So Long, Farewell to the Archives Unleashed Cloud

Published in

Archives Unleashed

6 min readJun 23, 2021

This spring, we mark a few special anniversaries for the Archives Unleashed Project! It’s been four years since the start of the project and our third year of running the Archives Unleashed Cloud.

With the support of the Andrew W. Mellon Foundation, our focus has been on developing open-source analysis tools to help make web archives more accessible. One of our primary goals has been “to make petabytes of historical internet content accessible.” Thanks to our community of Cloud users and collaborators, we’ve been able to make that goal a reality.

As we sunset the Cloud at the end of June 2021, we reflect on what Archives Unleashed has accomplished with the Cloud.

Building the Cloud

The Archives Unleashed Cloud was developed to support access and use of web archives for scholarly inquiry. The Cloud presents an accessible and user-friendly interface, while simplifying the process of conducting analysis on web archive collections.

The first commit to the Archives Unleashed Cloud repository was on 24 October 2017. The proceeding weeks focused on designing and building the site’s basic framework as well as functions for conducting analysis.

*Initial commit for Archives Unleashed Cloud)*

The Cloud was built as an open-source platform, running on Ruby on Rails and Apache Spark. The majority of back-end development occurred between November 2017 and April 2018, with a focus on three main areas:

Harness the power of the Toolkit. The Cloud hosts the “canonical instance” of the Toolkit, meaning it uses the Toolkit as the underlying code-base to drive analytical functions. In providing an accessible user interface, researchers could focus on research questions and conduct analysis in as few as three clicks.
Hook up WASAPI. The Web Archiving Systems API (WASAPI), is a “standardized mechanism to export and import web archive data between systems”. Using WASPI created a connection between platform and data source; the platform being the Archives Unleashed Cloud, and the data source being a user’s web archive collections from Archive-It.
Build background jobs. Adopting Apache Spark provided a pathway for building processing jobs that resulted in creating derivative files and in-browser visualizations.

As you can imagine, we did a lot of testing! Special attention was paid to performance testing to validate, understand, and improve responsiveness, scalability, speed and resource usage.

In May 2018, the Archives Unleashed Cloud was soft launched.

Achievements

The Archives Unleashed Cloud ultimately provided new opportunities to engage with web archival research and explore web archive collections. Our team is extremely proud to reflect on the work completed and achievements met. So let’s take a look!

Reach and Impact

The Archives Unleashed Cloud has been adopted across 60 unique institutions in 10 countries, though the majority of our users were from North America, Europe, Australia and New Zealand. It has been used among a diverse range of institutions and organizations, including:

Universities/College Libraries
National Libraries
International & Non-Profit Organizations
Public Libraries
Government Organizations
Art Archives

On the research end, the Cloud has been used to analyze collections covering a wide range of topics, from journalism and politics to health and national sporting events. We’ve also seen the Cloud used within the classroom setting, as instructors have promoted experiential learning and understanding of web archives. Similarly, there are several cases of graduate students using the Cloud for exploratory research, and even identifying how it fits within scholarly workflows that incorporate multiple (and interrelated) tools.

The Cloud provided a scalable solution for the analysis of web archives and a robust data processing scaffolding. Over its lifespan, the Cloud has processed just under a petabyte of W/ARC data across 1500 collections. That’s a lot of web archive data!

Collaborations

Engaging with institutions has been important for the growth of our user community, and in bridging the gap between researchers and data.

Project co-investigator Nick Ruest worked with several institutions to support access by creating and sharing derivative data of over 30 web archive collections. Using the Archives Unleashed Toolkit and Cloud, derivatives were generated for select collections curated by Columbia University, Ivey Plus Library Confederation, and the Bibliothèque et Archives nationales du Québec. These derivative datasets have been made publicly available through Web Archives for Historical Research Group communities on Zenodo and Dataverse with a citable DOI.

This initiative had two goals in mind:

To help institutions promote and increase the visibility of their web archive collections; and
Continue to lower barriers of access and use, which in this case, provides a starting point for researchers who don’t have immediate access to WARCs while encouraging exploration of web archival data.

Early on in the Cloud’s development, our project connected with six Canadian research universities to test analysis functionality and processes. From this experience, we were able to test scalability, efficiency, as well as identify and troubleshoot errors on a variety of collection sizes and topics.

Learning Resources

Our users noted that they often had difficulties knowing where to start once they had web archive data to explore. In response, and to provide a bit of inspiration, our team developed learning guides that walked through how to use Cloud derivatives with external tools. In total, half a dozen guides were created and addressed common approaches to both network and textual analysis, while incorporating tools like Gephi and Voyant. These resources have provided a starting point for exploring web archives and promoted confidence in scholars new to web archive research.

The Next Evolution

While the Archives Unleashed Cloud is sunsetting, it isn’t entirely going away. The conceptual framework has been reimagined to continue the support of access to and use of web archives, while also addressing long-term sustainability plans.

We will see the analytical functions from the Cloud brought into alignment within the Archive-It environment. This integration (2020–2023) will allow for an end-to-end service for working with web archives at scale: from collection curation to analysis. The collaboration between Archives Unleashed and Archive-It ultimately seeks to broaden and enhance the accessibility and usability of web archives.

Our team would like to express gratitude to everyone who has supported this project. We truly appreciate the enthusiastic response we’ve received from the community and look forward to sharing our next evolution.

The codebase for the Cloud is available through our Github repository: https://github.com/archivesunleashed/auk.

Additional Reading

Archives Unleashed has published several articles, posts, and materials addressing the concept, design, and impacts of the Cloud.

Journal Articles / Conference Papers

From archive to analysis: accessing web archives at scale through a cloud-based interface. International Journal of Digital Humanities, 2021.
The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. arXiv:2001.05399, January 2020.
The Cost of a WARC : Analyzing Web Archives in the Cloud. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Vol. 19 (2019).

Invited Talks

Medium Posts

So Long, Farewell to the Archives Unleashed Cloud

Written by Samantha Fritz