Unleashing Web Archives: A Final Letter to Our Community

Samantha Fritz
Archives Unleashed
Published in
16 min readJul 11, 2023

--

Established in 2017, the Archives Unleashed Project began with the goal of unlocking the potential of web archives for scholarly research.

Over the past six years, we’ve engaged with librarians, archivists, technologists, and researchers from a variety of disciplines, including digital humanities, communication and media studies, journalism, computer science, information management, the social sciences and beyond. Thanks to you all!

When we take a broad look at the phases of the Archives Unleashed Project, we’ve seen the shift from a stand-alone Mellon-funded project from 2017–2020 to a collaborative integration with the Internet Archive beginning in 2020. Through those two phases, we’ve evolved our tools and approaches, which address the technical complexities and burdens researchers face when working with web archives.

As the project came to a close at the end of June 2023, our team is looking back at our activities and celebrating the tools that we share (and leave) with the broader community.

Phase I (2017–2020)

The underlying goal of our work has been to lower the barriers and burdens of accessing and using web archives, and we have approached this work through two main activities: Tool Building and Community Engagement.

Building Scalable Analytical Tools

The W/ARC file format is messy and complex and can certainly be intimidating to new scholars. We also appreciate that web archive collections are typically difficult to work with due to their scale. In fact, even when collections are packaged into a more familiar format, such as a CSV, many of them often can’t be used with proprietary tools and software because the data often exceeds the limits of spreadsheet programs. Not every scholar has the means, support, or time to learn extensive computational analysis skills needed to explore this kind of data.

Describing the WARC File, from Ruest, Fritz & Milligan (2022) Creating order from the mess: web archive derivative datasets and notebooks, Archives and Records, 43:3, 316–331, DOI: 10.1080/23257962.2022.2100336

To address this problem, our project has created scalable tools, methodologies, and approaches to help address some of the technical complexities and burdens researchers often face with web archives data.

During the first phase of the project (2017–2020), the team developed open-source tools as well as methodological approaches for conducting scalable analysis. Together these tools presented new opportunities for researchers to investigate and analyze web archives and included:

Archives Unleashed Toolkit (2017 — present)

The Archives Unleashed Toolkit is an open-source platform for exploring and analyzing web archives. The Toolkit provides a set of approaches that allow users to investigate large-scale web archives by creating derivative datasets that can be used to fully explore and extract meaningful insights using Apache Spark.

Extensive documentation, formatted as a cook-book style approach, provides recipes for common use cases so users can confidently and successfully build customized scripts for common analytical tasks to explore text, network, and binary data.

Archives Unleashed Cloud (2018–2020)

The Archives Unleashed Cloud was designed as a one-stop, web-based portal for scholars to ingest their Archive-It collections and execute a number of analyses with the click of a mouse.

From a technical perspective, the Cloud was a Rails application that had a web-based front end, with the Toolkit and a web archive data transfer API called “WASAPI” working in the background. The Cloud provided a connection between a user’s Archive-It account via WASAPI, which provided a pipeline to their Archive-It collections. The Toolkit was used to process web archive collections and generate easy-to-use derivative datasets for further exploration.

The Archives Unleashed Cloud was officially sunsetted on 30 June 2021 and provided the theoretical foundation for work done in Phase II, which eventually led to the ARCH (Archives Research Compute Hub) iteration of 2023. Keep reading to learn more about ARCH!

Over its lifespan, the Archives Unleashed Cloud analyzed just under a petabyte of data and was used by individuals from 59 unique institutions across ten countries.

Infographic that highlights Archives Unleashed Cloud Stats

Warclight (2018–2023)

Warclight was developed as a framework to help with the search and discovery of web archives. As a Project Blacklight-based Rails engine, it supported the discovery of web archives held in the W/ARC formats and allowed faceted full-text search, record view, and other advanced discovery options. Warclight was specifically designed to work with web archive data indexed via the UK Web Archive’s webarchive-discovery project.

As part of the WALK (Web Archive for Longitudinal Knowledge) project, Warclight brought together the archival holdings of a half-dozen Canadian libraries by providing the federated search and access to research derivatives of Canadian web archival collections.

Warclight instances were created for the University of Alberta, University of Toronto, University of Winnipeg, University of Victoria, Simon Fraser University, and Dalhousie University, which indexed their W/ARCs from their Archive-It accounts and provided an interface to interact with this data through a search and discovery layer.

Warclight interface

Datathons: Community Building

On the community engagement front, our team hosted a series of datathons for skills development and to foster a sense of belonging and support through collaborative projects. Borrowing from the hacking genre of events often found within the tech industry, as well as an earlier series of events held in 2016 and 2017, Archives Unleashed datathons provided an immersive and uninterrupted period of time for participants to work collaboratively on projects and gain hands-on experience working with web archive data. The datathon series cultivated community formation and empowered scholars to build confidence and the skills needed to work with web archives.

These datathons saw the participation of 73 attendees who were representative of a wide range of professional roles, including librarians, archivists, technologists, digital humanities scholars, graduate/post-graduate researchers, and computer scientists.

Mellon-funded Datathons include:

To visit team projects completed, please visit the datathon event pages.

Archives Unleashed Datathon Infographic

Phase II (2020–2023)

Supported by a second Mellon Foundation grant, the Archives Unleashed Project collaborated closely with the Internet Archive in a highly integrated partnership aimed to enhance the accessibility and usability of web archives for research purposes.

This second phase of Archives Unleashed (2020–2023) saw the maturing of tools and community engagement activities to provide a more robust and interconnected environment for engaging with web archives and more focused and intensive support for research. Much of this was done in collaboration with the Internet Archive’s Archiving and Data Services team.

ARCH (Archives Research Compute Hub)

In partnership with the Internet Archive, our goal was to design a sustainable tool and service that would support web archival research at scale.

On June 26, 2023, we collectively achieved this objective by launching the Archives Research Compute Hub (ARCH), a research and education service that helps users easily build, access, analyze, publish, and preserve web archive datasets at scale.

The road to ARCH was reached by blending and leaning into the expertise of project members throughout the design and development process, which included several components:

Building a Computational Environment. Building what would become the ARCH platform required both enhancing and reimagining the code base to ensure effective and efficient resource allocation and scalability of processing and transformation of W/ARC data. Engineers at the Internet Archive provided leadership in creating the Sparkling data processing library, implementing a new job and queue management system, and building a distributed, high-performance computing cluster. And we can’t forget the building of physical infrastructure, i.e. mounting racks to hold the servers that ARCH now calls home!

Wireframe and Prototyping. Responsive design strategy practices were applied to finesse wireframes and prototypes to ensure the interface aligned with the feedback and requirements outlined by stakeholder groups.

Iterative UX/UI Testing. Five scaffolded iterative rounds of User Experience (UX) and User Interface (UI) testing were conducted to test the efficiency and design of ARCH prototypes. Interviews and surveys were distributed to stakeholder groups, with particular attention on recruiting participants representative of a broad spectrum of skill levels, institutional types, geography, and collection sizes. The collection of qualitative and quantitative feedback evaluated the stability, robustness, and scalability of the ARCH platform. Virtual feedback sessions were also held with 2021–2022 Cohort participants to learn about their experience in piloting the platform.

Documentation. ARCH documentation was created and organized to reflect the wide range of skills of its users. Collaborative writing between Archives Unleashed and web archivists at the Internet Archive has resulted in user documentation (via the ARCH Help Center) that provides onboarding, an overview of the platform’s features, documented use cases of web archives in research, and guidance for using ARCH derivatives with other analysis tools and methods.

Stewardship and Sustainability. In building a platform and service that seamlessly integrates with Internet Archive’s infrastructure, as a project team, we’ve ensured reliable, scalable, and sustainable stewardship.

ARCH Dashboard

ARCH Features:

  • Intuitive, interactive, user-friendly stand-alone web-based app
  • Build custom collections with filtering capabilities
  • Transform web archive collections into research objects by generating over a dozen dataset types (e.g., full text, images, hyperlink network graphs, etc.) into widely compatible CSV files
  • Download generated datasets directly in-browser or via API
  • Research-ready datasets can be used with computational environments and tools (e.g. Jupyter Notebooks, Gephi, Voyant)
  • Direct connection between datasets and Google Colab Notebook to conduct initial exploratory analysis
  • Review content collection through in-browser visualizations and data previews
  • Publish datasets in line with best practices in reproducible research to the Internet Archive. All datasets will be preserved in perpetuity
  • Browse extensive documentation to support your use of ARCH, guidance on pairing datasets with computational tools, and sharing research examples and use cases

ARCH provides approachable and powerful opportunities to investigate web archives at scale. Researchers can now easily navigate, search, and make sense of vast collections of web data without requiring extensive technical expertise or specialized tools. This increases accessibility and opens doors for a wider range of researchers to delve into the wealth of information captured in web archives, leading to enhanced scholarly inquiry.

As the Archives Unleashed Project comes to an end, the Internet Archive and Archiving and Data Services team will be stewarding the ARCH platform and service.

To learn more about ARCH, please reach out via the following form.

ARCH: https://webservices.archive.org/pages/arch

ARCH Help Center: https://arch-webservices.zendesk.com/hc/en-us

Archives Unleashed Toolkit (AUT) 1.2.0

As noted earlier, the Archives Unleashed Toolkit is an open-source platform for analyzing web archives using Apache Spark and makes use of Sparkling for parsing W/ARC records. The toolkit provides powerful tools for analytics and data processing.

Terminal window of using the Archives Unleashed Toolkit 1.2.0 with sample data

In November 2022, the Archives Unleashed Toolkit 1.2.0 version was released, which provides a stable and open-source solution for analyzing web archive collections at scale, and marks a major milestone for the Archives Unleashed Project! As of this release, the Toolkit:

  • Provides full support of DataFrames (DF) and Resilient Distributed Dataset (RDD) to allow for flexibility in processing and displaying structured, semi-structured, or unstructured data.
  • Incorporates Python and Scala implementations to accommodate the programming languages most familiar to scholars working with web archives.
  • Presents unlimited combinations in generating standardized datasets that can be used as scholarly objects to explore five functional analysis categories: collection, link, binary, text file analysis, and text extraction.
  • Allows users to utilize spark-submit to create over 20 different derivatives of web archive collections.
  • Adopts the Sparkling library, which was developed as a data processing library for the Internet Archive, to improve the speed, efficiency, and reliability of parsing of W/ARC records.
  • Enhanced user documentation that illustrates dozens of text, network, and binary analytical examples.

The Archives Unleashed Toolkit is open-source and available to anyone interested in exploring and analyzing web archives.

GitHub: https://github.com/archivesunleashed/aut

Documentation: https://aut.docs.archivesunleashed.org/

Understanding that not everyone has access to web archive collections, we’ve collaborated with several institutions to process over two dozen web archive collections through the Toolkit and ARCH. This means that as a scholar, you can access read-to-use derivative datasets and start your research journey. Datasets are available through the Web Archives for Historical Research Group via Zenodo + Borealis.

Computational Notebooks (2019–2023)

Computational notebooks come in all shapes and sizes. Many are familiar with Juptyer and Google Colab. Notebooks provide a space for creating, editing, and sharing documents that can contain live code, equations, visualizations, and text (Driscol).

Notebooks provide a more accessible entry point for researchers interested in exploring and interacting with web archive collections (Ruest). Once a researcher generates (or has access to) derivative datasets, many often ask, “How do I start using these derivatives?” Our goal with the notebooks was to help provide guidance to scholars on what investigating web archive data could look like through examples.

We’ve created both Python-based and Google Colab notebooks which provide a templated “madlib” approach, allowing scholars to work within browsers, filling in the blanks (where appropriate) to apply customized example methods and techniques while investigating web archive collections.

The development of computational notebooks, led by co-investigator Nick Ruest, provided some first steps in investigating these derivations of web archives and entry points or examples of types of analysis that can be conducted.

Using computational notebooks to conduct initial analysis on top-level domains

Notebooks have been integrated into the ARCH platform, providing a seamless transition from collection analysis and generated datasets to using a dataset directly within a Google Colab Notebook.

The Archives Unleashed Project also offers standalone Jupyter notebooks that are publicly available and can be used in conjunction with generated datasets from the Archives Unleashed Toolkit.

Using computational notebooks to conduct text analysis, NER (named entity recognition) with SpaCY

Archives Unleashed Cohort Program

As a project, we’ve been committed to community building and engagement because we know projects can’t exist in silos, and a sense of belonging and connection is critical for tools and services to grow and thrive.

Alongside our tool-building efforts, our team expanded on the foundations of our community engagement activities by supporting web archives research in a more intensive way than our datathon series allowed for and was realized through the development of the Archives Unleashed Cohort Program (2021–22 and 2022–23).

The Cohort Program was designed to facilitate intensive research engagement with web archives, foster a community of users, and cultivate a broad spectrum of research use cases across disciplines and fields.

Over two cycles, the program engaged with 50 interdisciplinary researchers from 20 institutions across nine countries. Over the course of each year-long collaboration, teams used web archives as a primary data source to investigate wide-ranging research topics and questions.

Geographic representation of Archives Unleashed Cohorts

Throughout the program, participants were provided technical and academic mentorship to steward research projects through direct one-on-one consultation from Archives Unleashed, connections to field experts, and opportunities for peer-to-peer support.

The Cohort Program has, at a micro level, demonstrated the value in expanding opportunities to access, explore and engage with web archives, all of which are dependent on building connections and relationships.

  • Access. Connecting researchers with (web archives) data uncovered opportunities to forge relationships with content creators and collection curators. In partnership with the Internet Archive and in collaboration with specific Archive-It partners, cohort teams were given read-access to web archive collections, which were then analyzed through the ARCH platform.
  • Explore. Building community among cohort participants and fostering an environment that embraced skills-building and knowledge-sharing, users were empowered to experiment with new tools and methodologies for investigating web archives.
  • Engage. The cohort projects have enhanced the visibility of web archive collections and facilitated meaningful connections with diverse disciplinary communities. As a result, these projects serve as exemplary use cases and showcase the high-quality research potential inherent in working with web archives. Through shared experiences and research insights, cohort participants have highlighted the possibilities, opportunities, and inspiration for those seeking to dive into the world of web archive research.
A wide range of Methods, Techniques, and Tools used by Cohort researchers

The scholarly contributions and accomplishments of these ten research teams have been tremendous! If we look at traditional scholarly content, teams have presented at conferences and meetings, led workshops, and written peer-reviewed articles and blog posts. They’ve also published research findings and engaged with the web archive field through creative and unique ways, such as developing computational notebooks for research and teaching, developing tools, contributing a Spanish Wikipedia entry for web archiving, producing a zine publication, and initiating Solr Wayback instances.

We celebrate these researchers and their contributions to the web archiving space. The Cohort Projects Summary Report provides highlights use cases of web archives research for several subjects and disciplines, including crisis communication, health mis/disinformation, commenting systems, digital activism, health mis/dis information, Latin American women’s rights movements, digital labour and motherhood, reconciliation processes, and the COVID-19 Pandemic.

An exclusive summary of the program and research project can also be found in the Archives Unleashed blog post “Celebrating Research Opportunities with Web Archive through Archives Unleashed Cohort Projects.”

To Infinity and Beyond

The Archives Unleashed Project has spent the past six years building innovative tools and fostering a vibrant community. As we come to the end of our project, we proudly unleash our roster of tools, the Archives Unleashed Toolkit, ARCH, and notebooks, alongside co-created resources and datasets to inspire and empower continued research collaborations and knowledge-sharing endeavours.

We are grateful for the support we’ve received and thrilled to have played a part in creating a community of scholars who are deeply passionate about unlocking the rich resources within web archives.

We are not a fan of goodbyes, so we hope to see you soon!

Acknowledgements

The work of the Archives Unleashed Project has been made possible by generous funding from the Mellon Foundation (2017–2023).

We’d also like to acknowledge the financial and in-kind support of Smart Start Labs, the University of Waterloo, York University, Archive-It, Social Sciences and Humanities Research Council of Canada, Compute Canada, and the Ministry of Research, Innovation, and Science.

We are incredibly grateful to our broader community of supporters, peers, and colleagues who have welcomed us into this space, championed our work, and been a source of inspiration.

The Archives Unleashed Project has engaged in highly integrated collaborations throughout its two phases and acknowledges the efforts and contributions of its project members, collaborators, and advisory board members!

Project Team

  • Ian Milligan, Primary Investigator, University of Waterloo (Phase I & II)
  • Nick Ruest, Co-Investigator, York University (Phase I & II)
  • Jimmy Lin, Co-Investigator, University of Waterloo (Phase I & II)
  • Jefferson Bailey, Co-Investigator, Internet Archive & Director, Archiving and Data Services, Internet Archive (Phase II)
  • Thomas Padilla, Deputy Director, Archiving and Data Services, Internet Archive (Phase II)
  • Helge Holzmann, Senior Data Engineer, Archiving and Data Services, Internet Archive (Phase II)
  • Samantha Fritz, Project Manager, Archives Unleashed (Phase I & II)
  • Kody Willis, Product Operations Manager, Archiving and Data Services, Internet Archive (Phase II)
  • Karl-Rainer Blumenthal, Web Archivist, Archiving and Data Services, Internet Archive (Phase II)
  • Alex Dempsey, Senior Engineering Manager, Archiving and Data Services, Internet Archive (Phase II)
  • Peggy Lee (2021–2022)

Collaborators

  • Sarah McTavish, Department of History, University of Waterloo (2018–2021)
  • Rebecca MacAlpine, Department of History, University of Waterloo (2019–2020)
  • Tobi Adewoye, David R. Cheriton School of Computer Science, University of Waterloo (2019–2020)
  • Xiao Han, David R. Cheriton School of Computer Science, University of Waterloo (2019–2020)
  • Gursimran Singh, David R. Cheriton School of Computer Science, University of Waterloo (2019–2020)
  • Hsiu-Wei Yang, David R. Cheriton School of Computer Science, University of Waterloo (2018–2019)
  • Linqing Liu, David R. Cheriton School of Computer Science, University of Waterloo (2018–2019)
  • Borui Lin, David R. Cheriton School of Computer Science, University of Waterloo (2018)
  • Jeremy Wiebe, Department of History, University of Waterloo (2018–2019)
  • Ryan Deschamps, Department of History, University of Waterloo (2017–2019)

Archives Unleashed Advisory Board

2017–2020

  • Jefferson Bailey
  • Matthew Weber
  • Nathalie Casemajor
  • Nicholas Worby
  • Robert H. McDonald

2020–2023

  • Matthew Weber
  • Michele Weigle
  • Robert H. McDonald
  • Jane Winters
  • Sylvain Bélanger
  • Nicholas Taylor

Additional Reading

Archives Research Comput Hub (ARCH)

  • Holzmann, H., Ruest, N., Bailey, J., Dempsey, A., Fritz, S., Lee, P., and Ian Milligan. “ABCDEF: the 6 key features behind scalable, multi-tenant web archive processing with ARCH: archive, big data, concurrent, distributed, efficient, flexible.” In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries (JCDL ‘22). Association for Computing Machinery, New York, NY, USA, 1–11 (2022). [link] [preprint]

Archives Unleashed Toolkit

Notebooks

Datathons/Community

Archives Unleashed Cloud

References

Mike Driscoll. “Jupyter Notebook: An Introduction.” https://realpython.com/jupyter-notebook-introduction/

Nick Ruest, Samantha Fritz & Ian Milligan (2022) Creating order from the mess: web archive derivative datasets and notebooks, Archives and Records, 43:3, 316–331, DOI: 10.1080/23257962.2022.2100336

--

--