Archives Unleashed Project: 2019 Progress Report

Samantha Fritz
Archives Unleashed
Published in
8 min readJan 21, 2020

--

It’s hard to believe that we are already two and a half years into the Archives Unleashed Project! So as we begin this new year, a new decade, and head into the final six months of this project, let’s reflect on the work and milestones the Archives Unleashed Team and community has reached.

The Archives Unleashed Project

For first time visitors, the Archives Unleashed Project, supported by a grant from the Andrew W. Mellon Foundation, has three core objectives:

  • Develop and build an open-source toolkit;
  • Deploy the Archives Unleashed Cloud as a one-stop portal for analyzing web archives; and
  • Organize datathons to provide hands-on learning opportunities to work with web archives at scale and develop a sustainable user community.

A key pillar of the project is accessibility. We believe that it’s not enough to collect and store web archives, there needs to be a means to explore and use them at scale. The suite of tools and platforms offered by the project offer an opportunity to bridge an accessibility gap among librarians, archivists, scholars, and researchers.

2018/19 Recap with Archives Unleashed

Development

The Archives Unleashed Cloud has seen continual improvements in both usage and functionality over the last year.

The largest collection processed in the Cloud: a 17.6 TB collection from the International Internet Preservation Consortium (2012 Summer Olympics Collection).

We distributed a user-feedback survey in order to better understand the experience of Cloud users, and help the team assess the platform’s strengths, weaknesses and areas for opportunities. As a result, several recommendations were included into our development plan, specifically around enhancing our user interface and documentation to make them more usable.

Dashboard/Monitoring: To assist our team, as well as other Cloud users, we have implemented three back-end dashboards to visually measure and display real-time statistics on key performance indicators, track metrics of background job processes, and offer a triage point for diagnostics and troubleshooting. These include an overview of ongoing jobs, graphs to monitor any bottlenecks or issues, as well as overall statistics.

Additional Functionality: Participant feedback from our most recent datathons informed development cycles and led to the implementation of additional functionalities to the Cloud including:

  • An additional derivative: filtering all text in a web archive by the top ten most popular domains;
  • Graphing the presence of domains within a collection; and
  • Launching the Archives Unleashed Cloud Notebook service!

The Archives Unleashed Toolkit had its first release in October 2017 and has seen significant improvements to functionality over 11 subsequent releases. The current iteration of the Toolkit supports the basic derivative dataset generation for the Archives Unleashed Cloud (hyperlinks, URLs, entities, and the like). If you’ve used the Cloud, without knowing it, you’ve been using the Toolkit too!

One of the significant developments this year has revolved around the move towards implementing DataFrames support in both Scala and Python. This provides users comfortable with Python to take advantage of the growing Python data science landscape. Supporting a DataFrames shift was undertaken to better integrate our tool with the broader field of the digital humanities and computational social science. We’d sincerely like to thank our contributors Jeremy Wiebe and Gursimran Singh for their work and assistance in moving this piece of development forward in strides!

Additional priorities in 2019 included:

  • Updating and adapting our documentation to meet expanding functionality, especially with regards to new forms of working with networks, extracting entities, and working with non-textual elements such as images;
  • Improving the content flow of our website and documentation;
  • Restructuring of documentation organization: users are now able to access a cookbook-esque type format, which has made information more user-friendly;
  • Code refactoring: keeping long-term sustainability in mind, we’ve focused on refactoring our code base to simplify it and reduce dependencies.

Archives Unleashed Notebooks

Notebooks were created to help answer the questions “I have derivatives, now what?” and have evolved into a method which makes working with web archive derivatives more approachable. We currently have a series of notebooks that allow users to work with the web archive derivatives, created through the Archives Unleashed Cloud, in their browser. You can try it out here, just click “Open in Colab” button at the top of the page!

Notebooks are based on the “mad libs” approach, where users fill in the blanks to select what they want to do, from collection analysis to finding most-popular words to exporting more refined datasets for future work.

Community Development and Outreach

We have continued to develop and expand our community engagement through online platforms including:

  • Publishing short blog posts through Medium (https://news.archivesunleashed.org) which address interest topics related to web archiving, case studies, as well as feature project milestones and tool functionalities.
  • We also continue to engage with current users and enthusiasts through our quarterly newsletter, which provides project updates, news, and notice of events. Our newsletter currently reaches 185 subscribers from Canada, the United States, Germany, the Netherlands, and Australia. You can subscribe here!
  • Communications and engagement with the wider archival community through our Twitter account (@unleasharchives).
  • Our Slack channel is a dedicated communication environment for our users to collaborate and find trouble support on projects and challenges (http://slack.archivesunleashed.org). You can sign up by clicking the link!

Presentations and Publications

Throughout the year, team members have engaged in a number of scholarly meetings to present on the tools and platforms available to researchers with the ultimate goal of raising interest and to further grow the user base and community around the project. In 2018/2019 our team attended several high-profile web archiving conferences (IIPC, RESAW), presented our project at the Coalition for Networked Information (CNI) annual meeting, and presented several invited talks.

We have also published several pieces. In particular, we had two short papers and three poster/demonstration papers at the 2019 ACM/IEEE Joint Conference on Digital Libraries (held on the beautiful campus of the University of Illinois). These were:

  • Ryan Deschamps, Samantha Fritz, Jimmy Lin, Ian Milligan, and Nick Ruest. “The Cost of a WARC : Analyzing Web Archives in the Cloud.” http://hdl.handle.net/10315/36158
  • Ian Milligan, Nathalie Casemajor, Samantha Fritz, Jimmy Lin, Nick Ruest, Matthew S. Weber, and Nicholas Worby. “Building Community and Tools for Analyzing Web Archives through Datathons.” http://hdl.handle.net/10315/36180
  • Ryan Deschamps, Nick Ruest, Jimmy Lin, Samantha Fritz, and Ian Milligan. “The Archives Unleashed Notebook: Madlibs for Jumpstarting Scholarly Exploration of Web Archives.” http://hdl.handle.net/10315/36160
  • Hsiu-Wei Yang, Linqing Liu, Ian Milligan, Nick Ruest, and Jimmy Lin. “Scalable Content-Based Analysis of Images in Web Archives with TensorFlow and the Archives Unleashed Toolkit.” http://hdl.handle.net/10315/36161
  • Nick Ruest, Ian Milligan, and Jimmy Lin. “Warclight: A Rails Engine for Web Archive Discovery.” http://hdl.handle.net/10315/36159

A full list of our published work can be found in the Publications and Press section of our website.

We also launched our first promotional video that highlights the work of the Archives Unleashed project and the context of our project tools within the web archiving community.

  1. Introducing the Archives Unleashed Project: Introduces the context of web archiving, the overarching goals of the Archives Unleashed project and promotes tool and platform awareness.
  2. Archives Unleashed Cloud: A Tour: Provides a tour of the Archives Unleashed Cloud and explains the functionalities by examining the derivative files users can further explore, and the resources available through the Archives Unleashed Project.
  3. Archives Unleashed Toolkit 0.17.0 Release: Using gource visualization, the project team captured development for the Toolkit 0.17.0 release cycle and visualizes the extensive contributions.

Datathons

We love our datathons because they provide an opportunity to build and engage with our community, offer a chance for hands-on training with the Archives Unleashed Toolkit, and create an environment to collaboratively explore web archives. Over the last year and a bit, we’ve run two datathons, following up on our first one held at the University of Toronto in April 2018.

The first datathon in this period was organized and co-hosted with the Simon Fraser University Libraries and KEY, SFU’s Big Data Initiative in November 2018. Rebecca Dowson, Digital Scholarship Librarian at SFU, served as a co-organizer of the event.

With an overwhelming response from local and international applicants, we were able to host 22 participants from 13 institutions. Projects included local British Columbia events such as wildfires or political upheaval, and we were able to facilitate good conversations and projects around the use of web archives. [Check-out the Vancouver projects]

Our second datathon in this period was co-hosted with a team from George Washington University Library in April 2019. We were pleased to collaborate with Laura Wrubel, Daniel Kerchner, Rachel Trent, and Robin Delaloye to run our largest and most diverse datathon to date, bringing together 26 participants from 18 institutions globally. We were very impressed by the creativity of the projects showcased here and their diverse approaches to web archiving questions, including novel methodological approaches, application of archival theory, and innovative analysis. We also used this event to introduce the Archives Unleashed Cloud notebooks which were heavily used (and which provided additional feedback to our team). [Check-out the Washington projects]

Advisory Board

We’d like to extend our sincerest thanks to our colleagues in the field who serve on our Advisory Board. These individuals have provided valuable feedback and advice which has helped to direct the future directions of the project.

2020 Roadmap

So what’s next for Archives Unleashed? There are lots of exciting developments coming up including:

  • Continue development to support PySpark, DataFrames and establish a stable 1.0.0 release for AUT;
  • Continue work with Notebooks and Google Collab;
  • Run our fourth datathon in NY, March 2020 in collaboration with Columbia University Libraries;
  • Prepare for a special datathon event, which will take place right after IIPC’s web archiving conference in Montreal, QC. Special thanks to University of Toronto Libraries and BAnQ for their support and in hosting this event;
  • And more, so stay tuned!

Acknowledgments

The Archives Unleashed Project is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

Many thanks to Ian Milligan and Nick Ruest for their contributions to and review of this article.

--

--