On Cloud Number 9: Accessibility, Usability, and Functionality with the Archives Unleashed Cloud (AUK)

Samantha Fritz
Archives Unleashed
Published in
5 min readMay 22, 2018

--

By: Samantha Fritz, Nick Ruest, Ian Milligan

Excitement around AUK, or the Archives Unleashed Cloud, has been floating around for the past few weeks both within our team and the broader web archiving community. We had a chance to share the news (and an interactive demo) with our Archives Unleashed Datathon participants, and now we’d like to officially share it with you.

Imagine a world where you could access and analyze a web archives file in as few as three clicks, with no coding necessary. Oh wait, there’s a portal for that: the Archives Unleashed Cloud.

AUK is an open source cloud-based analysis tool that helps researchers and scholars conduct web archive analysis. It is a component of the Archives Unleashed Project and supports the priorities of accessibility and usability of web archives by providing users a web-based front end to access the Archives Unleashed Toolkit.

AUK Milestone

Over the past five months, our project co-Investigator and developer, Nick Ruest, has taken extraordinary strides in getting the first iteration of AUK up and running.

AUK is a Ruby on Rails and Apache Spark-based project which downloads a given Archive-It collection from a user and subsequently runs the Archives Unleashed Toolkit as a background job to create a basic set of derivative files and visualizations for a user.

Putting together AUK hasn’t been as easy as 1,2,3 or as simple as Do-Rei-Me. Building technical infrastructure is always interesting, and usually involves a few surprises along the way. The construction of AUK meant not only building new infrastructure, but connecting it to existing complex infrastructure. The AUT team would like to send a special shout out to Nick, who not only found some creative solutions to the challenges associated with building AUK’s technical framework, but also undertook learning Rails for this project and Warclight.

Rail combines the Ruby programming language with HTML, CSS, and JavaScript to create a web application that runs on a web server. — Daniel Kehoe, RailsApps Project

The current version of AUK has helped propel the project to a major milestone, and we are so excited as we enter the next stage of AUK.

AUK Look and Functionality

AUK has a clean and modern interface with an easy-to-use dashboard. The best part is that you do not need to know how to code, which in turn supports the increased accessibility of working with web archives.

Core features of the Archives Unleashed Cloud include:

  • Syncing Archive-It collections
  • Ingesting Archive-It collections via WASAPI to AUK
  • Creating a network graph, domain distribution list, full-text derivatives, and making each available for download
  • Providing an in-browser network diagram to see major nodes and connections within your collection.

In addition, AUK’s documentation offers guidance on how to get AUK up and running, as well as how to use some of its features.

A Guided Tour of AUK

Let’s take a tour of how AUK works. Once a user signs up to cloud.archivesunleashed.org, they enter their Archive-It credentials, which are salted and encrypted. Those credentials are then used to sync their Archive-It collections with AUK using Archive-It’s WASAPI endpoint. This is done as a background job, and once it is complete, it emails the user to let them know that their Archive-It collections are synced and available for further analysis on their dashboard.

The AUK dashboard provides some basic information about each collection: title, if it is publicly available (in Archive-It), the number of ARC/WARCs in the collection, and the size of the collection. You can see this below!

AUK dashboard

From the dashboard, users can select a collection to download, which will trigger a number of background jobs. The first job uses data gathered from the WASAPI endpoint to download and verify each ARC/WARC file to our AUK instance. Once the entire collection is downloaded, an automatic email is generated to notify the user that the collection has been downloaded, and analysis will begin.

The analysis process then triggers an Apache Spark job and uses AUT to create a basic set of derivatives:

  • A GEXF file which you can load with Gephi. It has a basic layout courtesy of our Graphpass program, which allows you to see major nodes and communities in the network
  • A GraphML file which you can load with Gephi. It does not have any basic layouts or transformations, requiring you to do so manually. You can use Graphpass to provide layout if you wish to add that feature to your file.
  • A csv file that explains the distribution of domains within the web archive.
  • A txt file that contains the plain text extracted from HTML documents within the web archive. You can find the crawl date, full URL, and the plain text of each page within the file.
AUK collection analysis and derivatives

In addition, we use Graphpass to help to create a simple network visualization powered by Sigma js on the collection dashboard, and display a table of the top 10 domains occurrences in a given collection.

Sigma js used to visualize collection network

Future development will focus on filtering further down on a collection, and integrating the new DataFrame functionality we’re adding to AUT via a JDBC connector.

Can I try it out?

Currently, anyone with an Archive-It subscription can take AUK for a spin. Please just come over to cloud.archivesunleashed.org and give it a try.

If you don’t have an Archive-It account, sit tight, we are looking into development for AUK users who would like to bring their own WARCs to AUK with webrecorder.io. Updates will be provided through our social media channels and newsletter, as well as on our website.

AUK is an open source project, you can view the codebase here. Although it is tied closely to the canonical instance running at cloud.archivesunleashed.org, it can also be run as a standalone project on your own server, desktop, or laptop! That said, our primary focus is on the canonical instance that we are hosting. However, if there is interest in generalizing aspects of the project, let us know and we can collaborate and figure out how to make it happen.

Acknowledgments

Sincerest thanks to Creighton Barrett, Corey Davis, Ben Goldman, and Greg Wiedeman for testing AUK out, and helping to surface some bugs.

This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

Get Involved

There are a variety of ways to get involved with the Archives Unleashed Project and Toolkit:

--

--