Wrapping up WALK: Reflections on a Project

Ian Milligan
Archives Unleashed
Published in
4 min readSep 5, 2018

--

This is the first of several posts that will reflect on WALK. In future instalments, researchers will discuss what they learned from working with the datasets we’ve created.

By: Ian Milligan and Nick Ruest

In 2015, a group of us from York University, the University of Waterloo, and the University of Alberta began to chat about an idea for a project. We all agreed that we had great web archive collections throughout Canada – but that they were silo’d and sometimes hard to access. What if we could provide researcher access to collections, either through a search engine or through datasets? We applied to Compute Canada and on Christmas Eve 2015 learned that we’d been awarded computational resources to make “Web Archives for Longitudinal Knowledge” a reality.

The initial idea was to use our Warcbase platform (now the Archives Unleashed Toolkit) and the combination of webarchive-discovery with Apache Solr and Shine to let people search through these collections. We had done much the same with the Canadian Political Parties collection during the 2015 election, letting people search for trends in a political web archive stretching back ten years to 2005.

Shine was showing its age, however, and no longer much active development going on — so we quickly realized that a new search interface, based on Project Blacklight, would help web archives better play with libraries and archives. This became the Warclight engine.

In short, our goal was to do the following with Canadian web archive collections:

  • Use Warcbase/AUT to generate full-text files, network graphs, and other scholarly derivatives so that computational researchers could play with Canadian web archives;
  • Use webarchive-discovery and Warclight to launch full-text portals; and
  • Run a few case studies to let graduate students learn how to use web archives.

We’ve been successful in these goals! It ended up being a very useful trial run for what we are now doing, at scale, with the Archives Unleashed project.

WebArchive-Discovery and Warclight

For the uninitiated, webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type. If you’re curious about all the fields available, checkout this sample schema.xml for Warclight.

Warclight is a Project Blacklight based Rails engine that supports the discovery (faceted full-text search, record view, and other advanced discovery options) of web archives that have been indexed in Apache Solr by webarchive-discovery.

Our team has written in the past about implementing Warclight and webarchive-discovery a few times.

Warclight Sites

View of faceted results for University of Winnipeg Warclight instance

With the Compute Canada supplied infrastructure, we’ve been able to index over 25 TB of ARCs and Warcs from our 6 partner institutions, resulting in nearly one billion Solr docs! (We’re re-indexing two large University of Alberta collections now, so those won’t be publicly available for a while yet.)

We were able to do this by implementing Apache SolrCloud. This allowed us to get past some limitations of running a standalone Apache Solr setup, and also allowed us to take advantage of indexing each of our six partner institutions in their own Solr collection. As well as, create a “federated” search of our six partner institutions with the SolrCloud Collections API.

Warclight implementations for each of the six partner institutions can be found at the links below:

View of record from University of Winnipeg Warclight instance

Derivative Datasets

For each of the public Archive-It collections, we generated the following files:

  • Plain text of all the HTML pages within them;
  • A network diagram that can be loaded in the open-source Gephi network analysis program; and
  • A list of all the domains found within a given collection.

We’ve released all of these files through the Canadian Federated Research Data Repository (FRDR). You can find the collection of files at https://www.frdr.ca/repo/handle/doi:10.20383/101.036. They are arranged by each collection: each of the compressed files contains the plain text, the network file, and the domains.

The derivative files are hosted at FRDR.

These are what we call the “basic set” of scholarly derivatives. The domain count can let a researcher know what’s inside a collection. The network diagram can help them find sites of interest, or can be a direct dataset to use in social network analysis applications. Finally, the plain text can be used in data mining or text mining to uncover patterns within a web archival collection.

We have guides, now up as part of the Archives Unleashed Cloud, explaining how to use these datasets. How can you load them into Gephi? How can you run sentiment analysis, for example, on the plain text? Check them out here.

Some of this can seem a bit vague, however. In the posts that follow, we will explain how to use these files for analysis.

--

--

Associate professor of digital and Canadian history at @uWaterloo. Helping to lead @unleasharchives. Digital history, digital libraries, web archives.