Using the Archives Unleashed Toolkit at the Munich DigitiZation Center

Ian Milligan
Archives Unleashed
Published in
6 min readApr 22, 2020

--

(Photo by Émile Perron on Unsplash)

By Katharina Schmid

The Munich DigitiZation Center (MDZ) at the Bavarian State Library (Bayerische Staatsbibliothek or BSB) has been archiving websites since 2012. Improving the ways in which our web archives can be searched, accessed and used is the focus of a current research project at the MDZ and the University of Passau, supported by the German Research Foundation DFG. Together with the chair of Digital Humanities and the chair of European Politics we are analyzing how methods and tools from the Digital Humanities can be applied to web archives. Our aim is to approach research questions from the library and information sciences as well as the humanities, especially the political sciences. As part of the project the MDZ has conducted an event crawl in the context of the European Parliamentary elections from 23th March 2019 to 26th June 2019 and built a collection of archived websites of German and European parties, candidates and media as well as Social Media Sites. The crawl has produced 3136 WARC files, with a total size of 2.5 TB. Their sheer size and the diversity of their content can make WARC files challenging to work with. In order to open up this archived collection to researchers we have experimented with the Archives Unleashed Toolkit. After the initial setup we have begun extracting and filtering plain text for further analyses such as topic modeling.

Basic setup

We have used the Docker image of the toolkit provided on GitHub. It ships with all the necessary dependencies (Java 8, Apache Spark) and allowed us to get a Spark shell running in no time, with our own data directory mounted into the Docker container. You only have to be aware that the processes in your Docker container are by default run as root and that the derivatives generated by the toolkit will be owned by root as well. This can be problematic in a multi-user environment, where users may be able to start a Docker container but unable to modify or delete the derivatives they generated, as they lack the required privileges.

With the Docker image, Apache Spark is used in local mode: it is deployed on a single machine and not on a cluster of several machines. As a consequence, you will not be able to make full use of Spark’s capabilities for parallel processing: your code will be executed in parallel on the different cores of your machine, but will not be distributed across multiple servers. For testing and experimenting with the toolkit, however, this setup worked fine.

Extracting plain text

Our first step was to extract plain text from a subset of the HTML documents captured during the event crawl. Based on plain text, frequency or concordance analyses or more sophisticated analyses like topic modeling can be performed. We used version 0.50.0 of the Archives Unleashed Toolkit. Starting with this release, the developers have begun to implement a Python API, moving away from Scala as the default language of the toolkit. Given that Python is more widely used in data science and comes with excellent libraries and frameworks for data processing and analysis, this will certainly facilitate adoption. As the Python API was not fully implemented by the time this article was written, the following sample scripts will still rely on the Scala API. Based on the sample scripts provided in the official documentation, we began our plain text extraction:

As text analyses are typically language-specific, we have filtered the results to include only websites in German. We have also included metadata such as the crawl date and the URL in the final output.

The names of the different functions make it easy to see what is being done: load the WARC files, extract HTML documents, keep only documents in German, select metadata and document content without HTML mark-up, and save the result in CSV format. The toolkit hides the complexities of processing WARC files from the user: “webpages()” for example means parsing and filtering the WARC records according to their MIME type and URL ending, excluding robots.txt and records without a crawl date as well as records with a HTTP status code other than 200 OK. The sample scripts offer an easy way to get started with WARC analysis and cover basic use cases like generating plain text or link graphs from your data. More specific scenarios, such as when HTML documents in your WARC file have been further compressed, are not covered by the toolkit and require more extensive modifications of the sample code.

Dealing with boilerplate content

As we reviewed the results we soon recognized that the plain text included a lot of boilerplate content in the form of the header, footer or navigation of the website. This boilerplate content is highly repetitive and typically far outweighs the actual content of a single web page. Luckily for us, the Archives Unleashed Toolkit provides a way to remove the boilerplate content based on a library for boilerplate content detection by Christian Kohlschütter and colleagues. We therefore modified our initial script as follows:

Looking at the results for different sample URLs we found that the filter does in fact remove most of the boilerplate content. The following table shows the plain text for a sample WARC record before and after boilerplate removal. It illustrates that the majority of the extracted text is considered boilerplate content and accordingly removed. For the sample record, we also checked the visual representation in Open Wayback to see what parts of a single web page were removed as boilerplate content. Figure 1 highlights removed boilerplate content in orange: it includes not just navigation, header and footer, but also previews of news content on other pages of the website.

Figure 1: Plain text extracted with AUT, before and after boilerplate removal.
Figure 2: Sample web page as displayed in Open Wayback. Content removed as boilerplate is covered in orange.

In some cases, however, part of the HTML mark-up was not recognized by the filter and included in the final output. For example, our derivatives still contained “<h1>Mitteilung für die Presse</h1>” or “<img style=”float: left; margin-right: 10px; margin-bottom: 10px;” src=” […].jpg” >”. In our case, the inclusions were few and comparatively short and could therefore probably be neglected. More frequent or lengthier inclusions could however complicate topic modeling. Running “RemoveHTMLDF” on the output of “ExtractBoilerpipeTextDF” helped to get rid of the instances mentioned above.

.select($"crawl_date", $"url", RemoveHTMLDF(ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content"))))

Memory issues

While experimenting with our plain text derivatives, we repeatedly ran into out-of-memory errors. The errors could be traced back to specific WARC files but did not seem to be directly related to file size. Processing a WARC file with a size of 724 MB produced an out-of-memory error, while larger files with a size of 955 MB could be handled without any problems. This does not seem to be a singular case: The problem is discussed in a separate issue on GitHub and the Archives Unleashed team seems to be working on it for the new release. As a workaround and due to the restrictions of our basic setup, we resorted to processing the WARC files sequentially, loading only one file at a time. Apart from that, we increased the memory for the Spark driver process to 7 GB, as suggested in the documentation for the toolkit:

docker run - rm -it -v "/my/data:/data" aut:0.50.0 /spark/bin/spark-shell - packages "io.archivesunleashed:aut:0.50.0" - driver-memory 7G

One thing to highlight here is the ease of getting in touch with the Archives Unleashed team on GitHub or on their Slack Channel. We have repeatedly contacted the team with our questions, issues and suggestions and have found them to be very open to our input, seeking to improve the user experience of those working with the toolkit.

Our next step will be to conduct Latent Dirichlet Allocation on the plain-text derivatives generated with the Archives Unleashed Toolkit, to see whether the data requires further processing or is sufficient for this kind of analysis. In the future, we would also like to work with derivative files for link analysis in order to explore a more or less automated collection building process of web archives in libraries.

--

--

Associate professor of digital and Canadian history at @uWaterloo. Helping to lead @unleasharchives. Digital history, digital libraries, web archives.