Learning to Use Large-Scale Methods to Explore Web Archives

Sarah McTavish
Archives Unleashed
Published in
6 min readSep 24, 2018

--

Our last series of posts looked at the WALK project. This final post shows how you can do some of that research yourself with the Archives Unleashed Cloud!

In the Archives Unleashed Learning Guides, we outline various methods of performing text and network analysis on a large collection of archived web pages. In this post, I will discuss some of the skills needed to do this kind of analysis, challenges that one might face, and some of the resources that I found to be invaluable to learning how to use these tools.

Visit our Archives Unleashed Learning Guides

This type of analysis can be invaluable when working with extremely large collections. Keyword searches are useful, and allow a researcher to examine all instances of a given keyword but, when working with very large collections of archived web pages, that keyword still may occur thousands of times (or even more!) within the collection. Large-scale text analysis essentially does that “reading” for you, allowing those thousands of instances to be quickly examined and analyzed. More than this, the use of network graphing methodologies allows these web collections to be viewed as a functional whole, by highlighting the connections made through hyperlinking. Through these methodologies, it is possible to achieve a comprehensive perspective on large web collections that could not be “read” in traditional ways.

Using AntConc to find the context in which words in a web archive appear

Much of these text analysis methods discussed require familiarity with the command line — essentially, typing in code to tell your computer what to do. Even when using a tool such as AntConc, which requires no coding knowledge, it can be helpful to know how to use the command line to break a large file into many smaller ones, which helps AntConc to work faster and more efficiently. This can be quite intimidating to a beginner with no background in computer science. For myself, my very first interaction with the command line was in learning to do large-scale text analysis of archived GeoCities websites. Fortunately, there are many resources available to teach new users to use the command line to perform tasks and run code. This Introduction to the Bash Command Line on the Programming Historian gives an excellent introduction to installing Bash, running commands, and interacting with files using the command line.

As mentioned earlier, the command line can be used to split large text files into many smaller text files. This is incredibly useful when performing text analysis on a very large text source; many analysis tools will crash or stall when analyzing a very large file, but have no problem doing the same analysis on several hundred small files. This discussion on Stack Overflow demonstrates several ways that the command line can be used to split a large text file. In general, Stack Overflow is excellent for looking up answers for any questions that you might have — you can be sure that someone else has asked it before.

Our “Sentiment Analysis with the Natural Language Toolkit” lesson shows you how to find out the positive or negative sentiment within a web archive collection

Despite building a familiarity with the command line early on, using Python and the Natural Language Toolkit (NLTK) for sentiment analysis proved the most challenging for my developing skill set. Both the Programming Historian’s lesson on Sentiment Analysis for Exploratory Sentiment Analysis and the NLTK Textbook provide a good foundation for learning how to install and run Python and the Natural Language Toolkit.

Even with this foundation, which guides users through running sentiment analysis on individual sentences and paragraphs using sample data, I found it quite difficult to make the jump to using my own data. Chapter 3 of the NLTK textbook outlines how to load your own text, however without understanding exactly what each command was doing, it was difficult to write workable code, using my own text source. In this case, I found it helpful to use a resource like this Python for Beginners website to look up what each snippet of code was doing, and how changing the variables would change how the code operates.

Our “Networking Graphing Archived Websites with Gephi” tutorial shows you how to use AUK to create this!

Using Gephi to generate network graphs from web archive collections also requires some specialized knowledge. Gephi has a graphical interface which makes it easy for an individual with no (or limited) coding knowledge to do network analysis. However, the software still has a learning curve, particularly in deciding how to filter the data, which statistics to run, and how to organize the graph using Gephi’s built-in algorithms. Gephi’s own manuals are quite comprehensive and include descriptions of each algorithm and how it operates on your dataset. Gephi also maintains a list of tutorials, including ones created by the Gephi team, ones from other users, and video tutorials on YouTube. These resources make it easy to get started with Gephi — the software itself is also easy to learn through trial and error, testing out various variables to see what works and what doesn’t.

Despite the technical knowledge needed for this type of computer-mediated analysis, the biggest learning curve has been in recognizing when each type of text analysis is actually useful in each situation. Though the idea is compelling — the idea that a computer can read thousands of sources quickly and efficiently, providing useful analysis — the reality is often quite different. In some cases, the tool and the source base clash, obscuring any real useful results. For example, when using topic modelling with MALLET to examine a web collection from 2016, made up primarily of news article and social media, almost the entirety of the generated list of topics was relating to the metadata and menu bars on the pages, and not the actual content. It appears that there can be challenges to running text analysis on complex web collections, which are made up of much more than merely content.

In another case, I used sentiment analysis with the NLTK on GeoCities’ LGBTQ community, in an attempt to highlight the emotion expressed in connection with certain identity labels. Almost predictably, this type of analysis was not able to understand the nuances of the multitude of ways that individuals express their own identity. The only identity labels that were calculated as being anything but neutral were those terms which are primarily used as LGBTQ slurs. To make matters more complicated, these slurs often had been reclaimed by the community and, though this could be caught by reading a statement in context, sentiment analysis only recognized the word itself. It is important to recognize the limitations of each method, and the situations where each method will be most effective. Often it is a process of trial and error to determine which method will yield the most useful results.

Finally, one of the most important pieces of knowledge for exploring web collections with text analysis is a general recognition that these tools, as with all other forms of analysis, have their own biases. Computer mediated analysis appears on the surface to be objective, however the choice of method, the way that you structure or filter your text, your use of keywords, and the programming of the tool itself all determine the results that will come out of the analysis. With this recognition is a need to be transparent about methodology and open to the idea that these text analysis tools provide one interpretation of your text among many.

--

--