Celebrating Research Opportunities with Web Archive through Archives Unleashed Cohort Projects

Samantha Fritz
Archives Unleashed
Published in
11 min readJun 29, 2023

--

Following the successes of the Archives Unleashed datathon series (2017–2020), the Archives Unleashed Cohort Program launched in 2021 to foster research engagement with web archives.

The program has seen two cycles of year-long intensive collaborations to support research with web archives as a primary data source and scholarly objects.

Research teams were provided with technical and academic mentorship to steward projects through direct one-on-one consultation from the Archives Unleashed team, members of the Archiving and Data Services team, connections to field experts, and opportunities for peer-to-peer support.

The program also saw teams publish their research through a number of scholarly avenues — from journal publications to conferences to community engagement and advocacy work. As these projects offer high-quality use cases for working with web archive data, we provide a summary of the cohort program and feature the research of these interdisciplinary and international project teams.

View their project profiles:

Project Summaries Report: https://bit.ly/AUCohortProjects

About Cohort Researchers

Map featuring the geographic representation of cohort teams

Map featuring the geographic representation of cohort teams

From 2021–2023, the program hosted ten research projects involving an international and interdisciplinary group of scholars. Their projects spanned a broad range of topics, from perspectives, experiences, and responses to the COVID-19 Pandemic, to the role of new media in feminist-based landscapes and queer identity and discourse formation in online spaces to cultural practices and reconciliation processes.

Word cloud of self-expressed topics and themes studied by cohort teams

Word cloud of self-expressed topics and themes studied by cohort teams

Our research teams have self-identified over two dozen topics and themes their projects explored and studied, including :

#COVID19 #Pandemic #Multilingualism #CrisisCommunication #RiskCommunication #BigData #CommentingSystems #OnlineComments #NewsWebsites #DigitalActivism #LatinAmericanFeminism #CyberFeminism #CounterArchives #DigitalLabor #Motherhood #Blogging #Feminism #Domesticity #DisInformation #MisInformation #CulturalPractices #ReconciliationProcesses #HistoricalRedress #HumanRightsDefense

Methods + Tools

When collaborating with researchers interested in investigating web archive collections, we’ve found one of the first questions asked is what types of tools or methods should be used to explore web archival data.

The answer: it really depends on your research question(s) or what you are asking of the data. As cohort participants discovered, their questions often have to be reimagined based on the content of a web archive collection.

That said, all teams began their research process by using the ARCH (Archives Research Compute Hub) platform. ARCH transforms web archival collections, specifically the WARC files, into accessible scholarly objects that can be used with a myriad of tools and methods for further research. These include domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection.

After generating datasets with ARCH, cohort teams used mixed-methods to conduct text and network analysis. Interestingly to note, while several teams engaged in similar analysis methods, their approaches to, for instance, topic modelling, had a wide range of techniques and tools applied!

Visual summary of analysis methods and tools used by cohort researchers

Visual summary of analysis methods and tools used by cohort researchers

Text Analysis

Many teams began their investigations by exploring the text of their web archive collections. Since ARCH provides a dataset encompassing the full text of a web archive, it became an invaluable starting place to understand the types of content present. Cohorts used a wide range of methods, techniques, and tools and faced the challenge of dealing with scale.

Mixing both distant and close reading was crucial given that even smaller collections have thousands, if not hundreds of thousands, of lines of content, which makes it impossible to conduct manual coding or reading of all the content.

We saw many topic modelling and classification techniques used to engage with the data from a macro level, as well as the use of programming languages and computational notebooks to investigate recurring themes and patterns to draw out stories from the data.

Network Analysis

Looking at the interconnectivity within a web archive, teams used network analysis to understand how actors, bodies, and groups within web archive collections were linked, how information flowed, and what entities were central or influential to conversations.

Image + Binary Analysis

A few of our cohorts were interested in using information from binary files found within web archive collections, such as images, PDFs, videos, spreadsheets, word documents, and the like — essentially any non-textual-based objects.

Web Archive Collections

In partnership with members of the Internet Archive’s Archiving and Data Services team, cohorts were granted access to conduct analysis on web archive collections, many of which were curated by Archive-It Partners.

Establishing connections between cohorts and collection curators was important for several reasons:

  1. Highlighting the use of institutional web archive collections in scholarly inquiry & research;
  2. Provides a way for cohorts to gain contextualized knowledge about a collection — i.e. what scoping rules were applied for inclusion/exclusion criteria; and
  3. Facilitated collaborations and open opportunities to contribute content to web archives of underrepresented communities and voices

Collectively, cohort teams have explored 44 web archive collections from 24 Archive-It Partners and collection curators. We thank all curators and institutions for their care in preserving cultural heritage through these web archive collections. The cohorts used collections by the following institutions and individuals:

  • Archive-It, Archiving and Data Services team
  • Archive Team
  • Brigham Young University
  • Brock University
  • Columbia University Libraries
  • Duke University
  • International Internet Preservation Consortium
  • Library and Archives Canada
  • Mark Graham
  • McGill University
  • National Museum of Women in the Arts
  • New York University
  • Nick Ruest
  • San Francisco Public Library
  • San Jose State University, School of Information
  • Schlesinger Library
  • Smith College
  • Temple University Special Collections
  • University of Manitoba
  • University of Michigan Library
  • University of Michigan School of Information
  • University of Saskatchewan
  • University of Texas at Austin Libraries
  • University of Texas at San Antonio Libraries Special Collections

Celebrating Project Accomplishments

It has been a pleasure to collaborate with these teams and witness their growth, successes, and boundless creativity in meeting challenges!

The Archives Unleashed Cohort roster includes ten projects by fifty researchers from twenty institutions.

In collaboration with cohort teams, we’ve created a project summary document that highlights the use cases of working with web archival data and shares experiences in meeting challenges that big data presents.

Cohort Project Summaries: https://bit.ly/AUCohortProjects

Here we share the description for each project and celebrate their achievements!

Crisis Communication in the Niagara Region during the COVID-19 Pandemic by Tim Ribaric, David Sharron, Cal Murgu, Karen Louise Smith, and Duncan Koerber

Using web archives collected by Brock University, this project will examine how organizations in the Niagara region have responded to government COVID-19 mandates. Analysis will focus on investigating three types of entities: local government, non-profit organizations, and major private entities. Findings from this research aim to inform future crisis communication organizational planning, specifically at the local and municipal level. The project will also create several open computational notebooks to support teaching, learning, and research.

AWAC2 — Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset by Valérie Schafer, Karin De Wild, Frédéric Clavert, Niels Brügger, Susan Aasman, Sophie Gebeil, Joshgun Sirajzade

Investigating transnational events through web archive collections, the AWAC2 team will focus on a distant reading of the IIPC COVID-19 web archival collection to understand actors, content types and interconnectivity throughout it.

Mapping and tracking the development of online commenting systems on news websites between 1996–2021 by Anne Helmond, Johannes Paßmann, Robert Jansma, Luca Hammer, and Lisa Gerzen

This project aims to reconstruct a history of online commenting by examining the role of commenting technologies in the popularisation of commenting practices. It will do so by examining the distribution and evolution of commenting technologies on the top 25 Dutch, German, and world news websites from 1996–2021 to understand how they have shaped the practices of users. This will allow them to explore the interplay between technologies and practices of the past and to investigate histories of natively-born technologies and practices.

  • Two presentations at the Association of Internet Researchers Conference, 2022, Dublin, Ireland [Slides].
  • WARCnet closing event, 2022, in Aarhus, Denmark [Slides].
  • Internet Archive, 2021, online event [Slides].
  • Internet Archive, 2022, Canada.
  • Workshops on ‘online technography’ using web archives at the University of Siegen.
  • Development of Jupyter Notebooks for internal use.
  • Development of ‘the Technograph’ tool (in-progress), a visualization interface on the dataset to assist in the analysis.

Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves by Shana MacDonald, Aynur Kadir, Brianna Wiens, Sid Heeg

Project members will explore web archive collections to conduct a comparative analysis of the history of feminist media practices across interdisciplinary multi-media sources. The team expects to produce a timeline of issue responses from different historical moments and map different feminist media practices over this timeline to determine overlaps. The project’s key outcome will be to recover earlier feminist media practices and contextualize them in the digital present.

  • “Approaches to Archiving Feminist Memes.” Preserving Digital Born Media by Women: methods for decolonial & feminist futures (Panel). Film and Media Studies Association of Canada annual conference, May 2023
  • “Activists Archiving the Internet: Social Justice Informed Approaches to Digitally Born Content.” Panel with Nick Ruest, Brianna Wiens, Shawn Walker, Mina Momeni. Shaking Up the Archive, Queen Margaret University, Edinburgh, June 23–25, 2023. Accepted February 2023.
  • “Reconceptualizing Internet Archives: Feminist Memes as Repertoire” Canadian Association for Theatre Research, Halifax, June 9–12, 2023. Accepted February 2023.
  • “From Placards to Memes: The Utopic Refusals of Feminist Media Techno-Imaginaries,” co-author Brianna Wiens, Feminist Encounters (Invited for special issue Sept 2023)

Viral health misinformation from Geocities to COVID-19 by Shawn Walker, Michael Simeone, Kristy Roschke, and Anna Muldoon

This project will examine and compare two case studies of health misinformation: HIV mis/disinformation circulating on Geocities in the mid-1990s to early 2000s with the role of official COVID-19 Dashboards in COVID mis/disinformation. This work contributes to our understanding of current and historical health misinformation as well as the connections between them, and will also garner insights into how historical narratives of health misinformation have been recycled and repurposed.

  • Viral health misinformation: From GeoCities to COVID-19, presented at the Association of Internet Researchers, Dublin
  • Paper under review

Latin American Women’s Rights Movements: Tracing Online Presence through Language, Time and Space by Sylvia Fernandez, Rosario Rogel-Salazar, Verónica Benítez-Pérez, Alan Colín-Arce, Abraham García-Monroy, Hejin Shin

In analyzing web archives related to human rights and feminist movements, the project will develop a historical analysis of the websites from women’s rights movements in Mexico and Latin America, particularly those focusing on eradicating femicides and gender violence. The team will also study how these movements relate to women’s rights movements globally, specifically by comparing language expression.

  • Built Web Archive Collection, as part of Huellas Incómodas Project: https://idrhku.org/huellasincomodas/webarchive
  • Spanish Wikipedia entry for web archiving https://es.wikipedia.org/wiki/Archivado_web
  • DH Workshop hosted at the Autonomous University of the State of Mexico, 2023, organized with a grant from the Science Council of the State of Mexico https://idrhku.org/huellasincomodas/investigaciondigital
  • (Forthcoming Article) Rosario Rogel-Salazar, Abraham García, Alan Colin-Arce, Verónica Benítez-Pérez. Preserving the memory of Latin American feminist movements on digital and web counterarchives
  • Received additional $100,000 MXN grant from the Science Council of the State of Mexico under the program Research Funding for Women Scientists.

Historicizing Aughts-Era Mormon Mommy Blogging Media Landscapes by Emily Edwards, Robin Hershkowitz, and Lauren Andrikanich

In the era of post-blogging, or microblogging, this project will explore early aughts Mormon mommy blogging culture, mediations of marriage, mothering, and feminine domesticity to historicize this period in relation to contemporary manifestations and trends of mommy influencing on social media platforms.

This project will identify connections and transitions to new media ecologies and new iterations of racialized, gendered domestic ideologies that share historic genealogies to digital media practices and structures pioneered by Mormon mommy bloggers.

  • Abstract accepted, “Digital Pioneers: Mormon Mommy Bloggers and Digital Domestics,” for a special issue of Internet Histories on “Gender and the Internet/Web History” forthcoming 2025
  • Zine “Sliding Data: Feminist Methodological Pathways” accepted for publication in DIY Methods Low-Carbon Research Methods Initiative forthcoming 2023
  • Submission as part of a panel on “Digital scholarship and the web: Exploring new sources and emerging research methods” for the Digital Library Federation Forum Conference 2023

Using Web Archives for Mapping the Use of Cultural Practices in Postconflict Societies and During Reconciliation Processes by Ricardo Velasco Trujillo and Luis Gomez

Employing computational methods such as contextual search, data mining, and web scraping, amongst others, this project aims at making an initial assessment and map out the use of cultural practices in different reconciliation processes and across several human rights organizations working actively in post-conflict societies.

Querying Queer Web Archives by Filipa Calado, Corey Clawson, Di Yoong, and Lisa Rhody

In studying queer online spaces, project investigators will explore investments in concepts like utopia, play, radicalism, normativity, religion, and conversion and how they affect queer identity and discourse formation over time.

  • Yoong, D., Calado, F., & Clawson, C., (2023, May 3). Querying Queer Web Archives [Conference presentation]. IIPC WAC 2023, Hilversum, The Netherlands.

Web Archiving and the Saskatchewan COVID Archive: Expanding Coverage to Capture Social Media, Medical Misinformation, and Radicalization by Jim Clifford, Derek Cameron, Erika Dyck, Craig Harkema, Patrick Chasse, and Tim Hutchinson

Using web archives collected by the University of Saskatchewan, this project seeks to map and connect conversations and experiences of the COVID-19 pandemic with the province of Saskatchewan. The research team will develop timelines and knowledge trees to provide an opportunity to learn about the causal relationship between social media and this important shift in public health policy in Saskatchewan.

Conclusion

It has been a wonderful two years of deep collaboration with our Cohort teams. At the end of our collaborations, saying goodbye is always tough: but we’ve universally been so proud of the excellent work that we have seen. Our goal was to help support web archive-based research that would make disciplinary and societal impact… we hope you agree that we’ve been successful on that front!

--

--