Skip Navigation

Information Integrity through Web Archiving: Capturing Data Releases

James Lowry|

This blog is a part of a series of posts on the importance of information integrity. Click here to read the introductory post.

On 11th November 2016, Matthew Kirschenbaum tweeted:

Safe to assume that the @realDonaldTrump????? account will now be archived by @USNatArchives????? under the Presidential Records Act?

Kirschenbaum was highlighting the political importance of digital preservation. When so much political debate and campaigning takes place over social media, capturing and preserving web-published information becomes vital for accountability.

Beyond social media, web archiving has other politically important applications. If open data is to be the basis of policy decisions, planning for service provision, or public debate, if it’s to be the common ground on which citizens and governments stand, a record should be kept. What data did the government publish? Where and when?

Web archiving provides a solution. Web archiving involves taking snapshots (harvesting) of web content using web-crawlers and preserving them in digital repositories. Importantly (from an audit point of view), web-crawlers also capture metadata about the harvesting process. The oldest web-archiving initiative is the Internet Archive, which began web-crawling in 1996 and which, since 2001, provides access to its collection through the Wayback Machine.

In his practical guide to archiving websites, Adrian Brown observed that the ‘ease with which content can be made available via the web, combined with the fragility of that content in a world of constant technological change, engenders an information environment which can be positively hostile to long-term sustainability’ (p.3). Technological change is one threat; the active removal of content is another. Text can be altered, pages taken down, links removed. Poor management and lack of resources also pose risks to the persistence of web content.

This suggests that the frequency of website capture needs close consideration. How frequently is web archiving happening? The UK Government Web Archive captures datasets published on data.gov.uk only twice a year. Even if governments archive their own web estates more frequently, the dynamic nature of online publishing suggests that official web archiving can’t realistically capture every update or every release of data.

Should civil society organisations be creating alternative archives that document the provenance of the data they’re using? Given some governments’ limited capacity for or interest in web archiving, civil society web archives may be the only record of the state’s online publications. The International Internet Preservation Consortium provides access to a range of tools for harvesting, preserving and providing access to archived web content that may be useful here. It also provides guidance on the various legal issues that arise from web archiving.

Perhaps web-archiving needs to become part of the process of using open data. Is it feasible to produce a tool that would allow civil society actors to document the sources of the data they’re using? Learning lessons from the Eyewitness to Atrocities app, the tool could capture metadata about the context of the data release. A tool like this would allow users to trace data back to its source. Are there potential civil applications of the content identification work of initiatives such as DataCite, which is supporting the research community to locate, identify and cite research data? As the recent US election shows, being able to provide and check sources is critical for informed political engagement.

 
Open Government Partnership