Happy accidents: adventures in web preservation

A happy accident led me into exploring web preservation.  I was doing (or trying to do) some file format id-ing and realised I needed to document information relating to specific software.  Web preservation was something I confess I had been “putting off” because it “looked difficult”.  I mean everyone says it’s difficult so it must be, right? But inspired by a digital preservation mantra:  ‘don’t let the best be the enemy of good enough’ I decided that if I wanted to capture information on the web and not find it the the link had rotted when I came back to it I would need to explore ways of “preserving” it.  Oh wait – that’s like web preservation, right?  So armed with a use case I thought now was as good a time as any to experiment with web preservation tools.


So I started with a tool called Webrecorder – I had read about this but not had a chance to play with it.  Using it was pretty straightforward – you need to register and log in and then you then create collections (say for example related websites, or themes) which you can add to at a later date.  The basic principle is that each time you “start” the recording you can hop to a website and it will capture each link you visit – including PDFs and other material (I haven’t tested it for video content – note to self – do this next!).  The tool appeals to the archivist in me because it captures everything; the relevant metadata about the capture and you can link “recordings” (ie sessions when you did the web capture) together.  I see it as a great for personal digital archiving which is another thing I’m interested in developing as an advocacy tool.  It’s also useful for small scale sweeps like the one I was intending although for bigger projects something more automated would be required.


Also – and this is a big also –  this tool captures web sites but it doesn’t preserve them.  Like any digital preservation activity you can’t just have a tool which will “do it for you”.  The tool is only as good as the systems which you link it to.  In the case of Webrecorder the tool allows you to download your capture as a zipped WARC file – which is great as this is the format developed for capturing “web accessible content in an archived state”. Recordings from Webrecorder can then be downloaded and ingested into a preservation system and managed from there.  Brilliant!

However (and there’s always a however) I want to check and access my WARC files. Thankfully Webrecorder comes with a player which allows you to “play back” the captured web pages.  Want I want to do next is experiment with using other web capture tools and playing them back with Webrecorder player and also playing Webrecorder captured files using other playback methods.

Webrecorder is a great system for people (like myself) who don’t have a huge amount of technical know-how but I would like to explore other tools and systems which might require a bit more investment in time for set up and installation.  The key things I want to explore are around automation and integration with our existing systems and workflows.


What I need to do:

  • spend a bit more dedicated time exploring and comparing tools
  • keep a log of my experiences (blog or other platform)
  • think about contributing to COPTR (I notice Webrecorder isn’t on there except in the wishlist column…)

What I need help with:

  • understanding the WARC file format
  • understanding more about the crawl process – what can/can’t/should/shouldn’t be attempted
  • understanding more about the metadata which is captured
  • and a whole lot more about automation processes

Next I want to have a go with WARCreate which is a Google Chrome plugin.   I got as far as installing it but it slowed down my browser performance so much I took it off again…

Wish me luck!



Radical Collections

Senate House
Image: Steve Cadman, Flickr (https://www.flickr.com/photos/stevecadman/496743569)      CC BY-SA 2.0

I attended the inspirational Radical Collections conference held at Senate House on 3rd March which was part of their Radical Voices season.  The main themes of the conference were collections development, the politics of cataloguing and widening participation and representation.  Many of the papers focused on more than one of these themes and the papers and audience were a good (healthy?) mix of archives and library professionals and others.  My role as digital archivist is not just about preservation but also access to digital collections and their on-going management.  Archive collections (and other library special collections) do not sit in isolation and have to be considered as part of a wider cultural and political background.  Decisions made by library and archive professionals have consequences for their donors and users. The importance and significance of the context collecting and managing is key to meeting equality and diversity agendas.

The first session looked at some collections with “radical” contents: Ken Loach’s archive at the BFI, the Underground and Alternative Press Collection at the University of Brighton and the archives of Radical Psychiatry (or anti-psychiatry).  The collections referred to were very varied in subject matter but were united in the way in which they cast light on “alternative” narratives.  Ken Loach’s archive reveals  interviews reveal the dissenting voices of union activists of the 1970’s and 80’s which are not otherwise represented in the official archives.  Brighton’s underground and alternative press collection documents the hugely influential narratives of alternative community activity in Brighton – much of which has since become mainstream, such as environmental activism, but which had its origins in alternative activism.  Likewise the history of the development of psychiatry has a counter-narrative of alternative practice.

The second panel looked at some of the more political aspects of library collections and tackled questions as diverse as as varied as discriminatory library cataloguing systems (and practice) and the predominance of whiteness in librarianship.  The papers were a useful reminder – if that were needed – of the constant need to address inbuilt discriminatory practice.  Inclusion and presence is not enough and sometimes the structures themselves need to be challenged to lessen discrimination.

Zine Library
Image: Cory Doctorow, Flickr (https://www.flickr.com/photos/doctorow/100318253)             CC BY-SA 2.0

In the afternoon there was more on radical histories drawn from collections, from the children’s literature in Cork reflecting the emergence of the Irish Free State at the beginning of the twentieth century, through the archive of a women’s organisation of the 1990s to the issues around the preservation and access of zines.  This session had a lot of focus on the personal relationships which develop between the creator of the collections and the collecting institution.  This exposes tensions where there are ideological differences between them or where the creator has ideological disagreements with the collecting institution – something which was returned to later.

The final panel looked at the issue directly relating to the workforce.  I was asked to step in to chair this session at the last minute – which I was very happy to do as it was a fascinating range of papers only marred by a fire alarm which interrupted the first speaker.  Tamsin Bookey from Tower Hamlets revisited the issue of whiteness in libraries and archives both in terms of users, collections and staff.  She also looked at the Social Model of Disability in relation to archives provision.  Katherine Quinn in the second paper looked at the challenge of radical librarianship in the HE Sector and finally Kirsty Fife and Hannah Louise Henthorn discussed approaches to diversifying the archives sector and launched a survey which you can take part in: Marginalised in the UK archive sector.

The conference was extremely thought provoking and there are a number of issues that I have been reflecting on with respect to my practice.  Libraries and archives are not neutral spaces nor are they “a static auxiliary” to education, as defined by some sociologists.  Any collecting or engagement activity in libraries or archives needs very careful assessment and critique to support equality in service provision and maintain transparency (rather than neutrality which is not achievable).

Looking forward


A picture of people doing a lot more exercise than I do! (image: http://skitterphoto.com/?portfolio=4-mile-run-groningen)

Like a lot of people I have spent January setting priorities for the year ahead.  I haven’t given up chocolate or done any more exercise but I have been giving some thought to both where I would like to focus in my work and some of the areas I would like to develop in my practice.  One of the first things I would like to do is sign up for an xml course – I’m keen to improve my technical skills and this looks like a good place to start.  I will always be an archivist not a developer but I want to be able to to have more confidence to be able to:

  • talk to developers and IT colleagues
  • develop a more critical approach to choosing tools to work with
  • try out more technical tasks such as file format id-ing
  • explore more possibilities of using data in a digital humanities contexts

Preservation workflows

(image: startupstockphotos.com/post/123128198211)

Other things I’m focusing on at the moment are conducting an in-depth analysis of my digital preservation workflow.  We’ve been playing around with automating elements of our workflow which ingests and processes research data and then prepares it for long term preservation. What I have planned out at the moment is very piecemeal and I know from experience that piecemeal solutions hide weaknesses and dependencies that have not been fully thought through.  Our test instance of Archivematica fell over because of an upgrade elsewhere on the system – lack of communication and insufficient planning led to a problem.  This is of course why we’re not yet in the production stage but it did bring it home to me about how important solid planning and the identification of dependencies are (if that wasn’t already apparent!).

Getting the information out there

You won’t need a Mac to access our catalogue… (image: startupstockphotos.com/post/123128547586/at-barrel-soho-nyc)

I’m also exploring cataloguing systems and am currently playing with AToM – an open source standards-based a cataloguing system from Artefactual (who also develop Archivematica) which looks to offer many of the things which we will be requiring.  I have some existing catalogues to import (which is proving rather more tricky than I thought it would be) but I like what I see of what the system offers in terms of standards conformance, ease of use and interoperability.  I am looking for a system that will nicely expose digital and non-digital descriptions side by side and an integration with Archivematica is important for this.  I am also keen for it to work alongside our current Onesearch library catalogue to allow users to navigate across collections and find their way around everything the university has to offer.


I want to get into the habit of regular blogging and have been inspired by Jen Mitcham’s regular Digital Archiving updates as well as Kirsty Lee’s Bits and Pieces.  A longer read which I will be coming back to which is worth a look is Bentley Historical Library‘s Appraising Digital Archives with Archivematica paper which was written from elements already appearing in their blog.

So – here’s to a busy year of digital archives!


United we stand 

I attended a Digital Preservation Coalition training event recently in Liverpool called “Making Progress with Digital Preservation”. This came at a good time for me after having been in post as Digital Archivist at Lancaster University for a couple of months and finding myself trying to do just that.  It was a great opportunity to meet some of my fellow professionals in the region and also to meet a wide range of practitioners from different disciplines who had come together to try and get their heads around some of the challenges faced by the emerging and changing discipline of Digital Preservation.

One of the big themes of the day – and something I’ve been giving quite a bit of thought to recently – is the need for advocacy – as William Kilbride, chief executive of the DPC, said “a huge part of digital preservation is relentless advocacy” and certainly the relentless nature of it can seem daunting. I often think that very few people really grasp what it is I am trying to achieve in my job – it can be quite hard to explain – and without having the record creators on board with the task of preserving is impossible. Digital preservation does not take place in isolation – it is a combination of tasks undertaken by a wide range of people taking on the challenges posed by the technologies, information, curation, selection and so on and so on. 

As was discussed at the event, digital preservation is an activity undertaken by people from many different disciplines each of whom bring a different angle or perspective to many of the issues with are being grappled with. This includes librarians, records managers, archivists, data managers, IT systems people, researchers… the List is endless.  It’s a collaborative effort and one which, if it is to succeed, needs to be taken up and be taken seriously by anyone who is engaged in data creation.  And by that of course I mean everybody.

Funding models for projects mean that there are a multiplicity of time-limited projects, the results of which are scattered and difficult to navigate even for someone who knows a little about the subject. On the plus side here are lots of people who are keen to share their knowledge, experience and expertise, and only by string collaborative working will we really achieve results.

I’m preparing to introduce my colleagues to the principles of Digital Preservation because that advocacy work starts at home, and I can’t save the world digital data on my own.

This week I’ve been reading this article by Anthony Cocciolo, Professor of Information at the Pratt Institute, New York and Library Science which looks at the archivist in a data managers world.  I’ve also looked at this article from International Article of a Digital Curation on how we should be taking a holistic approach to data curation.

Happy International Archives Day

I’ve been motivated to write my blog to coincide with International Archives Day with is being celebrated on 9th June with the theme the year of democracy.  The blog is intended to chart my progress in digital preservation which is a new(-ish) direction for me. However as an archivist committed to ensuring authenticity, transparency and access to information it’s one which I see as the logical way of taking this work on into the future and ensuring current and future archives continue to maintain these principles.  In fact it underpins the whole democratic process, and the whole business of democracy cannot exist without archivists and information managers supporting its regulation.

“Secrecy, being an instrument of conspiracy, ought never to be the system of regular government.” Jeremy Bentham, On Publicity from The Works of Jeremy Bentham volume 2, part 2 (1839).
However before these weightier matters can be tackled I need to take my first steps in mapping out a digital preservation strategy and my first task has been to survey what other institutions are doing, what kind of policies they have and any interesting or innovative ways in which digital collections are preserved and presented.  It’s given me a great opportunity to spend some time looking at a variety of collections, some of my favourites being YODAL – the University of York’s Digital Library and New York Public Library‘s digital collections. Whilst I was on the “York” theme (there must be something in the name which promotes good digital projects) I found a wonderful set of digitised images relating to the Spanish Civil War held at New York University and made available via their Digital Library Projects from originals held in the Internaitonal Brigade Archives in Moscow.  Lots of other fascinating stuff here as well including the Guantánamo Lawyers Archive.

In the meantime I’ll be following #IAD15 on Twitter for all the best archives and democracy stories from around the world.