web archiving – An Old Hand Digital

Mapping essential skills for the twenty first century archivist – a presentation given at the 2019 Archives and Records Association Conference in Leeds, UK, 28th August 2019.

Here’s a version with slides of the talk I gave in Leeds .

So where did it all start for me?

*What I tweeted after leaving the Archives and Records Association Conference 2012*

I worked for many years in “traditional” archives and was inspired to make a move into digital preservation partly after the ARA Conference in 2012 (the last one I attended). Over the next few years I did what I could and started to develop a digital asset register as well as work on policy but the usual frustrations hampered further progress:

Got a break working on the digital preservation of research data management at Lancaster University – a related but distinct role – where I learnt a huge amount and one of the first things I began to do was map the divide between data and archives.

People (especially from Humanities backgrounds!) tend to shy away from the word data – but there’s a lot to learn! I’m now based at the Modern Records Centre at the University of Warwick which has the records of trades unions, pressure groups and some notable individuals including Bill Morris, Eric Hobsbawm and Rodney Bickerstaffe all of which contain digital content. So I have plenty to go on both in terms of our legacy collections and also new material coming in.

It isn’t that progress isn’t being made but it could be faster and perhaps should be as time is not on our side. And if you don’t believe me one of the first collections I have worked on is material from the University’s 50th anniversary which included the following:

*From a collection of material relating to the University’s 50th anniversary celebrations*

Of these links:

Works and resolves to an internal site (but not under our administrative control). I used Webrecorder to extract the content and create a WARC file for preservation
Does not resolve but I (eventually) tracked it down the the British Library UK web archive – but you have to go there to view it.
Storify ceased to operate in 2018 so the content is now gone.

So the time to act is… now. Or rather the time to act was yesterday.

*My uncle’s enduring (for now) presence on Facebook*

I don’t have any emails from my father because I changed provider and lots all my emails dating before 2012 by which time he was dead. It doesn’t matter too much in a way because I have lots of physical photos, writings and other things of his. That seems to be slightly before a tipping point whereby suddenly I have little or nothing physically documenting my interactions and connections with people. My uncle who passed away this year was a dedicated user of Facebook, to the point where many of my friends – who never met him – have said that they will miss his comments… I have emails, photographs, Facebook comments – all of which are digital – that somehow seem a little less easy to capture.
And why am I spending time telling you about my uncle and Facebook? Because these are all stories – and this is what it is all too easy to forget – that what seem to be files, spreadsheets, databases are also stories of life, love, loss and hope. I would like everyone in the room to remember that the digital is created by human activity and the stories told are no less.

I wanted to explore a bit more about where we might be in terms of tackling digital archives by talking to a few fellow archive sectors workers to get a better sense myself of what might be required to get people to be the “archivist required to deal with digital stuff”. The people I spoke to are representative of nothing but themselves and I’m aware that there is a large scale piece of work sponsored by Jisc and the National Archives which is underway (I should know as I took part in it). However as someone who does write, advocate and deliver training I wanted to know what it is people felt they wanted as well offer up suggestions about how to go forward and go from planning to doing.

I went into Archives because I like old stuff and history – it’s no accident that my Twitter handle is “An Old Hand” – which is a palaeography joke, and my avatar is an image from an early modern document. It was the physicality of the old documents – whatever age which first drew me in, but what kept me was the stories that they tell.

When I asked the people I spoke to about why they had gone into Archives they said more or less the same thing. They all mentioned history as being a reason they got interested in archives but they also mentioned other things as well – they mentioned working with people and they mentioned stories and these relate as much to digital as to any other type of collection. Perhaps the main problem is that digital material just isn’t that old – well maybe younger people might coo at a floppy disk but even so – in archives terms (and remember I work at the Modern Records Centre!) although I discovered out earliest document is 1633 (still modern right?).

*The eighties as I remember them (although Thatcher was in colour)*

But still our earliest digital material is from the 80s which at times seems like another world but I’m guessing you either remember it or you’ve heard a lot about from your parents. It’s more familiar and what constitutes these records – word processed documents, emails, photographs – it’s the stuff of everyday – even now – so it just doesn’t have the glamour or even the appeal of the parchment deed, the daguerreotype or whatever. And all of my conversations indicated that people – regardless of their age – were just not as excited about the archives that weren’t cool old stuff.

One of the things I want to do is to get people excited about the cool new stuff. And if it can’t be done through the physicality of the medium (which it hasn’t really got) then then it’s through the content – whether its (in my case) material from the Fire Brigades Union – whose records we have most recently taken in . It’s not hard to see how we can easily find interesting and engaging content which highlights important stories.

So if we know why we need to do it – the next thing is how. A word that came up again and again was “confidence” or lack thereof although this wasn’t reflected in all the people I spoke to. There is a course, webinar, book, podcast on everything under the sun if you know where to find it. But that’s one of the problems. There’s just too much out there – how do you know which is the “right” tool, the “right” workflow for the job? Here are some basics which I would recommend looking at if you’re looking to make a start:

If you are in a management position (and even if you are not) you need to be advocating for digital preservation so that means an understanding of the consequences of inaction, of the risks and of course to know about the collections which you are or might be dealing with. Digital Preservation Coalition has a great website which includes the basics which is fantastic for your elevator pitches and so on

But the secret is that there is no gold standard. Yes – it is helpful to have tools to work with like Preservica or Rosetta or Archivematica but you ask anyone who has those tools if they are doing the preservation for you? They are not. They are not capturing them, sorting them, listing or appraising them or even storing them.

If you can start by protecting those bit streams and knowing – as much as you can – what those file types are then you have already made a start. Because if you’ve done that then the next generation of archivists will be able to build on your base and do some of this more painstaking work that you don’t have time, resource or capacity for. We need to agree on an understanding of what the base line is – somewhere around bit stream preservation with “as much metadata as I can get”. To me that looks like capture, context and running the files through DROID. Even this might be too much for people in very locked down situations such as in a Local Authority environment where you might even be limited by this. Could we look at sharing some command line stuff that might at least enable basic tasks? And then share them?

It’s something we can all be doing and it’s something we do together. Many of you here are already doing things – big and small and everyone else wants to hear about it – to give ideas, encouragement and to help find strategies which work in the real world.

Let’s try and find platforms for this so we can share our experiences and learn from each other.

Yes – I am asking you to make your lives difficult difficult but be brave – you don’t have to do it alone.

A happy accident led me into exploring web preservation. I was doing (or trying to do) some file format id-ing and realised I needed to document information relating to specific software. Web preservation was something I confess I had been “putting off” because it “looked difficult”. I mean everyone says it’s difficult so it must be, right? But inspired by a digital preservation mantra: ‘don’t let the best be the enemy of good enough’ I decided that if I wanted to capture information on the web and not find it the the link had rotted when I came back to it I would need to explore ways of “preserving” it. Oh wait – that’s like web preservation, right? So armed with a use case I thought now was as good a time as any to experiment with web preservation tools.

tweet1

So I started with a tool called Webrecorder – I had read about this but not had a chance to play with it. Using it was pretty straightforward – you need to register and log in and then you then create collections (say for example related websites, or themes) which you can add to at a later date. The basic principle is that each time you “start” the recording you can hop to a website and it will capture each link you visit – including PDFs and other material (I haven’t tested it for video content – note to self – do this next!). The tool appeals to the archivist in me because it captures everything; the relevant metadata about the capture and you can link “recordings” (ie sessions when you did the web capture) together. I see it as a great for personal digital archiving which is another thing I’m interested in developing as an advocacy tool. It’s also useful for small scale sweeps like the one I was intending although for bigger projects something more automated would be required.

ipad-820272_1920

Also – and this is a big also – this tool captures web sites but it doesn’t preserve them. Like any digital preservation activity you can’t just have a tool which will “do it for you”. The tool is only as good as the systems which you link it to. In the case of Webrecorder the tool allows you to download your capture as a zipped WARC file – which is great as this is the format developed for capturing “web accessible content in an archived state”. Recordings from Webrecorder can then be downloaded and ingested into a preservation system and managed from there. Brilliant!

However (and there’s always a however) I want to check and access my WARC files. Thankfully Webrecorder comes with a player which allows you to “play back” the captured web pages. Want I want to do next is experiment with using other web capture tools and playing them back with Webrecorder player and also playing Webrecorder captured files using other playback methods.

Webrecorder is a great system for people (like myself) who don’t have a huge amount of technical know-how but I would like to explore other tools and systems which might require a bit more investment in time for set up and installation. The key things I want to explore are around automation and integration with our existing systems and workflows.

school-2253459_1920

What I need to do:

spend a bit more dedicated time exploring and comparing tools
keep a log of my experiences (blog or other platform)
think about contributing to COPTR (I notice Webrecorder isn’t on there except in the wishlist column…)

What I need help with:

understanding the WARC file format
understanding more about the crawl process – what can/can’t/should/shouldn’t be attempted
understanding more about the metadata which is captured
and a whole lot more about automation processes

Next I want to have a go with WARCreate which is a Google Chrome plugin. I got as far as installing it but it slowed down my browser performance so much I took it off again…

Wish me luck!

An Old Hand Digital

Tag: web archiving

Good enough

Mapping essential skills for the twenty first century archivist – a presentation given at the 2019 Archives and Records Association Conference in Leeds, UK, 28th August 2019.

Happy accidents: adventures in web preservation