A happy accident led me into exploring web preservation. I was doing (or trying to do) some file format id-ing and realised I needed to document information relating to specific software. Web preservation was something I confess I had been “putting off” because it “looked difficult”. I mean everyone says it’s difficult so it must be, right? But inspired by a digital preservation mantra: ‘don’t let the best be the enemy of good enough’ I decided that if I wanted to capture information on the web and not find it the the link had rotted when I came back to it I would need to explore ways of “preserving” it. Oh wait – that’s like web preservation, right? So armed with a use case I thought now was as good a time as any to experiment with web preservation tools.
So I started with a tool called Webrecorder – I had read about this but not had a chance to play with it. Using it was pretty straightforward – you need to register and log in and then you then create collections (say for example related websites, or themes) which you can add to at a later date. The basic principle is that each time you “start” the recording you can hop to a website and it will capture each link you visit – including PDFs and other material (I haven’t tested it for video content – note to self – do this next!). The tool appeals to the archivist in me because it captures everything; the relevant metadata about the capture and you can link “recordings” (ie sessions when you did the web capture) together. I see it as a great for personal digital archiving which is another thing I’m interested in developing as an advocacy tool. It’s also useful for small scale sweeps like the one I was intending although for bigger projects something more automated would be required.
Also – and this is a big also – this tool captures web sites but it doesn’t preserve them. Like any digital preservation activity you can’t just have a tool which will “do it for you”. The tool is only as good as the systems which you link it to. In the case of Webrecorder the tool allows you to download your capture as a zipped WARC file – which is great as this is the format developed for capturing “web accessible content in an archived state”. Recordings from Webrecorder can then be downloaded and ingested into a preservation system and managed from there. Brilliant!
However (and there’s always a however) I want to check and access my WARC files. Thankfully Webrecorder comes with a player which allows you to “play back” the captured web pages. Want I want to do next is experiment with using other web capture tools and playing them back with Webrecorder player and also playing Webrecorder captured files using other playback methods.
Webrecorder is a great system for people (like myself) who don’t have a huge amount of technical know-how but I would like to explore other tools and systems which might require a bit more investment in time for set up and installation. The key things I want to explore are around automation and integration with our existing systems and workflows.
What I need to do:
- spend a bit more dedicated time exploring and comparing tools
- keep a log of my experiences (blog or other platform)
- think about contributing to COPTR (I notice Webrecorder isn’t on there except in the wishlist column…)
What I need help with:
- understanding the WARC file format
- understanding more about the crawl process – what can/can’t/should/shouldn’t be attempted
- understanding more about the metadata which is captured
- and a whole lot more about automation processes
Next I want to have a go with WARCreate which is a Google Chrome plugin. I got as far as installing it but it slowed down my browser performance so much I took it off again…
Wish me luck!