Happy accidents: adventures in web preservation

A happy accident led me into exploring web preservation.  I was doing (or trying to do) some file format id-ing and realised I needed to document information relating to specific software.  Web preservation was something I confess I had been “putting off” because it “looked difficult”.  I mean everyone says it’s difficult so it must be, right? But inspired by a digital preservation mantra:  ‘don’t let the best be the enemy of good enough’ I decided that if I wanted to capture information on the web and not find it the the link had rotted when I came back to it I would need to explore ways of “preserving” it.  Oh wait – that’s like web preservation, right?  So armed with a use case I thought now was as good a time as any to experiment with web preservation tools.

tweet1

So I started with a tool called Webrecorder – I had read about this but not had a chance to play with it.  Using it was pretty straightforward – you need to register and log in and then you then create collections (say for example related websites, or themes) which you can add to at a later date.  The basic principle is that each time you “start” the recording you can hop to a website and it will capture each link you visit – including PDFs and other material (I haven’t tested it for video content – note to self – do this next!).  The tool appeals to the archivist in me because it captures everything; the relevant metadata about the capture and you can link “recordings” (ie sessions when you did the web capture) together.  I see it as a great for personal digital archiving which is another thing I’m interested in developing as an advocacy tool.  It’s also useful for small scale sweeps like the one I was intending although for bigger projects something more automated would be required.

ipad-820272_1920

Also – and this is a big also –  this tool captures web sites but it doesn’t preserve them.  Like any digital preservation activity you can’t just have a tool which will “do it for you”.  The tool is only as good as the systems which you link it to.  In the case of Webrecorder the tool allows you to download your capture as a zipped WARC file – which is great as this is the format developed for capturing “web accessible content in an archived state”. Recordings from Webrecorder can then be downloaded and ingested into a preservation system and managed from there.  Brilliant!

However (and there’s always a however) I want to check and access my WARC files. Thankfully Webrecorder comes with a player which allows you to “play back” the captured web pages.  Want I want to do next is experiment with using other web capture tools and playing them back with Webrecorder player and also playing Webrecorder captured files using other playback methods.

Webrecorder is a great system for people (like myself) who don’t have a huge amount of technical know-how but I would like to explore other tools and systems which might require a bit more investment in time for set up and installation.  The key things I want to explore are around automation and integration with our existing systems and workflows.

school-2253459_1920

What I need to do:

  • spend a bit more dedicated time exploring and comparing tools
  • keep a log of my experiences (blog or other platform)
  • think about contributing to COPTR (I notice Webrecorder isn’t on there except in the wishlist column…)

What I need help with:

  • understanding the WARC file format
  • understanding more about the crawl process – what can/can’t/should/shouldn’t be attempted
  • understanding more about the metadata which is captured
  • and a whole lot more about automation processes

Next I want to have a go with WARCreate which is a Google Chrome plugin.   I got as far as installing it but it slowed down my browser performance so much I took it off again…

Wish me luck!

 

Digital Ambition

Last week I was delighted to travel to Hull History Centre to speak at the Archives and Records Association‘s Section for Archives and Technology Digital Ambition training session about digital preservation.  The audience was made up of archives professionals. interested but not necessarily specialist in digital preservation

We were welcomed to Hull History Centre by University Archivist Simon Wilson who gave us an overview of their project to capture the Hull City of Culture 2017 events.  This is extremely ambitious in its scale and complexity and he talked us through some of the more challenging aspects of working on a collaborative time-limited project.  They are learning lessons from The Olympic and Paralympic Record project run by the National Archives in capturing the digital elements of a cultural event.

hull
Hull: Weeping Window (author’s own CC-BY) https://www.hull2017.co.uk/whatson/events/poppies-weeping-window/

Next up was Jen Mitcham of the University of York who shared her experience of using Open Source software.  She introduced the context by way of a great Lego video

and then went through what the challenges and benefits are of using Open Source solutions.  Using Open Source definitely comes hand in hand with working as part of a community – and building a community is something which is a feature of digital preservation which relies on input and support from a wide variety of places.  There was discussion about how different institutions and organisations have a very different take on how far they embrace the Open Source model – unsurprising given the range of organisations represented and the different needs and priorities they have.

Lunch was a great opportunity to meet up catch up with friends and colleagues working on various aspects of digital preservation.  There was quite a lot of talk about developing digital skills for archivists – which was the theme of my presentation – and interestingly shortly after the event Library Carpentry put out a call for an “Archivists Wish List” of skills  – please consider contributing to their call for suggestions or even become part of the community by taking part in their ideas sprint.

In the afternoon we heard from Gary Tuson from Norfolk Record Office who is leading an impressive consortium bid from East Anglia to procure preservation services at scale for 5 organisations of different sizes and needs.  Its impressive stuff and not without challenges – especially with dealing with sensitive data – but an important and worthwhile exercise.  Even the scoping of the project is a huge learning experience, says Gary.

computer-1869236_1920
Image: Pixabay CC0 (https://pixabay.com/en/computer-display-electronics-1869236/)

Finally I had my slot and I talked about the transferable skills which all archivists have which I had rather cheesily called “From Secretary hand to software”.  The idea was to encourage those who were present to think about the skills they have as an archivist and apply them to the issues around digital preservation.  Fragile formats in danger of obsolescence, metadata requirements, authenticity and provenance – all these key issues with regards to digital formats should be familiar to the archivist from a traditional background.  Granted, the technology is different, and this is where we are going to need help from our IT colleagues, although some basics are helpful.  In fact it would be useful to start bringing together some ideas – ARA’s Section for Archives and Technology are certainly interested in developing future training opportunities and the Library Carpentry Archivist’s Wish Lists is another way of contributing to the skills development process.

My main regret of the day was not to spend more time in the City of Culture 2017 and my main message of the day was:

Do nothing and you are guaranteed to lose records.

Do something!

I tried to be helpful on this and suggest my two favourite articles which I refer to time and again: NDSA’s Levels of Digital Preservation and Tim Gollins’ Parsimonious Preservation.  The former because it offers a structured approach and goals to aim for and the latter because it is short, accessible and reassuring.  I return to them often and when I got back to Lancaster I put into practice my “do something” and started my long overdue inventory of digital holdings in our Special Collections.  Time to get planning!

student-849825_1920
Image: Pixabay CC0 (https://pixabay.com/en/student-typing-keyboard-text-woman-849825/)

Radical Collections

Senate House
Image: Steve Cadman, Flickr (https://www.flickr.com/photos/stevecadman/496743569)      CC BY-SA 2.0

I attended the inspirational Radical Collections conference held at Senate House on 3rd March which was part of their Radical Voices season.  The main themes of the conference were collections development, the politics of cataloguing and widening participation and representation.  Many of the papers focused on more than one of these themes and the papers and audience were a good (healthy?) mix of archives and library professionals and others.  My role as digital archivist is not just about preservation but also access to digital collections and their on-going management.  Archive collections (and other library special collections) do not sit in isolation and have to be considered as part of a wider cultural and political background.  Decisions made by library and archive professionals have consequences for their donors and users. The importance and significance of the context collecting and managing is key to meeting equality and diversity agendas.

The first session looked at some collections with “radical” contents: Ken Loach’s archive at the BFI, the Underground and Alternative Press Collection at the University of Brighton and the archives of Radical Psychiatry (or anti-psychiatry).  The collections referred to were very varied in subject matter but were united in the way in which they cast light on “alternative” narratives.  Ken Loach’s archive reveals  interviews reveal the dissenting voices of union activists of the 1970’s and 80’s which are not otherwise represented in the official archives.  Brighton’s underground and alternative press collection documents the hugely influential narratives of alternative community activity in Brighton – much of which has since become mainstream, such as environmental activism, but which had its origins in alternative activism.  Likewise the history of the development of psychiatry has a counter-narrative of alternative practice.

The second panel looked at some of the more political aspects of library collections and tackled questions as diverse as as varied as discriminatory library cataloguing systems (and practice) and the predominance of whiteness in librarianship.  The papers were a useful reminder – if that were needed – of the constant need to address inbuilt discriminatory practice.  Inclusion and presence is not enough and sometimes the structures themselves need to be challenged to lessen discrimination.

Zine Library
Image: Cory Doctorow, Flickr (https://www.flickr.com/photos/doctorow/100318253)             CC BY-SA 2.0

In the afternoon there was more on radical histories drawn from collections, from the children’s literature in Cork reflecting the emergence of the Irish Free State at the beginning of the twentieth century, through the archive of a women’s organisation of the 1990s to the issues around the preservation and access of zines.  This session had a lot of focus on the personal relationships which develop between the creator of the collections and the collecting institution.  This exposes tensions where there are ideological differences between them or where the creator has ideological disagreements with the collecting institution – something which was returned to later.

The final panel looked at the issue directly relating to the workforce.  I was asked to step in to chair this session at the last minute – which I was very happy to do as it was a fascinating range of papers only marred by a fire alarm which interrupted the first speaker.  Tamsin Bookey from Tower Hamlets revisited the issue of whiteness in libraries and archives both in terms of users, collections and staff.  She also looked at the Social Model of Disability in relation to archives provision.  Katherine Quinn in the second paper looked at the challenge of radical librarianship in the HE Sector and finally Kirsty Fife and Hannah Louise Henthorn discussed approaches to diversifying the archives sector and launched a survey which you can take part in: Marginalised in the UK archive sector.

The conference was extremely thought provoking and there are a number of issues that I have been reflecting on with respect to my practice.  Libraries and archives are not neutral spaces nor are they “a static auxiliary” to education, as defined by some sociologists.  Any collecting or engagement activity in libraries or archives needs very careful assessment and critique to support equality in service provision and maintain transparency (rather than neutrality which is not achievable).

Looking forward

jogger-jogging-sport-marathon

A picture of people doing a lot more exercise than I do! (image: http://skitterphoto.com/?portfolio=4-mile-run-groningen)

Like a lot of people I have spent January setting priorities for the year ahead.  I haven’t given up chocolate or done any more exercise but I have been giving some thought to both where I would like to focus in my work and some of the areas I would like to develop in my practice.  One of the first things I would like to do is sign up for an xml course – I’m keen to improve my technical skills and this looks like a good place to start.  I will always be an archivist not a developer but I want to be able to to have more confidence to be able to:

  • talk to developers and IT colleagues
  • develop a more critical approach to choosing tools to work with
  • try out more technical tasks such as file format id-ing
  • explore more possibilities of using data in a digital humanities contexts

Preservation workflows

startup-photos
(image: startupstockphotos.com/post/123128198211)

Other things I’m focusing on at the moment are conducting an in-depth analysis of my digital preservation workflow.  We’ve been playing around with automating elements of our workflow which ingests and processes research data and then prepares it for long term preservation. What I have planned out at the moment is very piecemeal and I know from experience that piecemeal solutions hide weaknesses and dependencies that have not been fully thought through.  Our test instance of Archivematica fell over because of an upgrade elsewhere on the system – lack of communication and insufficient planning led to a problem.  This is of course why we’re not yet in the production stage but it did bring it home to me about how important solid planning and the identification of dependencies are (if that wasn’t already apparent!).

Getting the information out there

startup-photos-1
You won’t need a Mac to access our catalogue… (image: startupstockphotos.com/post/123128547586/at-barrel-soho-nyc)

I’m also exploring cataloguing systems and am currently playing with AToM – an open source standards-based a cataloguing system from Artefactual (who also develop Archivematica) which looks to offer many of the things which we will be requiring.  I have some existing catalogues to import (which is proving rather more tricky than I thought it would be) but I like what I see of what the system offers in terms of standards conformance, ease of use and interoperability.  I am looking for a system that will nicely expose digital and non-digital descriptions side by side and an integration with Archivematica is important for this.  I am also keen for it to work alongside our current Onesearch library catalogue to allow users to navigate across collections and find their way around everything the university has to offer.

Blogging

I want to get into the habit of regular blogging and have been inspired by Jen Mitcham’s regular Digital Archiving updates as well as Kirsty Lee’s Bits and Pieces.  A longer read which I will be coming back to which is worth a look is Bentley Historical Library‘s Appraising Digital Archives with Archivematica paper which was written from elements already appearing in their blog.

So – here’s to a busy year of digital archives!

 

Piece by piece

Old_College_of_Edinburgh_University
The Old College, Edinburgh (Image by Kim Traynor – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18939432

Breaking the boundaries

Recently I had the chance to take a trip up to Edinburgh University to take part in an event called Research Data, Records and Archives: Breaking the Boundaries which was organised by Edinburgh University to address “the challenge of managing research data in relation to records management and archives”. This was especially interesting to me having recently spoken about this subject at a Digital Futures conference in Cambridge (you can see my slides here.

Building blocks

The venue was the beautiful Playfair Library Hall, begun in Neo-Classical style in 1789 and finally completed by William Playfair and put into use as the University library in the 1820’s. It took quite a long time to finish building the library that the university wanted which has made me feel a bit better about the progress I’ve been making in digital preservation here in Lancaster! The Playfair Library now serves as a fantastic venue for a range of events such as this workshop where we were drawn together from a range of different disciplines to talk about research data and how to build for the long term. With so much of a reminder of the influence of the past around us it was good to focus on how we are going to continue to preserve and maintain academic endeavour.

 

playfair
Playfair Library (Image: Rachel MacGregor, 2016)

We were archivists, librarians, data managers and others from a wide variety of institutions and situations brought together with a common purpose and to compare and share approaches and experiences. Digital preservation is a slow and iterative process which needs a range of tools, processes and skills bolted together to work towards the long term goal. Every situation needs a slightly different approach according to the needs and resources available but we can all learn from each other and contribute towards making progress.

To keep or not to keep

The morning session focussed on a variety of presentations from information professionals and also a couple of case studies. It was refreshing to hear from a real life researcher talking about the importance of the re-use of data, in this case Professor Ian Deary whose research was based on a large scale dataset from a population study of the 1920’s and 30’s. This data was in paper format of course but became the basis for invaluable research into the effect of aging on the brain. Deary made the valuable point that the research he has undertaken was only possible because the dataset had not been sampled – data from the entire cohort had been kept. This sat a little awkwardly alongside the earlier call from our introductory speaker Kevin Ashley (Director of the Digital Curation Centre) who exhorted us to get “better at managing and better at throwing things away”. In fact this is not a digital vs non-digital issue – the tension of managing data with finite space and resources has always been there and appraisal techniques have been developed to help with this problem and work towards a solution.

The records continuum

The need to be involved in all stages of the lifespan of data was highlighted by a number of speakers, including Rachel Hosker of the University of Edinburgh who called for greater communication and collaboration with data creators and depositors. I think most of us would agree that this was the best approach, but how practical or sustainable it is, particularly when dealing with a deluge of research data from a multiplicity of sources I am not sure. What I do think is that we should be seeing data as part of the records continuum model – one which has been around a long time but which in the UK at least has not always had the prominence it should. In research data terms the model is almost always that of a life cycle and a move towards seeing it as a continuum would leave those managing and preserving the data in a much stronger position to plan for and develop strategies to ensure both long-term survival and access to data (or archives or records or whatever you like to think of the “stuff” as.  I think there’s another blog post in there).

Identifying what we have

The afternoon brought us together in small groups for discussion of some of the key problems – and solutions – as we saw them of managing research data. My group – which was a mixture of archivists, researcher data managers and software developers – spent time discussing the issue of obscure file formats and scientific research data. There is the initial problem of identifying the file formats and then the further problem of sustaining the software which supports the data. There are plenty of tools available for file format identification but most rely on the PRONOM file registry,  invaluable but inevitably limited when working with research data file formats. PRONOM supports the work of the UK’s National Archives and whilst it has become the de facto international file format registry standard, its principal raison d’etre is to support UK government departmental record keeping practices.  As a community supporting digital preservation we should be seeking ways to enhance and contribute towards file format id-ing which will enable work above and beyond this. The team at the University of York Borthwick Institute have made great strides in developing and supporting this initiative but it high time a much greater number of us took part in this work. Here at Lancaster University we have over 70 datasets (and counting!) which we are working to preserve and make available for the long term.  A number of these are file formats which we have little or no information about. One of my action points arisng from the workshop is to work on file format identification and documentation – if anyone has any good suggestions of how to start work on this I would be very interested to hear from them!

Sustainability and good practice

We were equally concerned with the long term sustainability of software. I anticipate both migration and emulation to play a role in our digital preservation strategies but having robust software development in the first place is a good starting point. The Software Sustainability Institute does a great deal of unsung work to improve the quality of software development and again we should all be engaged actively in promoting good practice. There is a great deal of useful information and guidance available on their website.
All in all it was a very thought provoking day and one which raised a lot of questions but for me at least gave me some things to put on my “to do” list. Digital preservation is an iterative process and it’s time to bolt another piece onto the digital preservation structure.

 

International_DigitalPreservation

The preservation jigsaw puzzle

University of York Central Hall. Philip Pankhurst, via Wikimedia Commons

I had a great day meeting with Jen Mitcham, who blogs here on her work at the Borthwick Institute and also Laura from University of Sheffield Library to talk about and share experiences of digital preservation.  The needs and set up in our various institutions are different but we share many areas of concern, such as the need for advocacy and the challenges of integrating traditional archival theory and practice with the management of digital data.

Advocacy

We talked a bit about terminology and about how confusing this can get.  “File” and “archive” mean very different things to different people.  It might not just be a question of avoiding confusing terminology but aso have learning a new language to discuss familiar territory.  For archiving read sustainability, for curate read preserve, for file read item and so on.

Old vs new

There are many areas where the management of traditional archives and born digital overlap.  All those involved in preservation – digital or otherwise – need to tackle some basic questions. What have we got? What should we be collecting and preserving? What are we trying to achieve? It is as important for the traditional archivist to have a sense of what to collect, why collect and who is the audience as it is for the digital archivist.

This could be formalised in a policy document or plan but need not be; there still needs to be a clear sense of direction and structure for the work being undertaken.

The need in the first instance to have intellectual control of the collections – whatever they are.  The first question any archivist should ask themselves is “what have I got”? This can be a particularly difficult question to answer if it is obscured by the physical medium of the items themselves – whether they are documents written in a language or hand which is hard to read (it could be Latin, secretary hand or an early version of WordPerfect) or in a format which is inaccessible (water damaged document, reel-to-reel tape, 5 1/4″ floppy disk).  But even though it might be a technically difficult question to answer it’s one which most people can begin to tackle – even if the answer (description) of the data is “medieval document” or “word processed document” or even “research data in a study of termites”. Before making any further progress we have to now what the scale of the problem is.

Pieces in the preservation puzzle

Here at Lancaster University we are focussing on the management and preservation of research data – that is the raw data from scholarly outputs that the university staff and students produce.  This data is undeniably valuable and useful but only if it can be accessed and reused for the long term and be trusted, just like any other evidence (or archival document).  We are considering ways in which we can map research data outputs* which I think will have really big benefits for preservation planning and go some way towards tackling the questions about what we have, what might we be receiving, how long do we need to keep it for.  Because of the huge volume of data we are talking about it makes sense to try and automate as much of the process as possible – something my colleagues at York have been giving some thought to.

And while some of this might seem a little removed from those grappling with legacy files on outdated systems, managing emails, corporate archives or whatever else, they are all pieces in the preservation jigsaw puzzle.

What we should be collecting is a decision for each individual repository but the important thing is that this is clearly (although not necessarily) rigidly defined and that stakeholders (depositors, users, researchers) are consulted and involved.

Crossing the digital divide

What are we trying to achieve is the long-term preservation of archives/data and making them available and discoverable. This is something which presents the big challenge but one for which solutions are continually being developed.  There is not going to be one software tool which will adequately do all of this but then there isn’t one single solution for arranging, describing, indexing, storing, labelling and retrieving traditional archives and there are likely to be a range of solutions which are suitable for one, the other or both.  We are currently inhabiting a world of hybrid digital non-digital archives and we need to be thinking about solutions which cross this perceived divide. These might be old or new but we need to bring together preservation and digital preservation to look at how we manage archives whatever their format.

There’s more about what we mean when we talk about archives here from Kate Theimer.

I have also been following the various Archive conferences in the US, Australia and Ireland.  A nice blog post here from SAA2015 and all the ARA UK and Ireland Conference tweets can be found on Twitter at #ARA2015 including (I think for the first time) a digital preservation strand.  There was a lot in all of these conferences on the subject of advocacy.

*if you are interested in Research Data Management you might want to read more about our JISC funded project here

United we stand 

I attended a Digital Preservation Coalition training event recently in Liverpool called “Making Progress with Digital Preservation”. This came at a good time for me after having been in post as Digital Archivist at Lancaster University for a couple of months and finding myself trying to do just that.  It was a great opportunity to meet some of my fellow professionals in the region and also to meet a wide range of practitioners from different disciplines who had come together to try and get their heads around some of the challenges faced by the emerging and changing discipline of Digital Preservation.

One of the big themes of the day – and something I’ve been giving quite a bit of thought to recently – is the need for advocacy – as William Kilbride, chief executive of the DPC, said “a huge part of digital preservation is relentless advocacy” and certainly the relentless nature of it can seem daunting. I often think that very few people really grasp what it is I am trying to achieve in my job – it can be quite hard to explain – and without having the record creators on board with the task of preserving is impossible. Digital preservation does not take place in isolation – it is a combination of tasks undertaken by a wide range of people taking on the challenges posed by the technologies, information, curation, selection and so on and so on. 

As was discussed at the event, digital preservation is an activity undertaken by people from many different disciplines each of whom bring a different angle or perspective to many of the issues with are being grappled with. This includes librarians, records managers, archivists, data managers, IT systems people, researchers… the List is endless.  It’s a collaborative effort and one which, if it is to succeed, needs to be taken up and be taken seriously by anyone who is engaged in data creation.  And by that of course I mean everybody.

Funding models for projects mean that there are a multiplicity of time-limited projects, the results of which are scattered and difficult to navigate even for someone who knows a little about the subject. On the plus side here are lots of people who are keen to share their knowledge, experience and expertise, and only by string collaborative working will we really achieve results.

I’m preparing to introduce my colleagues to the principles of Digital Preservation because that advocacy work starts at home, and I can’t save the world digital data on my own.

This week I’ve been reading this article by Anthony Cocciolo, Professor of Information at the Pratt Institute, New York and Library Science which looks at the archivist in a data managers world.  I’ve also looked at this article from International Article of a Digital Curation on how we should be taking a holistic approach to data curation.