Piece by piece

The Old College, Edinburgh (Image by Kim Traynor – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18939432

Breaking the boundaries

Recently I had the chance to take a trip up to Edinburgh University to take part in an event called Research Data, Records and Archives: Breaking the Boundaries which was organised by Edinburgh University to address “the challenge of managing research data in relation to records management and archives”. This was especially interesting to me having recently spoken about this subject at a Digital Futures conference in Cambridge (you can see my slides here.

Building blocks

The venue was the beautiful Playfair Library Hall, begun in Neo-Classical style in 1789 and finally completed by William Playfair and put into use as the University library in the 1820’s. It took quite a long time to finish building the library that the university wanted which has made me feel a bit better about the progress I’ve been making in digital preservation here in Lancaster! The Playfair Library now serves as a fantastic venue for a range of events such as this workshop where we were drawn together from a range of different disciplines to talk about research data and how to build for the long term. With so much of a reminder of the influence of the past around us it was good to focus on how we are going to continue to preserve and maintain academic endeavour.


Playfair Library (Image: Rachel MacGregor, 2016)

We were archivists, librarians, data managers and others from a wide variety of institutions and situations brought together with a common purpose and to compare and share approaches and experiences. Digital preservation is a slow and iterative process which needs a range of tools, processes and skills bolted together to work towards the long term goal. Every situation needs a slightly different approach according to the needs and resources available but we can all learn from each other and contribute towards making progress.

To keep or not to keep

The morning session focussed on a variety of presentations from information professionals and also a couple of case studies. It was refreshing to hear from a real life researcher talking about the importance of the re-use of data, in this case Professor Ian Deary whose research was based on a large scale dataset from a population study of the 1920’s and 30’s. This data was in paper format of course but became the basis for invaluable research into the effect of aging on the brain. Deary made the valuable point that the research he has undertaken was only possible because the dataset had not been sampled – data from the entire cohort had been kept. This sat a little awkwardly alongside the earlier call from our introductory speaker Kevin Ashley (Director of the Digital Curation Centre) who exhorted us to get “better at managing and better at throwing things away”. In fact this is not a digital vs non-digital issue – the tension of managing data with finite space and resources has always been there and appraisal techniques have been developed to help with this problem and work towards a solution.

The records continuum

The need to be involved in all stages of the lifespan of data was highlighted by a number of speakers, including Rachel Hosker of the University of Edinburgh who called for greater communication and collaboration with data creators and depositors. I think most of us would agree that this was the best approach, but how practical or sustainable it is, particularly when dealing with a deluge of research data from a multiplicity of sources I am not sure. What I do think is that we should be seeing data as part of the records continuum model – one which has been around a long time but which in the UK at least has not always had the prominence it should. In research data terms the model is almost always that of a life cycle and a move towards seeing it as a continuum would leave those managing and preserving the data in a much stronger position to plan for and develop strategies to ensure both long-term survival and access to data (or archives or records or whatever you like to think of the “stuff” as.  I think there’s another blog post in there).

Identifying what we have

The afternoon brought us together in small groups for discussion of some of the key problems – and solutions – as we saw them of managing research data. My group – which was a mixture of archivists, researcher data managers and software developers – spent time discussing the issue of obscure file formats and scientific research data. There is the initial problem of identifying the file formats and then the further problem of sustaining the software which supports the data. There are plenty of tools available for file format identification but most rely on the PRONOM file registry,  invaluable but inevitably limited when working with research data file formats. PRONOM supports the work of the UK’s National Archives and whilst it has become the de facto international file format registry standard, its principal raison d’etre is to support UK government departmental record keeping practices.  As a community supporting digital preservation we should be seeking ways to enhance and contribute towards file format id-ing which will enable work above and beyond this. The team at the University of York Borthwick Institute have made great strides in developing and supporting this initiative but it high time a much greater number of us took part in this work. Here at Lancaster University we have over 70 datasets (and counting!) which we are working to preserve and make available for the long term.  A number of these are file formats which we have little or no information about. One of my action points arisng from the workshop is to work on file format identification and documentation – if anyone has any good suggestions of how to start work on this I would be very interested to hear from them!

Sustainability and good practice

We were equally concerned with the long term sustainability of software. I anticipate both migration and emulation to play a role in our digital preservation strategies but having robust software development in the first place is a good starting point. The Software Sustainability Institute does a great deal of unsung work to improve the quality of software development and again we should all be engaged actively in promoting good practice. There is a great deal of useful information and guidance available on their website.
All in all it was a very thought provoking day and one which raised a lot of questions but for me at least gave me some things to put on my “to do” list. Digital preservation is an iterative process and it’s time to bolt another piece onto the digital preservation structure.




6 thoughts on “Piece by piece

  1. Johan van der Knijff

    For identifying research formats you might also want to have a look at Apache Tika, which covers a wider range of formats than the PRONOM-based tools (although result are often less specific). Andy Jackson’s Format Registry aggregator is also useful here:


    (Although this hasn’t been updated for over a year)


      1. Johan van der Knijff

        Which reminds me … only a week back Tim Allison (who’s one of the developers working on Tika) ran a DROID vs TIka vs Unix File comparison on the Tika regression corpus. Results are available here:

        Also,the latest version of Richard Lehane’s Siegfried tool now combines format signatures from PRONOM and freedesktop.org; details here:



  2. Hi Rachel,

    Following from David on Twitter, two additional links outside of the TNA blog that I wrote are:



    Some good links to the other bits and pieces David spoke about on there. I also wrote this a while back which provides a good methodology to sure up any identification work you do: http://exponentialdecay.co.uk/blog/published-the-skeleton-test-corpus/

    I hope that helps!



