Software Matters

Me thinking about software preservation

I’m going to let you into a secret.  Well, it’s not a secret really but whilst I have been gamely (trying to) take on the challenges of tackling various technical aspects of digital preservation, my approach to software preservation has been decidedly that of an ostrich.  I have been firmly sticking my head in the sand over this! It’s partly because I really don’t feel like I know enough about software writing in the first place and partly because someone else is doing something about it aren’t they…?

However databases, websites, video and so on are complex digital assets which I am only too happy* to tackle but somehow software seems a step further

However a couple of things made me rethink my position.

The first was a recent talk from Neil Chue-Hong of the Software Sustainability Institute at one of our Lancaster Data Conversations.  He discouraged me and encouraged me in equal measure.  Discouraged me because when addressing a room full of coder writers he asked them to consider how they might access their code in three months time.  Three months!  What about three years or even three decades?  If people are not even considering a lifetime beyond three months I’m starting to wonder if it’s worth getting involved at all.

However on a more positive note Neil was keen to promote good practice in software writing and management and recognised the barriers to maintaining and sharing code.  As with most preservation work the key is getting the creators to adopt good practice early on in the process – the upstream approach which I’ve alluded to before and has been around a very long time and indeed is what makes Digital Preservation a human project.  However in order to support good practice and to build the right processes for managing software a better understanding is what is required.

Can we build code to last?

The second thing was the realisation that I was already responsible for code in our data repository – for example this dataset here which supports a recent Lancaster PhD thesis.

We don’t have a huge uptake from our PhD students for depositing data in the repository and we are especially keen to encourage this because we want our researchers to get into good habits early.  Neil explained in his session that there were particular barriers to certain researchers sharing data – early career researchers amongst them – as there is a fear of sharing “bad” code.  But as he pointed out – everyone writes bad code and part of the advocacy around sharing is getting over the fear of “being found out”.  From my perspective – if people are willing (or even brave enough) to share their code I want to make sure that as someone charged with digital preservation I can try and create the optimal environment for software preservation into the long term.

At the moment I think we have some way to go on this, but thankfully help is at hand with the Jisc/SSI workshop Software Deposit and Preservation Policy and Planning Workshop.  The workshop aim is to “first workshop will present the results of work done by the SSI to examine the current workflows used to preserve software as a research object as part of research data management procedures.”

Sounds good to me.

What do I hope to get out of this?


  • I’m really interested in getting the right metadata to support the preservation of software
  • I’m keen to hear what other people are doing!
  • I want to know where I can best direct my efforts and the areas I need to concentrate on first to get up to speed with providing excellent support

I’ll be reporting back soon on what my next steps are…

*or is that grimly determined?


Piece by piece

The Old College, Edinburgh (Image by Kim Traynor – Own work, CC BY-SA 3.0,

Breaking the boundaries

Recently I had the chance to take a trip up to Edinburgh University to take part in an event called Research Data, Records and Archives: Breaking the Boundaries which was organised by Edinburgh University to address “the challenge of managing research data in relation to records management and archives”. This was especially interesting to me having recently spoken about this subject at a Digital Futures conference in Cambridge (you can see my slides here.

Building blocks

The venue was the beautiful Playfair Library Hall, begun in Neo-Classical style in 1789 and finally completed by William Playfair and put into use as the University library in the 1820’s. It took quite a long time to finish building the library that the university wanted which has made me feel a bit better about the progress I’ve been making in digital preservation here in Lancaster! The Playfair Library now serves as a fantastic venue for a range of events such as this workshop where we were drawn together from a range of different disciplines to talk about research data and how to build for the long term. With so much of a reminder of the influence of the past around us it was good to focus on how we are going to continue to preserve and maintain academic endeavour.


Playfair Library (Image: Rachel MacGregor, 2016)

We were archivists, librarians, data managers and others from a wide variety of institutions and situations brought together with a common purpose and to compare and share approaches and experiences. Digital preservation is a slow and iterative process which needs a range of tools, processes and skills bolted together to work towards the long term goal. Every situation needs a slightly different approach according to the needs and resources available but we can all learn from each other and contribute towards making progress.

To keep or not to keep

The morning session focussed on a variety of presentations from information professionals and also a couple of case studies. It was refreshing to hear from a real life researcher talking about the importance of the re-use of data, in this case Professor Ian Deary whose research was based on a large scale dataset from a population study of the 1920’s and 30’s. This data was in paper format of course but became the basis for invaluable research into the effect of aging on the brain. Deary made the valuable point that the research he has undertaken was only possible because the dataset had not been sampled – data from the entire cohort had been kept. This sat a little awkwardly alongside the earlier call from our introductory speaker Kevin Ashley (Director of the Digital Curation Centre) who exhorted us to get “better at managing and better at throwing things away”. In fact this is not a digital vs non-digital issue – the tension of managing data with finite space and resources has always been there and appraisal techniques have been developed to help with this problem and work towards a solution.

The records continuum

The need to be involved in all stages of the lifespan of data was highlighted by a number of speakers, including Rachel Hosker of the University of Edinburgh who called for greater communication and collaboration with data creators and depositors. I think most of us would agree that this was the best approach, but how practical or sustainable it is, particularly when dealing with a deluge of research data from a multiplicity of sources I am not sure. What I do think is that we should be seeing data as part of the records continuum model – one which has been around a long time but which in the UK at least has not always had the prominence it should. In research data terms the model is almost always that of a life cycle and a move towards seeing it as a continuum would leave those managing and preserving the data in a much stronger position to plan for and develop strategies to ensure both long-term survival and access to data (or archives or records or whatever you like to think of the “stuff” as.  I think there’s another blog post in there).

Identifying what we have

The afternoon brought us together in small groups for discussion of some of the key problems – and solutions – as we saw them of managing research data. My group – which was a mixture of archivists, researcher data managers and software developers – spent time discussing the issue of obscure file formats and scientific research data. There is the initial problem of identifying the file formats and then the further problem of sustaining the software which supports the data. There are plenty of tools available for file format identification but most rely on the PRONOM file registry,  invaluable but inevitably limited when working with research data file formats. PRONOM supports the work of the UK’s National Archives and whilst it has become the de facto international file format registry standard, its principal raison d’etre is to support UK government departmental record keeping practices.  As a community supporting digital preservation we should be seeking ways to enhance and contribute towards file format id-ing which will enable work above and beyond this. The team at the University of York Borthwick Institute have made great strides in developing and supporting this initiative but it high time a much greater number of us took part in this work. Here at Lancaster University we have over 70 datasets (and counting!) which we are working to preserve and make available for the long term.  A number of these are file formats which we have little or no information about. One of my action points arisng from the workshop is to work on file format identification and documentation – if anyone has any good suggestions of how to start work on this I would be very interested to hear from them!

Sustainability and good practice

We were equally concerned with the long term sustainability of software. I anticipate both migration and emulation to play a role in our digital preservation strategies but having robust software development in the first place is a good starting point. The Software Sustainability Institute does a great deal of unsung work to improve the quality of software development and again we should all be engaged actively in promoting good practice. There is a great deal of useful information and guidance available on their website.
All in all it was a very thought provoking day and one which raised a lot of questions but for me at least gave me some things to put on my “to do” list. Digital preservation is an iterative process and it’s time to bolt another piece onto the digital preservation structure.