Archives Access and AI

I was really looking forward to this conference organised by Dr Lise Jaillant of the University of Loughborough and it did not disappoint. I took what I think must be a record fifteen pages of notes (yes I do use a real notebook) not to mention the countless tweet (see #AcArAi) so it will be impossible for me to do much more than summarise some of the highlights (for me at least) of this conference.

The conference brought together digital humanities scholars, archivists, digital preservation practitioners and others to discuss and share ideas about making archives accessible – either with or without the aid of machine learning/AI.

The conference was near Hackney Wick – an interesting part of London

Helen Mavin from the Imperial War Museum gave a fascinating insight into the very complex work being undertaken at the museum to manage and preserve over a million digital files deposited by the Ministry of Defence. This is material specifically covered by UK government legislation (the Public Records Act) so is not everything which the Museum collects or preserves (this was important when discussing their appraisal and retention strategies). Collections management and transfer procedures were out of date and key contacts were unknown or unclear. Mavin needed to establish robust criteria for retention and selection of digital materials to facilitate fast(er) and more efficient transfer. Once the material was transferred there needed to be standardised workflows which are media agnostic. Key challenges were around staff turnover and both a skills and resource gap – something which will be familiar to many.

I was really interested to hear from Jonathan Manton and Alice Prael from Yale University Libraries talked about their work on trying to centralise and standardise the workflows at their institution which comprises a number of libraries and a museum. They began by centralising the processing of imaging and file extraction from physical media which has helped standardise the process and assisted with tackling processing backlogs. They have also looked at email archiving, born digital catalogue description and network transfer. I was particularly keen to hear about the work they have been doing on describing born digital archives. This is something which really needs more discussion and action from practitioners – I’m finding it presents a very real headache for my own practice. Prael and Manton commented that the area which caused most difficulty was describing hybrid archives. I wasn’t surprised by this and given this is the landscape we are going to be working in for some time to come it’s one we should all be giving a lot of thought to.

The cataloguing of archives is closely linked to (but not coterminous with) access so I was also keen to hear from Anthea Seles talk about the use of Machine Learning and Artificial Intelligence in archival processing and the extent to which archivists and information professionalism should (but are not necessarily) involved in the creation and use of algorithms and the ways in which data is explored, exposed and exploited. Seles’ talk gave a great deal of food for thought and my homework will be to read Cathy O’Neil’s Weapons of Math Destruction.

I also very much enjoyed hearing from Leontien Talboom who is undertaking a PhD jointly with the UK National Archives and and University College London looking at the barriers and opportunities which exist in making digital archives available. So far she has uncovered what a huge amount of work still needs to be done in this area but I’m looking forward to work coming out in full so we are better informed to overcome the barriers and exploit the opportunities. In the same panel another PhD student Rebecca Oliva from the University of Glasgow is looking at sensitivity reviewing and the extent to which it can be automated. Manual sensitivity reviewing is on the whole an opaque process, therefore lending itself well to automation and machine processing. However sensitivity is also very context specific ie data can be sensitive in one context and not in another so there are many challenges to be met. Again I am really looking forward to the outputs of this work.

On the final day it was good to hear from Jenny Bunn who invited us to ask “What can archivists bring to the (AI) party?”. The answer is (hopefully) quite a lot. AI has been around for a while now and it is starting to grow up. more people are asking for accountability in AI and this is where the archivist comes in – we are really really good at documenting and organising things!

We had an entertaining presentation from Caylin Smith and Andy Irving from the British Library on their struggles with making non-published legal deposit material from the British Library available. They are extremely hampered by legislation which has not yet caught up with the technology (a point which Paul Gooding from the University of Glasgow also addressed in his presentation) but have made great strides to improve what they can, certainly in terms of the user experience of accessing material on site. We also had a heard a fantastic call to arms from William Kilbride of the Digital Preservation Coalition which he has helpfully published as a blog post so I don’t have to report on it!

Many of the conference presentations are available here and the abstracts here. I thought this was a fascinating and thought provoking conference and I am still thinking about many of the themes which came up in it and hope to draw on in my practice. many thanks to Dr Lise Jaillant for all her hard work in putting this together.

Conference cake which like everything else was excellent!

Archivematica UK User Group meeting, University of Westminster

We met in the heart of London thanks to University of Westminster

It was an early 2020 start for the Archivematica UK User group but thankfully the weather was very kind for those of us travelling to London to our hosts, the University of Westminster, who kindly provided accommodation and refreshments for us. These meetings need caffeine and sugar!

The meetings are a great chance for users to get together and share their successes and their woes which tend to come in equal measure for anyone practising digital preservation (we are all agreed we’re all still learning!).

First to take the floor was Matthew Addis from Arkivum who talked about the Preservation Action Registry which is a bold but incredibly useful project to try and capture and share technical best practice for preservation actions in a human and machine executable way. Both Archivematica’s Format Policy registry (of which more in a moment) and Preservica’s Linked Data Registry define rules for preservation actions but are not interoperable and the benefits for users of both systems and any systems going forward if they were would be great. It was just the first in a series of presentations which touched on the theme of interoperability which chimes well with the Open Source nature of Archivematica. After all we’re all trying to work on a common task of preservation of a dizzying variety of digital file formats in a wide variety of contexts so the more working together we can do the better!

Interoperability: what we all want.

The open part of the agenda allowed us to give some thought to PDFs. Who loves PDFs? Not many of us it seems. You thought a PDF was a PDF (or maybe a PDF/A) and that was it? You’d be wrong. There a wide varieties of types of PDF and then there are PDFs which conform to the standard they are supposed to and those which don’t. And then there are those that will render correctly and those which don’t…. Archivematica can and does tell you the type and conformity of the PDF but what you are going to do with this information and whether the PDF will render as it should are separate questions. We had some good debates on this and we referenced Paul Wheatley’s thought provoking (or is that provocative?) blog on the subject of validation, something we should all be giving a lot more thought to.

PDFs – ugh

I then took gross advantage of the fact that I was chairing the meeting by chipping in with my tuppence worth on normalization, sparked off by Evelyn McLellan’s recent questions to the user community more broadly about normalization options in Archivematica. I was interested to know whether people had been investigating Archivematica’s Format Policy Registry – a question which had been posed at a previous user group meeting a couple of years ago in Aberystwyth. There definitely seemed to be more confidence about having a go at tinkering with it and overall a questioning about whether the “normalize everything approach” was the right one. I am keen to get more people thinking proactively about their normalization choices – when, where and whether to normalize and agree with Tim Walsh when they wrote about it in their response to Evelyn – that having a view is a responsibility of the institution because of the resource implications of large uncompressed normalized formats but also that each institution is very context specific.

The final session before lunch was a introduction to work by the Wellcome Collections who are getting to grips with Archivematica and tailoring it to meet the needs of their fairly sophisticated workflows. I suspect I wasn’t the only one who was envious of their plans for full scale automation in many of their workflows and their contribution to the user community is going to be invaluable – all shared via GitHub. I’m excited about what the future holds and will be watching their work with keen interest.

After lunch Jen Mitcham from the Digital Preservation Coalition presented on their new project developing a guide to procurement for digital preservation systems and services. They are looking for ways to support both product users and vendors get the best from the procurement process and there was some valuable discussion about the dos and don’ts – we all recognised the “manage expectations” advice because procuring a system does not mean that you have solved digital preservation – I think everyone in the room has found this out!

Systems procurement – we can all learn from each other.

Next up was Matthew Neely from Bodleian Libraries, Oxford talking about their work on Archivematica. Some of us had heard about this work at last year’s Archivematica Camp in London and via their blog but it was good to hear about where they are and what their future development plans are. The Bodleian have particular challenges around the size and scale of both their holdings and their staffing – user management is a key concern for them in the way it isn’t for smaller organisations. They are also doing some interesting work on reporting functionality as an add-on to Archivematica. This isn’t something which Archivematica does (as it isn’t part of its core functionality) but it’s always interesting to hear about this kind of work – one of the great benefits to Open Source software is that it can encourage and support a whole range of additional parallel services to suit individual needs.

The final talk of the day was from Stephen Mackey of Penwern Ltd, a consultancy firm that are involved in a number of digital preservation projects including work on the Central Europe Facility (CEF) eArchiving project (EARK) which has the bold and ambitious aim to introduce compliance and interoperability into the way that institutions work looking at creating AIPs which can be read and shared by any system. Stephen shared the proposed data model with us and there was some interesting discussion around how disparate systems might achieve conformance in different ways.

Europe-wide interoperability and eArchiving

Finally no Archivematica User group meeting would be complete without input from Artefactual and we were really pleased to be able to welcome Sarah Mason, Systems Archivist at Artefactual who is based in the UK which meant she was able to attend in person. She gave us the latest news on Artefactual developments including informality about changes to normalization paths for video files (part of the work I had referred to earlier), some improvements to user acceptance testing which should make for smoother releases for updates and the integration of new PRONOM updates.

All in all it was a very successful meeting although next time I think we should factor in more general discussion time. I certainly stayed on an extra half an hour or so (thank you University of Westminster!) exchanging experiences in an informal environment which is in part what these meetings are all about. I’m looking forward to the next one already!

iPres 2019, Amsterdam

Lovely Amsterdam

iPres is the International Digital Preservation conference which I am lucky enough to have attended twice now (the first was three years ago in Bern which I wrote about here).

It’s a massive conference which is far too big to get round all of and thankfully (for either those not able to clone themselves or alternatively for those not able to come) includes online papers and also on line collaborative note taking, which means it is possible to catch up on some things at a distance or even later.

So I just thought I’d highlight some of the sessions which I enjoyed – there were many I missed – and what my main take aways were.

Arriving to the venue by ferry was a big hit with everyone – like the United Kingdom the Netherlands is a maritime nation and it’s great to remember that it’s not just about canals. The venue at the EYE museum was stunning on the waterfront but it did involve a lot more climbing than I had bargained for, particularly in a flat country like the Netherlands…

Eye Museum of Film, Amsterdam

The kick off session was a workshop on the Preservation Action Registry – this was really useful for me as it helped me understand more about how I might document my actions better. Documenting what we actually do and also what we don’t do and indeed what we used to do but don’t any more. It might mean capturing the “people and process” part of preservation in a machine readable way. I got much more insight into how I might analyse all the processes which go into preservation work to create far better and more useful documentation strategies.

Then it was straight off to the Digital Preservation Coalition’s DP Anonymous where we are invited to share stories about digital preservation challenges (we don’t say failures) and it was gripping stuff (although as it’s Chatham House Rules I can say no more than that). I presented on my recent struggles with setting up a virus checking workflow and it was great to share because I had some practical helpful suggestions of what to do next.

The conference proper opened on Tuesday with a keynote from Geert Lovink, writer and activist. There was a lot to like about this and I think I was not the only person to respond positively to his call to value our networks – particularly those which take place in the same physical space. Lovink is a very persuasive speaker but at the end – partly because I come from a different political and cultural angle – I found I didn’t agree with him on all counts and instead was inclined to agree with comments from the floor from Leslie Johnson of NARA describing successful networks supported in spite of or indeed because of existing corporate structures.

Delicious conference catering

As I have outlined in a previous post I am really interested in skills development in the archives sector so it was a really good opportunity to hear about a number of projects from around the world looking at this. And at an international conference there is the opportunity not just to hear about projects but also meet the people driving them and benefit from their experience so I was particularly thrilled to meet Angela Beking from Library and Archives Canada and Jaye Weatherburn from the University of Melbourne who are variously spearheading initiatives to help fellow professionals develop their skills. Beking presented on her work developing a collaborative model for knowledge transfer aka “digital detention” which got great feedback from the staff who were undertaking it. Weatherburn meanwhile has been instrumental in leading Australasia Preserves which is aiming to support the growth in a community of practice across a large geographic region. All of this has given me a huge amount of food for thought and I hope to be able to build on this community development work in the future.

A strong theme of the conference

The keynote on Day Two was especially welcome as I wasn’t at the 2018 Archives and Records Association conference so missed that opportunity to hear from Professor Michelle Caswell of UCLA. I have recently read her piece on Feminist Standpoint Appraisal and it was great to be reminded how all of us – whatever role we play in safeguarding, curating or making archives and/or data available have a role to play in ensuring that this is done with equity and it does not reinforce the hierarchies of oppression. None of us are neutral operators and we and the collections we manage are a part of society and if we want to see a change, we are the ones to enact that change through our practice.

I really enjoyed the poster sessions and it’s such a privilege to be able to talk to people directly about their research. My main criticism was that I couldn’t get round them all but I did enjoy hearing from, amongst others, Merle Friedrich of the German National Library of Science of Technology about their analysis of AV file formats which complemented the poster from the Open Preservation Foundation on significant properties of spreadsheets, both examples of studies which lead us all to a better understanding of formats.

Me on the big stage

On the Thursday I really enjoyed the lightning talks – despite giving one myself (which is not what you might call enjoyable). The range and breadth of topics covered and calls for contributions was fantastic, from Harvard’s Wolbach Library Project Phaedra, through the TRUST principles being developed for digital repositories and the file format work happening at NARA and I think session included the best conference slide on distributed storage services

Digital preservation is all about the unicorn magic!

iPres 2019 was a great conference and I’m just sorry I didn’t have time to see a bit more of Amsterdam. It was a great privilege to attend and a particularly exciting to be able to speak at the ad hoc session. A massive thank you to the organisers and all the participants – a conference is made by the community after all. I hope to be able to spend a bit more time looking at the contributions and putting into practice what I learnt from the conference.

Reward for all our hard work

Good enough

Mapping essential skills for the twenty first century archivist – a presentation given at the 2019 Archives and Records Association Conference in Leeds, UK, 28th August 2019.

Here’s a version with slides of the talk I gave in Leeds .

So where did it all start for me?

What I tweeted after leaving the Archives and Records Association Conference 2012

I worked for many years in “traditional” archives and was inspired to make a move into digital preservation partly after the ARA Conference in 2012 (the last one I attended). Over the next few years I did what I could and started to develop a digital asset register as well as work on policy but the usual frustrations hampered further progress:

Got a break working on the digital preservation of research data management at Lancaster University – a related but distinct role – where I learnt a huge amount and one of the first things I began to do was map the divide between data and archives.

People (especially from Humanities backgrounds!) tend to shy away from the word data – but there’s a lot to learn! I’m now based at the Modern Records Centre at the University of Warwick which has the records of trades unions, pressure groups and some notable individuals including Bill Morris, Eric Hobsbawm and Rodney Bickerstaffe all of which contain digital content. So I have plenty to go on both in terms of our legacy collections and also new material coming in.

It isn’t that progress isn’t being made but it could be faster and perhaps should be as time is not on our side. And if you don’t believe me one of the first collections I have worked on is material from the University’s 50th anniversary which included the following:

From a collection of material relating to the University’s 50th anniversary celebrations

Of these links:

  1. Works and resolves to an internal site (but not under our administrative control). I used Webrecorder to extract the content and create a WARC file for preservation
  2. Does not resolve but I (eventually) tracked it down the the British Library UK web archive – but you have to go there to view it.
  3. Storify ceased to operate in 2018 so the content is now gone.

So the time to act is… now. Or rather the time to act was yesterday.

My uncle’s enduring (for now) presence on Facebook

I don’t have any emails from my father because I changed provider and lots all my emails dating before 2012 by which time he was dead. It doesn’t matter too much in a way because I have lots of physical photos, writings and other things of his. That seems to be slightly before a tipping point whereby suddenly I have little or nothing physically documenting my interactions and connections with people. My uncle who passed away this year was a dedicated user of Facebook, to the point where many of my friends – who never met him – have said that they will miss his comments… I have emails, photographs, Facebook comments – all of which are digital – that somehow seem a little less easy to capture.
And why am I spending time telling you about my uncle and Facebook? Because these are all stories – and this is what it is all too easy to forget – that what seem to be files, spreadsheets, databases are also stories of life, love, loss and hope. I would like everyone in the room to remember that the digital is created by human activity and the stories told are no less.

I wanted to explore a bit more about where we might be in terms of tackling digital archives by talking to a few fellow archive sectors workers to get a better sense myself of what might be required to get people to be the “archivist required to deal with digital stuff”. The people I spoke to are representative of nothing but themselves and I’m aware that there is a large scale piece of work sponsored by Jisc and the National Archives which is underway (I should know as I took part in it). However as someone who does write, advocate and deliver training I wanted to know what it is people felt they wanted as well offer up suggestions about how to go forward and go from planning to doing.

I went into Archives because I like old stuff and history – it’s no accident that my Twitter handle is “An Old Hand” – which is a palaeography joke, and my avatar is an image from an early modern document. It was the physicality of the old documents – whatever age which first drew me in, but what kept me was the stories that they tell.

When I asked the people I spoke to about why they had gone into Archives they said more or less the same thing. They all mentioned history as being a reason they got interested in archives but they also mentioned other things as well – they mentioned working with people and they mentioned stories and these relate as much to digital as to any other type of collection. Perhaps the main problem is that digital material just isn’t that old – well maybe younger people might coo at a floppy disk but even so – in archives terms (and remember I work at the Modern Records Centre!) although I discovered out earliest document is 1633 (still modern right?).

The eighties as I remember them (although Thatcher was in colour)

But still our earliest digital material is from the 80s which at times seems like another world but I’m guessing you either remember it or you’ve heard a lot about from your parents. It’s more familiar and what constitutes these records – word processed documents, emails, photographs – it’s the stuff of everyday – even now – so it just doesn’t have the glamour or even the appeal of the parchment deed, the daguerreotype or whatever. And all of my conversations indicated that people – regardless of their age – were just not as excited about the archives that weren’t cool old stuff.

The Fire Brigade’s Union

One of the things I want to do is to get people excited about the cool new stuff. And if it can’t be done through the physicality of the medium (which it hasn’t really got) then then it’s through the content – whether its (in my case) material from the Fire Brigades Union – whose records we have most recently taken in . It’s not hard to see how we can easily find interesting and engaging content which highlights important stories.

So if we know why we need to do it – the next thing is how. A word that came up again and again was “confidence” or lack thereof although this wasn’t reflected in all the people I spoke to. There is a course, webinar, book, podcast on everything under the sun if you know where to find it. But that’s one of the problems. There’s just too much out there – how do you know which is the “right” tool, the “right” workflow for the job? Here are some basics which I would recommend looking at if you’re looking to make a start:

If you are in a management position (and even if you are not) you need to be advocating for digital preservation so that means an understanding of the consequences of inaction, of the risks and of course to know about the collections which you are or might be dealing with. Digital Preservation Coalition has a great website which includes the basics which is fantastic for your elevator pitches and so on

But the secret is that there is no gold standard. Yes – it is helpful to have tools to work with like Preservica or Rosetta or Archivematica but you ask anyone who has those tools if they are doing the preservation for you? They are not. They are not capturing them, sorting them, listing or appraising them or even storing them.

If you can start by protecting those bit streams and knowing – as much as you can – what those file types are then you have already made a start. Because if you’ve done that then the next generation of archivists will be able to build on your base and do some of this more painstaking work that you don’t have time, resource or capacity for. We need to agree on an understanding of what the base line is – somewhere around bit stream preservation with “as much metadata as I can get”. To me that looks like capture, context and running the files through DROID. Even this might be too much for people in very locked down situations such as in a Local Authority environment where you might even be limited by this. Could we look at sharing some command line stuff that might at least enable basic tasks? And then share them?

It’s something we can all be doing and it’s something we do together. Many of you here are already doing things – big and small and everyone else wants to hear about it – to give ideas, encouragement and to help find strategies which work in the real world.

Let’s try and find platforms for this so we can share our experiences and learn from each other.

Yes – I am asking you to make your lives difficult difficult but be brave – you don’t have to do it alone.

Do not attempt this at home

So we all know what it’s like – there is SO MUCH literature out there to read about on the theory of Digital Preservation it can be quite unnerving actually having to do the practical stuff and sometimes there seems to be a bit of a gap between the theory and how you might actually do that thing.

My case in point was quarantine. Every Digital Preservation handbook, how-to policy and procedure said some variation on “”digital files will be kept in a controlled environment for 30 days to protect against viruses”. This makes perfect sense to me and anyone used to dealing with physical archives (I refuse to call them analogue!) will be used to dealing with the concepts of quarantine – nobody likes pests or mold and we like them even less if they get anywhere near our strong rooms and archives stores. So potential contaminated documents will be put into isolation and treated with the appropriate chemicals and left until we can be sure that the mold or pests are dead. And so it is – sort of – with digital archives – that if we leave them isolated for a period of time (30 days allows for emerging threats to have been identified by updated virus checkers) hopefully the nasties will be mopped up (or wiped out) by our more up-to-date anti virus software.

Picture of a mouse
threats to the archives need removing before processing takes place

I also know that – partly for this reason – it’s important to have an isolated workstation and there’s some great advice out there for setting something up which needn’t be enormously costly or complicated (see this blog post by Porter Ohlsen). But until I sat there with a USB stick in one hand and a write blocker in the other I hadn’t considered how this would actually work.

WARNING – do not attempt this at home! I attached the write-protected USB drive to the workstation for initial examination – establishing checksums and a quick glance at the content using FTK Imager (I am still testing and comparing between this and BitCurator but that’s a story for another blog post). And then I thought – what do I do now? Do I sit and wait for 30 days? How do I set a reminder for when the 30 days is up? And what if I get another deposit next week and I want to start processing that? I realised that despite all the literature on the subject everyone was surprisingly silent on the practicalities of how this was supposed to work. So I turned to Twitter for advice:

And I got some great advice and discussion from this. Ross Spencer (@beet_keeper) suggested asking questions like:

  • How old is the deposit?
  • How long has it been since the checksums changed?
  • How long has the antivirus at the depositor’s site been running on the material?
  • Can you process the material in an isolated environment, and how long will the processing take?

Somaya Langley (@criticalsenses) suggested that most older material (ie lying around for a while before it has been processed) will have already passed its de facto quarantine period. She also suggested using more than one anti-virus tool – as a belt and braces approach. She commented that generally there is a lot of unseen labour put into the management of curation workstations which tends not to get documented at all… there is certainly planning to do around managing a non networked machine and how to ensure it gets updated regularly but on a schedule that suits the work of the archives.

David Underdown (@DavidUnderdown9) suggested the delay between deposit and final ingest is such that this should mop up any viruses but admitted working in a lab environment with a network and storage helps with the management of different deposits. Here at the Modern Records Centre it’s just me and my digital curation workstation. And I now realise (as I stare at my out of bounds machine) that I’m going to need a more nuanced workflow for this work based on the points which Ross made about the background to the material I am processing and also looking more closely at my own institution’s policies on anti-virus checks. Some of this is risk assessment and I would love to see current work on risk assessments in digital preservation looking at this in more detail. 

What I really want to see is more discussion about what people out there are doing in terms of quarantine and how much of a risk they deem it to be? In the NDSA Levels of Digital Preservation (a hugely influential and valuable set of recommendations for beginning digital preservation work) virus checking is variously at level 2 “Virus check high risk content” and Level 3 “Virus check all content”. It’s not on Level 4 at all I presume because you are doing it already at Level 3. A little bird tells me that in the upcoming NDSA Levels reboot there will be a more nuanced version of this, but it still won’t address how we spell out exactly how quarantining and virus checking fits into the work flow.

In the meantime I’m going to work on creating a more in depth workflow which tries to balance the risks and the practicalities of managing born digital stuff so that capture and identification is timely, safe and consistent.

I want it all

The new V&A building in Dundee
The new V&A at Dundee (author’s own, CC-BY)

I was very excited to be attending the ICA_SUV conference on appraisal in Dundee not least because it was my first visit to this most interesting of Scottish cities. I even went up early so I could take in some of the sites, such as the new V&A, the RRS Discovery and the Dundee 71 Brewing Brewery…

Beer at the Dundee Brewery tap
71 Brewing beer at the brewery tap (image author’s own CC-BY)

The conference theme was appraisal, relevant to every archivist, records manager, information professional, data curator or whatever you call yourself, and it was particularly good to have this discussion at an international conference. There was a significant US/Canadian presence and many other countries as well, so the differences in record keeping practice and tradition varied quite considerably, adding to the richness of the debate.

I had chosen to speak at the conference about my work on email appraisal and I am hoping that the text and slides will be shared so I can post a link to them and I will certainly post more on that work in the future and developments since I made my conference presentation.

It’s hard to pick out high lights of the conference but I really enjoyed hearing from Karolien Claes from the University of Antwerp on developing a toolbox and guidelines to help academics manage their own records – it sounded like (and Karolien must forgive me if I totally misrepresent her work) a blend of Research Data Management, records management and personal digital archiving and a good example where a range of approaches from across disciplines can help work towards the goal of (digital) preservation. Professor Basma Makhlouf Shabou (Geneva School of Business Administration) also discussed some very interesting work taking place in Switzerland for automating various archival processes, and in particular appraisal using a tool they have called ArchiSelect. Whilst there are often tensions around the idea of automated processes which perform such a subjective and human element, very careful testing showed the Swiss researchers that a large percentage of tasks could be automated. In the era of Big Data it would be impossible to process manually this so a level of automation is required to do anything at all. Shabou was keen to stress that these tools support decision making rather than replace decision making so I don’t think we need to feel as if our jobs are going to be taken by robots just yet.

Robot (Image by uleem odhit from Pixabay)

Another presentation that stuck with me particularly – and something that many of us probably don’t give as much thought to as we should – was from Renata Arovelius and Karl Petterson from the Swedish University of Agricultural Sciences. Whilst the record keeping tradition and environment has some marked differences to other countries (for example the UK) but some of the issues remain a constant. Petterson posed the question – when we say we have deleted digital files what exactly do we mean by this? (Spoiler – when you click on the delete button your data doesn’t actually evaporate – see for example ICO guidance on deleting data). It’s a really interesting point – Petterson speculated that Swedish law is not clear on what “deleted” actually constitutes and it is probably not the only legal system where this is also the case.

Specimen brought back from the Antarctic by the RRS Discovery
Preservation brought to you by the Antarctic explorer vessel RRS Discovery (author’s own CC BY)

One of the talking points of the conference was Geoffrey Yeo’s keynote and subsequent “provocation” (at least I think that’s what it was meant to be…) that in the context of digital records the archivist should keep “everything” because the subjectivity of the appraising archivist is removed and the previous barriers of sorting and finding “relevant” (to the researcher) material is made possible by the vastly improved and refined search capabilities. There are a number of problems with this position but to be fair to Yeo he was challenging the audience (and perhaps the wider archival community) to defend a theoretical (as opposed to practical) justification for appraisal in the digital age. In some ways the argument (like the justification) remains theoretical because the bottom line is that we do not have the resources to do so, be it economic or environmental. Environmental concerns are the topic of a recent article by Keith L. Prendergrass, Walker Sampson, Tim Walsh and Laura Alagna. For all of us environmental impact must be our priority concern which applies to paper as well as to digital of course so any other considerations seem somewhat meaningless. It was helpful though to think through all the reasons for undertaking appraisal and not merely regard it as “a thing which archivists do”.

wind farm - a source of renewable energy
Wind farm (Image by Free-Photos from Pixabay)

If in some mythical future where we had discovered an unlimited sustainable form of energy that would allow us to keep everything would we do so? I don’t believe we would because we always need to keep in mind that digital or otherwise the bit streams that sit on our servers are actually people’s stories, people’s experiences and people’s lives. And in this we have a responsibility to manage these archives responsibly. There are lots of things we would never wish to be kept – I would be mortified if job application forms from when I was in my teens surfaced (unless I was considerably more sophisticated than I remember myself to be), emails from many moments in my life, medical records, school attendance and behaviour records. There are numerous types of data which there is an expectation it will be forgotten and deleted as UK and EU GDPR legislation frames. And whilst the law also supports archiving in the public interest and allows for the a refusal of a request for erasure this is not and should not be justification for not treating the people whose records they are with respect. The law describes the record keeping activity as being in the public interest and this needs to be the touchstone upon which we base our appraisal decisions. So yes – we do need to appraise because we have a moral and ethical duty to do so. So no – I don’t want it all. I will take on the appraisal challenge with all its difficulties and complexities. I just need to work out how to make sure what I delete really is deleted…

Minnie the Minx - born in Dundee
Minnie the Minx (author’s own CC BY)

Seeing double

I’ve been making good progress with processing and ingesting some of our born-digital collections – in particular the records produced by the University. The most difficult thing about this work has been ensuring that we receive the files in the first place! I’ve chosen to make a start on this material because in the main it is predictable (usually Word documents or PDFs of various sorts) and we understand the context of it and in some cases get some additional metadata. We’re very lucky here at Warwick University because we’re well resourced in terms of having a Records Management Team (yes that’s right – there’s more than one person doing it!) plus me and two archives assistants who are able to spend some time on processing and cataloguing. And yes – quite a bit of time is spent on sending emails saying “where are these committee minutes” or “can you send them without password protection” and so on. There is no denying that the capture part is labour intensive before you’ve even started on the digital processing.

There’s a lot of fine tuning to be done in digital preservation and it can be very time consuming

I’ve developed a workflow document for the team here to follow so that the processing is consistent although I am also constantly reviewing and revising our workflows. Digital preservation is not something which can be “achieved” it’s an ongoing process: from fixity checking through to revising workflows and normalising files for preservation and access. You will literally NEVER be done. But don’t let me put you off…

Workflow for initial processing of committee minutes

For these regularly deposited and (relatively!) unproblematic files we have adopted a low-level processing workflow. The selection, appraisal and quite often the arrangement has already been done by the creators so we focus on cataloguing and ingesting the files into the preservation system. A file list (not really a manifest) is created using a dir > b command and used to list the files in the catalogue. This means any one of us can quickly and easily create this type of document. At present I have generally not been including a file manifest as part of any submission documentation – mainly because I’m trying to streamline the process and I would have to add it in manually. Also the file list is captured in the catalogue metadata. I’m not too worried about where the information is captured as long as it’s captured somewhere.

However with some of the legacy files (ie the ones which have been lurking around on the Shared Drive for a year or six) I have more often been needing something a bit more involved. This is in part because the legacy material includes duplicates, surrogates and other versions so at this point I am more likely to be making some appraisal decisions or otherwise document what I have. For these collections I have been making file manifests, usually using DROID. The process of identifying duplicate files (deduplication) and is a key part of management and appraisal decisions. Running a DROID report over the files gives you some great metadata to get started with – it identifies the file types, and gives them a checksum. With the report in csv format you can sort by file type and checksum which gives you instant results for the number of each file type and also allows you to see where there are duplicate checksums (which denotes a duplicate file). This is fine for where you are dealing with 10 or 15 files but does not scale up – when I ran it across the 1,000 or so files I was dealing with I just couldn’t see where the duplicates were that easily.

DROID report csv but so manythe files – ugh!

However thankfully help was at hand courtesy of David Underdown (from the UK National Archives and the csv validator which I hadn’t previously come across. Even better a user-friendly blog post to accompany it which (with rather a lot of of help from David) I created a csv schema which not only reported on duplicates (as errors) in a csv file but also (with an extra tweak) weeded out any null reports where DROID found a container file (eg a folder) which it did not create a checksum for.

Rreport of schema indicating where the duplicate (and therefore) error files occur.

If you want to have a go with this (assuming you’ve got DROID up and running) you can download the CSV validator here and then upload your DROID csv report and a copy of the deduplication schema (copy and paste it from here into a text file and save it somewhere). Hit the validate button and instant deduplication.

Having tried these things out largely whilst “chatting” over Twitter there also followed some great accompanying discussion including a great tip from Angela Beking of Library and Archives Canada who pointed out that you can set filters on DROID profile outputs (I shall be having a go with using this functionality too).

Other people came up with some alternative tools to try (eg TreeSize or HashMyFiles). There are literally hundreds of files out there for performing all sorts of tasks – you can find some described at COPTR (Community Owned digital Preservation Tool Registry – and I would encourage everyone to contribute to COPTR if they find a tool they like that’s useful for a particular aspect of their workflow. Free tools in particular are great for people working with small budgets (and who isn’t doing that?)

Always worth spending time trying to find the right tool to suit your needs.

This all started out with trying to find a way to weed out duplicate files and to do a bit less seeing double but ended up being a conversation and a piece of collaborative work which has certainly helped me see more clearly. My next step is to try and integrate the report outputs of this into my workflows. I hope some of the sharing of this work is helpful to other people too.

Gerald Aylmer Seminar 2019: Digital and the Archive

A sunny day at the National Archives

It’s been a regular-ish part of my calendar for a few years now to attend the Gerald Aylmer Seminar which has been held annually since 2002. This year’s theme was Digital and the Archive – so of great relevance to my current work and anyone engaged with digital preservation. It’s a great opportunity for historians and archivists to get together and share their work and experiences – something that ought to happen rather more often than it does.

In fact I’ve been very interested in how we might start to present our legacy born digital holdings to our users and potential users – what do researchers want from these kinds of sources? Do they yet know themselves? Jane Winters, who was one of the Keynote speakers, has been asking these very questions and it was great to hear about this her outlining some of the challenges of getting researchers to interact with born digital and pointing out some of the difficulties which still remain about capturing born digital resources and making them available.

John Sheridan completed the tripartite keynote, begun with Alice Prochaska looking back on a glittering career in digital librarianship and scholarship, by delivering a “call to arms” to archivists to develop the necessary archival practice to meet the challenges of capturing today’s digital sources (not just preserving yesterdays) and suggested we do not (yet) have the right levels of advocacy to achieve this. I do agree with this although I am not convinced it is an entirely new problem. I also recognise the tensions inherent in constantly both managing legacy collections as well as keeping up with the material which is being produced right now.

The next session focussed on the “hybrid” nature of the archive with Jen Mitcham talking about her work on the Marks and Gran archive which she has blogged about here. This came at a good time for me as I am currently taking my first steps in digital forensic work which I will be blogging about very soon. Something which I really took away from Jen’s talk was reminding us of the “user experience” of working with legacy born digital files (in her case with word processing packages from the 1980s) where the whole design and probably use of the software package was to produce a physical document. This is an important factor to bear in mind when considering (as I am doing) how to represent some digital objects from the archives. The theme of the user interaction was further taken up by the following speaker Professor James Newman from Bath Spa University who had recorded a presentation on his work capturing the user experience in video games (specifically Super Mario Brothers (you can read all about it here!)

In the afternoon we heard about the fascinating work which has been undertaken by Ruth Ahnert on Tudor State networks which opened up huge possibilities using metadata derived from calendars and catalogues, as well as stressing the improtance of linked open data in reconstructing netowrks of these kinds. Again her work is available here to read.

Rachel Foss from the British Library gave a fascinating insight into their “enhanced curation” work where they gather a huge amount of supporting information about the people whose papers they take – groundbreaking and innovative stuff which I am sure there is much to be learnt from, even if we can’t all get a trip to the south of France to record the ambient sounds of the valley where John Berger lived…

It is interesting that the British Library do ask authors questions about their writing practices in terms of engagement with digital technologies, something which gives key insights to understanding the digital collections. It would be interesting to see how this information is represented in the metadata made available to the researcher.

Adrian Glew from the Tate introduces a huge community engagement project which the Tate was involved in the outputs from which have been shared.

The final presentation was from Naomi Wells talking about working with London’s Latin American Community on documenting their experiences. there were some very interesting findings in relation to attitudes towards digital and physical heritage – websites and other digital resources were seen as inherently ephemeral as opposed to physical objects. It was difficult to get the same level of engagement for the digital legacy.

The day ended with a panel “provocation” led by Margot Finn from the Royal Historical Society with Kelly Foster (blue badge guide and wikimedian), Jo Pugh (National Archives) and Jo Fox (Institute of Historical Research) all contributing to a thought provoking discussion. Foster drew out the power of open data and of licensing both to give appropriate credit to voices which are often obscured from the narrative and ended on a call to “open up” data and metadata. It’s something I’m going to take away with me and start acting on!

Memory Makers 2018

Beautiful Amsterdam

I was extremely lucky to be at the Amsterdam Museum for both the Memory Makers: Digital Skills and How to Get Them conference and also the Digital Preservation Awards 2018 , where excellent practice across the sector was recognised and rewarded.

I missed the ePADD workshop in the morning but I did get to meet Josh Schneider later – which was great – so I need to make sure I follow up on my email preservation work so I can bother him with more questions in the future.  That’s one of the really great things about conferences – you get to meet people whose work you have followed and admired – this helps create connections, establishes areas of interest and builds communities. 

The conference was kicked off with an inspiring but impossible to summarise keynote from Eppo van Nissen tot Sevenaer, Director of the Netherlands Institute for Sound and Vision (Beeld en Geluid) in which he encouraged us all to be archivist activists and quoted William Ernest Henley’s poem “Invictus“.  It fired us all up for a conference which explored how digital preservation knowledge is taught, acquired and disseminated.

The first session focussed on teaching “Digital Preservation” (there was quite a bit of discussion about what constituted this and how it was best described in terms of curricula). Eef Masson of the University of Amsterdam who teaches on a Masters Programme on Preservation and Presentation of the Moving Image discussed how the disciplines of film and media studies intersected with and led to collaboration with the traditional archives programmes – to everyone’s benefit. Sarah Higgins from University of Aberystwyth talked frankly about the difficulties of engaging students from humanities backgrounds with digital skills.  Many (although by no mean all) people choosing a career in archives do so because they like “old things” – this struck a chord with to me as I am a medievalist by training and I have learned to “love the bits”.  How did I get there and how can I take others there with me?  It seems there is a need to engage and inspire people with our less tangible digital heritage.  Later that evening on receiving her DPC Fellowship award Barbara Sierman said:

One of my big take aways from the conference was how to engage people with digital preservation and encourage people to get as excited about it as I am!  After Sarah Higgins, Simon Tanner rounded off the session talking us through Kings College’s Digital Asset and Media Management MA which boomed in numbers once they added the word “Media” into the title.  The list of MA student dissertation topics sounded absolutely fascinating and very varied. Tanner explained that they don’t teach Digital Preservation as a module but rather it is woven into the fabric of the degree.

Sharon McMeekin of the Digital Preservation Coalition began the second session of the afternoon by talking through the survey of what kind of training members said they wanted (which might not necessarily be the same as what they ought to be focussing on…).  She encouraged sharing best and worst (!) practice and emphasised that Digital Preservation is a career of continuous learning – something to be aware of when employing someone in that role.  Next was Maureen Pennock of the British Library who illustrated an enviable internal advocacy strategy. She explained:

The final speaker of the day – Chantal Keijsper of the Utrecht Archives – described the “Twenty First Century” skills and competencies needed to realise our digital ambitions.

The evening was taken up with the Digital Preservation Award 2018 which you can read about here. They were all worthy winners and there were many extremely unlucky losers.  Almost all of my nominees won their category – I’m saying nothing beyond re-iterating my love for ePADD – they were very worthy winners in their category!

Jen Mitcham of the DPC and me at the awards ceremony

Day two of the conference was a chance for some of the Award finalists to showcase their work.  First up was Amber Cushing from University College Dublin discussing the research done to try and build the digital information management course at the institution. In a targeted questionnaire aimed at those who had responsibility for digital curation there was a surprising lack of awareness of what digital preservation/curation was and a confusion between digital preservation and digitisation. Next up was Rosemary Lynch who was part of the Universities of Liverpool and Ibadan (Nigeria) project to review their Digital Curation curriculum. Both institutions learnt a lot from the process and enabled them to make changes to their student offer. With support from the  International Council on Archives this project has helped make standard and other resources available in countries where there this can be difficult.  Next was Frans Neggers from the Dutch Digital Heritage Network (Netwerk Digitaal Erfgoed) talking about the Leren Preserveren course launched in October 2017 enabling Dutch students to learn practical digital preservation skills.  They have had excellent feedback from the course: 

I expected that I would learn about digital preservation, but I learned a lot about my own organization, too”

Student on the Leren Preserveren project

and Neggers added that another benefit was raising the profile of the Dutch Digital Heritage Network – often this course was how people got to find out about the organisation.  The final speaker in this session was Dorothy Waugh from Emory University, one of a group of archivists who have developed the Archivist’s Guide to Kryoflux.  I can testify that this is an invaluable piece of work for anyone planning to (in my case) or actually using a Kryoflux device (designed to read obsolete digital media carriers).  The Kryoflux was developed by audio visual specialists and does not come with archivist-friendly instructions:

In the final session we heard some great examples of training and advocacy.  Jasper Snoeren from the Netherlands Institute of Sound and Vision (Beeld en Geluid) talked about their “Knowledge Cafes” where they invite staff to share a drink and learn about curation and preservation. He discussed how to turn a sector into a community: run very focussed training programmes and keep people engaged in between. Puck Huitsing from the Dutch Network of War Collections (Netwerk Oorlogsbronnen)follwed and had a great deal of useful advice which would constitute a blog post in itself although my favourite quote was probably:

Rounding off an extremely useful and successful event, Valerie Jones from the UK National Archives presented the Archives Unlocked Strategic Vision for the sector, tempering this by saying:

If you’re going to innovate, just do it. Don’t write reports. Just go.

Valerie Jones, UK National Archives

I learnt a great deal at this conference and as usual I have added more to my “to do” list, especially around tackling internal advocacy and I can’t wait to start putting this into practice.

Stroopwaffels and coffee
Stroopwaffels and coffee kept me going!