May 2019 – An Old Hand Digital

Glasgow at its best (image author’s own CC BY)

The Spring Archivematica UK User Group this year headed to Glasgow, hosted by the University of Strathclyde right in the centre of one of my favourite cities. The user group is open to anyone using or interested in Archivematica and is user led and independent of Artefactual Systems who develop and maintain the software. If you are interested do drop me a line! We do always welcome input from Artefactual – usually a dial in from Canada but this time an in person presentation from their new managing director Justin Simpson.

As the user group are not a formal body (although there are terms of reference) there is no budget for anything so we are reliant on members to host the session, so a big thank you is in order for the University of Strathclyde for their hospitality.

In time honoured tradition our hosts Victoria Peters and Alan Morrison from Strathclyde gave an update on where they are with Archivematica, respectively from an Archives and Research Data Management perspective.

Victoria ran through some of her workflows and mused about the best place in the workflow to undertake appraisal – before ingest into Archivematica or after? There was quite a bit of debate on this and generally the view was that it was “easier” to do this before ingest but would we still say this if better tools were available for ingest in Archivematica (there is some functionality but it isn’t that easy to work with)? Is this really a development we would like to see? Strathclyde also hope to begin automate more of their workflows – something I sympathise with very much. Meanwhile Alan echoed this desire for automation which could be achieved by the development of integration tools (which may come in the future) especially with regard to metadata capture. Again this was an issue that others returned to during the day.

Next up was Anna McNally from the University of Westminster who bravely (in my view) tackled the question of PREMIS rights. PREMIS rights – for those of you who can start to feel a bit seasick at the extreme metadata end of preservation are clearly explained in this excellent blog from our friends at Bentley Historical Library. Archivematica helps you enormously by automatically capturing most of the metadata required by PREMIS however you do have to create your own rights statements if you want to include it in the SIP. We had an interesting discussion around whether or not it was necessary to include a rights statement with the digital object and if you do what happens when the rights either are not clear or change? But if you decide not to include them are you creating problems for yourself further down the line? I’m definitely going to be doing some work around this as I move onto material where the rights are not as straightforward as they are for our corporate records.

Chris Grygiel from the University of Leeds gave the final presentation before lunch and returned to everyone’s favourite topic: workflows. This was the theme of the day as Fabi Barticioti from the London School of Economics also talked about theirs – in both cases very different workflows for very different kinds of material but really useful to be able to share and discuss approaches. Chris’s work on digital forensic workflows very much mirrors my own so I am looking forward to bouncing some more ideas around with him in the future. It raised the question again of how much work is done pre-ingest and how much post ingest. The answer as ever appears to be: it depends… Fabi looked at the very complex issue of transferring metadata between systems and reflected on the amount of work necessary to prepare metadata for ingest. She said that you can’t treat each collection as a special flower but then every collection is different…

You can’t treat collections as special flowers

Laura Giles from the University of Hull gave a really positive update form the Hull City of Culture project and her big message was that if you can get your creators/depositors to do just a small amount of work (eg sensitivity tagging) it can lead to huge benefits in the long run.

Final presentation of the day was Matthew Addis wowing us with his metadata wizardry and use of Lego figures to illustrate his talk! He had developed a really neat way of ingesting ISAD(G) metadata into Archivematica and AtoM which gave very impressive results although came with the usual health warning: you need to do lots of preflight checks. With all workflows, it doesn’t matter how neat your tools and scripts are, you still have to input the metadata, and if there are errors here it can scupper the whole thing. I wonder if there’s a use case here for the National Archives CSV validator I blogged about recently?

We can rely on there being lego somewhere in the proceedings….

Finally we were really pleased to welcome Artefactual’s Managing Director Justin Simpson in person to the meeting as it coincided with a UK trip. There’s some great stuff happening at Artefactual and it’s fantastic to have the opportunity to ask questions of them directly like this.

Another brilliant Archivematica User Group meeting which was really useful as a platform to share experiences and ideas. Already looking forward to the next one!

I’ve been making good progress with processing and ingesting some of our born-digital collections – in particular the records produced by the University. The most difficult thing about this work has been ensuring that we receive the files in the first place! I’ve chosen to make a start on this material because in the main it is predictable (usually Word documents or PDFs of various sorts) and we understand the context of it and in some cases get some additional metadata. We’re very lucky here at Warwick University because we’re well resourced in terms of having a Records Management Team (yes that’s right – there’s more than one person doing it!) plus me and two archives assistants who are able to spend some time on processing and cataloguing. And yes – quite a bit of time is spent on sending emails saying “where are these committee minutes” or “can you send them without password protection” and so on. There is no denying that the capture part is labour intensive before you’ve even started on the digital processing.

There’s a lot of fine tuning to be done in digital preservation and it can be very time consuming

I’ve developed a workflow document for the team here to follow so that the processing is consistent although I am also constantly reviewing and revising our workflows. Digital preservation is not something which can be “achieved” it’s an ongoing process: from fixity checking through to revising workflows and normalising files for preservation and access. You will literally NEVER be done. But don’t let me put you off…

Workflow for initial processing of committee minutes

For these regularly deposited and (relatively!) unproblematic files we have adopted a low-level processing workflow. The selection, appraisal and quite often the arrangement has already been done by the creators so we focus on cataloguing and ingesting the files into the preservation system. A file list (not really a manifest) is created using a dir > b command and used to list the files in the catalogue. This means any one of us can quickly and easily create this type of document. At present I have generally not been including a file manifest as part of any submission documentation – mainly because I’m trying to streamline the process and I would have to add it in manually. Also the file list is captured in the catalogue metadata. I’m not too worried about where the information is captured as long as it’s captured somewhere.

However with some of the legacy files (ie the ones which have been lurking around on the Shared Drive for a year or six) I have more often been needing something a bit more involved. This is in part because the legacy material includes duplicates, surrogates and other versions so at this point I am more likely to be making some appraisal decisions or otherwise document what I have. For these collections I have been making file manifests, usually using DROID. The process of identifying duplicate files (deduplication) and is a key part of management and appraisal decisions. Running a DROID report over the files gives you some great metadata to get started with – it identifies the file types, and gives them a checksum. With the report in csv format you can sort by file type and checksum which gives you instant results for the number of each file type and also allows you to see where there are duplicate checksums (which denotes a duplicate file). This is fine for where you are dealing with 10 or 15 files but does not scale up – when I ran it across the 1,000 or so files I was dealing with I just couldn’t see where the duplicates were that easily.

DROID report csv but so manythe files – ugh!

However thankfully help was at hand courtesy of David Underdown (from the UK National Archives and the csv validator which I hadn’t previously come across. Even better a user-friendly blog post to accompany it which (with rather a lot of of help from David) I created a csv schema which not only reported on duplicates (as errors) in a csv file but also (with an extra tweak) weeded out any null reports where DROID found a container file (eg a folder) which it did not create a checksum for.

Rreport of schema indicating where the duplicate (and therefore) error files occur.

If you want to have a go with this (assuming you’ve got DROID up and running) you can download the CSV validator here and then upload your DROID csv report and a copy of the deduplication schema (copy and paste it from here into a text file and save it somewhere). Hit the validate button and instant deduplication.

Having tried these things out largely whilst “chatting” over Twitter there also followed some great accompanying discussion including a great tip from Angela Beking of Library and Archives Canada who pointed out that you can set filters on DROID profile outputs (I shall be having a go with using this functionality too).

I often filter "folders" out before exporting the profiles, depending on what I need to use the report for 🙂 pic.twitter.com/1nfqWjezyV
— Angela Beking (@AngelaBeking) May 9, 2019

Other people came up with some alternative tools to try (eg TreeSize or HashMyFiles). There are literally hundreds of files out there for performing all sorts of tasks – you can find some described at COPTR (Community Owned digital Preservation Tool Registry – and I would encourage everyone to contribute to COPTR if they find a tool they like that’s useful for a particular aspect of their workflow. Free tools in particular are great for people working with small budgets (and who isn’t doing that?)

Always worth spending time trying to find the right tool to suit your needs.

This all started out with trying to find a way to weed out duplicate files and to do a bit less seeing double but ended up being a conversation and a piece of collaborative work which has certainly helped me see more clearly. My next step is to try and integrate the report outputs of this into my workflows. I hope some of the sharing of this work is helpful to other people too.

An Old Hand Digital

Month: May 2019

Archivematica UK User Group Glasgow 16th May 2019

Seeing double