Seeing double

I’ve been making good progress with processing and ingesting some of our born-digital collections – in particular the records produced by the University. The most difficult thing about this work has been ensuring that we receive the files in the first place! I’ve chosen to make a start on this material because in the main it is predictable (usually Word documents or PDFs of various sorts) and we understand the context of it and in some cases get some additional metadata. We’re very lucky here at Warwick University because we’re well resourced in terms of having a Records Management Team (yes that’s right – there’s more than one person doing it!) plus me and two archives assistants who are able to spend some time on processing and cataloguing. And yes – quite a bit of time is spent on sending emails saying “where are these committee minutes” or “can you send them without password protection” and so on. There is no denying that the capture part is labour intensive before you’ve even started on the digital processing.

There’s a lot of fine tuning to be done in digital preservation and it can be very time consuming

I’ve developed a workflow document for the team here to follow so that the processing is consistent although I am also constantly reviewing and revising our workflows. Digital preservation is not something which can be “achieved” it’s an ongoing process: from fixity checking through to revising workflows and normalising files for preservation and access. You will literally NEVER be done. But don’t let me put you off…

Workflow for initial processing of committee minutes

For these regularly deposited and (relatively!) unproblematic files we have adopted a low-level processing workflow. The selection, appraisal and quite often the arrangement has already been done by the creators so we focus on cataloguing and ingesting the files into the preservation system. A file list (not really a manifest) is created using a dir > b command and used to list the files in the catalogue. This means any one of us can quickly and easily create this type of document. At present I have generally not been including a file manifest as part of any submission documentation – mainly because I’m trying to streamline the process and I would have to add it in manually. Also the file list is captured in the catalogue metadata. I’m not too worried about where the information is captured as long as it’s captured somewhere.

However with some of the legacy files (ie the ones which have been lurking around on the Shared Drive for a year or six) I have more often been needing something a bit more involved. This is in part because the legacy material includes duplicates, surrogates and other versions so at this point I am more likely to be making some appraisal decisions or otherwise document what I have. For these collections I have been making file manifests, usually using DROID. The process of identifying duplicate files (deduplication) and is a key part of management and appraisal decisions. Running a DROID report over the files gives you some great metadata to get started with – it identifies the file types, and gives them a checksum. With the report in csv format you can sort by file type and checksum which gives you instant results for the number of each file type and also allows you to see where there are duplicate checksums (which denotes a duplicate file). This is fine for where you are dealing with 10 or 15 files but does not scale up – when I ran it across the 1,000 or so files I was dealing with I just couldn’t see where the duplicates were that easily.

DROID report csv but so manythe files – ugh!

However thankfully help was at hand courtesy of David Underdown (from the UK National Archives and the csv validator which I hadn’t previously come across. Even better a user-friendly blog post to accompany it which (with rather a lot of of help from David) I created a csv schema which not only reported on duplicates (as errors) in a csv file but also (with an extra tweak) weeded out any null reports where DROID found a container file (eg a folder) which it did not create a checksum for.

Rreport of schema indicating where the duplicate (and therefore) error files occur.

If you want to have a go with this (assuming you’ve got DROID up and running) you can download the CSV validator here and then upload your DROID csv report and a copy of the deduplication schema (copy and paste it from here into a text file and save it somewhere). Hit the validate button and instant deduplication.

Having tried these things out largely whilst “chatting” over Twitter there also followed some great accompanying discussion including a great tip from Angela Beking of Library and Archives Canada who pointed out that you can set filters on DROID profile outputs (I shall be having a go with using this functionality too).

Other people came up with some alternative tools to try (eg TreeSize or HashMyFiles). There are literally hundreds of files out there for performing all sorts of tasks – you can find some described at COPTR (Community Owned digital Preservation Tool Registry – and I would encourage everyone to contribute to COPTR if they find a tool they like that’s useful for a particular aspect of their workflow. Free tools in particular are great for people working with small budgets (and who isn’t doing that?)

Always worth spending time trying to find the right tool to suit your needs.

This all started out with trying to find a way to weed out duplicate files and to do a bit less seeing double but ended up being a conversation and a piece of collaborative work which has certainly helped me see more clearly. My next step is to try and integrate the report outputs of this into my workflows. I hope some of the sharing of this work is helpful to other people too.

3 thoughts on “Seeing double

  1. There’s one potential gotcha in this around how DROID handles the case of files identifying as multiple formats. When you do the export from DROID you can choose either one row per file or one row per format. It’s probably better to choose one row per format, as otherwise the CSV file may have a variable number of fields which the validator won’t like. However, you could then get false positives on duplicates as I think the checksum for the file would be give in both rows (otherwise it would be blank which might also cause issues).

    I’ve updated the schema on GitHub now.

    Like

  2. Pingback: Digital preservation update: January-June 2019 – Digital Preservation @ University of Glasgow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s