MidiPres May 2021: Cataloguing Born Digital

Image credit: Image by DreamQuest from Pixabay

It’s now over twelve months since Laura and I launched our “digital preservation support network” (at a real live event – imagine that!). It’s heartening to see how much we’ve developed during this time. For the Spring MidiPres meeting we had taken a bit of a poll on what topics people would like to focus on and “Cataloguing Born Digital Collections” was quite high up on the list.

We invited a few people to share their experiences of cataloguing born digital materials to get things started but in the spirit of the network this was as much to invite further debate and comment from others and get us all thinking. I know I find this useful as it gives me ideas to borrow and develop. It’s also helpful to be able to get a sense check from others – does my theory (or practice) make sense? Might there be better ways of doing it?

Our first presenter reflected on their considerable experience of cataloguing analogue collections and the challenges they were anticipating in tackling born digital collections. There were some technical challenges such as managing the integration between the CALM cataloguing software and the Preservica preservation system but these (as is so often the case) were smaller challenges than dealing with the size and complexity of the material coming in. Archivists spend their time making sense of the archive and presenting it to the outside world in a way that is intelligible and digestible, so there were inevitably going to be challenges with very large deposits which either had no structure or an extremely complex one – how is this best presented in a catalogue to facilitate discovery? Sometimes file names are user friendly, sometimes they are not and there are always the tensions of capturing the essence of what was created and facilitating users in their discovery of it. There was some discussion around managing personal data – identifying it in amongst a huge quantity of material and what resources there were for automated ways of doing this.

Our second presenter focussed more on the practicalities of cataloguing standards and software. Many MidiPres members use CALM as their cataloguing software but by no means all; we have a strong feeling that the network should aim to be vendor neutral as far as possible which helps us be more inclusive. Several members do not have specific cataloguing software at all – usually using a spreadsheet to catalogue and manage their collections information. The CALM cataloguing system (like others) is standards based and as is common in the UK most people adhere to ISAD(G) for their cataloguing with local variations in practice. Increasingly, as ISAD(G) was not developed for digital collections, practice forced to diversify and devlop and whilst alternatives are emerging (notably RiC) I’ve yet to get a feel for people embracing this. A lack of standardisation in the way we describe our collections is likely to have a negative effect on their discoverability – especially across different institutions – so the more we share experiences and practice the better.

There was some discussion of that perennial problem of dates – do we record the creation date, the last modified date, the presumed actual date of the document and if so where and how to record and then represent to the researcher. Which led us on to the capture and use of the metadata which we create around digital collections – it’s not something that any of us were aware of being made publicly available routinely but it is definitely a consideration. There was quite some discussion more broadly about how we represent born digital collections via our catalogues and the only consensus is that it depends on the collection! We also mooted how we managed digital content within our collections, especially hybrid ones, and there were various practices shared for indicating the presence of digital content to give the archivist an easy way of gaining an overview of what is in the system, such as assigning an accession specific code or creating a drop down or Y/N field in the collections management system. This allows for greater reporting functionality although probably doesn’t address the need for a more granular approach.

Our third presenter talked about the move from traditional paper oriented ways of cataloguing towards incorporating digital – I think this is something which many struggle with because standards and software are so deeply rooted in the paper world that even if you want to move on from that (both in terms of the types of collections you are working with but also in the way in which collections are made available online) it can be something of a struggle. There was some good discussion of no “one size fits all” approach working for born digital (as indeed is the case for analogue archives) and Trevor Owens’ excellent “Theory and Craft of Digital Preservation” book was referenced – an excellent read for anyone interested in getting into more depth with the subject.

Sharepoint feels like the Wild West to many of us…
(Image credit by Brigitte makes custom works from your photos, thanks a lot from Pixabay)

There was lots of discussion around what the records creators/depositors can do for us. Some archive offices asked for lists to accompany any deposits which inevitably varied in usefulness, but in some cases were seen as explicit statement of balancing the depositor/archivist relationship so the archives were not seen as a dumping ground for material which was no longer considered immediately useful! There was some discussion about how or even if it might be possible to acquire other sorts of creator generated metadata (checksums being the most obvious one) and despite the wealth of digital preservation literature recommending this as good practice, most if not all of us felt this was at present completely unachievable (and some had attempted it). It was even observed that in some situations putting demands for lists (for example) on depositors would just lead to nothing being deposited at all. The real world, it turns out, is quite a long way away from the text books.

There was loads more covered in the meetings including a SharePoint Anonymous discussion (“it’s like the Wild West) and some thoughts on licensing of digital materials, both of which felt like they could form the core of future meetings. We certainly aren’t going to run out of things to say any time soon and I really feel we are helping everyone to build confidence in the sector, which is what we set out to achieve.

MidiPres January 2021: a closer look at email

No, not emails (Image by Andrys Stienstra from Pixabay)

It doesn’t seem five minutes since the last meeting of Midipres (and I said that last time) but here we are in 2021 and time for another Midlands Preservation Network event which brings together practitioners from across the UK Midlands region to share and discuss their digital preservation stories. This meeting had a special theme – email preservation – which proved a popular topic with many and we had some lively discussion covering a range of topics from normalisation to appraisal and access.

We kicked off with a case study from a group member who had received some emails as part of a deposit of various digital records from a Parish Council (no not Handforth!) including emails saved onto a cd in an .eml format. The archives were having trouble opening the emails* and had concerns that converting them to another format as that created problems with the attachments… We had a useful discussion around how much of the available literature addresses the issue of migrating and dealing with entire mailboxes rather than having a handful of emails amongst other material (I’ve certainly come across the latter scenario more than once). We talked about the various formats that you were likely to come across – .pst, .mbox and so on and how much or little influence you might have about asking for a particular format.

A couple of others shared their work – one archivist had been collecting Covid update emails from the Chief Executive’s Office as part of their contemporary collecting strategy but had concerns (don’t we all) about embedded links to external content including videos. I don’t think anyone has the answer to these issues but there would be useful work mapping out what might be possible and the sorts of tools needed to achieve this kind of capture. Another group member shared a success story in that they had succeeded where I have failed (I am not bitter about this at all!) in getting a deposit of the inbox of one of the Chief Officers in their organisation. It required developing a new deposit agreement and some significant degree of negotiation but to me sounds like a huge success and I look forward to hearing on further updates on this, hoping we might be able to emulate their success. We discussed the attitudes people have towards their emails and how the mixing of personal and professional in email accounts made the management and capture of email particularly challenging. One of the group mentioned James Lappin’s recent blog post on this very topic and I shall be reading his article with great interest. Getting people to weed their emails in advance of depositing sounds like it could be challenging and we were pointed towards a recent IRMS podcast where Vincent Hoolt of the Netherlands National Archives discussed the pitfalls of exactly this, talking about how they had inadvertently received the divorce papers of a government official – not just embarrassing but a GDPR nightmare!

The thrill of a live demo of Emailchemy (photo author’s own)

I followed this up by giving as live demo (brave I know) of Emailchemy – a tool for converting email to different formats. I’ve had a go with this and with Aid4Mail in their demo versions and can definitely seeing them being useful for future work especially in conjunction with ePADD which I’ve also written about before. I am very enthusiastic about the potential use of things like ePADD for both appraisal and access to email collections. There was more discussion about cataloguing email collections and I’m sure we’ll be looking to talk more about cataloguing born digital in a future session.

In the meantime I think I’ll get back to sniffing around after those institutional email accounts!

*update: problem now solved: a file path issue 🙂

AURA Network Workshop: Open Data vs Privacy

National Library of Ireland (By YvonneM – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15120535)

The last non-virtual conference I attended was back in January in London at the Archives, Access and AI conference which I blogged about here and that work has been built upon to form the AURA Network (Archives in the UK and Republic of Ireland and AI) which focuses on trying to unlock digital assets held by cultural institutions. So I was very excited to have the opportunity to travel to virtual Dublin (for the second time this year!for this workshop which had a fabulous programme – and many thanks to Dr Lise Jaillant of the University of Loughborough and Dr Annalina Caputo of Dublin City University for putting together and co ordinating a great event. Day One focussed on Open Data, Privacy and AI and Day Two on issues of access to Born Digital archives – very much interlinked topics. You can see the full programme here and all I will do is cover a few of my highlights.

The conference opened with a virtual tour of the National Library of Ireland from an architectural perspective, led by Brid O’Sullivan which was wonderful and really made me determined to visit, not least because she mentioned that they have the “second* best toilets in Dublin” (according to one newspaper anyway!). I’m adding it to my to-visit list for my next Dublin trip!

The opening session began with Rob Brennan from DCU exploring the difficulties of dealing with GDPR and personal data in the age of the digital deluge. The volume of information being created and sloshing around the place means that traditional or even more recent methods of tracking data (using spreadsheets for example) just don’t scale up – “please feel sorry for the Data Protection Office” said Brennan! The answer to this problem might be in using machine learning to create computational methods of identifying personal data and Brennan highlighted the work of the Data Privacy Vocabularies and Controls Community Group, part of WC3, who have been developing a taxonomy of privacy and data protection terms to assist with just this sort of issue. Hopefully this work will start to remove the headache for all DPOs and anyone else dealing with sensitive data – this will is potentially valuable for all archivists and records managers.

This was followed by a wide ranging and thoughtful presentation from Rachel Hosker from the University of Edinburgh with her wonderfully titled “Beautiful Messy Data: Archival Access and Data Protection” which explored the issues familiar to archivists everywhere: digitised and digital archives can be investigated using data science methods but they are unstructured and it can be very difficult to identify and manage issues relating to privacy when you don’t know what the records might contain. And whilst machine learning and natural language processing give hope that some of these difficulties may be overcome using automated processes there is still much to be learnt about biases in these processes. Hosker highlighted new work by Lucy Havens, Melissa Terras, Benjamin Bach and Beatrice Alex which aims to unpick some of this.

Finally Frederic Saunderson from the National Library of Scotland invited us to consider the differences between data protection and privacy, which overlap but are by no means the same. The issues over access can be further complicated when rights management is exercised and practitioners need to be very careful to apply the appropriate access framework to different collections.

The round table session which followed covered topics as varied as semantic web technologies and data models to manage access, the FAIR principles, how AI failures are human failures and more on text mining. I really enjoyed how the different aspects of “access” were explored by those working in very different disciplines but all with a common goal of ethical access for the greater good and not to the detriment of the individual. It is very heartening to be at events where these challenges can be explored so that both researchers who want access and information managers who want to protect privacy can understand the challenges on each side.

Day two opened with another round table with more semantic web research, automated sensitivity reviewing, recreating serendipity in digital searching and the problems of disambiguating in large scale digital searches. I was particularly taken with Lucy McKenna (Trinity College Dublin)’s work on authoritative interlinking for semantic web cataloguing which is shared here. The panel reflected on the need to share skills and improve communication between disciplines, something we probably are aware of on one level but don’t spend enough time on….

Dublin – can’t wait to get back there for real! (Image by Claire Tardy from Pixabay)

Eilidh MacGlone from the UK web archive then opened up the next session talking about the work of the UK’s official web archive who work hard to capture the UK’s legacy web content, in parallel to the way the network of copyright libraries capture other published outputs. However they are necessarily restricted by what they can collect – no access to material behind log ins and so forth. A parallel presentation came from Joanna Finnegan and Della Keating from the National Library of Ireland who have responsibilities for the Irish web archive. They talked about successful use of crowd sourcing (such as Flickr) to help identify people and places featured in some of their collections.

Next came Paul Gooding from the University of Glasgow discussing his work looking at how researchers actually use digital collections – it is not easy to collect data on this in an ethical way and there is much that is “hidden” in terms of the way user analytics are exposed. Ciaran Wallace of Trinity College Dublin then continued the theme of user approaches by talking about how the historian approaches digital sources, although primarily the focus was on digitised sources and how a “definitive” history is made using accessible and discoverable resources. This is obviously always the case with whatever kind of archival collection we consider and both archivist and researcher have to be as explicit and transparent about curatorial methods, decisions and dissemination. The final presentation in this session came from Gareth Jones from Dublin City University talking about search and access in broadcast media archives and was the reminder that I didn’t need that audio visual archives are extremely complex!

The afternoon concluded the workshop with another wide ranging round table discussing metadata and rights issues, interoperability and more on the semantic web, Covid collecting and social media archiving and the teaching of digital curation. Again lots of very current and interesting debate and byt the end my head was bursting with thoughts and questions and things to follow up.

It might have been the end of the workshop but there are two more planned in 2021 and there is also a call for papers for a special issue of AI and Society entitled Shedding Light into the Darkness of Digital Culture. Abstracts are due by 11th January so if you feel you have something to say on any of these wide ranging topics then take a look at the call. I can’t wait to see these outputs and in the meantime am going to be looking at data modelling and ontologies with renewed enthusiasm!

Many thanks to Andrew Janes (UK National Archives) whose tweets I relied upon heavily for this summary!

*the best toilets in Dublin are in Brown Thomas apparently

Archivematica UK User Group Online Autumn 2020

Generic image search for Scotland brought inevitable Highland kuh Image by Frank Winkler from Pixabay

It seems like a million years since we last met as a user group but astonishingly it was earlier this year when we gathered in person (can anyone remember that?) at the University of Westminster However we still want to meet and we are relying on one another even more now so it was fantastic to have the University of Glasgow host the event for us via Zoom so we could continue to meet and share ideas and experiences.

There was an appropriately Scottish feel to the meeting as we heard first of all from the George Macgregor from the University of Strathclyde about the work they have been doing on integrating Archivematica with their ePrints repository. You can read the detail of the work here but it’s the result of a fantastic international collaboration with Concordia University in Canada and has taken a lot of time and hard work to come to fruition. Strathclyde (in common with many other Universities) use ePrints as their repository where they keep (note not preserve) academic research outputs. George was very candid about the fact that preservation was not part of the data management process but they are now making great strides towards integrating preservation practices into their management workflow.

Our second speaker was Sean Rippington from the University of St Andrews (continuing the Scottish theme) talking about their recent work. For the first part of his talk Sean spoke on a parallel theme to George – integrating their repository (Pure – not a repository system but used as one by many) with a preservation system. Thanks to ongoing development work by Jisc as part of their Open Research Hub there are now Adapters to enable integration which helped St Andrews to connect Pure with Archivematica. It’s a really encouraging piece of work which is available on Github and hopefully will lead to future interoperability work. Sean also updated us on a separate but fascinating piece of work on preserving HyperCard (early Mac) files. When attempting an ingest into Archivematica normalization didn’t achieve much either for preservation (complex METs files) or for access. As a consequence they turned to emulation to support the file but would look to developing a preservation pathway going forward.

By Jeff Keyzer from Austin, TX, USA – Goodwill Computer Museum, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=41058991

We had some time for discussion and talked about normalisation pathways for spreadsheets (answers on a postcard please). Convenient as online is for actually attending events (we had attendees from a really wide geographic spread which almost certainly would not have happened in real life) but the chance to have informal catch ups with people facing similar issues is less easy to replicate online although I am very interested in exploring ways of doing this…

Finally we were pleased to welcome Sarah Mason from Artefactual systems to give us all the latest Archivematica news.

Reflecting on the event we can see how every institution has their own individual setup and needs which means a lot of different approaches are required but by sharing work and ideas hopefully we can all tackle our won challenges. I’m really looking forward to more of events like this and will be looking for enhancing the networking opportunities online so that we can get the best out of our events.

Autumn MidiPres

Image by Hans Braxmeier from Pixabay

I can’t believe that time has moved round so quickly – what happened to the summer (don’t answer that)? and already we’ve had the second meeting of the Midlands Digital Preservation Network. After the last meeting we sent out a feedback form to try and get a sense of what people might want to discuss. It was important for us from the outset to get as much feedback and input as possible. There seemed to be an interest in “transfer” and by that we interpreted this as “getting the stuff into your archive” which I have always maintained is one of the hardest parts of digital preservation – it doesn’t matter how fancy your preservation system is if you can’t capture the “stuff” you want in the first place then you are not doing any kind of preservation.

I had been doing some work on using Sharepoint for depositing/collecting archives and I thought I would share my experiences with the group to get feedback, prompt others to have a go or just get people thinking a bit more about their record keeping and managing environment. A quick show of virtual hands suggested the majority of people at the meeting worked for organisations which used Sharepoint (even if they didn’t use it much themselves). On my part I have found myself using Sharepoint much more than I did during the last six months as a way of working collaboratively and remotely and our Records Manager reports that the number of Sharepoint Libraries and Teams Sites in the organisation has mushroomed in the last six months. This presents huge record keeping and compliance challenges as well as digital preservation issues but here is not the place to go into it (another blog perhaps…). However because of its ubiquity Sharepoint does seem to be a way of sharing and gathering collections. My experiments were purely in the context of internal transfers and I soon found it was beyond my skills to be able to set up the permissions to satisfy my data security requirements. Although my IT colleagues were able to help with this I still haven’t really been able to achieve what I want. I’d love to hear from anyone who has managed to set up something which they are happy with! We had some discussions about the kind of metadata you might or might not be able to capture plus that old chestnut last modified date.

Laura then bravely offered to some live tool demo-ing and she kicked off with showing the tools that they are using at the University of Nottingham to work with their audio visual collections. The thing about digital preservation work and trying to move forward with it is that it will depend on your specific situation and what your goals and priorities are. For Nottingham it was working with various A/V collections and in particular the collections of a local musician which itself has thrown up a number of digital preservation challenges. Laura first explained that she had found this guidance produced by New York University which I’m going to have a good look at myself. She then went on to walk us through two tools she had been using: Exact Audiocopy and IsoBuster for audio CDs and DVDs respectively. It gave us a chance not just to see the tools in action but also to discuss how some the tools are used and we spent a bit of time discussing the pros and cons of disk imaging.

And to round off by popular demand Laura also live demoed Teracopy and those of us who have had a go with it (and I include myself in this) discussed the fact that it was a very useful tool but not an easy one to use (I for one find it quite complicated). Its checksumming ability is clearly of great benefit but just getting the checksums is one thing – working out how and when to use and check them is quite another!

Finally we agreed we would – where we could – share some of our procedures (must go and do that now!) and I think that would be a really useful discussion point for future meetings.

All the tools we have talked about are free to download in some version of another and we are all keen to keep the network system agnostic.

I came away full of ideas and thoughts of things I need to be getting on with – it’s really valuable to connect with others and get inspiration at times like these. Here’s to the next meeting – and as ever do get in touch with me if you want to know more (or join in!) rachel dot macgregor at warwick dot ac dot uk or DM me via Twitter @An_Old_Hand

We Miss iPres

If you are working in digital preservation and you are not collaborating, then you are on a lonely road.

Picture of me collaborating in the social event at WeMissiPres

In one way it doesn’t seem that long since iPres 2019 when I was lucky enough to be able to travel to Amsterdam to attend this excellent international conference. This year the conference was scheduled for Beijing which I would not have been able to attend under normal circumstances. But of course these are not normal circumstances and iPres 2020 has been moved on to iPres 2021. To plug the gap left by the postponement of the conference I was delighted to learn that an informal non-conference We Miss iPres was to be held in the gap left by this postponement. I wasn’t feeling inspired enough to propose a topic to present on but I did volunteer to chair a couple of sessions (which was quite hard enough work as it was!). A huge thank you to all the Friends of iPres who collaborated to make this non conference happen!

An online international event has to take place across a multitude of timezones so I didn’t quite make it to all of it. What I did hear was inspiring and thought provoking and I have already been pencilling up the programme to catch the parts I missed. With so many presentations I couldn’t possibly talk about everything I heard but I’m just going to pick on a few favourites that struck me for one reason and another – have a look through the programme and see what I mean about the variety on offer.

When looking for inspiration in my work I need look no further than Professor Michelle Caswell, whose keynote was one of the highlights of iPres 2019. And she did not disappoint this year and invited us all to imagine what “liberatory digital preservation” might look like, by which she meant digital preservation practice in which we consider things like what the impact of collecting/making available might have on oppressed and marginalised communities and who we are excluding by the way we practice. What resonated particularly with me was that Caswell invited us all to take time to close our eyes and imagine what new ways of practice might look like and feel and how they might be accomplished. In all aspects of preservation work it’s too easy to be swept along on the treadmill of processing and day to day matters that we don’t make time to reimagine our structures and processes and this holds everyone back whether that is in dismantling oppressive structural practice, working towards new ways of working to support environmental sustainability or just trying to imagine how we might do more with less, we are always required to reevaluate and reassess our processes so that we can build fairer, more robust and more sustainable futures.

An example of how this might be put into practice was shared by Daniel Steinmeier (Dutch National Library) who discussed how inclusive practice could both help diversify collections and metadata and also involve and empower communities which would lead to more of the same. I’m really looking forward to reading more about this work in the future.

Barbara Sierman addresses WeMissiPres

Sometimes all this can seem very daunting, especially when stuck at home and faced with huge uncertainties on every front. So it was also heartening to hear a presentation from Somaya Langley (University of Sydney) talking about digitisation and digital preservation and focussing on examining workflow and infrastructure at the University of Sydney to reveal where work was needed.

If I don’t document this within 48 hours it’s gone from my head

Somaya Langley @criticalsenses

She also talked about celebrating small successes which is something I have repeated to myself on days when I feel like I haven’t achieved very much at all. Improving documentation is something which is permanently on my to do list and I will now add looking to see how digital preservation can be “baked in” to digitisation processes (and I think this means a longer and harder look at what those workflows and processes are).

I was really interested to hear about broader themes emerging in the digital preservation community and will spend some time looking at the US National Archives and Records Administration Digital Preservation Framework – a living working document which they are keen for people to use and get feedback from.

In digital preservation circles we often talk about “good enough” and try and consider what that looks like. We are all working with finite resources and need to optimise what we have for the best and most sustainable results. An interesting spin on the question of “what does good enough look like” came from Alex Garnett (Simon Fraser University) who considered the many and varied ways audio visual archives are made accessible – both amateur and professional (whatever that means!) – and what constituted “acceptable” levels of loss of quality in output. Something of a musing on the interplay between the technology and techniques available and the aesthetics and perceptions of “authenticity” there was much to think about and enjoy from his presentation (not least Big Buck Bunny)

There were other moments to savour from the conference including some wonderful films such as The Last Day of Bangkok Trams which Chalida Uabumrungjit from the National Film Archive of Thailand used to illustrate how they had promoted and advocated their work during lockdown (it was one of the most popular downloads and it is indeed charming).

And finally a personal favourite of mine was the film showcased by Andrew Davidson (Robert Gordon University) working on the Fraserburgh on Film project which connects community engagement, audio visual presentation and outreach in a truly magical way.

A wonderful unconference, so much to think about and good to know I am not alone on this road.

Introducing MidiPres*

Live demo at MidiPres (image courtesy of S Colbourne)

I’ve been a huge admirer of some of the digital preservation community networks which have emerged in recent years – I’m thinking here particularly of Aye Preserves and further afield Australasia Preserves and have thought how nice it would be to have something similar to go to. Circumstances conspired that I found myself at iPres in Amsterdam last year chatting to fellow UK Midlands preservationist Laura Peaurt (University of Nottingham) and we reflected on how ironic it was that we went all the way to Amsterdam to have the time and space to exchange ideas when we could be doing it nearer home. Fast forward a few months and Laura and I were preparing to launch an informal networking event maybe in Birmingham for Midlands colleagues to meet to share experiences and offer support, perhaps over a coffee or glass of beer… Fast forward a few more months and Laura and realised we would not be meeting anyone any time soon in real life so we set up the very first online meeting aimed at anyone in the region interested in or working on digital preservation challenges (we’re still hoping it can be a physical meeting at some point!).

We chose to use Microsoft Teams because both of our institutions already support it and several members mentioned that their institutions blocked the use of Zoom. Teams allows us to share screens, chat and record which ticked the boxes and no one had a problem joining it.

To keep it simple the two speakers were Laura and myself. Laura and her colleague Sarah Colbourne spoke about the work that the University of Nottingham are doing on Covid-19 collecting which builds on the work which was already underway at the University collecting contemporary stories relating to the student experience which can read more about here. Laura also did a live demo of Conifer (the tool formerly known as Webrecorder) and it was really interesting to see the tool in use and hear the discussion about the benefits and drawbacks of using Conifer and of web capture in general. I think we all agreed it was useful to have the ability to capture something and in a quite responsive way but it was no substitute for the very complex task of full scale web preservation.

I lead the second presentation which comprised a simple run through of using DROID, the National Archives’ tool for file format identification. As with Laura’s demo we weren’t claiming to show the “only” way of using the tool but just how we did in the context of the work we did. My main use cases are around capturing initial metadata, creating file lists to help with cataloguing, identifying duplicates for appraisal purposes and using the reporting functionality for planning with ongoing management of file formats. To me DROID is a great basis for “understand what you have” – surely the bedrock of collections management.

The meeting lasted two hours which went very quickly but I think a lot of people feel a bit screened out after longer than this. We asked for brief feedback on format and content and had some lovely comments:

Many thanks Rachel and Laura this DP group is just what we need to share experiences as well as theory.


I found it a very useful session, and feel that “Show and Tell” is a good way forward, complementing the more formal training available in other forums.


This is a very encouraging start to what we hope will be a community led resource. If you are in the UK Midlands Region and are interested in future events please do get in touch rachel dot macgregor at warwick dot ac dot uk or via @AnOldHand on Twitter.

*working title!

Archives Access and AI

I was really looking forward to this conference organised by Dr Lise Jaillant of the University of Loughborough and it did not disappoint. I took what I think must be a record fifteen pages of notes (yes I do use a real notebook) not to mention the countless tweet (see #AcArAi) so it will be impossible for me to do much more than summarise some of the highlights (for me at least) of this conference.

The conference brought together digital humanities scholars, archivists, digital preservation practitioners and others to discuss and share ideas about making archives accessible – either with or without the aid of machine learning/AI.

The conference was near Hackney Wick – an interesting part of London

Helen Mavin from the Imperial War Museum gave a fascinating insight into the very complex work being undertaken at the museum to manage and preserve over a million digital files deposited by the Ministry of Defence. This is material specifically covered by UK government legislation (the Public Records Act) so is not everything which the Museum collects or preserves (this was important when discussing their appraisal and retention strategies). Collections management and transfer procedures were out of date and key contacts were unknown or unclear. Mavin needed to establish robust criteria for retention and selection of digital materials to facilitate fast(er) and more efficient transfer. Once the material was transferred there needed to be standardised workflows which are media agnostic. Key challenges were around staff turnover and both a skills and resource gap – something which will be familiar to many.

I was really interested to hear from Jonathan Manton and Alice Prael from Yale University Libraries talked about their work on trying to centralise and standardise the workflows at their institution which comprises a number of libraries and a museum. They began by centralising the processing of imaging and file extraction from physical media which has helped standardise the process and assisted with tackling processing backlogs. They have also looked at email archiving, born digital catalogue description and network transfer. I was particularly keen to hear about the work they have been doing on describing born digital archives. This is something which really needs more discussion and action from practitioners – I’m finding it presents a very real headache for my own practice. Prael and Manton commented that the area which caused most difficulty was describing hybrid archives. I wasn’t surprised by this and given this is the landscape we are going to be working in for some time to come it’s one we should all be giving a lot of thought to.

The cataloguing of archives is closely linked to (but not coterminous with) access so I was also keen to hear from Anthea Seles talk about the use of Machine Learning and Artificial Intelligence in archival processing and the extent to which archivists and information professionalism should (but are not necessarily) involved in the creation and use of algorithms and the ways in which data is explored, exposed and exploited. Seles’ talk gave a great deal of food for thought and my homework will be to read Cathy O’Neil’s Weapons of Math Destruction.

I also very much enjoyed hearing from Leontien Talboom who is undertaking a PhD jointly with the UK National Archives and and University College London looking at the barriers and opportunities which exist in making digital archives available. So far she has uncovered what a huge amount of work still needs to be done in this area but I’m looking forward to work coming out in full so we are better informed to overcome the barriers and exploit the opportunities. In the same panel another PhD student Rebecca Oliva from the University of Glasgow is looking at sensitivity reviewing and the extent to which it can be automated. Manual sensitivity reviewing is on the whole an opaque process, therefore lending itself well to automation and machine processing. However sensitivity is also very context specific ie data can be sensitive in one context and not in another so there are many challenges to be met. Again I am really looking forward to the outputs of this work.

On the final day it was good to hear from Jenny Bunn who invited us to ask “What can archivists bring to the (AI) party?”. The answer is (hopefully) quite a lot. AI has been around for a while now and it is starting to grow up. more people are asking for accountability in AI and this is where the archivist comes in – we are really really good at documenting and organising things!

We had an entertaining presentation from Caylin Smith and Andy Irving from the British Library on their struggles with making non-published legal deposit material from the British Library available. They are extremely hampered by legislation which has not yet caught up with the technology (a point which Paul Gooding from the University of Glasgow also addressed in his presentation) but have made great strides to improve what they can, certainly in terms of the user experience of accessing material on site. We also had a heard a fantastic call to arms from William Kilbride of the Digital Preservation Coalition which he has helpfully published as a blog post so I don’t have to report on it!

Many of the conference presentations are available here and the abstracts here. I thought this was a fascinating and thought provoking conference and I am still thinking about many of the themes which came up in it and hope to draw on in my practice. many thanks to Dr Lise Jaillant for all her hard work in putting this together.

Conference cake which like everything else was excellent!

Archivematica UK User Group meeting, University of Westminster

We met in the heart of London thanks to University of Westminster

It was an early 2020 start for the Archivematica UK User group but thankfully the weather was very kind for those of us travelling to London to our hosts, the University of Westminster, who kindly provided accommodation and refreshments for us. These meetings need caffeine and sugar!

The meetings are a great chance for users to get together and share their successes and their woes which tend to come in equal measure for anyone practising digital preservation (we are all agreed we’re all still learning!).

First to take the floor was Matthew Addis from Arkivum who talked about the Preservation Action Registry which is a bold but incredibly useful project to try and capture and share technical best practice for preservation actions in a human and machine executable way. Both Archivematica’s Format Policy registry (of which more in a moment) and Preservica’s Linked Data Registry define rules for preservation actions but are not interoperable and the benefits for users of both systems and any systems going forward if they were would be great. It was just the first in a series of presentations which touched on the theme of interoperability which chimes well with the Open Source nature of Archivematica. After all we’re all trying to work on a common task of preservation of a dizzying variety of digital file formats in a wide variety of contexts so the more working together we can do the better!

Interoperability: what we all want.

The open part of the agenda allowed us to give some thought to PDFs. Who loves PDFs? Not many of us it seems. You thought a PDF was a PDF (or maybe a PDF/A) and that was it? You’d be wrong. There a wide varieties of types of PDF and then there are PDFs which conform to the standard they are supposed to and those which don’t. And then there are those that will render correctly and those which don’t…. Archivematica can and does tell you the type and conformity of the PDF but what you are going to do with this information and whether the PDF will render as it should are separate questions. We had some good debates on this and we referenced Paul Wheatley’s thought provoking (or is that provocative?) blog on the subject of validation, something we should all be giving a lot more thought to.

PDFs – ugh

I then took gross advantage of the fact that I was chairing the meeting by chipping in with my tuppence worth on normalization, sparked off by Evelyn McLellan’s recent questions to the user community more broadly about normalization options in Archivematica. I was interested to know whether people had been investigating Archivematica’s Format Policy Registry – a question which had been posed at a previous user group meeting a couple of years ago in Aberystwyth. There definitely seemed to be more confidence about having a go at tinkering with it and overall a questioning about whether the “normalize everything approach” was the right one. I am keen to get more people thinking proactively about their normalization choices – when, where and whether to normalize and agree with Tim Walsh when they wrote about it in their response to Evelyn – that having a view is a responsibility of the institution because of the resource implications of large uncompressed normalized formats but also that each institution is very context specific.

The final session before lunch was a introduction to work by the Wellcome Collections who are getting to grips with Archivematica and tailoring it to meet the needs of their fairly sophisticated workflows. I suspect I wasn’t the only one who was envious of their plans for full scale automation in many of their workflows and their contribution to the user community is going to be invaluable – all shared via GitHub. I’m excited about what the future holds and will be watching their work with keen interest.

After lunch Jen Mitcham from the Digital Preservation Coalition presented on their new project developing a guide to procurement for digital preservation systems and services. They are looking for ways to support both product users and vendors get the best from the procurement process and there was some valuable discussion about the dos and don’ts – we all recognised the “manage expectations” advice because procuring a system does not mean that you have solved digital preservation – I think everyone in the room has found this out!

Systems procurement – we can all learn from each other.

Next up was Matthew Neely from Bodleian Libraries, Oxford talking about their work on Archivematica. Some of us had heard about this work at last year’s Archivematica Camp in London and via their blog but it was good to hear about where they are and what their future development plans are. The Bodleian have particular challenges around the size and scale of both their holdings and their staffing – user management is a key concern for them in the way it isn’t for smaller organisations. They are also doing some interesting work on reporting functionality as an add-on to Archivematica. This isn’t something which Archivematica does (as it isn’t part of its core functionality) but it’s always interesting to hear about this kind of work – one of the great benefits to Open Source software is that it can encourage and support a whole range of additional parallel services to suit individual needs.

The final talk of the day was from Stephen Mackey of Penwern Ltd, a consultancy firm that are involved in a number of digital preservation projects including work on the Central Europe Facility (CEF) eArchiving project (EARK) which has the bold and ambitious aim to introduce compliance and interoperability into the way that institutions work looking at creating AIPs which can be read and shared by any system. Stephen shared the proposed data model with us and there was some interesting discussion around how disparate systems might achieve conformance in different ways.

Europe-wide interoperability and eArchiving

Finally no Archivematica User group meeting would be complete without input from Artefactual and we were really pleased to be able to welcome Sarah Mason, Systems Archivist at Artefactual who is based in the UK which meant she was able to attend in person. She gave us the latest news on Artefactual developments including informality about changes to normalization paths for video files (part of the work I had referred to earlier), some improvements to user acceptance testing which should make for smoother releases for updates and the integration of new PRONOM updates.

All in all it was a very successful meeting although next time I think we should factor in more general discussion time. I certainly stayed on an extra half an hour or so (thank you University of Westminster!) exchanging experiences in an informal environment which is in part what these meetings are all about. I’m looking forward to the next one already!

iPres 2019, Amsterdam

Lovely Amsterdam

iPres is the International Digital Preservation conference which I am lucky enough to have attended twice now (the first was three years ago in Bern which I wrote about here).

It’s a massive conference which is far too big to get round all of and thankfully (for either those not able to clone themselves or alternatively for those not able to come) includes online papers and also on line collaborative note taking, which means it is possible to catch up on some things at a distance or even later.

So I just thought I’d highlight some of the sessions which I enjoyed – there were many I missed – and what my main take aways were.

Arriving to the venue by ferry was a big hit with everyone – like the United Kingdom the Netherlands is a maritime nation and it’s great to remember that it’s not just about canals. The venue at the EYE museum was stunning on the waterfront but it did involve a lot more climbing than I had bargained for, particularly in a flat country like the Netherlands…

Eye Museum of Film, Amsterdam

The kick off session was a workshop on the Preservation Action Registry – this was really useful for me as it helped me understand more about how I might document my actions better. Documenting what we actually do and also what we don’t do and indeed what we used to do but don’t any more. It might mean capturing the “people and process” part of preservation in a machine readable way. I got much more insight into how I might analyse all the processes which go into preservation work to create far better and more useful documentation strategies.

Then it was straight off to the Digital Preservation Coalition’s DP Anonymous where we are invited to share stories about digital preservation challenges (we don’t say failures) and it was gripping stuff (although as it’s Chatham House Rules I can say no more than that). I presented on my recent struggles with setting up a virus checking workflow and it was great to share because I had some practical helpful suggestions of what to do next.

The conference proper opened on Tuesday with a keynote from Geert Lovink, writer and activist. There was a lot to like about this and I think I was not the only person to respond positively to his call to value our networks – particularly those which take place in the same physical space. Lovink is a very persuasive speaker but at the end – partly because I come from a different political and cultural angle – I found I didn’t agree with him on all counts and instead was inclined to agree with comments from the floor from Leslie Johnson of NARA describing successful networks supported in spite of or indeed because of existing corporate structures.

Delicious conference catering

As I have outlined in a previous post I am really interested in skills development in the archives sector so it was a really good opportunity to hear about a number of projects from around the world looking at this. And at an international conference there is the opportunity not just to hear about projects but also meet the people driving them and benefit from their experience so I was particularly thrilled to meet Angela Beking from Library and Archives Canada and Jaye Weatherburn from the University of Melbourne who are variously spearheading initiatives to help fellow professionals develop their skills. Beking presented on her work developing a collaborative model for knowledge transfer aka “digital detention” which got great feedback from the staff who were undertaking it. Weatherburn meanwhile has been instrumental in leading Australasia Preserves which is aiming to support the growth in a community of practice across a large geographic region. All of this has given me a huge amount of food for thought and I hope to be able to build on this community development work in the future.

A strong theme of the conference

The keynote on Day Two was especially welcome as I wasn’t at the 2018 Archives and Records Association conference so missed that opportunity to hear from Professor Michelle Caswell of UCLA. I have recently read her piece on Feminist Standpoint Appraisal and it was great to be reminded how all of us – whatever role we play in safeguarding, curating or making archives and/or data available have a role to play in ensuring that this is done with equity and it does not reinforce the hierarchies of oppression. None of us are neutral operators and we and the collections we manage are a part of society and if we want to see a change, we are the ones to enact that change through our practice.

I really enjoyed the poster sessions and it’s such a privilege to be able to talk to people directly about their research. My main criticism was that I couldn’t get round them all but I did enjoy hearing from, amongst others, Merle Friedrich of the German National Library of Science of Technology about their analysis of AV file formats which complemented the poster from the Open Preservation Foundation on significant properties of spreadsheets, both examples of studies which lead us all to a better understanding of formats.

Me on the big stage

On the Thursday I really enjoyed the lightning talks – despite giving one myself (which is not what you might call enjoyable). The range and breadth of topics covered and calls for contributions was fantastic, from Harvard’s Wolbach Library Project Phaedra, through the TRUST principles being developed for digital repositories and the file format work happening at NARA and I think session included the best conference slide on distributed storage services

Digital preservation is all about the unicorn magic!

iPres 2019 was a great conference and I’m just sorry I didn’t have time to see a bit more of Amsterdam. It was a great privilege to attend and a particularly exciting to be able to speak at the ad hoc session. A massive thank you to the organisers and all the participants – a conference is made by the community after all. I hope to be able to spend a bit more time looking at the contributions and putting into practice what I learnt from the conference.

Reward for all our hard work