Archivematica UK User Group Meeting November 2018

EXMED-2014-48
Modern Records Centre, University of Warwick (Image: Modern Records Centre)

Yesterday it was a privilege for us to host the autumn Archivematica User group meeting here at the University of Warwick – the 9th User Group meeting since its inception in 2015. I wasn’t at that meeting but I have been to all of the rest and they are a great opportunity for people who are interesting in, experimenting with or using Archivematica in full production mode to get together and discuss their experiences.

I used host’s privilege kicking off proceedings by giving a brief introduction to where we are at the University of Warwick which I illustrated like this:

ama-dablam-2064522_1920
Ain’t no mountain high enough (Image: Pixabay)

It does sometimes feel like we have a mountain to climb. We have various issues with our installation of Archivematica to sort out and then when we’ve got those sorted it it’s on to the really tough stuff!  We know that the future of digital archive processes is going to be about dealing with large quantities of material so we need to work on automating as many of our processes as possible.  A good place to start on this in is automating capturing descriptive metadata and also as many of the ingest processes as possible.  There are so many questions and I hope to be able to report on our progress at future meetings.

Next up we heard from Jenny Mitcham of the University of York at her final Archivematica UK User group meeting before she moves on to pastures new at the Digital Preservation Coalition. Jenny was reporting back on her work on some old WordStar files which form part of the Marks and Gran archive. She has already blogged about her adventures with these files and she came to the meeting to report on her most recent work using the manual normalization function in Archivematica.  Jenny emphasised that this work is incredibly time consuming and requires lots of experimentation and QA.  The work involved testing migrating the files to different formats – PDF, TXT and DOCX. By comparing the migrated results to an original version of WordStar which Jenny has running on an old machine in the corner of her office she could see that each normalised format captured some of the properties of the original but none of them captured them all. There was a further complication in that some files had the same name (but with a different extensions) which Archivematica does not like. On top of all of this PREMIS metadata has to be added manually to record the event – this gives not entirely satisfactory results in terms of the information that it represents (or doesn’t represent). The whole normalisation process is long and complex and is summarised with a short and not entirely decipherable PREMIS entry.  Jenny’s main take away is that Archivematica struggled in situations where the normalisation path was unclear.  Any three of the normalised files she produced could be an AIP or a DIP but Archivematica does not allow for more than one of each.

Following these presentations the group had a short discussion on the appraisal tab feature in Archivematica. We had previously asked people to test it out to report back on whether they thought it was a feature they were likely to use or not.  We had a relatively small number of people saying they had tried using it and possibly this reflects difficulty of use and/or the fact that the feature was designed for use specifically with ArchivesSpace (and therefore doesn’t necessarily integrate with AtoM or other systems). There also followed some discussion of how much appraisal people were likely to do in Archivematica (as opposed to before ingesting into the system). There was some feeling that it might be more useful if it did integrate with AtoM but this of course would require development work.  Food for thought…

There was also a short discussion on “how much” IT support we felt an institution might expect to need to support running an instance of Archivematica. Admittedly this is a bit of a “how long is a piece of string?” question but there were some valuable contributions around how advocacy was needed to engage IT support colleagues which might lead to more of a feeling of ownership and help develop enthusiasm and experience (they go hand in hand). There was also discussion of costings and creating a business case (the Digital Preservation Coalition got a name check here).

ropes-1185751_1920
String: how long is it? (Image: Pixabay)

After lunch we heard from Hrafn Malmquist from the University of Edinburgh who was updating us on his work automating their Archivematica workflow.  We heard at the last meeting about the beginnings of this piece of work creating an integration between Archivematica and DSpace and ArchivesSpace.  I was extremely impressed by the way in which the SIP is processed and produces two AIPs, one of which goes through to a dark archive and the other to DSpace.  The DIP which is produced is then also pushed to DSpace which then creates a link to ArchiveSpace.  So far just getting all the storage and access locations working smoothly is impressive enough but Hrafn says there is more to do – for example the DIP file structure is flat where it should be more hierarchical.

6586676977_2066ca1be5_z
Matthew Addis suggested this is how we felt about dabbling with the FPR… (Image: Jeff Eaton: https://www.flickr.com/photos/jeffeaton/6586676977 CC BY-SA 2.0)

 

Next up was Matthew Addis talking about his “journey into the FPR”. For many Archivematica Users (at least for those of us who discussed this at the Winter 2017 meeting in Aberystwyth), the Format Policy Registry is a thing to be approached with extreme caution. Archivematica offers the user the option of customising the normalisation pathways although as we saw with Jenny’s presentation approaches to normalisation are extremely complicated and often require a decision making process based on “least worst” options. Matthew’s normalisation work was around Office documents and emails.  One example was creating a normalisation path for Powerpoint files to PDF(A) where the process is lossy as animations, fonts, comments and all sorts of other content is lost.  Normalising to an Open Document Format might be preferable but this format is not widely supported making the files relatively inaccessible.  Analysing files for significant properties is extremely complicated and time consuming and in the end not easy to quantify; how do you measure which particular property has “more” significance than another if you are trying to compare processes. Another challenge was that Archivematica only supports one input format, one tool and one output format and sometimes more than one format and more than one tool might be involved in normalising a file. It was good to be reminded just how complex office documents are and cause no end of a headache for anyone planning for future resilience.

Our final presentation for the day was from Alan Morrison from the University of Strathclyde. He took us through their Research Data Management workflows using Archivematica. They share an instance of Archivematica with their Archives and Special Collections but there is little overlap between the two services. At present Archivematica is used just to create AIPs which are then stored in the local network storage. DIPs are not created because Strathclyde use the front end of their institutional Research Information System (the database which manages all the research outputs) to make the datasets discoverable.  Alan recognised that there ongoing issues, not least poor interoperability between systems and too many manual actions which lead to human error.  But there was much to look forward to as well such as a possible development of dashboard monitoring to aid management of the AIPs and the development of a plug-in to integrate with an ePrints repository. He also mentioned a possible Scottish Archivematica Hub (given there are a number of Scottish institutions using Archivematica).  We’ll definitely be looking forward to hearing more about this in the future.

To wrap up the day we were delighted to hear from Kelly Stewart of Artefactual systems making an early start in Vancouver to give us an update on Archivematica developments at their end.  We’re looking forward to the release of Version 8.0 which is imminent and excited to hear about a possible Archivematicamp UK/Europe – are there any interested hosts out there?

Overall I really enjoyed the day – there was a lot to think about and I gave myself a couple of pieces of homework which I must get on with sooner or later.

If you are interested in Archivematica and would like to join the group or just attend a future meeting to be able to chat to fellow users then do get in touch with me rachel_dot_macgregor_at_warwick_dot_ac_dot_uk

Advertisements

It takes a while to mature

botti-329375_1920

I wrote in my last post about how I was looking for more resource so I can make progress on various outstanding preservation tasks. This is not a speedy process so in the meantime I am looking at ways to help in the search for more resource and also the ways in which I should be deploying the resources I do have. It seems like a good time to write a roadmap which will hopefully help articulate the vision of where we are headed, identify concrete objectives and priorities to help others understand the work we are trying to do.

First of all I would like to undertake some sort of audit of where we are as an organisation. I have long been an advocate of the NDSA Levels of Digital Preservation and if you have met me you have probably heard me banging on about them. I even have them pinned up next to my desk (I stole this idea from Jen Mitcham) alongside my favourite xkcd cartoon

digipres
My desk

These are a great starting point but I’m looking for something a bit more in-depth.  This  is where I’ve turned to Maturity Modelling which is a method of assessing where an organisation is at in different areas and scoring to help define where improvements could be made and highlight areas which need the most attention. To help me undertake this assessment I looked at the suggestions on the Digital Preservation Coalition Preservation Handbook and also turned to Twitter, not least because that’s a place where many of those who have developed these models are to be found.

 

 

The Digital Preservation Capability Maturity Model referred to above is definitely one I am interested in and can be found here. The Assessing Organisational Readiness toolkit proved harder to track down (as the twitter conversation suggested there was a link rot issue) but I managed to get hold of a pdf version with another call out to Twitter (it would be great if there was some way of hosting it somewhere…).  The AOR toolkit is also very useful; based on the 2009 Jisc AIDA toolkit (also hard to find) and the CARDIO Research Data Assessment. This is also helpful as Warwick’s Research Data team have been developing their own roadmap using CARDIO and we are obviously keen to develop our services in a joined up and collaborative way.  The third suggestion which I’m going to look closely at is the Kenney and McGovern “Five Stages of Digital Preservation (http://dx.doi.org/10.3998/spobooks.bbv9812.0001.001Permissions) which was not hard to track down and has its own DOI, giving at least some guarantee that the link rot will be less likely.

I’ve started going through these models and each has different things to offer which are more or less useful to my particular situation. Every institution has its own priorities and ways of working and there is no one approach to digital preservation which will be applicable across the board. The roadmap I want to develop will hopefully help in the following areas:

  • establishing my digital preservation priorities
  • working out how to develop and move forward with preservation activities
  • highlighting areas for collaboration within the organisation
  • raising the profile of digital preservation work within the organisation
  • help make the case for additional resources based on an analysis of our current position

Using my assessment tools I can then identify my stakeholders and work towards a better understanding of where we are as an organisation and how we move forward.

So for now it’s back to my beloved spreadsheets and time to do some scoring!

You’ve got mail

Email preservation: it ain’t easy.  I knew that before I even started looking at it but there were a number of factors which prompted me to begin having a look at what we could do here at MRC.  I have been spending quite a bit of time getting to know the collections here and get a feel for what sorts of digital material we have.  This is normally the very first step in undertaking digital preservation work.  Understanding the collections means we can then prioritise and target particularly vulnerable formats, and make plans to tackle formats which will cause particular issues (eg 5.25 inch disks, 3.5 inch disks and so on). I have not completed this process yet (there are several thousand accessions to pick through) but I have turned up some emails included in the collections in a variety of formats: printed out, copied and pasted into word processed documents and so on.  We need to be preparing ourselves to deal with a new deposit of material which we might be offered from a trades union, activist or one of the many other people or organisation which come under our collecting policy, as it is almost certain to include email. The fact that it hasn’t yet become a huge issue is most likely because we haven’t yet asked the question.

computer-3368242_1920

I am also starting to think about how we collect archives from the university itself more effectively and inevitably this includes email correspondence in a number of different settings.  I probably wouldn’t be able to even consider this were it not for our Records Manager who is working on the front line of records creation. I don’t think I can emphasise enough the fact that it doesn’t matter how many all-singing all-dancing technological solutions we put in place to “preserve” the digital stuff, if we don’t have actual have people and resources in place with record creators (in whatever capacity this might be) we can’t hope to capture the things that are really important (whether for cultural, legal, evidential or other reasons).  Preservation: it’s all about resourcing.

Whilst we may be a little way away from tackling some of the complexities of emails as archive collections, a more pressing use case for us is to preserve the correspondence which accompanies our collections and include it in the submission information of our SIPs ingested into Archivematica. So I have been looking at some tools which would be useful in this process and thinking about what they can do and how they might fit into a transfer/deposit/appraisal/ingest workflow.

ePADD, which is developed and maintained by Stanford University, describes itself as “the all-in-one email appraisal, processing, discovery, and delivery solution for donors, archival repositories, and researchers.”

It’s worth noting here that it does not mention preservation – that isn’t actually what ePADD does – although it’s often mentioned in the same breath as various other preservation tools.  ePaDD is designed to help with the acquisition, appraisal and management of email collections, so particularly around capture and content management.  I would argue of course that this is all part of the preservation process but it’s fair to point this out as other tools will be needed for the processing and preparation of emails.  I have also been looking at Emailchemy and Aid4Mail both of which help with converting email export packages into preservation formats.

ePADD comes with excellent and very clear instructions on downloading and using the software and there is a detailed and active community forum. ePADD can link directly to a mailbox where you can select folders to capture emails from or you can upload emails which you have exported from a system in an MBOX format.  This is the standard export file format used by many email clients but NOT (inevitably) Outlook, which uses the proprietary .pst format.  This is where programs such as Emailchemy or Aid4Mail come in – they both enable the user to convert .pst files to MBOX format.

I got ePADD installed on my pc but immediately ran into problems…

ePADDerror

The good news is that Josh Schneider at Stanford was extremely quick off the mark with a diagnosis – that my pc did not have enough RAM.  He suggested running the Java version from the command line where you can specify how much RAM to allocate to the program.  I’m not a very techy person so although this sounded a bit daunting, again the excellent instructions for ePADD meant I could do this.

However:
invalid

Oh well.  At least it confirmed Josh’s diagnosis. And gave me a next step, which is to get more RAM behind me and hopefully get testing ePADD properly.

I was hoping this post would be documenting my adventures with email acquisition and appraisal but I’ll have to leave that for another day.  It came at a good time however as ePADD has been nominated for a Digital Preservation Coalition SSI Award for Research and Innovation. Based just on my experiences so far my vote is definitely going to ePADD as the documentation and support have been excellent and it looks as if it’s going to be a product with a lot of possibilities for us.

In the meantime – as with so much in digital preservation – I’m just going to have to look for more resources.

 

 

 

New beginnings

idea-1876659_1920

It’s exciting times for me starting a new chapter at the Modern Records Centre, University of Warwick – a great opportunity for me to get my teeth stuck into a a whole new set of challenges.  I’m really going to miss the team at Lancaster University where I started out on my preservation journey but I’m looking forward to hearing about everything they get up to in the future.

Meanwhile I’m just completing my second week here at the Modern Records Centre (MRC) where great work has already been done on implementing a live instance of Archivematica and ingesting both digitised and born digital material.  I’m really lucky because my predecessor left tons of useful notes and guidance for me to pick up.  I’ve got the task of moving things forward to start to try and scale up the processes so that we move from the current manual upload and processing of content to a more automated scalable approach.  This will help us tackle the backlog of digitised material which is awaiting ingest and also deal with the born digital material we are beginning to receive.

cabinet-3283536_1920

So my main focus in the first couple of weeks is around understanding the workflows and processes that take place already here at MRC with the creation of digital content and the cataloguing of both digitised and born digital materials.  It’s made me start to think (again) long and hard about the best approaches to cataloguing born digital materials.  I have returned again to the very excellent National Archives’ Digital Cataloguing Practices paper and also the University of California’s Guidelines for Born Digital Archival Description.  Any other suggestions very welcome to help get my thinking going! We are dealing largely with hybrid collections so all approaches need to take into consideration the legacy and current cataloguing methods. There is no clean slate or break – any developments need to work for the users and the archivists currently managing the physical collections. It’s going to be a collaborative effort and I’m looking forward to the challenge.

I’m also pushing a few digital assets through Archivematica and it is *very slow*. I’m hoping to concentrate on improving performance particularly with a view to scaling up. As ever, there are those who have gone before, especially some fantastically useful blogs from Bentley Historical Library and Jenny Mitcham at the University of York not to mention the Archivematica User Forum.  In fact I’m writing this whilst waiting for a SIP to appear in my ingest tab (I hope I haven’t broken it already…).

cake
Not the actual cake I baked – that got eaten…

So it’s busy busy busy here although I have had time to bake a cake which hopefully has started me out on the right footing with colleagues.

Now – off to track down that missing SIP….

Software Matters

wild-1701781_1920
Me thinking about software preservation

I’m going to let you into a secret.  Well, it’s not a secret really but whilst I have been gamely (trying to) take on the challenges of tackling various technical aspects of digital preservation, my approach to software preservation has been decidedly that of an ostrich.  I have been firmly sticking my head in the sand over this! It’s partly because I really don’t feel like I know enough about software writing in the first place and partly because someone else is doing something about it aren’t they…?

However databases, websites, video and so on are complex digital assets which I am only too happy* to tackle but somehow software seems a step further

However a couple of things made me rethink my position.

The first was a recent talk from Neil Chue-Hong of the Software Sustainability Institute at one of our Lancaster Data Conversations.  He discouraged me and encouraged me in equal measure.  Discouraged me because when addressing a room full of coder writers he asked them to consider how they might access their code in three months time.  Three months!  What about three years or even three decades?  If people are not even considering a lifetime beyond three months I’m starting to wonder if it’s worth getting involved at all.

However on a more positive note Neil was keen to promote good practice in software writing and management and recognised the barriers to maintaining and sharing code.  As with most preservation work the key is getting the creators to adopt good practice early on in the process – the upstream approach which I’ve alluded to before and has been around a very long time and indeed is what makes Digital Preservation a human project.  However in order to support good practice and to build the right processes for managing software a better understanding is what is required.

programming-2115930_1920
Can we build code to last?

The second thing was the realisation that I was already responsible for code in our data repository – for example this dataset here which supports a recent Lancaster PhD thesis.

We don’t have a huge uptake from our PhD students for depositing data in the repository and we are especially keen to encourage this because we want our researchers to get into good habits early.  Neil explained in his session that there were particular barriers to certain researchers sharing data – early career researchers amongst them – as there is a fear of sharing “bad” code.  But as he pointed out – everyone writes bad code and part of the advocacy around sharing is getting over the fear of “being found out”.  From my perspective – if people are willing (or even brave enough) to share their code I want to make sure that as someone charged with digital preservation I can try and create the optimal environment for software preservation into the long term.

At the moment I think we have some way to go on this, but thankfully help is at hand with the Jisc/SSI workshop Software Deposit and Preservation Policy and Planning Workshop.  The workshop aim is to “first workshop will present the results of work done by the SSI to examine the current workflows used to preserve software as a research object as part of research data management procedures.”

Sounds good to me.

What do I hope to get out of this?

questions-2110967_1920

  • I’m really interested in getting the right metadata to support the preservation of software
  • I’m keen to hear what other people are doing!
  • I want to know where I can best direct my efforts and the areas I need to concentrate on first to get up to speed with providing excellent support

I’ll be reporting back soon on what my next steps are…

*or is that grimly determined?

Happy accidents: adventures in web preservation

A happy accident led me into exploring web preservation.  I was doing (or trying to do) some file format id-ing and realised I needed to document information relating to specific software.  Web preservation was something I confess I had been “putting off” because it “looked difficult”.  I mean everyone says it’s difficult so it must be, right? But inspired by a digital preservation mantra:  ‘don’t let the best be the enemy of good enough’ I decided that if I wanted to capture information on the web and not find it the the link had rotted when I came back to it I would need to explore ways of “preserving” it.  Oh wait – that’s like web preservation, right?  So armed with a use case I thought now was as good a time as any to experiment with web preservation tools.

tweet1

So I started with a tool called Webrecorder – I had read about this but not had a chance to play with it.  Using it was pretty straightforward – you need to register and log in and then you then create collections (say for example related websites, or themes) which you can add to at a later date.  The basic principle is that each time you “start” the recording you can hop to a website and it will capture each link you visit – including PDFs and other material (I haven’t tested it for video content – note to self – do this next!).  The tool appeals to the archivist in me because it captures everything; the relevant metadata about the capture and you can link “recordings” (ie sessions when you did the web capture) together.  I see it as a great for personal digital archiving which is another thing I’m interested in developing as an advocacy tool.  It’s also useful for small scale sweeps like the one I was intending although for bigger projects something more automated would be required.

ipad-820272_1920

Also – and this is a big also –  this tool captures web sites but it doesn’t preserve them.  Like any digital preservation activity you can’t just have a tool which will “do it for you”.  The tool is only as good as the systems which you link it to.  In the case of Webrecorder the tool allows you to download your capture as a zipped WARC file – which is great as this is the format developed for capturing “web accessible content in an archived state”. Recordings from Webrecorder can then be downloaded and ingested into a preservation system and managed from there.  Brilliant!

However (and there’s always a however) I want to check and access my WARC files. Thankfully Webrecorder comes with a player which allows you to “play back” the captured web pages.  Want I want to do next is experiment with using other web capture tools and playing them back with Webrecorder player and also playing Webrecorder captured files using other playback methods.

Webrecorder is a great system for people (like myself) who don’t have a huge amount of technical know-how but I would like to explore other tools and systems which might require a bit more investment in time for set up and installation.  The key things I want to explore are around automation and integration with our existing systems and workflows.

school-2253459_1920

What I need to do:

  • spend a bit more dedicated time exploring and comparing tools
  • keep a log of my experiences (blog or other platform)
  • think about contributing to COPTR (I notice Webrecorder isn’t on there except in the wishlist column…)

What I need help with:

  • understanding the WARC file format
  • understanding more about the crawl process – what can/can’t/should/shouldn’t be attempted
  • understanding more about the metadata which is captured
  • and a whole lot more about automation processes

Next I want to have a go with WARCreate which is a Google Chrome plugin.   I got as far as installing it but it slowed down my browser performance so much I took it off again…

Wish me luck!

 

Digital Ambition

Last week I was delighted to travel to Hull History Centre to speak at the Archives and Records Association‘s Section for Archives and Technology Digital Ambition training session about digital preservation.  The audience was made up of archives professionals. interested but not necessarily specialist in digital preservation

We were welcomed to Hull History Centre by University Archivist Simon Wilson who gave us an overview of their project to capture the Hull City of Culture 2017 events.  This is extremely ambitious in its scale and complexity and he talked us through some of the more challenging aspects of working on a collaborative time-limited project.  They are learning lessons from The Olympic and Paralympic Record project run by the National Archives in capturing the digital elements of a cultural event.

hull
Hull: Weeping Window (author’s own CC-BY) https://www.hull2017.co.uk/whatson/events/poppies-weeping-window/

Next up was Jen Mitcham of the University of York who shared her experience of using Open Source software.  She introduced the context by way of a great Lego video

and then went through what the challenges and benefits are of using Open Source solutions.  Using Open Source definitely comes hand in hand with working as part of a community – and building a community is something which is a feature of digital preservation which relies on input and support from a wide variety of places.  There was discussion about how different institutions and organisations have a very different take on how far they embrace the Open Source model – unsurprising given the range of organisations represented and the different needs and priorities they have.

Lunch was a great opportunity to meet up catch up with friends and colleagues working on various aspects of digital preservation.  There was quite a lot of talk about developing digital skills for archivists – which was the theme of my presentation – and interestingly shortly after the event Library Carpentry put out a call for an “Archivists Wish List” of skills  – please consider contributing to their call for suggestions or even become part of the community by taking part in their ideas sprint.

In the afternoon we heard from Gary Tuson from Norfolk Record Office who is leading an impressive consortium bid from East Anglia to procure preservation services at scale for 5 organisations of different sizes and needs.  Its impressive stuff and not without challenges – especially with dealing with sensitive data – but an important and worthwhile exercise.  Even the scoping of the project is a huge learning experience, says Gary.

computer-1869236_1920
Image: Pixabay CC0 (https://pixabay.com/en/computer-display-electronics-1869236/)

Finally I had my slot and I talked about the transferable skills which all archivists have which I had rather cheesily called “From Secretary hand to software”.  The idea was to encourage those who were present to think about the skills they have as an archivist and apply them to the issues around digital preservation.  Fragile formats in danger of obsolescence, metadata requirements, authenticity and provenance – all these key issues with regards to digital formats should be familiar to the archivist from a traditional background.  Granted, the technology is different, and this is where we are going to need help from our IT colleagues, although some basics are helpful.  In fact it would be useful to start bringing together some ideas – ARA’s Section for Archives and Technology are certainly interested in developing future training opportunities and the Library Carpentry Archivist’s Wish Lists is another way of contributing to the skills development process.

My main regret of the day was not to spend more time in the City of Culture 2017 and my main message of the day was:

Do nothing and you are guaranteed to lose records.

Do something!

I tried to be helpful on this and suggest my two favourite articles which I refer to time and again: NDSA’s Levels of Digital Preservation and Tim Gollins’ Parsimonious Preservation.  The former because it offers a structured approach and goals to aim for and the latter because it is short, accessible and reassuring.  I return to them often and when I got back to Lancaster I put into practice my “do something” and started my long overdue inventory of digital holdings in our Special Collections.  Time to get planning!

student-849825_1920
Image: Pixabay CC0 (https://pixabay.com/en/student-typing-keyboard-text-woman-849825/)