How to package and archive datasets using BagIt workflows in iRODS

Meant for:

iRODS users

Requirements:

iCommands installed
BagIt packaging workflow installed, see How to enable packaging workflows with the SURF BagIt iRODS rule
connected to iRODS, see How to connect to iRODS using iCommands

Packaging data can be useful when a dataset or folder/collection contains many (small or big) files and needs be archived (either for publishing or cost reduction purposes). However, packaging data before uploading can be a tedious operation. Here we show how an iRODS collection can be archived by using BagIt tools installed on the server.

Packaging a collection using metadata through a BagIt workflow

The BagIt packaging workflow searches for all collections with the matadata attribute "SURFbagit". It expects the destination resource as a value and one of the two keywords "copy" or "move" as units. If the word "copy" is set then the folder is archived, but the original copy is still available, otherwise it is removed and only the archive file is kept. Here we assume that the destination storage resource is called 'surfArchive'. In order to start the packaging workflow for a collection, you have to mark the collection with the following metadata:

user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive copy::tgz::avu

This addition of metadata to the collection will communicate to the iRODS system that this collection is a candidate to be packaged with the BagIt format. Note that this collection will not immediately be packaged; this will happen asynchronously in the background. This is useful as packaging could take a long time for collections with large data sets.

The metadata of the collection will be changed immediately:

user@login:~$ imeta ls -C /your/collection/to/package
AVUs defined for collection /your/collection/to/package:
attribute: SURFbagit
value: archive
units: copy

In the background, iRODS will find this dataset marked as a packaging candidate, and package the files in the background. It makes use of the bdbag tool set to ensure the validity and consistency of the dataset, which means data will not get lost upon archiving.

In time, the dataset will be packaged and moved to the 'surfArchive' resource:

user@login:~$ ils /your/collection/to
  package.tgz
  C- /your/collection/to/package

If you have specified 'move':

user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive move::tgz::avu

then the original collection will be removed:

user@login:~$ ils /your/collection/to
  package.tgz

Note that by default the package will be compressed into a '.tgz' file. If you would like a different format, you can specify this while marking the collection:

user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive copy::zip::avu

If you do not wish the metadata to be stored inside the archive file, omit the '::avu' part.

Unpackaging to retrieve the collection

If you have chosen the 'move' option or have removed the original collection manually, and you would like to retrieve the collection and unpackage the archive file, you can initiate the asynchronous reverse operation by doing:

user@login:~$ $ imeta set -d /your/collection/to/package.tgz SURFunbagit surfResc move

Note that we use the '-d' option for 'imeta' as the archive file is a data object and not a collection.

After some time the collection is restored in its original form:

user@login:~$ ils /your/collection/to
  C- /your/collection/to/package

Space shortcuts

Page tree

Meant for:

Requirements:

Packaging a collection using metadata through a BagIt workflow

Unpackaging to retrieve the collection