Meant for:
- iRODS users
Requirements:
- iCommands installed
- BagIt packaging workflow installed, see How to enable packaging workflows with the SURF BagIt iRODS rule
- connected to iRODS, see How to connect to iRODS using iCommands
Packaging data can be useful when a dataset or folder/collection contains many (small or big) files and needs be archived (either for publishing or cost reduction purposes). However, packaging data before uploading can be a tedious operation. Here we show how an iRODS collection can be archived by using BagIt tools installed on the server.
Packaging a collection using metadata through a BagIt workflow
The BagIt packaging workflow searches for all collections with the matadata attribute "SURFbagit". It expects the destination resource as a value and one of the two keywords "copy" or "move" as units. If the word "copy" is set then the folder is archived, but the original copy is still available, otherwise it is removed and only the archive file is kept. Here we assume that the destination storage resource is called 'surfArchive'. In order to start the packaging workflow for a collection, you have to mark the collection with the following metadata:
user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive copy::tgz::avu
This addition of metadata to the collection will communicate to the iRODS system that this collection is a candidate to be packaged with the BagIt format. Note that this collection will not immediately be packaged; this will happen asynchronously in the background. This is useful as packaging could take a long time for collections with large data sets.
The metadata of the collection will be changed immediately:
user@login:~$ imeta ls -C /your/collection/to/package AVUs defined for collection /your/collection/to/package: attribute: SURFbagit value: archive units: copy
In the background, iRODS will find this dataset marked as a packaging candidate, and package the files in the background. It makes use of the bdbag tool set to ensure the validity and consistency of the dataset, which means data will not get lost upon archiving.
In time, the dataset will be packaged and moved to the 'surfArchive' resource:
user@login:~$ ils /your/collection/to package.tgz C- /your/collection/to/package
If you have specified 'move':
user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive move::tgz::avu
then the original collection will be removed:
user@login:~$ ils /your/collection/to package.tgz
Note that by default the package will be compressed into a '.tgz' file. If you would like a different format, you can specify this while marking the collection:
user@login:~$ imeta set -C /your/collection/to/package SURFbagit surfArchive copy::zip::avu
If you do not wish the metadata to be stored inside the archive file, omit the '::avu' part.
Unpackaging to retrieve the collection
If you have chosen the 'move' option or have removed the original collection manually, and you would like to retrieve the collection and unpackage the archive file, you can initiate the asynchronous reverse operation by doing:
user@login:~$ $ imeta set -d /your/collection/to/package.tgz SURFunbagit surfResc move
Note that we use the '-d' option for 'imeta' as the archive file is a data object and not a collection.
After some time the collection is restored in its original form:
user@login:~$ ils /your/collection/to C- /your/collection/to/package