Introduction

SURF provides the Data Archive service to store large files or datasets for long-term preservation or as a temporary storage scale-out for compute infrastructure users. Files are stored on tapes that are managed and accessed by a tape library. A conventional disk-based file system is available for receiving files from the tapes and from the file systems of remote systems, for example, the computer systems (e.g. Lisa and Snellius).

As a user you can focus on structuring your files and folders and make sure data is staged or not. You don't have to bother about where your data is stored on tape or which tape you need, this is handled automatically by the system.

Tools

Depending on the operating system you are using tools are available to access the Data Archive service.

If you only need to list your files or want to add files to the archive, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP for Windows or Cyberduck for MacOS). To retrieve data using these applications, you first need to stage the data before you can transfer it to another system.

MacOS/Linux

Use the built-in terminal application.

Windows

Use PuTTY or MobaXTerm.

If you need to list your files or want to add files, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP).

System login and general usage

The Data Archive system is Unix-based and requires you to login to the login node with your username and password. The system also allows key-based authentication. A more in-depth introduction and explanation can be found on the SSH usage page.

Users of Lisa, Snellius

Users of the compute infrastructure, that (additionally) have access to the Data Archive service, have their own directory on the archive file system visible from the compute infrastructure. A user can access his/her archive directory via:

cd /archive/<username>

Archive-only users

If a user has only access to the archive but no other parts of the compute infrastructure, he/she can login directory to the archive. Login can be done using a terminal or tool supporting SFTP or SSH connections. Logging in directly on the Data Archive can be done via:

ssh <username>@archive.surfsara.nl

Possibly the SSH application notifies you about the authenticity of a newly connected host:

The authenticity of host 'archive.surfsara.nl (<IP>)' can't be established.
ECDSA key fingerprint is SHA256:<hash>.
Are you sure you want to continue connecting (yes/no)?

You can safely continue by entering 'yes', as long as the host is 'archive.surfsara.nl' and the IP is in the range 145.100.xx.xx.

User home folder

If logged in directly to the archive the user will be in the so-called home folder:

pwd
/archive/<username>

By default, you can only store data in this folder or any subfolder.

Shared folders

Possibly a user has access to one or more shared folders on the archive system. This means data can be accessed and/or modified on a different location than the user's home folder, for example a folder of another user.

In this case you can simply change path to that shared folder, e.g.:

cd /archive/<shared folder>

If you are copying data using SCP or another tool to the shared folder from a remote system, make sure to use the absolute path in the target while using your own user name as authentication:

scp file <username>@archive.surfsara.nl:/archive/<shared folder>/.


Does your login have access to the archive? 

To find our if this is the case

  1. Go to https://portal.surfsara.nl/ and login with the username above (your Lisa or Snellius username)
  2. Go to the menu on the left hand side and check "your profile"
  3. On the right hand side you will see all systems you have access to. If the "Data Archive" is not listed there you have no access.  
  4. If you think you should have access to the system please open a ticket at the https://servicedesk.surfsara.nl/ and provide us your grant number or your login.

Basic commands

The usage of the archive file system is essentially transparent, the standard Unix commands (cp,ls,mv,...) can be used to handle the files in the archive, but there are special commands to do things more efficient. Those are the so-called DMF commands and will be explained in later.

A new subfolder, here named “workdir” can be created with the following command:

mkdir /archive/<username>/workdir

Copy data to this folder can be done via:

cp $HOME/workdir/output* /archive/<username>/workdir

What happens here, is that files from the HOME directory are copied to the stage area of the archive file system. Later on, these files will be automatically replicated to tape.

Now the other way around:

dmget -a /archive/barbara/workdir/output*
cp /archive/barbara/workdir/output* $HOME/workdir

Here, files are copied from tape to the stage area (if not already there), and subsequently copied to the HOME file system.

Note that retrieving files from tape can take some time.

DMF commands

DMF (Data Migration Facility) is a hierarchical storage management system for Silicon Graphics environments. Its primary purpose is to augment the economic value of storage media and stored data.

On Snellius, commands are provided for file owners to affect the manual storing and retrieval of data. Users can do the following:

  • dmput: Explicitly migrate files
  • dmget: Explicitly recall files or parts of files (also called staging)
  • dmcopy: Copy all or part of the data from a migrated file to an online file
  • dmls: List files and determine whether a file is migrated
  • dmattr: Test in shell scripts whether a file is online or offline
  • dmfind: Search for migrated files

See the online manual pages for details.

DMF file types

A file status can have several values in DMF:

  • REG: Regular files are user files residing only on disk
  • MIG: Migrating files are files which are being copied from disk to tape
  • UNM: Unmigrating files are files which are being copied from tape to disk
  • Migrated files can be either of the following:
    • DUL: Dual-state files whose data resides both online and offline
    • OFL: Offline files whose data is no longer on disk


Although directories themselves have statuses as well when listed, they will always remain regular (REG) as they are not stored on tape. They are only defined in the file system itself.

Listing file statuses

Example of the "dmls" command:

$ dmls -l
-rw-------    1 hthta    staff     632792 Jul 26  1999 (OFL) file1
-rw-------    1 hthta    staff     632792 Jul 27  1999 (OFL) file2
-rw-------    1 hthta    staff      15884 Jul 27  1999 (REG) file3
-rw-------    1 hthta    staff     632792 Aug  2  1999 (DUL) file4
-rw-------    1 hthta    staff     632792 Jun 19 23:20 (MIG) file5

Example of the "dmget" command:

$ dmget *
$ dmls -l
-rw-------    1 hthta    staff     632792 Jul 26  1999 (DUL) file1
-rw-------    1 hthta    staff     632792 Jul 27  1999 (DUL) file2
-rw-------    1 hthta    staff      15884 Jul 27  1999 (REG) file3
-rw-------    1 hthta    staff     632792 Aug  2  1999 (DUL) file4
-rw-------    1 hthta    staff     632792 Jun 19 23:20 (MIG) file5

All the files with (OFL) changed their status to (DUL).

It is important that you don't put many small files, but few large files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. If you have many small files, it is best to combine them in a large file, for example using the tar command:

cd $HOME/workdir
tar cvf /archive/barbara/workdir/output.tar output*

The reason is that if you archive many files, chances are that they will get allocated on hundreds different tapes. Getting these files back will then take much time, because every tape-mount takes a significant amount of time.

The reverse of the above action would be:

cd $HOME/workdir
dmget -a output*
tar xvf /archive/barbara/workdir/output.tar

Sophisticated commands for power users

If you are handling really large amounts of data, the DMF utilities come in useful. On Lisa and Snellius the DMF utilities are system-wide installed.

The most important DMF commands are dmget and dmls.

  • dmget -a <files> retrieves files from tape and puts them on the stage area
  • dmls -l gives information about the status of the file (see man dmls)

Example:

dmget -a /archive/barbara/workdir/output*
cp /archive/barbara/workdir/output* $HOME/workdir

Archive and batch

In general, do not handle archived files in a batch job. Before a job starts, the files should be available on the normal file system, using the examples above.

Archive and the Lisa system

On Lisa, the archive file system is not available on the batch nodes, so all handling of the archive file system has to be done interactively.

Archive and the Snellius system

On Snellius, the archive file system is available on the login and service nodes. A possibility is to create a dependent job, where the handling of archive data is done in a cheap service node. Please see the Batch - Howto.