SURF provides the Data Archive service to store large files or datasets for long-term preservation or as a temporary storage scale-out for compute infrastructure users. Files are stored on tapes that are managed and accessed by a tape library. A conventional disk-based file system is available for receiving files from the tapes and from the file systems of remote systems, for example, the computer systems (e.g. Lisa and Snellius).
As a user you can focus on structuring your files and folders and make sure data is staged or not. You don't have to bother about where your data is stored on tape or which tape you need, this is handled automatically by the system.
Depending on the operating system you are using tools are available to access the Data Archive service.
If you only need to list your files or want to add files to the archive, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP for Windows or Cyberduck for MacOS). To retrieve data using these applications, you first need to stage the data before you can transfer it to another system.
Use the built-in terminal application.
If you need to list your files or want to add files, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP).
System login and general usage
The Data Archive system is Unix-based and requires you to login to the login node with your username and password. The system also allows key-based authentication. A more in-depth introduction and explanation can be found on the SSH usage page.
Users of Lisa, Snellius
Users of the compute infrastructure, that (additionally) have access to the Data Archive service, have their own directory on the archive file system visible from the compute infrastructure. A user can access his/her archive directory via:
If a user has only access to the archive but no other parts of the compute infrastructure, he/she can login directory to the archive. Login can be done using a terminal or tool supporting SFTP or SSH connections. Logging in directly on the Data Archive can be done via:
Possibly the SSH application notifies you about the authenticity of a newly connected host:
You can safely continue by entering 'yes', as long as the host is 'archive.surfsara.nl' and the IP is in the range
User home folder
If logged in directly to the archive the user will be in the so-called home folder:
By default, you can only store data in this folder or any subfolder.
Possibly a user has access to one or more shared folders on the archive system. This means data can be accessed and/or modified on a different location than the user's home folder, for example a folder of another user.
In this case you can simply change path to that shared folder, e.g.:
If you are copying data using SCP or another tool to the shared folder from a remote system, make sure to use the absolute path in the target while using your own user name as authentication:
Does your login have access to the archive?
To find our if this is the case
- Go to https://portal.surfsara.nl/ and login with the username above (your Lisa or Snellius username)
- Go to the menu on the left hand side and check "your profile"
- On the right hand side you will see all systems you have access to. If the "Data Archive" is not listed there you have no access.
- If you think you should have access to the system please open a ticket at the https://servicedesk.surfsara.nl/ and provide us your grant number or your login.
The usage of the archive file system is essentially transparent, the standard Unix commands (
mv,...) can be used to handle the files in the archive, but there are special commands to do things more efficient. Those are the so-called DMF commands and will be explained in later.
A new subfolder, here named “workdir” can be created with the following command:
Copy data to this folder can be done via:
What happens here, is that files from the HOME directory are copied to the stage area of the archive file system. Later on, these files will be automatically replicated to tape.
Now the other way around:
Here, files are copied from tape to the stage area (if not already there), and subsequently copied to the HOME file system.
Note that retrieving files from tape can take some time.
DMF (Data Migration Facility) is a hierarchical storage management system for Silicon Graphics environments. Its primary purpose is to augment the economic value of storage media and stored data.
On Snellius, commands are provided for file owners to affect the manual storing and retrieval of data. Users can do the following:
dmput: Explicitly migrate files
dmget: Explicitly recall files or parts of files (also called staging)
dmcopy: Copy all or part of the data from a migrated file to an online file
dmls: List files and determine whether a file is migrated
dmattr: Test in shell scripts whether a file is online or offline
dmfind: Search for migrated files
See the online manual pages for details.
DMF file types
A file status can have several values in DMF:
REG: Regular files are user files residing only on disk
MIG: Migrating files are files which are being copied from disk to tape
UNM: Unmigrating files are files which are being copied from tape to disk
- Migrated files can be either of the following:
DUL: Dual-state files whose data resides both online and offline
OFL: Offline files whose data is no longer on disk
Although directories themselves have statuses as well when listed, they will always remain regular (REG) as they are not stored on tape. They are only defined in the file system itself.
Listing file statuses
Example of the "
Example of the "
All the files with (OFL) changed their status to (DUL).
It is important that you don't put many small files, but few large files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. If you have many small files, it is best to combine them in a large file, for example using the tar command:
The reason is that if you archive many files, chances are that they will get allocated on hundreds different tapes. Getting these files back will then take much time, because every tape-mount takes a significant amount of time.
The reverse of the above action would be:
Sophisticated commands for power users
If you are handling really large amounts of data, the DMF utilities come in useful. On Lisa and Snellius the DMF utilities are system-wide installed.
The most important DMF commands are
dmget -a <files>retrieves files from tape and puts them on the stage area
dmls -lgives information about the status of the file (see
Archive and batch
In general, do not handle archived files in a batch job. Before a job starts, the files should be available on the normal file system, using the examples above.
Archive and the Lisa system
On Lisa, the archive file system is not available on the batch nodes, so all handling of the archive file system has to be done interactively.
Archive and the Snellius system
On Snellius, the archive file system is available on the login and service nodes. A possibility is to create a dependent job, where the handling of archive data is done in a cheap service node. Please see the Batch - Howto.