Introduction: Encrypting Data with saracrypt

This page introduces the concept of encryption and the tool saracrypt to encrypt your data.

The following sections explain these tools, how to install them, and how to use them. The encryption engine used by saracrypt is GPG version 2 (also known as GnuPG, or GNU Privacy Guard). GPG offers various encryption algorithms, by default saracrypt uses AES256, which is the strongest encryption algorithm currently available.

We stress that you should encrypt your files on your source system, and then transfer your files using a secure transfer protocol, sftp, to transfer encrypted data.

saracrypt is a tool for bulk data encryption. It was created at SURF with the following in mind:

  • Security: strong encryption
  • Ease of use
  • Large archived datasets that reside on tape in SURF's Central Archive

Encryption is a means for hiding information from the prying eyes of others. In principle, your data is safe at SURF's highly secured data centres and infrastructure. However, encryption also creates an extra safety net against hacks and data leaks.

Before employing encryption you should consider how valuable your data really is, and whether it warrants putting in extra effort in keeping things secret.

saracrypt uses file encryption: each file is individually encrypted and as such can be stored securely in a long term archive. When you wish to access the data, it must first be decrypted. This is an explicit step in your workflow, data access is not fully transparent.

saracrypt uses a "master key" concept: a randomized secret key is used to encrypt the dataset. The key itself is protected by a passphrase. This scheme allows you to change the passphrase without having to decrypt and re-encrypt all the data, and it allows the master key to be used as a recovery key in case you forget the passphrase. You are free to choose either a single global master key for your data archive, or have different master keys for each dataset. You can read more about generating this master key in Generating the master key.

Installing saracrypt

saracrypt is provided as an open source tool via gitlab. The program can be obtained from the gitlab repository here: https://gitlab.com/surfsara/saracrypt

  1. You can download the source files using the download button at the top of the screen of the repository.
  2. Review the installation instructions on the repository page and the README.mb. After downloading, first verify your environment needs, then run the setup.

    Notes about the environment

    • By default saracrypt is installed in /usr/local.
    • Use 'SETUP_CONF=mysetup.conf ./setup.sh' if you need other settings.
    • The default python3 binary is used (>=3.6 is required). Set PYTHON3 to point at an alternative python3 binary, or use PATH
    • Build depends on make, gcc, swig, python3-devel, libgcrypt-devel, and libgpg-error-devel.
  3. Open a command line terminal, and run the installation script:

    sudo ./setup.sh
  4. Or you can build an RPM package by running (for example):

    ./build_rpm.sh release-1.2

Usage

saracrypt is a UNIX command-line tool. The following examples show the commands to type at the UNIX prompt, and the output that is expected to appear onscreen. The UNIX prompt itself is denoted by the dollar sign.

Generating the master key

 For encrypting datasets we will use a master key. You may choose to have a single master key for your entire archive or use a master key per dataset.

saracrypt -g mykey

Enter new master key passphrase:
Enter it again (for verification):
writing mykey.gpg

The master recovery key is:

        H261-2FFY-SIVX-QBYZ-K37D

You should print this key out on paper and store it in a safe place,
or keep a copy of the key in a password manager application.
This key must be kept secret at all times.

This creates the file mykey.gpg. You will need this file to encrypt/decrypt the dataset.

Security considerations regarding the master key

The master key is protected by a passphrase. It is therefore important to choose a strong passphrase. The best thing to do is to use a password manager application to generate a long, random password. Long means 16 characters or more.

It is reasonably safe to leave the master key file on the system next to the dataset. For added security you might offload it elsewhere, for example to a USB thumb drive. Note however that for use on SURF's Data Archive, you will have to manually copy over the master key file because the archive system has no direct access to your USB thumb drive. Alternatively, and possibly more secure, copy the encrypted dataset over to your private system, and decrypt the data there.

During master key generation, the recovery key is printed onscreen. You will want to double-check that no one was looking over your shoulder and that the key was not captured by webcam or security camera that may be present in the room. Also, when printing the recovery key on paper, make sure no one but you (or authorized people) had access to that paper as it rolled out of the printer.

The master key file is created read-only to prevent accidental deletion.

The master key is always encrypted with AES256, no matter what cipher you pass on the command-line.

Encrypting a dataset

For encryption, we use the master key that was generated as described in the previous section.

saracrypt asks for the passphrase to unlock the master key. Here, we might get a warning about having a readable umask. The umask is a setting in the UNIX environment which dictates the default file permissions. The access to your archive directory is probably closed; this is your chance to double-check. Hit Ctrl+C now and fix the settings:

ls -ld ~/       # review home directory permissions

chmod 0750 ~/   # share directory with group

chmod 0700 ~/   # or: set private access

umask 027       # file creation mask: share with group

umask 077       # file creation mask: share with no one

Now that our umask is set securely, let's encrypt the dataset:

saracrypt -m mykey --progress dataset/

Enter master key passphrase:
saracrypt is dmgetting any offline files
dataset/file0001.dat.gpg
dataset/file0002.dat.gpg
dataset/file0003.dat.gpg
...

saracrypt asks for the passphrase to unlock the master key. It proceeds to stage files from tape if needed. Next, it will encrypt all files stored under the dataset/ directory. We passed the option --progress, which displays a textmode progress bar for large files.

It is possible to place the encrypted files under an alternate directory. Specify option --destdir to do so. The destination directory must already exist.

By default, saracrypt will not erase the original files. The act of encryption will leave you with two versions of the files: encrypted and unencrypted. The fact that saracrypt does not delete data by default is considered a data safety feature. To delete the unencrypted originals, run:

saracrypt -m mykey --delete dataset/

If you feel unsure about using --delete, do a dry run first to see what files would be deleted:

saracrypt -m mykey --delete --dry-run dataset/

It is OK to run the saracrypt command again on the encrypted dataset. saracrypt will not encrypt any files that already were encrypted. It detects this only by the .gpg extension on the filename. A warning message will be issued when the file is already encrypted.

By default, saracrypt encrypts all files under the dataset's directory. It is possible to exclude files by using the --exclude option. Multiple exclude patterns may be specified:

$ saracrypt -m mykey --exclude '*.chksum' --exclude '*.idx' dataset/

It is also possible to list the exclusion patterns in a file:

saracrypt -m mykey --exclude-from exclude.txt dataset/

saracrypt was made to process large datasets. Whenever it encounters an error with a file, it will continue to process the rest of the dataset. If you wish to abort, pass --bail to bail out early:

saracrypt -m mykey --bail dataset/

Alternatively, you may want to be even more restrictive, and treat all warnings and errors as fatal errors. This can be done by passing the option --werror.

The --stats option presents a short summary with statistics at the end of a run, for example:

saracrypt -m mykey --stats --decrypt dataset/
...
6 files, 0 errors
decrypting took 22 seconds
average rate: 60.5 MB/s

Pass option --quiet to suppress informational messages and warnings.

For encryption, you may select an alternate cipher (encryption algorithm) by using the --cipher  option. By default, saracrypt  uses AES256, which is the strongest cipher currently in use. It's so strong, governments use it to protect state secrets. It is recommended to stay with the default. Run saracrypt --help to list the supported ciphers. The specific list of ciphers depends on the version of GPG that is installed on the system. Notably, older versions of GPG do not include the Camellia ciphers.

saracrypt uses GPG as encryption engine. It searches for the gpg command via the PATH environment variable. You may specify a different GPG command via the --gpg option. In general, this is not needed, but you may use this option to point saracrypt at a specific version of GPG.

Security considerations regarding encryption

saracrypt uses GPG as encryption engine. Implementing encryption algorithms right, in a secure manner, is difficult and therefore left entirely to the implementors of GPG. GPG is well respected and considered to be a strong security tool.

saracrypt gives a warning when the umask (file creation mask) is set too permissive.

The master key is unlocked for use with encryption. The unencrypted key resides only in computer memory; it does not "touch" the disk. The memory is protected by the operating system. saracrypt disables core dumps to prevent the memory from being dumped to disk in the event of a critical error.

Without using the --delete option, the unencrypted version of the files remain on the system.

The --delete option uses the POSIX unlink system call. It does not overwrite any data blocks with zeroes. In general, there is no such thing as "secure erase" of files in modern UNIX filesystems. Moreover, SSD and HDD firmware decide where data is physically located on the drive, which is beyond the control of operating systems. At SURF decomissioned equipment gets fully wiped in a secure manner and/or physically destroyed by a certified subcontractor.

Filenames are not encrypted. An attacker may be able to deduce certain information from the naming of files.

The filename extension of encrypted files is always .gpg. You should not rename the file extension; saracrypt expects .gpg. Moreover, it is easy to detect that files were encrypted with GPG anyway by using the file command.

For data safety and recovery reasons, the SURF Data Archive makes nightly backups. The data retention period is usually four weeks (as specified in the Service Level Agreement). It is not possible to delete files from these backups.

It is recommended to use the default AES256 cipher. The choice for a different, weaker cipher should be based on risk analysis, and reasoning for choosing the alternate cipher should preferably be documented. Ciphers tend to be broken over time, affecting the security of data, especially for long term storage.

Decrypting a dataset

To decrypt the dataset, pass the --decrypt option:

saracrypt -m mykey --decrypt --progress dataset/

Enter master key passphrase:
saracrypt is dmgetting any offline files
dataset/file0001.dat.gpg
dataset/file0002.dat.gpg
dataset/file0003.dat.gpg
...

As with encryption, saracrypt may issue a warning about the umask (file creation mask). Set the umask to a more restricted mode to get rid of the warning and be more secure. See section 2.2 for more information.

Decryption leaves both the decrypted and the encrypted files on disk. To automatically delete the encrypted copies, pass the --delete option. Usually, you will want to keep the data in its encrypted form, however if an unencrypted file with the same name already exists, saracrypt will not overwrite it. In case of error, saracrypt will continue to process the entire dataset. Pass the --bail option to bail out early.

Security considerations regarding decryption

Decryption leaves an unencrypted copy of the data on the system. You should make sure that no unauthorized access can take place. Be advised to remove the unencrypted copy when it is no longer needed.

Changing the passphrase

It is possible to change the passphrase on the master key.

saracrypt --change mykey

Enter master key passphrase:
Enter new master key passphrase:
Enter it again (for verification):
writing mykey.gpg.tmp
saving mykey.gpg

As you can see, it first creates a new temporary file. When there are no errors, the master key file is moved in place.

Since the dataset is encrypted with the master key, we have effectively changed the passphrase to the data.

Batch usage

When working with large datasets, it is common to do batch processing. To accommodate non-interactive batch usage saracrypt has the option of reading the passphrase from a file. A word of warning is in order because what we are about to do is generally considered a bad practice: insecure.

  1. Set a private file creation mask: umask 077.

  2. Use a text editor to create a file pass.txt. It will contain only one line: the passphrase.

  3. In the batch script, use saracrypt --batch pass.txt to pass the passfile

    umask 077 saracrypt -m mykey --batch pass.txt --decrypt dataset/ 

Note that this leaves the unencrypted data still present at the end of the job. Any job temp directory will be cleaned up automatically by the batch system. Any other directory you should make sure to clean up; we can use the --delete option for this:

saracrypt -m mykey --batch pass.txt --delete dataset/

rm pass.txt   # done; delete the passfile asap

Batch mode in combination with option --quiet suppresses all messages but errors. When there are no errors, saracrypt  will give an exit status of zero.

Security considerations regarding batch usage

The passphrase must be stored in its plaintext form. This is a big security risk because in case of a security breach an attacker can now easily read your passphrase. Only do this on a system that you can reasonably trust. Leave the passfile on the system only for as long as is truly necessary.

Filesystems on batch compute systems are typically shared across multiple nodes.

Jobs may break intermittently and not reach the end of the job script, and thus failing to clean up.

Using the recovery key

In the troublesome event that you lost the passphrase or the master key file, you may use the recovery key. You had printed the recovery key on paper, remember? Alternatively, it may be stored in a password manager. The procedure was described in the section on generating a master key.

There are two ways in which the recovery key can be used. Firstly, the recovery key may be used to decrypt data:

saracrypt --decrypt dataset/
Enter passphrase:

Enter the recovery key as passphrase.

Secondly, the recovery key can be used to reconstruct the master key file. To reconstruct the master key file, do the following:

  1. obtain the recovery key, either from printed paper or password manager
  2. set a private file creation mask: umask 077
  3. use a text editor to create the file mykey, containing only one line: the recovery key
  4. use saracrypt to encrypt the file mykey to mykey.gpg
  5. delete the unencrypted file mykey

The UNIX commands to enter are:

umask 077

nano mykey    # or use vim, emacs, ...

saracrypt mykey

Enter passphrase:
Enter it again (for verification):
saracrypt is dmgetting any offline files

mykey.gpg

rm mykey    # delete the plaintext copy

The outcome is the re-created master key file.

Finally, delete the unencrypted copy holding the recovery key.

Security considerations on using the recovery key

If you lost either the passphrase or the master key file, pay attention to how you lost it. If it was stored on a USB thumb drive that you lost, it may well be that some other person now is in possession of your key material. Additional steps may be necessary to ensure your data stays secure.

When you obtain the recovery key, you are taking it from its secure environment. You must ensure that the recovery key is handled safely and securely during this time.

Encryption without using a master key

As you may have noticed in the previous section, saracrypt can be used to encrypt and decrypt data without using any master key at all.

This mode is convenient when you just want to encrypt a small number of files and not deal with a master key. Although entirely possible, it is not recommended to use this mode on large datasets; it is not possible to easily change the passphrase, and there is no recovery key in case you lose the passphrase. The recommendation is to use a master key as described earlier in this user guide.

That said, here is an example of encrypting a file using only a passphrase:

saracrypt --progress datafile.dat

Enter passphrase:
Enter it again (for verification):
saracrypt is dmgetting any offline files

datafile.dat.gpg

Final security considerations

A system is only as secure as its weakest link. Unfortunately, it is completely human to make mistakes. Moreover, security measures tend to get in the way of getting actual work done. saracrypt tries to strike a balance between having strong security and good usability.

Data security is not everybody's cup of tea. Nevertheless, if you work with important data then it's probably a good idea to spend some time going over the security aspects of your work. For example:

  • Is the passphrase strong enough? Tip: use a randomized passphrase, generated by a password manager application.

  • Is the recovery key safe from unauthorized access?

  • Do we have backups of the data? What if the encryption keys are lost? What if the encrypted files are lost?

  • Can the systems on which you work to be trusted? If your workstation/laptop has already been compromised, the protection effectively falls flat.

  • Why are you using encryption? What are the risks?

  • What is your local security officer expecting and recommending?

  • What is the plan in case security is breached, and data is leaked?


Table of contents