Info

No need to download common datasets yourselves!

On Snellius, we have installed and prepared a list of datasets frequently used to either train or benchmark a model, usually in the context of machine learning. Instead of occupying space on your own space or waiting for the download of the data to finish to your own space, freely use the available datasets at the dataset folder on Snellius.

Importantly, the root of most datasets folders is /scratch-nvme/ml-datasets/ or /projects/2/managed_datasets/

For the data storage and conversion we use Python as a framework.

License: CC BY-NC-SA 3.0 (https://creativecommons.org/licenses/by-nc-sa/3.0/)

This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.

Dataset or model not listed?

If the dataset or model is missing, it can be downloaded or uploaded to Snellius. Please contact us if you think other people would also use this model or dataset, we can then add a copy of this to the public model and dataset space. This way, we alleviate having many duplicates of models or datasets on the system and users needing to download or uploaded from external sources. Of course, if your dataset or model is proprietary or privacy-sensitive, this does not apply.

Getting access to restricted datasets and models

Some datasets and models are not accessible by default on Snellius, because they require explicit acceptance of a license or agreeing to a terms of use on the website of the dataset or model provider.

If you would like to access these datasets or models on Snellius, please send a ticket to https://servicedesk.surf.nl with a screenshot of the dataset or model provider giving you access to the data.

Even if access to a datasets is not restricted, it usually still has a license and a terms of conduct.
By using the dataset or model you are agreeing to both the license and the terms of conduct.

Table of Contents

absoluteUrl	true

Model name	Free access	Path on Snellius	Available versions	License	Description	Website	Size
Llama3	❌	`/projects/2/managed_datasets/llama3`	8B 8B-Instruct 70B 70B-Instruct	Proprietary (community license)	-	https://llama.meta.com/	-
Llama2	❌	`/projects/2/managed_datasets/llama`	7B 7B-chat 13B 13B-chat 70B 70B-chat	Proprietary (community license)	-	https://llama.meta.com/	-
CodeLlama2	❌	`/projects/2/managed_datasets/codellama`	7B 7B-instruct 7B-python 13B 13B-instruct 13B-python 34B 34B-instruct 34B-python 70B 70B-instruct 70B-python	Proprietary (community license)	-	https://llama.meta.com/	-
Mistral	❌	`/projects/2/managed_datasets/hf_cache_dir`	7B-v0.1 7B-Instruct-v0.1 7B-Instruct-v0.2	Proprietary (community license)	-	https://huggingface.co/mistralai https://mistral.ai/	-
Mixtral	❌	`/projects/2/managed_datasets/hf_cache_dir`	8x7B-v0.1 8x7B-Instruct-v0.1 8x22B-v0.1 8x22B-Instruct-v0.1	Proprietary (community license)	-	https://mistral.ai/ https://huggingface.co/mistralai	-
Phi-3	✅	`/projects/2/managed_datasets/hf_cache_dir`	mini-4k-instruct mini-128k-instruct	MIT	-	https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3	-
Phi-2	✅	`/projects/2/managed_datasets/hf_cache_dir`	N/A	MIT		https://huggingface.co/microsoft/phi-2
Whisper	✅	`/projects/2/managed_datasets/hf_cache_dir`	large-v3	Apache 2.0		https://huggingface.co/openai/whisper-large-v3
GPT-2	✅	`/projects/2/managed_datasets/hf_cache_dir`	base medium large xl	MIT	-	https://huggingface.co/openai-community?sort_models=likes#models

Dataset	Free access	Path on Snellius	Available versions	License	Description	Website	Size
ADE20K	✅	`/projects/2/managed_datasets/ADE20K`	23-02-2024	ADE20K License	ADE20K is composed of more than 27K images from the SUN and Places databases. Images are fully annotated with objects, spanning over 3K object categories. Many of the images also contain object parts, and parts of parts. The original annotated polygons are also provided, as well as object instances for amodal segmentation. Images are also anonymized, blurring faces and license plates.	ADE20K Website	-
AlphaFold	✅	`/projects/2/managed_datasets/AlphaFold`	2.3.1	Apache 2.0	AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.	AlphaFold Info	-
BDD100k (Berkeley Deep Drive 100k)	❌	`/scratch-nvme/ml-datasets/bdd100k`	-	BSD 3-Clause License	BDD100K is a diverse driving dataset for heterogeneous multitask learning.	BDD100K Website	2TB
CIFAR10	✅	`/scratch-nvme/ml-datasets/cifar-10`	-	-	CIFAR10 is an image database consisting of 60k 32x32 color images for image classification.	CIFAR10 Info	162MB
CIFAR100	✅	`/scratch-nvme/ml-datasets/cifar-100`	-	-	CIFAR10 is an image database consisting of 60k 32x32 color images for image classification.	CIFAR Info	162MB
Cityscapes	✅	`/scratch-nvme/ml-datasets/cityscapes`	-	Cityscapes License	Cityscapes is a large-scale dataset of stereo street video sequences with 5000 pixel-level annotations and 20k 'weak' annotations. Its primary purpose is to assess semantic segmentation on scene understanding (pixel-level, instance-level, and panoptic).	Cityscapes Info	1.9TB
COCO (Microsoft Common Objects in Context)	✅	`/projects/2/managed_datasets/COCO`	2017	-	MS Coco dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K color images. Most benchmarks are reported on the COCO 2017 images.	COCO Website	46GB
GigaCorpus	✅	`/projects/2/managed_datasets/GigaCorpus`	v1 March 2023	-	With 234GB of varied plaintext, as much as 40 billion tokens, this is at least the largest Dutch corpus. But in addition this corpus is also freely available and the quality is relatively high for its size, care has been taken to get the data as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research.	GigaCorpus Info	500GB
HYPFLOWSCI6	✅	`/projects/2/managed_datasets/hypflowsci6_v1.0`	V1.0	GPLv3	The datapackage HYPFLOWSCI6 (HYdrological Projection of Future gLObal Water States with CMIP6) contains a simulation dataset of global hydrology and water resource conditions covering the historical/past years from 1960 to the future projected period until 2100. The dataset has 5 arc-minute spatial resolution (about 10 km at the equator) and monthly temporal resolution.	HYPFLOWSCI6 Info	-
ImageNet	❌	`/scratch-nvme/ml-datasets/imagenet`	-	ImageNet License	ImageNet is a famous image database of various resolutions for image classification collected from Flickr and other external websites.	ImageNet Info	-
Kinetics	✅	`/scratch-nvme/ml-datasets/kinetics`	kinetics 700-2020	-	Kinetics is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds.	Kinetics Info	-
KITTI	✅	`-`	-	KITTI License	KITTI is an image/video dataset from traffic scenarios for computer vision tasks like stereo, optical flow, visual odometry, 3D object detection 3D tracking and semantic segmentation (without annotations).	-	-
LLaVA-CC3M-Pretrain-595K	✅	Virtual path (when using Huggingface): `/projects/2/managed_datasets/hf_cache_dir/` Real path (raw images): `/projects/2/managed_datasets/hf_cache_dir/downloads/extracted/30814bc1b79e86b8e7ef21b088d25da3ba559b0b6a36848dfd9ff92e75a62604`	-	-	LLaVA Visual Instruct CC3M Pretrain 595K is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Captions are also associated with BLIP synthetic caption for reference. It is constructed for the pretraining stage for feature alignment in visual instruction tuning.	LLaVA-CC3M-Pretrain-595K Info	-
MNIST	✅	`/scratch-nvme/ml-datasets/MNIST`	-	CC BY-SA License	MNIST is an image database of 70k grayscale handwritten digits under 10 categories (0 to 9) with a fixed resolution 28x28.	MNIST Info	55MB
STL10	✅	`/scratch-nvme/ml-datasets/stl10`	-	-	STL10 is an image database consisting of 60k 96x96 color images for image classification.	STL10 Info	2.5GB

Space shortcuts

Page tree

Versions Compared

Old Version 56

New Version Current

Key

Dataset or model not listed?

Getting access to restricted datasets and models

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 56

New Version Current

Key

Dataset or model not listed?

Getting access to restricted datasets and models