Data Pool

The GWDG data pool consists of datasets that are relevant to different user groups over a longer period of time. This includes datasets that are to be shared within a working group or with external parties. The data pool concept is a well-known and simple strategy for organising the sharing of curated datasets. A good example is the system implemented by DKRZ1.
This includes, for example:
- training data sets for machine learning applications
- open data sets of (inter-)governmental organizations
- open data sets of any HPC users
- project data for other projects to use
- semi-public project data that should be only shared upon application
- And many more!
Usage
Each data pool has a name, a version, content files (data and code), metadata files, and a README.md
for other users to get started with the dataset. Pool data is either public (everyone on the cluster can access them) or non-public (grant access to specific other projects and users), but pool metadata and the README.md
are always public. A website listing all available data pools is planned.
All datasets are centrally located in the /pools/data
. The path to each pool follows the scheme
/pools/data/PROJECT/POOLNAME/POOLVERSION
where PROJECT
is the project’s HPC Project ID (see Project Structure), POOLNAME
is the name of the pool, and POOLVERSION
is the specific version of the pool.
The file structure inside each data pool is
Path | Type | Description |
---|---|---|
public | file | DRAFT ONLY: Optional empty file. If present, the pool is public. |
README.md | file | Documentation for the pool |
METADATA.json | file | Pool metadata |
CITATION.bib | file | BibTeX file with references to cite if using everything in the pool |
GENERATED_METADATA.json | file | GENERATED: Pool metadata that can’t be generated before submission |
CHECKSUMS.EXT | file | GENERATED: Tagged checksums of data/CHECKSUMS.EXT and all top-level files ¹ |
content/ | directory | Directory holding pool content (non-public for non-public pools) |
content/CHECKSUMS.EXT | file | GENERATED: Tagged checksums of every content file ¹ |
content/code/ | directory | Directory holding pool data |
content/code/* | files/dirs | The actual code of the data pool |
content/data/ | directory | Directory holding pool data |
content/data/* | files/dirs | The actual data of the data pool |
[1]: EXT
is an extension based on the checksum algorithm (e.g. sha256
for SHA2-256).
Creation
Pools have to go through several phases.
Data Pool Workflow
Overview of the data pool creation process.
0. Prerequisite: Registered project
Only projects in the HPC Project Portal (see Project Management) are eligible to create pools. See Getting An Account for information on how to apply for a project.
NHR/HLRN projects created before 2024/Q2 must migrate to the HPC Project Portal before being eligibile to create pools. See the NHR/HLRN Project Migration page for information on migration.
1. Requesting data pool staging area
A project’s draft pools are created in a staging area under /pools/data-pool-staging/PROJECT
.
Initially, projects don’t have access to the staging area.
A project PI can request access to the staging area via a support request (see Start Here for the email address to use).
The request should include a rough estimate of how much disk space and how many files/directories will be used.
If approved, a directory in the staging area is created for the project.
Each individual draft pool in preparation should use a separate subdirectories of the project’s staging directory, specifically /pools/data-pool-staging/PROJECT/POOL/VERSION
where POOL
is the pool name and VERSION
is its version.
The maximum number of files/directories in a pool is limited in order to improve IO performance of anyone using the pool. For example, directories with a million files are not allowed because anyone using the pool would harm the performance of the filesystem for everyone. In many cases, it is possible to bundle together large sets of small files.
2. Building pool draft
Project members setup draft pools in subdirectories of the staging directory (note that each individual pool is treated separately from this point in the workflow). The subdirectory must be POOL/VERSION
so that the pool name and version can be deduced from the path. A template data pool is provided, which you can access via:
cp -r /pools/data-pool-template/* /pools/data-pool-staging/PROJECT/POOL/VERSION/
The template contains the basic files that must be filled out for the pool and directories that must get files:
README.md
METADATA.json
CITATION.bib
content/
content/code/
content/data/
Make sure to create an empty public
file if you want the pool to be public (and make sure it doesn’t exist if the pool should not be public), which can be done by
touch /pools/data-pool-staging/PROJECT/POOL/VERSION/public
and
rm /pools/data-pool-staging/PROJECT/POOL/VERSION/public
Put the pool code and data into the content/code/
and content/data/
subdirectories respectively, whether copying it from elsewhere on the cluster or uploading it to the cluster (more details in our documentation on data transfer).
There are a few hard restrictions:
- No files or additional directories in
content/
. Everything must go under thecontent/code/
andcontent/data/
subdirectories. - All symlinks must be relative links that stay entirely inside the
content/
directory and must eventually terminate on a file/directory (no circular links) - File, directory, and symlink names must all meet the following requirements:
- Hard requirements:
- Must be UTF-8 encoded (ASCII is a subset of UTF-8)
- Must not contain newline characters (wrecks havoc on unix shells)
- Must be composed entirely of printable characters (only allowed whitespace is the space character)
- Must not be a single
-
or double dash--
(wrecks havoc on passing to command line utilities) - Must not be a tilde
~
(wrecks havoc on unix shells)
- Recommendations
- Do not start with a dash
-
(wrecks havoc on passing to command line utilities) - Do not start with a dot
.
(pools shouldn’t have hidden files, directories, and/or symlinks)
- Do not start with a dash
- Hard requirements:
You should use the CLI pool validator at /pools/data-pool-tools/bin/data-pool-tools-validate
to validate the various parts of your draft pools like
/pools/data-pool-tools/bin/data-pool-tools-validate [OPTIONS] PATH
where PATH
is the part of your draft pool you want to validate or even the whole draft pool if you give the path to its directory.
The validator autodetects what is being validated based on the specific PATH
.
See Validator for more information.
For a straightforward setup of the data pool, we will eventually provide tools (CLI / WebUI / …) to help you prepare the various files in your draft pool and to check the draft pool for problems.
3. Submitting pool for review
A project PI submits the draft pool to become an actual pool by creating a support request (see Start Here for the email address to use). The following additional information must be included in the support request
- Project ID (if the project’s POSIX group is
HPC_foo
, then the Project ID isfoo
) - Pool name
- Pool version
- What sorts of other projects would be interested in using the data.
The pool’s path must then be /pools/data-pool-staging/PROJECT/POOL/VERSION
.
Eventually, this will be replaced with a web form.
Once the submission is received, a read-only snapshot of the draft pool will be created and the various generated metadata files (checksum files and GENERATED_METADATA.json
) generated for review. The validator is run. If it fails, the submitter is notified and the read-only snapshot is deleted so that they can fix the problems and resubmit. If the validator passes, the submitter is notified and the draft pool goes through the review process:
- All other PIs of the project are notified with the location of the draft pool snapshot and instructions on how to approve or reject the pool.
- If all other PIs have approved the pool, the draft pool goes to the Data Pool Approval Team.
- If the Data Pool Approval Team approves, the pool is accepted. Otherwise, they will contact the PIs.
4. Publishing Pool
Once a pool has been fully approved, it will be published in the following steps:
- The pool is copied to its final location and permissions configured.
- The pool metadata is added to the pool index within the data catalogue to appear in our Data Lake
- The draft pool’s read-only snapshot is deleted.
Finally, your data pool is available on our HPC system to all users with a high-speed connection. Anyone can access the data directly using the path to your data pool /pools/data/PROJECT/POOL/VERSION
.
5. Editing Pool
Projects are allowed to edit a pool, either submitting a new version or a correction. If you need to edit your data pool, copy it to the staging directory and follow the process from step 3 Submitting pool for review except that the following additional pieces of information must be given
- The pool to be editted must be specified.
- It must be specified whether this is a new version or a replacement for an existing version.
- Specify what is changed (e.g. changelog)
- If replacing/correcting a version, explaining why. This is critical because replacing an existing pool version is a destructive operation which undermines the reproducibility of the scientific results other derived from it. Correcting a data pool version therefore means that the citation that others used to acknowledge the usage of your provided data is technically not correct anymore. Therefore, the default is to create a new version of the data pool and only perform in-place corrections in exceptional cases.
6. Regular review
All data pools are periodically reviewed to determine whether the data pool should be retained or deleted (or optionally archived) when the requested availability window expires.
Managing Access to Non-Public Pools
For pools with non-public data, access to files under /pools/data/PROJECT/POOL/VERSION/content
is restricted via ACL.
Read access is granted to all members of the project, and any additional projects (must be in the HPC Project Portal or specific project-specific usernames that a PI specifies.
Initially, changing who else is granted access requires creating a support request (see Start Here for the email address to use).
Eventually, this will be incorporated directly into the HPC Project Portal.
Data Documentation and Reuse License/s
It is recommended to follow domain specific best practices for data management, such as metadata files, file formats, etc.
While helpful, this is not enough by itself to make a dataset usable to other researchers.
To ensure a basic level of reusability, each data pool has README.md
and METADATA.json
files in their top-level directory containing a basic description of the dataset and how to use it.
These files are also critical for informing others which license/s apply to the data. All data must have a license, which should conform to international standards to facilitate re-use and ensure credit to the data creators2. Different files can have different licenses, but it must be made clear to users of the pool which license each file uses. Common licenses are:
- The various Creative Commons licenses for text and images
- The various licenses approved by OSI for source code
- CC0 for raw numerical data (not actually copyrightable in many legal jurisdictions, but this makes it so everyone has the same rights everywhere)
In addition, a CITATION.bib
file is required for correct citation of the dataset when used by other HPC users.
This is a good place for pool authors to place the bibliographic information for the associated paper/s, thesis, or data set citation, as some journals like Nature provide.
This is for all things that would have to be cited if the whole data pool is used.
If some data requires only a subset of the citations, that would be a good thing to mention in the documentation (either the README.md
or some other way in under the content/
directory).
The design of these files is strongly inspired by DKRZ34.
All information in this README.md
and the METADATA.json
file is publicly available, including the names and email addresses of the PI/s and creator/s of the data pool.
General Recommendations
- Follow good data organization, naming, and metadata practices in your field; taking inspiration from other fields if there is none or if they don’t cover your kind of data.
- Include minimal code examples to use the data. Jupyter notebooks, org-mode files, etc. are encouraged. Please place them in the
content/code
directory if possible. - For source code, indicate its dependencies and which environment you have successfully run it in.
- Add metadata inside your data files if they support it (e.g. using Attributes in NetCDF and HDF5 files).
- Provide specifications and documentation for any custom formats you are using.
- Use established data file formats when possible, ideally ones that have multiple implementations and/or are well documented.
- Avoid patent encumbered formats and codecs when possible.
- Bundle up large numbers of small files into a fewer number of larger files.
- Compress the data when possible if it makes sense (e.g. use PNG or JPEG instead of BMP).
- Avoid spaces in filenames as much as possible (cause havoc for people’s shell scripts).
- Use UTF-8 encoding and Unix newlines when possible (note, some formats may dictate other ones and some languages require other encodings).
Files and Templates
Submitted: public
This file in a draft pool, if it exists, indicates that the pool is public. If it does not exist, the pool is restricted. The file must have a size of zero. The easiest way to create it is via
touch /pools/data-pool-staging/PROJECT/POOL/VERSION/public
Note that the file is not copied to snapshots or the final published pool.
In snapshots and published pools, the information on whether it is public or not is instead in GENERATED_METADATA.json
.
Submitted: README.md
The README.md
should document the data and its use, so that any domain expert can use the data without contacting the project members.
It must be a Markdown document following the conventions of CommonMark plus GitHub Flavored Markdown (GFM) tables.
It must be UTF-8 encoded with Unix line endings.
The data pool TITLE
on the first line must be entirely composed of printable ASCII characters.
The template README.md
structure is
# TITLE
## owner / producer of the dataset
## data usage license
## content of the dataset
## data usage scenarios
## methods used for data creation
## issues
## volume of the dataset (and possible changes thereof)
## time horizon of the data set on /pool/data
Submitted: METADATA.json
The metadata written by the pool submitters.
It must be a JSON file.
It must be UTF-8 encoded with Unix line endings.
Dates and times must be in UTC.
Dates must be in the "YYYY-MM-DD"
format and times must be in "YYYY-MM-DDThh:mm:ssZ"
format (where Z
means UTC).
The file should be human readable (please use newlines and indentation).
It is a JSON dictionary of dictionaries.
At the top level are keys of the form "v_NUM"
indicating a metadata version under which the metadata for that version of the metadata are placed.
This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility.
At any given time, the submission rules will dictate which version/s the submitters are required/allowed to use.
Many fields are based on the CF Conventions.
Most string fields must be composed entirely of printable characters with the only allowed whitespace characters being space and for some the Unix newline \n
.
The template METADATA.json
is
{
"v_1": {
"title": "TITLE",
"pi": ["Jane Doe"],
"pi_email": ["jane.doe@example.com"],
"creator": ["Jane Doe", "John Doe"],
"creator_email": ["jane.doe@example.com", "john.doe@example.com"],
"institution": ["Example Institute"],
"institution_address": ["Example Institute\nExample Straße 001\n00000 Example\nGermany"],
"source": "generated from experimental data",
"history": "2024-11-28 Created.\n2024-12-02 Fixed error in institution_address.",
"summary": "Data gathered from pulling numbers out of thin air.",
"comment": "Example data with no meaning. Never use.",
"keywords": ["forest-science", "geophysics"],
"licenses": ["CC0-1.0", "CC-BY-4.0"]
}
}
Version 1
REQUIRED
The key is "v_1"
. The fields are
Key | Value type | Description |
---|---|---|
title | string | Title of the pool (must match TITLE in the README.md |
pi | list of string | Name/s of the principal investigator/s |
pi_email | list of string | Email address/es of the PI/s in the same order as "pi" - used for communication and requests |
creator | list of string | Name/s of the people who made the pool |
creator_email | list of string | Email address/es of the people who made the pool in the same order as "creator" |
institution | list of string | Names of the responsible institutions (mostly the institutions of the PI/s) |
institution_address | list of string | Postal/street address of each institution properly formatted with newlines (last line must be the country) in the same order as "institution" . Must be sufficient for mail sent via Deutsch Post to arrive there. |
source | string | Method the data was produced (field from CF Conventions) |
history | string | Changelog style history of the data (will have newlines) (field from CF Conventions) |
summary | string | Summary/abstract for the data |
comment | string | Miscellaneous comments about the data |
keywords | list of string | List of keywords relevant to the data |
licenses | list of string | List of all licenses that apply to some part of the data. Licenses on the SPDX License List must use the SPDX identifier. Other licenses must take the form "Other -- NAME" given some suitable name (should be explained in the documentation). |
Submitted: CITATION.bib
What paper(s), thesis(es), report(s), etc. should someone cite when using the full dataset? Written by the pool submitter. If it’s empty, the dataset cannot be cited in publications without contacting the author(s) (possibly because a publication using it hadn’t been published at the time of the pool submission).
It is a BibTeX file.
It must be ASCII encoded with Unix line endings.
The encoding is restricted to ASCII so it is compatible with normal BibTeX.
Otherwise, a user would have to use bibtex8, bibtexu, or BibLaTeX.
See https://www.bibtex.org/SpecialSymbols for how to put various non-ASCII characters into the file.
An example CITATION.bib
would be
@Misc{your-key,
author = {Musterfrau, Erika
and Mustermann, Max},
title = {Your paper title},
year = {2024},
edition = {Version 1.0},
publisher = {Your Puplisher},
address = {G{\"o}ttingen},
keywords = {ai; llm; mlops; hpc },
abstract = {This is the abstract of your paper.},
doi = {10.48550/arXiv.2407.00110},
howpublished= {\url{https://doi.org/10.48550/arXiv.2407.00110},
}
Generated: GENERATED_METADATA.json
Metadata generated during publication and not controlled by the submitters.
It is a JSON file.
It is UTF-8 encoded with Unix line endings.
Dates and times are in UTC.
Dates are in the "YYYY-MM-DD"
format and times are in "YYYY-MM-DDThh:mm:ssZ"
format (where Z
means UTC).
It is a JSON dictionary of dictionaries.
At the top level are keys of the form "v_NUM"
indicating a metadata version under which the metadata for that version of the metadata are placed.
This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility
An example GENERATED_METADATA.json
is
{
"v_1": {
"public": true,
"project_id": "project123",
"pool_id": "cooldata",
"version" "0.39",
"submitter": "Mno Pqr",
"submitter_email": "pqr@uni.com",
"commit_date": "2024-12-03",
"commit_history": [
["2023-12-02", "SHA2-256", "193931a1913891aef..."]
]
}
}
Version 1
The key is "v_1"
. The fields are
Key | Value type | Description |
---|---|---|
public | boolean | Whether the pool is public or not |
project_id | string | The HPC Project ID of the project |
pool_id | string | Pool name, which is the name of its subdirectory |
version | string | Pool version string |
submitter | string | Name of the user who submitted the pool |
submitter_email | string | Email address of the user who submitted the pool |
commit_date | string | Date (UTC) the pool was committed/finalized |
commit_history | list of list | All previous commits as lists of "commit_date" , checksum algorithm, and checksum of the CHECKSUM.EXT file |
Generated: content/CHECKSUMS.EXT
Checksum file generated during submission containing the tagged checksums of all files under content/
except itself.
The extension .EXT
is based on the checksum algorithm (e.g. .sha256
for SHA-2-256).
Tagged checksums include the algorithm on each line and are created by passing the --tag
option to Linux checksum programs like sha256sum
.
If a data pool only has the data file content/data/foo
, the data/CHECKSUMS.sha256
file would be something like
SHA256 (content/data/foo) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
Generated: CHECKSUMS.EXT
Checksum file generated during submision containing the tagged checksums of the following files
README.md
METADATA.json
CITATION.bib
content/CHECKSUMS.EXT
The extension .EXT
is based on the checksum algorithm (e.g. .sha256
for SHA-2-256).
Tagged checksums include the algorithm on each line and are created by passing the --tag
option to Linux checksum programs like sha256sum
.
An example CHECKSUMS.sha256
file would be something like
SHA256 (README.md) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
SHA256 (METADATA.json) = d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307
SHA256 (CITATION.bib) = 865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d
SHA256 (content/CHECKSUMS.sha256) = 65e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d8
Get in Contact With Us
If you have any questions left, that we couldn’t answer in this documentation, we are happy to get contacted by you via Ticket (E-Mail to our support addresses). Please indicate “HPC-Data Pools” in the subject, so your request reaches us quickly and without any detours.
https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/pool-data/index.html ↩︎
The PI of the data project is responsible to make sure, that this is in line with the respective data licence. ↩︎
https://docs.dkrz.de/_downloads/b490cc716f55bf8f6c3cb96e8c30993a/README_HAPPI_POOL_DATA_210429.pdf ↩︎
https://docs.dkrz.de/_downloads/78422c3b903b2842d21f4b8f2970c95f/application_PoolData_HAPPI_210421.pdf ↩︎