Data Pool
The GWDG data pool consists of datasets that are relevant to different user groups over a longer period of time. This includes datasets that are to be shared within a working group or with external parties. The data pool concept is a well-known and simple strategy for organising the sharing of curated datasets. A good example is the system implemented by DKRZ1.
This includes, for example:
- training data sets for machine learning applications
- open data sets of (inter-)governmental organizations
- open data sets of any HPC users
- project data for other projects to use
- semi-public project data that should be only shared upon application
- And many more!
Usage
Each data pool has a name, a version, data files, metadata files, and a README.md
for other users to get started with the dataset. Pool data is either public (everyone on the cluster can access them) or non-public (grant access to specific other projects and users), but pool metadata and the README.md
are always public. A website listing all available data pools is planned.
All datasets are centrally located in the /pools/data
. The path to each pool follows the scheme
/pools/data/PROJECT/POOLNAME/POOLVERSION
where PROJECT
is the project’s HPC Project ID (see Project Structure), POOLNAME
is the name of the pool, and POOLVERSION
is the specific version of the pool.
The file structure inside each data pool is
Path | Type | Description |
---|---|---|
README.md | file | Documentation for the pool |
METADATA.json | file | Pool metadata |
CITATION.bib | file | BibTeX file with references to cite if using everything in the pool |
GENERATED_METADATA.json | file | GENERATED: Pool metadata that can’t be generated before submission |
CHECKSUMS.EXT | file | GENERATED: Tagged checksums of data/CHECKSUMS.EXT and all top-level files |
data/ | directory | Directory holding pool data (non-public for non-public pools) |
data/CHECKSUMS.EXT | file | GENERATED: Tagged checksums of every data file |
data/* | files/dirs | The actual data of the data pool |
Creation
Pools have to go through several phases.
0. Prerequisite: Registered project
Only projects in the HPC Project Portal (see Project Management) are eligible to create pools. See Getting An Account for information on how to apply for a project.
NHR/HLRN projects created before 2024/Q2 must migrate to the HPC Project Portal before being eligibile to create pools. See the NHR/HLRN Project Migration page for information on migration.
1. Requesting data pool staging area
Draft pools are created in a staging area. Initially, projects don’t have access to the staging area. A project PI can request access to the staging area via a support request (see Start Here for the email address to use). The request should include a rough estimate of how much disk space and how many files/directories will be used. If approved, a directory in the staging area is created for the project.
Each individual draft pools in preparation should use a separate subdirectories of the staging directory.
The maximum number of files/directories in a pool is limited in order to improve IO performance of anyone using the pool. For example, directories with a million files are not allowed because anyone using the pool would harm the performance of the filesystem for everyone. In many cases, it is possible to bundle together large sets of small files.
2. Building pool draft
Project members setup draft pools in subdirectories of the staging directory (note that each individual pool is treated separately from this point in the workflow). A template data pool is provided, which you can access via:
cp /pools/data-pool-template/* /pools/data-staging/PROJECT/POOL/
The template contains the basic files that must be filled out for the pool:
README.md
METADATA.json
CITATION.bib
Put the pool data into the data/
subdirectory, whether copying it from elsewhere on the cluster or uploading it to the cluster (more details in our documentation on data transfer).
There are a few hard restrictions:
data/CHECKSUMS.*
is forbidden- All symlinks must be relative links that stay entirely inside the
data/
directory and must eventually terminate in a file/directory (no circular links)
For a straightforward setup of the data pool, we eventually provide tools (CLI / WebUI / …) to help you prepare the various files in your draft pool and to check the draft pool for problems. You will also have access to the same draft pool validator program that we use.
3. Submitting pool for review
A project PI submits the draft pool to become an actual pool by creating a support request (see Start Here for the email address to use). The following additional information must be included in the support request
- Path to the pool draft
- Desired pool name
- Pool version
Eventually, this will be replaced with a web form.
Once the submission is received, a readonly snapshot of the draft pool will be created for review. The validator is run. If it fails, the submitter is notified and the read-only snapshot is deleted so that they can fix the problems and resubmit. If the validator passes, the submitter is notified and the draft pool goes through the review process:
- All other PIs of the project are notified with the location of the draft pool snapshot and instructions on how to approve or reject the pool.
- If all other PIs have approved the pool, the draft pool goes to the Data Pool Approval Team.
- If the Data Pool Approval Team approves, the pool is accepted. Otherwise, they will contact the PIs.
4. Publishing Pool
Once a pool has been fully approved, it will be published in the following steps:
- A unique pool ID is generated (part of which is taken from the submission metadata).
- The pool is copied to its final location and permissions configured.
- The
data/CHECKSUMS.EXT
andCHECKSUMS.EXT
files are generated. - The
GENERATED_METADATA.json
is generated. - The pool metadata is added to the pool index within the data catalogue to appear in our Data Lake
- The draft pool’s read-only snapshot is deleted.
Finally, your data pool is available on our HPC system to all users with a high-speed connection. Anyone can access the data directly using the path to your data pool /pools/data/PROJECT/POOL/VERSION
.
5. Editing Pool
Projects are allowed to edit a pool, either submitting a new version or a correction. If you need to edit your data pool, copy it to the staging directory and follow the process from step 3 Submitting pool for review except that the following additional pieces of information must be given
- The pool to be editted must be specified.
- It must be specified whether this is a new version or a replacement for an existing version.
- Specify what is changed (e.g. changelog)
- If replacing/correcting a version, explaining why. This is critical because replacing an existing pool version is a destructive operation which undermines the reproducibility of the scientific results other derived from it. Correcting a data pool version therefore means that the citation that others used to acknowledge the usage of your provided data is technically not correct anymore. Therefore, the default is to create a new version of the data pool and only perform in-place corrections in exceptional cases.
6. Regular review
All data pools are periodically reviewed to determine whether the data pool should be retained or deleted (or optionally archived) when the requested availability window expires.
Managing Access to Non-Public Pools
For pools with non-public data, access to files under /pools/data/PROJECT/POOL/VERSION/data
is restricted via ACL.
Read access is granted to all members of the project, and any additional projects (must be in the HPC Project Portal or specific project-specific usernames that a PI specifies.
Initially, changing who else is granted access requires creating a support request (see Start Here for the email address to use).
Eventually, this will be incorporated directly into the HPC Project Portal.
Data Documentation and Reuse License/s
It is recommended to follow domain specific best practices for data management, such as metadata files, file formats, etc.
While helpful, this is not enough by itself to make a dataset usable to other researchers.
To ensure a basic level of reusability, each data pool has README.md
and METADATA.json
files in their top-level directory containing a basic description of the dataset and how to use it.
These files are also critical for informing others which license/s apply to the data. All data must have a license, which should conform to international standards to facilitate re-use and ensure credit to the data creators2. Different files can have different licenses, but it must be made clear to users of the pool which license each file uses. Common licenses are:
- The various Creative Commons licenses for text and images
- The various licenses approved by OSI for source code
- CC0 for raw numerical data (not actually copyrightable in many legal jurisdictions, but this makes it so everyone has the same rights everywhere)
In addition, a CITATION.bib
file is required for correct citation of the dataset when used by other HPC users.
This is a good place for pool authors to place the bibliographic information for the associated paper/s, thesis, or data set citation, as some jounrals like Nature provide.
This is for all things that would have to be sited if the whole data pool is used.
If some data requires only a subset of the citations, that would be a good thing to mention in the documentation (either the README.md
or some other way in under the data/
directory).
The design of these files is strongly inspired by DKRZ34.
All information in this README.md
and the METADATA.json
file is publicly available, including the names and email addresses of the PI/s
General Recomendations
- Follow good data organization, naming, and metadata practices in your field; taking inspiration from other fields if there is none or if they don’t cover your kind of data.
- Include minimal code examples to use the data. Jupyter notebooks, org-mode files, etc. are encouraged. Please place them in a
data/code
directory if possible. - For source code, indicate its dependencies and which environment you have successfully run it in.
- Add metadata inside your data files if they support it (e.g. using Attributes in NetCDF and HDF5 files).
- Provide specifications and documentation for any custom formats you are using.
- Use established data file formats when possible, ideally ones that have multiple implementations and/or are well documented.
- Avoid patent encumbered formats and codecs when possible.
- Bundle up large numbers of small files into a fewer number of larger files.
- Compress the data when possible if it makes sense (e.g. use PNG or JPEG instead of BMP).
- Avoid spaces in filenames as much as possible (cause havoc for people’s shell scripts).
- Use UTF-8 encoding and Unix newlines when possible (note, some formats may dictate other ones and some languages require other encodings).
Files and Templates
Submitted: README.md
The README.md
should document the data and its use, so that any domain expert can use the data without contacting the project members.
It must be a Markdown document following the conventions of CommonMark plus GitHub Flavored Markdown (GFM) tables.
It must be UTF-8 encoded with Unix line endings.
The data pool TITLE
on the first line must be entirely composed of printable ASCII characters.
The template README.md
structure is
# TITLE
## owner / producer of the dataset
## data usage license
## content of the dataset
## data usage scenarios
## methods used for data creation
## issues
## volume of the dataset (and possible changes thereof)
## time horizon of the data set on /pool/data
Submitted: METADATA.json
The metadata written by the pool submitters.
It must be a JSON file.
It must be UTF-8 encoded with Unix line endings.
Dates and times must be in UTC.
Dates must be in the "YYYY-MM-DD"
format and times must be in "YYYY-MM-DDThh:mm:ssZ"
format (where Z
means UTC).
The file should be human readable (please use newlines and indentation).
It is a JSON dictionary of dictionaries.
At the top level are keys of the form "v_NUM"
indicating a metadata version under which the metadata for that version of the metadata are placed.
This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility
At any given time, the submission rules will dictate which version/s the submitters are required/allowed to use.
Many fields are based on the CF Conventions.
The template METADATA.json
is
{
"v_1": {
"title": "...",
"public": true,
"pi": ["Abc Def"],
"pi_email": ["foo@example.com"],
"creator": ["Mno Pqr", "Abc Def"],
"creator_email": ["pqr@uni.com", "def@uni.com"],
"institution": ["GWDG"],
"institution_address": ["GWDG\n...\nGermany"],
"source": "generated from experimental data",
"history": "2024-11-28 Created.\n2024-12-02 Fixed error in institution_address.",
"references": "Processed by the FOOBAR method from F Baz, et al. Nature 31 (2030) [DOI:a99adf9adf9asdf9asf]",
"summary": "...",
"comment": "...",
"keywords": ["forest-science", "sentinel-2"],
"licenses": ["CC0-1.0", "CC-BY-4.0"]
}
}
Version 1
The key is "v_1"
. The fields are
Key | Value type | Description |
---|---|---|
title | string | Title of the pool (must match TITLE in the README.md |
public | bool | Whether the data of the pool is public or not |
pi | list of string | Name/s of the principal investigator/s |
pi_email | list of string | Email address/es of the PI/s in the same order as "pi" - used for communication and requests |
creator | list of string | Name/s of the people who made the pool |
creator_email | list of string | Email address/es of the people who made the pool in the same order as "creator" |
institution | list of string | Names of the responsibile institutions (mostly the institutions of the PI/s) |
institution_address | list of string | Postal/street address of each institution properly formated with newlines (last line must be the country) in the same order as "institution" . Must be sufficent for mail sent via Deutsch Post to arrive there. |
source | string | Method the data was produced (field from CF Conventions) |
history | string | Changelog style history of the data (will have newlines) (field from CF Conventions) |
summary | string | Summary/abstract for the data |
comment | string | Miscellaneous comments about the data |
keywords | list of string | List of keywords relevant to the data |
licenses | list of string | List of all licenses that apply to some part of the data. Licens on the SPDX Lixense List must use the SPDX identifier. Other licenses must take the form "Other -- NAME" given some suitable name (should be explained in the documentation). |
Submitted: CITATION.bib
What paper(s), thesis(es), report(s), etc. should someone cite when using the full dataset? Written by the pool submitter. If it’s empty, the dataset cannot be cited in publications without contacting the author(s) (possibly because a publication using it hadn’t been published at the time of the pool submission).
It is a BibTeX file.
It must be ASCII encoded with Unix line endings.
The encoding is restricted to ASCII so it is compatible with normal BibTeX.
Otherwise, a user would have to use bibtex8, bibtexu, or BibLaTeX.
See https://www.bibtex.org/SpecialSymbols for how to put various non-ASCII characters into the file.
An example CITATION.bib
would be
@Misc{your-key,
author = {Musterfrau, Erika
and Mustermann, Max},
title = {Your paper title},
year = {2024},
edition = {Version 1.0},
publisher = {Your Puplisher},
address = {G{\"o}ttingen},
keywords = {ai; llm; mlops; hpc },
abstract = {This is the abstract of your paper.},
doi = {10.48550/arXiv.2407.00110},
howpublished= {\url{https://doi.org/10.48550/arXiv.2407.00110},
}
Generated: GENERATED_METADATA.json
Metadata generated during publication and not controlled by the submitters.
It is a JSON file.
It is UTF-8 encoded with Unix line endings.
Dates and times are in UTC.
Dates are in the "YYYY-MM-DD"
format and times are in "YYYY-MM-DDThh:mm:ssZ"
format (where Z
means UTC).
It is a JSON dictionary of dictionaries.
At the top level are keys of the form "v_NUM"
indicating a metadata version under which the metadata for that version of the metadata are placed.
This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility
An example GENERATED_METADATA.json
is
{
"v_1": {
"id": "f0o",
"version" "0.39"
"project_id": "project123",
"pool_id": "cooldata",
"submitter": "Mno Pqr",
"submitter_email": "pqr@uni.com",
"commit_date": "2024-12-03",
"commit_history": [
["2023-12-02", "SHA2-256", "193931a1913891aef..."]
]
}
}
Version 1
The key is "v_1"
. The fields are
Key | Value type | Description |
---|---|---|
id | string | Unique persistent identifier for the pool |
version | string | Pool version string |
project_id | string | The HPC Project ID of the project |
pool_id | string | Pool name, which is the name of its subdirectory |
submitter | string | Name of the user who submitted the pool |
submitter_email | string | Email address of the user who submitted the pool |
commit_date | string | Date (UTC) the pool was commited/finalized |
commit_history | list of list | All previous commits as lists of "commit_date" , checksum algorithm, and checksum of the CHECKSUM.EXT file |
Generated: data/CHECKSUMS.EXT
Checksum file generated during submision containing the tagged checksums of all files under data/
except itself.
The extension .EXT
is based on the checksum algorithm (e.g. sha256
for SHA-2-256).
Tagged checksums include the algorithm on each line and are created by passing the --tag
option to Linux checksum programs like sha256sum
.
If a data pool only has the data file data/foo
, the data/CHECKSUMS.sha256
file would be something like
SHA256 (foo) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
Generated: CHECKSUMS.EXT
Checksum file generated during submision containing the tagged checksums of the following files
README.md
METADATA.json
CITATION.bib
data/CHECKSUMS.EXT
The extension .EXT
is based on the checksum algorithm (e.g. sha256
for SHA-2-256).
Tagged checksums include the algorithm on each line and are created by passing the --tag
option to Linux checksum programs like sha256sum
.
An example CHECKSUMS.sha256
file would be something like
SHA256 (README.md) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
SHA256 (METADATA.json) = d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307
SHA256 (CITATION.bib) = 865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d
SHA256 (data/CHECKSUMS.sha256) = 65e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d8
Get in Contact With Us
If you have any questions left, that we couldn’t answer in this documentation, we are happy to get contacted by you via Ticket (E-Mail to our support addresses). Please indicate “HPC-Data Pools” in the subject, so your request reaches us quickly and without any detours.
https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/pool-data/index.html ↩︎
The PI of the data project is responsible to make sure, that this is in line with the respective data licence. ↩︎
https://docs.dkrz.de/_downloads/b490cc716f55bf8f6c3cb96e8c30993a/README_HAPPI_POOL_DATA_210429.pdf ↩︎
https://docs.dkrz.de/_downloads/78422c3b903b2842d21f4b8f2970c95f/application_PoolData_HAPPI_210421.pdf ↩︎