Data Pool

Visualization of a data pool showing the directory tree in a terminal window, two of the data images, a display of some gene sequences, and a diagram of the workflow in an image processing pipeline.

The GWDG data pool consists of datasets that are relevant to different user groups over a longer period of time. This includes datasets that are to be shared within a working group or with external parties. The data pool concept is a well-known and simple strategy for organising the sharing of curated datasets. A good example is the system implemented by DKRZ1.

This includes, for example:

  • training data sets for machine learning applications
  • open data sets of (inter-)governmental organizations
  • open data sets of any HPC users
  • project data for other projects to use
  • semi-public project data that should be only shared upon application
  • And many more!

Usage

Each data pool has a name, a version, data files, metadata files, and a README.md for other users to get started with the dataset. Pool data is either public (everyone on the cluster can access them) or non-public (grant access to specific other projects and users), but pool metadata and the README.md are always public. A website listing all available data pools is planned.

All datasets are centrally located in the /pools/data. The path to each pool follows the scheme

/pools/data/PROJECT/POOLNAME/POOLVERSION

where PROJECT is the project’s HPC Project ID (see Project Structure), POOLNAME is the name of the pool, and POOLVERSION is the specific version of the pool. The file structure inside each data pool is

PathTypeDescription
README.mdfileDocumentation for the pool
METADATA.jsonfilePool metadata
CITATION.bibfileBibTeX file with references to cite if using everything in the pool
GENERATED_METADATA.jsonfileGENERATED: Pool metadata that can’t be generated before submission
CHECKSUMS.EXTfileGENERATED: Tagged checksums of data/CHECKSUMS.EXT and all top-level files
data/directoryDirectory holding pool data (non-public for non-public pools)
data/CHECKSUMS.EXTfileGENERATED: Tagged checksums of every data file
data/*files/dirsThe actual data of the data pool

Creation

Pools have to go through several phases.

Data pool workflow.

Data Pool Workflow

Overview of the data pool creation process.

0. Prerequisite: Registered project

Only projects in the HPC Project Portal (see Project Management) are eligible to create pools. See Getting An Account for information on how to apply for a project.

Warning

NHR/HLRN projects created before 2024/Q2 must migrate to the HPC Project Portal before being eligibile to create pools. See the NHR/HLRN Project Migration page for information on migration.

1. Requesting data pool staging area

Draft pools are created in a staging area. Initially, projects don’t have access to the staging area. A project PI can request access to the staging area via a support request (see Start Here for the email address to use). The request should include a rough estimate of how much disk space and how many files/directories will be used. If approved, a directory in the staging area is created for the project.

Each individual draft pools in preparation should use a separate subdirectories of the staging directory.

Info

The maximum number of files/directories in a pool is limited in order to improve IO performance of anyone using the pool. For example, directories with a million files are not allowed because anyone using the pool would harm the performance of the filesystem for everyone. In many cases, it is possible to bundle together large sets of small files.

2. Building pool draft

Project members setup draft pools in subdirectories of the staging directory (note that each individual pool is treated separately from this point in the workflow). A template data pool is provided, which you can access via:

cp /pools/data-pool-template/* /pools/data-staging/PROJECT/POOL/

The template contains the basic files that must be filled out for the pool:

  • README.md
  • METADATA.json
  • CITATION.bib

Put the pool data into the data/ subdirectory, whether copying it from elsewhere on the cluster or uploading it to the cluster (more details in our documentation on data transfer). There are a few hard restrictions:

  • data/CHECKSUMS.* is forbidden
  • All symlinks must be relative links that stay entirely inside the data/ directory and must eventually terminate in a file/directory (no circular links)

For a straightforward setup of the data pool, we eventually provide tools (CLI / WebUI / …) to help you prepare the various files in your draft pool and to check the draft pool for problems. You will also have access to the same draft pool validator program that we use.

3. Submitting pool for review

A project PI submits the draft pool to become an actual pool by creating a support request (see Start Here for the email address to use). The following additional information must be included in the support request

  1. Path to the pool draft
  2. Desired pool name
  3. Pool version

Eventually, this will be replaced with a web form.

Once the submission is received, a readonly snapshot of the draft pool will be created for review. The validator is run. If it fails, the submitter is notified and the read-only snapshot is deleted so that they can fix the problems and resubmit. If the validator passes, the submitter is notified and the draft pool goes through the review process:

  1. All other PIs of the project are notified with the location of the draft pool snapshot and instructions on how to approve or reject the pool.
  2. If all other PIs have approved the pool, the draft pool goes to the Data Pool Approval Team.
  3. If the Data Pool Approval Team approves, the pool is accepted. Otherwise, they will contact the PIs.

4. Publishing Pool

Once a pool has been fully approved, it will be published in the following steps:

  1. A unique pool ID is generated (part of which is taken from the submission metadata).
  2. The pool is copied to its final location and permissions configured.
  3. The data/CHECKSUMS.EXT and CHECKSUMS.EXT files are generated.
  4. The GENERATED_METADATA.json is generated.
  5. The pool metadata is added to the pool index within the data catalogue to appear in our Data Lake
  6. The draft pool’s read-only snapshot is deleted.

Finally, your data pool is available on our HPC system to all users with a high-speed connection. Anyone can access the data directly using the path to your data pool /pools/data/PROJECT/POOL/VERSION.

5. Editing Pool

Projects are allowed to edit a pool, either submitting a new version or a correction. If you need to edit your data pool, copy it to the staging directory and follow the process from step 3 Submitting pool for review except that the following additional pieces of information must be given

  1. The pool to be editted must be specified.
  2. It must be specified whether this is a new version or a replacement for an existing version.
  3. Specify what is changed (e.g. changelog)
  4. If replacing/correcting a version, explaining why. This is critical because replacing an existing pool version is a destructive operation which undermines the reproducibility of the scientific results other derived from it. Correcting a data pool version therefore means that the citation that others used to acknowledge the usage of your provided data is technically not correct anymore. Therefore, the default is to create a new version of the data pool and only perform in-place corrections in exceptional cases.

6. Regular review

All data pools are periodically reviewed to determine whether the data pool should be retained or deleted (or optionally archived) when the requested availability window expires.

Managing Access to Non-Public Pools

For pools with non-public data, access to files under /pools/data/PROJECT/POOL/VERSION/data is restricted via ACL. Read access is granted to all members of the project, and any additional projects (must be in the HPC Project Portal or specific project-specific usernames that a PI specifies. Initially, changing who else is granted access requires creating a support request (see Start Here for the email address to use). Eventually, this will be incorporated directly into the HPC Project Portal.

Data Documentation and Reuse License/s

It is recommended to follow domain specific best practices for data management, such as metadata files, file formats, etc. While helpful, this is not enough by itself to make a dataset usable to other researchers. To ensure a basic level of reusability, each data pool has README.md and METADATA.json files in their top-level directory containing a basic description of the dataset and how to use it.

These files are also critical for informing others which license/s apply to the data. All data must have a license, which should conform to international standards to facilitate re-use and ensure credit to the data creators2. Different files can have different licenses, but it must be made clear to users of the pool which license each file uses. Common licenses are:

  • The various Creative Commons licenses for text and images
  • The various licenses approved by OSI for source code
  • CC0 for raw numerical data (not actually copyrightable in many legal jurisdictions, but this makes it so everyone has the same rights everywhere)

In addition, a CITATION.bib file is required for correct citation of the dataset when used by other HPC users. This is a good place for pool authors to place the bibliographic information for the associated paper/s, thesis, or data set citation, as some jounrals like Nature provide. This is for all things that would have to be sited if the whole data pool is used. If some data requires only a subset of the citations, that would be a good thing to mention in the documentation (either the README.md or some other way in under the data/ directory).

The design of these files is strongly inspired by DKRZ34.

Warning

All information in this README.md and the METADATA.json file is publicly available, including the names and email addresses of the PI/s

General Recomendations

  • Follow good data organization, naming, and metadata practices in your field; taking inspiration from other fields if there is none or if they don’t cover your kind of data.
  • Include minimal code examples to use the data. Jupyter notebooks, org-mode files, etc. are encouraged. Please place them in a data/code directory if possible.
  • For source code, indicate its dependencies and which environment you have successfully run it in.
  • Add metadata inside your data files if they support it (e.g. using Attributes in NetCDF and HDF5 files).
  • Provide specifications and documentation for any custom formats you are using.
  • Use established data file formats when possible, ideally ones that have multiple implementations and/or are well documented.
  • Avoid patent encumbered formats and codecs when possible.
  • Bundle up large numbers of small files into a fewer number of larger files.
  • Compress the data when possible if it makes sense (e.g. use PNG or JPEG instead of BMP).
  • Avoid spaces in filenames as much as possible (cause havoc for people’s shell scripts).
  • Use UTF-8 encoding and Unix newlines when possible (note, some formats may dictate other ones and some languages require other encodings).

Files and Templates

Submitted: README.md

The README.md should document the data and its use, so that any domain expert can use the data without contacting the project members.

It must be a Markdown document following the conventions of CommonMark plus GitHub Flavored Markdown (GFM) tables. It must be UTF-8 encoded with Unix line endings. The data pool TITLE on the first line must be entirely composed of printable ASCII characters.

The template README.md structure is

# TITLE

## owner / producer of the dataset

## data usage license

## content of the dataset

## data usage scenarios

## methods used for data creation

## issues

## volume of the dataset (and possible changes thereof)

## time horizon of the data set on /pool/data

Submitted: METADATA.json

The metadata written by the pool submitters.

It must be a JSON file. It must be UTF-8 encoded with Unix line endings. Dates and times must be in UTC. Dates must be in the "YYYY-MM-DD" format and times must be in "YYYY-MM-DDThh:mm:ssZ" format (where Z means UTC). The file should be human readable (please use newlines and indentation).

It is a JSON dictionary of dictionaries. At the top level are keys of the form "v_NUM" indicating a metadata version under which the metadata for that version of the metadata are placed. This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility At any given time, the submission rules will dictate which version/s the submitters are required/allowed to use. Many fields are based on the CF Conventions.

The template METADATA.json is

{
    "v_1": {
        "title": "...",
        "public": true,
        "pi": ["Abc Def"],
        "pi_email": ["foo@example.com"],
        "creator": ["Mno Pqr", "Abc Def"],
        "creator_email": ["pqr@uni.com", "def@uni.com"],
        "institution": ["GWDG"],
        "institution_address": ["GWDG\n...\nGermany"],
        "source": "generated from experimental data",
        "history": "2024-11-28  Created.\n2024-12-02  Fixed error in institution_address.",
        "references": "Processed by the FOOBAR method from F Baz, et al. Nature 31 (2030) [DOI:a99adf9adf9asdf9asf]",
        "summary": "...",
        "comment": "...",
        "keywords": ["forest-science", "sentinel-2"],
        "licenses": ["CC0-1.0", "CC-BY-4.0"]
    }
}

Version 1

The key is "v_1". The fields are

KeyValue typeDescription
titlestringTitle of the pool (must match TITLE in the README.md
publicboolWhether the data of the pool is public or not
pilist of stringName/s of the principal investigator/s
pi_emaillist of stringEmail address/es of the PI/s in the same order as "pi" - used for communication and requests
creatorlist of stringName/s of the people who made the pool
creator_emaillist of stringEmail address/es of the people who made the pool in the same order as "creator"
institutionlist of stringNames of the responsibile institutions (mostly the institutions of the PI/s)
institution_addresslist of stringPostal/street address of each institution properly formated with newlines (last line must be the country) in the same order as "institution". Must be sufficent for mail sent via Deutsch Post to arrive there.
sourcestringMethod the data was produced (field from CF Conventions)
historystringChangelog style history of the data (will have newlines) (field from CF Conventions)
summarystringSummary/abstract for the data
commentstringMiscellaneous comments about the data
keywordslist of stringList of keywords relevant to the data
licenseslist of stringList of all licenses that apply to some part of the data. Licens on the SPDX Lixense List must use the SPDX identifier. Other licenses must take the form "Other -- NAME" given some suitable name (should be explained in the documentation).

Submitted: CITATION.bib

What paper(s), thesis(es), report(s), etc. should someone cite when using the full dataset? Written by the pool submitter. If it’s empty, the dataset cannot be cited in publications without contacting the author(s) (possibly because a publication using it hadn’t been published at the time of the pool submission).

It is a BibTeX file. It must be ASCII encoded with Unix line endings. The encoding is restricted to ASCII so it is compatible with normal BibTeX. Otherwise, a user would have to use bibtex8, bibtexu, or BibLaTeX. See https://www.bibtex.org/SpecialSymbols for how to put various non-ASCII characters into the file. An example CITATION.bib would be

@Misc{your-key,
  author =  {Musterfrau, Erika
         and Mustermann, Max},
  title =   {Your paper title},
  year =    {2024},
  edition =     {Version 1.0},
  publisher =   {Your Puplisher},
  address =     {G{\"o}ttingen},
  keywords =    {ai; llm; mlops; hpc },
  abstract =    {This is the abstract of your paper.},
  doi =     {10.48550/arXiv.2407.00110},
  howpublished= {\url{https://doi.org/10.48550/arXiv.2407.00110},
}

Generated: GENERATED_METADATA.json

Metadata generated during publication and not controlled by the submitters.

It is a JSON file. It is UTF-8 encoded with Unix line endings. Dates and times are in UTC. Dates are in the "YYYY-MM-DD" format and times are in "YYYY-MM-DDThh:mm:ssZ" format (where Z means UTC).

It is a JSON dictionary of dictionaries. At the top level are keys of the form "v_NUM" indicating a metadata version under which the metadata for that version of the metadata are placed. This versioning allows the format to evolve while making it extremely clear how each field should be interpreted (using the version of the key) and allowing the file to contain more than one version at once for wider compatibility

An example GENERATED_METADATA.json is

{
    "v_1": {
        "id": "f0o",
        "version" "0.39"
        "project_id": "project123",
        "pool_id": "cooldata",
        "submitter": "Mno Pqr",
        "submitter_email": "pqr@uni.com",
        "commit_date": "2024-12-03",
        "commit_history": [
            ["2023-12-02", "SHA2-256", "193931a1913891aef..."]
        ]
    }
}

Version 1

The key is "v_1". The fields are

KeyValue typeDescription
idstringUnique persistent identifier for the pool
versionstringPool version string
project_idstringThe HPC Project ID of the project
pool_idstringPool name, which is the name of its subdirectory
submitterstringName of the user who submitted the pool
submitter_emailstringEmail address of the user who submitted the pool
commit_datestringDate (UTC) the pool was commited/finalized
commit_historylist of listAll previous commits as lists of "commit_date", checksum algorithm, and checksum of the CHECKSUM.EXT file

Generated: data/CHECKSUMS.EXT

Checksum file generated during submision containing the tagged checksums of all files under data/ except itself. The extension .EXT is based on the checksum algorithm (e.g. sha256 for SHA-2-256). Tagged checksums include the algorithm on each line and are created by passing the --tag option to Linux checksum programs like sha256sum. If a data pool only has the data file data/foo, the data/CHECKSUMS.sha256 file would be something like

SHA256 (foo) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730

Generated: CHECKSUMS.EXT

Checksum file generated during submision containing the tagged checksums of the following files

  • README.md
  • METADATA.json
  • CITATION.bib
  • data/CHECKSUMS.EXT

The extension .EXT is based on the checksum algorithm (e.g. sha256 for SHA-2-256). Tagged checksums include the algorithm on each line and are created by passing the --tag option to Linux checksum programs like sha256sum. An example CHECKSUMS.sha256 file would be something like

SHA256 (README.md) = 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
SHA256 (METADATA.json) = d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307
SHA256 (CITATION.bib) = 865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d
SHA256 (data/CHECKSUMS.sha256) = 65e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded977307d8

Get in Contact With Us

If you have any questions left, that we couldn’t answer in this documentation, we are happy to get contacted by you via Ticket (E-Mail to our support addresses). Please indicate “HPC-Data Pools” in the subject, so your request reaches us quickly and without any detours.