GöDL - Data Catalog

The GöDL Data Catalog is a standalone service that can index all files stored on our HPC systems. It makes files findable and accessible via semantic and domain specific metadata. Therefore, users do not have to remember explicit paths or create complicated directory trees to encode metadata within paths and filenames.

If you want access to your own namespace, you can request access via mail to support@gwdg.de using the subject “Access to GöDL HPC”.

Current Situation

Currently, users rely on hierarchical folder structures and well-defined filenames to organize their data with respect to domain-specific metadata. However, this can lead to confusion since our HPC systems provide multiple storage systems, so data can be distributed. Users can understandably struggle to get a global view of their data across all provided filesystems. In addition, the access pattern is often very inefficient. Such an example is shown in the image below, where users may have to remember the exact storage location or travserse the tree to find the data they are looking for.

Hierarchical file tree encoding semantic information in file paths.

What Are the Challenges with Nested Folder Stucture?

Difficult to Find Data: Users must remember storage locations, which is not always intuitive
No Efficient Search Methods: Finding specific data requires manually searching through directories
Does Not Scale Well: As data grows, managing it manually becomes impractical
Complex File Structures: Can lead to confusion and inefficient access patterns.

Why Use a Data Catalog?

Data catalogs help to index data based on user-provided metadata, which enables efficient and user-friendly searching for data.In addition, the data catalog helps to manage data during their lifecycle by providing commands to move, stage and delete files.

What are the Benefits using Data Catalog?

Improved Search-ability: Quickly search and find relevant data using metadata or tags.
Better Organization: Organize data using metadata, making it easier to categorize and access.
Scalability: A Data Catalog grows with your data, making it easier to manage large datasets over time.

Usage Example with and whithout Data Catalog

Goal:A literature researcher is searching for datasets for books published between 2005-2007 in the Horror genre.

Szenrio Whithout a Data Catalog Without a Data Catalog, the researcher must rely on a manually structured folder hierarchy. Consider the above folder strcuture in chapter current situation

To find the relevant datasets, the researcher must:

Search for Horror books within each year’s folder.
Cross check multiple directories to ensure all relevant books are found.
Repeat the process if new books are added later

Szenario With a Data Catalog: Simple Metadata Search: A Data Catalog allows the researcher to find relevant books instantly by searching metadata instead of browsing through directories

Search Query Example: "year=2005-2007,genre=Horror"

Conclusion Why a Data Catalog is more efficient: A Data Catalog eliminates the manual workload allowing researchers to reduce workload and focus more on their core analysis

Preparation

Generate JSON Files

To ingest data into the Data Catalog, users can either create a JSON file for each data file they want to register or annotate each file manually. The first option provides convenient capabilities for a bulk upload. The JSON file must contain the metadata and must be placed in the same folder as the corresponding data file. It is also crucial to save the JSON file with the same name of the corresponding data file, i.e., for a data file called example.txt the corresponding metadata must be located in a file example.json. The metadata users save for their data is not predefined allowing users to define domain-specific metadata based on their needs. This flexibility ensures that different research fields can include relevant metadata attributes.

For more clarity an example format of file is given:

{
  "researchfield": "Cardiology",
    "age": "45",
    "gender": "Male",
}

The descriptive metadata in this JSON file serves as searchable keywords in the Data Catalog. Users can query datasets based on these attributes.

Ideally, these JSON files can be created automatically!

Usage Operation

All commands have the same structure when using the goedl cli tool.

goedl: This is an alias or shortcut for executing the Python CLI script (cli.py) and loading your config data (goedl.json).

operations: is a placeholder for the operation you want to execute (e.g., --delete, --ingest-folder, stage, migrate, annotate).

parameter: A placeholder for any additional arguments or parameters required by the chosen operation.

Syntax in Commands

Please use the following syntax rules when specifying commands:

= (Equals): Used to specify exact matches. For example, “Year=2002”
- searches for all data where the year is exactly 2002.
=< (Less Than or Equal to): Specifies that a value is less than or equal to the given value. For example, “Year=<2005”
- returns data from 2005 and earlier.
=> (Greater Than or Equal to): Specifies that a value is greater than or equal to the given value. For example, “Year=>2002”
- returns data from 2002 and later years.
from - to: Used to define a range between two values. For example, “Year=2002-2005”
- searches for data within this specific range of years.

Additional Rules

Multiple conditions in a query must be separated by a comma without space
- Correct: “Year=2002,Year=2005”
- Incorrect: “Year=2002, Year=2005”
Everything within the query must be in quotes “…”
- Correct: “Year=2002”
- Incorrect: Year=2002

Ingest Data

The operation ingest-folder is used to ingest JSON metadata files created in the previous step. The command expects a folder directory where each data file has a corresponding JSON sidecar file with the necessary metadata. If the target folder is nested, this tool will recursively traverse all paths.

Command Example: goedl --ingest-folder ./test-data/ This command ingests all data from the folder test-data

Annotate Data

Similar to the ingest command, you can also manually annotate data by adding metadata to a specific data object. This allows you to enrich existing datasets with additional descriptive attributes or to ingest new files if there are not yet any annotations available. Example Command: goedl --annotate "Season=Winter,Vegetation=Subtropical" --file ~/test/no_trees/Industrial_4.jpg

This command adds the metadata “Season=Winter” and “Vegetation=Subtropical” to the file Industrial_4.jpg.

Listing Available Data

The operation list list all Data that match a given descriptive metadata query. The matching query is determined by the search parameter provided by the user.

Example giving: goedl --list "PatientAge=25-30"

This command lists all datasets where the metadata contains the attribute PatientAge and its value is in the range of 25 to 30.

Optional

Limiting the Output Size You can limit the number of returned results using the –size argument. --size
- Example Command: goedl --list "PatientAge=25-30" --size 3 This command limits the output to 3 results
Display Full Technical Metadata By default, only descriptive metadata is shown. If you need the full technical metadata, use the –full argument.
- Example Command: goedl --list "PatientAge=25-30" --full
This will display all stored metadata for each matching data object, including technical attributes stored in the database.

Data Staging

Before processing data in a job, it is highly recommended to stage the data into a hot storage. This improves accessibility for compute nodes and enhances performance. You can learn more about the staging process here.

The operation stage Copies all data matching the defined query after the stage statement to the specified target directory

Example Command: goedl --stage "PatientAge=25-30" --target ./test-stage/

This command stages all datasets where the metadata contains “reason=testing” into the directory ./test-stage/.

Data Migration

The operation migrate moves data matching the defined query after the migrate statement to the specified target directory. Here, the handle is updated, meaning that the specified target directory will become the new source storage for the specified data.

Example giving: goedl --migrate "PatientAge=25-30" --target ./test-migrate/

This command moves all datasets where the metadata contains “reason=testing” to ./test-migrate/ and updates the reference to reflect the new storage location.

Important Notice:

After migration, the data will no longer reside in its previous location.
The metadata in the Data Catalog is automatically updated

Delete Data

The delete operation removes all data that matches the specified key-value query.

Example Command: goedl --delete "Region=Africa"

This command permanently removes all datasets where the metadata contains “Region=Africa”.

Deleting data will also remove its associated metadata from the catalog. Ensure that you no longer need the data before executing this command.

All user operations are defined in CSL and can be accessed through the command: goedl --help

Addendum - Config file

This JSON file, named godl.json, is stored in the home directory because the script only searches for this file in that location. Below is an example of the config file:

{
    "config1": {
        "username": "{username}",
        "password": "{password}",
        "index": "{MyIndex}",
        "url": "{https://es.gwdg.de}"
    },
// other config Data

:warning: Warning: Do not delete or modify the config file, as it is crucial for using the Data Catalog.