Data Migration Guide

While having a separate username for each project has some upsides such as separate data store quotas, never having to worry about submitting jobs with the wrong Slurm account, etc.; a major downside is that sometimes files must be copied or moved between usernames. Common scenarios are:

  • Copying scripts or configuration files in your HOME directory that took effort to create (e.g. .bashrc, .config/emacs/init.el, etc.)
  • Moving files from your legacy username to a new project-specific username
  • Moving files from your username of an expired project to the new username of a successor project

In all cases, you must pay attention to which data stores are available on each island of the cluster. Data can only be transferred between two data stores from a node that can access both. See Cluster Storage Map for information on which data stores are available where. If there is no node that can access both, you might have to use an intermediate data store.

Note

This topic requires at least basic understanding of POSIX permissions and groups. Refer to the following links for further information:

https://en.wikipedia.org/wiki/File-system_permissions#Notation_of_traditional_Unix_permissions
https://en.wikipedia.org/wiki/Chmod
https://en.wikipedia.org/wiki/Unix_file_types#Representations

Warning

Only the username owning a file/directory can change its owner, group, or ACLs. Even other usernames attached to the same Academic ID are unable to, because the operating system does not know that the different usernames are aliases for the same person.

Warning

Many directories both have a logical path like /user/username/u12345 and a real path that points to the actual location on the filesystem.

Please always operate on the real paths which are directories you can actually modify, for example the symbolic links below the /user or /projects directories cannot be modified by users.

You can find out the real path with the following command:

realpath /path/to/directory

As a quick way to resolve the logical path to the real path you can also add a / to the path at the end when using it in the following commands.

Determine Method to Use

The exact method to use depends on the source and destination data stores of the data to be migrated. See the table below to find the easiest method to do each kind of migration.

SourceDestinationMethod
project-specific usernameproject-specific username with same Academic IDShared AcademicID Group
Academic ID username (legacy SCC)Shared AcademicID Group
your legacy HLRN/NHR usernameGet Added to Shared AcademicID Group
any other usernameACL
legacy SCC usernameproject-specific username with the same Academic IDShared AcademicID Group
any other usernameACL
legacy HLRN/NHR usernameyour project-specific usernamesGet Added to Shared AcademicID Group
your legacy SCC usernameGet Added to Shared AcademicID Group
any other usernameACL
projectany other projectBetween Projects
Info

The ACL works on most data stores (some don’t support them) and is the most powerful, but it is more complex. You can often use it, but we recommend using the other methods when possible. The data stores they do not work on are:

Get Added to AcademicID Group

Legacy NHR/HLRN usernames have different AcademicIDs than the AcademicID used for project-specific usernames in the HPC Project Portal and legacy SCC usernames. Thus, the Shared AcademicID Group method cannot be directly used. But, the legacy HLRN/NHR username can be added to your shared AcademicID POSIX group (HPC_u_<academicid>) by our support team by writing a support ticket. Make sure to the email address associated with the accounts, or one of them if different email addresses are associated with each. This proves you control one or both (you may be asked for additional information to prove you own the other if they are associated with different email addresses). Make sure to clearly state both your legacy HLRN/NHR username and the AcademicID whose POSIX group it should be added to. Once your legacy HLRN/NHR username has been added to the HPC_u_<academicid> POSIX group, proceed to the Shared AcademicID Group method.

Using Shared AcademicID Group

Every AcademicID with at least one project-specific username in the HPC Project Portal has a shared POSIX group of the form HPC_u_<academicid>. All of that AcademicID’s project-specific usernames as well as the AcademicID itself (legacy SCC username) are members of this shared POSIX group. For example, John Doe with AcademicID jdoe is a member of two projects and thus, has two project specific usernames u12345 and u56789, he will have a group called HPC_u_jdoe with 3 members jdoe, u12345, and u56789. This shared POSIX group is provided to facilitate easy data migration between these usernames without risk of opening up your data stores to other people by accident.

To grant access to a file/directory to the other usernames in the shared AcademicID POSIX group, you would do the following using the username that owns the directories/files (your other usernames lack the permissions):

chgrp [OPTIONS] HPC_u_<academicid> <path>
chmod [OPTIONS] g+<perms> <path>

If the <path> is a directory, you should generally add the -R option to make the command apply the group/permissions recursively to subdirectories and files. <perms> should be rX for readonly access and rwX for read-write access, where the capital X gives execute permissions only to files/directories that are already executable by the owner.

Warning

Please do NOT use a lower-case x in <perms> as that gives execute permissions to all files, even those that should not be executable. Having random files be executable without good reason is confusing in the best case and a potential security risk and risk to your data in the worst.

Info

It is important to remember, your other usernames can’t access <dir>/<file> unless they can also access <dir>, so always be mindful of the parent directory/ies.

Since symlinks are used for many data stores, make sure with directories to end <path> with a / or use $(realpath <path>) to get the fully resolved destination after walking through all symlinks. Otherwise, the commands will operate on the symlink and not the destination. For example, /user/jdoe/u12345 would be a symlink to u12345’s HOME directory, so if you wanted to share that with your other usernames in the same HPC_u_jdoe group, you would either run

chgrp -R HPC_u_jdoe /user/jdoe/u12345/

or

chgrp -R HPC_u_jdoe $(realpath /user/jdoe/u12345)

To give the destination username read-only access to the source, do the following:

  1. Login with the username of the source
  2. Change the group of the source to HPC_u_<academicid> ()
  3. Add g+rX permissions to the source directory (recursively)
  4. If you are sharing a subdirectory in you data store, you will need to change the group of the parent directory/directories and add the permission g+X (non-recursively)

For example, suppose John Doe wants to give access to the .config subdirectory of his HOME directory of his legacy SCC username to his other usernames so the configuration files can be copied over. John Doe would do this by logging in with jdoe and

[jdoe@gwdu101 ~]$ chgrp HPC_u_jdoe ~/
[jdoe@gwdu101 ~]$ chgrp -R HPC_u_jdoe ~/.config
[jdoe@gwdu101 ~]$ chmod g+rX ~/
[jdoe@gwdu101 ~]$ chgrp -R g+rX ~/.config

See Tips and advanced commands if you have a very large number of files/directories and the above commands are taking a long time.

Then, John Doe could access the files from his u12345 username like

[scc_cool_project] u12345@gwdu101 ~ $ cp -R /usr/users/jdoe/.config/emacs ~/.config/

If John wants to keep using the shared directory to create new files with the source username but by default grant access to the other usernames in HPC_u_<academicid>, he could also set the SGID-bit on the shared directory, so any newly created files will also be owned by the correct group automatically:

find <path> -type d -exec chmod g+s {} \;

Using ACLs

Data can be migrated using ACLs (Access Control Lists) on most data stores (some don’t support them), but it is more complex. ACLs can be more powerful than regular POSIX permissions, but are not immediately visible and can easily lead to confusion or mistakes. ACLs should be avoided unless you can’t use the easier Shared AcademicID Group or Get Added to Shared AcademicID Group methods.

The basic idea with ACLs is that you can give additional r/w/x permissions to specific users or groups without changing the file/directory owner, group, and main permissions. You can think of them as giving files/directories secondary users and groups with their own separate permissions.

Warning

You must use the username that owns the files/directories to add ACLs to them.

ACLs are added with the setfacl command like

setfacl [OPTIONS] -m <kind>:<name>:<perms> <path>

and removed like

setfacl [OPTIONS] -x <kind>:<name> <path>

where <kind> is u for a user and g for a group, <name> is the username or group name, and <perms> is the permissions. For <perms>; use r for read access, w for write access, and capital X to grant execute permissions if the path already has execute permissions for the owner. Add the -R option to apply the ACL recursively to subdirectories and files.

Warning

Please do NOT use a lower-case x in <perms> as that gives execute permissions to all files, even those that should not be executable. Having random files be executable without good reason is confusing in the best case and a potential security risk and risk to your data in the worst.

Info

It is important to remember, your other usernames can’t access <dir>/<file> unless they can also access <dir>, so always be mindful of the parent directory/ies.

Since symlinks are used for many data stores, make sure with directories to end <path> with a / or use $(realpath <path>) to get the fully resolved destination after walking through all symlinks. Otherwise, the commands will operate on the symlink and not the destination. For example, /user/jdoe/u12345 would be a symlink to u12345’s HOME directory, so if you wanted to share that with username u31913, you would either run

setfacl -m u:u31913:rX /user/jdoe/u12345/

or

setfacl -m u:u31913:rX $(realpath /user/jdoe/u12345)

You can see if a file/directory has ACLs when using ls -l by looking for + sign at the end of the permissions column. ACLs can be displayed using getfacl. The following example demonstrates making two files bar and baz in subdirectory foo, adding an ACL to bar, showing the permissions with ls -l, and then reading the ACLs on bar:

[gzadmfnord@glogin4 ~]$ mkdir foo
[gzadmfnord@glogin4 ~]$ cd foo
[gzadmfnord@glogin4 foo]$ touch bar baz
[gzadmfnord@glogin4 foo]$ setfacl -m u:fnordsi1:r bar
[gzadmfnord@glogin4 foo]$ ls -l
total 0
-rw-r-----+ 1 gzadmfnord gzadmfnord 0 May 21 15:56 bar
-rw-r-----  1 gzadmfnord gzadmfnord 0 May 21 15:56 baz
[gzadmfnord@glogin4 foo]$ getfacl bar
# file: bar
# owner: gzadmfnord
# group: gzadmfnord
user::rw-
user:fnordsi1:r--
group::r--
mask::r--
other::---

To give the destination username readonly access to the source, do the following

  1. Login with the username of the source
  2. Add g+rX ACLs to the source directory (recursively)
  3. If you are sharing a subdirectory in you data store, you will need to add a g+X ACL to the parent directory/directories (non-recursively)

For example, suppose John Doe wants to give access to the .config subdirectory of his HOME directory of his legacy HLRN/NHR username nibjdoe to his project-specific username u12345 so the configuration files can be copied over. John Doe would do this by logging in with nibjdoe and

[nibjdoe@glogin3 ~]$ setfacl -m u:u12345:rX ~/
[nibjdoe@glogin3 ~]$ setfacl -R -m u:u12345:rX ~/.config

Then, John Doe could access the files from his u12345 username like

[nib30193] u12345@gwdu101 ~ $ cp -R /mnt/vast-nhr/home/nibjdoe/.config/emacs ~/.config/

If John wants to keep using the shared directory to create new files with the source username but by default grant access to u12345, he could set a default ACL on shared directory, so any newly created files and directories will automatically have the same ACL

setfacl -R -d u:u12345:rX <path>

where the -d option is used to specify that the default ACL should be changed. If you want to remove a default ACL, you also need to include the -d option.

Between Projects

Have A Username in Both Projects

With your username that is a member of both projects, you can just copy the data from one to the other as long as it isn’t too large with rsync or cp.

Warning

If the data is very large, this will be very slow and may harm filesystem performance for everyone. In this case, please write a support ticket so the best way to copy or move the data can be found (the admins have additional more efficient ways to transfer data in many cases).

Have Different Usernames in Both Projects

If the data is small, it can be transferred via an intermediate hop. If the source project is A and the destination project is B and your usernames in both are userA and userB respectively, this would be done by:

  1. Copy the data from the datastore of A to a user datastore of userA.
  2. Share the data from the datastore of userA with username userB using the respective method in the table above.
  3. Using username userB, copy the data from the datastore of userA to the destination datastore of project B.

Otherwise, please write a support ticket so a suitable way to migrate the data can be found. Make sure to indicate the source and destination, their projects, and your usernames in each.

Tips and advanced commands

If you have a very large amount of files/directories and the commands above take a long time to complete, here are some tips to speed it up.

  1. For large numbers of directories, set the SGID-bit with this more advanced command:
find <path> -type d \! -perm /g+s -print0 | xargs -0rn 200 chmod g+s
  1. For a large number of files, some of which may already belong to the correct group or have group r/w/x permissions, changing the group and setting permissions can be sped up by running:
find <path> \! -group HPC_u_<academicid> -print0 | xargs -0rn 200 chgrp HPC_u_<academicid>
find <path> \! -perm /g+rw -print0 | xargs -0rn 200 chmod g+rwX
  1. Use a terminal multiplexer to let the commands run over night.

  2. Use the correct login node to run your commands. Accessing filesystems for a specific cluster island, if possible from login nodes dedicated to other islands, may be a lot slower than accessing them from the correct login nodes. See See Cluster Storage Map for the best islands to access each data store.