Data Migration Guide
While having a separate username for each project has some upsides such as separate data store quotas, never having to worry about submitting jobs with the wrong Slurm account, etc.; a major downside is that sometimes files must be copied or moved between usernames. Common scenarios are:
- Copying scripts or configuration files in your HOME directory that took effort to create (e.g.
.bashrc
,.config/emacs/init.el
, etc.) - Moving files from your legacy username to a new project-specific username
- Moving files from your username of an expired project to the new username of a successor project
In all cases, you must pay attention to which data stores are available on each island of the cluster. Data can only be transferred between two data stores from a node that can access both. See Cluster Storage Map for information on which data stores are available where. If there is no node that can access both, you might have to use an intermediate data store.
This topic requires at least basic understanding of POSIX permissions and groups. Refer to the following links for further information:
https://en.wikipedia.org/wiki/File-system_permissions#Notation_of_traditional_Unix_permissions
https://en.wikipedia.org/wiki/Chmod
https://en.wikipedia.org/wiki/Unix_file_types#Representations
Only the username owning a file/directory can change its owner, group, or ACLs. Even other usernames attached to the same Academic ID are unable to, because the operating system does not know that the different usernames are aliases for the same person.
Many directories both have a logical path like /user/username/u12345
and a real path that points to the actual location on the filesystem.
Please always operate on the real paths which are directories you can actually modify, for example the symbolic links below the /user
or /projects
directories cannot be modified by users.
You can find out the real path with the following command:
realpath /path/to/directory
As a quick way to resolve the logical path to the real path you can also add a /
to the path at the end when using it in the following commands.
Determine Method to Use
The exact method to use depends on the source and destination data stores of the data to be migrated. See the table below to find the easiest method to do each kind of migration.
Source | Destination | Method |
---|---|---|
project-specific username | project-specific username with same Academic ID | Shared AcademicID Group |
Academic ID username (legacy SCC) | Shared AcademicID Group | |
your legacy HLRN/NHR username | Get Added to Shared AcademicID Group | |
any other username | ACL | |
legacy SCC username | project-specific username with the same Academic ID | Shared AcademicID Group |
any other username | ACL | |
legacy HLRN/NHR username | your project-specific usernames | Get Added to Shared AcademicID Group |
your legacy SCC username | Get Added to Shared AcademicID Group | |
any other username | ACL | |
project | any other project | Between Projects |
The ACL works on most data stores (some don’t support them) and is the most powerful, but it is more complex. You can often use it, but we recommend using the other methods when possible. The data stores they do not work on are:
- all ARCHIVE/PERM data stores
Get Added to AcademicID Group
Legacy NHR/HLRN usernames have different AcademicIDs than the AcademicID used for project-specific usernames in the HPC Project Portal and legacy SCC usernames.
Thus, the Shared AcademicID Group method cannot be directly used.
But, the legacy HLRN/NHR username can be added to your shared AcademicID POSIX group (HPC_u_<academicid>
) by our support team by writing a support ticket.
Make sure to the email address associated with the accounts, or one of them if different email addresses are associated with each.
This proves you control one or both (you may be asked for additional information to prove you own the other if they are associated with different email addresses).
Make sure to clearly state both your legacy HLRN/NHR username and the AcademicID whose POSIX group it should be added to.
Once your legacy HLRN/NHR username has been added to the HPC_u_<academicid>
POSIX group, proceed to the Shared AcademicID Group method.
Using Shared AcademicID Group
Every AcademicID with at least one project-specific username in the HPC Project Portal has a shared POSIX group of the form HPC_u_<academicid>
.
All of that AcademicID’s project-specific usernames as well as the AcademicID itself (legacy SCC username) are members of this shared POSIX group.
For example, John Doe with AcademicID jdoe
is a member of two projects and thus, has two project specific usernames u12345
and u56789
, he will have a group called HPC_u_jdoe
with 3 members jdoe
, u12345
, and u56789
.
This shared POSIX group is provided to facilitate easy data migration between these usernames without risk of opening up your data stores to other people by accident.
To grant access to a file/directory to the other usernames in the shared AcademicID POSIX group, you would do the following using the username that owns the directories/files (your other usernames lack the permissions):
chgrp [OPTIONS] HPC_u_<academicid> <path>
chmod [OPTIONS] g+<perms> <path>
If the <path>
is a directory, you should generally add the -R
option to make the command apply the group/permissions recursively to subdirectories and files.
<perms>
should be rX
for readonly access and rwX
for read-write access, where the capital X
gives execute permissions only to files/directories that are already executable by the owner.
Please do NOT use a lower-case x
in <perms>
as that gives execute permissions to all files, even those that should not be executable.
Having random files be executable without good reason is confusing in the best case and a potential security risk and risk to your data in the worst.
It is important to remember, your other usernames can’t access <dir>/<file>
unless they can also access <dir>
, so always be mindful of the parent directory/ies.
Since symlinks are used for many data stores, make sure with directories to end <path>
with a /
or use $(realpath <path>)
to get the fully resolved destination after walking through all symlinks.
Otherwise, the commands will operate on the symlink and not the destination.
For example, /user/jdoe/u12345
would be a symlink to u12345
’s HOME directory, so if you wanted to share that with your other usernames in the same HPC_u_jdoe
group, you would either run
chgrp -R HPC_u_jdoe /user/jdoe/u12345/
or
chgrp -R HPC_u_jdoe $(realpath /user/jdoe/u12345)
To give the destination username read-only access to the source, do the following:
- Login with the username of the source
- Change the group of the source to
HPC_u_<academicid>
() - Add
g+rX
permissions to the source directory (recursively) - If you are sharing a subdirectory in you data store, you will need to change the group of the parent directory/directories and add the permission
g+X
(non-recursively)
For example, suppose John Doe wants to give access to the .config
subdirectory of his HOME
directory of his legacy SCC username to his other usernames so the configuration files can be copied over.
John Doe would do this by logging in with jdoe
and
[jdoe@gwdu101 ~]$ chgrp HPC_u_jdoe ~/
[jdoe@gwdu101 ~]$ chgrp -R HPC_u_jdoe ~/.config
[jdoe@gwdu101 ~]$ chmod g+rX ~/
[jdoe@gwdu101 ~]$ chgrp -R g+rX ~/.config
See Tips and advanced commands if you have a very large number of files/directories and the above commands are taking a long time.
Then, John Doe could access the files from his u12345
username like
[scc_cool_project] u12345@gwdu101 ~ $ cp -R /usr/users/jdoe/.config/emacs ~/.config/
If John wants to keep using the shared directory to create new files with the source username but by default grant access to the other usernames in HPC_u_<academicid>
, he could also set the SGID-bit on the shared directory, so any newly created files will also be owned by the correct group automatically:
find <path> -type d -exec chmod g+s {} \;
Using ACLs
Data can be migrated using ACLs (Access Control Lists) on most data stores (some don’t support them), but it is more complex. ACLs can be more powerful than regular POSIX permissions, but are not immediately visible and can easily lead to confusion or mistakes. ACLs should be avoided unless you can’t use the easier Shared AcademicID Group or Get Added to Shared AcademicID Group methods.
The basic idea with ACLs is that you can give additional r/w/x permissions to specific users or groups without changing the file/directory owner, group, and main permissions. You can think of them as giving files/directories secondary users and groups with their own separate permissions.
You must use the username that owns the files/directories to add ACLs to them.
ACLs are added with the setfacl
command like
setfacl [OPTIONS] -m <kind>:<name>:<perms> <path>
and removed like
setfacl [OPTIONS] -x <kind>:<name> <path>
where <kind>
is u
for a user and g
for a group, <name>
is the username or group name, and <perms>
is the permissions.
For <perms>
; use r
for read access, w
for write access, and capital X
to grant execute permissions if the path already has execute permissions for the owner.
Add the -R
option to apply the ACL recursively to subdirectories and files.
Please do NOT use a lower-case x
in <perms>
as that gives execute permissions to all files, even those that should not be executable.
Having random files be executable without good reason is confusing in the best case and a potential security risk and risk to your data in the worst.
It is important to remember, your other usernames can’t access <dir>/<file>
unless they can also access <dir>
, so always be mindful of the parent directory/ies.
Since symlinks are used for many data stores, make sure with directories to end <path>
with a /
or use $(realpath <path>)
to get the fully resolved destination after walking through all symlinks.
Otherwise, the commands will operate on the symlink and not the destination.
For example, /user/jdoe/u12345
would be a symlink to u12345
’s HOME directory, so if you wanted to share that with username u31913
, you would either run
setfacl -m u:u31913:rX /user/jdoe/u12345/
or
setfacl -m u:u31913:rX $(realpath /user/jdoe/u12345)
You can see if a file/directory has ACLs when using ls -l
by looking for +
sign at the end of the permissions column.
ACLs can be displayed using getfacl
.
The following example demonstrates making two files bar
and baz
in subdirectory foo
, adding an ACL to bar
, showing the permissions with ls -l
, and then reading the ACLs on bar
:
[gzadmfnord@glogin4 ~]$ mkdir foo
[gzadmfnord@glogin4 ~]$ cd foo
[gzadmfnord@glogin4 foo]$ touch bar baz
[gzadmfnord@glogin4 foo]$ setfacl -m u:fnordsi1:r bar
[gzadmfnord@glogin4 foo]$ ls -l
total 0
-rw-r-----+ 1 gzadmfnord gzadmfnord 0 May 21 15:56 bar
-rw-r----- 1 gzadmfnord gzadmfnord 0 May 21 15:56 baz
[gzadmfnord@glogin4 foo]$ getfacl bar
# file: bar
# owner: gzadmfnord
# group: gzadmfnord
user::rw-
user:fnordsi1:r--
group::r--
mask::r--
other::---
To give the destination username readonly access to the source, do the following
- Login with the username of the source
- Add
g+rX
ACLs to the source directory (recursively) - If you are sharing a subdirectory in you data store, you will need to add a
g+X
ACL to the parent directory/directories (non-recursively)
For example, suppose John Doe wants to give access to the .config
subdirectory of his HOME
directory of his legacy HLRN/NHR username nibjdoe
to his project-specific username u12345
so the configuration files can be copied over.
John Doe would do this by logging in with nibjdoe
and
[nibjdoe@glogin3 ~]$ setfacl -m u:u12345:rX ~/
[nibjdoe@glogin3 ~]$ setfacl -R -m u:u12345:rX ~/.config
Then, John Doe could access the files from his u12345
username like
[nib30193] u12345@gwdu101 ~ $ cp -R /mnt/vast-nhr/home/nibjdoe/.config/emacs ~/.config/
If John wants to keep using the shared directory to create new files with the source username but by default grant access to u12345
, he could set a default ACL on shared directory, so any newly created files and directories will automatically have the same ACL
setfacl -R -d u:u12345:rX <path>
where the -d
option is used to specify that the default ACL should be changed.
If you want to remove a default ACL, you also need to include the -d
option.
Between Projects
Have A Username in Both Projects
With your username that is a member of both projects, you can just copy the data from one to the other as long as it isn’t too large with rsync
or cp
.
If the data is very large, this will be very slow and may harm filesystem performance for everyone. In this case, please write a support ticket so the best way to copy or move the data can be found (the admins have additional more efficient ways to transfer data in many cases).
Have Different Usernames in Both Projects
If the data is small, it can be transferred via an intermediate hop.
If the source project is A
and the destination project is B
and your usernames in both are userA
and userB
respectively, this would be done by:
- Copy the data from the datastore of
A
to a user datastore ofuserA
. - Share the data from the datastore of
userA
with usernameuserB
using the respective method in the table above. - Using username
userB
, copy the data from the datastore ofuserA
to the destination datastore of projectB
.
Otherwise, please write a support ticket so a suitable way to migrate the data can be found. Make sure to indicate the source and destination, their projects, and your usernames in each.
Tips and advanced commands
If you have a very large amount of files/directories and the commands above take a long time to complete, here are some tips to speed it up.
- For large numbers of directories, set the SGID-bit with this more advanced command:
find <path> -type d \! -perm /g+s -print0 | xargs -0rn 200 chmod g+s
- For a large number of files, some of which may already belong to the correct group or have group r/w/x permissions, changing the group and setting permissions can be sped up by running:
find <path> \! -group HPC_u_<academicid> -print0 | xargs -0rn 200 chgrp HPC_u_<academicid>
find <path> \! -perm /g+rw -print0 | xargs -0rn 200 chmod g+rwX
Use a terminal multiplexer to let the commands run over night.
Use the correct login node to run your commands. Accessing filesystems for a specific cluster island, if possible from login nodes dedicated to other islands, may be a lot slower than accessing them from the correct login nodes. See See Cluster Storage Map for the best islands to access each data store.