Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: made appraisal section and shared folders, added more info on google drive

Rclone is a tool for managing material in cloud services. In DDC, it is primarily a way of transferring content out of a donor's cloud storage into DDC's digital storage.

Table of Contents

Setting up Rclone

Rclone should already be installed in BitCurator but if you are transferring content that you need to log in to access, you'll need to do some additional set up to connect to cloud storage service. This will have to be repeated for each cloud storage service and possibly reconnected if some time has passed between the last time you've used Rclone. Below lists the general set up process with sections for specific cloud storage providers used in practice so far. If there is a cloud storage provider you need to access that's not covered below, please contact the digital archivist.

...

In general we use Rclone for transferring files from cloud services. When possible we also use it to confirm the fixity of the files downloaded.

Appraisal or preparing for a transfer

When transferring content from a cloud service, it may not be apparent the extent or content contained in storage. Rclone has some features which allow for basic analysis of the contents of cloud storage.

Creating a list of files and basic metadata

The lsf  command allows for listing information about files in a machine-readable way. Particularly useful options are -s for size, -t for modified time, and -m for mimetype (i.e. the file type). Through this command you can create a CSV listing this information along with the paths -p which can be used for appraisal of the content's preservation needs, content analysis based on file names and modified dates, and to sum the size column to determine how much storage a transfer will require. Here is an example:

Code Block
languagebash
linenumberstrue
rclone lsf --csv --format ptms -R --files-only [name of remote as set up above]:[name_of_folder_or_file (if spaces in name, you can put quotation marks around this after the colon)] > path/to/csv/filename.csv

here is an example for Dropbox:

Code Block
languagebash
linenumberstrue
rclone lsf --csv --format ptms -R --files-only dropbox:"Radhika Nahpal INT" > /media/sf_BCShared01/processing/2022_061acc/file_list.csv

Finding if a transfer contains Google drive formats

When transferring content from Google Drive, there may be Google objects (Docs, Sheets, Slides, etc.) that will not be exported by Rclone in the native Drive "format" but in an equivalent format such as Microsoft Word instead. Additionally, there are some object types that cannot be exported by Rclone (such as Forms). In order to prepare for how best to export these files, it is useful to make a list of all the files in the transfer that are Google objects. This list should be included in the submission documentation of the transfer to record the original format of these files.

In order to find this information, we will use the lsf command similar to above but add --drive-show-all-gdocs to show the google docs (even those that can't be exported) and --metadata-include  to filter for only Google object mimetypes. We will also exclude size, since Rclone cannot measure that for Google objects.

Here is an example:

Code Block
languagebash
linenumberstrue
rclone lsf --csv --format ptm -R --files-only --drive-show-all-gdocs --metadata-include "vnd.google-apps.*" googledrive:"Radhika Nahpal INT" > /media/sf_BCShared01/processing/2022_061acc/orig_in_gdrive_formats.csv

Copying files

Basic copying

The command to copy files is fairly simple, you specify that you want to copy the files, enter their location, and then their destination. For instance:

...

Code Block
languagebash
linenumberstrue
rclone copy dropbox:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/Radhika Nahpal INT" 

Shared folder copying

Things shared with you in cloud services often appear in a separate section from your personal storage area. In order to access that with Rclone you will often need to add a flag for the specific service to the copy command detailed above. Here are some the flags for common services at MIT:

  • Google Drive: --drive-shared-with-me 
  • Dropbox: doesn't appear to always be necessary but you can use --dropbox-shared-folders for folders or --dropbox-shared-files if looking for an individual file. There is no way to copy content with only an open link so you can either copy content to your personal Dropbox or ask the creator to share it with you by email.
  • OneDrive/Sharepoint: currently not functional but there is a workaround that could work.

Google drive copying

Google drive has some unique features that sometimes allow for or require alternative steps.

When transferring files from Google Drive (that do not include Google objects such as Docs, Sheets, and Slides) and additional analysis will most likely not be needed, such as a small transfer of word documents,  you can direct the output of Rclone to a folder that aligns with Archivematica's standard packaging structure. This will save some work later when preparing for Archivematica. Here is an example:

Code Block
languagebash
linenumberstrue
rclone copy googledrive:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" 

If content had been shared with you and not in your Google Drive, you can use the --drive-shared-with-me flag to look in that area for the content instead.

Code Block
languagebash
linenumberstrue
rclone copy googledrive:"Radhika Nahpal INT" --drive-shared-with-me "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" 

Additionally, when transferring content from Google Drive, there may be Google objects (Docs, Sheets, Slides, etc.). Because Rclone cannot tell the size of these files they are all listed as having a file size of -1. So you can check for these by listing (ls) the content and setting max-size to 0. Additionally, there are some formats that cannot be exported by rclone (such as forms) and are not listed, so we want to add also the flag --all-drive-formats

Here is an example:

Code Block
languagebash
linenumberstrue
rclone ls googledrive:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" --drive-show-all-gdocs --max-size 0

Once you have your list of Google object files, you can assess how to export themYou may need to check if Google objects exist in your transfer, see this section above (link). Most common formats of Docs, Sheets, and Slides we will choose to export in open document equivalent formats. You can do this by setting the google drive export formats (the default are Microsoft Office documents). There are other options (such as PDF) described at the link above. Here is an example:

...

Note

Google objects, such as Docs, Sheets, and Slides, do not have checksums stored in Google Drive that can be extracted. If you have any of these in the content you're transferring, they will be downloaded as regular files, but they will not have checksums in the checksum file extracted from Google Drive. In these cases, we will not reuse the checksum file we create here in Archivematica and it should be named googledrive_checksums.txt in a location of your choosing for later inclusion in submission documentation.

Confirming fixity

In order to confirm fixity, there are number of options:

...