You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 18 Next »

Rclone is a tool for managing material in cloud services. In DDC, it is primarily a way of transferring content out of a donor's cloud storage into DDC's digital storage.

Setting up Rclone

Rclone should already be installed in BitCurator but if you are transferring content that you need to log in to access, you'll need to do some additional set up to connect to cloud storage service. This will have to be repeated for each cloud storage service and possibly reconnected if some time has passed between the last time you've used Rclone. Below lists the general set up process with sections for specific cloud storage providers used in practice so far. If there is a cloud storage provider you need to access that's not covered below, please contact the digital archivist.

General setup

This general setup will work with DropBox

  1. Open the terminal on the left-hand side of the BitCurator desktop
  2. Type "rclone config" and hit enter

  3. Type "new" or "n" for new remote, i.e. new cloud service
  4. Enter a name for the cloud serve (e.g. dropbox)
  5. Choose the number that is listed in the terminal for cloud service you named in step 4. (e.g. Dropbox is at time of writing, 13)
  6. Hit enter for questions about client id and client secret to accept the defaults
  7. Type "no" and hit enter for the advanced config question
  8. Type "y" and hit enter for the auto config option
  9. A link will appear in the terminal if it doesn't open in a browser, highlight and copy it. Paste the link into your internet browser.
  10. On the page that pops up, choose the option to authorize Rclone access
  11. Return to the terminal, if you see "got code" as part of the output above, type "y" for "this is ok" and hit enter
  12. Type "q" and hit enter to quit if done setting up cloud connections.

OneDrive or Sharepoint

  1. Open the terminal on the left-hand side of the BitCurator desktop
  2. Type "rclone config" and hit enter

  3. Type "new" or "n" for new remote, i.e. new cloud service
  4. Enter a name for the cloud serve (e.g. onedrive)
  5. Choose the number that is listed in the terminal for cloud service you named in step 4. (e.g. OneDrive is at time of writing, 26)
  6. Hit enter for questions about client id and client secret to accept the defaults
  7. Choose a national cloud region for OneDrive, most likely, 1 - "Microsoft Cloud Global"
  8. Type "no" and hit enter for the advanced config question
  9. Type "y" and hit enter for the auto config option
  10. A link will appear in the terminal if it doesn't open in a browser, highlight and copy it. Paste the link into your internet browser.
  11. Return to the terminal, if you see "got code" as part of the output above.
  12. Type of connection, enter a number for the type of OneDrive/Sharepoint connection. 1 for basic OneDrive Personal or Business, you will get your personal one drive account and anything shared with you. 2 for sharepoint root will give you sharepoints that are open to you. 3 you can choose enter the site url for a specific sharepoint site, this is the easiest option to find exactly what you want.
  13. If you chose option 2, select the sharepoint from the list that you want to access. If you chose option 3, enter the url for the sharepoint site.
  14. If the drive you selected looks good, type "y" for "this is ok" and hit enter
  15. You will then get a summary of the configuration, if everything looks ok, type "y" for "this is ok" and hit enter
  16. Type "q" and hit enter to quit if done setting up cloud connections.

Google Drive

  1. Open the terminal on the left-hand side of the BitCurator desktop
  2. Type "rclone config" and hit enter

  3. Type "new" or "n" for new remote, i.e. new cloud service
  4. Enter a name for the cloud serve (e.g. googledrive)
  5. Choose the number that is listed in the terminal for cloud service you named in step 4. (e.g. Google Drive at time of writing is 15)
  6. For application client id, the digital archivist has set this up. Contact them for access to the ID. Copy and paste that here and hit enter
  7. The OAuth Client Secret, follow the same steps as 6 which will be found in the same place as the client ID.
  8. For "Scope that rclone should use when requesting access from drive" choose 2 - "Read-only access to file metadata and file contents."
  9. Press enter for ID of root folder
  10. Press enter for Service Account Credentials JSON file path
  11. Type "no" and hit enter for the advanced config question
  12. "y" and hit enter for the auto config option
  13. A link will appear in the terminal if it doesn't open in a browser, highlight and copy it. Paste the link into your internet browser.
  14. On the page that pops up, choose the option to authorize Rclone access
  15. Configure this as a shared drive - (still need to check on this)
  16. Return to the terminal, if you see "got code" as part of the output above, type "y" for "this is ok" and hit enter
  17. Type "q" and hit enter to quit if done setting up cloud connections.

Using Rclone

In general we use Rclone for transferring files from cloud services. When possible we also use it to confirm the fixity of the files downloaded.

Copying files

The command to copy files is fairly simple, you specify that you want to copy the files, enter their location, and then their destination. For instance:

rclone copy [name of remote as set up above]:[name_of_folder_or_file (if spaces in name, you can put quotation marks around this after the colon)] [/path/to/destination/folder/originalname, i.e. processing folder, etc. If you want to retain the original folder name, enter it here, quoted if there are spaces in it]

Here is an example:


rclone copy dropbox:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/Radhika Nahpal INT" 

When transferring files from Google Drive (that do not include Google objects such as Docs, Sheets, and Slides) and additional analysis will most likely not be needed, such as a small transfer of word documents,  you can direct the output of Rclone to a folder that aligns with Archivematica's standard packaging structure. This will save some work later when preparing for Archivematica. Here is an example:


rclone copy googledrive:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" 

If content had been shared with you and not in your Google Drive, you can use the --drive-shared-with-me flag to look in that area for the content instead.

rclone copy googledrive:"Radhika Nahpal INT" --drive-shared-with-me "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" 

Additionally, when transferring content from Google Drive, there may be Google objects (Docs, Sheets, Slides, etc.). Because Rclone cannot tell the size of these files they are all listed as having a file size of -1. So you can check for these by listing (ls) the content and setting max-size to 0. Additionally, there are some formats that cannot be exported by rclone (such as forms) and are not listed, so we want to add also the flag --all-drive-formats

Here is an example:

rclone ls googledrive:"Radhika Nahpal INT" "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" --drive-show-all-gdocs --max-size 0

Once you have your list of Google object files, you can assess how to export them. Most common formats of Docs, Sheets, and Slides we will choose to export in open document equivalent formats. You can do this by setting the google drive export formats (the default are Microsoft Office documents). There are other options (such as PDF) described at the link above. Here is an example:

rclone copy googledrive:"Radhika Nahpal INT" --drive-shared-with-me "/media/sf_BCShared01/processing/2022_061acc/objects/Radhika Nahpal INT" --drive-export-formats ods,odt,odp


Extracting checksums

Some cloud providers have checksums stored in their system that you can extract and facilitate fixity checking. Some are unique to their system or some can be more standard types. Here is a general layout of the command to extract the checksums into a text file:

rclone hashsum [type of checksum] [remote source]:"folder_name or file" (same as used when copying) --output-file /path/to/output/file.txt 

Here is an example for dropbox:

rclone hashsum dropbox dropbox:"Radhika Nahpal INT" --output-file /media/sf_BCShared01/processing/2022_061acc/submissionDocumentation/dropbox_checksums.txt 

Here is an example for OneDrive or SharePoint:

rclone hashsum quickxor onedrive:"Radhika Nahpal INT" --output-file /media/sf_BCShared01/processing/2022_061acc/submissionDocumentation/onedrive_checksums.txt 

Here is an example for Google Drive, because you can reuse md5 checksums in Archivematica, we can name the checksum file and store it in it's standard packaging and naming structure:

rclone hashsum md5 googledrive:"Radhika Nahpal INT" --output-file /media/sf_BCShared01/processing/2022_061acc/submissionDocumentation/checksum.md5

Google objects, such as Docs, Sheets, and Slides, do not have checksums stored in Google Drive that can be extracted. If you have any of these in the content you're transferring, they will be downloaded as regular files, but they will not have checksums in the checksum file extracted from Google Drive. In these cases, we will not reuse the checksum file we create here in Archivematica and it should be named googledrive_checksums.txt in a location of your choosing.

Confirming fixity

In order to confirm fixity, there are number of options:

Confirm the using the checksums you extracted in the steps above:

rclone checksum [checksum type] /path/to/checksum/file.txt /path/to/local_directory/of/copied_files

Here is an example for dropbox:

rclone checksum dropbox /media/sf_BCShared01/processing/2022_061acc/submissionDocumentation/dropbox_checksums.txt "/media/sf_BCShared01/processing/2022_061acc/Radhika Nahpal INT"

Confirm without local checksums/those that rclone generates:

rclone check [remote name]:[source folder] /path/to/local_copy/of/source_folder

Next steps and packaging files

While Rclone exports the files from the cloud provider, under most circumstances, it doesn't perform the needed packaging or analysis that will be needed for processing the files. Proceed to the Logical Transfer section for next steps in processing this content.

  • No labels