Using NeuroCAAS from the Command Line

The simplest way to use NeuroCAAS is through the website interface, located here. However, this is not the only way to use NeuroCAAS. In particular, NeuroCAAS also supports a command line interface (CLI), which lets you upload data, submit jobs, and retrieve results from the platform directly from your computer’s terminal window, as opposed to a web browser. In this document, we’ll discuss the pros and cons of this method, how it interfaces with other ways of using NeuroCAAS, and the steps you’ll need to follow.

Pros and Cons

The biggest pros of using a CLI interface to interact with NeuroCAAS is efficiency and scale. You can easily upload large datasets to NeuroCAAS (~100s of GB), which can be quite inconvenient using the web browser, and you can easily submit multiple jobs in parallel, or build software around calls to NeuroCAAS (although this may be easier with the SDK for your programming language– although NeuroCAAS also supports this use case, we do not cover it in detail here). It can also be useful to download data to a machine where you do not have access to a web browser.

The downsides of using a CLI interface is that it can be less user friendly- in particular, monitoring jobs can be more difficult: the website interface lets you monitor logs for jobs in real time, but you will have to download and inspect them using the CLI. If you are getting started, we recommend that you work with the web interface first, in order to get familiar with how NeuroCAAS works.

Interoperability

The website interface to NeuroCAAS and the CLI (as well as any other cloud storage browser) are interoperable- you can start a job with the web inteface, and download result via the CLI, or upload data via the CLI and run jobs with the browser.

Installation

We use the AWS CLI to support a CLI interface to NeuroCAAS, as it is platform independent and optimized for performance. You will not need to create an AWS account in order to use the AWS CLI, but you will need to do the following:

  1. Create an account for NeuroCAAS here if you have not already.

  2. Log in on the NeuroCAAS web interface, and click on your name in the top right hand corner of the website to reach your profile.

  3. On your profile, record the following information:

    • AWS Access Key

    • AWS Secret Access Key

    • S3 Bucket For Datasets and config files

We will use this information to set up the CLI and submit work to NeuroCAAS. Once you have this information, please follow the instructions here AWS CLI to install the CLI for your platform. Use the AWS Access Key and Secret Access Key you retrieved earlier to configure your CLI when prompted.

To confirm that installation was successful, run the command aws s3 ls and confirm that you see a list of cloud storage buckets. If you do not, contact an AWS administrator with your issues, or submit an issue on GitHub.

Basic Usage

Once you have the AWS CLI set up correctly, you can start working with NeuroCAAS analyses. All user interaction on NeuroCAAS happens by uploading and downloading files from cloud storage, which in this case corresponds to AWS S3 Buckets. Each bucket that you see when you run the command aws s3 ls corresponds to a different analysis. Within each bucket, you will see that each user group has a directory, that lets them manage their analysis related data and results.

Uploading Data

You can upload data and configuration files to the {inputs,configs} subdirectories indicated above. Note that you cannot download data from the inputs directory, so you cannot use NeuroCAAS as free file storage (you can delete files, however).

Assuming you have some data on your local machine at /path/to/data, to upload data from your local machine to NeuroCAAS you can run aws s3 cp /path/to/data s3://caiman-ncap-web/user1/{inputs,configs}/data. This will create a file named data in storage. To delete the file, you should run aws s3 rm s3://caiman-ncap-web/user1/{inputs,configs}/data. You can also upload directories using the aws s3 sync command, instead of aws s3 cp.

Submitting Jobs

To submit jobs to NeuroCAAS, you will upload a special file to NeuroCAAS, which we call a submit file.A submit file is a JSON file that has the fields dataname,configname,timestamp, as follows:

{
   "dataname": ["user1/inputs/data"],
   "configname": "user1/configs/config.yaml",
   "timestamp": "unique_timestamp"
}

You can write this file on your local machine. Note that the dataname and configname parameters are paths to the input data and configuration file you want to use for analysis, without the s3 bucket prefix. Additionally, you can pass a string or list to the dataname parameter. If you pass a list, the analysis will be parallelized across each data file in the list, using the same configuration file. Finally, the timestamp parameter is a unique timestamp that will be used to write an output directory where the outputs of your job will be stored.

You can start a job by saving your submit file as any file suffixed as submit.json. You should upload this file to the submissions, using the command aws s3 cp submit.json s3://caiman-ncap-web/user1/submissions/.

Note that this file is generated automatically when you click the “submit” button on the NeuroCAAS website.

Monitoring Jobs

Once you upload a submit file, a result directory for the corresponding job will be generated at s3://caiman-ncap-web/user1/results/job__caiman-ncap-web_{unique_timestamp}, where the timestamp is what you provided in the timestamp field of your submit file. You can monitor jobs by looking in the directory s3://caiman-ncap-web/user1/results/job__caiman-ncap-web_{unique_timestamp}/logs. You can find two types of files here:

  • certificate.txt: There will always be one certificate file that uniquely identifies a job. It will contain any logging and error information related to the startup of the job you requested, and will print high-level information about all the IAEs are running in real time.

  • DATASET_NAME_{}.txt: If you request processing for multiple IAEs in parallel, each will generate a detailed log that contains the standard error and standard output from the IAEs console, as well as live information on the IAEs memory and CPU usage, and the amount of time it has been running.

You can monitor job process by downloading these files to your local computer, and opening them. Run the command aws s3 cp s3://caiman-ncap-web/user1/results/job__caiman-ncap-web_{unique_timestamp}/logs/{certificate.txt,DATASET_NAME_{}.txt} /path/to/local/location/ to download them.

Downloading Results

Finally, all results that your analysis generates will be sent to the folder s3://caiman-ncap-web/user1/results/job__caiman-ncap-web_{unique_timestamp}/process_results. You can download them all at once with the command aws s3 sync s3://caiman-ncap-web/user1/results/job__caiman-ncap-web_{unique_timestamp}/process_results/ /path/to/local/location/. Note that when processing finishes, it will upload a file called end.txt to the process_results folder, which you can monitor for to detect if the analysis is done.

Advanced Usage

Storage Bypass (June 9th, 2022)

If you have especially large datasets or analysis outputs, it may be prohibitively slow to first transfer them to NeuroCAAS, analyze them, and remove the analysis outputs, especially as we do not allow users to use NeuroCAAS itself to analyze data. For these cases, we can offer a “Storage Bypass” option: if you have access to your data in a publically accessible S3 bucket, you can read and write directly to that bucket. Timestamped outputs and logs will be delivered back to the storage location of your choice. In order to make use of this feature, modify your submit files as follows:

{
    dataname: [s3://inputbucket/sep_inputs/file.ext],
    configname: [s3://inputbucket/sep_configs/configsep.json],
    timestamp: timestamp
    [resultpath: s3://outputbucket/sep_results],
}

Modifications to the original submit file format are presented in brackets. Note that here we assume the following:

  • data and configuration files come from a public bucket, and can be accessed by anyone.

  • data and config must come from the same input bucket, but no longer require an explicit group

  • results will be written to a subfolder of the optionally different output bucket.

Finally, older analyses last deployed before this feature will need to be updated in order to work correctly with it. We are actively updating analyses at this time to standardize interfaces across analyses, but if you encounter any issues please contact neurocaas@gmail.com and we can expedite updates.