Copick and HPC

copick HPC

A frequent issue encountered when working with cryoET datasets is the need to validate or visualize results of data analysis run on a high-performance computing (HPC) cluster. Often the data is stored on a remote file system that the user's machine does not have direct access to.

This tutorial demonstrates how to use copick to access data stored on an HPC cluster on your local machine.

Step 1: Install copick

See the quickstart guide for instructions on how to install copick. Copick must be installed on both the HPC cluster and your local machine.

Step 2: Setup your HPC project

On the HPC cluster we can access the data via the local filesystem. Here, we assume that we are working on the HPC cluster my_cluster. For reproducibility's sake we will assume that the static dataset is dataset 10301, retrieved from the cryoET data portal at cryoetdataportal.czscience.com. We will assume that the project overlay is stored in the directory /hpc/data/copick_project on the HPC cluster.

We will store this information in a configuration file copick_config.json on the HPC cluster.

HPC Configuration

{
    "config_type": "cryoet_data_portal",
    "name": "Example HPC Project",
    "description": "This project lives on an HPC cluster.",
    "version": "0.5.0",
    "pickable_objects": [
        {
            "name": "ribosome",
            "is_particle": true,
            "identifier": "GO:0022626",
            "label": 1,
            "color": [  0, 117, 220, 255],
            "radius": 150
        },
        {
            "name": "atpase",
            "is_particle": true,
            "identifier": "GO:0045259",
            "label": 2,
            "color": [251, 192, 147, 255],
            "radius": 150
        },
        {
            "name": "membrane",
            "is_particle": false,
            "identifier": "GO:0016020",
            "label": 3,
            "color": [200, 200, 200, 255],
            "radius": 10
        }
    ],
    "overlay_root": "local:///hpc/data/copick_project/",
    "overlay_fs_args": {
        "auto_mkdir": true
    },
    "dataset_ids" : [10301]
}

Note that the same concept can also be applied on fully locally stored datasets. A config for that case is provided here, but will not be used in this tutorial.

HPC Configuration (data fully on cluster)

{
    "config_type": "filesystem",
    "name": "Example HPC Project",
    "description": "This project lives on an HPC cluster.",
    "version": "0.5.0",
    "pickable_objects": [
        {
            "name": "ribosome",
            "is_particle": true,
            "identifier": "GO:0022626",
            "label": 1,
            "color": [  0, 117, 220, 255],
            "radius": 150
        },
        {
            "name": "atpase",
            "is_particle": true,
            "identifier": "GO:0045259",
            "label": 2,
            "color": [251, 192, 147, 255],
            "radius": 150
        },
        {
            "name": "membrane",
            "is_particle": false,
            "identifier": "GO:0016020",
            "label": 3,
            "color": [200, 200, 200, 255],
            "radius": 10
        }
    ],
    "overlay_root": "local:///hpc/data/copick_project",
    "overlay_fs_args": {
        "auto_mkdir": true
    },
    "static_root": "local:///hpc/data/copick_project_static",
    "static_fs_args": {
        "auto_mkdir": true
    }
}

Step 3: Access the data on your local machine

To access the data on your local machine, we reuse most parts of the configuration file from the HPC cluster. We only need to inform copick about the location of the project on the cluster and how to access it. For simplicity, we will assume that passwordless login is set up on the HPC cluster, see the SSH documentation for more information.

SSH Authentication

In cases of mandatory 2-FA authentication, you may need to set up an SSH tunnel to the remote filesystem, e.g.

ssh -L 2222:localhost:22 user.name@]my_cluster

and then use "host":"localhost" and "port":2222 as the host in the config and commands below.

On our local machine, we create a new configuration file copick_config_local.json with the following content:

Local Configuration

{
    "config_type": "cryoet_data_portal",
    "name": "Example Local Project",
    "description": "This Project accesses data from an HPC cluster.",
    "version": "0.5.0",
    "pickable_objects": [
        {
            "name": "ribosome",
            "is_particle": true,
            "identifier": "GO:0022626",
            "label": 1,
            "color": [  0, 117, 220, 255],
            "radius": 150
        },
        {
            "name": "atpase",
            "is_particle": true,
            "identifier": "GO:0045259",
            "label": 2,
            "color": [251, 192, 147, 255],
            "radius": 150
        },
        {
            "name": "membrane",
            "is_particle": false,
            "identifier": "GO:0016020",
            "label": 3,
            "color": [200, 200, 200, 255],
            "radius": 10
        }
    ],
    "overlay_root": "ssh:///hpc/data/copick_project/",
    "overlay_fs_args": {
        "host": "my_cluster",
        "username": "user.name"
    },
    "dataset_ids" : [10301]
}

As before, we can also provide a configuration for a dataset that is fully stored on the HPC cluster. This will not be used in this tutorial.

Local Configuration (data fully on cluster)

{
    "config_type": "filesystem",
    "name": "Example Local Project",
    "description": "This Project accesses data from an HPC cluster.",
    "version": "0.5.0",
    "pickable_objects": [
        {
            "name": "ribosome",
            "is_particle": true,
            "identifier": "GO:0022626",
            "label": 1,
            "color": [  0, 117, 220, 255],
            "radius": 150
        },
        {
            "name": "atpase",
            "is_particle": true,
            "identifier": "GO:0045259",
            "label": 2,
            "color": [251, 192, 147, 255],
            "radius": 150
        },
        {
            "name": "membrane",
            "is_particle": false,
            "identifier": "GO:0016020",
            "label": 3,
            "color": [200, 200, 200, 255],
            "radius": 10
        }
    ],
    "overlay_root": "ssh:///hpc/data/copick_project/",
    "overlay_fs_args": {
        "host": "my_cluster",
        "username": "user.name"
    },
    "static_root": "ssh:///hpc/data/copick_project_static",
    "static_fs_args": {
        "host": "my_cluster",
        "username": "user.name"
    }
}

Step 4: Modify the data on the HPC cluster

Using the configuration file copick_config.json from the previous step we can now access the data on the HPC cluster and perform processing tasks. In lieu of a full processing example, we will demonstrate how read a set of picks from the dataset save a random subset to a new file.

import copick
import numpy as np

# Get applicable picks
root = copick.from_file("copick_config.json")
run = root.get_run("14069")
picks = run.get_picks(object_name="ribosome")[0]

# Get a random subset of picks
points = picks.points

# Create a new pick object (this will be saved in the overlay directory)
new_picks = run.new_picks(object_name="ribosome", user_id="subset", session_id="0")
new_picks.points = np.random.choice(points, 20, replace=False)
new_picks.store()

Step 5: Access the modified data on your local machine

Using the configuration file copick_config_local.json from the previous step we can now access the modified data on our local machine without additional downloads.

import copick

# Get applicable picks
root = copick.from_file("copick_config_local.json")
run = root.get_run("14069")
picks = run.get_picks(object_name="ribosome", user_id="subset")[0]

# Confirm the number
print(f"Number of picks: {len(picks.points)}")

Step 6: Visualize the data

In reality, you would likely want to visualize the data, instead of just counting the number of picks. For this, you can use the ChimeraX-copick plugin. Visualization works as with any other copick project. For more information, see the tutorial on ChimeraX-copick integration., or check out CellCanvas or napari-copick for alternative visualization options.