Soft Segmentation Tutorial (SMURF-lite)

This notebook demonstrates the SMURF-lite workflow on the 10x FFPE mouse brain Visium HD dataset. SMURF-lite is the non-deep-learning mode of SMURF and does not require a GPU. In the shared-spot transcript-assignment step, SMURF-lite uses a spatial-distance-weighted multinomial strategy. Use the full tutorial instead if you want the deep-learning refinement.

Dataset Information

Access the dataset here: Visium HD CytAssist Gene Expression Libraries of Mouse Brain

Important Downloads

To download the necessary files, run the following commands:

# Download data from 10x
!mkdir -p "smurf_example_data"
!wget -P smurf_example_data https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Mouse_Brain/Visium_HD_Mouse_Brain_tissue_image.tif
!wget -P smurf_example_data https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Mouse_Brain/Visium_HD_Mouse_Brain_binned_outputs.tar.gz
!tar -xzvf smurf_example_data/Visium_HD_Mouse_Brain_binned_outputs.tar.gz -C smurf_example_data

# Download our nuclei segmentation results if needed
!curl -L -o smurf_example_data/segmentation_final.npy.gz https://github.com/The-Mitra-Lab/SMURF/raw/main/data/segmentation_final.npy.gz
!gunzip smurf_example_data/segmentation_final.npy.gz

If you use the same cell segmentation method as introduced, you may skip the first part until the section labeled Start here

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scanpy as sc
import anndata

import smurf as su

import pickle
import copy
import gzip

Note on software versions

This tutorial was developed and tested with the exact package versions listed below.
Using these versions should reproduce the results exactly.
Newer releases may introduce subtle differences.
If you encounter any issues, please open an issue on our GitHub repository and we’ll respond as soon as possible. Thanks!

Package

Version

numpy

1.26.4

pandas

1.5.3

tqdm

4.66.2

scanpy

1.10.0

scipy

1.12.0

matplotlib

3.8.3

numba

0.59.0

scikit-learn

1.4.1.post1

Pillow

10.2.0

anndata

0.10.6

h5py

3.10.0

pyarrow

16.1.0

igraph

0.11.5

Here are the python codes you can use to install the exact versions:

import importlib.metadata as metadata
import subprocess
import sys

# Dictionary of desired package versions
desired_versions = {
    "numpy": "1.26.4",
    "pandas": "1.5.3",
    "tqdm": "4.66.2",
    "scanpy": "1.10.0",
    "scipy": "1.12.0",
    "matplotlib": "3.8.3",
    "numba": "0.59.0",
    "scikit-learn": "1.4.1.post1",
    "Pillow": "10.2.0",
    "anndata": "0.10.6",
    "h5py": "3.10.0",
    "pyarrow": "16.1.0",
    "igraph": "0.11.5",
}

for pkg, target_ver in desired_versions.items():
    try:
        installed_ver = metadata.version(pkg)
    except metadata.PackageNotFoundError:
        installed_ver = None

    if installed_ver != target_ver:
        print(f"{pkg} current version: {installed_ver or 'not installed'}, installing/upgrading to {target_ver}…")
        subprocess.run(
            [sys.executable, "-m", "pip", "install", f"{pkg}=={target_ver}"],
            check=True
        )
    else:
        print(f"{pkg} is already at version {target_ver}, no action needed.")

Here, we define your data path and saving path. Ensure that your data path includes the complete HE or DAPI image and a folder containing square_002um data. Specifically, you will need tissue_positions.parquet and filtered_feature_bc_matrix.

[2]:
import os

data_path = 'smurf_example_data/'
save_path = 'results_example_mousebrain/'

# Check if the directory exists
if not os.path.exists(save_path):
    # If it doesn't exist, create it
    os.makedirs(save_path)

We start building our So (spatial object). The input should be your tissue_positions.parquet in 'square_002um/spatial', the final full image, and the format of your experiment (either 'HE' or 'DAPI').

[3]:
so = su.prepare_dataframe_image(data_path+'binned_outputs/square_002um/spatial/tissue_positions.parquet',
                                data_path+'Visium_HD_Mouse_Brain_tissue_image.tif',
                               'HE')

Please use your own tools to perform cell segmentation on so.image_temp() (the area covered by Visium HD in the original image). Set all background pixels to 0 and assign each cell a unique integer cell ID. segmentation_final should be a NumPy array with the same shape as so.image_temp().

[4]:
plt.imshow(so.image_temp())
plt.show()
plt.close()
../../_images/tutorials_notebooks_Tutorial_Mousebrain_8_0.png

Please save the image for segmentation now.

[5]:
with gzip.GzipFile(data_path + 'image_to_segmentation.npy.gz', 'w') as f:
    np.save(f, so.image_temp())

Assuming we now have the final segmentation results, let’s proceed to read them. You can always use the provided example results to try out the tutorial.

[6]:
so.segmentation_final = np.load(data_path + 'segmentation_final.npy')

Now, we start generating cell and spot information to prepare for the analysis.

[7]:
so.generate_cell_spots_information()
Generating cells information.
100%|██████████| 23340/23340 [01:09<00:00, 337.90it/s]
Generating spots information.
100%|██████████| 23340/23340 [02:58<00:00, 131.01it/s]
Filtering cells
We have 57300 nuclei in total.
Creating NN network

Visualize the nuclei segmentation results overlaid on the original HE or DAPI image. Ensure that the results are meaningful and that the segmented nuclei accurately align with the original image.

[8]:
su.plot_results(so.image_temp(), so.segmentation_final )
../../_images/tutorials_notebooks_Tutorial_Mousebrain_16_0.png

Start here (optional shortcut)

Use this section only if you already saved a prepared so.pkl object from a previous run, or downloaded a prepared example object. Otherwise, continue from the beginning of the notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scanpy as sc
import anndata

import smurf as su

import pickle
import copy


with open(save_path + 'so.pkl', 'rb') as file:
    so = pickle.load(file)

Load adata from the original 2 μm results.

[9]:
# load adata from original 2um results

adata = sc.read_10x_mtx(data_path +'binned_outputs/square_002um/filtered_feature_bc_matrix')
adata = copy.deepcopy(adata[so.df[so.df.in_tissue == 1]['barcode']])
adata.obs
[9]:
s_002um_02448_01644-1
s_002um_00700_02130-1
s_002um_00305_01021-1
s_002um_02703_00756-1
s_002um_02278_00850-1
...
s_002um_02189_01783-1
s_002um_01609_01578-1
s_002um_00456_02202-1
s_002um_00292_01011-1
s_002um_00865_01560-1

6296688 rows × 0 columns

[10]:
sc.pp.filter_genes(adata, min_counts=1000)

Here, we begin by generating the nuclei*genes matrix. The result will be stored in so.final_nuclei.

[11]:
su.nuclei_rna(adata,so)
adata_sc = copy.deepcopy(so.final_nuclei)
adata_sc
100%|██████████| 57300/57300 [01:10<00:00, 807.64it/s]
[11]:
AnnData object with n_obs × n_vars = 57300 × 9970
    var: 'gene_ids', 'feature_types'

Filter cells by counts while preserving the raw data version.

[12]:
sc.pp.filter_cells(adata_sc, min_counts=5)
adata_raw = copy.deepcopy(adata_sc)
adata_raw
[12]:
AnnData object with n_obs × n_vars = 56867 × 9970
    obs: 'n_counts'
    var: 'gene_ids', 'feature_types'
[13]:
adata_raw = copy.deepcopy(adata_sc)

Initial analysis of nuclei data. Note that many datasets may not perform well at this stage; it is normal to observe hundreds of clusters due to low-quality nuclei counts.

Note on the clustering resolution used below

In su.singlecellanalysis(..., resolution=2), resolution refers to the Leiden clustering resolution used for nuclei-level initialization. We use 2.0 as the default starting value in this tutorial. Lower values generally merge clusters, whereas higher values tend to split clusters. If you observe obvious over-splitting, try a lower value; if distinct populations are merged, try a higher value.

[14]:
adata_sc = su.singlecellanalysis(adata_sc,resolution=2)
Starting mt
Starting normalization
Starting PCA
Starting UMAP
/home/juanru/miniconda3/envs/bidcell/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Starting Clustering
../../_images/tutorials_notebooks_Tutorial_Mousebrain_28_3.png

Here, we start performing iterations. Note that this function can be time-consuming.

[15]:
su.itering_arragement(adata_sc, adata_raw, adata, so, resolution=2, save_folder = save_path, show = True, keep_previous = False)
100%|██████████| 56867/56867 [03:59<00:00, 237.18it/s]
100%|██████████| 56867/56867 [01:06<00:00, 860.10it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap1_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_2.png
0.42587605351650787 1
100%|██████████| 56867/56867 [03:58<00:00, 238.02it/s]
100%|██████████| 56867/56867 [01:05<00:00, 874.50it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap2_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_6.png
0.6653163073440685 2
100%|██████████| 56867/56867 [04:57<00:00, 191.11it/s]
100%|██████████| 56867/56867 [01:04<00:00, 885.56it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap3_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_10.png
0.692272176364526 3
100%|██████████| 56867/56867 [04:27<00:00, 212.77it/s]
100%|██████████| 56867/56867 [01:04<00:00, 875.97it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap4_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_14.png
0.7058946412586965 4
100%|██████████| 56867/56867 [04:48<00:00, 197.11it/s]
100%|██████████| 56867/56867 [01:04<00:00, 882.54it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap5_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_18.png
0.7017089650355296 5
100%|██████████| 56867/56867 [04:58<00:00, 190.19it/s]
100%|██████████| 56867/56867 [01:05<00:00, 874.40it/s]
Starting mt
Starting normalization
Starting PCA
Starting UMAP
Starting Clustering
WARNING: saving figure to file figures/umap6_umap.png
../../_images/tutorials_notebooks_Tutorial_Mousebrain_30_22.png
0.6986723859819448 6

Load results here. From this point onward, the SMURF-lite and SMURF-full workflows differ. The lite workflow uses the non-deep-learning shared-spot assignment strategy, whereas the full workflow adds a deep-learning refinement step. This is the last natural checkpoint at which you can switch to the full workflow using the same intermediate objects (adatas_final, cells_final, and weights_record).

[16]:
adatas_final = sc.read_h5ad(save_path +'adatas.h5ad')

with open(save_path +'cells_final.pkl', 'rb') as file:
    cells_final = pickle.load(file)

with open(save_path +'weights_record.pkl', 'rb') as file:
    weights_record = pickle.load(file)
[17]:
cell_cluster_final = su.return_celltype_plot(adatas_final, so)
su.plot_cellcluster_position(cell_cluster_final, col_num=5)
100%|██████████| 23340/23340 [01:16<00:00, 305.09it/s]
../../_images/tutorials_notebooks_Tutorial_Mousebrain_33_1.png

Here, we generate the final result. By setting plot = True, we will create so.pixels_cells, which might take more than half an hour. If you do not need this or have already generated so.pixels_cells, set plot = False.

Here, adata_sc_final.obs contains five columns:

  • cell_cluster: Denotes the cluster assignment from the iteration.

  • cos_simularity: The cosine similarity of the cell’s gene expression with the average expression of its cluster.

  • cell_size: The number of 2um spots occupied by this cell.

  • array_row and array_col: The absolute coordinates of the cell, matching the spot locations provided by 10x.

[18]:
adata_sc_final = su.get_finaldata_fast(cells_final, so, adatas_final, adata, weights_record, plot = True)
adata_sc_final
100%|██████████| 1690156/1690156 [02:57<00:00, 9545.57it/s]
100%|██████████| 23340/23340 [27:03<00:00, 14.38it/s]
[18]:
AnnData object with n_obs × n_vars = 56823 × 9970
    obs: 'cell_cluster', 'cos_simularity', 'cell_size', 'array_row', 'array_col', 'n_counts'
    var: 'gene_ids', 'feature_types'
    uns: 'spatial'
    obsm: 'spatial'

Since we are using the filtered version of the data, if you want to recover the full gene set in the final dataset, you can use the following code:

adata1 = sc.read_10x_mtx(data_path + 'binned_outputs/square_002um/filtered_feature_bc_matrix')
adata1 = adata1[so.df[so.df.in_tissue == 1]['barcode']]
weight_to_celltype = su.calculate_weight_to_celltype(adatas_final, adata1, cells_final, so)
adata_sc_final2 = su.get_finaldata_fast(cells_final, so, adatas_final, adata1, weight_to_celltype, plot=False)
adata_sc_final2

Finally, we generate a pixels-to-cell visualization overlaid on the original staining image. Saving this figure can take a long time. Use save=None or save=False if saving is not needed. High DPI settings also increase runtime; dpi=1500 is a typical high-resolution setting.

[19]:
su.plot_results(so.image_temp(), so.pixels_cells, dpi = 1500, save = save_path + 'final_results.pdf')
../../_images/tutorials_notebooks_Tutorial_Mousebrain_38_0.png

Save your results here.

Please note that so is a very large object because it contains the raw image. If you do not have enough disk space, use the following compressed format instead:

import pickle
import gzip

def save_compressed_pickle(data, filename):
    with gzip.GzipFile(filename, 'wb') as f:
        pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

save_compressed_pickle(so, save_path + "so.pkl.gz")

You can load your file using this code:

import pickle
import gzip

def load_compressed_pickle(filename):
    with gzip.GzipFile(filename, 'rb') as f:
        return pickle.load(f)

so = load_compressed_pickle(save_path + 'so.pkl.gz')
[20]:
with open(save_path + "so.pkl", 'wb') as f:
    pickle.dump(so, f)

adata_sc_final.write(save_path + "adata_sc_final.h5ad")
[22]:
adata1 = sc.read_10x_mtx(data_path +'binned_outputs/square_002um/filtered_feature_bc_matrix')
adata1 = adata1[so.df[so.df.in_tissue == 1]['barcode']]
weight_to_celltype = su.calculate_weight_to_celltype(adatas_final, adata1, cells_final, so)
adata_sc_final2 = su.get_finaldata_fast(cells_final, so, adatas_final, adata1, weight_to_celltype, plot = False)
100%|██████████| 56867/56867 [01:29<00:00, 635.48it/s]
100%|██████████| 1690156/1690156 [04:15<00:00, 6611.70it/s]
[23]:
adata_sc_final2.write(save_path + "adata_sc_final2.h5ad")