Doublet Detection on 10k PBMCs from 10x Genomics v3#

[1]:
import numpy as np
import doubletdetection
import scanpy as sc
import matplotlib.pyplot as plt

sc.settings.n_jobs=8
sc.set_figure_params()
%matplotlib inline

Download Data from 10x#

Load Count Matrix#

[2]:
adata = sc.read_10x_h5(
    "pbmc_10k_v3_filtered_feature_bc_matrix.h5",
    backup_url="https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_filtered_feature_bc_matrix.h5"
)
adata.var_names_make_unique()
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[3]:
# remove "empty" genes
sc.pp.filter_genes(adata, min_cells=1)

Run Doublet Detection#

Here we show-off the new backend implementation that uses scanpy. This new implementation is over 2x faster than version 2.4.0. To use the previous version of DoubletDetection please add the parameters (clustering_algorithm="phenograph", verbose=True, standard_scaling=False) to the classifier and use the thresholds p_thresh=1e-7, voter_thresh=0.8. We recommend using these original parameters if the parameters below lead to undesirable results.

We support the following clustering algorithms:

  1. phenograph

  2. louvain

  3. leiden

The latter two use the scanpy implementations. The default pseudocount for log transform is 0.1. The classifier can become much more memory efficient (but slower) if pseudocount=1 and standard_scaling=False as the array can remain sparse the entire time. This may lead to fewer detected doublets, in which case the prediction thresholds can be manipulated.

[4]:
clf = doubletdetection.BoostClassifier(
    n_iters=10,
    clustering_algorithm="louvain",
    standard_scaling=True,
    pseudocount=0.1,
    n_jobs=-1,
)
doublets = clf.fit(adata.X).predict(p_thresh=1e-16, voter_thresh=0.5)
doublet_score = clf.doublet_score()
[5]:
adata.obs["doublet"] = doublets
adata.obs["doublet_score"] = doublet_score

Visualize Results#

Convergence of doublet calls#

[6]:
f = doubletdetection.plot.convergence(clf, save='convergence_test.pdf', show=True, p_thresh=1e-16, voter_thresh=0.5)
_images/tutorial_11_0.png

Doublets on umap#

[7]:
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
[8]:
sc.pl.umap(adata, color=["doublet", "doublet_score"])
/home/adam/.pyenv/versions/3.9.7/envs/doubletdetection/lib/python3.9/site-packages/anndata/_core/anndata.py:1228: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'feature_types' as categorical
/home/adam/.pyenv/versions/3.9.7/envs/doubletdetection/lib/python3.9/site-packages/anndata/_core/anndata.py:1228: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'genome' as categorical
_images/tutorial_14_1.png
[9]:
sc.pl.violin(adata, "doublet_score")
_images/tutorial_15_0.png

Number of predicted doublets at different threshold combinations#

[10]:
f3 = doubletdetection.plot.threshold(clf, save='threshold_test.pdf', show=True, p_step=6)
_images/tutorial_17_0.png