DoubletDetection

Doublet detection in single-cell RNA-seq data.

class doubletdetection.doubletdetection.BoostClassifier(boost_rate=0.25, n_components=30, n_top_var_genes=10000, replace=False, use_phenograph=True, phenograph_parameters={'prune': True}, n_iters=25, normalizer=None, random_state=0, verbose=False, standard_scaling=False)[source]

Classifier for doublets in single-cell RNA-seq data.

Parameters:
  • boost_rate (float, optional) – Proportion of cell population size to produce as synthetic doublets.
  • n_components (int, optional) – Number of principal components used for clustering.
  • n_top_var_genes (int, optional) – Number of highest variance genes to use; other genes discarded. Will use all genes when zero.
  • replace (bool, optional) – If False, a cell will be selected as a synthetic doublet’s parent no more than once.
  • use_phenograph (bool, optional) – Set to False to disable PhenoGraph clustering in exchange for louvain clustering implemented in scanpy. Defaults to True.
  • phenograph_parameters (dict, optional) – Parameter dict to pass directly to PhenoGraph. Note that we change the PhenoGraph ‘prune’ default to True; you must specifically include ‘prune’: False here to change this. Only used when use_phenograph is True.
  • n_iters (int, optional) – Number of fit operations from which to collect p-values. Defualt value is 25.
  • normalizer ((sp_sparse) -> ndarray) – Method to normalize raw_counts. Defaults to normalize_counts, included in this package. Note: To use normalize_counts with its pseudocount parameter changed from the default 0.1 value to some positive float new_var, use: normalizer=lambda counts: doubletdetection.normalize_counts(counts, pseudocount=new_var)
  • random_state (int, optional) – If provided, passed to PCA and used to seedrandom seed numpy’s RNG. NOTE: PhenoGraph does not currently admit a random seed, and so this will not guarantee identical results across runs.
  • verbose (bool, optional) – Set to False to silence all normal operation informational messages. Defaults to True.
  • standard_scaling (bool, optional) – Set to True to enable standard scaling of normalized count matrix prior to PCA. Recommended when not using Phenograph. Defaults to False.
all_log_p_values_

Hypergeometric test natural log p-value per cell for cluster enrichment of synthetic doublets. Shape (n_iters, num_cells).

Type:ndarray
all_p_values_

DEPRECATED. Exponentiated all_log_p_values. Due to rounding point errors, use of all_log_p_values recommended. Will be removed in v3.0.

Type:ndarray
all_scores_

The fraction of a cell’s cluster that is synthetic doublets. Shape (n_iters, num_cells).

Type:ndarray
communities_

Cluster ID for corresponding cell. Shape (n_iters, num_cells).

Type:ndarray
labels_

0 for singlet, 1 for detected doublet.

Type:ndarray, ndims=1
parents_

Parent cells’ indexes for each synthetic doublet. A list wrapping the results from each run.

Type:list of sequences of int
suggested_score_cutoff_

Cutoff used to classify cells when n_iters == 1 (scores >= cutoff). Not produced when n_iters > 1.

Type:float
synth_communities_

Cluster ID for corresponding synthetic doublet. Shape (n_iters, num_cells * boost_rate).

Type:sequence of ints
top_var_genes_

Indices of the n_top_var_genes used. Not generated if n_top_var_genes <= 0.

Type:ndarray
voting_average_

Fraction of iterations each cell is called a doublet.

Type:ndarray
fit(raw_counts)[source]

Fits the classifier on raw_counts.

Parameters:raw_counts (array-like) – Count matrix, oriented cells by genes.
Sets:
all_scores_, all_p_values_, all_log_p_values_, communities_, top_var_genes, parents, synth_communities
Returns:The fitted classifier.
predict(p_thresh=1e-07, voter_thresh=0.9)[source]

Produce doublet calls from fitted classifier

Parameters:
  • p_thresh (float, optional) – hypergeometric test p-value threshold that determines per iteration doublet calls
  • voter_thresh (float, optional) – fraction of iterations a cell must be called a doublet
Sets:
labels_ and voting_average_ if n_iters > 1. labels_ and suggested_score_cutoff_ if n_iters == 1.
Returns:0 for singlet, 1 for detected doublet
Return type:labels_ (ndarray, ndims=1)
doubletdetection.doubletdetection.load_10x_h5(file, genome)[source]
Load count matrix in 10x H5 format
Adapted from: https://support.10xgenomics.com/single-cell-gene-expression/software/ pipelines/latest/advanced/h5_matrices
Parameters:
  • file (str) – Path to H5 file
  • genome (str) – genome, top level h5 group
Returns:

Raw count matrix. ndarray: Barcodes ndarray: Gene names

Return type:

ndarray

doubletdetection.doubletdetection.load_mtx(file)[source]

Load count matrix in mtx format

Parameters:file (str) – Path to mtx file
Returns:Raw count matrix.
Return type:ndarray