DoubletDetection¶

Doublet detection in single-cell RNA-seq data.

class doubletdetection.doubletdetection.BoostClassifier(boost_rate=0.25, n_components=30, n_top_var_genes=10000, replace=False, use_phenograph=True, phenograph_parameters={'prune': True}, n_iters=25, normalizer=None, random_state=0, verbose=False, standard_scaling=False)[source]¶

Classifier for doublets in single-cell RNA-seq data.

Parameters:

boost_rate (float, optional) – Proportion of cell population size to produce as synthetic doublets.
n_components (int, optional) – Number of principal components used for clustering.
n_top_var_genes (int, optional) – Number of highest variance genes to use; other genes discarded. Will use all genes when zero.
replace (bool, optional) – If False, a cell will be selected as a synthetic doublet’s parent no more than once.
use_phenograph (bool, optional) – Set to False to disable PhenoGraph clustering in exchange for louvain clustering implemented in scanpy. Defaults to True.
phenograph_parameters (dict, optional) – Parameter dict to pass directly to PhenoGraph. Note that we change the PhenoGraph ‘prune’ default to True; you must specifically include ‘prune’: False here to change this. Only used when use_phenograph is True.
n_iters (int, optional) – Number of fit operations from which to collect p-values. Defualt value is 25.
normalizer ((sp_sparse) -> ndarray) – Method to normalize raw_counts. Defaults to normalize_counts, included in this package. Note: To use normalize_counts with its pseudocount parameter changed from the default 0.1 value to some positive float new_var, use: normalizer=lambda counts: doubletdetection.normalize_counts(counts, pseudocount=new_var)
random_state (int, optional) – If provided, passed to PCA and used to seedrandom seed numpy’s RNG. NOTE: PhenoGraph does not currently admit a random seed, and so this will not guarantee identical results across runs.
verbose (bool, optional) – Set to False to silence all normal operation informational messages. Defaults to True.
standard_scaling (bool, optional) – Set to True to enable standard scaling of normalized count matrix prior to PCA. Recommended when not using Phenograph. Defaults to False.

all_log_p_values_¶

Hypergeometric test natural log p-value per cell for cluster enrichment of synthetic doublets. Shape (n_iters, num_cells).

Type:	ndarray

all_p_values_¶

DEPRECATED. Exponentiated all_log_p_values. Due to rounding point errors, use of all_log_p_values recommended. Will be removed in v3.0.

Type:	ndarray

all_scores_¶

The fraction of a cell’s cluster that is synthetic doublets. Shape (n_iters, num_cells).

Type:	ndarray

communities_¶

Cluster ID for corresponding cell. Shape (n_iters, num_cells).

Type:	ndarray

labels_¶

0 for singlet, 1 for detected doublet.

Type:	ndarray, ndims=1

parents_¶

Parent cells’ indexes for each synthetic doublet. A list wrapping the results from each run.

Type:	list of sequences of int

suggested_score_cutoff_¶

Cutoff used to classify cells when n_iters == 1 (scores >= cutoff). Not produced when n_iters > 1.

Type:	float

synth_communities_¶

Cluster ID for corresponding synthetic doublet. Shape (n_iters, num_cells * boost_rate).

Type:	sequence of ints

top_var_genes_¶

Indices of the n_top_var_genes used. Not generated if n_top_var_genes <= 0.

Type:	ndarray

voting_average_¶

Fraction of iterations each cell is called a doublet.

Type:	ndarray

fit(raw_counts)[source]¶

Fits the classifier on raw_counts.

Parameters:	raw_counts (array-like) – Count matrix, oriented cells by genes.

Sets:: all_scores_, all_p_values_, all_log_p_values_, communities_, top_var_genes, parents, synth_communities

Returns:	The fitted classifier.

predict(p_thresh=1e-07, voter_thresh=0.9)[source]¶

Produce doublet calls from fitted classifier

Parameters:	p_thresh (float, optional) – hypergeometric test p-value threshold that determines per iteration doublet calls voter_thresh (float, optional) – fraction of iterations a cell must be called a doublet

Sets:: labels_ and voting_average_ if n_iters > 1. labels_ and suggested_score_cutoff_ if n_iters == 1.

Returns:	0 for singlet, 1 for detected doublet
Return type:	labels_ (ndarray, ndims=1)

doubletdetection.doubletdetection.load_10x_h5(file, genome)[source]¶

Load count matrix in 10x H5 format: Adapted from: https://support.10xgenomics.com/single-cell-gene-expression/software/ pipelines/latest/advanced/h5_matrices

Parameters:	file (str) – Path to H5 file genome (str) – genome, top level h5 group
Returns:	Raw count matrix. ndarray: Barcodes ndarray: Gene names
Return type:	ndarray

doubletdetection.doubletdetection.load_mtx(file)[source]¶

Load count matrix in mtx format

Parameters:	file (str) – Path to mtx file
Returns:	Raw count matrix.
Return type:	ndarray