DoubletDetection¶
Doublet detection in single-cell RNA-seq data.
-
class
doubletdetection.doubletdetection.
BoostClassifier
(boost_rate=0.25, n_components=30, n_top_var_genes=10000, replace=False, use_phenograph=True, phenograph_parameters={'prune': True}, n_iters=25, normalizer=None, random_state=0, verbose=False, standard_scaling=False)[source]¶ Classifier for doublets in single-cell RNA-seq data.
Parameters: - boost_rate (float, optional) – Proportion of cell population size to produce as synthetic doublets.
- n_components (int, optional) – Number of principal components used for clustering.
- n_top_var_genes (int, optional) – Number of highest variance genes to use; other genes discarded. Will use all genes when zero.
- replace (bool, optional) – If False, a cell will be selected as a synthetic doublet’s parent no more than once.
- use_phenograph (bool, optional) – Set to False to disable PhenoGraph clustering in exchange for louvain clustering implemented in scanpy. Defaults to True.
- phenograph_parameters (dict, optional) – Parameter dict to pass directly to PhenoGraph. Note that we change the PhenoGraph ‘prune’ default to True; you must specifically include ‘prune’: False here to change this. Only used when use_phenograph is True.
- n_iters (int, optional) – Number of fit operations from which to collect p-values. Defualt value is 25.
- normalizer ((sp_sparse) -> ndarray) – Method to normalize raw_counts. Defaults to normalize_counts, included in this package. Note: To use normalize_counts with its pseudocount parameter changed from the default 0.1 value to some positive float new_var, use: normalizer=lambda counts: doubletdetection.normalize_counts(counts, pseudocount=new_var)
- random_state (int, optional) – If provided, passed to PCA and used to seedrandom seed numpy’s RNG. NOTE: PhenoGraph does not currently admit a random seed, and so this will not guarantee identical results across runs.
- verbose (bool, optional) – Set to False to silence all normal operation informational messages. Defaults to True.
- standard_scaling (bool, optional) – Set to True to enable standard scaling of normalized count matrix prior to PCA. Recommended when not using Phenograph. Defaults to False.
-
all_log_p_values_
¶ Hypergeometric test natural log p-value per cell for cluster enrichment of synthetic doublets. Shape (n_iters, num_cells).
Type: ndarray
-
all_p_values_
¶ DEPRECATED. Exponentiated all_log_p_values. Due to rounding point errors, use of all_log_p_values recommended. Will be removed in v3.0.
Type: ndarray
-
all_scores_
¶ The fraction of a cell’s cluster that is synthetic doublets. Shape (n_iters, num_cells).
Type: ndarray
-
communities_
¶ Cluster ID for corresponding cell. Shape (n_iters, num_cells).
Type: ndarray
-
labels_
¶ 0 for singlet, 1 for detected doublet.
Type: ndarray, ndims=1
-
parents_
¶ Parent cells’ indexes for each synthetic doublet. A list wrapping the results from each run.
Type: list of sequences of int
-
suggested_score_cutoff_
¶ Cutoff used to classify cells when n_iters == 1 (scores >= cutoff). Not produced when n_iters > 1.
Type: float
-
synth_communities_
¶ Cluster ID for corresponding synthetic doublet. Shape (n_iters, num_cells * boost_rate).
Type: sequence of ints
-
top_var_genes_
¶ Indices of the n_top_var_genes used. Not generated if n_top_var_genes <= 0.
Type: ndarray
-
voting_average_
¶ Fraction of iterations each cell is called a doublet.
Type: ndarray
-
fit
(raw_counts)[source]¶ Fits the classifier on raw_counts.
Parameters: raw_counts (array-like) – Count matrix, oriented cells by genes. - Sets:
- all_scores_, all_p_values_, all_log_p_values_, communities_, top_var_genes, parents, synth_communities
Returns: The fitted classifier.
-
predict
(p_thresh=1e-07, voter_thresh=0.9)[source]¶ Produce doublet calls from fitted classifier
Parameters: - p_thresh (float, optional) – hypergeometric test p-value threshold that determines per iteration doublet calls
- voter_thresh (float, optional) – fraction of iterations a cell must be called a doublet
- Sets:
- labels_ and voting_average_ if n_iters > 1. labels_ and suggested_score_cutoff_ if n_iters == 1.
Returns: 0 for singlet, 1 for detected doublet Return type: labels_ (ndarray, ndims=1)
-
doubletdetection.doubletdetection.
load_10x_h5
(file, genome)[source]¶ - Load count matrix in 10x H5 format
- Adapted from: https://support.10xgenomics.com/single-cell-gene-expression/software/ pipelines/latest/advanced/h5_matrices
Parameters: - file (str) – Path to H5 file
- genome (str) – genome, top level h5 group
Returns: Raw count matrix. ndarray: Barcodes ndarray: Gene names
Return type: ndarray