doubletdetection.BoostClassifier#
- class doubletdetection.BoostClassifier(boost_rate=0.25, n_components=30, n_top_var_genes=10000, replace=False, clustering_algorithm='phenograph', clustering_kwargs=None, n_iters=10, normalizer=None, pseudocount=0.1, random_state=0, verbose=False, standard_scaling=False, n_jobs=1)[source]#
Classifier for doublets in single-cell RNA-seq data.
- Parameters
boost_rate (float, optional) – Proportion of cell population size to produce as synthetic doublets.
n_components (int, optional) – Number of principal components used for clustering.
n_top_var_genes (int, optional) – Number of highest variance genes to use; other genes discarded. Will use all genes when zero.
replace (bool, optional) – If False, a cell will be selected as a synthetic doublet’s parent no more than once.
self.clustering_algorithm (str, optional) – One of `[“louvain”, “leiden”,
implementations. ("phenograph"]`. "louvain" and leiden refer to the scanpy) –
clustering_kwargs (dict, optional) – Keyword args to pass directly to clusering algorithm. Note that we change the PhenoGraph ‘prune’ default to True. We also set directed=False and resolution=4 for Louvain and Leiden clustering. You must specifically include these params here to change them. random_state and key_added should not be overriden when clustering algorithm is Louvain or Leiden.
n_iters (int, optional) – Number of fit operations from which to collect p-values. Defualt value is 25.
normalizer ((sp_sparse) -> ndarray) – Method to normalize raw_counts. Defaults to normalize_counts, included in this package. Note: To use normalize_counts with its pseudocount parameter changed from the default pseudocount value to some positive float new_var, use: normalizer=lambda counts: doubletdetection.normalize_counts(counts, pseudocount=new_var)
pseudocount (int, optional) – Pseudocount used in normalize_counts. If 1 is used, and standard_scaling=False, the classifier is much more memory efficient; however, this may result in fewer doublets detected.
random_state (int, optional) – If provided, passed to PCA and used to seedrandom seed numpy’s RNG. NOTE: PhenoGraph does not currently admit a random seed, and so this will not guarantee identical results across runs.
verbose (bool, optional) – Set to False to silence all normal operation informational messages. Defaults to True.
standard_scaling (bool, optional) – Set to True to enable standard scaling of normalized count matrix prior to PCA. Recommended when not using Phenograph. Defaults to False.
n_jobs (int, optional) – Number of jobs to use. Speeds up neighbor computation.
- all_log_p_values_#
Hypergeometric test natural log p-value per cell for cluster enrichment of synthetic doublets. Use for tresholding. Shape (n_iters, num_cells).
- Type
ndarray
- all_scores_#
The fraction of a cell’s cluster that is synthetic doublets. Shape (n_iters, num_cells).
- Type
ndarray
- communities_#
Cluster ID for corresponding cell. Shape (n_iters, num_cells).
- Type
ndarray
- labels_#
0 for singlet, 1 for detected doublet.
- Type
ndarray, ndims=1
- parents_#
Parent cells’ indexes for each synthetic doublet. A list wrapping the results from each run.
- Type
list of sequences of int
- suggested_score_cutoff_#
Cutoff used to classify cells when n_iters == 1 (scores >= cutoff). Not produced when n_iters > 1.
- Type
float
- synth_communities_#
Cluster ID for corresponding synthetic doublet. Shape (n_iters, num_cells * boost_rate).
- Type
sequence of ints
- top_var_genes_#
Indices of the n_top_var_genes used. Not generated if n_top_var_genes <= 0.
- Type
ndarray
- voting_average_#
Fraction of iterations each cell is called a doublet.
- Type
ndarray
Methods table#
|
Produce doublet scores |
|
Fits the classifier on raw_counts. |
|
Produce doublet calls from fitted classifier |
Methods#
doublet_score#
- BoostClassifier.doublet_score()[source]#
Produce doublet scores
The doublet score is the average negative log p-value of doublet enrichment averaged over the iterations. Higher means more likely to be doublet.
- Returns
Average negative log p-value over iterations
- Return type
scores (ndarray, ndims=1)
fit#
- BoostClassifier.fit(raw_counts)[source]#
Fits the classifier on raw_counts.
- Parameters
raw_counts (array-like) – Count matrix, oriented cells by genes.
- Sets:
all_scores_, all_log_p_values_, communities_, top_var_genes, parents, synth_communities
- Returns
The fitted classifier.
predict#
- BoostClassifier.predict(p_thresh=1e-07, voter_thresh=0.9)[source]#
Produce doublet calls from fitted classifier
- Parameters
p_thresh (float, optional) – hypergeometric test p-value threshold that determines per iteration doublet calls
voter_thresh (float, optional) – fraction of iterations a cell must be called a doublet
- Sets:
labels_ and voting_average_ if n_iters > 1. labels_ and suggested_score_cutoff_ if n_iters == 1.
- Returns
0 for singlet, 1 for detected doublet
- Return type
labels_ (ndarray, ndims=1)