FAQ

Input and Output 

Can cellsnp-lite be used for other omics, such as scDNA-seq or scATAC-seq data?

Yes. Cellsnp-lite supports multiple omics data from various sequencing platforms. You may use proper options for different omics, mainly the --cellTAG for cell tag and --UMItag for UMI tag, e.g., please set --UMItag None for 10x scDNA-seq or scATAC-seq data.

See Processing other omics data for details and more options.

Can cellsnp-lite be used for spatial transcriptomics data?

Yes. Efforts have been made to explore using cellsnp-lite to calculate the B allele frequency (BAF) as one input signal for Numbat to calculate the CNV probability per spot in spatial transcriptomics (Arora et al, 2023).

Can cellsnp-lite be used for mouse data?

Yes.

For Mode 2, by default it runs on chr1 to 22 on human. For mouse, you need to specify it to 1,2,...,19 (replace the ellipsis).

On the other hand, if it is lab straint with genotype available, e.g., from the mouse genome project (ftp://ftp-mouse.sanger.ac.uk/current_snps), you can also run Mode 1.

Why file “cellSNP.cells.vcf.gz” is missing in the output folder?

The --genotype option should be added for cellsnp-lite to output the file cellSNP.cells.vcf.gz.

Running Mode 

Mode 2a is too slow on large dataset, what should I do?

Mode 2a is more suitable for small datasets. For large datasets, you may try Mode 2b + Mode 1a. Mode 2a does joint calling and genotyping, but it is substantially slower than calling first in a bulk manner by Mode 2b followed by genotyping in Mode 1a. To speed up, you may try --minMAF 0.1 --minCOUNT 100 options in both modes.

My dataset is not large, why it takes so long time?

To speedup, you may

Check whether the cell barcodes are “filtered”, i.e., from filtered_gene_bc_matrices instead of from raw_gene_bc_matrices in the cellranger output folder (update -b);
Try to use the SNP list from AF5e2 VCF file instead of AF5e4 in this human SNP list folder (update -R);
Use more threads or cores (update -p).

Pileup and Genotype Results 

Why the reference alleles (REF) are different from FASTA file (in Mode 2)?

Cellsnp-lite Mode 2 takes the allele with the highest count as REF and the second highest as ALT by default. Therefore, neither allele is necessarily identical to the actual (genomic) reference in Mode 2. This is different from Mode 1, which uses the REF and ALT alleles specified in the input VCF.

However, since v1.2.2, cellsnp-lite has the -f or --refseq option to extract the real (genomic) reference allele from FASTA file as REF, and assign the allele (other than REF) with the highest UMI/read counts as the ALT.

How to check the real reads covering an SNP in a cell?

You can extract the reads by samtools and then view them in IGV.

To extract reads covering a SNP and output to a BAM file (assuming the SNP position is chr1:100000):

samtools view -h -b  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

If you only want to extract SNP reads in specific cell (assuming cell barcode is XXX-1 and cell tag is CB):

samtools view -h -b  -d CB:XXX-1  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

Then you can load the output BAM file above into IGV to view the reads.

Why the output allele counts are different from IGV view?

IGV would filter some reads by default, which could lead to the difference in allele counts with cellsnp-lite output. The allele counts should be the same if given the same read filtering settings.

You may refer to the question on this page Which reads will be filtered by cellsnp-lite? and Preferences -> Alignments for read filtering settings of cellsnp-lite and IGV, respectively.

Why the output read counts are different from samtools view?

The inconsistency of read counts between samtools and cellsnp-lite is probably due to the different filtering settings of the two tools, e.g., by default, cellsnp-lite will filter some low-quality reads (please check --exclFLAG option) while samtools do not. To make the filtering settings the same, you can use -F option in samtools view.

Why the output read counts are different from cellSNP (python version)?

The difference in read counts is probably because the two methods used different read filtering settings, especially in Mode 2.

In Mode 2, cellSNP (actually the dependency pysam.pileup()) has a default limitation that the max_depth (i.e., max pileup-ed read count) is 8000, However, cellsnp-lite does not have this max_depth limitation by default, it will pileup as many reads as possible. You may try using the same read filtering settings for both cellsnp-lite and cellSNP, to make their read counts highly concordant in Mode 2.

About Implementation 

Which SNPs will be filtered by cellsnp-lite?

Cellsnp-lite has a few options for SNP filtering. By default, SNP will be filtered if

its aggregated UMI (if --UMItag is not None) or read (otherwise) count is <20 (--minCOUNT);
its minor allele frequency (the frequency of the allele with second highest read or UMI count) is <0 (--minMAF).

See Optional Arguments in manual for details and more options.

Which reads will be filtered by cellsnp-lite?

Cellsnp-lite has a few options for read filtering. By default, read will be filtered if

it does not contain target cell tag (if set in --cellTAG) or its cell barcode is not in the input barcode list (-b);
it does not contain target UMI tag (if set in --UMItag);
any mask bits is set in SAM FLAG: UNMAP, SECONDARY, QCFAIL (when use UMI) or UNMAP, SECONDARY, QCFAIL, DUP (otherwise).
its mapped length is <30 (--minLEN);
its mapping quality MAPQ is <20 (--minMAPQ);
total pileup read count per input file is >INT_MAX (--maxDEPTH);
it is not mapped in proper pairs (--countORPHAN).

See Read Filtering in manual for details and more options.

What is the computational efficiency of cellsnp-lite?

In theory, the computational complexity (i.e., running time) of cellSNP-lite is O(n) for number of variants and O(n*log(n)) for number of cells, which means it is more sensitive to the cell counts.

For human SNP list, we suggest using the version with AF5e2 (i.e., AF>5%, 7.4M SNPs), instead of AF5e4 (i.e., AF>0.05%, 36.6M SNPs).

Is it fine to have multiple primary alignments in BAM file?

IMPO, the “multi-primary” strategy, in which multiple alignments with the best score are labeled as primary, should be fine for downstream tasks if the fraction of the “extra” primary alignments is low.

Generally, we recommend to use “single-primary” strategy for genotyping, in which only one alignment with best alignment score is labelled as primary and the rest as secondary.

See detailed discussion in issue #39.

Command Line Options 

What is `--minMAF`, can I post-filter SNPs based on `minMAF`?

Cellsnp-lite was designed for bi-allelic SNPs. In its Mode 1, REF and ALT alleles are specified by user while in mode 2, REF and ALT are inferred from data as the alleles with highest and second highest read(UMI) counts. Therefore, in Mode 1, the REF or ALT in the reference VCF could be different from the major or minor allele inferred from data. For example, the ALT in VCF could be REF in the data.

In cellsnp cmdline (for both Mode 1 and 2), MAF is always caculated as the fraction read(UMI)_count_of_minor_allele / read(UMI)_count_of_all_alleles, where the minor allele is the allele with second highest read(UMI) count inferred from data. See also issue #77.

Therefore, in Mode 1, post-filtering SNPs based on the minimum allele frequency of the REF and ALT alleles in VCF file could be different from filtering SNPs with --minMAF in the cellsnp cmdline, for a small subset of SNPs whose major allele (with highest read/UMI count) or minor allele (second highest) is neither REF or ALT allele but one of the OTH alleles. See also issue #90.

The number of SNPs whose major or minor allele is one of the OTH alleles is expected to be quite small (in Mode 1), given the input reference VCF is reliable (e.g., with common SNPs compiled from 1000 genome project), hence should have limited influence on downstream donor deconvolution.

Downstream Demultiplexing with Vireo 

Cellsnp-lite output not working with Vireo using mitochondrial SNPs?

The large mitochondrial read counts in cellsnp-lite output makes it more likely for vireo to reach local optima so that the parameters of donors become the same and hence vireo cannot assign the cells to certain donor.

Besides, vireo is designed for nuclear SNVs. For mito SNVs, you may want to try this vireo Mito tutorial, which was used by MQuad. Note that the duplicate reads should probably be removed beforehand, if there are no UMIs in your data.

How can I use genotypes for each individual when avaiable?

You may use bcftools merge to make a combined VCF for all donors.

Downstream Clonal Substructure Discovery with MQuad 

What is the recommended pipeline for SMART-seq2 data (MQuad)?

For SMART-seq2 data, cellsnp-lite supports multiple BAM files as input, so there is no need to merge all FASTQ files to generate single BAM file.

Specifically, one BAM file (and corresponding .bai index file) should be generated per FASTQ (pairs, Read 1&2), e.g, with STAR. Then cellsnp-lite can be run with the -S option, which will pileup every well (cell) in one go. The “-S” option requires a plain file listing BAM files, each per line.

Notably, as SMART-seq2 reads have no cell barcode and UMI tag, you may need to specify --cellTAG None and --UMItag None in cellsnp-lite. Additionally, the —gzip option can be used to output zipped VCF file(s).

The output of cellsnp-lite (“-S” option) run on SMART-seq2 data will be in the same format as 10x scRNA-seq, i.e., the three SNP x cell count matrices in the output folder are already “pooled” from all input wells (cells), each row is SNP and column is cell.

Therefore, the downstream processing (after running cellsnp-lite “-S”) should be similar to 10x scRNA-seq data, such as running MQuad with the cellsnp folder (three “pooled” matrices) and then using vireo for clonal reconstruction (see vireo Mito tutorial for details).