Supplementary MaterialsAdditional file 1: This file includes: (1) supplementary methods describing details in solitary cell quality control and preprocessing, application details of additional DE methods, and a statistical magic size linking UMI and read counts; (2) all supplementary numbers. Bitbucket (https://bitbucket.org/Wenan/nbid) . The source code is also uploaded with DOI Web address: 10.5281/zenodo.1225670 . The codes for data QC and DE analysis using other packages can be downloaded from https://bitbucket.org/Wenan/scrna_qc_de . The public datasets we use in this paper are from Ziegenhain et al. , Zheng et al. , Grun et al. , Jatin et al. , Klein et al. , Islam et al. , and Scialdone et al. . Abstract Go through counting and unique molecular identifier (UMI) counting are the principal gene manifestation quantification schemes used in single-cell RNA-sequencing (scRNA-seq) analysis. By using multiple scRNA-seq datasets, we reveal unique distribution variations between these techniques and conclude the bad binomial model is a good approximation for UMI counts, even in heterogeneous populations. We further propose a novel IU1-47 differential expression analysis algorithm based on a negative binomial model with independent dispersions in each group (NBID). Our results show that this properly controls the FDR and achieves better power for UMI counts when compared to other recently developed packages for scRNA-seq analysis. Electronic supplementary material The online version of this article (10.1186/s13059-018-1438-9) contains supplementary material, which is available to authorized users. of two cells with similar read counts or UMI counts. a, b Read counts for Smart?Seq2. c, d Read counts for CEL???Seq2/C1. e, f UMI counts for CEL???Seq2/C1. a, c, e The with color-coded density, the highest density at the origin. The and negative binomial Modeling and goodness of fit for UMI counts in large scale scRNA-seq datasets Although the datasets of Ziegenhain et al.  provided an unparalleled opportunity to evaluate the difference between read counts and UMI counts, the number of cells captured was relatively small (range = 29C80). We extended our analysis to additional datasets generated by different platforms [7, 20C23] to evaluate whether the same design kept for additional datasets generally. Despite specialized variations among heterogeneity and protocols within cell populations, general, the model selection and goodness-of-fit evaluation for these datasets backed our summary that UMI matters could be modeled by simpler versions in comparison with read matters (Additional?document?2: Dining tables S1A and S1B). Since 2016, many Drop-seq UMI centered systems have made an appearance with the ability to process a large number of cells in one test [2, 8]. As a result, we studied if the same design kept for such large-scale datasets. We used the referred to model-selection technique and goodness-of-fit test to the following datasets: (1) CD4+ na?ve T cells (9850 cells); and (2) CD4+ memory T cells (9578 cells), both of which were generated on the GemCode platform (10 Genomics, Pleasanton, CA, USA) , and 3) Rh41 cells, a human positive alveolar rhabdomyosarcoma (ARMS) cell line (6875 cells) prepared in-house on the Chromium platform (10 Genomics). Rh41 cells contained two distinct subpopulations based on unsupervised clustering analysis (Additional file 1: Figure S2) and were included to evaluate the effects of strong heterogeneity on model selection and fitting (Table?3). Although few genes (4C7, 0.04C0.06%) preferred the ZINB model in the relatively homogeneous T-cell populations, the percentage of genes selecting the ZINB model in Rh41 cells was slightly elevated, albeit still low (39 genes, 0.21%). The expression of these genes differed significantly between the two clusters (FDR? ?0.05, the Wilcoxon rank sum test; see also Additional file 2: Table IU1-47 S2), suggesting that the SIRPB1 fraction of genes preferring the ZINB model correlates with the level of heterogeneity. Table 3 Number of genes with selected models for large-scale datasets on the GemCode and Chromium platforms negative binomial Open in a separate window Fig. 2 Goodness of fit using the negative binomial distribution on the na?ve T-cell data (Tn). a The empirical and theoretical probability mass function (pmf) for the first gene with FDR? ?0.2. b The empirical and theoretical cumulative distribution function (cdf) for the first gene with FDR? ?0.2. c, d The same pmf and cdf plots for the first gene with FDR? ?0.05. e, f The same pmf and cdf plots for the gene with the worst FDR scRNA-seq differential expression analysis A direct consequence of properly modeling scRNA-seq counts is the power to accurately conduct differential expression analyses. Based on the knowledge derived from UMI-count modeling, we proposed a NB-based algorithm for differential expression analysis of large-scale UMI-based scRNA-seq data. We extended the general NB-based models by allowing independent dispersion parameters IU1-47 in each biological condition, resulting in the NBID method. This approach is analogous to.