Dataset of NvWA

Description of datasets for eight species

Here, we provided single cell labels for eight species.

Labels

Dataset.*_train_test.cells.csv.gz
Dataset.*_train_test.genes.csv.gz
Dataset.*_train_test.label.npz

Human   Mouse   Zebrafish   Drosophila   Ciona   Earthworm   C.elegans   Planarian


Datasets

Dataset.*_train_test.h5

Human   Mouse   Zebrafish   Drosophila   Ciona   Earthworm   C.elegans   Planarian

High-quality single cell data collection

To obtain high-quality cells, we set a higher cutoff to generate a dataset with an average of approximately 1,000 genes/cells from zebrafish, and Drosophila, and approximately 370 genes/cells from earthworm. We used similar cutoffs (1,000 genes/cells) to select data from the Human Cell Landscape (HCL), the Mouse Cell Atlas (MCA), and published atlases of Ciona, C. elegans and planarians. In total, we obtained 480 cell groups from eight species, covering major cell lineages, including epithelial, immune, neuron, stromal, muscle, secretory, erythroid, germ, endothelial, and proliferating lineages.

MAGIC preprocess

We then used the Markov affinity-based graph imputation of cells (MAGIC) algorithm to denoise the cell count matrix, fill in missing genes, and improve expressed gene numbers.

Genome

The genome sequences of human (GRCh38 genome), mouse (GRCm38 genome), zebrafish (GRCz11) were download from Ensembl. The genome sequence of Ciona (Joined-scaffold (KH) genome) was downloaded from Ghost database. The genome sequence of Drosophila (BDGP6.28 genome) was downloaded from FlyBase database. The genome sequence of earthworm (GWHACBE00000000) was downloaded from Genome Warehouse (http://bigd.big.ac.cn/gwh/). The genome sequence of C. elegans (GCA_000002985.3 genome) was downloaded from WormBase database. The genome sequence of planarian (S2F2 genome) was downloaded PlanMine database. The putative promoters were considered as 10 kilo-base (kb) sequence centered at the transcription start site (TSS). The promoter sequences were obtained according to gene coordinates using pysam, a Python interface to samtools.

Data processing for machine learning

Due to the zero-inflated negative binomial (ZINB) distribution characteristics of single-cell data, the gene expression patterns of the previously generated high-quality MAGIC datasets were binarized into labels. Considering the numbers of protein-coding genes and expressed genes among eight species, the different label cutoffs were used for expressed (label 1) and unexpressed (label 0) genes, with ~8,000 and ~5,000 expressed genes for vertebrates and invertebrates respectively. The corresponding label (gene expression vector) and input (one-hot encoding promoter sequence) were paired to generate the full dataset for deep learning.

Machine learning dataset construction

For each species, the datasets have been divided into three sets for deep learning, training set, validation set, and testing set. For humans and mice, we left out all genes on chromosome 8 (chr8) for testing, and genes on the other chromosomes for training and validation (randomly split left out 1,000 genes). This approach (Zhou, 2015) strictly excludes the sequence overlap between the training and testing sets. But for the other 6 species, due to the imbalance of gene numbers on different chromosomes or scaffolds, we randomly split left out 1,000 genes for testing, 1,000 genes for validation, and the remaining genes for training.