Download PDFOpen PDF in browserVector Representation of Gene Co-expression in Single Cell RNA-SeqEasyChair Preprint 457111 pages•Date: November 15, 2020AbstractThe sparsity of gene expression is a well known problem in single cell RNA-seq data. Known as dropout, the gene expression observed for each cell is only a fraction of the total transcriptome. Several techniques have been adopted to address this challenge including variable gene selection and expression imputation. We present an approach for finding dense vector representations of genes from co-expression that can be used in place of the sparse expression profile over cells. By leveraging co-expression across all cells, each gene vector is a meaningful representation that is independent of missing data from individual cells. Similar genes, measured by cosine similarity between vectors, are found to correspond to known cell type markers. Using latent space arithmetic, these gene vectors have the additive capacity to accurately describe each cell and can be used to generate a low dimensional cell embedding. It is also possible to decompose and subtract sources of variation including batch effects. Any feature that can be described as a set of genes can be represented as a composite of vectors. We demonstrate the application of these vectors in identification of cell type markers, dimensionality reduction, and batch correction. Keyphrases: batch effect correction, dimensionality reduction, gene expression, machine learning, single-cell RNA-seq, vector representations
|