Title :
Non-negative Tensor Factorization with missing data for the modeling of gene expressions in the Human Brain
Author :
Nielsen, Soren Fons Vind ; Morup, Morten
Author_Institution :
Dept. of Appl. Math. & Comput. Sci., Tech. Univ. of Denmark, Lyngby, Denmark
Abstract :
Non-negative Tensor Factorization (NTF) has become a prominent tool for analyzing high dimensional multi-way structured data. In this paper we set out to analyze gene expression across brain regions in multiple subjects based on data from the Allen Human Brain Atlas [1] with more than 40 % data missing in our problem. Our analysis is based on the non-negativity constrained Canonical Polyadic (CP) decomposition where we handle the missing data using marginalization considering three prominent alternating least squares procedures; multiplicative updates, column-wise, and row-wise updating of the component matrices. We examine three gene expression prediction scenarios based on data missing at random, whole genes missing and whole areas missing within a subject. We find that the column-wise updating approach also known as HALS performs the most efficient when fitting the model. We further observe that the non-negativity constrained CP model is able to predict gene expressions better than predicting by the subject average when data is missing at random. When whole genes and whole areas are missing it is in general better to predict by subject averages. However, we find that when whole genes are missing from all subjects the model based predictions are useful. When analyzing the structure of the components derived for one of the best predicting model orders the components identified in general constitute localized regions of the brain. Non-negative tensor factorization based on marginalization thus forms a promising framework for imputing missing values and characterizing gene expression in the human brain. However, care also has to be taken in particular when predicting the genetic expression levels at a whole region of the brain missing as our analysis indicates that this requires a substantial amount of subjects with data for this region in order for the model predictions to be reliable.
Keywords :
biology computing; data analysis; genetics; least mean squares methods; matrix algebra; Allen human brain atlas; CP; HALS; alternating least squares procedures; column-wise component matrix updating; gene expressions modeling; high dimensional multiway structured data analysis; human brain; missing data; multiplicative updates; nonnegative tensor factorization; nonnegativity constrained canonical polyadic decomposition; row-wise component matrix updating; Abstracts; Genetics; Loading; Noise; Tensile stress; Training; Vectors; CP; Cande-Comp/PARAFAC; Marginalization; Missing Values; Non-negative Matrix Factorization; Non-negative Tensor Factorization;
Conference_Titel :
Machine Learning for Signal Processing (MLSP), 2014 IEEE International Workshop on
Conference_Location :
Reims
DOI :
10.1109/MLSP.2014.6958919