Abstract :
In this era of hyper-technological innovation, massive amounts of data are being generated at almost every level of applications in almost every area of disciplines. Extracting interesting knowledge from raw data, or data mining in a broader sense, has become an indispensable task. Nevertheless, data collected from complex phenomena represent often the integrated result of several interrelated variables, whereas these variables are less precisely defined. The basic principle of data mining is to distinguish which variable is related to which and how the variables are related. In many situations, the digitized information is gathered and stored as a data matrix. It is often the case, or so assumed, that the exogenous variables depend on the endogenous variables in a linear relationship. Retrieving "useful" information therefore can often be characterized as finding "suitable" matrix factorization. This paper offers a synopsis from this prospect on how linear algebra techniques can help to carry out the task of data mining. Examples from factor analysis, cluster analysis, latent semantic indexing and link analysis are used to demonstrate how matrix factorization helps to uncover hidden connection and do things fast. Low rank matrix approximation plays a fundamental role in cleaning the data and compressing the data. Other types of constraints, such as nonnegativity, will also be briefly discussed.
Keywords :
approximation theory; data analysis; data compression; data mining; matrix decomposition; applied linear algebra; data analysis; data cleaning; data compression; data matrix factorization; data mining; information retrieval; knowledge extraction; low rank matrix approximation; Cleaning; Data analysis; Data mining; Image reconstruction; Indexing; Informatics; Information retrieval; Linear algebra; Mathematics; Technological innovation; cluster analysis; data mining; factor analysis; linear model; link analysis; matrix factorization;