Multiblock Data Fusion in Statistics and Machine Learning. Tormod Næs
Читать онлайн книгу.and have each provided important research on both the theoretical and the application sides of things. And they represent both the experience of the old-timers and visions of the coming generations. I can say that without insulting (I hope) as I am in the same age group as the more exdistinguished part of the authors.
I have the deepest respect and the highest admiration for the three authors. I have learned so many things from their individual contributions over the years. Reading this joint work is not a disappointment. Please do enjoy!
Rasmus Bro
Køge, Denmark, July 28, 2021
Preface
Combining information from two or possibly several blocks of data is gaining increased attention and importance in several areas of science and industry. Typical examples can be found in chemistry, spectroscopy, metabolomics, genomics, systems biology, and sensory science. Many methods and procedures have been proposed and used in practice. The area goes under different names: data integration, data fusion, multiblock analyses, multiset analyses, and others.
This book is an attempt to provide an up-to-date treatment of the most used and important methods within an important branch of the area; namely methods based on so-called components or latent variables. These methods have already obtained enormous attention in, for instance, chemometrics, bioinformatics, machine learning, and sensometrics and have proved to be important both for prediction and interpretation.
The book is primarily a description of methodologies, but most of the methods will be illustrated by examples from the above-mentioned areas. The book is written such that both users of the methods as well as method developers will hopefully find sections of interest. At the end of the book there is a description of a software package developed particularly for the book. This package is freely available in R and covers many of the methods discussed.
To distinguish the different types of methods from each other, the book is divided into five parts. Part I is an introduction and description of preliminary concepts. Part II is the core of the book containing the main unsupervised and supervised methods. Part III deals with more complex structures and, finally, Part IV presents alternative unsupervised and supervised methods. The book ends with Part V discussing the available software.
Our recommendations for reading the book are as follows. A minimum read of the book would involve chapters 1, 2, 3, 5, and 7. Chapters 4, 6 and 8 are more specialized and chapters 9 and 10 contain methods we think are more advanced or less obvious to use. We feel privileged to have so many friendly colleagues who were willing to spend their time on helping us to improve the book by reading separate chapters. We would like to express our thanks to: Rasmus Bro, Margriet Hendriks, Ulf Indahl, Henk Kiers, Ingrid Måge, Federico Marini, Åsmund Rinnan, Rosaria Romano, Lars Erik Solberg, Marieke Timmerman, Oliver Tomic, Johan Westerhuis, and Barry Wise. Of course, the correctness of the final text is fully our responsibility!
Age Smilde, Utrecht, The Netherlands
Tormod Næs, Ås, Norway
Kristian Hovde Liland, Ås, Norway
March 2022
List of Figures
Figure 1.1 High-level, mid-level, and low-level fusion for two input blocks.The Z’s represent the combined information from the twoblocks which is used for making the predictions. The upperfigure represents high-level fusion, where the results from two separate analyses are combined. The figure in the middle is an illustration of mid-level fusion, where components from the two data blocks are combined before further analysis. The lowerfigure illustrates low-level fusion where the data blocks are simply combined into one data block before further analysis takes place.
Figure 1.2 Idea of dimension reduction and components. The scores T summarise the relationships between samples; the load-ings P summarise the relationships between variables.Sometimes weights W are used to define the scores.
Figure 1.3 Design of the plant experiment. Numbers in the top row refer to lightlevels (in μE m−2 sec−1); numbers in the first column are degrees centigrade. Legend: D = dark, LL = low light, L = light and HL = high light.
Figure 1.4 Scores on the first two principal components of a PCA on theplant data (a) and scores on the first ASCA interaction component (b). Legend: D = dark, LL = low light, L = light and HL = high light.
Figure 1.5 Idea of copy number variation (a), methylation (b), and mutation (c)of the DNA. For (a) and (c): Source: Adapted from Koch et al., 2012.
Figure 1.6 Plot of the Raman spectra used in predicting the fat content. The dashed lines show the split of the data set into multiple blocks.
Figure 1.7 L-shape data of consumer liking studies.
Figure 1.8 Phylogeny of some multiblock methods and relationsto basic data analysis methods used in this book.
Figure 1.9 The idea of common and distinct components. Legend: blueis common variation; dark yellow and dark red are distinctvariation and shaded areas are noise (unsystematic variation).
Figure 2.1 Idea of dimension reduction and components. Sometimes W isused to define the scores T which in turn define the loadings P.
Figure 2.2 Geometry of PCA. For explanation, see text (withpermission of H.J. Ramaker, TIPb, The Netherlands).
Figure 2.3 Score (a) and loading (b) plots of a PCA on Caber-net Sauvignon wines. Source: Bro and Smilde (2014).Reproduced with permission of Royal Society of Chemistry.
Figure 2.4 PLS validated explained variance when applied to Ramanwith PUFA responses. Left: PLSR on one responseat a time. Right: PLS on both responses (standardised).
Figure 2.5 Score and loading plots for the single response PLS regression model predicting PUFA as percentage of total fat in the sample (PUFAsample).
Figure 2.6 Raw and normalised urine NMR-spectra.Different colours are spectra of different subjects.
Figure 2.7 Numerical representations of the lengths of sticks: (a) left: the empirical relational system (ERS) of which only the length is studied, right: a numerical representation (NRS1), (b) an alternative numerical representation (NRS2) ofthe same ERS carrying essentially the same information.
Figure 2.8 Classical (a) and logistic PCA (b) on the same muta-tion data of different cancers. Source Song et al. (2017). Reproduced with permission from Oxford Academic Press.
Figure 2.9 Classical (a) and logistic PCA (b) on the same methyla-tion data of different cancers. Source Song et al. (2017). Reproduced with permission from Oxford Academic.
Figure 2.10 SCA for two data blocks; one containingbinary data and one with ratio-scaled data.
Figure 2.11 The block scores of the rows of the two blocks. Legend:green squares are block scores of the first block; blue circlesare block scores of the second block and the red stars aretheir averages (indicated with ta). Panel (a) favouring block X1, (b) the MAXBET solution, (c) the MAXNEAR solution.
Figure 2.12 Two column-spaces each of