Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations
Journal:
arXiv
Published Date:
Apr 15, 2025
Abstract
Background: The integration and analysis of multi-modal data are increasingly
essential across various domains including bioinformatics. As the volume and
complexity of such data grow, there is a pressing need for computational models
that not only integrate diverse modalities but also leverage their
complementary information to improve clustering accuracy and insights,
especially when dealing with partial observations with missing data. Results:
We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an
unsupervised method for the integration and joint dimensionality reduction of
multi-modal data. GPCCA addresses key challenges in multi-modal data analysis
by handling missing values within the model, enabling the integration of more
than two modalities, and identifying informative features while accounting for
correlations within individual modalities. The model demonstrates robustness to
various missing data patterns and provides low-dimensional embeddings that
facilitate downstream clustering and analysis. In a range of simulation
settings, GPCCA outperforms existing methods in capturing essential patterns
across modalities. Additionally, we demonstrate its applicability to
multi-omics data from TCGA cancer datasets and a multi-view image dataset.
Conclusion: GPCCA offers a useful framework for multi-modal data integration,
effectively handling missing data and providing informative low-dimensional
embeddings. Its performance across cancer genomics and multi-view image data
highlights its robustness and potential for broad application. To make the
method accessible to the wider research community, we have released an R
package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.