Deep Multimodal Data Fusion
Abstract
Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.
Cited in this thesis
Frequently Cited Together
- Fishers' preference for mobile traceability platform: challenges in achieving a 1 chapter
- Online learning: A comprehensive survey1 chapter
- One model to learn them all1 chapter
- Deep conditional ordinal regression for neural networks1 chapter
- Seafood phospholipids: extraction efficiency and phosphorous nuclear magnetic re1 chapter
- Prediction of coffee traits by artificial neural networks and laser-assisted rap1 chapter
BibTeX
@article{Zhao2024,
author = {Zhao, Fei and Zhang, Chengcui and Geng, Baocheng},
journal = {ACM computing surveys},
title = {Deep multimodal data fusion},
year = {2024},
number = {9},
pages = {1–36},
volume = {56},
publisher = {ACM New York, NY},
}