Deep Multimodal Data Fusion

Fei Zhao, Chengcui Zhang, Baocheng Geng

2024 ACM Computing Surveys Cited 308 times

Abstract

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.

Cited in this thesis

Conclusions

Frequently Cited Together

Ordinal regression with multiple output cnn for age estimationNiu 20161 chapter
Fishers' preference for mobile traceability platform: challenges in achieving a Untal 20251 chapter
An analytical and machine learning approach for total mercury and methylmercury Piroutkova 20251 chapter
Prediction of coffee traits by artificial neural networks and laser-assisted rapCardoso 20251 chapter
Multilayer feedforward networks are universal approximatorsHornik 19891 chapter
Support vector learning for ordinal regressionHerbrich 19991 chapter

BibTeX

@article{Zhao2024,
  author = {Zhao, Fei and Zhang, Chengcui and Geng, Baocheng},
  journal = {ACM computing surveys},
  title = {Deep multimodal data fusion},
  year = {2024},
  number = {9},
  pages = {1–36},
  volume = {56},
  publisher = {ACM New York, NY},
}