多模态学习方法综述

A survey of multimodal machine learning

  • 摘要: 大数据是多源异构的。在信息技术飞速发展的今天,多模态数据已成为近来数据资源的主要形式。研究多模态学习方法,赋予计算机理解多源异构海量数据的能力具有重要价值。本文归纳了多模态的定义与多模态学习的基本任务,介绍了多模态学习的认知机理与发展过程。在此基础上,重点综述了多模态统计学习方法与深度学习方法。此外,本文系统归纳了近两年较为新颖的基于对抗学习的跨模态匹配与生成技术。本文总结了多模态学习的主要形式,并对未来可能的研究方向进行思考与展望。

     

    Abstract: “Big data” is always collected from different resources that have different data structures. With the rapid development of information technologies, current precious data resources are characteristic of multimodes. As a result, based on classical machine learning strategies, multi-modal learning has become a valuable research topic, enabling computers to process and understand “big data”. The cognitive processes of humans involve perception through different sense organs. Signals from eyes, ears, the nose, and hands (tactile sense) constitute a person’s understanding of a special scene or the world as a whole. It reasonable to believe that multi-modal methods involving a higher ability to process complex heterogeneous data can further promote the progress of information technologies. The concepts of multimodality stemmed from psychology and pedagogy from hundreds of years ago and have been popular in computer science during the past decade. In contrast to the concept of “media”, a “mode” is a more fine-grained concept that is associated with a typical data source or data form. The effective utilization of multi-modal data can aid a computer understand a specific environment in a more holistic way. In this context, we first introduced the definition and main tasks of multi-modal learning. Based on this information, the mechanism and origin of multi-modal machine learning were then briefly introduced. Subsequently, statistical learning methods and deep learning methods for multi-modal tasks were comprehensively summarized. We also introduced the main styles of data fusion in multi-modal perception tasks, including feature representation, shared mapping, and co-training. Additionally, novel adversarial learning strategies for cross-modal matching or generation were reviewed. The main methods for multi-modal learning were outlined in this paper with a focus on future research issues in this field.

     

/

返回文章
返回