[Paper Notes] HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incomplete

[Paper Notes] HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incomplete

HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness

Paper link

Background introduction

With the advancement of data collection technology, multi-modal data is increasing rapidly. Multi-modal data fusion has effectively improved performance in various application scenarios, such as target detection, sentiment analysis, sentiment recognition, and medical detection.

Early research has combined multi-modality to learn joint representation or perform prediction, including some traditional methods, such as early fusion and late fusion, deep learning methods (late fusion based on graphs), and deep fusion (focusing on exploring multi-modal interactions) ).

In the real world, due to various reasons such as sensor damage, data damage, and human recording errors, multi-modal data usually has modal missing. Effectively integrating and analyzing incomplete multi-modal data is a challenging problem. .

There are three main problems to be solved for the lack of multimodal mode:

  1. For multimodal data combinations with different missing modalities, the dimensionality and number of feature sets may be inconsistent, which brings difficulties to the application of a complete multimodal fusion model.
  2. Effective multimodal fusion requires learning complementary information, specific modal information and multimodal interaction, but due to the existence of missing modalities, relevant information cannot be obtained directly from incomplete individual data.
  3. A large amount of missing data may greatly reduce the size of the data, making it difficult to learn high-dimensional interactive features from a small number of samples.

Some previous studies usually deal with the problem of missing modalities by deleting incomplete data samples or inferring missing modalities. However, directly deleting incomplete data samples will significantly reduce the sample size, which may lead to over-fitting of subsequent deep learning models, especially when a large number of samples have different modal data missing. The methods of inferring missing patterns try to generate missing patterns based on observed existing patterns, such as zero-value filling method, mean-value filling method, matrix completion and deep learning-based methods, but such methods may introduce data to the data instead. New noises have a negative impact on model performance, and these noises sometimes affect complex auxiliary models, such as deep generation models.

Main research content

This article mainly studies incomplete multi-modal data fusion based on heterogeneous graphs, and is dedicated to processing incomplete data without speculation. There are already many researchers trying to do this, such as:

  • Multi-source feature learning method: Divide incomplete data into multiple sub-combinations, and then integrate these sub-combinations to turn it into a sparse multi-task learning problem.
  • Multi-hypergraph learning method (Multi-hypergraph learning method): Combine high-order subgroup relations and learn directly on the output.

Although the above method provides a solution, it ignores the inter-modal/intra-modal interaction and cannot learn the relationship between incomplete samples. The author of this article developed a new basic structure to easily extract complex information and fuse multimodal data with missing modalities without deleting or predicting data.

This method is called Heterogeneous Graph-based Multimodal Fusion (HGMF). It first models multimodal data with incompleteness in a heterogeneous graph structure, and then uses a graph neural network-based direct push The learning framework extracts complementary information from highly interactive and incomplete multi-modality, and merges the information from different subspaces into a unified space. The specific content is:

  • It is proposed to model highly interactive multimodal data with different incomplete modalities in a heterogeneous hypernode graph (HHG).
  • A direct learning framework based on graph neural network is proposed to perform multi-modal fusion of incomplete data in the constructed HGG.
  • Experiments on multiple levels of missing data prove that the method can handle real scenarios with a high percentage of missing data.

Formalize the problem

Incomplete form of multimodal data

For oneMM -modal incomplete data set, there will be(2M 1)(2^M-1) A combination of different missing modalities, so an incomplete multimodal data set has at most(2M 1)(2^M-1) incomplete form. The following figure shows a block-based structure diagram, which is a three-modal data set with 7 incomplete forms (M=3M=3 ). The colored ones are the existing modalities, and X is the missing modal. The figure also shows that the instances can be divided into several groups. All instances in each group have the same missing form, and each instance belongs to only one form.

Problem 2.1 Multimodal fusion with incomplete data

HypothesisMM is the number of modalities in the data set,NN is the number of samples, \psi is a function that maps each sample to a certain form, (q) {1...M}\phi(q)\subseteq/{1...M\} represents for FormqqA collection of available modes for . Given a set of incomplete multimodal data samplesD={x~i}i=1ND=\{\tilde{x}_i\}^N_{i=1}As input, each data sample now contains a set of available modalities x~i={xi,m}mA( (i))\tilde{x}_i=\{x_i,m\}_{m\in(\psi(i))}

Direct push learning

This article constructs a direct learning framework, which is different from inductive learning. Direct learning (example-based learning) directly integrates the feature information implicit in other samples. The key point of this method is that through this direct learning framework, an incomplete data sample can obtain missing information from other existing samples, and instances with different missing data modalities can effectively exchange their special modes. And interactive information, and achieve multi-modal fusion in the process .

research method

This article proposes the HGMF method, which is based on a graph neural network (GNN) direct push learning framework, which consists of three steps:

  1. Modeling incomplete multi-modal data in a heterogeneous super-node graph structure.
  2. Encode high-interaction multimodal data with missing modalities into more specific modality-specific and cross-modal interaction information.
  3. Integrate and exchange information across different missing forms among multi-modal instances, fusing all data into the same embedded space.

The following figure shows an example of a three-stage workflow that HGMF processes a data sample with 4 missing forms.

Modeling incomplete multi-modal data through heterogeneous super node graphs

An incomplete multi-modal data set containing multiple missing data forms can be modeled as a k-NN association graph structure, and each node is an instance.

In order to achieve modeling, first define the HGG diagram . An HGG graph can be defined asG=(V,E, , )G=(V,E,\psi,\phi)

  • /V={vi}i=1NV=\{v_i\}^N_{i=1} X={{xi,m m ( (i))},1 i N}X=\{\{x_{i,m}|\forall m/in/phi(\psi(i))\},1/leq i/leq N/}

  • E={ej}j=1 E E=\{e_j\}^{|E|}_{j=1} k-NN kk k-NN EE H {0,1} V E H\in\{0,1\}^{|V|\times|E|} viv_i, eje_j H(vi,ej)=1H(v_i,e_j)=1 viv_i &e_j& eje_j wjw_j 1.

  • :V T\psi:V/longmapsto T - T={1,2,...,M }T=\{1,2,...,\overline{M}\} M =2M 1\overline{M}=2^M-1 \psi

  • :T P(M)\ \phi:T/longmapsto P(M)\backslash/emptyset P(M)P(M) M={1,,2,...,M}M=\{1,,2,...,M\}

HHG DD DD ( )\phi(\cdot) ( )\psi(\cdot) XX HHG

B block VbV_b MbM_b b

Zm= i,j Vb um(Xi,m) um(Xj,m) 22Z_m=\sum_{i,j\in V_b} ||u_m(X_i,m)-u_m(X_j,m)||^2_2 (||...|| ) um( ),m=1...Mu_m(\cdot),m=1...M k (b) B {H1,H2,...,HB}\{H_1,H_2,...,H_B\} HHG H=[H1;H2;...;Hb]H=[H_1;H_2;...;H_b]



  1. CNN
  2. Bidirectional Long Short-Term Memory Networks, BI-LSTM

fm( ; m)f_m(\cdot;\Theta_m) m\Theta_m mm viv_i X~i=Xi,mm ( (i))\tilde{X}_i={X_i,m}_{m\in/phi(\psi(i))} -m

him=fm(xi,m; m)h_i^m = f_m(x_{i,m};\Theta_m)

him RFmh_i^m\in/mathbb{R}^{F_m} R\mathbb{R} FmF_m -m

P( )P(\cdot) M={1,2,...,M}M=\{1,2,...,M\} S P(M)\ \forall S\in P(M)\backslash/emptyset SS (factor) F F'

SS S =1|S|=1 him,mh_i^{m,m}

Um RF FmU_m/in/mathbb{R}^{F'\times F_m} bm RF b_m/in/mathbb{R}^{F'} Um,m RF (Fm)2U_{m,m}/in/mathbb{R}^{F'\times (F_m)^2} bm,m RF b_{m,m}\in/mathbb{R}^{F'} gm( )g_m(\cdot) gm,m( )g_{m,m}(\cdot) Gim RFm fmG_i^m/in/mathbb{R}^{F_m/times f_m} himh_i^m h im\bar{h}^m_i -m

SS S >1|S|>1 {him m S}\{h_i^m|\forall m/in S\} hiSh_i^S

CiSC_i^S S |S| US RF ( m SFm)U_S\in/mathbb{R}^{F'/times (\prod_{m\in S}F_m)} bS RF b_S/in/mathbb{R}^{F'} gS( )gS(\cdot)

(a) HHG Genc=(Venc,E, , )G_{enc}=(V_{enc},E,\psi,\phi) Xenc={hi}i=1NX_{enc}=\{h_i\}^N_{i=1}

  • M \bar{M}-

Genc=(Venc,E, , )G_{enc}=(V_{enc},E,\psi,\phi) M = \bar{M}=|\tau| ( )\psi(\cdot) Venc={Vp p }V_{enc}=\{V_p|\forall p/in/tau\} vi Vpv_i/in V_p F F'- h~i={hiS S P( (p))\ }\tilde{h}_i=\{h_i^S|\forall S/in P(\psi(p))\backslash/emptyset\} M \bar{M} Z RN dZ\in/mathbb{R}^{N/times d}

Multi-fold Bilevel Graph Attention Networks, MBGAT MGBAT M \bar{M} M \bar{M}

MBGAT z={{zip vi Vp} p }z=\{\{z_i^p|\forall v_i/in V_p\}|\forall p/in/tau\} zip Rdpz_i^p/in/mathbb{R}^{d_p} pp viv_i dpd_p- z ={{z ip vi Vp} p }z'=\{\{{z'}_{i}^{p}|\forall v_i/in V_p\}|\forall p/in/tau\} z ip Rd p{z'}_i^p/in/mathbb{R}^{{d'}_p} d p{d'}_p -pp


{Wpq p,q }\{W_{pq}|\forall p,q/in/tau\} |\tau| Wpq Rdq dpW_{pq}\in/mathbb{R}^{d'_q/times d_p} -p -q ( )

Nq(i)={vj vj Vq (HHT)ij>0}N_q(i)=\{v_j|\forall v_j/in V_q/wedge (HH^T)_{ij} > 0\} qq viv_i HH Nq(i)N_q(i) viv_i a q R2dq 1\vec{a}_q/in/mathbb{R}^{2d'_q/times 1}

-q viv_i

( )\sigma(\cdot) sigmoid

( )

{si1,si2,...,siM },siq Rdq \{s_i^1,s_i^2,...,s^{\bar{M}}_i\},s_i^q/in/mathbb{R}^{d'_q} -q -p bp R2dp 1b_p/in/mathbb{R}^{2d'_p/times 1}

Vqp Rdp dq V_{qp}/in/mathbb{R}^{d'_p/times d'_q} -q -p viv_i


  1. ModelNet40:3D CAD
  2. NTU 3D

  1. Concat, Tensor Fusion Network(TFN)
  2. Low-rank Fusion Network (LFM)
  3. Multitask Multimodal Learning (MTL)

  1. Hypergraph Neural Network (HGNN)

python3 HGMF