HGMF: Heterogeneous Graphbased Fusion for Multimodal Data with Incompleteness
Background introduction
With the advancement of data collection technology, multimodal data is increasing rapidly. Multimodal data fusion has effectively improved performance in various application scenarios, such as target detection, sentiment analysis, sentiment recognition, and medical detection.
Early research has combined multimodality to learn joint representation or perform prediction, including some traditional methods, such as early fusion and late fusion, deep learning methods (late fusion based on graphs), and deep fusion (focusing on exploring multimodal interactions) ).
In the real world, due to various reasons such as sensor damage, data damage, and human recording errors, multimodal data usually has modal missing. Effectively integrating and analyzing incomplete multimodal data is a challenging problem. .
There are three main problems to be solved for the lack of multimodal mode:
 For multimodal data combinations with different missing modalities, the dimensionality and number of feature sets may be inconsistent, which brings difficulties to the application of a complete multimodal fusion model.
 Effective multimodal fusion requires learning complementary information, specific modal information and multimodal interaction, but due to the existence of missing modalities, relevant information cannot be obtained directly from incomplete individual data.
 A large amount of missing data may greatly reduce the size of the data, making it difficult to learn highdimensional interactive features from a small number of samples.
Some previous studies usually deal with the problem of missing modalities by deleting incomplete data samples or inferring missing modalities. However, directly deleting incomplete data samples will significantly reduce the sample size, which may lead to overfitting of subsequent deep learning models, especially when a large number of samples have different modal data missing. The methods of inferring missing patterns try to generate missing patterns based on observed existing patterns, such as zerovalue filling method, meanvalue filling method, matrix completion and deep learningbased methods, but such methods may introduce data to the data instead. New noises have a negative impact on model performance, and these noises sometimes affect complex auxiliary models, such as deep generation models.
Main research content
This article mainly studies incomplete multimodal data fusion based on heterogeneous graphs, and is dedicated to processing incomplete data without speculation. There are already many researchers trying to do this, such as:
 Multisource feature learning method: Divide incomplete data into multiple subcombinations, and then integrate these subcombinations to turn it into a sparse multitask learning problem.
 Multihypergraph learning method (Multihypergraph learning method): Combine highorder subgroup relations and learn directly on the output.
Although the above method provides a solution, it ignores the intermodal/intramodal interaction and cannot learn the relationship between incomplete samples. The author of this article developed a new basic structure to easily extract complex information and fuse multimodal data with missing modalities without deleting or predicting data.
This method is called Heterogeneous Graphbased Multimodal Fusion (HGMF). It first models multimodal data with incompleteness in a heterogeneous graph structure, and then uses a graph neural networkbased direct push The learning framework extracts complementary information from highly interactive and incomplete multimodality, and merges the information from different subspaces into a unified space. The specific content is:
 It is proposed to model highly interactive multimodal data with different incomplete modalities in a heterogeneous hypernode graph (HHG).
 A direct learning framework based on graph neural network is proposed to perform multimodal fusion of incomplete data in the constructed HGG.
 Experiments on multiple levels of missing data prove that the method can handle real scenarios with a high percentage of missing data.
Formalize the problem
Incomplete form of multimodal data
For one$M$ modal incomplete data set, there will be$(2^M1)$ A combination of different missing modalities, so an incomplete multimodal data set has at most$(2^M1)$ incomplete form. The following figure shows a blockbased structure diagram, which is a threemodal data set with 7 incomplete forms ($M=3$ ). The colored ones are the existing modalities, and X is the missing modal. The figure also shows that the instances can be divided into several groups. All instances in each group have the same missing form, and each instance belongs to only one form.
Problem 2.1 Multimodal fusion with incomplete data
Hypothesis$M$ is the number of modalities in the data set,$N$ is the number of samples,$\psi$ is a function that maps each sample to a certain form,$\phi(q)\subseteq/{1...M\}$ represents for Form$q$A collection of available modes for$q$ . Given a set of incomplete multimodal data samples$D=\{\tilde{x}_i\}^N_{i=1}$As input, each data sample now contains a set of available modalities $\tilde{x}_i=\{x_i,m\}_{m\in(\psi(i))}$
Direct push learning
This article constructs a direct learning framework, which is different from inductive learning. Direct learning (examplebased learning) directly integrates the feature information implicit in other samples. The key point of this method is that through this direct learning framework, an incomplete data sample can obtain missing information from other existing samples, and instances with different missing data modalities can effectively exchange their special modes. And interactive information, and achieve multimodal fusion in the process .
research method
This article proposes the HGMF method, which is based on a graph neural network (GNN) direct push learning framework, which consists of three steps:
 Modeling incomplete multimodal data in a heterogeneous supernode graph structure.
 Encode highinteraction multimodal data with missing modalities into more specific modalityspecific and crossmodal interaction information.
 Integrate and exchange information across different missing forms among multimodal instances, fusing all data into the same embedded space.
The following figure shows an example of a threestage workflow that HGMF processes a data sample with 4 missing forms.
Modeling incomplete multimodal data through heterogeneous super node graphs
An incomplete multimodal data set containing multiple missing data forms can be modeled as a kNN association graph structure, and each node is an instance.
In order to achieve modeling, first define the HGG diagram . An HGG graph can be defined as$G=(V,E,\psi,\phi)$

/$V=\{v_i\}^N_{i=1}$ $X=\{\{x_{i,m}\forall m/in/phi(\psi(i))\},1/leq i/leq N/}$

$E=\{e_j\}^{E}_{j=1}$ kNN $k$ kNN $E$ $H\in\{0,1\}^{V\timesE}$ $v_i$, $e_j$ $H(v_i,e_j)=1$ $v_i$ &e_j& $e_j$ $w_j$ 1.

$\psi:V/longmapsto T$  $T=\{1,2,...,\overline{M}\}$ $\overline{M}=2^M1$ $\psi$

$\phi:T/longmapsto P(M)\backslash/emptyset$ $P(M)$ $M=\{1,,2,...,M\}$
HHG $D$ $D$ $\phi(\cdot)$ $\psi(\cdot)$ $X$ HHG
B block $V_b$ $M_b$ b
$Z_m=\sum_{i,j\in V_b} u_m(X_i,m)u_m(X_j,m)^2_2$ (... ) $u_m(\cdot),m=1...M$ k (b) B $\{H_1,H_2,...,H_B\}$ HHG $H=[H_1;H_2;...;H_b]$
$G$ $X$
$X$
 CNN
 Bidirectional Long ShortTerm Memory Networks, BILSTM
$f_m(\cdot;\Theta_m)$ $\Theta_m$ $m$ $v_i$ $\tilde{X}_i={X_i,m}_{m\in/phi(\psi(i))}$ m
$h_i^m\in/mathbb{R}^{F_m}$ $\mathbb{R}$ $F_m$ m
$P(\cdot)$ $M=\{1,2,...,M\}$ $\forall S\in P(M)\backslash/emptyset$ $S$ (factor) $F'$
$S$ $S=1$ $h_i^{m,m}$
$U_m/in/mathbb{R}^{F'\times F_m}$ $b_m/in/mathbb{R}^{F'}$ $U_{m,m}/in/mathbb{R}^{F'\times (F_m)^2}$ $b_{m,m}\in/mathbb{R}^{F'}$ $g_m(\cdot)$ $g_{m,m}(\cdot)$ $G_i^m/in/mathbb{R}^{F_m/times f_m}$ $h_i^m$ $\bar{h}^m_i$ m
$S$ $S>1$ $\{h_i^m\forall m/in S\}$ $h_i^S$
$C_i^S$ $S$ $U_S\in/mathbb{R}^{F'/times (\prod_{m\in S}F_m)}$ $b_S/in/mathbb{R}^{F'}$ $gS(\cdot)$
(a) HHG $G_{enc}=(V_{enc},E,\psi,\phi)$ $X_{enc}=\{h_i\}^N_{i=1}$
 $\bar{M}$
$G_{enc}=(V_{enc},E,\psi,\phi)$ $\bar{M}=\tau$ $\psi(\cdot)$ $V_{enc}=\{V_p\forall p/in/tau\}$ $v_i/in V_p$ $F'$ $\tilde{h}_i=\{h_i^S\forall S/in P(\psi(p))\backslash/emptyset\}$ $\bar{M}$ $Z\in/mathbb{R}^{N/times d}$
Multifold Bilevel Graph Attention Networks, MBGAT MGBAT $\bar{M}$ $\bar{M}$
MBGAT $z=\{\{z_i^p\forall v_i/in V_p\}\forall p/in/tau\}$ $z_i^p/in/mathbb{R}^{d_p}$ $p$ $v_i$ $d_p$ $z'=\{\{{z'}_{i}^{p}\forall v_i/in V_p\}\forall p/in/tau\}$ ${z'}_i^p/in/mathbb{R}^{{d'}_p}$ ${d'}_p$ $p$
MGBAT
$\{W_{pq}\forall p,q/in/tau\}$ $\tau$ $W_{pq}\in/mathbb{R}^{d'_q/times d_p}$ p q ( )
$N_q(i)=\{v_j\forall v_j/in V_q/wedge (HH^T)_{ij} > 0\}$ $q$ $v_i$ $H$ $N_q(i)$ $v_i$ $\vec{a}_q/in/mathbb{R}^{2d'_q/times 1}$
q $v_i$
$\sigma(\cdot)$ sigmoid
( )
$\{s_i^1,s_i^2,...,s^{\bar{M}}_i\},s_i^q/in/mathbb{R}^{d'_q}$ q p $b_p/in/mathbb{R}^{2d'_p/times 1}$
$V_{qp}/in/mathbb{R}^{d'_p/times d'_q}$ q p $v_i$
HGMF
 ModelNet40:3D CAD
 NTU 3D
 IEMOCAP
 Concat, Tensor Fusion Network(TFN)
 Lowrank Fusion Network (LFM)
 Multitask Multimodal Learning (MTL)
 Hypergraph Neural Network (HGNN)
python3 HGMF