5
STREAMULT: STREAMING MULTIMODAL TRANSFORMER FOR HETEROGENEOUS AND ARBITRARY LONG SEQUENTIAL DATA Victor Pellegrain ? , Myriam Tami ? , Michel Batteux , C´ eline Hudelot ? Institut de Recherche Technologique SystemX, 2 boulevard Thomas Gobert, 91120, Palaiseau, France ? Universit´ e Paris-Saclay, CentraleSup´ elec, MICS, 91190, Gif-sur-Yvette, France ABSTRACT This paper tackles the problem of processing and combining effi- ciently arbitrary long data streams, coming from different modalities with different acquisition frequencies. Common applications can be, for instance, long-time industrial or real-life systems monitoring from multimodal heterogeneous data (sensor data, monitoring re- port, images, etc.). To tackle this problem, we propose StreaMulT, a Streaming Multimodal Transformer, relying on cross-modal at- tention and an augmented memory bank to process arbitrary long input sequences at training time and run in a streaming way at inference. StreaMulT reproduces state-of-the-art results on CMU- MOSEI dataset, while being able to deal with much longer inputs than other models such as previous Multimodal Transformer. Index TermsMultimodal learning, Streaming data, Trans- former, Long-term dependencies 1. INTRODUCTION Availability of massive amounts of data, coupled with recent ma- chine learning breakthroughs offers great potential in numerous do- mains. More specifically in Industry 4.0 era, a major challenge is to exploit all information sources related to a system in order to per- form monitoring for corrective and predictive maintenances. To do so, signal processing approaches must be able to handle multimodal sources such as sensors measurements, maintenance textual reports, or machines images. Therefore they need to be able to deal with data streams that are heterogenous by nature (time series, raw text, images, etc.) and by their acquisition frequency. Besides, these different streams are also unaligned, as the behaviour of a sensor at present time can be highly correlated with a maintenance report from several days or weeks in the past. Finally, data history may be arbitrary long, and input streams shall be processed in a streaming fashion at inference, as an industrial system may never stop (see Fig. 1). Since their introduction of self-attention [1], Transformer-based architectures have constituted breakthrough in many different Deep Learning fields, creating efficient contextualized encoders [2] and decoders [3], or regularly beating SOTA benchmarks [4, 5, 6]. Some of these approaches have been proposed to handle multimodal data, such as Multimodal Transformer [7], inferring unaligned dependen- cies across modalities. These approaches however do not tackle the challenges of arbitrary long inputs or streaming inference and face limitation, mainly because of their time and memory complexity which is quadratic in the input sequence. Many approaches tried to alleviate this issue [8] by either using low-rank approximations of the self-attention matrix [9, 10] adding some sparsity through se- lected or learned attention patterns [11, 12, 13, 14], or conveying information via a bounded memory [15, 16, 17] decreasing the com- plexity up to a linear level. Furthermore, some approaches focus on handling stream data, especially in Automatic Speech Recognition (ASR) domain, to ensure low-latency at test time, by chunking input sequences into smaller segments [18, 19, 20]. Notably, Emformer architecture [21] performs streaming ASR by updating a memory bank to convey information across segments. But this architecture is limited to unimodal sequences. In this paper, we thus propose to combine these two approaches in StreaMulT, a Streaming Multimodal Transformer. Our global ar- chitecture extends the Emformer approach to a more challenging task, by dealing with heterogeneous and unaligned modalities: it enables to consider both an arbitrary long input multimodal data and a streaming inference. Our contributions are threefold. First, we define a new applicative paradigm, in which one aims to solve a prediction task across time, from heterogeneous (by nature and acquisition frequency) multimodal sequential data and in a stream- ing fashion, hence handling arbitrary long input data. We then pro- pose StreaMulT, a Streaming Multimodal Transformer architecture to tackle this issue and deal with unaligned input streams. Due to the lack of a public dataset adapted to our task, we eventually pro- pose to evaluate our model on CMU-MOSEI dataset, on a multi- modal sentiment analysis task, in order to compare StreaMulT per- formances with previous approaches. It includes both multimodal and unaligned streams. In section 2 we formalize our new paradigm. We then introduce our model, StreaMult, in the section 3. At last, we conduct experi- ments on CMU-MOSEI dataset in section 4. 2. MULTIMODAL LEARNING IN STREAMING In this section, we define the challenging problem our method tack- les. For purposes of clarity, we consider three modalities, denoted by α,β,γ. This case can be extended to any number of modalities without loss of generality. We consider 3 time series (Xα,X β ,Xγ ) from different modalities (e.g. text, image, sound, numerical, etc.) as our input data. Each series is indexed by time, according to its own acquisition times and lies in its own definition space. Hence for the modality α, Xα := (Xα(t)) t∈Tα and t ∈Tα, Xα(t) R dα where Tα and dα are respectively the countable set containing ac- quisition times of modality α and its associated feature dimension. Our objective is to enable some prediction tasks (regression or classification) across time. Let X be the set defined as: X := n [X(s)] st ,t R o where [X(s)] st are data of all modalities acquired before time step t. Formally, given a labeling space Y that is common to the different modalities, we try to find the optimal ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv:2110.08021v1 [cs.LG] 15 Oct 2021

ABSTRACT arXiv:2110.08021v1 [cs.LG] 15 Oct 2021

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

STREAMULT: STREAMING MULTIMODAL TRANSFORMER FOR HETEROGENEOUSAND ARBITRARY LONG SEQUENTIAL DATA

Victor Pellegrain†?, Myriam Tami?, Michel Batteux†, Celine Hudelot?

† Institut de Recherche Technologique SystemX, 2 boulevard Thomas Gobert, 91120, Palaiseau, France? Universite Paris-Saclay, CentraleSupelec, MICS, 91190, Gif-sur-Yvette, France

ABSTRACT

This paper tackles the problem of processing and combining effi-ciently arbitrary long data streams, coming from different modalitieswith different acquisition frequencies. Common applications canbe, for instance, long-time industrial or real-life systems monitoringfrom multimodal heterogeneous data (sensor data, monitoring re-port, images, etc.). To tackle this problem, we propose StreaMulT,a Streaming Multimodal Transformer, relying on cross-modal at-tention and an augmented memory bank to process arbitrary longinput sequences at training time and run in a streaming way atinference. StreaMulT reproduces state-of-the-art results on CMU-MOSEI dataset, while being able to deal with much longer inputsthan other models such as previous Multimodal Transformer.

Index Terms— Multimodal learning, Streaming data, Trans-former, Long-term dependencies

1. INTRODUCTION

Availability of massive amounts of data, coupled with recent ma-chine learning breakthroughs offers great potential in numerous do-mains. More specifically in Industry 4.0 era, a major challenge is toexploit all information sources related to a system in order to per-form monitoring for corrective and predictive maintenances. To doso, signal processing approaches must be able to handle multimodalsources such as sensors measurements, maintenance textual reports,or machines images. Therefore they need to be able to deal withdata streams that are heterogenous by nature (time series, raw text,images, etc.) and by their acquisition frequency. Besides, thesedifferent streams are also unaligned, as the behaviour of a sensorat present time can be highly correlated with a maintenance reportfrom several days or weeks in the past. Finally, data history may bearbitrary long, and input streams shall be processed in a streamingfashion at inference, as an industrial system may never stop (see Fig.1).

Since their introduction of self-attention [1], Transformer-basedarchitectures have constituted breakthrough in many different DeepLearning fields, creating efficient contextualized encoders [2] anddecoders [3], or regularly beating SOTA benchmarks [4, 5, 6]. Someof these approaches have been proposed to handle multimodal data,such as Multimodal Transformer [7], inferring unaligned dependen-cies across modalities. These approaches however do not tackle thechallenges of arbitrary long inputs or streaming inference and facelimitation, mainly because of their time and memory complexitywhich is quadratic in the input sequence. Many approaches triedto alleviate this issue [8] by either using low-rank approximationsof the self-attention matrix [9, 10] adding some sparsity through se-lected or learned attention patterns [11, 12, 13, 14], or conveying

information via a bounded memory [15, 16, 17] decreasing the com-plexity up to a linear level. Furthermore, some approaches focus onhandling stream data, especially in Automatic Speech Recognition(ASR) domain, to ensure low-latency at test time, by chunking inputsequences into smaller segments [18, 19, 20]. Notably, Emformerarchitecture [21] performs streaming ASR by updating a memorybank to convey information across segments. But this architecture islimited to unimodal sequences.In this paper, we thus propose to combine these two approaches inStreaMulT, a Streaming Multimodal Transformer. Our global ar-chitecture extends the Emformer approach to a more challengingtask, by dealing with heterogeneous and unaligned modalities: itenables to consider both an arbitrary long input multimodal dataand a streaming inference. Our contributions are threefold. First,we define a new applicative paradigm, in which one aims to solvea prediction task across time, from heterogeneous (by nature andacquisition frequency) multimodal sequential data and in a stream-ing fashion, hence handling arbitrary long input data. We then pro-pose StreaMulT, a Streaming Multimodal Transformer architectureto tackle this issue and deal with unaligned input streams. Due tothe lack of a public dataset adapted to our task, we eventually pro-pose to evaluate our model on CMU-MOSEI dataset, on a multi-modal sentiment analysis task, in order to compare StreaMulT per-formances with previous approaches. It includes both multimodaland unaligned streams.

In section 2 we formalize our new paradigm. We then introduceour model, StreaMult, in the section 3. At last, we conduct experi-ments on CMU-MOSEI dataset in section 4.

2. MULTIMODAL LEARNING IN STREAMING

In this section, we define the challenging problem our method tack-les. For purposes of clarity, we consider three modalities, denotedby α, β, γ. This case can be extended to any number of modalitieswithout loss of generality. We consider 3 time series (Xα, Xβ , Xγ)from different modalities (e.g. text, image, sound, numerical, etc.)as our input data. Each series is indexed by time, according to itsown acquisition times and lies in its own definition space. Hence forthe modality α,

Xα := (Xα(t))t∈Tα and ∀t ∈ Tα, Xα(t) ∈ Rdα

where Tα and dα are respectively the countable set containing ac-quisition times of modality α and its associated feature dimension.Our objective is to enable some prediction tasks (regression orclassification) across time. Let X be the set defined as: X :={[X(s)]s≤t , t ∈ R

}where [X(s)]s≤t are data of all modalities

acquired before time step t. Formally, given a labeling space Y thatis common to the different modalities, we try to find the optimal

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

arX

iv:2

110.

0802

1v1

[cs

.LG

] 1

5 O

ct 2

021

Fig. 1: Multimodal learning in a streaming scheme applied to industrial monitoring

prediction function h∗ : X 7→ Y minimizing a loss L on somehypothesis spaceH:

h∗ = argminh∈H

L(h)

with L(h) := 1|Ty|

∑t∈Ty l

(h([X(s)]s≤t

), yt)

where l is a score function and Ty is the ground truth time steps,whose definition depends on the subsidiary task. For instance, in theprevious industrial monitoring application, Ty := Tα ∪ Tβ ∪ Tγ asthe objective is to detect a fault at any time. However, if we considernow a task in which the objective is to classify each sentence con-tained in a long sequence (keeping past sentences as input), then fora sequence of s multimodal sentences, the associated ground truthtime steps are the last acquisition time steps of each sentence:

Ty =

{max

T jα∪Tjβ∪T jγ

t, 1 ≤ j ≤ s

}where j is the sentence index.

To the best of our knowledge, this paradigm has never been in-troduced as such. In the following section we introduce a new archi-tecture to address our objective.

3. PROPOSED MODEL

We propose StreaMulT, a Streaming Multimodal Transformer ar-chitecture, taking advantages of both Multimodal Transformer [7]and Emformer [21]. Multimodality is managed by using Cross-modal Transformer layers, while arbitrary long sequences are han-dled through a Block processing architecture.

3.1. Crossmodal Transformer and Block processing reviews

Crossmodal Attention module, as defined in [7], deals with hetero-geneity gap of multimodal inputs [22] by expressing a target modal-ity α with raw features from a source modality β. Formally, con-sidering our input sequences Xα and Xβ from modalities α and β,the crossmodal attention for Xα attending to Xβ , denoted Xβ→α iscomputed as:

Xβ→α : = softmax

(QαK

Tβ√

dk

)Vβ

= softmax

(XαWQαW

TKβXTβ√

dk

)XβWVβ

with (Qα) the query matrix for modality α, Kβ , Vβ the key andvalue matrices for modality β and WQα ,WKβ ,WVβ being learnedweights.

Input data being arbitrary long here, Multimodal Transformertraining is intractable due to its quadratic complexity, and inferencecannot be done in a streaming way, as the vanilla model needs thewhole sequence as input. To alleviate this, we use block processingmethod, chunking input sequences into non-overlapping smaller seg-ments (Ci)i≥0 (see Fig. 2). We then compute attention on these seg-ments and hence reduce complexity during the cross-modal attentioncomputation. Extending the block processing method to input datawith heterogeneous sampling rates, we define hard segment boundswith respect to the temporal axis, hence producing shared segmentsacross modalities. To prevent boundary effect, left and right con-text blocks are concatenated with initial blocks to form contextualsegments Xi = [Li : Ci : Ri].

An Augmented-Memory Transformer (AM-TRF) [18] approachthen encodes segments information, by learning and storing a mem-ory bank to convey information through time.Considering a contextual segment Xi = [Li : Ci : Ri] and a mem-ory bankMi = [m1, . . . ,mi−1] containing compressed informationfrom previous segments, the output Xn+1

i of the n-th layer is com-puted as:

Xni = LN(Xn

i )

Kni =Wk[M

ni , X

ni ]

V ni =WV [Mni , X

ni ]

Qni =WQXni

[ZnL,i : ZnC,i : Z

nR,i] : = Attn (Qni ,K

ni , V

ni ) +Xn

i

Xn+1i = FFN

(LN([ZnL,i : Z

nC,i : Z

nR,i]))

Xn+1i = LN

(Xn+1i + [ZnL,i : Z

nC,i : Z

nR,i])

mni = Attn (WQs

ni ,K

ni , V

ni )

where sni is the mean of Cni and LN, FFN, Attn respectively corre-spond to Layer Normalization, Feed-Forward and Attention layers.After passing through all N layers, outputs corresponding to leftand right contexts are discarded to keep only center segments repre-sentations (CNi )i≥0.In this paper, we choose to build on the Emformer architecture [21],which is an improved and efficient implementation of AM-TRF,

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

Fig. 2: Block processing for Multimodal learning in a streamingscheme. For modality α: Xα, Cα,i, Lα,i and Rα,i respectively cor-respond to the full input sequence, the initial i-th block, and the leftand right contexts associated to this block to form the contextual i-th segment. sα,i corresponds to the mean of current segment Cα,i.Blue area represents an initial block for modality β while the pinkone represents a contextual segment for modality γ.

hence we use cached values from previous segments for left-contextrepresentations instead of recomputing attention.

3.2. Putting together with Memory bank

Our global end-to-end architecture combines benefits from Em-former and Multimodal Transformer. The architecture is illustratedin Fig. 4. We describe here the processing of the modality α.Xα is first passed through a 1D convolutional layer aiming to modelsome local temporal structure, and map all modalities to a com-mon feature dimension d. Segment bounds are then fixed, andfollowing block processing approach, every contextual segmentsXα,i are processed in a parallel way. They are first given to amodality-specific Emformer to initialize its own modality memorybank Mα. Then, each source modality / target modality (β / α)pair is processed by its own Streaming Crossmodal Transformer(SCT) module. Specifically, each segment from the target modalityXα,i = [Lα,i : Cα,i : Rα,i] is expressed using the same tempo-ral segment from the source modality Xβ,i along with the sourcemodality memory bank Mβ,i. For each layer n:

[Cnα,i, R

nα,i

]= LN(

[Cnα,i, R

nα,i

])[

Cnβ,i, Rnβ,i

]= LN(

[Cnβ,i, R

nβ,i

])

Knβ,i =

[KnM,β→α,i,K

nL,β→α,i,K

nC,β→α,i,K

nR,β→α,i

]V nβ,i =

[V nM,β→α,i, V

nL,β→α,i, V

nC,β→α,i, V

nR,β→α,i

]ZnC,β→α,i = Attn(QnC,β→α,i,K

nβ,i, V

nβ,i) + Cnβ→α,i

ZnR,β→α,i = Attn(QnR,β→α,i,Knβ,i, V

nβ,i) +Rnβ→α,i[

Cn+1α,i , R

n+1α,i

]= FFN(LN([ZnC,β→α,i, Z

nR,β→α,i]))[

Cn+1α,i , R

n+1α,i

]= LN(

[Cn+1α,i , R

n+1α,i

]+ [ZnC,β→α,i, Z

nR,β→α,i])

Fig. 3: Streaming Crossmodal Transformer module

where,[KnM,β→α,i,K

nC,β→α,i,K

nR,β→α,i

]= Wk,β→α

[Mβ,i, C

nβ,i, R

nβ,i

][V nM,β→α,i, V

nC,β→α,i, V

nR,β→α,i

]= Wv,β→α

[Mβ,i, C

nβ,i, R

nβ,i

][QnC,β→α,i, Q

nR,β→α,i

]= Wq,β→α

[Cnβ→α,i, R

nβ→α,i

]and

(KnL,β→α,i, V

nL,β→α,i

)are the key and value copies (cached)

corresponding to previous segments, up to left context size. Thismodule is illustrated in Fig. 3. After the last layer N , right con-texts representations (RDβ→α,i)i>0 are discarded. (CDβ→α,i)i>0 areconcatenated to form the final crossmodal representation Xβ→α.We then concatenate along the feature dimension all crossmodaloutputs corresponding to the same target modality α in a vector

Zα :=

(Xβ→αXγ→α

), that is given as input to a Transformer Encoder

exploiting sequential nature of data, to produce modality output yα.All modality outputs are eventually concatenated and passed througha final fully-connected layer to output prediction y.In the next section, we experimentally validate our model.

4. EXPERIMENTS AND RESULTS

4.1. Dataset and setups

Despite having a public dataset compatible with the Streaming Mul-timodal Learning challenge, involving long, heterogeneous and un-aligned input sequences, we conduct experiments on CMU-MOSEIdataset [23], to empirically evaluate the StreaMulT architectureand compare it with existing approaches handling sequential un-aligned multimodal data. CMU-MOSEI dataset consists of 23,454movie review video clips on YouTube, from which are extractedaudio and video features using Facet (based on CERT [24]) andCOVAREP [25]. Textual features are also extracted from wordstranscripts, using Glove [26] pretrained embeddings. This producesan unaligned version of the dataset, which is used to create a word-aligned version, using P2FA algorithm [27]. All aligned sentencesare padded to a fixed length of 50 time steps.The related task aims to perform sentiment analysis on these clips,labeled by human annotators with a sentiment score from -3 to 3. Asin [7] and previous works, we evaluate model performances usingvarious metrics: 7-class-accuracy, binary accuracy (positive or neg-ative statements), F1-Score, MAE and correlation between model’spredictions and labels.

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

Fig. 4: Streaming Multimodal Transformer architecture. SCT stands for Streaming Crossmodal Transformer. Different colors representheterogeneity nature of different modalities, and shadings represent crossmodal features.

To highlight StreaMulT added value, we conduct experiments in dif-ferent settings. (1) We first consider input video clips as our wholeinput sequences, and observe StreaMulT performances when divid-ing these clips into smaller segments. As we need to define hardsegment temporal bounds, which are not given in the unaligned ver-sion of CMU-MOSEI, we conduct this experiment with the alignedversion of the dataset. For StreaMulT, we choose to divide the inputsentences into 5 segments of length 10. (2) We then concatenate allvideo clips related to the same speaker and consider these as inputsequences, to simulate arbitrary long input streams.

We compared StreaMulT performances with Multimodal Trans-former (MulT) and other models adressing Multimodal SentimentAnalysis challenge, among which the recent SOTA methods [28, 29].We strongly emphasize that the added value of StreaMult is its abil-ity to deal with arbitrary long unaligned multimodal inputs, and thatit does not intend to address Multimodal Sentiment Analysis spe-cific task. Hence we only report Multimodal Transformer metricsscores given in [7] for a fair comparison. We also used the avail-able official code1 for Multimodal Transformer architecture to runthe experiments, with hyperparameters given in [7]. We could notreproduce the results shown in the paper, hence we present the re-sults we obtained, that are not as good as the given ones. All scoresfrom our experiments are averaged on 5 runs.

Metric Acch7 Acch2 F1h MAEl Corrh

MulT 51.8 82.5 82.3 0.580 0.703MulT‡(1) 49.32 81.05 81.42∗ 0.615 0.666

StreaMulT‡ (1) 50, 08∗ 81.08∗ 81.01 0.608∗ 0.671∗

MulT‡(2) - - - - -StreaMulT‡ (2) 49.25 80.55 80.84 0.621 0.665

Table 1: Results on CMU-MOSEI. Best results are marked in bold.‡: own implementation or reproduced from official code with pro-vided hyper-parameters. *: best result obtained among the ‡ cate-gory. (1) and (2) refer to the experiments settings defined above.

Table 1 show that our architecture globally reproduces the re-

1https://github.com/yaohungt/Multimodal-Transformer

sults of Multimodal Transformer for setting (1) (even performs alittle bit better on some metrics), which shows that memory bankconveys properly salient information through time, as StreaMulT re-ceiptive field only attends to segments of length 10, while MulT at-tends to whole sequence of length 50. For setting (2), results areslighlty worse, but this setting only aims to simulate arbitrary longinputs, to show that StreaMulT approach is running, whereas MulTfaces Memory error. This validates StreaMult architecture in its abil-ity to run in a streaming fashion.

5. CONCLUSION

The proposed StreaMulT merged the crossmodal attention moduleof Multimodal Transformer with the parallelized block processingmethod of Emformer to process multimodal data in a streamingscheme. That way, it addresses the newly introduced challenge ofMultimodal Learning in Streaming, in which input data are arbirarylong heterogeneous and unaligned sequences. Expermients con-ducted on CMU-MOSEI dataset showed promising results, with noloss of performances but an ability to handle arbitrary long data attrain time, and to process sequences in a streaming fashion at in-ference. Numerous applications of this paradigm such as IndustrialMonitoring need an adapted dataset to compare related future works.

Acknowledgements. Victor Pellegrain is funded by IRT Sys-temX in collaboration with CentraleSupelec. This work was per-formed using HPC resources from the Mesocentre computing centerof CentraleSupelec and Ecole Normale Superieure Paris-Saclay sup-ported by CNRS and Region Ile-de-France.

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

6. REFERENCES

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Il-lia Polosukhin, “Attention is all you need,” in Advancesin Neural Information Processing Systems, I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova, “Bert: Pre-training of deep bidirectional transform-ers for language understanding,” in NAACL, 2019.

[3] Alec Radford, Jeff Wu, Rewon Child, David Luan, DarioAmodei, and Ilya Sutskever, “Language models are unsuper-vised multitask learners,” 2019.

[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, DirkWeissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De-hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16words: Transformers for image recognition at scale,” 2021.

[5] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par-mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng-dong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer:Convolution-augmented transformer for speech recognition,”2020.

[6] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: Ano-recurrence sequence-to-sequence model for speech recog-nition,” in 2018 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), 2018, pp. 5884–5888.

[7] Yao Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. ZicoKolter, Louis Philippe Morency, and Ruslan Salakhutdinov,“Multimodal transformer for unaligned multimodal languagesequences,” ACL 2019 - 57th Annual Meeting of the Associa-tion for Computational Linguistics, Proceedings of the Confer-ence, pp. 6558–6569, 2020.

[8] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler,“Efficient transformers: A survey,” 2020.

[9] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, andFrancois Fleuret, “Transformers are rnns: Fast autoregressivetransformers with linear attention,” 2020.

[10] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan,Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins,Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Be-langer, Lucy Colwell, and Adrian Weller, “Rethinking atten-tion with performers,” 2021.

[11] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever,“Generating long sequences with sparse transformers,” 2019.

[12] Manzil Zaheer, Guru Guruganesh, Avinava Dubey, JoshuaAinslie, Chris Alberti, Santiago Ontanon, Philip Pham,Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed, “Bigbird: Transformers for longer sequences,” 2021.

[13] Iz Beltagy, Matthew E. Peters, and Arman Cohan, “Long-former: The long-document transformer,” 2020.

[14] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and DavidGrangier, “Efficient content-based sparse attention with rout-ing transformers,” 2020.

[15] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell,Quoc V. Le, and Ruslan Salakhutdinov, “Transformer-xl: At-tentive language models beyond a fixed-length context,” 2019.

[16] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, andTimothy P. Lillicrap, “Compressive transformers for long-range sequence modelling,” 2019.

[17] Pedro Henrique Martins, Zita Marinho, and Andre F. T. Mar-tins, “∞-former: Infinite memory transformer,” 2021.

[18] Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-FengYeh, and Frank Zhang, “Streaming transformer-based acousticmodels using self-attention with augmented memory,” 2020.

[19] Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, ShuaiZhang, and Zhengqi Wen, “Synchronous transformers for end-to-end speech recognition,” 2020.

[20] Linhao Dong, Feng Wang, and Bo Xu, “Self-attention aligner:A latency-control end-to-end model for asr using self-attentionnetwork and chunk-hopping,” 2019.

[21] Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-FengYeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer,“Emformer: Efficient Memory Transformer Based Acous-tic Model For Low Latency Streaming Speech Recognition,”2020.

[22] Wenzhong Guo, Jianwen Wang, and Shiping Wang, “Deepmultimodal representation learning: A survey,” IEEE Access,vol. 7, pp. 63373–63394, 2019.

[23] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, ErikCambria, and Louis-Philippe Morency, “Multimodal languageanalysis in the wild: CMU-MOSEI dataset and interpretabledynamic fusion graph,” in Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers), Melbourne, Australia, July 2018, pp.2236–2246, Association for Computational Linguistics.

[24] Gwen Littlewort, Jacob Whitehill, Tingfan Wu, Ian Fasel,Mark Frank, Javier Movellan, and Marian Bartlett, “The com-puter expression recognition toolbox (cert),” in 2011 IEEEInternational Conference on Automatic Face Gesture Recog-nition (FG), 2011, pp. 298–305.

[25] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio,and Stefan Scherer, “Covarep: A collaborative voice analysisrepository for speech technologies,” 05 2014.

[26] Jeffrey Pennington, Richard Socher, and Christopher Manning,“GloVe: Global vectors for word representation,” in Proceed-ings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), Doha, Qatar, Oct. 2014, pp.1532–1543, Association for Computational Linguistics.

[27] Jiahong Yuan and Mark Y. Liberman, “Speaker identificationon the scotus corpus,” Journal of the Acoustical Society ofAmerica, vol. 123, pp. 3878–3878, 2008.

[28] Wenmeng Yu and Jiele Wu Hua Xu, Ziqi Yuan, “Learningmodality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” arXiv, 2021.

[29] Wei Han, Hui Chen, and Soujanya Poria, “Improving multi-modal fusion with hierarchical mutual information maximiza-tion for multimodal sentiment analysis,” 09 2021.

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.