Learning Representations from Imperfect Time Series Data ...pliang/papers/acl2019_tensor.pdf · word segmentation (Pei et al.,2014), question answering (Qiu and Huang,2015), and machine

Learning Representations from Imperfect Time Series Datavia Tensor Rank Regularization

Paul Pu Liang♣∗, Zhun Liu♢∗, Yao-Hung Hubert Tsai♣Qibin Zhao♡, Ruslan Salakhutdinov♣, Louis-Philippe Morency♢♣Machine Learning Department, Carnegie Mellon University, USA♢Language Technologies Institute, Carnegie Mellon University, USA

♡Tensor Learning Unit, RIKEN Center for Artificial Intelligence Project, Japan{pliang,zhunl,yaohungt,rsalakhu,morency}@cs.cmu.edu

[email protected]

Abstract

There has been an increased interest in mul-timodal language processing including multi-modal dialog, question answering, sentimentanalysis, and speech recognition. However,naturally occurring multimodal data is oftenimperfect as a result of imperfect modalities,missing entries or noise corruption. To ad-dress these concerns, we present a regulariza-tion method based on tensor rank minimiza-tion. Our method is based on the observationthat high-dimensional multimodal time seriesdata often exhibit correlations across time andmodalities which leads to low-rank tensor rep-resentations. However, the presence of noiseor incomplete values breaks these correlationsand results in tensor representations of higherrank. We design a model to learn such ten-sor representations and effectively regularizetheir rank. Experiments on multimodal lan-guage data show that our model achieves goodresults across various levels of imperfection.

1 IntroductionAnalyzing multimodal language sequences spansvarious fields including multimodal dialog (Daset al., 2017; Rudnicky, 2005), question answer-ing (Antol et al., 2015; Tapaswi et al., 2015;Das et al., 2018), sentiment analysis (Morencyet al., 2011), and speech recognition (Palaskaret al., 2018). Generally, these multimodal se-quences contain heterogeneous sources of infor-mation across the language, visual and acous-tic modalities. For example, when instructingrobots, these machines have to comprehend ourverbal instructions and interpret our nonverbal be-haviors while grounding these inputs in their vi-sual sensors (Schmerling et al., 2017; Iba et al.,2005). Likewise, comprehending human inten-tions requires integrating human language, speech,

∗first two authors contributed equally

clean multimodal data<latexit sha1_base64="h1pzHCMMRR0RKSBQVEr4zlmGh8c=">AAACBXicbVA9SwNBEN3z2/gVtdRiMQhW4U4FLYM2lgomCskR5vY2yeLu3rE7J4YjjY1/xcZCEVv/g53/xk1yhSY+GHi8N8PMvCiVwqLvf3szs3PzC4tLy6WV1bX1jfLmVsMmmWG8zhKZmNsILJdC8zoKlPw2NRxUJPlNdHc+9G/uubEi0dfYT3mooKtFRzBAJ7XLuy3kD5gzyUFTlUkUKolB0hgQBu1yxa/6I9BpEhSkQgpctstfrThhmeIamQRrm4GfYpiDQeE2DEqtzPIU2B10edNRDYrbMB99MaD7TolpJzGuNNKR+nsiB2VtX0WuUwH27KQ3FP/zmhl2TsNc6DRDrtl4USeTFBM6jITGwnCGsu8IMCPcrZT1wABDF1zJhRBMvjxNGofV4KjqXx1XamdFHEtkh+yRAxKQE1IjF+SS1Akjj+SZvJI378l78d69j3HrjFfMbJM/8D5/AB0omPM=</latexit>

imperfect multimodal data<latexit sha1_base64="I3wmMCkkoG+gn08puQZqeM8eRbQ=">AAACCXicbVC7SgNBFJ31GeMramkzGASrsKuClkEbywhGhSSEu7N34+DM7jJzVwxLWht/xcZCEVv/wM6/cfIo1Hhg4HDOvdw5J8yUtOT7X97M7Nz8wmJpqby8srq2XtnYvLRpbgQ2RapScx2CRSUTbJIkhdeZQdChwqvw9nToX92hsTJNLqifYUdDL5GxFEBO6lZ4m/CeCqkzNDEK4jpXJHUageIREAy6lapf80fg0ySYkCqboNGtfLajVOQaExIKrG0FfkadAgxJoXBQbucWMxC30MOWowlotJ1ilGTAd50S8Tg17iXER+rPjQK0tX0dukkNdGP/ekPxP6+VU3zcKWSS5YSJGB+Kc8Up5cNaeCSNS6/6joAw0v2VixswIMiVV3YlBH8jT5PL/VpwUPPPD6v1k0kdJbbNdtgeC9gRq7Mz1mBNJtgDe2Iv7NV79J69N+99PDrjTXa22C94H9+GKZrX</latexit>

tensor rank regularization<latexit sha1_base64="qWvflx4gMMCU5rXKIOGbeflghyQ=">AAACDHicbVC7TgJBFJ3FF+ILtbSZSEysyK6aaEm0scREHgkQMjvchQmzs5uZu0bc8AE2/oqNhcbY+gF2/o3DQqHgrU7OOTf3nuPHUhh03W8nt7S8srqWXy9sbG5t7xR39+omSjSHGo9kpJs+MyCFghoKlNCMNbDQl9Dwh1cTvXEH2ohI3eIohk7I+koEgjO0VLdYaiPcox+kCMpEmmqmhlRDP5FMi4fMNLYut+xmQxeBNwMlMptqt/jV7kU8CUEhl8yYlufG2EmZRsEljAvtxEDM+JD1oWWhYiGYTpqFGdMjy/RoYH8JIoU0Y39vpCw0ZhT61hkyHJh5bUL+p7USDC46qVBxYrPy6aEgkRQjOmmG9oQGjnJkAeNa2F8pHzDNONr+CrYEbz7yIqiflL3TsntzVqpczurIkwNySI6JR85JhVyTKqkRTh7JM3klb86T8+K8Ox9Ta86Z7eyTP+N8/gBHLZxj</latexit>

clean entries<latexit sha1_base64="wuMzvz1f+N4dxLXkrZwJSF97FMk=">AAAB/XicbVDLSgMxFM3UV62v+ti5CRbBVZlRQZdFNy4r2Ae0Q8mkt21oJjMkd8Q6FH/FjQtF3Pof7vwb03YW2nogcDjnvnKCWAqDrvvt5JaWV1bX8uuFjc2t7Z3i7l7dRInmUOORjHQzYAakUFBDgRKasQYWBhIawfB64jfuQRsRqTscxeCHrK9ET3CGVuoUD9oID5hyCUxRUKgFmHGnWHLL7hR0kXgZKZEM1U7xq92NeBLaAVwyY1qeG6OfMo3CTh4X2omBmPEh60PLUsVCMH46vX5Mj63Spb1I26eQTtXfHSkLjRmFga0MGQ7MvDcR//NaCfYu/VSoOEFQfLaol0iKEZ1EQbtCA0c5soRxLeytlA+YZhxtYAUbgjf/5UVSPy17Z2X39rxUucriyJNDckROiEcuSIXckCqpEU4eyTN5JW/Ok/PivDsfs9Kck/Xskz9wPn8AKAWVqw==</latexit>

imperfect entries<latexit sha1_base64="uEy9xlvD1dXmU2/rxG3w4dCph+Q=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARXJUZFXRZdOOygn1AO5RMeqcNzTxI7ohlqBt/xY0LRdz6F+78G9N2Ftp6IHA4594k5/iJFBod59sqLC2vrK4V10sbm1vbO/buXkPHqeJQ57GMVctnGqSIoI4CJbQSBSz0JTT94fXEb96D0iKO7nCUgBeyfiQCwRkaqWsfdBAeMBNhAioAjhQiVAL0uGuXnYozBV0kbk7KJEeta391ejFPQ3MBl0zrtusk6GVMoeASxqVOqiFhfMj60DY0YiFoL5smGNNjo/RoECtzIqRT9fdGxkKtR6FvJkOGAz3vTcT/vHaKwaWXiShJESI+eyhIJcWYTuqgPaFMajkyhHElzF8pHzDFOJrSSqYEdz7yImmcVtyzinN7Xq5e5XUUySE5IifEJRekSm5IjdQJJ4/kmbySN+vJerHerY/ZaMHKd/bJH1ifP4Lml48=</latexit>

Figure 1: Clean multimodal time series data (in shadesof green) exhibits correlations across time and acrossmodalities, leading to redundancy in low rank tensorrepresentations. On the other hand, the presence ofimperfect entries (in gray, blue, and red) breaks thesecorrelations and leads to higher rank tensors. In thesescenarios, we use tensor rank regularization to learntensors that more accurately represent the true correla-tions and latent structures in multimodal data.

facial behaviors, and body postures (Mihalcea,2012; Rossiter, 2011). However, as much as moremodalities are required for improved performance,we now face a challenge of imperfect data wheredata might be 1) incomplete due to mismatchedmodalities or sensor failure, or 2) corrupted withrandom or structured noise. As a result, an im-portant research question involves learning robustrepresentations from imperfect multimodal data.

Recent research in both unimodal and multi-modal learning has investigated the use of tensorsfor representation learning (Anandkumar et al.,2014). Given representations h1, ...,hM from Mmodalities, the order-M outer product tensor T =h1 ⊗ h2 ⊗ ... ⊗ hM is a natural representation forall possible interactions between the modality di-mensions (Liu et al., 2018). In this paper, wepropose a model called the Temporal Tensor Fu-sion Network (T2FN) that builds tensor represen-tations from multimodal time series data. T2FN

learns a tensor representation that captures mul-timodal interactions across time. A key observa-tion is that clean data exhibits tensors that are low-rank since high-dimensional real-world data is of-ten generated from lower dimensional latent struc-tures (Lakshmanan et al., 2015). Furthermore,clean multimodal time series data exhibits corre-lations across time and across modalities (Yanget al., 2017; Hidaka and Yu, 2010). This leadsto redundancy in these overparametrized tensorswhich explains their low rank (Figure 1). On theother hand, the presence of noise or incompletevalues breaks these natural correlations and leadsto higher rank tensor representations. As a re-sult, we can use tensor rank minimization to learntensors that more accurately represent the truecorrelations and latent structures in multimodaldata, thereby alleviating imperfection in the input.With these insights, we show how to integrate ten-sor rank minimization as a simple regularizer fortraining in the presence of imperfect data. As com-pared to previous work on imperfect data (Sohnet al., 2014; Srivastava and Salakhutdinov, 2014;Pham et al., 2019), our model does not need toknow which of the entries or modalities are imper-fect beforehand. Our model combines the strengthof temporal non-linear transformations of multi-modal data with a simple regularization techniqueon tensor structures. We perform experiments onmultimodal video data consisting of humans ex-pressing their opinions using a combination of lan-guage and nonverbal behaviors. Our results backup our intuitions that imperfect data increases ten-sor rank. Finally, we show that our model achievesgood results across various levels of imperfection.

2 Related WorkTensor Methods: Tensor representations havebeen used for learning discriminative representa-tions in unimodal and multimodal tasks. Tensorsare powerful because they can capture importanthigher order interactions across time, feature di-mensions, and multiple modalities (Kossaifi et al.,2017). For unimodal tasks, tensors have been usedfor part-of-speech tagging (Srikumar and Man-ning, 2014), dependency parsing (Lei et al., 2014),word segmentation (Pei et al., 2014), questionanswering (Qiu and Huang, 2015), and machinetranslation (Setiawan et al., 2015). For multimodaltasks, Huang et al. (2017) used tensor products be-tween images and text features for image caption-ing. A similar approach was proposed to learn

representations across text, visual, and acousticfeatures to infer speaker sentiment (Liu et al.,2018; Zadeh et al., 2017). Other applications in-clude multimodal machine translation (Delbrouckand Dupont, 2017), audio-visual speech recogni-tion (Zhang et al., 2017), and video semantic anal-ysis (Wu et al., 2009; Gao et al., 2009).

Imperfect Data: In order to account for imper-fect data, several works have proposed generativeapproaches for multimodal data (Sohn et al., 2014;Srivastava and Salakhutdinov, 2014). Recently,neural models such as cascaded residual autoen-coders (Tran et al., 2017), deep adversarial learn-ing (Cai et al., 2018), or translation-based learn-ing (Pham et al., 2019) have also been proposed.However, these methods often require knowingwhich of the entries or modalities are imperfectbeforehand. While there has been some work onusing low-rank tensor representations for imper-fect data (Chang et al., 2017; Fan et al., 2017;Chen et al., 2017; Long et al., 2018; Nimishakaviet al., 2018), our approach is the first to inte-grate rank minimization with neural networks formultimodal language data, thereby combining thestrength of non-linear transformations with themathematical foundations of tensor structures.

3 Proposed MethodIn this section, we present our method for learningrepresentations from imperfect human languageacross the language, visual, and acoustic modal-ities. In §3.1, we discuss some background ontensor ranks. We outline our method for learn-ing tensor representations via a model called Tem-poral Tensor Fusion Network (T2FN) in §3.2. In§3.3, we investigate the relationship between ten-sor rank and imperfect data. Finally, in §3.4, weshow how to regularize our model using tensorrank minimization.

We use lowercase letters x ∈ R to denotescalars, boldface lowercase letters x ∈ Rd to de-note vectors, and boldface capital letters X ∈Rd1×d2 to denote matrices. Tensors, which we de-note by calligraphic letters X , are generalizationsof matrices to multidimensional arrays. An order-M tensor has M dimensions, X ∈ Rd1×...×dM . Weuse ⊗ to denote outer product between vectors.

3.1 Background: Tensor Rank

The rank of a tensor measures how many vectorsare required to reconstruct the tensor. Simple ten-sors that can be represented as outer products of

h1ℓ h2

ℓ hTℓ

h1a

h2a

hTa

. . ....

h3ℓ

h3a

...

h1v

h2v

h3v

hTv

. . .

[x1ℓ, . . . xT

ℓ][x1 a,.

..xT a]

[x1 v,...x

T v]y

Mt = ht` ⌦ ht

v ⌦ hta

<latexit sha1_base64="c6ZUwHfEASXn8HdFB5sLByT5Xx8=">AAACPnicdVC7SgNBFJ2Nrxhfq5Y2g0GwCrsqaCMEbWyECOYB2RhmJ7PJkNkHM3cDYdkvs/Eb7CxtLBSxtXR2k8IkemDgzDn3cu89biS4Ast6MQpLyyura8X10sbm1vaOubvXUGEsKavTUISy5RLFBA9YHTgI1ookI74rWNMdXmd+c8Sk4mFwD+OIdXzSD7jHKQEtdc264xMYUCKS2/QB8CXO/66XDNJu4jAhMtUJgftMzXij/wyija5ZtipWDrxI7CkpoylqXfPZ6YU09lkAVBCl2rYVQSchEjgVLC05sWIRoUPSZ21NA6KndpL8/BQfaaWHvVDqFwDO1d8dCfGVGvuursz2VPNeJv7ltWPwLjoJD6IYWEAng7xYYAhxliXucckoiLEmhEqud8V0QCShoBMv6RDs+ZMXSeOkYp9WrLuzcvVqGkcRHaBDdIxsdI6q6AbVUB1R9Ihe0Tv6MJ6MN+PT+JqUFoxpzz6agfH9A0UUsZg=</latexit>

M =

TX

t=1

Mt

<latexit sha1_base64="a1jY5th1Xy+fYanMukUdv4ksWM4=">AAACEXicbVC7SgNBFJ2Nrxhfq5Y2g0FIFXZV0CYQtLERIuQF2U2YnUySIbMPZu4KYckv2PgrNhaK2NrZ+TfOJlvExAMXDufcy733eJHgCizrx8itrW9sbuW3Czu7e/sH5uFRU4WxpKxBQxHKtkcUEzxgDeAgWDuSjPieYC1vfJv6rUcmFQ+DOkwi5vpkGPABpwS01DNLjk9gRIlI7qe4gh0V+70EKva0W8cLVhd6ZtEqWzPgVWJnpIgy1Hrmt9MPaeyzAKggSnVsKwI3IRI4FWxacGLFIkLHZMg6mgbEZ8pNZh9N8ZlW+ngQSl0B4Jm6OJEQX6mJ7+nO9Ei17KXif14nhsG1m/AgioEFdL5oEAsMIU7jwX0uGQUx0YRQyfWtmI6IJBR0iAUdgr388ippnpfti7L1cFms3mRx5NEJOkUlZKMrVEV3qIYaiKIn9ILe0LvxbLwaH8bnvDVnZDPH6A+Mr18w0J0+</latexit>

M<latexit sha1_base64="uZvzERQPKMMR9NiLGAcS/T5qG+0=">AAAB8nicbVBNS8NAFHypX7V+VT16WSyCp5KooMeiFy9CBWsLaSib7bZdutmE3RehhP4MLx4U8eqv8ea/cdPmoK0DC8PMe+y8CRMpDLrut1NaWV1b3yhvVra2d3b3qvsHjyZONeMtFstYd0JquBSKt1Cg5J1EcxqFkrfD8U3ut5+4NiJWDzhJeBDRoRIDwShaye9GFEeMyuxu2qvW3Lo7A1kmXkFqUKDZq351+zFLI66QSWqM77kJBhnVKJjk00o3NTyhbEyH3LdU0YibIJtFnpITq/TJINb2KSQz9fdGRiNjJlFoJ/OIZtHLxf88P8XBVZAJlaTIFZt/NEglwZjk95O+0JyhnFhCmRY2K2EjqilD21LFluAtnrxMHs/q3nndvb+oNa6LOspwBMdwCh5cQgNuoQktYBDDM7zCm4POi/PufMxHS06xcwh/4Hz+AIM0kWU=</latexit>

MT<latexit sha1_base64="8QgwEIjtHbIpE/ny7x/TC5TA1HM=">AAAB9HicbVDLSgMxFL3xWeur6tJNsAiuyowKuiy6cSNU6AvasWTSTBuayYxJplCGfocbF4q49WPc+Tdm2llo64HA4Zx7uSfHjwXXxnG+0crq2vrGZmGruL2zu7dfOjhs6ihRlDVoJCLV9olmgkvWMNwI1o4VI6EvWMsf3WZ+a8yU5pGsm0nMvJAMJA84JcZKXjckZkiJSO+nj/VeqexUnBnwMnFzUoYctV7pq9uPaBIyaaggWndcJzZeSpThVLBpsZtoFhM6IgPWsVSSkGkvnYWe4lOr9HEQKfukwTP190ZKQq0noW8ns5B60cvE/7xOYoJrL+UyTgyTdH4oSAQ2Ec4awH2uGDViYgmhitusmA6JItTYnoq2BHfxy8ukeV5xLyrOw2W5epPXUYBjOIEzcOEKqnAHNWgAhSd4hld4Q2P0gt7Rx3x0BeU7R/AH6PMH5FGSKw==</latexit>

M1<latexit sha1_base64="kEkwVroTad49Yl6qxE3XJxZ+nes=">AAAB9HicbVDLSgMxFL2pr1pfVZdugkVwVWZU0GXRjRuhgn1AO5ZMmmlDM5kxyRTK0O9w40IRt36MO//GTDsLbT0QOJxzL/fk+LHg2jjONyqsrK6tbxQ3S1vbO7t75f2Dpo4SRVmDRiJSbZ9oJrhkDcONYO1YMRL6grX80U3mt8ZMaR7JBzOJmReSgeQBp8RYyeuGxAwpEend9NHtlStO1ZkBLxM3JxXIUe+Vv7r9iCYhk4YKonXHdWLjpUQZTgWblrqJZjGhIzJgHUslCZn20lnoKT6xSh8HkbJPGjxTf2+kJNR6Evp2MgupF71M/M/rJCa48lIu48QwSeeHgkRgE+GsAdznilEjJpYQqrjNiumQKEKN7alkS3AXv7xMmmdV97zq3F9Uatd5HUU4gmM4BRcuoQa3UIcGUHiCZ3iFNzRGL+gdfcxHCyjfOYQ/QJ8/r0WSCA==</latexit>

M2<latexit sha1_base64="zcVGz7+t4ZIbw5VqiTiAzOeOQvE=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLoxo1QwT6gHUsmzbShmWRMMoUy9DvcuFDErR/jzr8x085CWw8EDufcyz05QcyZNq777aysrq1vbBa2its7u3v7pYPDppaJIrRBJJeqHWBNORO0YZjhtB0riqOA01Ywusn81pgqzaR4MJOY+hEeCBYygo2V/G6EzZBgnt5NH6u9UtmtuDOgZeLlpAw56r3SV7cvSRJRYQjHWnc8NzZ+ipVhhNNpsZtoGmMywgPasVTgiGo/nYWeolOr9FEolX3CoJn6eyPFkdaTKLCTWUi96GXif14nMeGVnzIRJ4YKMj8UJhwZibIGUJ8pSgyfWIKJYjYrIkOsMDG2p6ItwVv88jJpViveecW9vyjXrvM6CnAMJ3AGHlxCDW6hDg0g8ATP8Apvzth5cd6dj/noipPvHMEfOJ8/sMmSCQ==</latexit>

M3<latexit sha1_base64="UTWzPLBL6vIDwYDXR4ODZOrB8sY=">AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsyooMuiGzdCBfuAdiyZNNOGZpIxyRTK0O9w40IRt36MO//GTDsLbT0QOJxzL/fkBDFn2rjut1NYWV1b3yhulra2d3b3yvsHTS0TRWiDSC5VO8CaciZowzDDaTtWFEcBp61gdJP5rTFVmknxYCYx9SM8ECxkBBsr+d0ImyHBPL2bPp73yhW36s6AlomXkwrkqPfKX92+JElEhSEca93x3Nj4KVaGEU6npW6iaYzJCA9ox1KBI6r9dBZ6ik6s0kehVPYJg2bq740UR1pPosBOZiH1opeJ/3mdxIRXfspEnBgqyPxQmHBkJMoaQH2mKDF8YgkmitmsiAyxwsTYnkq2BG/xy8ukeVb1zqvu/UWldp3XUYQjOIZT8OASanALdWgAgSd4hld4c8bOi/PufMxHC06+cwh/4Hz+ALJNkgo=</latexit>

language LSTM<latexit sha1_base64="4wOX6NYqLG9nll+8bFLf3vGko/o=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCq5KooMuiGxcKFfuCNpTJdNIOnUzCzI1YQ/FX3LhQxK3/4c6/cdpmoa0HLhzOuZd77/FjwTU4zreVW1hcWl7JrxbW1jc2t+ztnbqOEkVZjUYiUk2faCa4ZDXgIFgzVoyEvmANf3A59hv3TGkeySoMY+aFpCd5wCkBI3XsvTawB0gFkb2E9Bi+vqvejDp20Sk5E+B54makiDJUOvZXuxvRJGQSqCBat1wnBi8lCjgVbFRoJ5rFhA7MhpahkoRMe+nk+hE+NEoXB5EyJQFP1N8TKQm1Hoa+6QwJ9PWsNxb/81oJBOdeymWcAJN0uihIBIYIj6PAXa4YBTE0hFDFza2Y9okiFExgBROCO/vyPKkfl9yTknN7WixfZHHk0T46QEfIRWeojK5QBdUQRY/oGb2iN+vJerHerY9pa87KZnbRH1ifP3B7lTI=</latexit>

vis

ual

LST

M<latexit sha1_base64="PkdTAP5iIoRE3tOjUNsjVhu5mek=">AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSyCp5KooMeiFw8KFfsFbSib7bZdutmE3UlpCfkrXjwo4tU/4s1/47bNQVsfDDzem2Fmnh8JrsFxvq3c2vrG5lZ+u7Czu7d/YB8WGzqMFWV1GopQtXyimeCS1YGDYK1IMRL4gjX90e3Mb46Z0jyUNZhGzAvIQPI+pwSM1LWLHWATSMZcx0Tg+6faQ9q1S07ZmQOvEjcjJZSh2rW/Or2QxgGTQAXRuu06EXgJUcCpYGmhE2sWEToiA9Y2VJKAaS+Z357iU6P0cD9UpiTgufp7IiGB1tPAN50BgaFe9mbif147hv61l3AZxcAkXSzqxwJDiGdB4B5XjIKYGkKo4uZWTIdEEQomroIJwV1+eZU0zsvuRdl5vCxVbrI48ugYnaAz5KIrVEF3qIrqiKIJekav6M1KrRfr3fpYtOasbOYI/YH1+QMNGpRu</latexit>acou

stic

LST

M<latexit sha1_base64="x1Fk1eUX+Il9eXi9bofgql+EJEs=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyyCqzKjgi6LblwoVOwL2qFk0kwbmskMyR2xDsVfceNCEbf+hzv/xkzbhbYeuHA4597k3uPHgmtwnG8rt7C4tLySXy2srW9sbtnbO3UdJYqyGo1EpJo+0UxwyWrAQbBmrBgJfcEa/uAy8xv3TGkeySoMY+aFpCd5wCkBI3XsvTawB0gJjRINnOLru+rNqGMXnZIzBp4n7pQU0RSVjv3V7kY0CZkEKojWLdeJwUuJMk8KNiq0E81iQgekx1qGShIy7aXj7Uf40ChdHETKlAQ8Vn9PpCTUehj6pjMk0NezXib+57USCM69lMs4ASbp5KMgERginEWBu1wxCmJoCKGKZ+fTPlGEggmsYEJwZ0+eJ/XjkntScm5Pi+WLaRx5tI8O0BFy0RkqoytUQTVE0SN6Rq/ozXqyXqx362PSmrOmM7voD6zPH5P/lUk=</latexit>

Figure 2: The Temporal Tensor Fusion Network(T2FN) creates a tensor M from temporal data. Therank of M increases with imperfection in data so weregularize our model by minimizing an upper bound onthe rank ofM.

vectors have lower rank, while complex tensorshave higher rank. To be more precise, we de-fine the rank of a tensor using Canonical Polyadic(CP)-decomposition (Carroll and Chang, 1970).For an order-M tensorX ∈ Rd1×...×dM , there existsan exact decomposition into vectors w:

X =r

∑i=1

M⊗m=1

wim. (1)

The minimal r for exact decomposition is calledthe rank of the tensor. The vectors {{wi

m}Mm=1}ri=1are called the rank r decomposition factors of X .

3.2 Multimodal Tensor RepresentationsOur model for creating tensor representationsis called the Temporal Tensor Fusion Network(T2FN), which extends the model in Zadeh et al.(2017) to include a temporal component. Weshow that T2FN increases the capacity of TFNto capture high-rank tensor representations, whichitself leads to improved prediction performance.More importantly, our knowledge about tensorrank properties allows us to regularize our modeleffectively for imperfect data.

We begin with time series data from the lan-guage, visual and acoustic modalities, denotedas [x1

` , ...,xT` ], [x1

v, ...,xTv ], and [x1

a, ...,xTa ] re-

spectively. We first use Long Short-term Mem-ory (LSTM) networks (Hochreiter and Schmid-huber, 1997) to encode the temporal information

from each modality, resulting in a sequence of hid-den representations [h1

` , ...,hT` ], [h1

v, ...,hTv ], and

[h1a, ...,h

Ta ]. Similar to prior work which found

tensor representations to capture higher-order in-teractions from multimodal data (Liu et al., 2018;Zadeh et al., 2017; Fukui et al., 2016), we formtensors via outer products of the individual repre-sentations through time (as shown in Figure 2):

M =T

∑t=1

[ht`

1]⊗ [h

tv

1]⊗ [h

ta

1] (2)

where we append a 1 so that unimodal, bimodal,and trimodal interactions are all captured as de-scribed in Zadeh et al. (2017). M is our multi-modal representation which can then be used topredict the label y using a fully connected layer.Observe how the construction of M closely re-sembles equation (1) as the sum of vector outerproducts. As compared to TFN which uses a sin-gle outer product to obtain a multimodal tensor ofrank one, T2FN creates a tensor of high rank (up-per bounded by T ). As a result, the notion of ranknaturally emerges when we reason about the prop-erties ofM.

3.3 How Does Imperfection Affect Rank?

We first state several observations about the rankof multimodal representationM:1) rnoisy: The rank ofM is maximized when dataentries are sampled from i.i.d. noise (e.g. Gaus-sian distributions). This is because this settingleads to no redundancy at all between the featuredimensions across time steps.2) rclean < rnoisy: Clean real-world data is of-ten generated from lower dimensional latent struc-tures (Lakshmanan et al., 2015). Furthermore,multimodal time series data exhibits correlationsacross time and across modalities (Yang et al.,2017; Hidaka and Yu, 2010). This redundancyleads to low-rank tensor representations.3) rclean < rimperfect < rnoisy: If the data is im-perfect, the presence of noise or incomplete val-ues breaks these natural correlations and leads tohigher rank tensor representations.

These intuitions are also backed up by severalexperimental results which are presented in §4.2.

3.4 Tensor Rank Regularization

Given our intuitions above, it would then seemnatural to augment the discriminative objectivefunction with a term to minimize the rank ofM.

In practice, the rank of an order-M tensor is com-puted using the nuclear norm ∥X ∥∗ which is de-fined as (Friedland and Lim, 2014),

∥X ∥∗ = inf {r

∑

i=1

∣λi∣ ∶ X =

r

∑

i=1

λi (

M

⊗m=1

wim) , ∥w

im∥ = 1, r ∈ N} .

(3)

When M = 2, this reduces to the matrix nuclearnorm (sum of singular values). However, com-puting the rank of a tensor or its nuclear norm isNP-hard for tensors of order ≥ 3 (Friedland andLim, 2014). Fortunately, there exist efficientlycomputable upper bounds on the nuclear norm andminimizing these upper bounds would also mini-mize the nuclear norm ∥M∥∗. We choose the up-per bound as presented in Hu (2014), which upperbounds the nuclear norm with the tensor Frobeniusnorm scaled by the tensor dimensions:

∥M∥∗ ≤¿ÁÁÀ ∏M

i=1 dimax{d1, ..., dM}∥M∥F , (4)

where the Frobenius norm ∥M∥F is defined as thesum of squared entries inM which is easily com-putable and convex. Since ∥M∥F is easily com-putable and convex, including this term adds neg-ligible computational cost to the model. We willuse this upper bound as a surrogate for the nu-clear norm in our objective function. Our objec-tive function is therefore a weighted combinationof the prediction loss and the tensor rank regular-izer in equation (4).

4 ExperimentsOur experiments are designed with two researchquestions in mind: 1) What is the effect of variouslevels of imperfect data on tensor rank in T2FN?2) Does T2FN with rank regularization performwell on prediction with imperfect data? We an-swer these questions in §4.2 and §4.3 respectively.

4.1 DatasetsWe experiment with real video data consisting ofhumans expressing their opinions using a com-bination of language and nonverbal behaviors.We use the CMU-MOSI dataset which contains2199 videos annotated for sentiment in the range[−3,+3] (Zadeh et al., 2016). CMU-MOSI andrelated multimodal language datasets have beenstudied in the NLP community (Gu et al., 2018;Liu et al., 2018; Liang et al., 2018) from fullysupervised settings but not from the perspectiveof supervised learning with imperfect data. We

use 52 segments for training, 10 for validationand 31 for testing. GloVe word embeddings (Pen-nington et al., 2014), Facet (iMotions, 2017), andCOVAREP (Degottex et al., 2014) features areextracted for the language, visual and acousticmodalities respectively. Forced alignment is per-formed using P2FA (Yuan and Liberman, 2008) toalign visual and acoustic features to each word, re-sulting in a multimodal sequence. Our data splits,features, alignment, and preprocessing steps areconsistent with prior work on the CMU-MOSIdataset (Liu et al., 2018).

4.2 Rank Analysis

We first study the effect of imperfect data on therank of tensor M. We introduce the followingtypes of noises parametrized by noise level =[0.0,0.1, ...,1.0]. Higher noise levels impliesmore imperfection: 1) clean: no imperfection,2) random drop: each entry is dropped inde-pendently with probability p ∈ noise level,and 3) structured drop: independently for eachmodality, each time step is chosen with probabil-ity p ∈ noise level. If a time step is cho-sen, all feature dimensions at that time step aredropped. For all imperfect settings, features aredropped during both training and testing.

We would like to show how the tensor ranksvary under different imperfection settings. How-ever, as is mentioned above, determining the exactrank of a tensor is an NP-hard problem (Friedlandand Lim, 2014). In order to analyze the effect ofimperfections on tensor rank, we perform CP de-composition (equation (5)) on the tensor represen-tations under different rank settings r and measurethe reconstruction error ε,

ε =minwi

m

∥(r

∑i=1

M⊗m=1

wim) −X∥

F

. (5)

Given the true rank r∗, ε will be high at ranksr < r∗, while ε will be approximately zero at ranksr ≥ r∗ (for example, a rank 3 tensor would displaya large reconstruction error with CP decomposi-tion at rank 1, but would show almost zero errorwith CP decomposition at rank 3). By analyzingthe effect of r on ε, we are then able to derive asurrogate r to the true rank r∗.

Using this approach, we experimented onCMU-MOSI and the results are shown in Fig-ure 3(a). We observe that imperfection leads to anincrease in (approximate) tensor rank as measured

1 2 3 4 5 6 7 8 9 1011121314151617181920212223Decomposition rank

0.0

0.1

0.2

0.3

0.4

0.5De

com

posit

ion

loss

cleanrandom dropstructured drop

(a) CP decomposition error ofM underrandom and structured dropping of fea-tures. Imperfect data leads to an increasein decomposition error and an increasein (approximate) tensor rank.

(b) Sentiment classification accuracy un-der random drop (i.e. dropping en-tries randomly with probability p ∈

noise level). T2FN with rank reg-ularization (green) performs well.

(c) Sentiment classification accuracy un-der structured drop (dropping entire timesteps randomly with probability p ∈

noise level). T2FN with rank reg-ularization (green) performs well.

Figure 3: (a) Effect of imperfect data on tensor rank. (b) and (c): CMU-MOSI test accuracy under imperfect data.

by reconstruction error (the graph shifts outwardsand to the right), supporting our hypothesis thatimperfect data increases tensor rank (§3.3).

4.3 Prediction ResultsOur next experiment tests the ability of our modelto learn robust representations despite data im-perfections. We use the tensor M for predic-tion and report binary classification accuracy onCMU-MOSI test set. We compare to severalbaselines: Early Fusion (EF)-LSTM, Late Fusion(LF)-LSTM, TFN, and T2FN without rank regu-larization. These results are shown in Figure 3(b)for random drop and Figure 3(c) for structureddrop. T2FN with rank regularization maintainsgood performance despite imperfections in data.We also observe that our model’s improvement ismore significant on random drop settings, whichresults in a higher tensor rank as compared tostructured drop settings (from Figure 3(a)). Thissupports our hypothesis that our model learns ro-bust representations when imperfections that in-crease tensor rank are introduced. On the otherhand, the existing baselines suffer in the presenceof imperfect data.

5 Discussion and Future WorkWe acknowledge that there are other alternativemethods to upper bound the true rank of a ten-sor (Alexeev et al., 2011; Atkinson and Lloyd,1980; Ballico, 2014). From a theoretical perspec-tive, there exists a trade-off between the cost ofcomputation and the tightness of approximation.In addition, the tensor rank can (far) exceed themaximum dimension, and a low-rank approxima-tion for tensors may not even exist (de Silva andLim, 2008). While our tensor rank regulariza-tion method seems to work well empirically, there

is definitely room for a more thorough theoreti-cal analysis of constructing and regularizing ten-sor representations for multimodal learning.

6 Conclusion

This paper presented a regularization methodbased on tensor rank minimization. We observethat clean multimodal sequences often exhibit cor-relations across time and modalities which leadsto low-rank tensors, while the presence of imper-fect data breaks these correlations and results intensors of higher rank. We designed a model, theTemporal Tensor Fusion Network, to learn suchtensor representations and effectively regularizetheir rank. Experiments on multimodal languagedata show that our model achieves good resultsacross various levels of imperfections. We hopeto inspire future work on regularizing tensor rep-resentations of multimodal data for robust predic-tion in the presence of imperfect data.

Acknowledgements

PPL, ZL, and LM are partially supported by theNational Science Foundation (Award #1750439and #1722822) and Samsung. Any opinions, find-ings, conclusions or recommendations expressedin this material are those of the author(s) and donot necessarily reflect the views of Samsung andNSF, and no official endorsement should be in-ferred. YHT and RS are supported in part byDARPA HR00111990016, AFRL FA8750-18-C-0014, NSF IIS1763562, Apple, and Google fo-cused award. QZ is supported by JSPS KAK-ENHI (Grant No. 17K00326). We also acknowl-edge NVIDIA’s GPU support and the anonymousreviewers for their constructive comments.

ReferencesBoris Alexeev, Michael A. Forbes, and Jacob Tsimer-

man. 2011. Tensor rank: Some lower and upperbounds. CoRR, abs/1102.0072.

Animashree Anandkumar, Rong Ge, Daniel Hsu,Sham M. Kakade, and Matus Telgarsky. 2014. Ten-sor decompositions for learning latent variable mod-els. J. Mach. Learn. Res., 15(1):2773–2832.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In International Conference on ComputerVision (ICCV).

M.D. Atkinson and S. Lloyd. 1980. Bounds on theranks of some 3-tensors. Linear Algebra and its Ap-plications, 31:19 – 31.

E. Ballico. 2014. An upper bound for the real tensorrank and the real symmetric tensor rank in terms ofthe complex ranks. Linear and Multilinear Algebra,62(11):1546–1552.

Lei Cai, Zhengyang Wang, Hongyang Gao, DinggangShen, and Shuiwang Ji. 2018. Deep adversariallearning for multi-modality missing data comple-tion. In KDD ’18, pages 1158–1166.

J. Douglas Carroll and Jih-Jie Chang. 1970. Analysisof individual differences in multidimensional scal-ing via an n-way generalization of “eckart-young”decomposition. Psychometrika, 35(3):283–319.

Yi Chang, Luxin Yan, Houzhang Fang, Sheng Zhong,and Zhijun Zhang. 2017. Weighted low-rank tensorrecovery for hyperspectral image restoration. CoRR,abs/1709.00192.

Xiai Chen, Zhi Han, Yao Wang, Qian Zhao, DeyuMeng, Lin Lin, and Yandong Tang. 2017. A generalmodel for robust tensor factorization with unknownnoise. CoRR, abs/1705.06755.

Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-bodied Question Answering. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition (CVPR).

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M.F. Moura, DeviParikh, and Dhruv Batra. 2017. Visual Dialog. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Gilles Degottex, John Kane, Thomas Drugman, TuomoRaitio, and Stefan Scherer. 2014. Covarep - a col-laborative voice analysis repository for speech tech-nologies. In ICASSP. IEEE.

Jean-Benoit Delbrouck and Stephane Dupont. 2017.Multimodal compact bilinear pooling for mul-timodal neural machine translation. CoRR,abs/1703.08084.

Haiyan Fan, Yunjin Chen, Yulan Guo, Hongyan Zhang,and Gangyao Kuang. 2017. Hyperspectral imagerestoration using low-rank tensor recovery. IEEEJournal of Selected Topics in Applied Earth Obser-vations and Remote Sensing, PP:1–16.

Shmuel Friedland and Lek-Heng Lim. 2014. Compu-tational complexity of tensor nuclear norm. CoRR,abs/1410.6072.

Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach.2016. Multimodal compact bilinear pooling forvisual question answering and visual grounding.arXiv preprint arXiv:1606.01847.

Xinbo Gao, Yimin Yang, Dacheng Tao, and XuelongLi. 2009. Discriminative optical flow tensor forvideo semantic analysis. Computer Vision and Im-age Understanding, 113(3):372 – 383. Special Issueon Video Analysis.

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen,Xinyu Li, and Ivan Marsic. 2018. Multimodal af-fective analysis using hierarchical attention strategywith word-level alignment. In ACL.

Shohei Hidaka and Chen Yu. 2010. Analyzing multi-modal time series as dynamical systems. In Interna-tional Conference on Multimodal Interfaces and theWorkshop on Machine Learning for Multimodal In-teraction, ICMI-MLMI ’10, pages 53:1–53:8, NewYork, NY, USA. ACM.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Shenglong Hu. 2014. Relations of the Nuclear Normsof a Tensor and its Matrix Flattenings. arXiv e-prints, page arXiv:1412.2443.

Qiuyuan Huang, Paul Smolensky, XiaodongHe, Li Deng, and Dapeng Oliver Wu. 2017.Tensor product generation networks. CoRR,abs/1709.09118.

Soshi Iba, Christiaan J. J. Paredis, and Pradeep K.Khosla. 2005. Interactive multimodal robot pro-gramming. The International Journal of RoboticsResearch, 24(1):83–104.

iMotions. 2017. Facial expression analysis.

Jean Kossaifi, Zachary C. Lipton, Aran Khanna,Tommaso Furlanello, and Anima Anandkumar.2017. Tensor regression networks. CoRR,abs/1707.08308.

Karthik Lakshmanan, Patrick T. Sadtler, Elizabeth C.Tyler-Kabara, Aaron P. Batista, and Byron M. Yu.2015. Extracting low-dimensional latent structurefrom time series in the presence of delays. NeuralComputation, 27:1825–1856.

http://arxiv.org/abs/1102.0072


http://dl.acm.org/citation.cfm?id=2627435.2697055



https://doi.org/https://doi.org/10.1016/0024-3795(80)90202-5

https://doi.org/https://doi.org/10.1016/0024-3795(80)90202-5

https://doi.org/10.1080/03081087.2013.839671

https://doi.org/10.1080/03081087.2013.839671

https://doi.org/10.1080/03081087.2013.839671

https://doi.org/10.1145/3219819.3219963

https://doi.org/10.1145/3219819.3219963

https://doi.org/10.1145/3219819.3219963

https://doi.org/10.1007/BF02310791

https://doi.org/10.1007/BF02310791

https://doi.org/10.1007/BF02310791

https://doi.org/10.1007/BF02310791








https://doi.org/10.1109/JSTARS.2017.2714338

https://doi.org/10.1109/JSTARS.2017.2714338



https://doi.org/https://doi.org/10.1016/j.cviu.2008.08.007

https://doi.org/https://doi.org/10.1016/j.cviu.2008.08.007

http://aclweb.org/anthology/P18-1207



https://doi.org/10.1145/1891903.1891968

https://doi.org/10.1145/1891903.1891968




https://doi.org/10.1177/0278364904049250

https://doi.org/10.1177/0278364904049250


Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, andTommi Jaakkola. 2014. Low-rank tensors for scor-ing dependency structures. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), vol-ume 1, pages 1381–1391.

Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. Multimodal languageanalysis with recurrent multistage fusion. EMNLP.

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-narasimhan, Paul Pu Liang, AmirAli Bagher Zadeh,and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific fac-tors. In ACL.

Zhen Long, Yipeng Liu, Longxi Chen, and Ce Zhu.2018. Low rank tensor completion for multiway vi-sual data. CoRR, abs/1805.03967.

Rada Mihalcea. 2012. Multimodal sentiment analy-sis. In Proceedings of the 3rd Workshop in Com-putational Approaches to Subjectivity and SentimentAnalysis, WASSA ’12, pages 1–1, Stroudsburg, PA,USA. Association for Computational Linguistics.

Louis-Philippe Morency, Rada Mihalcea, and PayalDoshi. 2011. Towards multimodal sentiment analy-sis: Harvesting opinions from the web. In Proceed-ings of the 13th international conference on multi-modal interfaces, pages 169–176. ACM.

Madhav Nimishakavi, Pratik Kumar Jawanpuria, andBamdev Mishra. 2018. A dual framework for low-rank tensor completion. In S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, andR. Garnett, editors, Advances in Neural InformationProcessing Systems 31, pages 5484–5495. CurranAssociates, Inc.

Shruti Palaskar, Ramon Sanabria, and Florian Metze.2018. End-to-end multimodal speech recognition.In Proceedings of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing(ICASSP).

Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Max-margin tensor neural network for chinese word seg-mentation. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 293–303. Associa-tion for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP.

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabas Poczos. 2019.Found in translation: Learning robust joint repre-sentations by cyclic translations between modalities.AAAI.

Xipeng Qiu and Xuanjing Huang. 2015. Convolutionalneural tensor network architecture for community-based question answering. In Proceedings of the24th International Conference on Artificial Intelli-gence, IJCAI’15, pages 1305–1311. AAAI Press.

James Rossiter. 2011. Multimodal intent recognitionfor natural human-robotic interaction.

Alexander I. Rudnicky. 2005. Multimodal DialogueSystems, pages 3–11. Springer Netherlands, Dor-drecht.

Edward Schmerling, Karen Leung, Wolf Vollprecht,and Marco Pavone. 2017. Multimodal probabilisticmodel-based planning for human-robot interaction.CoRR, abs/1710.09483.

Hendra Setiawan, Zhongqiang Huang, Jacob Devlin,Thomas Lamar, Rabih Zbib, Richard Schwartz, andJohn Makhoul. 2015. Statistical machine translationfeatures with multitask tensor networks. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers), pages 31–41. As-sociation for Computational Linguistics.

Vin de Silva and Lek-Heng Lim. 2008. Tensor rank andthe ill-posedness of the best low-rank approximationproblem. SIAM J. Matrix Anal. Appl., 30(3):1084–1127.

Kihyuk Sohn, Wenling Shang, and Honglak Lee. 2014.Improved multimodal deep learning with variationof information. In NIPS.

Vivek Srikumar and Christopher D Manning. 2014.Learning distributed representations for structuredoutput prediction. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors, Advances in Neural Information ProcessingSystems 27, pages 3266–3274. Curran Associates,Inc.

Nitish Srivastava and Ruslan Salakhutdinov. 2014.Multimodal learning with deep boltzmann ma-chines. JMLR, 15.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelha-gen, Antonio Torralba, Raquel Urtasun, and SanjaFidler. 2015. Movieqa: Understanding storiesin movies through question-answering. CoRR,abs/1512.02902.

Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin.2017. Missing modalities imputation via cascadedresidual autoencoder. In CVPR.

F. Wu, Y. Liu, and Y. Zhuang. 2009. Tensor-basedtransductive learning for multimodality video se-mantic concept detection. IEEE Transactions onMultimedia, 11(5):868–878.








http://papers.nips.cc/paper/7793-a-dual-framework-for-low-rank-tensor-completion.pdf

http://papers.nips.cc/paper/7793-a-dual-framework-for-low-rank-tensor-completion.pdf

https://doi.org/10.3115/v1/P14-1028

https://doi.org/10.3115/v1/P14-1028

https://doi.org/10.3115/v1/P14-1028




https://doi.org/10.1007/1-4020-3075-4_1

https://doi.org/10.1007/1-4020-3075-4_1



https://doi.org/10.3115/v1/P15-1004

https://doi.org/10.3115/v1/P15-1004

https://doi.org/10.1137/06066518X

https://doi.org/10.1137/06066518X

https://doi.org/10.1137/06066518X

http://papers.nips.cc/paper/5279-improved-multimodal-deep-learning-with-variation-of-information.pdf

http://papers.nips.cc/paper/5279-improved-multimodal-deep-learning-with-variation-of-information.pdf

http://papers.nips.cc/paper/5323-learning-distributed-representations-for-structured-output-prediction.pdf

http://papers.nips.cc/paper/5323-learning-distributed-representations-for-structured-output-prediction.pdf

http://jmlr.org/papers/v15/srivastava14b.html

http://jmlr.org/papers/v15/srivastava14b.html



https://doi.org/10.1109/TMM.2009.2021724

https://doi.org/10.1109/TMM.2009.2021724

https://doi.org/10.1109/TMM.2009.2021724

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley,Daniel Kifer, and C. Lee Giles. 2017. Learningto extract semantic structure from documents usingmultimodal fully convolutional neural networks. InThe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR).

Jiahong Yuan and Mark Liberman. 2008. Speakeridentification on the scotus corpus. Journal of theAcoustical Society of America.

Amir Zadeh, Minghai Chen, Soujanya Poria, ErikCambria, and Louis-Philippe Morency. 2017. Ten-sor fusion network for multimodal sentiment analy-sis. In EMNLP, pages 1114–1125.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: Multimodal cor-pus of sentiment intensity and subjectivity anal-ysis in online opinion videos. arXiv preprintarXiv:1606.06259.

Qingchen Zhang, Laurence T. Yang, Xingang Liu,Zhikui Chen, and Peng Li. 2017. A tucker deepcomputation model for mobile multimedia featurelearning. ACM Trans. Multimedia Comput. Com-mun. Appl., 13(3s):39:1–39:18.

https://doi.org/10.1145/3063593

https://doi.org/10.1145/3063593

https://doi.org/10.1145/3063593

Documents

Learning Representations from Imperfect Time Series Data ...pliang/papers/acl2019_tensor.pdf · word segmentation (Pei et al.,2014), question answering (Qiu and Huang,2015), and machine