Enhancing Language & Vision with Knowledge - ...Results and Lessons Learnt (2) Model Overall VT BCP...

Enhancing Language & Vision with Knowledge-

The Case of Visual Question Answering

Freddy LecueCortAIx, Thales, Canada

Inria, Francehttp://www-sop.inria.fr/members/Freddy.Lecue/

Maryam Ziaeefard, François Gardères (as contributors)CortAIx, Thales, Canada

(Keynote)2020 International Conference on Advance in Ambient

Computing and Intelligence

September 13th, 2020

1 / 31

http://www-sop.inria.fr/members/Freddy.Lecue/

Introduction

What is Visual Question Answering (aka VQA)?

The objective of a VQA model combines visual and textual featuresin order to answer questions grounded in an image.

What’s in the background? Where is the child sitting?

2 / 31

Classic Approaches to VQA

Most approaches combine Convolutional Neural Networks(CNN) with Recurrent Neural Networks (RNN) to learn amapping directly from input images (vision) and questions toanswers (language):

Visual Question Answering: A Survey of Methods and Datasets. Wu et al (2016)

3 / 31

Evaluation [1]

Acc(ans) = min

(1,

#{humans provided ans}3

)An answer is deemed 100% accurate if at least 3 workers providedthat exact answer.

Example: What sport can you use this for?

# {human provided ans}: race (6 times),motocross (2 times), ride (2 times)

Predicted answer: motocross

Acc (motocross): min(1, 23) = 0.66

[1] VQA: Visual Question Answering. Agrawal et al. (ICCV 2015)

4 / 31

VQA Models - State-of-the-Art

Major breakthrough in VQA (models and real-image dataset)

Accuracy Results:DAQUAR [2] (13.75 %), VQA 1.0 [1] (54.06 %), Visual Madlibs [3] (47.9 %), Visual7W [4] (55.6 %),Stacked Attention Networks [5] (VQA 2.0: 58.9 %, DAQAUR: 46.2 %), VQA 2.0 [6] (62.1 %), VisualGenome [7] (41.1 %), Up-down [8] (VQA 2.0: 63.2 %), Teney et al. (VQA 2.0: 63.15 %), XNM Net [9](VQA 2.0: 64.7 %), ReGAT [10] (VQA 2.0: 67.18 %), ViLBERT [11] (VQA 2.0: 70.55 %), GQA [12](54.06 %)

[2] Malinowski et al, [3] Yu et al, [4] Zhu et al, [5] Yang et al., [6] Goyal et al, [7] Krishna et al, [8]Anderson et al, [9] Shi et al, [10] Li et al, [11] Lu et al, [12] Hudson et al

5 / 31

Limitations

I Answers are required to be in the image.I Knowledge is limited.I Some questions cannot be correctly answered as some levels

of (basic) reasoning is required.

Alternative strategy: Integrating external knowledge such as do-main Knowledge Graphs.

What sort of vehicle uses When was the soft drinkthis item? company shown first created?

6 / 31

Knowledge-based VQA models - State-of-the-Art

I Exploiting associated facts for each question in VQAdatasets [18], [19];

I Identifying search queries for each question-image pair andusing a search API to retrieve answers ([20], [21]).

Accuracy Results:Multimodal KB [17] (NA), Ask me Anything [18] (59.44 %), Weng et al (VQA 2.0: 59.50 %), KB-VQA[19] (71 %), FVQA [20] (56.91 %), Narasimhan et al. (ECCV 2018) (FVQA: 62.2 %) , Narasimhan etal. (Neurips 2018) (FVQA: 69.35 %), OK-VQA [21] (27.84 %), KVQA [22] (59.2 %)

[17] Zhu et al, [18] Wu et al, [19] Wang et al, [20] Wang et al, [21] Marino et al, [22] Shah et al7 / 31

Our Contribution

Yet Another Knowledge Base-driven Approach? No.

I We go one step further and implement a VQA model thatrelies on large-scale knowledge graphs.

I No dedicated knowledge annotations in VQA datasetsneither search queries.

I Implicit integration of common sense knowledge throughknowledge graphs.

8 / 31

Knowledge Graphs (1)

I Set of (subject, predicate, object – SPO) triples - subject andobject are entities, and predicate is the relationship holdingbetween them.

I Each SPO triple denotes a fact, i.e. the existence of an actualrelationship between two entities.

9 / 31

Knowledge Graphs (2)

I Manual Construction - curated, collaborativeI Automated Construction - semi-structured, unstructured

Right: Linked Open Data cloud - over 1200 interlinked KGs en-coding more than 200M facts about more than 50M entities.Spans a variety of domains - Geography, Government, Life Sciences,Linguistics, Media, Publications, Cross-domain.

10 / 31

Problem Formulation

11 / 31

Our Machine Learning Pipeline

V: Language-attended visual features.Q: Vision-attended language features.G: Concept-language representation.

12 / 31

Image Representation - Faster R-CNN

input image

X Post-processing CNN with region-specific image features Faster R-CNN [24] - Suited for VQA [23].

X We use pretrained Faster R-CNNto extract 36 objects per images andtheir bounding box coordinates.

Other region proposal networks couldbe trained as an alternative approach.

[23] Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. Teney et al.(2017)[24] Faster R-CNN: towards real-time object detection with region proposal networks. Ren et al. (2015)

13 / 31

Language (Question) Representation - BERT

X BERT embedding [25] for question representation. Eachquestion has 16 tokens.X BERT shows the value of transfer learning in NLP and makesuse of Transformer, an attention mechanism that learnscontextual relations between words in a text.

[25] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformersfor language understanding. CoRRabs/1810.04805(2018)

14 / 31

Knowledge Graph Representation - Graph Embeddings

only KG that designed to understand the meanings of word

that people use and include common sense knowledge.

Pre-trained ConceptNet embedding [26] (with dimension = 200).[26] Commonsense knowledge base completion with structural and semantic context. Malaviya et al.(AAAI 2020)

15 / 31

Attention Mechanism (General Idea)

I Attention learns a context vector, informing about the mostimportant information in inputs for given outputs.

ExampleAttention in machine translation (Input: English, Output: French):

16 / 31

Attention Mechanism (More Technical)

Scaled Dot-Product Attention [27].Query Q: Target / Output embedding.Keys K, Values V: Source / Input embedding.

X Machine translation example: Q is an embed-ding vector from the target sequence. K, V areembedding vectors from the source sequence.

X Dot-product similarity between Q and K de-termines attentional distributions over V vectors.

X The resulting weight-averaged value vectorforms the output of the attention block.

[27] Attention Is All You Need. Vaswani et al. (NeurIPS 2017)

17 / 31

Attention Mechanism - Transformer

Multi-head Attention: Any given word can have multiple meanings→ more than one query-key-value sets

Encoder-style Transformer Block: A multi-headed attention blockfollowed by a small fully-connected network, both wrapped in a resid-ual connection and a normalization layer.

18 / 31

Vision-Language (Question) Representation

Co-TRM

Joint vision-attended language featuresand language-attended visual featuresto learn joint representations using Vil-BERT model [28].

[28] Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Lu etal. (2019)

19 / 31

Concept-Language (Question) Representation

X Questions features are conditioned onknowledge graph embeddings.

X The concept-language module is a se-ries of Transformer blocks that attends toquestion tokens based on KG embeddings.

X The input consists of queries from ques-tion embeddings and keys and values ofKG embeddings.

X Concept-Language representation en-hances the question comprehension with theinformation found in the knowledge graph.

20 / 31

Concept-Vision-Language Module

Compact Trilinear Interaction (CTI) [29] applied to each (V, Q, G)to achieve the joint representation of concept, vision, and language.I V represents language-attended visual features.I Q shows vision-attended language features.I G is concept-attended language features.

X Trilinear interaction to learn the interaction between V, Q, G.X By computing the attention map between all possible combina-tions of V, Q, G. These attention maps are used as weights. Then,the joint representation is computed with a weighted sum over allpossible combinations.(There are n1× n2× n3 possible combinations over the three inputswith dimensions n1, n2, and n3).

[29] Compact trilinear interaction for visual question answering. Do et al. (ICCV 2019)

21 / 31

Implementation Details

I Vision-Language Module: 6 layers of Transformer blocks, 8and 12 attention heads in the visual stream and linguisticstreams, respectively.

I Concept-Language Module: 6 layers of Transformer blocks,12 attention heads.

I Concept-Vision-Language Module: embedding size = 1024I Classifier: binary cross-entropy loss, batch size = 1024, 20

epochs, BertAdam optimizer, initial learning rate = 4e-5.

I Experiments conducted on NVIDIA 8 TitanX GPUs.

22 / 31

Datasets (1)

VQA 2.0 [30]I 1.1 million questions. 204,721 images extracted from COCO

dataset (265,016 images).I At least 3 questions (5.4 questions on average) are provided

per image.I Each question: 10 different answers (through crowd sourcing).I Questions categories: Yes/No, Number, and OtherI Special interest: "Other" category.

[30] Making the v in vqa matter: Elevating the role of image understanding in visual question answering.Goyal et al. (CVPR 2017)

23 / 31

Datasets (2)

Outside Knowledge-VQA (OK-VQA) [31]I Only VQA dataset that requires external knowledge.I 14,031 images and 14,055 questions.I Divided into eleven categories: Vehicles and Transportation

(VT); Brands, Companies and Products (BCP); Objects,Materials and Clothing (OMC); Sports and Recreation (SR);Cooking and Food (CF); Geography, History, Language andCulture (GHLC); People and Everyday Life (PEL); Plants andAnimals (PA); Science and Technology (ST); Weather andClimate (WC), and "Other".

[31] Ok-vqa: A visual question answering benchmark requiring external knowledge. Marino et al (CVPR2019)

24 / 31

Results and Lessons Learnt (1)

Model Overall Yes/No Number OtherUp-Down 63.2 80.3 42.8 55.8XNM Net 64.7 - - -ReGAT 67.18 - - -

ViLBERT 68.14 82.99 54.27 67.15ConceptBert 71.81 81.56 61.29 72.59

Table 1: Our Model vs. State-of-the-art Approaches on VQA 2.0

I Integrating common sense knowledge improves overallperformance (5.3% higher).

I Major improvement in ”Other” category.I ViLBERT outperforms on Yes/No questions as they are more

towards direct analysis of the image.

25 / 31

Results and Lessons Learnt (2)

Model Overall VT BCP OMC WC GHLCXNM Net 25.24 26.84 21.86 18.22 42.64 23.83MUTAN+AN 27.84 25.56 23.95 26.87 39.84 20.71ViLBERT 31.47 26.74 29.72 30.65 46.20 31.47ConceptBert 36.10 30.02 28.92 30.38 53.13 36.91Model CF PEL PA ST SR OtherXNM Net 23.93 20.79 24.81 21.43 33.02 24.39MUTAN+AN 29.94 25.05 29.70 24.76 33.44 23.62ViLBERT 31.93 26.54 30.49 27.38 35.24 28.72ConceptBert 37.04 31.55 37.88 34.38 39.85 37.08

Table 2: Our Model vs. State-of-the-art Approaches on OK-VQA

I Our model is better in PA, ST, and CF categories (14.7% higher).I ViLBERT outperforms our model on OMC and BCP categories,

respectively. Questions more towards direct analysis of the image.

26 / 31

Qualitative Results (1)

Q: What is the likely relationship ? Q: What is the lady looking at?of these animals?

V: friends V: phoneC: mother C: camera

Figure 1: VQA 2.0 examples in category "Other": ConceptBert (C) outperformsViLBERT (V) on Question Q.

27 / 31


Q: How big is the distance between Q: What play is being advertisedthe two players? on the side of the bus?

V: yes V: nothingC: 20ft C: movieGT: 10ft GT: smurfs

Figure 2: VQA 2.0 examples: ConceptBert (C) identifies answers of the same type asground-truth GT when compared with ViLBERT (V) on Question Q.

28 / 31


Q: What holiday is associated Q: What do these animals eat?with this animal?

V: sleep V: waterC: halloween C: plant

Figure 3: OK-VQA examples: ConceptBert (C) outperforms ViLBERT (V) onQuestion Q.

29 / 31


Q: Where can you buy Q: What kind of boat is this?contemporary furniture?

V: couch V: shipC: store C: freightGT: ikea GT: tug

Figure 4: OK-VQA examples: ConceptBert (C) identifies answers of the same type asground-truth GT when compared with ViLBERT (V) on Question Q.

30 / 31

Conclusion and Future Work

I Concept-aware VQA model for questions which requirecommon sense knowledge from external structured content.

I Novel representation of questions enhanced with commonsenseknowledge exploiting Transformer blocks and knowledge graphembeddings.

I Aggregation of vision, language, and concept embeddings tolearn a joint concept-vision-language embedding for VQAtasks.

Future workI Integrating explicit relations between entities and objects in

knowledge graph.I Evaluation through a semantic metric.I Integrating spatial relations or scene graphs to VQA models.

31 / 31

IntroductionSOTAMethodologyExperiments

Enhancing Language & Vision with Knowledge - ...Results and Lessons Learnt (2) Model Overall VT BCP...

Documents

storage.googleapis.com · 2020/3/1 · SPECIAL MEETING OFTHE CITY COUNCIL MONDAY, MARCH 16,20]5 4:00 P.M. 460 REPAIR & REPLACEMENT FUND 814.78 SEIZURE & FORFEITED PROP - FED 26.84

22 t7ñF351s '59,668 F +756-01 TEL 0836-E )OO FAX 08 wg:oo ... · 22 t7ñF351s '-23 15-0503 -0252599 HP03 54.3 2F 42.64 n-f '59,668 F 319!! 54.3 fi) 2F 4451 F HP03 18:00 0 10:00—

Image Enhancement using Image Fusion · (c) Fig 4 :(a) Image 1, (b) Image 2, (c) fused Image . Image 1 Image 2 Fused image PSNR 80.38 78.66 85.21 MSE 34.92 42.64 20.05 . Case 3: consist

Riva 88' Domino Super - Viudes Yachts · 2020. 3. 24. · Riva 88' Domino Super The 88’ Domino Super measures 26.84 metres in length, with a beam of 6.30 metres. She features a

Administrative Patent Judges Administrative Patent Judge ... · 37 C.F.R. § 42.64(c) (“A motion to exclude evidence must be filed to preserve any objection.”). Petitioner also

鳥取県経済 2月の動き...o rt mprort mr(n m pn p nrt (rmrv p o s) s s p (o n pwm p (s v t wn nx pxs q )rm xt q )rt n)xnm )r) vrt orn trn q rn q vrv q nrt q (rs q rs q prs q

39.04 39.47 41.34 42.04 42.54 42.64 43 · Splash Meet Manager 11, Build 23699 Registered to Kring Noord Brabant 25-11-2012 16:44 - pagina 1 Brabantse Sprint Kampioenschappen Eindhoven,

Automatically generated PDF from existing images....2014—15de dn. 408.56 döðddenñd. caddö Ldend dJî. 42.64 ectrfd dnad), do. 365.92 dnddeañd. 2015-16de dJî 7197.16 do 375.00

Estado Libre Asociado de Puerto Rico Oficina del …Eventos, Evaluación de Riesgos y Respuesta a los Riesgos, con un promedio de solamente un 26.84%. Para la descripción de los componentes

Las Vegas Sands Corp. Earnings Call PresentationNet Debt to Trailing Twelve Months EBITDA 0.3 x2.3 x5.8 xNM 1.4 x Trailing Twelve Months EBITDA – $4.55 billion Trailing Twelve Months

EX. CURB INLET ST5 TOP= 42.64' EX. MODIFIED … Plan...Pier 5 Pier 7 Pier 8 Pier 9 Pier 10 Pier 11 Pier 12 Pier 13 Pier 14 Pier 15 Pier 16 Pier 6 DUNCAN AVE. LESLIE AVE. BENCH 12"

Swimming North Coast District ASC HY-TEK's MEET · PDF file58 Duncan, Aliyah 10 Metro South East 39.15 38.97 ... 88 Lawes, Iva 10 Metro South East 42.38 42.64 89 Bourke, Amber 9 Metro

URETHANE BUMPERS 優力膠緩衝墊 - MISUMI · 2019-02-25 · axnh（標準蕭氏硬度a90） a xnm （標準蕭氏硬度a70） axnl（標準蕭氏硬度a50） 15 10～200 7 m4

Amelia McGuiness Time Trials - Amazon Web Services · 2019-08-19 · Course: Date: Place Bib Name School 1 312 Bayley Sadler Jindabyne Central School () 26.84 (1) 26.49 (1) 26.49

FICHA TÉCNICA DE LA COMPETICIÓN · UDROIU, Cesar 04 CD Nados Castellon 26.84 472 42. MANJON FERNANDEZ, Antonio 03 C.N. Albacete 26.85 472 43. MARTINEZ ALMONACIL, Ethan 04 C.D. Salvamento

LAND SURVEY SOLUTIONS DESCRIPTION€¦ · 26.42 26.65 26.84 27.04 27.29 27.50 27.66 telstra telstra and brick shop brick shop brick brick morrison road a-----dwg no contractors must

Studie - BestFewo...26.84% der Urlauber verreisen in die Schweiz mit Kind 29.39% der Urlauber verreisen nach Deutschland mit Kind Davon: Reisende mit 1 Kind: 32.51% 28.57% 42.73% Reisende

Rates.pdf · biscuit (sunfeast) biscuit (britania-tiger) biscuit (sunfeast - cashew) unit 1 kg 1 kg 1 kg 1 kg 1 kg rate 26.84 69.29 42.9 38.74 1160 74.3 89 654 278 18.4 153.8

1/23/2020 8:19:57 AM NCS Top Five SC Times by age group as ... · Rank Time Name Age LSC / Club Swim Date Meet Girls 10 & under 50 Freestyle Short Course Yards Power Points 1 26.84

Rowan Ranger...62 2551 41 119 5112 2.004 63 2591 40 42.64 118 5230 2.019 64 2630 40 118 5348 2.033 65 2669 39 117 5465 2.047 66 2707 38 116 5581 2.061 67 2745 37 115 5696 2.075 68