CNM: An Interpretable Complex-valued Network for Matching · for matching (CNM)1 achieves comparable performances to strong CNN and RNN base-lines on two benchmarking question answer-ing

CNM: An Interpretable Complex-valued Network for Matching

Qiuchi Li∗University of Padua

Padua, [email protected]

Benyou Wang ∗University of Padua


Massimo Melucci†University of Padua


Abstract

This paper seeks to model human languageby the mathematical framework of quantumphysics. With the well-designed mathematicalformulations in quantum physics, this frame-work unifies different linguistic units in a sin-gle complex-valued vector space, e.g. wordsas particles in quantum states and sentencesas mixed systems. A complex-valued net-work is built to implement this frameworkfor semantic matching. With well-constrainedcomplex-valued components, the network ad-mits interpretations to explicit physical mean-ings. The proposed complex-valued networkfor matching (CNM)1 achieves comparableperformances to strong CNN and RNN base-lines on two benchmarking question answer-ing (QA) datasets.

1 Introduction

There is a growing concern on the interpretabil-ity of neural networks. Along with the increasingpower of neural networks comes the challenge ofinterpreting the numerical representation of net-work components into human-understandable lan-guage. Lipton (2018) points out two importantfactors for a model to be interpretable, namelypost-hoc interpretability and transparency. Theformer refers to explanations of why a modelworks after it is executed, while the latter concernsself-explainability of components through somemechanisms in the designing phase of the model.

We seek inspirations from quantum physics tobuild transparent and post-hoc interpretable net-works for modeling human language. The emerg-ing research field of cognition suggests that thereexist quantum-like phenomena in human cogni-tion (Aerts and Sozzo, 2014), especially languageunderstanding (Bruza et al., 2008). Intuitively, a

∗Equal Contribution†Corresponding Author

1https://github.com/wabyking/qnn.git

sentence can be treated as a physical system withmultiple words (like particles), and these wordsare usually polysemous (superposed) and corre-lated (entangled) with each other. Motivated bythese existing works, we aim to investigate the fol-lowing Research Question (RQ).

RQ1: Is it possible to model human lan-guage with the mathematical framework of quan-tum physics?

Towards this question, we build a novelquantum-theoretic framework for modeling lan-guage, in an attempt to capture the quantum-ness in the cognitive aspect of human language.The framework models different linguistic units asquantum states with adoption of quantum proba-bility (QP), which is the mathematical frameworkof quantum physics that models uncertainly on auniform Semantic Hilbert Space (SHS).

Complex values are crucial in the mathematicalframework of characterizing quantum physics. Inorder to preserve physical properties, the linguis-tic units have to be represented as complex vectorsor matrices. This naturally gives rise to another re-search question:

RQ2: Can we benefit from the complex-valuedrepresentation of human language in a real natu-ral language processing (NLP) scenario?

To this end, we formulate a linguistic unit as acomplex-valued vector, and link its length and di-rection to different physical meanings: the lengthrepresents the relative weight of the word whilethe direction is viewed as a superposition state.The superposition state is further represented inan amplitude-phase manner, with amplitudes cor-responding to the lexical meaning and phases im-plicitly reflecting the higher-level semantic aspectssuch as polarity, ambiguity or emotion.

In order to evaluate the above framework, weimplement it as a complex-valued network (CNM)for semantic matching. The network is applied tothe question answering task, which is the most

arX

iv:1

904.

0529

8v1

[cs

.CL

] 1

0 A

pr 2

019

typical matching task that aims at selecting thebest answer for a question from a pool of candi-dates. In order to facilitate local matching with n-grams of a sentence pair, we design a local match-ing scheme in CNM. Most of State-of-the-art QAmodels are mainly based on Convolution Neu-ral Network (CNN), Recurrent Neural Network(RNN) and many variants thereof (Wang and Ny-berg, 2015; Yang et al., 2016; Hu et al., 2014;Tan et al., 2015). However, with opaque structuresof convolutional kernels and recurrent cells, thesemodels are hard to understand for humans. We ar-gue that our model is advantageous in terms of in-terpretability.

Our proposed CNM is transparent in that it isdesigned in alignment with quantum physics. Ex-periments on benchmarking QA datasets show thatCNM has comparable performance to strong CNNand RNN baselines, whilst admitting post-hoc in-terpretations to human-understandable language.We therefore answer RQ1 by claiming that it ispossible to model human language with the pro-posed quantum-theoretical framework in this pa-per. Furthermore, an ablation study shows that thecomplex-valued word embedding performs bet-ter than its real counterpart, which allows us toanswer RQ2 by claiming that we benefit fromthe complex-valued representation of natural lan-guage on the QA task.

2 Background

Here we briefly introduce quantum probabilityand discuss a relevant work on quantum-inspiredframework for QA.

2.1 Quantum ProbabilityQuantum probability provides a sound explanationfor the phenomena and concepts of quantum me-chanics, by formulating events as subspaces in avector space with projective geometry.

2.1.1 Quantum SuperpositionQuantum Superposition is one of the fundamentalconcepts in Quantum Physics, which describes theuncertainty of a single particle. In the micro world,a particle like a photon can be in multiple mutual-exclusive basis states simultaneously with a prob-ability distribution. In a two-dimensional exam-ple, two basis vectors are denoted as |0〉 and |1〉2.

2We here adopt the widely used Dirac notations in quan-tum probability, in which a unit vector ~µ and its transpose ~µT

are denoted as a ket |u〉 and a bra 〈u| respectively.

Superposition is implemented to model a generalstate which is a linear combination of basis vectorswith complex-valued weights such that

|φ〉 = α0 |0〉+ α1 |1〉 , (1)

where α0 and α1 are complex scalars satisfying0 ≤ |α0|2 ≤ 1, 0 ≤ |α1|2 ≤ 1 and |α0|2 + |α1|2 =1. It follows that |φ〉 is defined over the complexfield. When α0 and α1 are non-zero values, thestate |φ〉 is said to be a superposition of the states|0〉 and |1〉, and the scalars α0 and α1 denote theprobability amplitudes of the superposition.

2.1.2 MeasurementThe uncertainty of an ensemble system with mul-tiple particles is encapsulated as a mixed state,represented by a positive semi-definite matrixwith unitary trace called density matrix: ρ =∑m

i |φi〉〈φi|, where {|φi〉}mi=0 are pure states likeEq. 1. In order to infer the probabilistic propertiesof ρ in the state space, Gleason’s theorem (Glea-son, 1957; Hughes, 1992) is used to calculateprobability to observe x through projection mea-surements |x〉〈x| that is a rank-one projector de-noted as a outer product of |x〉.

px(ρ) = 〈x| ρ |x〉 = tr(ρ |x〉〈x|) (2)

The measured probability px(ρ) is a non-negativereal-valued scalar, since both ρ and |x〉〈x| areHermitian. The unitary trace property guarantees∑

x∈X px(ρ) = 1 for X being a set of orthogonalbasis states.

2.2 Neural Network based Quantum-likeLanguage Model (NNQLM)

Based on the density matrices representation fordocuments in information retrieval (Van Rijsber-gen, 2004; Sordoni et al., 2013), Zhang et al.(2018a) built a neural network with density ma-trix for question answering. This Neural Networkbased Quantum Language Model (NNQLM) em-beds a word as a unit vector and a sentence as areal-valued density matrix. The distance betweena pair of density matrices is obtained by extract-ing features of their matrix multiplication in twoways: NNQLM-I directly takes the trace of the re-sulting matrix, while NNQLM-II applies convolu-tional structures on top of the matrix to determinewhether the pair of sentences match or not.

NNQLM is limited in that it does not makeproper use of the full potential of probabilistic

property of a density matrices.By treating den-sity matrices as ordinary real vectors (NNQLM-I) or matrices (NNQLM-II), the full potentialwith complex-valued formulations is largely ig-nored. Meanwhile, adding convolutional layers ontop of a density matrix is more of an empiricalworkaround than an implementation of a theoreti-cal framework.

In contrast, a complex-valued matching net-work is built on top of a quantum-theoreticalframework for natural language. In particular, anindirect way to measure the distance between twodensity matrices through trainable measurementoperations, which makes advantage of the proba-bilistic properties of density matrices and also pro-vides flexible matching score driven by trainingdata.

3 Semantic Hilbert Space

Here we introduce the Semantic Hilbert Space Hdefined on a complex vector space Cn, and threedifferent linguistic units, namely sememes, wordsand word combinations on the space. The conceptof semantic measurement is introduced at last.

Sememes. We assume H is spanned by the setof orthogonal basis states {|ej〉}nj=1 for sememes,which are the minimum semantic units of wordmeanings in language universals (Goddard andWierzbicka, 1994). The unit state |ej〉 can be seenas a one-hot vector, i.e., the j-th element in |ej〉 isone while other elements are zero, in order to ob-tain a set of orthogonal unit states. Semantic unitswith larger granularities are based on the set of se-meme basis.

Words. Words are composed of sememes in su-perposition. Each word w is a superposition overall sememes {|ej〉}nj=1, or equivalently a unit-length vector onH:

|w〉 =n∑j=1

rjeiφj |ej〉, (3)

i is the imaginary number with i2 = −1. Inthe above expression, {rj}nj=1 are non-negativereal-valued amplitudes satisfying

∑nj=1 rj

2 =1and φj ∈ [−π, π] are the corresponding com-plex phases. In comparison to Eq. 1, {rjeiφj}nj=0

are the polar form representation of the complex-valued scalars {αj}1j=0.

Word Combinations. We view a combinationof words (e.g. phrase, n-gram, sentence or docu-ment) as a mixed system composed of individual

words, and its representation is computed as fol-lows:

ρ =m∑j

1

m|wj〉〈wj |, (4)

where m is the number of words and |wj〉 is wordsuperposition state in Eq. 3, allowing multiple oc-currences. Eq. 4 produces a density matrix ρ forsemantic composition of words. It also describes anon-classical distribution over the set of sememes:the complex-valued off–diagonal elements de-scribes the correlations between sememes, whilethe diagonal entries (guaranteed to be real by itsoriginal property) correspond to a standard proba-bility distribution. The off–diagonal elements pro-vide our framework some potentials to model thepossible interactions between the basic sememebasis, which was usually considered mutually in-dependent with each other.

Semantic Measurements. The high-level fea-tures of a sequence of words are extracted throughmeasurements on its mixed state. Given a densitymatrix ρ of a mixed state, a rank-one projector P ,which is the outer product of a unit complex vec-tor, i.e. P = |x〉〈x|, is applied as a measurementprojector. It is worth mentioning that |x〉 could beany pure state in this Hilbert space (not only lim-ited to a specific word w). The measured probabil-ity is computed by Gleason‘s Theorem in Eq. 2.

4 Complex-valued Network for Matching

We implemented an end-to-end network formatching on the Semantic Hilbert Space. Fig.1 shows the overall structure of the proposedComplex-valued Network for Matching (CNM).Each component of the network is further dis-cussed in this section.

4.1 Complex-valued EmbeddingOn the Semantic Hilbert Space, each word w isembedded as a complex-valued vector ~w. Here welink its length and direction to different physicalmeanings: the length of a vector represents the rel-ative weight of the word while the vector directionis viewed as a superposition state. Each word wadopts a normalization into a superposition state|w〉 and a word-dependent weight π(w):

|w〉 =~w

||~w||, π(w) = ||~w||, (5)

where ||~w|| denotes the 2-norm length of ~w. π(w)is used to compute the relative weight of a word in

Figure 1: Architecture of Complex-valued Network for Matching. M© means a measurement operation accordingto Eq. 2.

Answer

…

…

…

MM

M

MM

M

MM

M

…

𝑣1 ⟨𝑣1| 𝑣2 ⟨𝑣2|𝑣𝑘 ⟨𝑣𝑘|

Question

… …

MM

M

MM

M

MM

M

…

Complex Embedding

Superposition States

Mixture Weights

Local Mixture Density Matrix

Measurement Probability

Output (1/0)

Embedding Mixture Measurement Matching

a local context window, which we will elaborate inSection 4.2.

4.2 Sentence Modeling with Local MixtureScheme

A sentence is modeled as a combination of indi-vidual words in it. NNQLM (Zhang et al., 2018a)models a sentence as a global mixture of allwords, which implicitly assumes a global inter-action among all sentence words. This seems tobe unreasonable in practice, especially for a longtext segment such as a paragraph or a document,where the interaction between the first word andthe last word is often negligible. Therefore, we ad-dress this limitation by proposing a local mixtureof words, which tends to capture the semantic rela-tions between neighbouring words and underminethe long-range word dependencies. As is shown inFig. 2, a sliding window is applied and a densitymatrix is constructed for a local window of lengthl (e.g. 3). Therefore, a sentence is composed of asequence of density matrices for l-grams.

The representation of a local l-gram window isobtained by an improved approach over Eq. 4. InEq. 4, each word is assigned with the same weight,which does not hold from an empirical point ofview. In this study, we take the L2-norm of theword vector as the relative weight in a local con-text window for a specific word, which could beupdated during training. To some extent, L2-normis a measure of semantic richness of a word, i.e.the longer the vector the richer the meaning. The

Figure 2: Architecture of local mixture component. Asliding window in black color is applied to the sen-tence, generating a local mixture density matrix foreach local window of length l.

⊙means that a matrix

multiplies a number with each elements.⊗

denotes anouter product of a vector.

Superposition States

L2-norm

𝑝1

Local Mixture Density Matrix

Softm

ax 𝑝2

𝑝3

𝜋1𝜋2𝜋3

Probabilities

Word Density Matrix

density matrix of an l-gram is computed as fol-lows:

ρ =

l∑i

p(wi) |wi〉〈wi|, (6)

where the relative importance of each word p(wi)in an l-gram is the soft-max normalized word-dependent weight: p(wi) = eπ(wi)∑l

j eπ(wj)

, where

π(wi) is the word-dependent weight. By convert-ing word-dependent weights to a probability dis-tribution, a legal density matrix is produced, be-cause

∑li p(wi) = 1 gives tr(ρ) = 1. Moreover,

the weight of a word also depends on its neighbor-ing words in a local context.

4.3 Matching of Question and Answer

In quantum information, there have been workstrying to estimate a quantum state from the resultsof a series of measurements (Rehacek et al., 2001;Lvovsky, 2004). Inspired by these works, we in-troduce trainable measurements to extract densitymatrix features and match a pair of sentences.

Suppose a pair of sentences with length Lare represented as two sets of density matrices{ρ1j}Lj=1 and {ρ2j}Lj=1 respectively. The same setof K semantic measurement operators {|vk〉}Kk=1

are applied to both sets, producing a pair of k-by-L probability matrix p1 and p2, where p1jk =

〈vk| ρ1j |vk〉 and p2jk = 〈vk| ρ2j |vk〉 for k ∈{1, ...,K} and j ∈ {1, ..., L}. A classical vector-based distances between p1 and p2 can be com-puted as the matching score of the sentence pair.By involving a set of semantic measurements, theproperties of density matrix are taken into consid-eration in computing the density matrix distance.

We believe that this way of computing den-sity matrix distance is both theoretically soundand applicable in practice. The trace inner productof density matrices (Zhang et al., 2018a) breaksthe basic axioms of metric, namely non-negativity,identity of indiscernables and triangle inequality.The CNN-based feature extraction (Zhang et al.,2018a) for density matrix multiplication loses theproperty of density matrix as a probability dis-tribution. Nielsen and Chuang (2010) introducedthree measures namely trace distance, fidelity andVN-divergence. However, it is computationallycostly to compute these metrics and propagate theloss in an end-to-end training framework.

We set the measurements to be trainable so thatthe matching of question and answering can be in-tegrated into the whole neural network, and iden-tify the discriminative semantic measurements in adata-driven manner. From the perspective of lineardiscriminant analysis (LDA) (Fisher, 1936), thisapproach is intended to find a group of finite dis-criminative projection directions for a better di-vision of different classes, but in a more soundframework inspired by quantum probability withcomplex-valued values. From an empirical pointof view, the data-driven measurements make itflexible to match two sentences.

Table 1: Dataset Statistics. For each cell, the values de-note the number of questions and question-answer pairsrespectively.

Dataset train dev testTREC QA 1229/53417 65/117 68/1442WikiQA 873/8627 126/130 633/2351

5 Experiments

5.1 Datasets and Evaluation Metrics

The experiments were conducted on two bench-marking question answering datasets for questionanswering (QA), namely TREC QA (Voorheesand Tice, 2000) and WikiQA (Yang et al., 2015).TREC QA is a standard QA dataset in the Text RE-trieval Conference (TREC). WikiQA is releasedby Microsoft Research on open domain questionanswering. On both datasets, the task is to selectthe most appropriate answer from the candidateanswers for a question, which require a rankingof candidate answers. After removing the ques-tions with no correct answers, the statistics ofthe cleaned datasets are given in the Tab. 1. Twocommon rank-based metrics, namely mean aver-age precision (MAP) and mean reciprocal rank(MRR), are used to measure the performance ofmodels.

5.2 Experiment Details

5.2.1 BaselinesWe conduct a comprehensive comparison across awide range of models. On TREC QA the experi-mented models include Bigram-CNN (Yu et al.,2014), three-layered Long Short-term Memory(LSTM) in combination with BM25 (LSTM-3L-BM25) (Wang and Nyberg, 2015), attention-basedneural matching model (aNMM) (Yang et al.,2016), Multi-perspective CNN (MP-CNN) (Heet al., 2015), CNTN (Qiu and Huang, 2015),attention-based LSTM+CNN model (LSTM-CNN-attn) (Tang et al., 2015) and pairwise wordinteraction modeling (PWIM) (He and Lin, 2016).On WikiQA dataset, we involve the followingmodels into comparison: Bigram-CNN (Yu et al.,2014), CNN with word count information (CNN-Cnt) (Yang et al., 2015), QA-BILSTM (Santoset al., 2016), BILSTM with attentive pooling(AP-BILSTM) (Santos et al., 2016), and LSTMwith attention (LSTM-attn) (Miao et al., 2015).On both datasets, we report the results of quantumlanguage model (Sordoni et al., 2013) and twomodels NNQLM-I, NNQLM-II by (Zhang et al.,

2018a) for comparison.

5.2.2 Parameter SettingsThe parameters in the network are Θ ={R,Φ, {|vi〉}ki=1}, in which R and Φ denote thelookup tables for amplitudes and complex phasesof each word, and |vi〉}ki=1 denotes the set of se-mantic measurements. We use 50-dimension com-plex word embedding.The amplitudes are initial-ized with 50-dimension Glove vectors (Penning-ton et al., 2014) and L2-norm regularized duringtraining. The phases are randomly initialized un-der a normal distribution of [−π, π]. The seman-tic measurements {|vi〉}ki=1} are initialized withorthogonal real-valued one-hot vectors, and eachmeasurement is constrained to be of unit lengthduring training. We perform max pooling over thesentence dimension on the measurement probabil-ity matrices, resulting in a k-dim vector for both aquestion and an answer. We concatenate the vec-tors for l = 1, 2, 3, 4 for questions and answers,and larger size of windows are also tried. Wewill use a longer sliding window in datasets withlonger sentences. The cosine similarity is used asthe distance metric of measured probabilities. Weuse triplet hinge loss and set the margin α = 0.1.A dropout layer is built over the embedding layerand measurement probabilities with a dropout rateof 0.9.

A grid search is conducted over the parameterpools to explore the best parameters. The parame-ters under exploration include {0.01,0.05,0.1} forthe learning rate, {1e-5,1e-6,1e-7,1e-8} for theL2-normalization of complex word embeddings,{8,16,32} for batch size, and {50,100,300,500}for the number of semantic measurements.

5.2.3 Parameter ScaleThe proposed CNM has a limited scale of pa-rameters. Apart from the complex word embed-dings which are |V | × 2n by size, the only setof parameters are {|vi〉}ki=1 which is k × 2n, with|V |, k, n being the vocabulary size, number of se-mantic measurements and the embedding dimen-sion, respectively. In comparison, a single-layeredCNN has at least l × k × n additional parameterswith l being the filter width, while a single-layeredLSTM is 4× k× (k+n) by the minimum param-eter scale. Although we use both amplitude partand phase part for word embedding, lower dimen-sion of embedding are adopted, namely 50, withthe comparable performance. Therefore, our net-

Table 2: Experiment Results on TREC QA Dataset.The best performed values are in bold.

Model MAP MRRBigram-CNN 0.5476 0.6437LSTM-3L-BM25 0.7134 0.7913LSTM-CNN-attn 0.7279 0.8322aNMM 0.7495 0.8109MP-CNN 0.7770 0.8360CNTN 0.7278 0.7831PWIM 0.7588 0.8219QLM 0.6780 0.7260NNQLM-I 0.6791 0.7529NNQLM-II 0.7589 0.8254CNM 0.7701 0.8591Over NNQLM-II 1.48% ↑ 4.08% ↑

Table 3: Experiment Results on WikiQA Dataset.Thebest performed values for each dataset are in bold.

Model MAP MRRBigram-CNN 0.6190 0.6281QA-BILSTM 0.6557 0.6695AP-BILSTM 0.6705 0.6842LSTM-attn 0.6639 0.6828CNN-Cnt 0.6520 0.6652QLM 0.5120 0.5150NNQLM-I 0.5462 0.5574NNQLM-II 0.6496 0.6594CNM 0.6748 0.6864Over NNQLM-II 3.88% ↑ 4.09% ↑

work scales better than the advanced models onthe CNN or LSTM basis.

5.3 Experiment Results

Tab. 2 and 3 show the experiment results on TRECQA and WikiQA respectively, where bold valuesare the best performances out of all models. Ourmodel achieves 3 best performances out of the 4metrics on TREC QA and WikiQA, and performsslightly worse than the best-performed models onthe remaining metric. This illustrates the effective-ness of our proposed model from a general per-spective.

Specifically, CNM outperforms most CNN andLSTM-based models, which have more compli-cated structures and a relatively larger parametersscale. Also, CNM performs better than existingquantum-inspired QA models, QLM and NNQLMon both datasets, which means that the quantumtheoretical framework gives rise to better performsmodel. Moreover, a significant improvement overNNQLM-1 is observed on these two datasets, sup-porting our claim that trace inner product is not aneffective distance metric of two density matrices.

Table 4: Ablation Test. The values in parenthesis are theperformance difference between the model and CNM.

Setting MAP MRRFastText-MaxPool 0.6659 (0.1042↓) 0.7152 (0.1439↓)CNM-Real 0.7112 (0.0589↓) 0.7922 (0.0659↓)CNM-Global-Mixture 0.6968 (0.0733↓) 0.7829 (0.0762↓)CNM-trace-inner-product 0.6952 (0.0749↓) 0.7688 (0.0903↓)CNM 0.7701 0.8591

5.4 Ablation Test

An ablation test is conducted to examine the influ-ence of each component on our proposed CNM.The following models are implemented in the ab-lation test. FastText-MaxPool adopt max poolingover word-embedding, just like FastText (Joulinet al., 2016). CNM-Real replaces word embed-dings and measurements with their real counter-parts. CNM-Global-Mixture adopts a global mix-ture of the whole sentence, in which a sentenceis represented as a single density matrix, leadingto a probability vector for the measurement re-sult. CNM-trace-inner-product replaces the train-able measurements with trace inner product likeNNQLM.

For the real-valued models, we replace the em-bedding with double size of dimension, in orderto eliminate the impact of the parameter scaleon the performance. Due to limited space, weonly report the ablation test result on TREC QA,and WikiQA has the similar trends. The test re-sults in Tab. 4 demonstrate that each componentplays a crucial role in the CNM model. In particu-lar, the comparison with CNM-Real and FastText-MaxPool shows the effectiveness of introducingcomplex-valued components, the increase in per-formance over CNM-Global-Mixture reveals thesuperiority of local mixture, and the comparisonwith CNM-trace-inner-product confirms the use-fulness of trainable measurements.

6 Discussions

This section aims to investigate the proposed re-search questions mentioned in Sec 1. For RQ1,we explain the physical meaning of each compo-nent in term of the transparency (Sec. 6.1), anddesign some case studies for the post-hoc inter-pretability (Sec. 6.2). For RQ2, we argue that thecomplex-valued representation can model differ-ent aspects of semantics and naturally address thenon-linear semantic compositionality, as discussedin Sec. 6.3.

Table 5: Physical meanings and constraints

Components DNN CNM

Sememe -complex basis vector / basis state{w|w ∈ Cn, ||w||2 = 1, }complete &orthogonal

Word real vector(−∞,∞)

unit complex vector / superposition state{w|w ∈ Cn, ||w||2 = 1}

Low-levelrepresentation

real vector(−∞,∞)

density matrix / mixed system{ρ|ρ = ρ∗, tr(ρ) = 1

Abstraction CNN/RNN(−∞,∞)

unit complex vector / measurement{w|w ∈ Cn, ||w||2 = 1}

High-levelrepresentation

real vector(−∞,∞)

real value/ measured probability(0, 1)

Table 6: Selected learned important words in TRECQA. All words are lower.

Selected words

Importantstudio, president, women, philosophyscandinavian, washingtonian, berliner, championshipdefiance, reporting, adjusted, jarred

Unimportant71.2, 5.5, 4m, 296036, 3.5may, be, all, bornmovements, economists, revenues, computers

6.1 Transparency

CNM aims to unify many semantic units with dif-ferent granularity e.g. sememes, words, phrases(or N-gram) and document in a single complex-valued vector space, as shown in Tab. 5. In partic-ular, we formulate atomic sememes as a group ofcomplete orthogonal basis states and words as su-perposition states over them. A linguistic unit withlarger-granularity e.g. a word phrase or a sentenceis represented as a mixed system over the words(with a density matrix, i.e. a positive semi-definitematrix with unit trace).

More importantly, trainable projection measure-ments are adopted to extract high-level representa-tion for a word phrase or a sentence. Each mea-surement is also directly embedded in this uni-fied Hilbert space, as a specific unit state (likewords), thus making it easily understood by theneighbor words near this specific state. The corre-sponding trainable components in state-of-art neu-ral network architectures, namely, kernels in CNNand cells in RNN, are represented as arbitrary real-valued without any constrains, lead to difficulty tobe understood.

6.2 Post-hoc Interpretability

The Post-hoc Interpretability is shown in threegroup of case studies, namely word weightscheme, matching pattern and discriminative se-mantic measurements.

6.2.1 Word Weighting SchemeTab. 6 shows the words selected from the top-50 most important words as well as top-50 unim-

Table 7: The matching patterns for specific sentence pairs in TREC QA. The darker the color, the bigger weightthe word is. The [ and ] denotes the possible border of the current sliding windows.

Question Correct Answer

Who is the [ president or chief executive of Amtrak ] ? “ Long-term success ... ” said George Warrington , [ Amtrak ’s president and chief executive ] .”

When [ was Florence Nightingale born ] ? ,”On May 12 , 1820 , the founder of modern nursing , [ Florence Nightingale , was born ] in Florence , Italy .”

When [ was the IFC established ] ? [ IFC was established in ] 1956 as a member of the World Bank Group .

[ how did women ’s role change during the war ] ..., the [ World Wars started a new era for women ’s ] opportunities to ....

[ Why did the Heaven ’s Gate members commit suicide ] ?, This is not just a case of [ members of the Heaven ’s Gate cult committing suicide ] to ...

portant ones. The importance of word is basedon the L2-norm of its learned amplitude embed-ding according to Eq. 5. It is consistent with in-tuition that, the important words are more aboutspecific topics or discriminative nouns, while theunimportant words include meaningless numbersor super-high frequency words. Note that somespecial form (e.g. plural form in the last row )of words are also identified as unimportant words,since we commonly did not stem the words.

6.2.2 Matching PatternTab. 7 shows the match schema with local slid-ing windows. In a local context window, we vi-sualize the relative weights (i.e. the weights afternormalized by softmax) for each word with dark-ness degrees. The table illustrates that our modelis capable of identifying true matched local win-dows of a sentence pair. Even the some words arereplaced with similar forms (e.g. commit and com-mitting in the last case) or meanings (e.g. changeand new in the fourth case), it could be robust toget a relatively high matching score. From a em-pirical point of view, our model outperforms othermodels in situations where specific matching pat-tern are crucial to the sentence meaning, such aswhen two sentences share some unordered bag-of-word combinations. To some extent, it is robust upto replacement of words with similar ones in theSemantic Hilbert Space.

6.2.3 Discriminative Semantic MeasurementsThe semantic measurements are performedthrough rank-one projectors {|x〉〈x|} . From aclassical point of view, each projector is associatedwith a superposition of fundamental sememes,which is not necessarily linked to a particularword. Since the similarity metric in the SemanticHilbert Space can be used to indicate semanticrelatedness, we rely on the nearby words of thelearned measurement projectors to understandwhat they may refer to.

Essentially, we identified the 10 most similarwords to a measurement based on the cosine sim-

Table 8: Selected learned measurements for TREC QA.They were selected according to nearest words for ameasurement vector in Semantic Hilbert Space.

Selected neighborhood words for a measurement vector1 andes, nagoya, inter-american, low-caste2 cools, injection, boiling,adrift3 andrews, paul, manson, bair4 historically, 19th-century, genetic, hatchback5 missile, exile, rebellion, darkness

ilarity metric. Tab. 8 shows part of the most simi-lar words of 5 measurements, which are randomlychosen from the total number of k=10 trainablemeasurements for the TREC QA dataset. It canbe seen that the first three selected measurementswere about positions, movement verbs and peo-ple’s names, while the rest were about topic ofhistory and rebellion respectively. Even thougha clear explanation of the measurements is notavailable, we are still able to roughly understandthe meaning of the measurements in the proposeddata-driven approach.

6.3 Complex-valued RepresentationIn CNM, each word is naturally embedded as acomplex vector, composed of a complex phasepart, a unit amplitude part and a scalar-valuedlength. We argue that the amplitude part (i.e.squared root of a probabilistic weight), corre-sponds to the classical word embedding with thelexical meaning, while the phase part implicitlyreflects the higher-level semantic aspect e.g. po-larity, ambiguity or emotion. The scalar-valuedlength is considered as the relative weight in amixed system. The ablation study in Sec. 5.4 con-firms that the complex-valued word embeddingperforms better than the real word embedding,which indicates that we benefit from the complex-valued embedding on the QA task.

From a mathematical point of view, complex-valued word embedding and other complex-valuedcomponents forms a new Hilbert vector space formodelling language, with a new definitions of ad-dition and multiplication, as well as a new innerproduct operation. For instance, addition in the

word meaning combination is defined as

z =z1 + z2 = r1eiθ1 + r2e

iθ2

=√r21 + r22 + 2r1r2 cos(θ2 − θ1)

× ei arctan(r1 sin(θ1)+r2 sin(θ2)r1 cos(θ1)+r2 cos(θ2)

) (7)

where z1 and z2 are the values for the correspond-ing element for two different word vectors |w1〉and |w2〉 respectively. Both the amplitudes andcomplex phases of z are added with a nonlinearcombination of phases and amplitudes of z1 andz2. A classical linear addition gives z = r1 + r2,which can be viewed as a degenerating case of thecomplex-valued addition with the phase informa-tion being removed (θ1 = θ2 = 0 in the example).

7 Conclusions and Future Work

Towards the interpretable matching issue, we pro-pose two research questions to investigate thepossibility of language modelling with quantummathematical framework. To this end, we designa new framework to model all the linguistic unitsin a unified Hilbert space with well-defined math-ematical constrains and explicit physical meaning.We implement the above framework with neuralnetwork and then demonstrate its effectiveness inquestion answering (QA) task. Due to the well-designed components, our model is advantageouswith its interpretability in term of transparency andpost-hoc interpretability, and also shows its poten-tial to use complex-valued components in NLP.

Despite the effectiveness of the current network,we would like to further explore the phase partin complex-valued word embedding to directlylink to concrete semantics such as word senti-ment or word position. Another possible direc-tion is to borrow other quantum concepts to cap-ture the interaction and non-interaction betweenword semantics, such as the Fock Space (Sozzo,2014) which considers both interacting and non-interacting entities in different Hilbert Spaces.Furthermore, a deeper and robust quantum-inspired neural architecture in a higher-dimensionHilbert space like (Zhang et al., 2018b) is alsoworth to be investigated for achieving strongerperformances with better explanatory power.

ACKNOWLEDGEMENT

We thank Sagar Uprety, Dawei Song, and PrayagTiwari for helpful discussions. Peng Zhang and

Peter Bruza gave us constructive comments to im-prove the paper. The GPU computing resourcesare partly supported by Beijing Ultrapower Soft-ware Co., Ltd and Jianquan Li.

The three authors are supported by the QuantumAccess and Retrieval Theory (QUARTZ) project,which has received funding from the EuropeanUnion’s Horizon 2020 research and innovationprogramme under the Marie Skłodowska-Curiegrant agreement No. 721321.

ReferencesDiederik Aerts and Sandro Sozzo. 2014. Quantum En-

tanglement in Concept Combinations. InternationalJournal of Theoretical Physics, 53(10):3587–3603.

Peter D Bruza, Kirsty Kitto, Douglas McEvoy, andCathy McEvoy. 2008. Entangling words and mean-ing. pages QI, 118–124. College Publications.

Ronald A Fisher. 1936. The use of multiple measure-ments in taxonomic problems. Annals of Eugenics,7(2):179–188.

Andrew M Gleason. 1957. Measures on the closed sub-spaces of a hilbert space. J. of Math. and Mech.,pages 885–893.

Cliff Goddard and Anna Wierzbicka. 1994. Semanticand lexical universals: Theory and empirical find-ings. John Benjamins Publishing.

Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi-Perspective Sentence Similarity Modeling with Con-volutional Neural Networks. In EMNLP, pages1576–1586. ACL.

Hua He and Jimmy Lin. 2016. Pairwise Word Interac-tion Modeling with Deep Neural Networks for Se-mantic Similarity Measurement. In NAACL, pages937–948. ACL.

Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2014. Convolutional Neural Network Archi-tectures for Matching Natural Language Sentences.In Advances in Neural Information Processing Sys-tems 27, pages 2042–2050. Curran Associates, Inc.

Richard IG Hughes. 1992. The structure and inter-pretation of quantum mechanics. Harvard universitypress.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficient textclassification. arXiv preprint arXiv:1607.01759.

Zachary C. Lipton. 2018. The Mythos of Model Inter-pretability. Queue, 16(3):30:31–30:57.

A I Lvovsky. 2004. Iterative maximum-likelihoodreconstruction in quantum homodyne tomography.Journal of Optics B: Quantum and SemiclassicalOptics, 6(6):S556.

https://doi.org/10.1007/s10773-013-1946-z

https://doi.org/10.1007/s10773-013-1946-z

http://aclweb.org/anthology/D15-1181



https://doi.org/10.18653/v1/N16-1108

https://doi.org/10.18653/v1/N16-1108

https://doi.org/10.18653/v1/N16-1108

http://papers.nips.cc/paper/5550-convolutional-neural-network-architectures-for-matching-natural-language-sentences.pdf

http://papers.nips.cc/paper/5550-convolutional-neural-network-architectures-for-matching-natural-language-sentences.pdf

https://doi.org/10.1145/3236386.3241340

https://doi.org/10.1145/3236386.3241340

http://stacks.iop.org/1464-4266/6/i=6/a=014

http://stacks.iop.org/1464-4266/6/i=6/a=014

Yishu Miao, Lei Yu, and Phil Blunsom. 2015.Neural Variational Inference for Text Processing.arXiv:1511.06038.

Michael A. Nielsen and Isaac L. Chuang. 2010. Quan-tum computation and quantum information. Cam-bridge University Press, Cambridge ; New York.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP, volume 14, pages 1532–1543.

Xipeng Qiu and Xuanjing Huang. 2015. Convolutionalneural tensor network architecture for community-based question answering. In IJCAI, pages 1305–1311.

Cicero dos Santos, Ming Tan, Bing Xiang, and BowenZhou. 2016. Attentive pooling networks. arXivpreprint arXiv:1602.03609.

Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio.2013. Modeling term dependencies with quantumlanguage models for ir. In SIGIR, pages 653–662.ACM.

Sandro Sozzo. 2014. A quantum probability expla-nation in Fock space for borderline contradictions.Journal of Mathematical Psychology, 58:1–12.

Ming Tan, Cicero dos Santos, Bing Xiang, and BowenZhou. 2015. LSTM-based Deep Learning Mod-els for Non-factoid Answer Selection. ArXiv:1511.04108.

Duyu Tang, Bing Qin, and Ting Liu. 2015. DocumentModeling with Gated Recurrent Neural Network forSentiment Classification. In EMNLP, pages 1422–1432, Lisbon, Portugal.

Cornelis Joost Van Rijsbergen. 2004. The geometry ofinformation retrieval. Cambridge University Press.

J. Rehacek, Z. Hradil, and M. Jezek. 2001. Iterative al-gorithm for reconstruction of entangled states. Phys.Rev. A, 63:040303.

Ellen M Voorhees and Dawn M. Tice. 2000. Buildinga question answering test collection. SIGIR, pages200–207.

Di Wang and Eric Nyberg. 2015. A Long Short-TermMemory Model for Answer Sentence Selection inQuestion Answering. In ACL, pages 707–712.

Liu Yang, Qingyao Ai, Jiafeng Guo, and W. BruceCroft. 2016. aNMM: Ranking Short Answer Textswith Attention-Based Neural Matching Model. InCIKM, pages 287–296. ACM.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.WikiQA: A Challenge Dataset for Open-DomainQuestion Answering. In EMNLP, pages 2013–2018.Association for Computational Linguistics.

Lei Yu, Karl Moritz Hermann, Phil Blunsom, andStephen Pulman. 2014. Deep Learning for AnswerSentence Selection. ArXiv: 1412.1632.

Peng Zhang, Jiabin Niu, Zhan Su, Benyou Wang, LiqunMa, and Dawei Song. 2018a. End-to-End Quantum-like Language Models with Application to QuestionAnswering. AAAI., pages 5666–5673.

Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang,and Dawei Song. 2018b. A quantum many-body wave function inspired language modeling ap-proach. In Proceedings of the 27th ACM Interna-tional Conference on Information and KnowledgeManagement, pages 1303–1312. ACM.

http://arxiv.org/abs/1511.06038

https://doi.org/10.1016/j.jmp.2013.11.001

https://doi.org/10.1016/j.jmp.2013.11.001



https://doi.org/10.1103/PhysRevA.63.040303

https://doi.org/10.1103/PhysRevA.63.040303

http://www.aclweb.org/anthology/P15-2116



https://doi.org/10.1145/2983323.2983818

https://doi.org/10.1145/2983323.2983818





Documents

CNM: An Interpretable Complex-valued Network for Matching · for matching (CNM)1 achieves comparable performances to strong CNN and RNN base-lines on two benchmarking question answer-ing