Recommend for a Reason: Unlocking the Power of

Findings of the Association for Computational Linguistics: EMNLP 2021, pages 763–778November 7–11, 2021. ©2021 Association for Computational Linguistics

763

Recommend for a Reason: Unlocking the Power of UnsupervisedAspect-Sentiment Co-Extraction

Zeyu Li1, Wei Cheng2, Reema Kshetramade1, John Houser1, Haifeng Chen2, and Wei Wang1

1Department of Computer Science, University of California, Los Angeles2NEC Labs America

{zyli,weiwang}@cs.ucla.edu{reemakshe,johnhouser}@ucla.edu{weicheng,haifeng}@nec-labs.com

Abstract

Compliments and concerns in reviews are valu-able for understanding users’ shopping inter-ests and their opinions with respect to spe-cific aspects of certain items. Existing review-based recommenders favor large and complexlanguage encoders that can only learn latentand uninterpretable text representations. Theylack explicit user-attention and item-propertymodeling, which however could provide valu-able information beyond the ability to rec-ommend items. Therefore, we propose atightly coupled two-stage approach, includingan Aspect-Sentiment Pair Extractor (ASPE)and an Attention-Property-aware Rating Es-timator (APRE). Unsupervised ASPE minesAspect-Sentiment pairs (AS-pairs) and APREpredicts ratings using AS-pairs as concreteaspect-level evidences. Extensive experimentson seven real-world Amazon Review Datasetsdemonstrate that ASPE can effectively extractAS-pairs which enable APRE to deliver supe-rior accuracy over the leading baselines.

1 Introduction

Reviews and ratings are valuable assets for the rec-ommender systems of e-commerce websites sincethey immediately describe the users’ subjectivefeelings about the purchases. Learning user pref-erences from such feedback is straightforward andefficacious. Previous research on review-based rec-ommendation has been fruitful (Chin et al., 2018;Chen et al., 2018; Bauman et al., 2017; Liu et al.,2019). Cutting-edge natural language processing(NLP) techniques are applied to extract the latentuser sentiments, item properties, and the compli-cated interactions between the two components.

However, existing approaches have disadvan-tages bearing room for improvement. Firstly, theydismiss the phenomenon that users may hold dif-ferent attentions toward various properties of themerchandise. An item property is the combinationof an aspect of the item and the characteristic asso-

ciated with it. Users may show strong attentions tocertain properties but indifference to others. The at-tended advantageous or disadvantageous propertiescan dominate the attitude of users and consequently,decide their generosity in rating.

Table 1 exemplifies the impact of the user atti-tude using three real reviews for a headset. Threeaspects are covered: microphone quality, comfort-ableness, and sound quality. The microphone qual-ity is controversial. R2 and R3 criticize it but R1praises it. The sole disagreement between R1 andR2 is on microphone, which is the major concern ofR2, results in the divergence of ratings (5 stars vs.3 stars). However, R3 neglects that disadvantageand grades highly (5 stars) for its superior comfort-ableness indicated by the metaphor of “pillow”.

Secondly, understanding user motivations ingranular item properties provides valuable infor-mation beyond the ability to recommend items. Itrequires aspect-based NLP techniques to extractexplicit and definitive aspects. However, existingaspect-based models mainly use latent or implicitaspects (Chin et al., 2018) whose real semanticsare unjustifiable. Similar to Latent Dirichlet Al-location (LDA, Blei et al., 2003), the semanticsof the derived aspects (topics) are mutually over-lapped (Huang et al., 2020b). These models under-mine the resultant aspect distinctiveness and leadto uninterpretable and sometimes counterintuitiveresults. The root of the problem is the lack of largereview corpora with aspect and sentiment annota-tions. The existing ones are either too small ortoo domain-specific (Wang and Pan, 2018) to beapplied to general use cases. Progress on senti-ment term extraction (Dai and Song, 2019; Tianet al., 2020; Chen et al., 2020a) takes advantageof neural networks and linguistic knowledge andpartially makes it possible to use unsupervised termannotation to tackle the lack-of-huge-corpus issue.

In this paper, we seek to understand how re-views and ratings are affected by users’ perception

764

Reviews Microphone Comfort Sound

R1 [5 stars]: Comfortable. Very high quality sound. . . . Mic is good too. There isan switch to mute your mic. . . I wear glasses and these are comfortable with myglasses on. . . .

good(satisfied)

comfortable high quality(praising)

R2 [3 stars]: I love the comfort, sound, and style but the mic is complete junk! completejunk (angry)

love love

R3 [5 stars]: . . . But this one feels like a pillow, there’s nothing wrong with theaudio and it does the job. . . . con is that the included microphone is pretty bad.

pretty bad(unsatisfied)

like a pillow(enjoyable)

nothingwrong

Table 1: Example reviews of a headset with three aspects, namely microphone quality, comfort level, and soundquality, highlighted specifically. The extracted sentiments are on the right. R1 vs. R2: Different users react differ-ently (microphone quality) to the same item due to distinct personal attentions and, consequently, give divergentratings. R1 vs. R3: A user can still rate highly of an item due to special attention on particular aspects (comfortlevel) regardless of certain unsatisfactory or indifferent properties (microphone and sound qualities).

of item properties in a fine-grained way and dis-cuss how to utilize these findings transparently andeffectively in rating prediction. We propose a two-stage recommender with an unsupervised Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). ASPEextracts (aspect, sentiment) pairs (AS-pairs) from reviews. The pairs are fed into APRE asexplicit user attention and item property carriers in-dicating both frequencies and sentiments of aspectmentions. APRE encodes the text by a contextu-alized encoder and processes implicit text featuresand the annotated AS-pairs by a dual-channel ratingregressor. ASPE and APRE jointly extract explicitaspect-based attentions and properties and solvethe rating prediction with a great performance.

Aspect-level user attitude differs from user pref-erence. The user attitudes produced by the inter-actions of user attentions and item properties aresophisticated and granular sentiments and ratio-nales for interpretation (see Section 4.4 and A.3.5).Preferences, on the contrary, are coarse sentimentssuch as like, dislike, or neutral. Preference-basedmodels may infer that R1 and R3 are written byheadset lovers because of the high ratings. Instead,attitude-based methods further understand that itis the comfortableness that matters to R3 ratherthan the item being a headset. Aspect-level atti-tude modeling is more accurate, informative, andpersonalized than preference modeling.

Note. Due to the page limits, some support-ive materials, marked by “†”, are presented inthe Supplementary Materials. We strongly rec-ommend readers check out these materials. Thesource code of our work is available on GitHubat https://github.com/zyli93/ASPE-APRE.

2 Related Work

Our work is related to four lines of literature whichare located in the overlap of ABSA and Recom-mender Systems.

2.1 Aspect-based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) (Xu et al.,2020; Wang et al., 2018) predicts sentiments towardaspects mentioned in the text. Natural language ismodeled by graphs in (Zhang et al., 2019; Wanget al., 2020) such as Pointwise Mutual Information(PMI) graphs and dependency graphs. Phan andOgunbona (2020) and Tang et al. (2020) utilizecontextualized language encoding to capture thecontext of aspect terms. Chen et al. (2020b) focuseson the consistency of the emotion surrounding theaspects, and Du et al. (2020) equips pre-trainedBERT with domain-awareness of sentiments. Ourwork is informed by these progress which utilizePMI, dependency tree, and BERT for syntax featureextraction and language encoding.

2.2 Aspect or Sentiment Terms Extraction

Aspect and sentiment terms extraction is a presup-position of ABSA. However, manually annotatingdata for training, which requires the hard labor ofexperts, is only feasible on small datasets in particu-lar domains such as Laptop and Restaurant (Pontikiet al., 2014, 2015) which are overused in ABSA.

Recently, RINANTE (Dai and Song, 2019) andSDRN (Chen et al., 2020a) automatically extractboth terms using rule-guided data augmentationand double-channel opinion-relation co-extraction,respectively. However, the supervised approachesare too domain-specific to generalize to out-of-domain or open-domain corpora. Conducting do-main adaptation from small labeled corpora to un-

https://github.com/zyli93/ASPE-APRE

765

labeled open corpora only produces suboptimalresults (Wang and Pan, 2018). SKEP (Tian et al.,2020) exploits an unsupervised PMI+seed strategyto coarsely label sentimentally polarized tokensas sentiment terms, showing that the unsupervisedmethod is advantageous when annotated corporaare insufficient in the domain-of-interest.

Compared to the above models, our ASPE hastwo merits of being (1) unsupervised and hencefree from expensive data labeling; (2) generalizableto different domains by combining three differentlabeling methods.

2.3 Aspect-based Recommendation

Aspect-based recommendation is a relevant taskwith a major difference that specific terms indicat-ing sentiments are not extracted. Only the aspectsare needed (Hou et al., 2019; Guan et al., 2019;Huang et al., 2020a; Chin et al., 2018). Some dis-advantages are summarized as follows. Firstly,the aspect extraction tools are usually outdatedand inaccurate such as LDA (Hou et al., 2019),TF-IDF (Guan et al., 2019), and word embedding-based similarity (Huang et al., 2020a). Second, therepresentation of sentiment is scalar-based whichis coarser than embedding-based used in our work.

2.4 Rating Prediction

Rating prediction is an important task in recom-mendation. Related approaches utilize text miningalgorithms to build user and item representationsand predict ratings (Kim et al., 2016; Zheng et al.,2017; Chen et al., 2018; Chin et al., 2018; Liuet al., 2019; Bauman et al., 2017). However, thetext features learned are latent and unable to pro-vide explicit hints for explaining user interests.

3 ASPE and APRE

3.1 Problem Formulation

Review-based rating prediction involves two majorentities: users and items. A user u writes a reviewru,t for an item t and rates a score su,t. Let Ru

denote all reviews given by u and Rt denote allreviews received by t. A rating regressor takes in atuple of a review-and-rate event (u, t) and reviewsets Ru and Rt to estimate the rating score su,t.

3.2 Unsupervised ASPE

We combine three separate methods to label AS-pairs without the need for supervision, namely PMI-based, neural network-based (NN-based), and lan-

guage knowledge- or lexicon-based methods. Theframework is visualized in Figure 1.

PMINeural Net

Lexicon

Sentiment Terms (ST)

Review Text

Dependency parsing

AS-pair Candidates(Aspect 1, Sentiment 1),

(Aspect 2, Sentiment 2),…

filtering and mergingAS-pair Extractions (in green)

(Aspect 1, Sentiment 1), (Aspect 2, Sentiment 2) …

<latexit sha1_base64="tNoVwm0r8RIJnxQpxB58QxB/m3c=">AAAB83icbVBNS8NAEJ3Ur1q/qh69BIvgqSSi6LHoxVOp2C9oQtlsN+3SzSbsTsQS+je8eFDEq3/Gm//GbZuDVh8MPN6bYWZekAiu0XG+rMLK6tr6RnGztLW9s7tX3j9o6zhVlLVoLGLVDYhmgkvWQo6CdRPFSBQI1gnGNzO/88CU5rFs4iRhfkSGkoecEjSSd9/se8geMavXp/1yxak6c9h/iZuTCuRo9Muf3iCmacQkUkG07rlOgn5GFHIq2LTkpZolhI7JkPUMlSRi2s/mN0/tE6MM7DBWpiTac/XnREYirSdRYDojgiO97M3E/7xeiuGVn3GZpMgkXSwKU2FjbM8CsAdcMYpiYgihiptbbToiilA0MZVMCO7yy39J+6zqXlSdu/NK7TqPowhHcAyn4MIl1OAWGtACCgk8wQu8Wqn1bL1Z74vWgpXPHMIvWB/fDPWRsQ==</latexit>

STNN<latexit sha1_base64="9gUld1h/rVcWhfCxeMTp/srUcy8=">AAAB9HicbVDLSgNBEJyNrxhfUY9eBoPgKeyKosegFz0IEfOCZAmzk95kyOzDmd5gWPIdXjwo4tWP8ebfOEn2oNGChqKqm+4uL5ZCo21/Wbml5ZXVtfx6YWNza3unuLvX0FGiONR5JCPV8pgGKUKoo0AJrVgBCzwJTW94NfWbI1BaRGENxzG4AeuHwhecoZHc+1q3g/CIafX2ZtItluyyPQP9S5yMlEiGarf42elFPAkgRC6Z1m3HjtFNmULBJUwKnURDzPiQ9aFtaMgC0G46O3pCj4zSo36kTIVIZ+rPiZQFWo8Dz3QGDAd60ZuK/3ntBP0LNxVhnCCEfL7ITyTFiE4ToD2hgKMcG8K4EuZWygdMMY4mp4IJwVl8+S9pnJSds7J9d1qqXGZx5MkBOSTHxCHnpEKuSZXUCScP5Im8kFdrZD1bb9b7vDVnZTP75Besj2+mopIF</latexit>

STPMI

Dep.-based Rules:amod, nsubj+acomp<latexit sha1_base64="t7zQjesIFuyu9MbaEHcvzC+DQow=">AAAB63icbVBNSwMxEJ2tX7V+VT16CRbBU9kVRY9FLx4r2A9ol5JNs21okg1JVihL/4IXD4p49Q9589+YbfegrQ8GHu/NMDMvUpwZ6/vfXmltfWNzq7xd2dnd2z+oHh61TZJqQlsk4YnuRthQziRtWWY57SpNsYg47USTu9zvPFFtWCIf7VTRUOCRZDEj2OZSn6RqUK35dX8OtEqCgtSgQHNQ/eoPE5IKKi3h2Jhe4CsbZlhbRjidVfqpoQqTCR7RnqMSC2rCbH7rDJ05ZYjiRLuSFs3V3xMZFsZMReQ6BbZjs+zl4n9eL7XxTZgxqVJLJVksilOObILyx9GQaUosnzqCiWbuVkTGWGNiXTwVF0Kw/PIqaV/Ug6u6/3BZa9wWcZThBE7hHAK4hgbcQxNaQGAMz/AKb57wXrx372PRWvKKmWP4A+/zBx9pjko=</latexit>[<latexit sha1_base64="vWwIAPNuvYrPd5HfBl/0fDz/yII=">AAAB9HicbVBNS8NAEN3Ur1q/qh69BIvgqSSi6LHoxYOHiv2CNpTNdtIu3Wzi7qS0hP4OLx4U8eqP8ea/cdvmoNUHA4/3ZpiZ58eCa3ScLyu3srq2vpHfLGxt7+zuFfcPGjpKFIM6i0SkWj7VILiEOnIU0IoV0NAX0PSHNzO/OQKleSRrOInBC2lf8oAzikbyHmrdDsIY0zsYT7vFklN25rD/EjcjJZKh2i1+dnoRS0KQyATVuu06MXopVciZgGmhk2iIKRvSPrQNlTQE7aXzo6f2iVF6dhApUxLtufpzIqWh1pPQN50hxYFe9mbif147weDKS7mMEwTJFouCRNgY2bME7B5XwFBMDKFMcXOrzQZUUYYmp4IJwV1++S9pnJXdi7Jzf16qXGdx5MkROSanxCWXpEJuSZXUCSOP5Im8kFdrZD1bb9b7ojVnZTOH5Besj28MkJJI</latexit>

STLex

Figure 1: Pipeline of ASPE.

3.2.1 Sentiment Terms ExtractionPMI-based method Pointwise Mutual Informa-tion (PMI) originates from Information Theory andis adapted into NLP (Zhang et al., 2019; Tian et al.,2020) to measure statistical word associations incorpora. It determines the sentiment polarities ofwords using a small number of carefully selectedpositive and negative seeds (s+ and s−) (Tian et al.,2020). It first extracts candidate sentiment termssatisfying the part-of-speech patterns by Turney(2002) and then measures the polarity of each can-didate term w by

Pol(w) =∑s+

PMI(w, s+)−∑s−

PMI(w, s−). (1)

Given a sliding window-based context sampler ctx,the PMI(·, ·) between words is defined by

PMI(w1, w2) = logp(w1, w2)

p(w1)p(w2), (2)

where p(·), the probability estimated by tokencounts, is defined by p(w1, w2) = |{ctx|w1,w2∈ctx}|

total #ctx

and p(w1) = |{ctx|w1∈ctx}|total #ctx . Afterward, we collect

the top-q sentiment tokens with strong polarities,both positive and negative, as STPMI.

NN-based method As discussed in Section 2, co-extraction models (Dai and Song, 2019) can accu-rately label AS-pairs only in the training domain.For sentiment terms with consistent semantics indifferent domains such as good and great, NNmethods can still provide a robust extraction recall.In this work, we take a pretrained SDRN (Chenet al., 2020a) as the NN-based method to gener-ate STNN. The pretrained SDRN is considered anoff-the-shelf tool similar to the pretrained BERTwhich is irrelevant to our rating prediction data.

766

Therefore, we argue ASPE is unsupervised for opendomain rating prediction.

Knowledge-based method PMI- and NN-basedmethods have shortcomings. The PMI-basedmethod depends on the seed selection. The ac-curacy of the NN-based method deteriorates whenthe applied domain is distant from the training data.As compensation, we integrate a sentiment lexi-con STLex summarized by linguists since expertknowledge is widely used in unsupervised learn-ing. Examples of linguistic lexicons include Sen-tiWordNet (Baccianella et al., 2010) and OpinionLexicon (Hu and Liu, 2004). The latter one is usedin this work.

Building sentiment term set The three senti-ment term subsets are joined to build an overallsentiment set used in AS-pair generation: ST =STPMI∪STNN∪STLex. The three sets compensatefor the discrepancies of other methods and expandthe coverage of terms shown in Table 10†.

3.2.2 Syntactic AS-pairs ExtractionTo extract AS-pairs, we first label AS-pair can-didates using dependency parsing and then filterout non-sentiment-carrying candidates using (ST )1.Dependency parsing extracts the syntactic relationsbetween the words. Some nouns are consideredpotential aspects and are modified by adjectiveswith two types of dependency relations shown inFigure 2: amod and nsubj+acomp. The pairsof nouns and the modifying adjectives composethe AS-pair candidates. Similar techniques arewidely used in unsupervised aspect extraction mod-els (Tulkens and van Cranenburgh, 2020; Dai andSong, 2019). AS-pair candidates are noisy sincenot all adjectives in it bear sentiment inclination.ST comes into use to filter out non-sentiment-carrying AS-pair candidates whose adjective is notin ST . The left candidates form the AS-pair set.Admittedly, the dependency-based extraction for(noun, adj.) pairs is suboptimal and causes miss-ing aspect or sentiment terms. An implicit moduleis designed to remedy this issue. Open domainAS-pair co-extraction is blocked by the lacking ofpublic labeled data and is left for future work.

We introduce ItemTok as a special aspect to-ken of the nsubj+acomp rule where nsubj isa pronoun of the item such as it and they. Infre-quent aspect terms with less than c occurrences

1Section A.2.1† explains this procedure in detail by pseu-docode of Algorithm 1†.

are ignored to reduce sparsity. We use WordNetsynsets (Miller, 1995) to merge the synonym as-pects. The aspect with the most synonyms is se-lected as the representative of that aspect set.

amod dependency relation:

Amazing sound and quality, all in one headset.

amod cc

conj

prep

advmod

pobj

nummod

Extracted AS-pair candidates:(sound, amazing), (quality, amazing)

nsubj+acomp dependency relation:

Sound quality is superior and comfort is excellent.

compound nsubj acomp

cc

conj

nsubj acomp

Extracted AS-pair candidates:(Sound quality, superior), (comfort, excellent)

Figure 2: Two dependency-based rules for AS-pair can-didates extraction. Effective dependency relations andaspects and sentiments candidates are highlighted.

Discussion ASPE is different from Aspect Ex-traction (AE) (Tulkens and van Cranenburgh, 2020;Luo et al., 2019; Wei et al., 2020; Ma et al., 2019;Angelidis and Lapata, 2018; Xu et al., 2018; Shuet al., 2017; He et al., 2017a) which extracts as-pects only and infers sentiment polarities in {pos,neg, (neu)}. AS-pair co-extraction, however, of-fers more diversified emotional signals than thebipolar sentiment measurement of AE.

3.3 APRE

APRE, depicted in Figure 3, predicts ratings givenreviews and the corresponding AS-pairs. It firstencodes language into embeddings, then learns ex-plicit and implicit features, and finally computesthe score regression. One distinctive feature ofAPRE is that it explicitly models the aspect in-formation by incorporating a da-dimensional as-pect representation ai ∈ Rda in each side of thesubstructures for review encoding. Let A(u) =

{a(u)1 , . . . ,a(u)k } denotes the k aspect embeddings

for users and A(t) for items. k is decided by thenumber of unique aspects in the AS-pair set.

Language encoding The reviews are encodedinto low-dimensional token embedding sequencesby a fixed pre-trained BERT (Devlin et al., 2019),a powerful transformer-based contextualized lan-guage encoder. For each review r in Ru or Rt,the resulting encoding H0 ∈ R(|r|+2)×de consistsof (|r|+ 2) de-dimensional contextualized vectors:

767

[CLS

]

supe

rior

Soun

d

com

fort …

[6(3

]

and

qual

ity is is

exce

llect

…

avg&max pool pool

…“comfort”

“mic quality”“sound quality”

sound aspect comfort aspect

pool

5HYLHZ��

review-wise agg

# of reviews

# o

f asp

ects

review-wise agg

concat

weighted avg. pool

weighted avg. pool

rating pred.

APRE - Item Reviews Encoder

[CLS] token

token repr. seq.

Aspect annotations Shared aspect representation Review representation

Sentiment annotations/embeddings Computational modules/operations

IMPLICIT EXPLICIT

REVIEW-WISEAGGREGATION

ASPECT-WISEAGGREGATION

<latexit sha1_base64="jA5OQwvMa/E4znz6Mb3JRY5xU8g=">AAAB9XicbVBNS8NAEJ34WetX1aOXxSJ4kJKIoseiF48V7Ae0tWy2m3bpZhN2J0oJ+R9ePCji1f/izX/jts1BWx8MPN6bYWaeH0th0HW/naXlldW19cJGcXNre2e3tLffMFGiGa+zSEa65VPDpVC8jgIlb8Wa09CXvOmPbiZ+85FrIyJ1j+OYd0M6UCIQjKKVHjpDiqnJemlySjDrlcpuxZ2CLBIvJ2XIUeuVvjr9iCUhV8gkNabtuTF2U6pRMMmzYicxPKZsRAe8bamiITfddHp1Ro6t0idBpG0pJFP190RKQ2PGoW87Q4pDM+9NxP+8doLBVTcVKk6QKzZbFCSSYEQmEZC+0JyhHFtCmRb2VsKGVFOGNqiiDcGbf3mRNM4q3kXFvTsvV6/zOApwCEdwAh5cQhVuoQZ1YKDhGV7hzXlyXpx352PWuuTkMwfwB87nD6Xzkpo=</latexit>

su,t

IMPLICIT EXPLICIT

APRE - User Reviews Encoder

cnn+relu

pool

<latexit sha1_base64="NNn/moirsib8drK5WK6geb0A9uw=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclUQUXRbddFnBPqCNZTKdtEMnkzBzI4ZQf8WNC0Xc+iHu/BunaRbaeuBeDufcy9w5fiy4Bsf5tlZW19Y3Nktb5e2d3b19++CwraNEUdaikYhU1yeaCS5ZCzgI1o0VI6EvWMef3Mz8zgNTmkfyDtKYeSEZSR5wSsBIA7vSB/YIWd79IGtMp/fuwK46NScHXiZuQaqoQHNgf/WHEU1CJoEKonXPdWLwMqKAU8Gm5X6iWUzohIxYz1BJQqa9LD9+ik+MMsRBpExJwLn6eyMjodZp6JvJkMBYL3oz8T+vl0Bw5WVcxgkwSecPBYnAEOFZEnjIFaMgUkMIVdzciumYKELB5FU2IbiLX14m7bOae1Fzbs+r9esijhI6QsfoFLnoEtVRAzVRC1GUomf0it6sJ+vFerc+5qMrVrFTQX9gff4AZVSVPg==</latexit>

H1

<latexit sha1_base64="oIZQSGaWPWIPK6kg8i7Nc9zuKLU=">AAACNHicbVBLSwMxEM7WV62vqkcvi6VQQcquKHoseil4qWAf0NYlm2bb0OyDZFYsYX+UF3+IFxE8KOLV32C23YO2DiTz8X0zzMznRpxJsKxXI7e0vLK6ll8vbGxube8Ud/daMowFoU0S8lB0XCwpZwFtAgNOO5Gg2Hc5bbvjq1Rv31MhWRjcwiSifR8PA+YxgkFTTvG63AP6AGr6u56qJ8mdXSj33JAP5MTXSY0048wKVJIU5iRVwUeJo+JjkTjFklW1pmEuAjsDJZRFwyk+9wYhiX0aAOFYyq5tRdBXWAAjnOpRsaQRJmM8pF0NA+xT2VfToxOzrJmB6YVCvwDMKfu7Q2FfpmvqSh/DSM5rKfmf1o3Bu+grFkQx0IDMBnkxNyE0UwfNAROUAJ9ogIlgeleTjLDABLTPBW2CPX/yImidVO2zqnVzWqpdZnbk0QE6RBVko3NUQ3XUQE1E0CN6Qe/ow3gy3oxP42tWmjOynn30J4zvH2/brSM=</latexit>

h(a)u,r

User-specific aspect repr. matrix

<latexit sha1_base64="gYj2WfKfw5+X4Ubl7THRQek7UFw=">AAACTXicbVHLSgMxFM3UR2t9jbp0M1gKFaTMiKLLqpsuK9gHtLVkMpk2mHmQ3BFLmB90I7jzL9y4UETMtF3Y1gtJDufcm3tz4sacSbDtNyO3srq2ni9sFDe3tnd2zb39lowSQWiTRDwSHRdLyllIm8CA004sKA5cTtvuw02mtx+pkCwK72Ac036AhyHzGcGgqYHplXtAn0BNdtdX9TS9d4rlnhtxT44DfaiRZgbTBJWmS5qq4ON0oJITkRbn77pKMzXRqlmyq/YkrGXgzEAJzaIxMF97XkSSgIZAOJay69gx9BUWwAinuk8iaYzJAx7SroYhDqjsq4kbqVXWjGf5kdArBGvC/q1QOJDZ+DozwDCSi1pG/qd1E/Av+4qFcQI0JNNGfsItiKzMWstjghLgYw0wEUzPapERFpiA/oCiNsFZfPIyaJ1WnfOqfXtWql3P7CigQ3SEKshBF6iG6qiBmoigZ/SOPtGX8WJ8GN/GzzQ1Z8xqDtBc5PK/13C2pA==</latexit>

A(u)

<latexit sha1_base64="HWD72PczyuxVyQhMY9Mc2PXCOns=">AAACYXicbVFNSwMxEM2uX7V+rfXoZbEUFKTsiqLHqpceFawKbV1m02wbmv0gmZWWsH/Smxcv/hGzbQ/VOpDk8d6bTGYSZoIr9LxPy15b39jcqmxXd3b39g+cw9qzSnNJWYemIpWvISgmeMI6yFGw10wyiEPBXsLxfam/vDOpeJo84TRj/RiGCY84BTRU4EwaPWQT1LM9jHS7KN78aqMXpmKgprE59Mgwwdygi2JF06dwVgQ6P5el9uuy26KU87Oi2gORjWDZGzh1r+nNwl0F/gLUySIeAuejN0hpHrMEqQClur6XYV+DRE4FMyVyxTKgYxiyroEJxEz19WxChdswzMCNUmlWgu6MXc7QEKuyJeOMAUfqr1aS/2ndHKObvuZJliNL6LxQlAsXU7cctzvgklEUUwOASm7e6tIRSKBoPqVqhuD/bXkVPF80/aum93hZb90txlEhx+SEnBKfXJMWaZMH0iGUfFnr1p61b33b27Zj1+ZW21rkHJFfYR//AMkkuHU=</latexit>

↵(a)u,r

<latexit sha1_base64="8MU2a+U0NswYozNintNu+uwBs/c=">AAACHXicbVC7TsMwFHXKq5RXgZEloqrEVCWoCMaKLh0YiqAPqQmR4zitVech+wZRRfkRFn6FhQGEGFgQf4PbZoCWI9k+Oude+d7jxpxJMIxvrbCyura+UdwsbW3v7O6V9w+6MkoEoR0S8Uj0XSwpZyHtAANO+7GgOHA57bnj5tTv3VMhWRTewiSmdoCHIfMZwaAkp1yvWkAfIJ3drp+2suzOLFluxD05CdSTjpTgzP100Ly6sbPMKVeMmjGDvkzMnFRQjrZT/rS8iCQBDYFwLOXANGKwUyyAEU6zkpVIGmMyxkM6UDTEAZV2Otsu06tK8XQ/EuqEoM/U3x0pDuR0VlUZYBjJRW8q/ucNEvAv7JSFcQI0JPOP/ITrEOnTqHSPCUqATxTBRDA1q05GWGACKtCSCsFcXHmZdE9r5lnNuK5XGpd5HEV0hI7RCTLROWqgFmqjDiLoET2jV/SmPWkv2rv2MS8taHnPIfoD7esHQTujQw==</latexit>

h1[CLS]

<latexit sha1_base64="QyYG1WikDVcByc+XIhYwnzSm8kA=">AAACc3icbVFNTxsxEPVuaUtDaUMrceFQizQSSBDtIlB75OPCoQdQCSAl29WsM0ssvB+yZ6tG1v6B/jxu/Asuvdeb5NCQjmT76b03Y884KZU0FASPnv9i5eWr16tvWmtv19+9b298uDZFpQX2RaEKfZuAQSVz7JMkhbelRsgShTfJ/Vmj3/xEbWSRX9GkxCiDu1ymUgA5Km7/7g4Jf5Gd7klqz+v6R9jqDpNCjcwkc4cdOyaeGezg7Nv3qK6XDHYHduvYVnu60RYqntSNXO02AqhyDAvmYYIEMxy3O0EvmAZfBuEcdNg8LuL2w3BUiCrDnIQCYwZhUFJkQZMUCl3tymAJ4h7ucOBgDhmayE5nVvOuY0Y8LbRbOfEp+2+Ghcw0/TlnBjQ2z7WG/J82qCj9GlmZlxVhLmYXpZXiVPDmA/hIahSkJg6A0NK9lYsxaBDkvqnlhhA+b3kZXB/0wqNecHnYOT6dj2OVbbFttsNC9oUds3N2wfpMsCdv0/vkce+Pv+Vv+59nVt+b53xkC+Hv/wUisr9U</latexit>

�u,r

<latexit sha1_base64="Gs7DeicuF+2GzVWxSqR33Axvgcg=">AAACkHicbVFdb9MwFHXC1ygMynjkJaKqtEmoSgaIPYDY2MuEeBiCbpPaEN24N6s1x4nsG0Rl+ffwf3jj3+C0QaItV7J9fM6xr31vXkthKI5/B+Gt23fu3tu533vwcPfR4/6TvQtTNZrjmFey0lc5GJRC4ZgESbyqNUKZS7zMb05b/fI7aiMq9ZUWNaYlXCtRCA7kqaz/czgl/EF2OeeFPXPuW9IbTvNKzsyi9IudeyZbGezk9NOX1Lktg92HA5fZ5oVutbUbT1wrNwetALKew4Y5R4Jus37p35xcKeey/iAexcuItkHSgQHr4jzr/5rOKt6UqIhLMGaSxDWlFjQJLtHnagzWwG/gGiceKijRpHZZUBcNPTOLikr7oShasv+esFCa9p3eWQLNzabWkv/TJg0VR6kVqm4IFV8lKhoZURW13YlmQiMnufAAuBb+rRGfgwZOvoc9X4Rk88vb4OJwlLwexZ9fDY4/dOXYYc/Yc7bPEvaGHbMzds7GjAe7wcvgbfAu3AuPwvfhycoaBt2Zp2wtwo9/AIvgykk=</latexit>

hcnn

<latexit sha1_base64="WiYE30sYVDxPBB+DgqlO0U3UB/A=">AAAC1XicbVLLjtMwFHXCawivAks2EVVRK0GVIBAsZ5jNLEAqj05HakJ04zqtNY4d2U7VyvIChNjyb+z4Bz4Cpy2aTjtXsn18zvH14zqvGFU6iv54/rXrN27eOrgd3Ll77/6D1sNHp0rUEpMhFkzIsxwUYZSToaaakbNKEihzRkb5+XGjj+ZEKir4F72sSFrClNOCYtCOylp/O4kmC21WfV6YE2u/xkEnyQWbqGXpBjNzTLY2mPHx+8+ptXsG04WezUz9XDbapYxHtpHrXiMAq2awY86JhovJdtb/m2LO7UVaVQJj5gMskmdH8+mLgRDM2u4Vt+gF2+nmNquzVjvqR6sI90G8AW20iUHW+p1MBK5LwjVmoNQ4jiqdGpCaYkZskNSKVIDPYUrGDnIoiUrNqio27DhmEhZCusZ1uGK3VxgoVXM45yxBz9Su1pBXaeNaF29TQ3lVa8LxeqOiZqEWYVPicEIlwZotHQAsqTtriGcgAWv3EQL3CPHulffB6ct+/LoffXzVPny3eY4D9AQ9RV0UozfoEJ2gARoi7H3yFt4377s/8q3/w/+5tvreZs1jdCn8X/8AmdHnYw==</latexit>vu<latexit sha1_base64="ZBeginnet3wkDRB/yoRQMvdlF/Y=">AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3Daoq3tXsn24+d5/PrjdVYxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbaWsZD28h1rxGAVQVsmDOi4XJyNev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41t+it55vZtA7Wbe+s49JWO+pHiwi3QbwCbbSKQdr6k0wErkvCNWag1CiOKj02IDXFjNggqRWpAJ/DlIwc5FASNTaLWtmw45hJmAvpGtfhgr26wkCpmgM7Zwm6UJtaQ16njWqdvxkbyqtaE46XG+U1C7UIm8KHEyoJ1mzuAGBJ3VlDXIAErN33CNwjxJtX3gYnr/rxXj/6+Lp98Hb1HDvoKXqGuihG++gAHaMBGiLsFd5X77v3wy/9b/5P/9fS6nurNU/QWvi//wGqOO+E</latexit>

Gu

<latexit sha1_base64="1iM5l54URpWTM4b07uZGYbF+A8A=">AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3Daoq3tXsn24+d5/PrjdVYxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbaWsZD28h1rxGAVQVsmDOi4XJyNev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41t+it55vZtA7Wbe+sTXXaakf9aBHhNohXoI1WMUhbf5KJwHVJuMYMlBrFUaXHBqSmmBEbJLUiFeBzmJKRgxxKosZmUSsbdhwzCXMhXeM6XLBXVxgoVXNg5yxBF2pTa8jrtFGt8zdjQ3lVa8LxcqO8ZqEWYVP4cEIlwZrNHQAsqTtriAuQgLX7HoF7hHjzytvg5FU/3utHH1+3D96unmMHPUXPUBfFaB8doGM0QEOEvcL76n33fvil/83/6f9aWn1vteYJWgv/9z+otO+D</latexit>

Gt<latexit sha1_base64="6iiUiGgrU7rXhG5+OUghTJYXQGc=">AAAC63icbVJNb9MwGHbC1whfHRy5RFRFrQRVgpjGcWMHdgCpCLpNakr0xnUaa44d2U61yvJf4MIBhLjyh7jxb3DaonXtXsn24+d5/Pr1R1YxqnQU/fX8Gzdv3b6zcze4d//Bw0et3ccnStQSkyEWTMizDBRhlJOhppqRs0oSKDNGTrPzo0Y/nRGpqOCf9bwi4xKmnOYUg3ZUuut5nUSTC20WfZabY2u/xEEnyQSbqHnpBlM4Jl0azOjo/aextVsG04WeTU39QjbalYyHtpHrXiMAqwrYMGdEw+VkPev/TTHn9jKtKoEx8wEukueHs+nLgRDM2u41p+gF6+lmNtWbpb2zjkxb7agfLSLcBvEKtNEqBmnrTzIRuC4J15iBUqM4qvTYgNQUM2KDpFakAnwOUzJykENJ1Ngs3sqGHcdMwlxI17gOF+z6CgOlaip2zhJ0oTa1hrxOG9U6fzM2lFe1JhwvN8prFmoRNg8fTqgkWLO5A4AldbWGuAAJWLvvEbhLiDePvA1OXvXjvX708XX74O3qOnbQU/QMdVGM9tEBOkYDNETYK7yv3nfvh1/63/yf/q+l1fdWa56gK+H//gerhe+C</latexit>vt

<latexit sha1_base64="4O8U80+ejG04PGYxsjOg8pIyP4Q=">AAADAXicbVJNb9MwGHbC1ygf6+DAgUtEVdRKUCUItB03dmAHkIqg26QmRG9cp7Xm2JHtVKssc+GvcOEAQlz5F9z4Nzht0dZ2rxTn8fM872u/trOSUaXD8K/nX7t+4+atrduNO3fv3d9u7jw4VqKSmAywYEKeZqAIo5wMNNWMnJaSQJExcpKdHdb6yZRIRQX/qGclSQoYc5pTDNpR6Y73qB1rcq7NfMxyc2Ttp6jRjjPBRmpWuJ+ZOCZdGMzw8O2HxNoNg+lA16ameiZrbaXiga3lqlsLwMoJrJkzouFicrnq/0Ux5/airCqAMfMOzuOnB9Px874QzNrOFV10V+tNbarX9/bG1uRmbthIm62wF84j2ATRErTQMvpp8088ErgqCNeYgVLDKCx1YkBqihmxjbhSpAR8BmMydJBDQVRi5jdog7ZjRkEupPu4Dubs5QwDharbcM4C9EStazV5lTasdL6XGMrLShOOFwvlFQu0COrnEIyoJFizmQOAJXV7DfAEJGDtHk19CNF6y5vg+EUvetUL379s7b9eHscWeoyeoA6K0C7aR0eojwYIe5+9r95374f/xf/m//R/Lay+t8x5iFbC//0P5tf4Zw==</latexit>

H0

<latexit sha1_base64="NNn/moirsib8drK5WK6geb0A9uw=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBFclUQUXRbddFnBPqCNZTKdtEMnkzBzI4ZQf8WNC0Xc+iHu/BunaRbaeuBeDufcy9w5fiy4Bsf5tlZW19Y3Nktb5e2d3b19++CwraNEUdaikYhU1yeaCS5ZCzgI1o0VI6EvWMef3Mz8zgNTmkfyDtKYeSEZSR5wSsBIA7vSB/YIWd79IGtMp/fuwK46NScHXiZuQaqoQHNgf/WHEU1CJoEKonXPdWLwMqKAU8Gm5X6iWUzohIxYz1BJQqa9LD9+ik+MMsRBpExJwLn6eyMjodZp6JvJkMBYL3oz8T+vl0Bw5WVcxgkwSecPBYnAEOFZEnjIFaMgUkMIVdzciumYKELB5FU2IbiLX14m7bOae1Fzbs+r9esijhI6QsfoFLnoEtVRAzVRC1GUomf0it6sJ+vFerc+5qMrVrFTQX9gff4AZVSVPg==</latexit>

H1

<latexit sha1_base64="WWG2kFE3n2FpneEGdx51K1UQc8k=">AAADF3icbVLLjtMwFHXCayiP6cCSTURV1EpMlSAQLGeYBbMAqQg6M1JTohvXaa1x7Mh2qqks/wUbfoUNCxBiCzv+BqctmmnaK8U5Pufc6+tHWjCqdBj+9fxr12/cvLVzu3Hn7r37u829BydKlBKTARZMyLMUFGGUk4GmmpGzQhLIU0ZO0/OjSj+dEamo4B/1vCCjHCacZhSDdlSy5+23Y00utFmMaWaOrf0UNdpxKthYzXP3M1PHJEuDGR69/TCydsNgOtC1iSmfykpbq3hoK7nsVgKwYgo1c0o0XE6uVv2/KObcXpZVOTBm3sFF/ORwNtnvC8Gs7WzZRXe93swmut7bG7uFrJLDRi112WDSbIW9cBHBJohWoIVW0U+af+KxwGVOuMYMlBpGYaFHBqSmmBHbiEtFCsDnMCFDBznkRI3M4l5t0HbMOMiEdB/XwYK9mmEgV1WHzpmDnqq6VpHbtGGps1cjQ3lRasLxcqGsZIEWQfVIgjGVBGs2dwCwpK7XAE9BAtbuKVWHENW3vAlOnvWiF73w/fPWwevVceygR+gx6qAIvUQH6Bj10QBh77P31fvu/fC/+N/8n/6vpdX3VjkP0Vr4v/8BviEB7A==</latexit>vu,r

rating pred.<latexit sha1_base64="M8mmmGrb8sEDnyfijAqeOQ8FmG0=">AAADNHicbVJbi9NAFJ7E21pvXX30JVgqLWhJRNHHXRd0EYWKdnehieFkMmmHncyEzKRsGeZH+eIP8UUEHxTx1d/gpKns9nIgycn3feecby5JwahUvv/dcS9dvnL12s711o2bt27fae/ePZKiKjEZYcFEeZKAJIxyMlJUMXJSlATyhJHj5PSg5o9npJRU8I9qXpAohwmnGcWgLBTvOm+6oSJnSi/eSaYPjfkUtLphIlgq57n96KlF4kagxwdvP0TGbAh0D/om1tWjsuZWOu6bmq76NQGsmMKaOCEKzn8udv0/FHNuztvKHBjT7+AsfLg/mzweCsGM6W1ZRX+138zEat3ba7MFrIv9jdrGYZiDmmJg+pWJmyqa2+E4Faoftzv+wF+Et5kEy6SDljGM21/DVOAqJ1xhBlKOA79QkYZSUcyIHVZJUgA+hQkZ25RDTmSkF4duvK5FUi8TpX248hboxQoNuazdW2XtWa5zNbiNG1cqexFpyotKEY6bQVnFPCW8+gZ5KS0JVmxuE8AltV49PIUSsLL3rGU3IVhf8mZy9GQQPBv475929l4ut2MH3UcPUA8F6DnaQ4doiEYIO5+db85P55f7xf3h/nb/NFLXWdbcQyvh/v0HzqwOTA==</latexit>Fim(·)

<latexit sha1_base64="N1kGNZxBS87u6TdsnnmxnL9XCig=">AAADUnicbVJbb9MwFHZaLiMM1sEjLxFVUStBlSAQPG5Mgj1sUhF0m9SE6MR1GmtOHMVO1cryb0RCvPBDeOEBcHrR1qZHinP8fd+5WSfKGRXSdX9Zjeadu/fu7z2wH+4/enzQOnxyIXhZYDLEnPHiKgJBGM3IUFLJyFVeEEgjRi6j65OKv5ySQlCefZXznAQpTDIaUwzSQOGhlXR8SWZSLc4oVqdaf/Psjh9xNhbz1PxUYpBwKVCjk7MvgdY1gepCT4eqfFlU3EbGY13RZa8igOUJbIkjIuHmcjvruijOMn2TVqTAmDqHmf/ieDp5NeCcad3dMUVvM99Uh3K7t096B1gFu7XYdYcpyAQDUx91uAyjqamOx1z27DpJZmsybLXdvrswp+54K6eNVjYIWz/8McdlSjKJGQgx8txcBgoKSTEj2vZLQXLA1zAhI+NmkBIRqMVKaKdjkLET88J8mXQW6O0IBamoZjPKqmexzVXgLm5Uyvh9oGiWl5JkeFkoLpkjuVPtlzOmBcGSzY0DuKCmVwcnUACWZgtt8wje9sh15+J133vbdz+/aR99WD3HHnqGnqMu8tA7dIRO0QANEba+W7+tv9a/xs/Gn6bVbC6lDWsV8xRtWHP/P6YwF8o=</latexit>Fex(·)

AS-pair 1

Contextualized Language Encoder

AS-pair 2

Figure 3: Pipeline of APRE including a user reviewencoder in the orange dashed box and an item reviewencoder in the top blue box, each containing an im-plicit channel (left) and an aspect-based explicit chan-nel (right). Internal details of item encoder are identicalto the counterpart of user encoder and hence omitted.

H0 = {h0[CLS],h

01, . . . ,h

0|r|,h

0[SEP]}. [CLS] and

[SEP] are two special tokens indicating starts andseparators of sentences. We use a trainable lineartransformation, h1

i = WTadh

0i + bad, to adapt the

BERT output representation H0 to our task as H1

where Wad ∈ Rde×df , bad ∈ Rdf , and df is thetransformed dimension of internal features. BERTencodes the token semantics based upon the contextwhich resolves the polysemy of certain sentimentterms, e.g., “cheap” is positive for price but nega-tive for quality. This step transforms the sentimentencoding to attention-property modeling.

Explicit aspect-level attitude modeling For as-pect a in the k total aspects, we pull out all the con-textualized representations of the sentiment words2

that modify a, and aggregate their representationsto a single embedding of aspect a in r as

h(a)u,r =

∑h1j , wj ∈ ST ∩ r and wj modifies a.

An observation by Chen et al. (2020b) suggests thatusers tend to use semantically consistent words forthe same aspect in reviews. Therefore, sum-pooling

2BERT uses WordPiece tokenizer that can break an out-of-vocabulary word into shorter word pieces. If a sentiment wordis broken into word pieces, we use the representation of thefirst word piece produced.

can nicely handle both sentiments and frequenciesof term mentions. Aspects that are not mentionedby r will have h(a)

u,r = 0. To completely pictureuser u’s attentions to all aspects, we aggregate allreviews from u, i.e. Ru, using review-wise ag-gregation weighted by α(a)

u,r given in the equationbelow. α(a)

u,r indicates the significance of each re-view’s contribution to the overall understanding ofu’s attention to aspect a

α(a)u,r =

exp(tanh(wTex[h

(a)u,r;a(u)]))∑

r′∈Ru exp(tanh(wTex[h

(a)u,r′ ;a

(u)])),

where [·; ·] denotes the concatenation of tensors.wex ∈ R(df+da) is a trainable weight. With theusefulness distribution of α(a)

u,r, we aggregate theh(a)u,r of r ∈ Ru by weighted average pooling:

g(a)u =∑r∈Ru

α(a)u,rh

(a)u,r.

Now we obtain the user attention representationfor aspect a, g(a)u ∈ Rdf . We use Gu ∈ Rdf×k todenote the matrix of g(a)u . The item-tower architec-ture is omitted in Figure 3 since the item propertymodeling shares the identical computing proce-dure. It generates the item property representationsg(a)t of Gt. Mutual attention (Liu et al., 2019; Tay

et al., 2018; Dong et al., 2020) is not utilized sincethe generation of user attention encodings Gu isindependent to the item properties and vice versa.

Implicit review representation It is acknowl-edged by existing works shown in Section 2 thatimplicit semantic modeling is critical because someemotions are conveyed without explicit sentimentword mentions. For example, “But this one feelslike a pillow . . . ” in R3 of Table 1 does not con-tain any sentiment tokens but expresses a strongsatisfaction of the comfortableness, which will bemissed by the extractive annotation-based ASPE.

In APRE, we combine a global feature h1[CLS], a

local context feature hcnn ∈ Rnc learned by a con-volutional neural network (CNN) of output channelsize nc and kernel size nk with max pooling, andtwo token-level features, average and max poolingof H1 to build a comprehensive multi-granularityreview representation vu,r:

vu,r =[h1

[CLS];hcnn; MaxPool(H1); AvgPool(H1)],

hcnn = MaxPool(ReLU(ConvNN_1D(H1))).

768

We apply review-wise aggregation without aspectsfor latent review embedding vu

βu,r =exp(tanh(wT

imvu,r))∑r′∈Ru exp(tanh(wT

imvu,r′)),

vu =∑r∈Ru

βu,rvu,r,

where βu,r is the counterpart of α(a)u,r in the implicit

channel, wim ∈ Rdim is a trainable parameter, anddim = 3df + nc. Using similar steps, we can alsoobtain vt for the item implicit embeddings.

Rating regression and optimization Implicitfeatures vu and vt and explicit features Gu andGt compose the input to the rating predictor toestimate the score su,t by

su,t = bu + bt︸︷︷︸biases

+Fim([vu;vt])︸︷︷︸implicit feature

+ 〈γ,Fex([Gu;Gt])〉︸︷︷︸explicit feature

.

Fim : R2dim → R and Fex : R2df×k → Rk aremulti-layer fully-connected neural networks withReLU activation and dropout to avoid overfitting.They model user attention and item property inter-actions in explicit and implicit channels, respec-tively. 〈·, ·〉 denotes inner-product. γ ∈ Rk and{bu, bt} ∈ R are trainable parameters. The opti-mization function of the trainable parameter set Θwith an L2 regularization weighted by λ is

J(Θ) =∑

ru,t∈R(su,t − su,t)2 + L2-reg(λ).

J(Θ) is optimized by back-propagation learningmethods such as Adam (Kingma and Ba, 2014).

4 Experiments

4.1 Experimental Setup

Datasets We use seven datasets from AmazonReview Datasets (He and McAuley, 2016)3 includ-ing AutoMotive (AM), Digital Music (DM), Musi-cal Instruments (MI), Pet Supplies (PS), Sport andOutdoors (SO), Toys and Games (TG), and Toolsand Home improvement (TH). Their statistics areshown in Table 2.

We use 8:1:1 as the train, validation, and testratio for all experiments. Users and items with lessthan 5 reviews and reviews with less than 5 wordsare removed to reduce data sparsity.

Baseline models Thirteen baselines in tradi-tional and deep learning categories are comparedwith the proposed framework. The pre-deep learn-ing traditional approaches predict ratings solelybased upon the entity IDs. Table 3 introduces theirbasic profiles which are extended in Section A.3.3†.Specially, AHN-B refers to AHN using pretrainedBERT as the input embedding encoder. It is in-cluded to test the impact of the input encoders.

Evaluation metric We use Mean Square Error(MSE) for performance evaluation. Given a test setRtest, the MSE is defined by

MSE =1

|Rtest|∑

(u,r)∈Rtest

(su,r − su,r)2.

Reproducibility We provide instructions to re-produce AS-pair extraction of ASPE and rating pre-diction of baselines and APRE in Section A.3.1†.The source code of our models is publicly availableon GitHub4.

4.2 AS-pair Extraction of ASPEWe present the extraction performance of unsuper-vised ASPE. The distributions of the frequenciesof extracted AS-pairs in Figure 5 follow the trendof Zipf’s Law with a deviation common to naturallanguages (Li, 1992), meaning that ASPE performsconsistently across domains. We show the qualita-tive results of term extraction separately.

Sentiment terms Generally, the AS-pair statis-tics given in Table 9† on different datasets are quan-titatively consistent with the data statistics in Ta-ble 2† regardless of domain. Figure 4 is a Venndiagram showing the sources of the sentiment termsextracted by ASPE from AM. All three methodsare efficacious and contribute uniquely, which canalso be verified by Table 10† in Section A.3.2†.

Aspect terms Table 4 presents the most frequentaspect terms of all datasets. ItemTok is rankedtop as users tend to describe overall feelings aboutitems. Domain-specific terms (e.g., car in AM) andgeneral terms (e.g., price, quality, and size) are in-termingled illustrating the comprehensive coverageand the high accuracy of the result of ASPE.

4.3 Rating Prediction of APREComparisons with baselines For the task ofreview-based rating prediction, a percentage in-

3https://jmcauley.ucsd.edu/data/amazon4https://github.com/zyli93/ASPE-APRE

https://jmcauley.ucsd.edu/data/amazon


769

Dataset Abbr. #Reviews #Users #Items Density Ttl. #W #R/U #R/T #W/R

AutoMotive AM 20,413 2,928 1,835 3.419×10−3 1.77M 6.274 10.011 96.583Digital Music DM 111,323 14,138 11,707 6.053×10−4 5.69M 7.087 8.558 56.828

Musical Instruments MI 10,226 1,429 900 7.156×10−3 0.96M 6.440 10.226 103.958Pet Supplies PS 157,376 19,854 8,510 8.383×10−4 14.23M 7.134 16.644 100.469

Sports and Outdoors SO 295,434 35,590 18,357 4.070×10−4 26.38M 7.471 14.484 99.199Toys and Games TG 167,155 19,409 11,924 6.500×10−4 17.16M 7.751 12.616 114.047

Tools and Home improv. TH 134,129 16,633 10,217 7.103×10−4 15.02M 7.258 11.815 124.429

Table 2: The statistics of the seven real-world datasets. (W: Words; U: Users; T: iTems; R: Reviews.)

Model Reference Cat. U/T ID Review

MF - Trad. XWRMF Hu et al. (2008) Trad. XFM Rendle (2010) Trad. XConvMF Kim et al. (2016) Deep X XNeuMF He et al. (2017b) Deep XD-CNN Zheng et al. (2017) Deep XD-Attn Seo et al. (2017) Deep XNARRE Chen et al. (2018) Deep X XANR Chin et al. (2018) Deep XMPCN Tay et al. (2018) Deep X XDAML Liu et al. (2019) Deep XAHN Dong et al. (2020) Deep X XAHN-B Same as AHN Deep X X

Table 3: Basics of compared baselines. Models’ in-put is marked by “X”. “U” and “T” denote Usersand iTems. D-CNN represents DeepCoNN. AHN-B de-notes the variant of AHN with BERT embeddings.

942734

169

5558

26972

228

PMI terms

NN terms

Lexicon

Figure 4: Sources of sen-timent terms from AM.

100 101 102 103 104

Frequency Rank of AS-pairs

100

101

102

103

104

105

Freq

uenc

y of

AS-

pairs

AMDMMIPSSOTGTH

Figure 5: Freq. rank vs.frequency of AS-pairs

crease above 1% in performance is considered sig-nificant (Chin et al., 2018; Tay et al., 2018). Ac-cording to Table 5, our model outperforms all base-line models including the AHN-B on all datasetsby a minimum of 1.337% on MI and a maximumof 4.061% on TG, which are significant improve-ments. It demonstrates (1) the superior capabilityof our model to make accurate rating predictions indifferent domains (Ours vs. the rest); (2) the perfor-mance improvement is NOT because of the use ofBERT (Ours vs. AHN-B). AHN-B underperforms

AM DM MI PS SO TG TH

ItemTok song ItemTok ItemTok ItemTok ItemTok ItemTokproduct ItemTok sound dog knife toy light

time album guitar food quality game toolcar music string cat product piece quality

look time quality toy size quality priceprice sound tone time price child product

quality voice price product look color bulblight track pedal price bag part batteryoil lyric tuner treat fit fun size

battery version cable water light size flashlight

Table 4: High frequency aspects of the corpora.

the original word2vec-based AHN5 because theweights of word2vec vectors are trainable whilethe BERT embeddings are fixed, which reducesthe parameter capacity. Within baseline models,deep-learning-based models are generally strongerthan entity ID-based traditional methods and recentones tend to perform better.

Ablation study Ablation studies answer thequestion of which channel, explicit or implicit, con-tributes to the superior performance and to whatextent? We measure their contributions by rows ofw/o EX and w/o IM in Table 5. w/o EX presentsthe best MSEs of an APRE variant without explicitfeatures under the default settings. The impact ofAS-pairs is nullified. w/o IM, in contrast, showsthe best MSEs of an APRE variant only leveragingthe explicit channel while removing the implicitone (without implicit). We observe that the opti-mal performances of the single-channel variants allfall behind those of the dual-channel model, whichreflects positive contributions from both channels.w/o IM has lower MSEs than w/o EX on severaldatasets showing that the explicit channel can sup-ply comparatively more performance improvementthan the implicit channel. It also suggests that thecostly latent review encoding can be less effective

5The authors of AHN also confirmed this observation.

770

Models AM DM MI PS SO TG TH

TRADITIONAL MODELS

MF 1.986 1.715 2.085 2.048 2.084 1.471 1.631WRMF 1.327 0.537 1.358 1.629 1.371 1.068 1.216

FM 1.082 0.436 1.146 1.458 1.212 0.922 1.050

DEEP LEARNING-BASED MODELS

ConvMF 1.046 0.407 1.075 1.458 1.026 0.986 1.104NeuMF 0.901 0.396 0.903 1.294 0.893 0.841 1.072D-Attn 0.816 0.403 0.835 1.264 0.897 0.887 0.980D-CNN 0.809 0.390 0.861 1.250 0.894 0.835 0.975

NARRE 0.826 0.374 0.837 1.425 0.990 0.908 0.958MPCN 0.815 0.447 0.842 1.300 0.929 0.898 0.969

ANR 0.806 0.381 0.845 1.327 0.906 0.844 0.981DAML 0.829 0.372 0.837 1.247 0.893 0.820 0.962AHN-B 0.810 0.385 0.840 1.270 0.896 0.829 0.976

AHN 0.802 0.376 0.834 1.252 0.887 0.822 0.967

OUR MODELS AND PERCENTAGE IMPROVEMENTS

Ours 0.791 0.359 0.823 1.218 0.863 0.788 0.936∆(%) 1.390 3.621 1.337 2.381 2.784 4.061 2.350

Val. 0.790 0.362 0.821 1.216 0.860 0.790 0.933

ABLATION STUDIES

w/o EX 0.814 0.379 0.833 1.244. 0.882 0.796 0.965w/o IM 0.798 0.374 0.863 1.226 0.873 0.798 0.956

Table 5: MSE of baselines, our model (Ours for testand Val. for validation), and variants. The row of ∆ cal-culates the percentage improvements over the best base-lines. All reported improvements over the best base-lines are statistically significant with p-value < 0.01.

than the aspect-sentiment level user and item pro-filing, which is a useful finding.

Hyper-parameter sensitivity A number ofhyper-parameter settings are of interest, e.g.,dropout, learning rate (LR), internal feature di-mensions (da, df , nc, and nk), and regularizationweight λ of the L2-reg in J(Θ). We run eachset of experiments on sensitivity search 10 timesand report the average performances. We tunedropout rate in [0, 0.1, 0.2, 0.3, 0.4, 0.5] and LR6

in [0.0001, 0.0005, 0.001, 0.005, 0.01] with otherhyper-parameters set to default, and report in Fig-ure 6 the minimum MSEs and the epoch numbers(Ep.) on AM. For dropout, we find the balance ofits effects on avoiding overfitting and reducing ac-tive parameters at 0.2. Larger dropouts need moretraining epochs. For LR, we also target a balancebetween training instability of large LRs and over-fitting concern of small LRs, thus 0.001 is selected.Larger LRs plateau earlier with fewer epochs whilesmaller LRs later with more. Figure 7† analyzes

6The reported LRs are initial since Adam and a LR sched-uler adjust it dynamically along the training.

0.0 0.1 0.2 0.3 0.4 0.5Dropout Rate

0.79

0.80

0.81

0.82

0.83

Perfo

rman

ce in

MSE

Perf. in MSEEpochs num.

2468101214

Num

ber o

f epo

chs

(a) Dropout vs. MSE and Ep.

10 4 10 3 10 2

Initial Learning Rate0.79

0.80

0.81

0.82

0.83

Perfo

rman

ce in

MSE


2468101214

Num

ber o

f epo

chs

(b) Init. LR vs. MSE and Ep.

Figure 6: Hyper-parameter searching and sensitivity

AM DM MI PS SO TG TH

127s∗ 31min 90s∗ 36min 90min 51min 35min

Table 6: Per epoch run time of APRE on the sevendatasets. The run time of AM and MI, denoted by “∗”,is disproportional to their sizes since they can fit intothe GPU memory for acceleration.

hyper-parameter sensitivities to changes on internalfeature dimensions (da, df , and nc), CNN kernelsize nk, and λ of L2-reg weight.

Efficiency A brief run time analysis of APRE isgiven in Table 6. The model can run fast with alldata in GPU memory such as AM and MI, whichdemonstrates the efficiency of our model and theroom for improvement on the run time of datasetsthat cannot fit in the GPU memory. The efficiencyof ASPE is less critical since it only runs once foreach dataset.

4.4 Case Study for Interpretation

Finally, we showcase an interpretation procedureof the rating estimation for an instance in AM: howdoes APRE predict u∗’s rating for a smart drivingassistant t∗ using the output AS-pairs of ASPE?We select seven example aspect categories with allreview snippets mentioning those categories. Eachcategory is a set of similar aspect terms, e.g., {look,design} and {beep, sound}. Without loss of gener-ality, we refer to the categories as aspects. Table 7presents the aspects and review snippets given byu∗ and received by t∗ with AS-pairs annotations.Three aspects, {battery, install, look}, are shared(yellow rows). Each side has two unique aspectsnever mentioned by the reviews of the other side:{materials, smell} of u∗ (green rows) and {price,sound} of t∗ (blue rows).

APRE measures the aspect-level contribu-tions of user-attention and item-property inter-actions by the last term of su,t prediction, i.e.,〈γ,Fex([Gu;Gt])〉. The contribution on the ith

771

aspect is calculated by the ith dimension of γ timesthe ith value of Fex([Gu;Gt]) which is shown inTable 8. The top two rows summarize the atten-tions of u∗ and the properties of t∗. Inferred Impactstates the interactional effects of user attentionsand item properties based on our assumption thatattended aspects bear stronger impacts to the finalprediction. On the overlapping aspects, the infe-rior property of battery produces the only negativescore (-0.008) whereas the advantages on installand look create positive scores (0.019 and 0.015),which is consistent with the inferred impact. Otheraspects, either unknown to user attentions or to itemproperties, contribute relatively less: t∗’s unappeal-ing price accounts for the small score 0.009 and themixture property of sound accounts for the 0.006.

This case study demonstrates the usefulness ofthe numbers that add up to su,t. Although small inscale, they carry significant information of valuedor disliked aspects in u∗’s perception of t∗. Thisprocess of decomposition is a great way to interpretmodel prediction on an aspect-level granularity,which is a capacity that other baseline models donot enjoy.

In Section A.3.5†, another case study indicatesthat a certain imperfect item property without userattentions only inconsiderably affects the ratingalthough the aspect is mentioned by the user’s re-views.

5 Conclusion

In this work, we propose a tightly coupled two-stage review-based rating predictor, consistingof an Aspect-Sentiment Pair Extractor (ASPE)and an Attention-Property-aware Rating Estimator(APRE). ASPE extracts aspect-sentiment pairs (AS-pairs) from reviews and APRE learns explicit userattentions and item properties as well as implicitsentence semantics to predict the rating. Extensivequantitative and qualitative experimental resultsdemonstrate that ASPE accurately and compre-hensively extracts AS-pairs without using domain-specific training data and APRE outperforms thestate-of-the-art recommender frameworks and ex-plains the prediction results taking advantage of theextracted AS-pairs.

Several challenges are left open such as fully orweakly supervised open domain AS-pair extractionand end-to-end design for AS-pair extraction andrating prediction. We leave these problems forfuture work.

From reviews given by user u∗. All aspects attended (3).

battery [To t1] After leaving this attached to my car for twodays of non-use I have a dead battery. Never had a deadbattery . . . , so I am blaming this device.

install [To t2] This was unbelievably easy to install. I havedone . . . . The real key . . . the installation is so easy. [Tot3] There were many installation options, but once . . . ,they clicked on easily.

look [To t3] It was not perfect and not shiny, but it did lookbetter. [To t4] It takes some elbow grease, but theresults are remarkable.

material [To t5] The plastic however is very thin and the cap ispretty cheap. [To t6] Great value. . . . . They are veryhard plastic, so they don’t mark up panels.

smell [To t7] This has a terrible smell that really lingersawhile. It goes on green. . . .

From reviews received by item t∗.

battery [From u1] The reason this won’t work on an iPhone 4or . . . because it uses low power Bluetooth, . . . . (7)

install [From u2] Your mileage and gas mileage and cost offuel is tabulated for each trip- Installation is prettysimple - but it . . . . (3)

look [From u3] Driving habits, fuel efficiency, and enginehealth are nice features. The overall design is nice andeasy to navigate. (3)

price [From u4] In fact, there are similar products to thisavailable at a much lower price that do work with . . . (7)

sound [From u5] The Link device makes an audible soundwhen you go over 70 mpg, brake hard, or accelerate toofast. (3) [From u6] Also, the beep the link devicemakes . . . sounds really cheapy. (7)

Table 7: Examples of reviews given by u∗ and receivedby t∗ with Aspect-Sentiment pair mentions as well asother sentiment evidences on seven example aspects.

Aspects material smell battery install look price sound

Attn. of u∗ 3 3 3 3 3 n/a n/aProp. of t∗ n/a n/a 7 3 3 7 3/7

Inferred Impact Unk. Unk. Neg. Pos. Pos. Unk. Unk.

γiFex(·)i (×10−2) 1.0 0.8 -0.8 1.9 1.5 0.9 0.6

Table 8: Attentions and properties summaries, inferredimpacts, and the learned aspect-level contributions.

Acknowledgement

We would like to thank the reviewers for their help-ful comments. The work was partially supportedby NSF DGE-1829071 and NSF IIS-2106859.

Broader Impact Statement

This paper proposes a rating prediction model thathas a great potential to be widely applied to rec-ommender systems with reviews due to its highaccuracy. In the meantime, it tries to relieve the un-justifiability issue for black-box neural networks bysuggesting what aspects of an item a user may feel

772

satisfied or dissatisfied with. The recommendersystem can better understand the rationale behindusers’ reviews so that the merits of items can be car-ried forward while the defects can be fixed. As faras we are concerned, this work is the first work thattakes care of both rating prediction and rationaleunderstanding utilizing NLP techniques.

We then address the generalizability and deploy-ment issues. Reported experiments are conductedon different domains in English with distinct reviewstyles and diverse user populations. We can ob-serve that our model performs consistently whichsupports its generalizability. Ranging from smallerdatasets to larger datasets, we have not noticed anypotential deployment issues. Instead, we noticethat stronger computational resources can greatlyspeed up the training and inference and scale upthe problem size while keeping the major executionpipeline unchanged.

In terms of the potential harms and misuses, webelieve they and their consequences involve twoperspectives: (1) the harm of generating inaccurateor suboptimal results from this recommender; (2)the risk of misuse (attack) of this model to revealuser identity. For point (1), the potential risk ofsuboptimal results has little impact on the majorfunction of online shopping websites since recom-menders are only in charge of suggestive content.For point (2), our model does not involve user anditem ID modeling. Also, we aggregate the user re-views in the representation space so that user iden-tity is hard to infer through reverse-engineering at-tacks. In all, we believe our model has little risk ofcausing dysfunction of online shopping platformsand leakages of user identities.

References

Stefanos Angelidis and Mirella Lapata. 2018. Summa-rizing opinions: Aspect extraction meets sentimentprediction and they are both weakly supervised. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing (EMNLP).

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani. 2010. Sentiwordnet 3.0: an enhanced lexicalresource for sentiment analysis and opinion mining.In Proceedings of the Seventh International Confer-ence on Language Resources and Evaluation.

Konstantin Bauman, Bing Liu, and Alexander Tuzhilin.2017. Aspect based recommendations: Recom-mending items with the most valuable aspects basedon user reviews. In Proceedings of the 23rd ACM

SIGKDD International Conference on KnowledgeDiscovery and Data Mining.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research.

Chong Chen, Min Zhang, Yiqun Liu, and ShaopingMa. 2018. Neural attentional rating regression withreview-level explanations. In Proceedings of the2018 World Wide Web Conference.

Shaowei Chen, Jie Liu, Yu Wang, Wenzheng Zhang,and Ziming Chi. 2020a. Synchronous double-channel recurrent network for aspect-opinion pairextraction. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics.

Xiao Chen, Changlong Sun, Jingjing Wang, ShoushanLi, Luo Si, Min Zhang, and Guodong Zhou. 2020b.Aspect sentiment classification with document-levelsentiment preference modeling. In Proceedings ofthe 58th Annual Meeting of the Association for Com-putational Linguistics.

Jin Yao Chin, Kaiqi Zhao, Shafiq Joty, and Gao Cong.2018. Anr: Aspect-based neural recommender. InProceedings of the 27th ACM International Confer-ence on Information and Knowledge Management.

Hongliang Dai and Yangqiu Song. 2019. Neural as-pect and opinion term extraction with mined rulesas weak supervision. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, NAACL-HLT.

Xin Dong, Jingchao Ni, Wei Cheng, Zhengzhang Chen,Bo Zong, Dongjin Song, Yanchi Liu, Haifeng Chen,and Gerard de Melo. 2020. Asymmetrical hierar-chical networks with attentive interactions for in-terpretable review-based recommendation. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence.

Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, andJianxin Liao. 2020. Adversarial and domain-awarebert for cross-domain sentiment analysis. In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics.

Xinyu Guan, Zhiyong Cheng, Xiangnan He, YongfengZhang, Zhibo Zhu, Qinke Peng, and Tat-Seng Chua.2019. Attentive aspect modeling for review-awarerecommendation. ACM Transactions on Informa-tion Systems (TOIS), 37(3):1–27.

773

Ruidan He, Wee Sun Lee, Hwee Tou Ng, and DanielDahlmeier. 2017a. An unsupervised neural attentionmodel for aspect extraction. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics.

Ruining He and Julian McAuley. 2016. Ups and downs:Modeling the visual evolution of fashion trends withone-class collaborative filtering. In Proceedings ofthe 25th International Conference on World WideWeb.

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie,Xia Hu, and Tat-Seng Chua. 2017b. Neural collab-orative filtering. In Proceedings of the 26th Interna-tional Conference on World Wide Web.

Yunfeng Hou, Ning Yang, Yi Wu, and S Yu Philip.2019. Explainable recommendation with fusion ofaspect information. World Wide Web, 22(1):221–240.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the 10thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining.

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008.Collaborative filtering for implicit feedback datasets.In 2008 Eighth IEEE International Conference onData Mining.

Chunli Huang, Wenjun Jiang, Jie Wu, and GuojunWang. 2020a. Personalized review recommendationbased on users’ aspect sentiment. ACM Transac-tions on Internet Technology (TOIT), 20(4):1–26.

Jiaxin Huang, Yu Meng, Fang Guo, Heng Ji, and Ji-awei Han. 2020b. Weakly-supervised aspect-basedsentiment analysis via joint aspect-sentiment topicembedding. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP).

Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungy-oung Lee, and Hwanjo Yu. 2016. Convolutional ma-trix factorization for document context-aware recom-mendation. In Proceedings of the 10th ACM Confer-ence on Recommender Systems.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Wentian Li. 1992. Random texts exhibit zipf’s-law-likeword frequency distribution. IEEE Transactions oninformation theory, 38(6).

Donghua Liu, Jing Li, Bo Du, Jun Chang, and RongGao. 2019. Daml: Dual attention mutual learningbetween ratings and reviews for item recommenda-tion. In Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining.

Huaishao Luo, Tianrui Li, Bing Liu, and Junbo Zhang.2019. DOER: Dual cross-shared RNN for aspectterm-polarity co-extraction. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics.

Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie,and Houfeng Wang. 2019. Exploring sequence-to-sequence learning in aspect term extraction. In Pro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics.

George A. Miller. 1995. Wordnet: A lexical databasefor english. Commun. ACM.

Minh Hieu Phan and Philip O Ogunbona. 2020. Mod-elling context and syntactical features for aspect-based sentiment analysis. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,Suresh Manandhar, and Ion Androutsopoulos. 2015.SemEval-2015 task 12: Aspect based sentimentanalysis. In Proceedings of the 9th InternationalWorkshop on Semantic Evaluation (SemEval 2015).

Maria Pontiki, Dimitris Galanis, John Pavlopoulos,Harris Papageorgiou, Ion Androutsopoulos, andSuresh Manandhar. 2014. SemEval-2014 task 4: As-pect based sentiment analysis. In Proceedings of the8th International Workshop on Semantic Evaluation(SemEval 2014).

Steffen Rendle. 2010. Factorization machines. In 2010IEEE International Conference on Data Mining.

Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu.2017. Interpretable convolutional neural networkswith dual local and global attention for review ratingprediction. In Proceedings of the 11th ACM Confer-ence on Recommender Systems.

Lei Shu, Hu Xu, and Bing Liu. 2017. Lifelong learningCRF for supervised aspect extraction. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics.

Hao Tang, Donghong Ji, Chenliang Li, and QijiZhou. 2020. Dependency graph enhanced dual-transformer structure for aspect-based sentimentclassification. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics.

Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.Multi-pointer co-attention networks for recommen-dation. In Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining.

Hao Tian, Can Gao, Xinyan Xiao, Hao Liu, Bolei He,Hua Wu, Haifeng Wang, and Feng Wu. 2020. SKEP:sentiment knowledge enhanced pre-training for sen-timent analysis. In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics.

774

Stéphan Tulkens and Andreas van Cranenburgh. 2020.Embarrassingly simple unsupervised aspect extrac-tion. In Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics.

Peter D. Turney. 2002. Thumbs up or thumbs down?semantic orientation applied to unsupervised classifi-cation of reviews. In Proceedings of the 40th AnnualMeeting on Association for Computational Linguis-tics.

Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan,and Rui Wang. 2020. Relational graph attention net-work for aspect-based sentiment analysis. In Pro-ceedings of the 58th Annual Meeting of the Associa-tion for Computational Linguistics.

Shuai Wang, Sahisnu Mazumder, Bing Liu, MianweiZhou, and Yi Chang. 2018. Target-sensitive mem-ory networks for aspect sentiment classification. InProceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics.

Wenya Wang and Sinno Jialin Pan. 2018. Recursiveneural structural correspondence network for cross-domain aspect and opinion co-extraction. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics.

Zhenkai Wei, Yu Hong, Bowei Zou, Meng Cheng, andYAO Jianmin. 2020. Don’t eclipse your arts due tosmall discrepancies: Boundary repositioning with apointer network for aspect extraction. In Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics.

Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2020.DomBERT: Domain-oriented language model foraspect-based sentiment analysis. In Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP).

Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2018. Dou-ble embeddings and CNN-based sequence labelingfor aspect extraction. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics.

Chen Zhang, Qiuchi Li, and Dawei Song. 2019.Aspect-based sentiment classification with aspect-specific graph convolutional networks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP).

Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Jointdeep modeling of users and items using reviews forrecommendation. In Proceedings of the Tenth ACMInternational Conference on Web Search and DataMining.

775

A Supplementary Materials

A.1 Introduction

This document is the Supplementary Materials forRecommend for a Reason: Unlocking the Powerof Unsupervised Aspect-Sentiment Co-Extraction.It contains supporting materials that are importantbut unable to be completely covered in the maintranscript due to the page limits.

A.2 Methods

A.2.1 Pseudocode of ASPE

Although Section 3.2 is self-explanatory, we wouldlike to explain the AS-pair generation process inSection 3.2.2 in detail by Algorithm 1. It lever-ages the sentiment term set ST obtained fromSection 3.2.1, a dependency parser, and WordNetsynsets to build AS-pairs.

Algorithm 1: AS-pairs GenerationInput: Sentiment terms ST , dependency

parser DepParser, threshold c.Output: AS-pairsData: Review-rating corpus R; WordNet

with synsets./* Initialize AS-pair candidate and

AS-pair sets */

1 AS-cand, AS-pairs←− ∅, ∅/* Extract AS-pair candidates. */

2 foreach review r ∈ R do3 dep-graphr ←−DepParser(r)4 foreach dependency relation rdep in

dep-graphr do5 if rdep is nsubj+acomp or rdep is

amod then6 Add corresponding (noun, adj.)

tuple to AS-cand (Figure 2)

/* Merge synonym aspects */

7 foreach (noun,adj.) tuple ∈ AS-cand do8 MergeSynAspect(synsets, noun)/* Filter out non-AS-pairs by ST

and frequency threshold c. */

9 foreach (noun, adj.) tuple ∈ AS-cand do10 if adj.∈ ST and Freq[noun] > c then11 Add (noun, adj.) to AS-pairs

12 return AS-pairs

A.3 Experiments

This section exhibits additional content regardingthe experiments such as a detailed experimentalsetup, the instructions to reproduce the baselinesand our model, supplemental experimental results,and another case study. We hope the critical con-tent help readers gain deeper insight into the per-formance of the proposed framework.

A.3.1 Reproducibility of ASPE and APRE

ASPE+APRE is implemented in Python (3.6.8)with PyTorch (1.5.0) and run with a single 12GBNvidia Titan Xp GPU. The code is available onGitHub7 and comprehensive instructions on howto reproduce our model are also provided. Thedefault hyper-parameter settings for the results inSection 4.3 are as follows:

ASPE In the AS-pair extraction stage, we setthe size of ctx to 5 and the PMI term quota qto 400 for both polarities. The counting thresh-olds c for different datasets are given in Table 9.SDRN (Chen et al., 2020a) utilized for term ex-traction is trained under the default settings in thesource code8 with the SemEval 14/15 datasets men-tioned in Section 2. spaCy9, a Python packagespecialized in NLP algorithms, provides the depen-dency parsing pipeline.

APRE In the rating prediction stage, we use apre-trained BERT model with 4 layers, 4 heads, and256 hidden dimensions (“BERT-mini”) for man-ageable GPU memory consumption. The BERTparameters (or weights) are fixed. The BERT tok-enizer and model are loaded from the Hugging Facemodel repository10. The initial learning rate is setto 0.001 with two adjusting mechanisms: (1) theAdam optimizer (β1, β2) = (0.9, 0.999) (the de-fault setting in PyTorch); (2) a learning rate sched-uler, StepLR, with step size as 3 and gamma as0.8. Dropout is set to 0.2 for both towers. df ,da, and nc are all set to 200 for consistency. TheCNN kernel size is 4. The L2-reg weight, λ, is setglobally to 0.0001. We use a clamp function toconstrain the predictions in the interval (1.0, 5.0).

7https://github.com/zyli93/ASPE-APRE8https://github.com/chenshaowei57/SDRN9https://spacy.io

10https://huggingface.co/google/bert_uncased_L-4_H-256_A-4


https://github.com/chenshaowei57/SDRN

https://spacy.io

https://huggingface.co/google/bert_uncased_L-4_H-256_A-4

https://huggingface.co/google/bert_uncased_L-4_H-256_A-4

776

A.3.2 ASPE: Additional ExperimentalResults of AS-pair Extraction

We present in Table 9 the statistics of the extractedAS-pairs of the corpora which are quantitativelyconsistent with the data statistics in Table 2 regard-less of domain.

Data c #AS-pairs/R #A/U #A/T #A #S

AM 50 3.076 12.681 16.284 291 8,572DM 100 1.973 5.792 8.380 296 9,781MI 50 3.358 12.521 16.323 167 8,143PS 150 3.445 14.886 23.893 529 12,563SO 250 4.078 19.401 28.314 747 17,195TG 150 4.482 19.053 26.657 680 13,972TH 150 5.235 22.833 29.816 659 14,145

Table 9: Statistics of unsupervised AS-pair extraction.c: frequency threshold; R: reviews; U: users; T: items.

We provide Table 10 ancillary to the Venn dia-gram in Figure 4 and the corresponding conclusionin Section 4.2. Table 10 illustrates the contribu-tions of the three distinct sentiment term extractionmethods discussed in Section 3.2, namely PMI-based method, neural network-based method, andlexicon-based method. All three methods can ex-tract useful sentiment-carrying words in the do-main of Automotive. Their contributions cannotoverwhelm each other, which strongly explains thenecessity of the unsupervised methods for termextraction in the domain-general usage scenario.Altogether they provide comprehensive coverageof sentiment terms in AM.

A.3.3 APRE: Information of BaselinesWe introduce baseline models mentioned in Table 3including the source code of the software and thekey parameter settings. For the fairness of com-parison, we only compare the models that haveopen-source implementations.

MF, WRMF, FM, and NeuMF11 Matrix factor-ization views user-item ratings as a matrix withmissing values. By factorizing the matrix with theknown values, it recovers the missing values aspredictions. Weighted Regularized MF (Hu et al.,2008) assigns different weights to the values in thematrix. Factorization machines (Rendle, 2010) con-sider additional second-order feature interactionsof users and items. Neural MF (He et al., 2017b)

11Source code of MF, WRMF, FM, and NeuMF is availablein DaisyRec, an open-source Python Toolkit: https://github.com/AmazingDD/daisyRec.

is a combination of generalized MF (GMF) anda multilayer perceptron (MLP). Hyper-parametersettings: The number of factors is 200. Regulariza-tion weight is 0.0001. We run for 50 epochs with alearning rate of 0.01 with the exception of MI thatuses a learning rate of 0.02 for MF and FM. Thedropout of NeuMF is set to 0.2.

ConvMF A CNN-based model proposed by Kimet al. (2016)12 that utilizes a convolutional neuralnetwork (CNN) for feature encoding of text embed-dings. Hyper-parameter settings: The regulariza-tion factor is 10 for the user model and 100 for theitem model. We used a dropout rate of 0.2.

ANR Aspect-based Neural Recommender (Chinet al., 2018)13 first proposes aspect-level represen-tations of reviews but its aspects are completelylatent without constraints or definitions on the se-mantics. Hyper-parameter settings: L2 regulariza-tion is 1× 10−6. Learning rate is 0.002. Dropoutrate is 0.5. We used 300-dimensional pretrainedGoogle News word embeddings.

DeepCoNN DeepCoNN (Zheng et al., 2017)14

separately encodes user reviews and item reviewsby complex neural networks. Hyper-parameter set-tings: Learning rate is 0.002 and dropout rate is 0.5.Word embedding is the same as ANR.

NARRE A model similar to DeepCoNN en-hanced by attention mechanism (Chen et al., 2018).Attentional weights are assigned to each review tomeasure its importance. Hyper-parameter settings:L2 regularization weight is 0.001 Learning rate is0.002. Dropout rate is 0.5. We used the same wordembeddings as described for ANR.

D-Attn15 Dual attention-based model (Seo et al.,2017) utilizes CNN as text encoders and buildslocal- and global-attention (dual attention) for userand item reviews. Hyper-parameter settings: In ac-cordance with the paper, we used 100-dimensionalword embedding. The factor number is 200.Dropout rate is 0.5. Learning rate and regulariza-tion weight are both 0.001.

MPCN Multi-Pointer Co-Attention Net-work (Tay et al., 2018) selects a useful subset

12https://github.com/cartopy/ConvMF.13https://github.com/almightyGOSU/ANR.14Source code of DeepCoNN and NARRE: https://

github.com/chenchongthu.15Source code of D-Attn, MPCN, and DAML: https:

//github.com/ShomyLiu/Neu-Review-Rec

https://github.com/AmazingDD/daisyRec

https://github.com/AmazingDD/daisyRec

https://github.com/cartopy/ConvMF

https://github.com/almightyGOSU/ANR

https://github.com/chenchongthu

https://github.com/chenchongthu

https://github.com/ShomyLiu/Neu-Review-Rec

https://github.com/ShomyLiu/Neu-Review-Rec

777

P only N only L only P ∩N\L P ∩ L\N N ∩ L\P P ∩N ∩ Lcountless therapeutic fateful ultimate uplifting dazzling amazingdreamy vital poorest new concerned costly beautiful

edgy uncanny tedious rhythmic joyful devastated classicentire adept unwell generic bombastic faster delightful

forgettable fulfilling joyous atmospheric unforgettable graceful enjoyablemelodious attracted illegal greater phenomenal affordable fantastic

moral celestial noxious supernatural inventive supreme gorgeouspropulsive harmonic lovable contemporary classy robust horrible

tasteful newest crappy surprising insightful useless inexpensiveuninspired enduring arduous tremendous masterful unpredictable magnificent

Table 10: Example sentiment terms of each part of the Venn diagram (Figure 4) from AM dataset. We use P(PMI), N (Neural network), and L (Lexicon) to denote the produced sentiment term sets of the three methods,respectively. Operator \ denotes set minus, e.g., P ∩ L\N refers to the set of terms that are in both P and L butnot in N . All sets contain commonly-used sentimental adjectives that can modify automotive items. This figurestrongly explains why three methods are all necessary for term extraction in non-domain-specific use cases. Theyall have unique contributions to the sentiment term set for larger coverage.

of reviews by pointer networks to build the userprofile for the current item. Hyper-parametersettings are the same as D-Attn except that thedropout is 0.2.

DAML DAML (Liu et al., 2019) forces encodersof the user and item reviews to interchange infor-mation in the fusion layer with local- and mutual-attention so that the encoders can mutually guidethe representation generation. Hyper-parametersettings are the same as MPCN.

AHN Asymmetrical Hierarchical Net-works (Dong et al., 2020)16 that guide theuser representation generation using item sideasymmetric attentive modules so that only relevanttargets are significant. Experiments are reproducedfollowing the settings in the paper.

A.3.4 APRE: Additional Analyses onHyper-parameter Sensitivity

Continuing Section 3.3, the searching and sensitiv-ity of the feature dimension (da, df , nc), the CNNkernel size nk, and the regularization weight λ isexhibited in Figure 7. We always set df = da = ncfor the consistency of internal feature dimensions.For (df , da, nc) in Figure 7a, we choose valuesfrom [50, 100, 150, 200] since the output dimen-sion of the BERT encoder is 256. The best per-formance occurs at 200. The training time spentis stable across different values. CNN kernel sizenk in Figure 7b varies in [4, 6, 8, 10]. We observethat generally larger kernel sizes may in turn hurt

16https://github.com/Moonet/AHN

the performance as the local features are fusedwith larger sequential contexts in natural language.The epoch numbers are stable as well. Figure 7cdemonstrates how λ affects the performance. Asλ becomes larger, the “resistance” against the lossminimization increases so that the training epochnumber increases. However, there are no cleartrends of performance fluctuation meaning that thesensitivity to L2-reg weight is insignificant.

Finally, we evaluate the effect of adding non-linearity to embedding adaptation function (EAF)mentioned in Section 3.3 which transforms H0

to H1 by h1i = σ

(WT

adh0i + bad

). We try

LeakyReLU, tanh, and identity functions for σ(·)and report the performances in Figure 7d. Withoutnon-linear layers, APRE is able to achieve the bestresults whereas non-linearity speeds up the training.

A.3.5 Case Study II for Interpretation

Finally, we show another case study from AMdataset using the same attention-property-score vi-sualization schema as Section 4.4. In this case, ourmodel is predicting the score user u∗ will give toa color and clarity compound for vehicle surfacet∗. The mentioned aspects of u∗ and the propertiesof t∗ are given in Table 11 including three over-lapping aspects (quality, look, cleaning) and oneunique aspect of each side (size of u∗ and smell oft∗). A summarization table, Table 12, shows thesummarized attentions and properties, the inferredimpacts, and the corresponding score componentsof 〈γ,Fex([Gu;Gt])〉.

https://github.com/Moonet/AHN

778

50 100 150 200Feature Dimensions (df and nc)

0.79

0.80

0.81

0.82

0.83Pe

rform

ance

in M

SEPerf. in MSEEpochs num.

2468101214

Num

ber o

f epo

chs

(a) Dims vs. MSE and Ep.

4 6 8 10CNN Kernel Size (nk)

0.79

0.80

0.81

0.82

0.83

Perfo

rman

ce in

MSE


2468101214

Num

ber o

f epo

chs

(b) nk vs. MSE and Ep.

10 6 10 5 10 4 10 3

Regularization Weight of L2-reg0.79

0.80

0.81

0.82

0.83

Perfo

rman

ce in

MSE


2468101214

Num

ber o

f epo

chs

(c) L2-reg vs. MSE and Ep.

None LeakyReLU TanhEmbedding Adaptation Activation Func

0.790

0.795

0.800

0.805

0.810Pe

rform

ance

in M

SEPerf. in MSEEpochs num.

5678910111213

Num

ber o

f epo

chs

(d) EAF vs. MSE and Ep.

Figure 7: Additional hyper-parameter sensitivity andsearching of internal feature dimensions (Dims: df , da,and nc), CNN kernel size (nk), regularization weightof L2-reg, and token embedding adaptation function.EAF is short for embedding adaptation function.

In this case study, we can observe the interestingphenomenon also exemplified in Table 1 by thecontrast between R1 and R3 that the aspect look,which has been mentioned by u∗ and reviewed neg-atively as a property of t∗ (“strange yellow color”),only produces an inconsiderable bad effect (-0.002)on the final score prediction. This indicates thatthe imperfect look (or color) of the item, althoughalso mentioned by u∗ in his/her reviews, receiveslittle attention from u∗ and thus poses a tiny nega-tive impact on the predicted rating decision of theuser. The other two overlapping aspects show in-tuitive correlations between their inferred impactsand the scores. The unique aspects, size and smell,have relatively small influences on the predictionbecause they are either not attended aspects or notmentioned properties.

It is also notable that some sentences that carrystrong emotions may contain few explicit sentimentmentions, e.g., “But for an all in one cleaner andwax I think this outperforms most.” It backs thedesign of APRE which carefully takes implicit sen-timent signals into consideration, and also calls foran advanced way for aspect-based sentiment mod-eling beyond term level. Different proportions ofsuch sentences in different datasets may account forthe inconsistency of better performances betweenthe two variants of the ablation study.

From reviews given by user u∗.

quality [To t1] As soon as I poured it into the bucket and startedgetting ready, I can tell the product was already betterquality than my previous washing liquid.

look [To t4] I bought [this item] because I had neglected mypaint job for too long. . . . it made my black paint joblook dull.

cleaning [To t2] . . . I was able to dry my car in record time andnot have any water marks left on the paint. I just slidethe towel over any parts with water and it left no trace ofwater and a clean shine to my car. [To t3] I hadcompletely neglected these areas, except for minorcleaning and protection. Once I applied it, the differencewas night and day!

size [To t6] The size was great as well, allowing me to getlarger areas in an easier amount of time so that I couldwash my car quicker than I have in the past.

From reviews received by item t∗.

quality [From u1] Adding too little soap will increase thetendency . . . This thick, high quality soap helps preventagainst that. (3) [From u2] . . . Cons: A bit pricey, butquality matters, and this product absolutely has it. Worthevery cent for sure! (3)

look [From u3] I was a bit disappointed. It is a strangeyellow color and it is thick and I personally did not carefor the smell. (7)

cleaning [From u4] As far as cleaning power it does fairly good,. . . The best cleaning of a car is in steps, but for an all inone cleaner and wax I think this outperforms most. (3)

smell [From u5] Just giving some useful feedback about thetruth behind the product . . . that it smells good. [Fromu6] I believe this preserves the wax layer longer . . . Thisis much thicker than the [some brand] soap, and has avery pleasant smell to it. (3)

Table 11: Examples of reviews from u∗ and to t∗ withAspect-Sentiment pair mentions as well as other senti-ment evidences on five example aspects.

Aspects size quality look cleaning smell

Attn. of u∗ 3 3 – 3 n/aProp. of t∗ n/a 3 7 3 3

Inferred Impact Unk. Pos. Neg. Pos. Unk.

γiFex(·)i (×10−2) 0.5 2.9 -0.2 1.4 0.3

Table 12: Attentions and properties summaries, in-ferred impacts, and the learned aspect-level contribu-tions on the score prediction.

Documents

Recommend for a Reason: Unlocking the Power of