On the Winograd Schema Challenge - Towards a New Test for Thinking Machines

On the Winograd Schema Challenge THE TEST FOR THINKING MACHINES WALID SABA Co-founder & CTO Klangoo, Inc.

INTRODUCTION Purely quantitative (e.g., statistical) methods would, with a high degree of certainty, erroneously resolve “it” in (1) by ‘school’ as it seems to be the only candidate in the text that would be consistent with training data obtained from processing a large corpus. (1) Dave told everyone in school that he wants to be a guitarist, because

he thinks it is a great sounding instrument.

A 5-year old with basic commonsense knowledge, however, would effortlessly and correctly infer that “it” in (1) refers to “a guitar”, a noun phrase that is not even explicitly mentioned in the text. Reasoning beyond what is explicitly stated in the text, the 5-year old is presumably accessing what we call commonsense (background) knowledge: a guitarist is a musician that plays a musical instrument, an instrument we call ‘guitar’. Undoubtedly it is this kind of thinking that Alan Turing had in mind when he posed the question “Can Machines Think?”, suggesting further that a machine that comprehends and communicates in ordinary spoken language, much like humans do, must be a thinking machine1. While it was certainly ingenious of Turing to recognize that competency in ordinary spoken language is the best test for thinking machines (Turing 1950), the imitation game (aka the “Turing Test”) that he proposed has several shortcomings. As recently suggested

1 Although this is not the subject of the current discussion, we however would like to unequivocally concur

that language, that infinite object that is tightly related to our capacity to have an infinite number of thoughts, is the ultimate test for thinking machines. While several accomplishments in computation are usually attributed to AI (due probably to an ill-formed definition of AI) most of these tasks deal with finding a near optimal solution from a finite set of possibilities and are hardly performing what we might call thinking. For example, and although the search space is very large (and would therefore require some intelligent heuristics) playing chess is, ultimately, a matter of looking up and comparing numeric values in a large table, as the possibilities are in the end finite. The fact that chess is an activity that only humans can perform is not different from calculators that also perform complex arithmetic operations – operations that only humans can perform, although we would hardly say that calculators are thinking machines. The same can also be said of pattern (sound and image) recognition systems, that, essentially find (or in some techniques learn) regularities in data. True human-level object recognition (beyond the image recognition that the most primitive of species can perform) would in the end require reasoning that is somewhat similar to that required in language understanding. In this regard, the recently proposed visual Turing Test for vision systems (Geman et al., 2014) is a step in the right direction.

http://www.linkedin.com/in/walidsaba

by Levesque et. al. (2012), the Turing Test left room for the possibility of some systems to pass the test, not because any thinking is going on, but by trickery and deception. As Levesque points out, systems that have participated in the Loebner competition (Shieber 1994) usually use deception and trickery by throwing in “elaborate wordplay, puns, jokes, quotations, clever asides, emotional outbursts,” while avoiding clear answers to questions that a 5-year old would be very comfortable in correctly answering. In addressing the shortcomings of the Turing Test, while staying true to the spirit of the test, namely to decide if a machine was thinking, Levesque et al suggested what they termed the “Winograd Schema (WS) challenge”, which can be illustrated by the classic example – given originally by Terry Winograd (1972):

(2) The city councilmen refused the demonstrators a permit because they

a. feared violence. b. advocated violence.

The question posed against this sentence would be: what does “they” refer to in (2a), and what does it refer to in (2b)? The answer is so obvious to humans that reason using commonsense (background) knowledge, and thus a machine would certainly be thinking if it can also correctly resolve such references. Levesque correctly points out however that care should be taken in the design of such queries so as to avoid the pitfalls of the original Turing Test, namely that a program should not be able to pass the test by performing simple syntactic level and pattern matching computation. For example, simple word co-occurrence data obtained from a corpus analysis might be all that is needed to make the correct guess in (3), without doing anything we might call thinking, while the same cannot be done in (4).

(3) The women stopped taking the pills because they were

a. pregnant. b. carcinogenic.

(4) The trophy would not fit into the brown suitcase because it was too

a. small. b. big.

Levesque calls the query in (4) “Google-proof”, since having access to a large corpus would not help here as the frequencies of contexts where “small” and “big” appear should in principal be the same. This is not the same in (3), however, where a purely quantitative system would pass many queries based on simple co-occurrence data as the likelihood of “carcinogenic” co-occurring with “pills” should be much higher than its co-occurrence with “women” (and similarly for the other combination). Another important point Levesque makes in proposing the WS challenge is avoiding posing queries that are either too obvious, or to difficult. The latter could happen, for example, if the questions posed required knowledge of a special vocabulary – a vocabulary that only specialized domain experts might know. Essentially, these queries should be ones that a 5-year old would be able to effortlessly answer – or, as Levesque puts it, “a good question for a WS is one that an untrained subject (your Aunt Edna, say) can answer immediately”. This is a crucial issue and it is central to the design of a good set of WS sentences.

THE KNOWLEDGE-LEVEL IN THE WINOGRAD SCHEMA To systematically deal with Levesque’s concern of not posing questions that are too easy or too difficult we consider such questions at a higher-level of abstraction. All WS queries share the following features: 1- there are two noun phrases mentioned in the sentence 2- there is a reference made in the sentence to one of these noun phrases 3- the question involves determining the right referent

This template applies to sentences where the reference (i) can be resolved by simple syntactic data; (ii) can be resolved by semantic information (type restrictions); (iii) can be resolved by reasoning at the pragmatic level, and by accessing commonsense (background) knowledge; or (iv) cannot be resolved at all, as illustrated in Figure 1. It is in the third level (the knowledge/pragmatic level), where good WS sentences are situated, and Levesque’s concern about the difficulty level is precisely related to capturing sentences in level 3, and in particular, those that are in the middle of level 3 (the shaded rectangle in figure 1).

Figure 1. Various ‘reference resolution’ levels in the Winograd Schema

To start with, we can dismiss discussion of sentences that are in level 4, as such sentences correspond to those that even humans would not be able to resolve the reference, as is the case in the following sentence: (5) John told Bill that he will not be nominated for the committee. Aside from level 4, there are three other levels that represent those cases where a referent can in theory be resolved by humans. We’ll discuss the other three levels below, concentrating on the knowledge/pragmatic level, where WS sentences are situated.

The Syntactic/Data Level (level 1 in figure 1)

This is the level at which only one of the two noun phrases is the correct referent and where simple syntactic data is enough to resolve the reference. Here are some typical examples: (6) a. John informed Mary that he passed the exam.

(reference here can be easily resolved by gender data)

b. John invited his classmates for his birthday party, and most of them showed up. (reference here can be easily resolved by number data)

The Semantic/Information Level (level 2 in figure 1)

Again, this is the level at which only one of the two noun phrases is the correct referent but where syntactic information is not enough, although semantic information (usually type information that might be available in a strongly-typed ontology) are enough to resolve the reference. Here is a typical example: (7) Our graduate students published 20 papers this year and, apparently, few of them

a. also authored books b. appeared in tier 1 journals

Here type information (often referred to in the literature as selectional restrictions) is sufficient to resolve the anaphor ‘them’: students, not papers, author books; and papers, not students, appear in journals. Although such sentences can be part of a good test, we agree with Levesque that such sentences can also be easily answered by a program that has a rich (and strongly-typed) ontology, without doing what we might call thinking. The Pragmatic/Knowledge Level (level 3 in figure 1)

This is the level at which ‘good’ WS sentences are situated. Sentences at this level are those where the reference in question can, in theory, be resolved by either of the two noun phrases, and the most appropriate referent is usually the one that is more probable than the other (or, it is the one that makes the final scenario being described more likely to be the case than the other). Care should therefore be exercised in not choosing WS sentences where the likelihood of both referents are near equal (cases where the WS is too difficult), or where the likelihood of one is clearly much higher than the other (cases where the WS is too easy). Below are some examples, in order of difficulty (from the too easy to the too difficult). In what follows we will discuss several examples in level 3, and we suggest that good WS sentences are those that are in the middle of level 3 (the gray area in figure 1). Level 3.1

At this level both referents are in theory possible, but the likelihood of one is usually much higher than the likelihood2 of the other. The sentences in (8) are typical examples.

2 By ‘likelihood’ here we mean something like “the commonsense likelihood”. Thus “the likelihood of

referent R being the correct one” essentially means the “the likelihood of referent R being selected is more consistent with our commonsense understanding of the world”.

(8) a. John told Bill he will be coming to the meeting. b. John asked Bill if he is coming to the meeting.

Level 3.2

Again, at this level both referents are also possible in theory. While the likelihood of one of the referent being the correct one is higher than the other choice, it is not much higher, and the correct referent is chosen because it is the one that is more consistent with our commonsense understanding of the world. The sentences (9) through 12 are typical examples.

(9) Professor Carnap told John that he

a. did very well in his final exam b. did not yet grade his final exam

(10) John gave the book to Fred when he

a. finished with it b. asked for it

(11) John could not lift his son because he was too a. heavy b. weak

(12) The trophy would not fit into the brown suitcase because it was too a. big b. small

Level 3.3

Sentences at this level are similar to the ones in level 3.2 in that both referents are also possible in theory, and the likelihood of one of the referent being the correct one is higher than the other choice. However, the difference in the degree of likelihood of either referent gradually becomes very small in such examples, although one of the referents is still more consistent with our commonsense understanding of the world. The sentences (13) and (14) are typical examples.

(13) The town councillors refused to give the angry demonstrators a

permit because they a. feared violence b. advocated violence

(14) A young teenager fired several shots at a policeman. Eyewitnesses say he immediately a. fell down b. fled away

Although it is not likely, one can imagine (13) describing a scenario where some (ill-intentioned) town councillors refused to give a permit for a (peaceful) demonstration, and specifically to incite violence. The sentence in (14) is even more of a typical example in this level, where both referents starts to be almost as likely: one can easily imagine a young teenager falling down after shooting a policeman with a very large machine gun, or a policeman fleeing away after being shot, perhaps to escape further injuries.

While sentences in levels 3.1, 3.2 and 3.3 are examples of sentences where both referents are in theory possible, sentences in level 3.3 are examples of sentences where the resolution of the reference becomes too difficult, in contrast to the ones in level 3.1 where the resolution of the reference is perhaps too easy. It is thus sentences in level 3.2 that are good WS sentences. In particular, it is sentences where both referents are in principal possible, yet humans (like our 5-year old, or Aunt Edna!) can easily make the correct choice because one of the likelihood of one of the referents is, after all, more consistent with our commonsense understanding of the world3.

IS THE WINOGRAD SCHEMA A SPECIAL CASE? Having discussed the Winograd Schema (WS) in some detail, suggesting in the process where good WS sentences are situated, we would like to suggest here that WS sentences are special cases of a more general phenomena in natural language understanding that a good test for intelligence must consider. The sentences in level 3.2 (which is a subset of level 3 in figure 1) are good WS sentences specifically because these are typical examples where humans perform commonsense reasoning accessing background knowledge that is not explicitly stated in the text. As Levesque et. al. (2012), noted

“You need to have background knowledge that is not expressed in the words of the sentence to be able to sort out what is going on …. And it is precisely bringing this background knowledge to bear that we informally call thinking.” (Emphasis added)

We wholeheartedly agree: accessing commonsense (background) knowledge that is not expressed in the text is what we humans do in resolving references in WS sentences, and this is what we call thinking, and what probably Turing had in mind in his test for thinking machines. However, this phenomenon, of accessing background knowledge not explicitly stated in the text, is not specific to reference resolution, but is in fact the common denominator in many other linguistic phenomena. In what follows we cover a few example sentences, where [the missing text] that is never stated, but is assumed to be available to all humans, is what allows humans to understand the sentence. Consider for example the following: (15) a. John enjoyed [reading] the book

b. John enjoyed [watching] the movie While John can, in theory, enjoy writing, publishing, buying, selling, etc. a book, and enjoy directing, producing, buying, selling, etc. a movie, a 5-year old would immediately infer that the [the missing text] in (15a) is ‘reading’ and that in (15b) is ‘watching’, and again, precisely because these two words make the final meaning more consistent with our commonsense understanding of the world.

3 Ironically, and while purely statistical (or other quantitative) models would not be able to pass good WS sentences, thinking seems to be happening in sentences where humans seem to be good at choosing the meaning that is more likely than others – i.e., the one that is the most consistent with our commonsense understanding of the world.

A query posed against such sentences would be “what did John enjoy about the book” (15a) and “what did John enjoy about the movie?” (15b). The sentences in (16) are examples where prepositional phrase (PP) attachments must be resolved. Specifically, it must be decided what the prepositional phrase attaches to (i.e., what does it modify). Here also it seems that humans resolve these attachments by accessing background knowledge, inferring in the process the [missing text] that is not typically explicitly stated in the text.

(16) a. I read a story about evolution [and I did that] in ten minutes. b. I read a story about [the] evolution [that happened] in the last million years.

In (17) we have an example where we also resolve what is referred to in the literature by quantifier scope ambiguities by accessing commonsense background knowledge to infer the [missing text] that is not usually stated in the text. (17) John visited a [different] house on every street in his neighbourhood. What is referred to in the literature as metonymy is yet another example of where humans use commonsense background knowledge in inferring [the missing text] as illustrated by the sentences in (18) and (19). (18) The [person sitting at the] corner table wants another beer. (19) The [person driving the] car in front us is annoying me, please pass it.

CONCLUDING REMARKS What came to be known as the Turing Test was initially suggested by Turing as a possible answer to the question “Can Machines Think?”. As Levesque et. al. (2012) pointed out, however, the current Turing Test can be passed by programs that perform few tricks without doing anything we might call thinking. The Winograd Schema (WS) challenge, suggested by Levesque is indeed a step in the right direction, as programs correctly answering WS sentences would indeed be thinking. In this short article we discussed where good WS sentences are situated. We further suggested that WS sentences are in fact instances of a much larger phenomenon. In particular, we suggested that in language compression there are many phenomena that, like the resolution of references in WS sentences, humans seem to employ commonsense reasoning and accessing background knowledge to infer the missing words that are not explicitly stated in the text. A good test for thinking machines would therefore be a carefully selected set of several phenomena (prepositional phrase attachments, quantifier scope ambiguity, metonymy, etc.), all of which are examples where humans use commonsense background knowledge to infer the missing text – text not explicitly stated – i.e., all of which are examples of thinking humans perform in the process of language comprehension.

REFRENCES Geman, D. et al. (2015), Visual Turing test for computer vision systems, Proceedings of

the National Academy of Sciences. Levesque, H., Davis, E. and Morgenstern, L. (2012), The Winograd Schema Challenge, In

Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, AAAI Press, pp. 552-561.

Shieber, S. M. (1994), Lessons from a restricted Turing Test. CACM 37(6), pp. 70–78. Turing, A. M. (1950), Computing Machinery and Intelligence, Mind, 59(236), pp. 433-460

Walid Saba is a Co-founder and the CTO of Klangoo, the developer of the

Magnet semantic engine. He has over 20 years of experience in information

technology, holding various positions at such places as the American

Institutes for Research, AT&T Bell Labs, Metlife, Nortel Networks, IBM and

Cognos. He has also spent 7 years in academia where he taught computer

science at the University of Ottawa, the New Jersey Institute of Technology,

the University of Windsor, the American University of Beirut, and the

American University of Technology. He has published over 35 technical

articles, including an award winning paper that was presented in Germany

at KI-2008. Walid holds an MSc in Computer Science from the University of

Windsor, and a PhD in Computer Science from Carleton University, which

he obtained in 1999.

Documents

On the Winograd Schema Challenge - Towards a New Test for Thinking Machines