Upload
arlene-booth
View
213
Download
0
Embed Size (px)
Citation preview
Slide 2
Overview
• Project Halo is a staged research and development effort towards a digital Aristotle:• An application capable of providing
user appropriate answers and justifications for questions in an ever growing number of domains
Slide 3
Step One: The Halo Pilot
• Goals:• To investigate the state-of-the-art in knowledge
representation and reasoning (KRR), especially “deep reasoning”
• To identify leaders in the field and to get to know them well
• To quickly come up-to-speed on the algorithmic and technical issues critical for good program management
• To establish and “test run” an evaluation methodology• To determine a roadmap for possible future research• Complete scientific transparency within the program
and ultimately with the entire scientific community• Limited/tight timeframe (6 months)
Slide 4
Domain/Syllabus Selection
• 50 pages from the AP-chemistry syllabus (Stoichiometry, Reactions in aqueous solutions, Acid-Base equilibria)• Small and self contained enough to be do-able in a short
period of time• Large enough to create many novel questions• complex “deep” combinations of rules• Standardize exam with well understood scores (AP1-AP5)• Chemistry is an exact science, more “monotonic”• No undo reliance on graphics (no free-body diagrams)• Availability of experts for exam generation and grading
Slide 5
Team Selection
• Selective Call-For-Proposals• Solid track record of relevant .gov and industrial
funding• Significant number of man-years invested in existing
relevant technology• World-class team• Responsiveness of proposal to the CFP• Bid within guidelines and expectations• Ability to work within the Project Halo contractual
environment
• Funded teams: Cycorp, SRI and Ontoprise
Slide 6
Climbing a Steep Hill
• Vulcan had little background in question answering prior to project Halo
• Hundreds of hours were dedicated to three rounds of training:• General primer in AI• Algorithmic training from each team• Tools and Admin training from each team• At the end, Vulcan was capable of encoding
questions using each teams’ formal language
Slide 7
The Challenge• Each team had four months to develop their chemistry
question answering applications• At the end of this time, the systems were sequestered and
the exam was released• Each team had two weeks to create formal encodings of the
100 (169 total subparts) questions in three sections (MC, DA, FF)
• Formal encodings were evaluated for fidelity against the original English by committee
• Encoded questions were run in batch on the sequestered systems, generating answers and justifications in English
• These answers were distributed to three SMEs for grading
Slide 9
Metrics• “Coverage”: the ability of the system to answer novel
questions from the entire specified syllabus• What percentage of the question types was the system
capable of reliably answering?• “Justification”: the ability to provide concise, user and
domain appropriate explanations• What percentage of the answer justifications was
acceptable to domain evaluators?• “Query encoding”: the ability to robustly represent
queries• Were questions encoded faithful to the original
English? How sensitive were the systems to these encodings?
• “Brittleness”: the ability to describe, measure and defeat major sources of brittleness• What were the major causes of failure? How can these
be remedied?
Slide 10
Examples of Question Encodings
• MC2. When lithium metal is reacted with nitrogen gas, under proper conditions, the product is:(a) no reaction occurs(b) LiN(c) Li2N
(d) Li3N
(e) LiN3
Slide 11
F-logic Encoding
• Encoded questionm1:Reaction[hasReactants->>{"Li","N"};enforced-
>>TRUE].answer("A") <- exists X m1[hasProducts->>X] and not
equal(X,"LiN") and not equal(X,"Li2N") and not equal(X,"Li3N") and not equal(X,"LiN3").
answer("B") <- m1[hasProducts->>"LiN"].answer("C") <- m1[hasProducts->>"Li2N"].answer("D") <- m1[hasProducts->>"Li3N"].answer("E") <- m1[hasProducts->>"LiN3"].FORALL X <- answer(X)
Slide 12
KM Encoding(every QF2 has (context ((a Reaction with (raw-material
((a Chemical with (has-basic-structural-unit
(((a Metal) & (an instance of (the output of (a Compute-Element-from-Name with
(input ("Lithium"))))))))) (a Chemical with (has-basic-structural-unit (((a Molecular-Compound with (has-chemical-formula ((a Chemical-Formula with (term ((:seq (:pair
2 N))))))))))) (state ((a State-Value with (value (*gas)))))))))))
Slide 13
KM Encoding (Cont)(output ((forall (the atomic-chemical-formula of (the has-basic-
structural-unit of (the result of (the context of Self)))) (if ((the elements of (the term of It)) = (:set (:pair 1 Li)
(:pair 1 N))) then "(b) LiN" else (if ((the elements of (the term of It)) = (:set (:pair 2
Li) (:pair 1 N))) then "(c) Li2N" else (if ((the elements of (the term of It)) = (:set (:pair 3
Li) (:pair 1 N))) then "(d) Li3N" else (if ((the elements of (the term of It)) = (:set (:pair 1
Li) (:pair 3 N))) then "(e) LiN3" else "(a) no reaction occurs")))) (comm [QF2-output-1] Self)))))
Slide 14
CYCL Encoding(implies (and
(chemicalReactants-TypeType ?REACTION Nitrogen)(chemicalReactants-TypeType ?REACTION
(ElementalSubstanceFn Lithium)) (ionicDecomposition ?LI2N LithiumIon 2 NitrideIon 1)(ionicDecomposition ?LI3N LithiumIon 3 NitrideIon 1)(ionicDecomposition ?LIN LithiumIon 1 NitrideIon 1) (ionicDecomposition ?LIN3 LithiumIon 1 NitrideIon 3))
(thereExists ?COMPOUND(thereExists ?LI-NUM
(thereExists ?N-NUM(thereExists ?LI-CHARGE
(thereExists ?N-CHARGE (and
Slide 15
CYCL Encoding (Cont.)(relationAllInstance chargeOfObject LithiumIon
(ElectronicCharge ?LI-CHARGE)) (relationAllInstance chargeOfObject NitrideIon
(ElectronicCharge ?N-CHARGE)) (ionicDecomposition ?COMPOUND LithiumIon ?LI-
NUM NitrideIon ?N-NUM) (evaluate 0 (PlusFn (TimesFn ?LI-CHARGE ?LI-
NUM) (TimesFn ?N-CHARGE ?N-NUM)))(goodChoiceAmongSentences ?ANSWER(TheList
(not (thereExists ?REACTION-2 (and (chemicalReactants-TypeType ?REACTION-2
(GaseousFn Nitrogen))(chemicalReactants-TypeType ?REACTION-2
(ElementalSubstanceFn Lithium)))))(equals ?COMPOUND ?LIN)(equals ?COMPOUND ?LI2N)(equals ?COMPOUND ?LI3N)(equals ?COMPOUND ?LIN3))))))))))
Slide 16
Evaluating Encodings
• High fidelity encodings do not add or delete relevant chemical knowledge from the original English.
• The encoding committee reviewed all encodings to verify that they were all “high fidelity.”
• A second criterion was “automatability”, the likelihood encodings could be produced automatically from English, given today’s state-of-the-art.
Slide 17
Challenge Results
• All three teams produced challenge results• All SMEs graded all the results• Each question part got separate grades for
answers and justifications• The grade ranges for each question part were
0, .5 and 1 for answers and likewise for justifications
• Graders were given guidelines to be as “AP-like” as possible
Slide 18
Results: MC Section
• Features 50 multiple choice questions (MC1-MC50).
• MC3: sodium azide is used in air bags to rapidly produce gas to inflate the bag. The products of the decomposition reaction are:(a) Na and water;(b) Ammonia and sodium metal;(c) N2 and O2;(d) Sodium and nitrogen gas;(e) Sodium oxide and nitrogen gas.
Slide 19
MC Results
MC Justification Scores
0.00
10.00
20.00
30.00
40.00
50.00
60.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
MC Answer Scores
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
SME1 SME2 SME3
Sco
res(
%)
CYCORP
ONTOPRISE
SRI
Slide 20
Results: DA Section
• Features 25 multi-part questions (DA1-DA25)
• DA1. Balance the following reactions, and indicate whether they are examples of combustion, decomposition, or combination
(a) C4H10 + O2 CO2 + H2O
(b) KClO3 KCl + O2
(c) CH3CH2OH + O2 CO2 + H2O
(d) P4 + O2 P2O5
(e) N2O5 + H2O HNO3
Slide 21
DA Results
DA Justification Scores
0.00
10.00
20.00
30.00
40.00
50.00
60.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
DA Answer Scores
0.00
10.00
20.00
30.00
40.00
50.00
60.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
Slide 22
Results: FF Section
• Features 25 multi-part questions (FF1-FF25)
• More qualitative, less computational• FF2. Pure water is a poor conductor of
electricity, yet ordinary tap water is a good conductor. Account for this difference.
Slide 23
FF Results
FF Justification Scores
0.00
5.00
10.00
15.00
20.00
25.00
30.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
FF Answer Scores
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
SME SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
Slide 24
Total Results
Challenge Justification Scores
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
Challenge Answer Scores
0.00
10.00
20.00
30.00
40.00
50.00
60.00
SME1 SME2 SME3
Sco
res
(%)
CYCORP
ONTOPRISE
SRI
Slide 25
Grader Comments
• Organization and brevity were the two major remarks• Some of the justifications were over 16 pages long• Many of the arguments were used repetitively• Proofs took a long time to “get to the point”• In some multiple choice cases, proofs involve invalidating
all wrong answers rather than proving the right one• Generalized proofs relied on instance-based solutions,
lack of meta-reasoning capability• Gaps in the knowledge were evident, e.g. many of the
teams had issues with net ionic equations
Slide 26
Brittleness Analysis: SRI
SRI Brittleness
107
63.5
37.5
26
261
B-MOD-2
B-IMP-1
B-ANJ-3
B-MTA-1
OTHER
Slide 27
Brittleness Analysis: CycorpCycorp Brittleness
19.646.1
20
29.6
25.1
46.5
61.5408.5
B-MOD-1
B-MOD-2
B-MOD-3
B-MOD-4
B-MGT-2
B-INF-3
B-ANJ-1
OTHER
Slide 28
Brittleness Analysis: Ontoprise
Ontoprise Brittleness
487
5.5
6
7
143
B-MOD-1
B-IMP-1
B-INF-2
B-ANJ-1
OTHER
Slide 32
Projections for the Next Iteration (3 Months)
• Same domain and scope:• AP-5 for multiple choice (~85%)• AP-4 for non-multiple choice (DA & FF)
(~65%)
Slide 33
Observations• Per-page encoding costs O($10K) for 50 pages• Encoding took highly expert teams 2 weeks of
effort• SRI relied most heavily on professional chemists,
most thorough on assembly process• The Ontoprise platform was the fastest and most
reliable (<2 hours). F-Logic was the most concise formal language• SRI was >5 hours and Cycorp >12 hours
• Cycorp’s generative explanations were the most ambitious. Needed more domain expert feedback
• Previously stated metrics, like the number of concept and relations, do not provide insight into coverage
Slide 34
Next Steps: Phase II
• Building tools to allow domain experts to encode robust knowledge
• Building tools to allow students to pose questions/problems
• Currently in pre-CFP design• Required skills:
• Knowledge Engines• Knowledge Acquisition (against documents)• HCI and Human Factors