A Trainable Multi-factored QA System

A Trainable Multi-factored QA System

Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia,

Verginica Barbu-Mititelu

Research Institute for Artificial Intelligence, Romanian Academy

ResPubliQA

• We participated in the Romanian-Romanian ResPubliQA task

• 500 juridical questions to be answered from the Romanian JRC Acquis (10714 docs)

• Questions have been translated from other languages => a more difficult QA task since translated terms are not necessarily expressed the same in the actual Romanian documents

Corpus processing and indexing

• POS tagging, lemmatization, chunking.

• Only the ‘body’ part of a document was indexed (no annexes, no headers)

• We have two Lucene indexes: a document index and a paragraph index

• What’s in the index: lemmas and paragraph classes for the paragraph index

QA flow

• Web services based:– Question preprocessing using TTL (

http://ws.racai.ro/ttlws.wsdl)

– Question classification using a ME classifier (http://shadow.racai.ro/JRCACQCWebService/Service.asmx)

– Query generation (2 types: TFIDF and chunk based) (http://shadow.racai.ro/QADWebService/Service.asmx)

– Search engine interrogation (http://www.racai.ro/webservices/search.asmx)

– Paragraph relevance score computation and paragraph reordering

The combined QA system

• In order to account for NOA strings (which, when given, will increase the overall performance measure) we decided to combine 2 results:– The QA system using the TFIDF query– The QA system using the chunk query

• When the same paragraph was returned among the top K (=3) paragraphs by the two systems, it was the answer

• For the other case, we returned the NOA string

Paragraph relevance

• s1 to s5 are paragraph relevance scores• λi are trained weights by iteratively computing

MRR scores on a 200 questions test set using sets of weights for which the sum is 1.

• Retaining the value of the weights that account for the largest obtained MRR, results in a MERT-like training procedure

• Increment step was 0.01

Relevance scores

• Lucene scores for the document and paragraph retrieval

• One BLUE-like relevance score which is high if a candidate paragraph contains keywords much in same order as in the question

• One indicator variable that is 1 if the candidate paragraph has the same class as the question (0 otherwise)

• One lexical chains based score (a real number quantified semantic distance between the question and the candidate paragraph)

Evaluations• Official results• Second run: query contained the question class

Post CLEF2009 Evaluations

• Results with all questions (500) answered (no NOA strings)• With trained parameters for every question class, we obtain an

overall accuracy of 0.5774 (29 additional correctly answered questions)

Post CLEF2009 Evaluations (II)

• Some other informative measures:– Answering precision: correct / answered– Rejection precision: (1 – correct) / unanswered

• AP(icia092roro) = 75.58%• RP(icia092roro) = 86.53%• So, the system is able to reject giving wrong

answers at a high rate which is a merit in itself (discovered due to the c@1 calculus) even if it cannot offer the same answering precision in the unanswered area

Conclusions

• A multi-factored QA system may be easily extended with new paragraph relevance scores

• It’s also easily adaptable on new domains and/or languages

• Update: better correlation between documents and paragraph relevance scores

• Future plans: to develop the English QA system along the same lines and combine the En-Ro outputs

Conclusions (II)

• Competition drives innovation but let’s not forget that these tools are there to help users

• Useful requirement: QA systems to be on the Web

• Ours is at http://www2.racai.ro/sir-resdec/

Documents

A Trainable Multi-factored QA System