Upload
martin-majlis
View
693
Download
0
Embed Size (px)
Citation preview
What happened?Martin Majlis
28/01/10 SWT - Final Project 2
Outline
Introduction Architecture Back-end
Downloading Extraction
Front-end Web application iGoogle Gadget
28/01/10 SWT - Final Project 3
Introduction
Answer on questions: what happened on 3 January what happened on 3 January 1865 what happened on January 1825 what happened from January until July 1985 what happened during the 16th century what started on January 1930 what ended in 1990
28/01/10 SWT - Final Project 4
Architecture
Back-end Downloading Structure Converting Parsing
Front-end Web application iGoogle Gadget
28/01/10 SWT - Final Project 5
Build process
Fully automatized Target for each phase Less error-prone GNU Make
28/01/10 SWT - Final Project 6
Data Source
Czech Wikipedia Documented format Dumps regularly generated Cleaner than general texts
28/01/10 SWT - Final Project 7
Downloading / Conversion
Downloading Script from DBPedia Added traffic shaping
Data Conversion Recognizing pages/categories Building category “hierarchy”
28/01/10 SWT - Final Project 8
Categories
Confusing Structure Netherlands - 229
Physics, Planets, Illusions, Psychology, Literature, Organ, Neuroscience, etc.
Maximal deep 5 Median: 31 Mean: 33.87
28/01/10 SWT - Final Project 9
Date Extraction – Regular Exp.
Regular expressions aren't for parsing Day=(\d+)\.; Month = (Jan|Feb|...); Year=(\d+) Date = (Day Month Year | Day Month | Month Year |
Year) Extract = (“from” Date “until” Date | Date “-” Date |
“between” Date “and” Date | “from” Date)
Day number can be on 14 positions In real more than 1000 slots
28/01/10 SWT - Final Project 10
Date Extraction - Tools
Standard way: GNU Flex / GNU Bison Ragel
Problem with UTF-8 support Unicode – almost 100.000 characters Big transition tables (100.000 vs 127)
28/01/10 SWT - Final Project 11
Date Extraction - Mixed
Lexical Analysis Regular Expressions Filling Table
Syntactic Analysis Theoretically CFG Practically again regular expressions
28/01/10 SWT - Final Project 12
Date Extraction - Example
Lexical Analysis “From 23 January 1956 until 2 February 1960” “From {{DATE_1}} until {{DATE_2}}”
Syntactic Analysis Interval = “From” DATE “to” DATE Interval = “Between” DATE “and” DATE
28/01/10 SWT - Final Project 13
Date Representation
Dates from 10.000 BC to 2500 AC Not exact: 13th century, June 1689 Zero
2 January - 5days = 28 December 2 January 1AC -5days = 28 December
1BC Simple tuples
(“I”, 23, 1, 1956, 20, 2, 2, 1960, 20)
28/01/10 SWT - Final Project 14
Web application
PHP5 + MySQL Nette Framework + Dibi http://css.majlis.cz/
GT: http://jdem.cz/dspw9
HTML, JSON, XML output
28/01/10 SWT - Final Project 15
iGoogle Gadget
iGoogle = Google personalized homepage URL: http://jdem.cz/dspx7 Using JSON Tricky development
28/01/10 SWT - Final Project 16
Future Work
Improve performance 20th century events – 28s – 406.980 (one OR) 20th century events – 0.0007s – 392.573 (no OR)
Improve parser architecture
28/01/10 SWT - Final Project 17
Questions?
28/01/10 SWT - Final Project 18
Thank You!