30
A Data Warehouse for Canadian Literature Eduardo Gutarra

Eduardo Gutarra. Overview Introduction and Motivation Background

Embed Size (px)

Citation preview

  • Slide 1

Eduardo Gutarra Slide 2 Overview Introduction and Motivation Background Slide 3 Introduction Data warehouses are often used as one of the main components of Decision Support Systems. Data warehouses can be used to perform analyses on different fields as long as there is a lot of data. We want to build a data warehouse on places mentioned in books. Gutenberg Canada Website provides books in Text Files and other formats, free of charge. Barcelona Saint John Montreal Slide 4 Motivation Project is inspired from the LitOLAP project Seeks to apply data warehousing techniques in the domain of text processing. Allows a literary researcher answering questions over an authors style, or a particular book among others. Facilitates the analysis of texts to a domain expert What kind of applications could this have? Slide 5 Data warehouse A data warehouse is a database specifically used for reporting. Populating a data warehouse (DW) involves an ETL process where the data is: Extracted from data sources Transformed to conform the schema of your DW. Loaded onto the data warehouse. Once the DW is populated, Online Analytical Processing (OLAP) can be performed on it. Slide 6 Data warehouse Sales in Store 1 Sales in Store 2 Flat Files ETL Process Datawarehouse OLAP Cube OLAP Cube Tend to be orders of magnitude larger Query response Time is more important Transactional throughput is More important Summarize The data Slide 7 ETL Process According to Kimball, about 70% of the effort is spent in the ETL Process My project has a Single Data Source Obtain the metadata, and the books separately : : Gutenberg Canada. (index.html) AuthorTitleYear Slide 8 MySQL English? No Transform to Table Form Transform to Table Form Annotated XML File Structured Table Annotated XML File Yes Slide 9 Book1.xml 21: I have lived in Saint John. 22: This sentence has no place mentioned.... Book1.txt 21: I have lived in Saint John. 22: This sentece has no place mentioned.... Natural Language Processing GATE -- Open-source software for text processing. Gazetteer to determine what words or phrases are a location. Annotates sentences and locations Produces XML file Slide 10 MySQL English? No Transform to Table Form Transform to Table Form Annotated XML File Structured Table Annotated XML File Yes Slide 11 Book1.xml 21: I have lived in Saint John. 22: This sentence has no place mentioned.... Book2.xml 31: This sentence mentions Fredericton and Halifax. 32: This sentence mentions Saint John.... Once the XML file is written we have a process to transform Into a single denormalized table. BookPlaceSentenceFrequency Book1Saint John211 Book2Fredericton311 Book2Halifax321 BookPlaceSentenceFrequency Book1Saint John211 Book1NONE221 Book2Fredericton311 Book2Halifax321 Slide 12 MySQL English? No Transform to Table Form Transform to Table Form Annotated XML File Structured Table Annotated XML File Yes Slide 13 The Multidimensional Model We use the multidimensional model to design the way the data is structured Multidimensional model divides the data in measures and context. Measures: Numerical data being tracked Stored as facts Dimensions provide the context for the facts Slide 14 Units Sold Profit Measures 20 $45 Time Product Location Dimensions Slide 15 The Star Schema When we store a multidimensional model in a relational database it is called a Star Schema. ProductIDLocationIDMonthIDUnits SoldProfit 2122045.. ProductIDProduct 1Sardines 2Anchovies 3Herring 4Pilchards LocationIDLocation 1Boston 2Benson 3Seattle 4Wichita MonthIDMonth 1April 2May 3June 4July Fact Table Dimension Table 20 $45 2NF 3NF 2NF Slide 16 Attributes Attributes are abstract items for convenient qualification or summarization of data. Attributes often form hierarchies. TimeIDMonthQuarterYear 1JanuaryQ12010 2FebruaryQ12010 3MarchQ12010 4AprilQ22010 5MayQ22010 6JuneQ22010 7JulyQ32010 8AugustQ32010 9SeptemberQ32010 10OctoberQ42010 11NovemberQ42010 12DecemberQ42010 13JanuaryQ12011 FinestCoarsest Q2 33 20 45 Q2 x Anchovies x Boston 98 Slide 17 SentenceID x PlaceID Frequency Place ID City Country Continent Sentence ID Place ID Frequency Sentence ID Text Sentence # Book Author Occupation PlaceSentence Slide 18 Issues with the Design PlaceIDCityCountryContinent 40UnspecifiedCanadaNorth America 41Unspecified North America 42Unspecified South America What if the place is a country? What if the place is a continent? Dummy value unspecified can fill in the missing values I live in Canada. I live in North America. Slide 19 Issues with the Design PlaceIDCityCountryContinent 40LondonCanadaNorth America 41LondonEnglandEurope London in England, or London in Ontario? I live in London. Two possible solutions: Allocation Determining from context which Country Slide 20 Issues with the Design Many to Many relationship between Authors and Books Many to Many relationships are tricky. They can lead to double-counting and other problems. AuthorTitleSentenceIDFrequency ? The Knight of the Burning Pestle11 Fletcher, JohnA Story21 ?A Tale of The Big Mountain31 AuthorTitleSentenceIDFrequency Beaumont, FrancisThe Knight of the Burning Pestle1 Fletcher, JohnThe Knight of the Burning Pestle1 Fletcher, JohnA Story21 Beaumont, FrancisA Tale of The Big Mountain3 Fletcher, JohnA Tale of The Big Mountain3 Author_1Author_2TitleSentenceIDFrequency Beaumont, FrancisFletcher, JohnThe Knight of the Burning Pestle11 Fletcher, JohnNULLA Story21 Beaumont, FrancisFletcher, JohnA Tale of The Big Mountain31 Additional Attribute Allocation Beaumont, Francis Fletcher, John Beaumont, Francis Fletcher, John Slide 21 Place ID City Country Continent Sentence ID Place ID Frequency AuthorGID AuthorID AuthorName Sentence ID Text Sentence # Book AuthorGID Occupation Dimension Table Bridge Table Outtriger Table Add two tables To the Star Schema Slide 22 AuthorID x SentenceID x PlaceID Frequency Text Authors Sentences Places Sentence ID Book Name Sentence # Place ID City Country Continent Author ID Author Name Occupation DOB DOD Sentence ID Place ID Frequency Author ID 2 Slide 23 Data Integration Slide 24 PlaceIDCityCountryContinent 33LondonEnglandEurope : : : : 45LondonCanadaNorth America BookIDSentenceIDPlaceIDFrequency 2810330.8 2810450.2 Slide 25 Algorithms for reading unstructured data to relational tables. Parse xml file and read a sentence in it. Having the sentence, we then add the sentence to the table of sentences: Check if we have a place in the sentence If there is a place, check whether it is new. If it is a new place, then we add an entry for it in the places table. Slide 26 OLAP Schema The OLAP Schema file defines which is the fact table and which are the dimension tables in the MySQL Schema. Once the cube is built we also add a OLAP Schema File MySQL Slide 27 http://msdn.microsoft.com/en- us/library/aa216779(v=sql.80).aspx http://msdn.microsoft.com/en- us/library/aa216779(v=sql.80).aspx Slide 28 Slide 29 MDX Query Language Slide 30 Slide 31 AuthorTitleSentenceIDFrequency Beaumont, Francis Fletcher, John The Knight of the Burning Pestle11 Fletcher, JohnA Story21 Beaumont, Francis Fletcher, John A Tale of The Big Mountain31 Slide 32 A Comparison Multidimensional Models More appropriate for OLAP applications. Provides faster query response times Reduce the number of joins Easier understanding of Data MDX (Multidimensional Expressions) Relational Models More appropriate for OLTP, or operational databases Better transactional throughput Reduce redundancies as much as possible. SQL (Structured Query Language)