Upload
arabicnlpimamu2013
View
123
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Building corpus from www for Arabic Arabic NLP group at Imam University
2013Al-Fridi.A , Bhattab.R , Al-Rakaf.N
Outline • Introduction• Data collection• Data processing• Architecture • Problems• Tools Methodology • Conclusion
Introduction• Building a corpus requires major time and
effort.• Texts may not be easily available for building
a corpus.• Web data that a new strand of research
developed• The web is immense, free and available.• The Web as a source of language data,
because that it's so big source rather than other sources.
• The idea of building corpora starting at 1897 by German linguist Kading.
Data collection• There is many ways to collecting the data from
the websites.
• used a locally developed spider program to get the data from each site.
• used the Arabic Optical Character Recognition (OCR) program Automatic Reader.
Data processingThe processing of the data to obtain the
corpus consisted of the following steps:
• Language classification.• Linguistic filtering.• Processing.• Corpus indexing.
Architecture
Problems• Textual layout.• Spelling mistakes.• Duplicates.
Tools Methodology
Crawler System
Cosmas Query
Boot CaT • This is the first propose a full procedure for the
automated extraction of specialized corpora and technical terms by web-mining.
• Let’s us try to build corpus
Sketch Engine
Introduction
• The Sketch Engine is a corpus processing system developed in 2002.
• The basic elements of the Sketch Engine are concordances, word sketches, grammatical relations, and a distributional thesaurus.
• The Sketch Engine service makes a number of large web corpora available for online analysis which can be done by using a web-based corpus query.
Sketch Engine
Implementation and Design
• The Sketch Engine has a different query system.
• A Word Sketch includes: subject, object, prepositional object, and modifier.
غواص أداة
غواص أداة
غواص أداة
Conclusion
• Building corpus from www for Arabic.
• Ways to collecting data from web.
• Problem we faced and the tools that support us to build the corpus.
Acknowledgments This work has been supervised by Dr.Amal Al-Saif,we Thank her for helping and supporting us.