[IEEE 2013 International Conference on Computer Sciences and Applications (CSA) - Wuhan, China (2013.12.14-2013.12.15)] 2013 International Conference on Computer Sciences and Applications

Research of Public News Monitor Software Based on Chinese Semantic Model

Yafu Xiao [email protected]

Xiongkai Shao [email protected]

Jianzhou Liu [email protected]

Yanjie Zhang [email protected]

School of Computer, Hubei University of Technology Wuhan, China

Abstract—This paper proposes a new monitor method which is based on semantic model, this paper also verifies it on both web sites and microblogs. The method, which uses a series man-made news event templates of certain domain, is invoked by computer program. The program resolves it, and then searches and crawls web news event automatically, in the same way, the program will also analyse the news event and classify them using the templates. Such this mechanism can replace people themselves to do the monitor job, and it has a progress in classify news event material too.

key words semantic model; XML database; web news eventmonitoring; information collection of microblog

I. INTRODUCTION

Net news event monitoring is a rapidly developing product of today’s Internet in the web2.0 era. With the booming of social networks, people’s enthusiasm for expression and dissemination news event are increased rapidly. Apparetly, this phenomenon has greatly influenced the governments' decision. In this background, it is necessary to collect, discover, and classify news event on the network[1] for certain department or government.

The paper have researched a monitoring software of web news event which is based on semantic model. Semantic model[2], which is the core of news event monitoring software, defines a set of news events which is concerned by administrators or experts of certain domain. These kinds ofnews events established the core keywords and syntagmatic relation between them. Based on the semantic model, computer program then generates data table and forms related data. When the information collection job begins, the topic crawler uses the keywords for searching which are included in the related data and download news event information, and after the download job is done, the program then uses the semantic model again to analyse and classify the content in order to obtain a complete and accurate ultimate data sets.

II. ANALYSIS OF SEMANTIC MODEL The semantic model is a structual model which is used to

describe certain kinds of event news. It is established by XML language and has good universality and good extensibility. The semantic model is the core of the whole web news event monitoring software.

In general, a semantic model contains these elements below:

1) The name of one event category.

2) BasicKeywords. Each event can not stand away fromsome basic keywords which is always been spoken during people’s talking, and these words is the basic element of one semantic model.

3) Template combination. A template determains how to construct one semantic model by the basic keywords.

Here is the example of public evnet news of university, a possible semantic model may be defined like this:

<AllEvents> <Event name="students' violation of disci-pline">

<BasicKeywords name="university">Wuhan University, Hubei University of Technology </ BasicKeywords> <BasicKeywords name="activity">skip class, chea,</BasicKeywords > <Template ID="01"> university + activity </Template > <Template ID="02"> activity + university </Template >

</Event>

<Event name="campus security"> <BasicKeywords name="university">Wuhan University, Hubei University of Technology </ BasicKeywords> <BasicKeywords name="activity">dormitory stealing, gambling </BasicKeywords > <Template ID="01"> university + activity </Template > <Template ID="02"> activity + university </Template >

</Event> </AllEvents>

The semantic model above defines two diffierent kinds of event, for each one, the XML document has already set basic keywords and templates. Take “students' violation of disci-pline” event for instance, inside the model we put two groups of basic keywords, one for the university names and the other for activities, each group has several words as is displayed, the university basic keywords node contains university names, while activity basic keywords node contains certain action. Finally, the template node organizes the basic keywords according to the way people speak.

2013 International Conference on Computer Sciences and Applications

978-0-7695-5125-8/13 $26.00 © 2013 IEEE

DOI 10.1109/CSA.2013.126

519

III. WEB NEWS MONITORING SOFTWARE BASED ON SEMANTIC MODEL

The whole monitorig software contains these main modules: 1.website crawler; 2.microblog crawler; 3.analyse and category module. These modules use semantic model to complete their job.

A. Defineing of Semantic Model The semantic model is a XML document, its root node is

“AllEvents”, the child nodes of the root node are named “Event”. This paper defined 5 kinds of event: 1.academic corruption; 2. students' violation of discipline; 3. students’ crime; 4. Problems of college students' psychological and health; 5. campus security problem. The 5 events are all belong to “AllEvents” node.

Inside of each “Event” node, this paper defined its feature nodes, the feature nodes are mainly belong to two kinds: one is BasicKeywords for containing basic words and the other is Template for organize the keywords. The “BasicKeywords” nodes defines special vocabulary elements of some certain news event, and the “Template” nodes defines the feature of the word orgenizing of one event.

B. Website Crawler The web site crawler module, whose duty is download the

web page to the localhost, it first reads the semantic model, and then generates keywords for searching based on some certain meta search engine[3], after that, the crawler makes it to the right URL for specified string ecoding and sends it to the meta search engine. The search engine will response with search result list – but not the final web page, as soon as the crawler receives the response, it resolves the result using regular expression for certain meta search engine so it could get the URL list of the news web page. Finally the crawler downloads the web pages and save important information into dadabase such as download date, the URL, and path of downloaded page file. The basic workflow is displayed in Figure 1

Start

Read Semantic Model

Generate Search Keywords

Invoke Meta Search Engine

Resolve Response Page

Download Webpage

Save into Database

Save as Files

Database

End

Local Files

Figure 1. Basic Work Flow of Website Crawler

The Website crawler first reads datas from database which includes target websites and semantic models, after the program gets the data, the program will use the advanced feature of some certain search engine(eg. how to allocate a website, and how to submit multi-keywords) to generate the searching keywords, the crawler then convert the searching keywords to a special URL and get the search result page through it, after the search engine returns the result page, the crawler resolves the page and extracts the target URLs from the page, finally the crawler crawls the web pages using the target URLs and downloads them to the local host and saves datas into the database.

This paper takes Baidu as example to demonstrate how to crawl web site news by meta search engine[4]. Baidu Searching has advanced search feature, its keywords format is:

searchKey = site:(website URL)(keyword11 | keyword 12…)(keyword21 | keyword22…)…

The website URL is target website address, keyword is the basic keyword in semantic model, the keywords in the same dash must be in the same BasciKeywords Node in the semantic model. For example, a piece of searching keywords may be like this:

site:(hb.qq.com)(Wuhan University|Hubei University of Technology)(dormitory stealing|gambling)

As is shown above, the “hb.qq.com” is one of the target websites, the keywords are seperated in two groups, one is university name and the other is activity, when the program gets the string, it will convert it to search URL for certain search engine’s rule. For example, the Baidu URL parameter configuration is displayed in Table 1:

Table 1 Parameter Configuration of Baidu Searching URL

Param Name Functionwd Keyword to be searched

cl Search type, cl=3 means web page searching, cl=2 means picture searching

pn Result page number

ieEncoding of keyword, the default value is for Simplified Chinese, both utf-8 and gb2312 are correct

The generated URL is here based on the sample searching keywords above:

http://www.baidu.com/s?wd=site%3A%28hb.qq.com%29%28%E6%AD%A6%E6%B1%89%E5%A4%A7%E5%AD%A6%7C%E6%B9%96%E5%8C%97%E5%B7%A5%E4%B8%9A%E5%A4%A7%E5%AD%A6%29%28%E5%AE%BF%E8%88%8D%E8%A2%AB%E7%9B%97%7C%E8%B5%8C%E5%8D%9A%29&rsv_bp=0&ch=&tn=baidu&bar=&rsv_spt=3&ie=utf-8&rsv_n=2&inputT=577.

Among the URL, there are many text strings such as “%29%28%E6%AD…”, this kind of string is URL encoding of the searching keywords.

Once the URL is generated, the crawler sends request to meta search engine to get the response page which contains one

520

page of search result list, in the list, each item means one result with its title, abstract, and original URL. The crawler extracts these information by the characteristic of certain search engine, often it uses regular expression[5]. After that, the crawler sends request to the real web site of the original URL, and downloads them to local host, and saves records into dadabase.

C. Microblog Crawler The microblog crawler’s job is to download microblog

content of the user concerned events. When it works, it first resolve the semantic model to generate all possible searching keywords, and then saves them into a table of database. After this work flow, the microblog crawler invokes the public API provided by the microblog operator with correct parameters. When the crawler receives the response from microblog operator – often is a character string in JSON or XML format, the program will resolve the information and extract the microblogs’ content. Finally, the crawler saves the microblog information into the database. The basic work flow is displayed as Figure 2:

Start

Save Keyword List to DB

Read Keyword List

Invoke Microblog API

Resolve Response Data

Save Valid Data to DB

Close Microblog API

Database

End

Initialize Microblog API

Generate Keyword List

Read Semantic Model

Figure 2. Basic Work Flow of Microblog Crawler

When invoking the API of microblog operator, the program need to follow the defined parameters of the API, different operators have different parameters, but they are similar. This paper takes Tencent Microblog as example, its key parameters is shown in Table 2:

Table 2 Key Parameters of Tencent Microblog API

Param Name Function

format format of response text (JSON or XML)

keyword searching keywords

msgtypetype of microblog (0 means all message, 1 means original, 2 means reprint)

pagesize record number of each pagepage current page

When parameters is configured, the crawler will initialize the API, and invoke its search function and send HTTP request to microblog operator. The crawler uses response variable of java to receive HTTP response from microblog operator. The response usually have two formats, JSON and XML. The program resolvers the response as long as it gets it, and finally it extracts the information and saves them into database.

D. Analyse and Category Module The original data set need to be siftted and classified. On

one hand, the datas from meta search engine are not accurate especially when the searching result is not sufficient. On the other hand, the original information contains tremendous duplicate datas for many information on the web or on the microblog are reprinted or copyed. So the original extracted datas need removing duplicate and analyse work flow.

For removing duplicate records, as in today’s Internet, many online news or microblog message are reprinted or copyied, so the paper chooses a method called LCS (Longest Common Subsequence) algorithm[6] to remove them.

The main idea of LCS is:

Define: character string A = a1a2a3…an, B = b1b2b3…bm, L[i, j] indicates the longest common subsequence’s length of two

character string a1a2a3…ai and b1b2b3…bi. (0 i n and 0 jm)

If ai = bi, then L[i, j] = L[i-1, j-1] + 1;

If ai bi, then L[i, j] = max{L[i, j-1], L[i-1, j]}

So, the recursion formula of LCS algorithm is:

� �

0 0 0[ , ] [ 1, 1] 1 0, 0

max [ , 1], [ 1, ] 0, 0i j

i j

if i or jL i j L i j if i j and a b

L i j L i j if i j and a b

� �� (1)

Here is the core idea of LCS algorithm, for each couple of iand j (0 i n and 0 j m), we can use a (n+1) (m+1) table to calculate L[i, j] by the formula (1) and then get the length of LCS of A and B.[7]

According to this paper, if the LCS’s length is above 80% of the longer character string, then the crawler judges that A and B are duplicate records.

To siftting and category datas, this paper uses strategy of word spacing[8] to process the original text. Its basic mechanism is finding a suitable distance D relying on the language expression of Chinese to make the function: y=f(D) can have the maximun value (y means the accuracy rate ofanalysis result). The program uses the semantic model and D to generate a matched pattern and matchs the target text through regular expression[9]. If it returns true, then the program saves

521

the record into database and allcate it current event category. The work flow of Analyse and Category Module is displayed in Figure 3:

Start

Database

End

Read Web News Event

Read Semantic Model

Text Pattern Matching

Is Matched?

Save into DB

Abandon

YesNo

Figure 3. Basic Work Flow of Analyse and Category Module

IV. EXPERIMENT DESIGN AND DATA PRESENTATION

The public web news monitoring software chooses five public concerned kinds of news and event in university as its experiment background. The events are academic corruption,students' violation of discipline, students’ crime, students' psychological and health problems and campus security.

As objectives of monitoring, this paper chooses ten universitis in Hubei Province: Wuhan University, Huazhong University of Science and Technology, China University of Geosciences, Central China Normal University, Huazhong Agricultural University, Wuhan University of Technology,Zhongnan University of Economics and Law, Hubei University,Hubei University Of Technology, Wuhan Textile University.

For website news event monitoring, this paper had choosed 13 news or BBS web site, they are displayed in Table 3:

Table 3 Choosed News or BBS Website According to This Paper

Website Name Official URLTianya Community http://tianya.cn/Changjiang Times http://www.changjiangtimes.com/Changjiang Daily http://cjmp.cnhan.com/Chu Network http://hb.qq.comTencent's Education http://edu.qq.com/Sina Hubei http://hb.sina.com.cn/Xinhua Net http://www.xinhuanet.com/Peoples Network http://www.people.com.cn/Southern Weekly http://www.infzm.com/iFeng http://www.ifeng.com/Netease http://www.163.com/Sohu http://www.sohu.com/Tencent http://www.qq.com/

The software is implemented in Java language, and develop tool is MyEclipse, database is MySQL, the whole develop and run environment is shown in Table 4:

Table 4 The Develop and Run Environment of The Web News Event Monitoring Software

CPU Intel Core i5 2.66GHzRAM 6GBHard Drive 1TB,5400 r/minOperation System Windows 8 Professional 64bit

Develop Tool MyEclipse 8.5Runtime Environment jdk 1.7.0.25

Database MySQL 5.5.22

Server Tomcat 6.0.37When first crawled the news event of website and

microblog, the software had got the origional data set about 39891 records of website and 32463 records of microblog. After being processed, the data set decreased to 1320 records of website and 3948 records of microblog. This decreasing is mainly because of the uncorrelated and duplicate records. The detail statistics is shown in Table 5

Table 5 The Detail Statistics of the Final Data Set

Website Microblogacademic corruption 346 184students' violation of disci-pline 259 530students’ crime 215 734students' psychological and health problems 230 1610

campus security 270 890total 1320 3948

From the table we can find that the number of microblog event news is larger than websites, especially in the events about students’ life in university. One possible reason is that many microblog users are student or young people who are recent college graduates, these people especially concern about their life in school and they prefer using microblog than visiting website, so they are always willing to shared this kind of news event or comment on microblog. But for some event like academic corruption, this event has a distance to young people’s life, so the information in microblog is less than that in websites.

V. CONCLUSION AND FUTURE WORK This paper has done some research and tesitfy work of web

news event monitoring technology based on semantic model. The monitoring areas not only contains traditional news websites but also personal microblogs, so the area is largely being broadened. The monitor background is choosed as some important news event happened in university which has reference value to management of universities and administrative department for education.

In the tesifying job, the software works well, but it has some disadvantages as below:

522

1.The software needs more university to generate a more complete data set.

2. The analyse module is not working pricise enough.

3.The speed of analysing is a little slow.

All this update job need to be done in the future. To make the monitoring software works better and stronger, we will try to add more universities in the software and update the analysis module to implement a more accurate data set.

REFERENCES

[1] Mai-Vu Tran , Minh-Hoang Nguyen, Sy-Quan Nguyen. VnLoc: A Real–time News Event Extraction Framework for Vietnamese [C]. 2012 Fourth International Conference on Knowledge and Systems Engineering:161-162.

[2] Wang Tao,Wang Yanzhang,Lu Yanxia. Research on ontology-based meta event model of public emergencies[J]. Journal of Dalian University of Technology ,2012,03:458-463.

[3] Boo Vooi Keong,Patricia Anthony. Meta Search Engine Powered by DBpedia[C], 2011 International Conference on Semantic Technology and Information Retrieval:90-91.

[4] Li Jia,Zeng Ping.The Analysis and Applications about Baidu Search Feature[J].Information Research,2011,08:99-100.

[5] He Youquan,Xu Cheng,Xu Xiaole,Tang Huajiao.Approach of Eliminating Web Page Noise Based on Statistical Characteristics and DOM tree[J]. Journal of Chongqing University of Technology (Natural Science) ,2011,01:55-56.

[6] Mohamed Elhadi, Amjad Al-Tobi, Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures[C]. 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology:679-680.

[7] M.H.Alsuwaiyel.Algorithm Design Techniques and Analysis[M].Beijing:Publishing House of Electronics Industry, 2012:130-131.

[8] Yao Zhanlei,Xu Xin.Research on the Detection of Sudden Events in News Stories of Online Information[J].New Technology of Library and Information Service,2011,04:54-55.

[9] Wu Xingrui. Study of Java Regular Expressions and Pattern Matching[J]. Public Communication of Science & Technology,2011,15:180-186.

523