Text mining

Natural Language Processing

by Advanced Artificial

Intelligence Methods

Jan Žižka, Ústav informatiky, PEF, Mendelova universita, Brno

Jan ŽižkaDepartment of Informatics

Faculty of Business and EconomicsMendel University in Brno, Czech Republic

[email protected], [email protected]

(Text Mining)

mailto:[email protected]

Natural Language Processing by AdvancedArtificial Intelligence Methods

● Data, information, knowledge

● Electronic text data

● Inductive machine learning (ML)

● Pre-processing of data and its representation

● Methods of searching, similarity, pattern recognition

● Algorithms (just some examples)

● Application areas



● Data, information, knowledge

- data here means all (text) values somehow obtained (relevant, irrelevant, with or without noise, exact and inexact, approximate, and so like)

- information is part of data that is interesting from the specific selected problem-solution viewpoint

- knowledge is generalized information

- metaknowledge is “knowledge about knowledge” (for example, to know which knowledge is applicable to a specific problem)



● Electronic text data

Text in an electronic form (ASCII/ANSI, Unicode, etc.).

Typical text data can be found, e.g., on the Internet.

Electronic text is used in many areas.

Electronic text data are created in any common natural language (not only in prevailing English).

Processing of such “human-like” data by machines is extraordinarily complicated and often depends on

a specific language.



● Inductive machine learning


- learning by using a limited set of examples;

- the examples generally cover only a proportion of reality;

- sufficient values describing the data are missing (for example, distribution);

- a mathematical model cannot be created for a reliable prediction or classification;

- knowledge is obtained by the generalization of information.




What is a color of a crow?

Black? And why?

Has anyone of you seena crow that was not black?

Has anyone seen completely all crows that have existedanywhere anytime on the Earth? (No, he/she surely hasn't.)

To what degree is the generalization “a crow is black”correct and acceptable? Can you say?



How many specific crows we need to see to generalize “a crow is black”?




The hooded crow:




The generalization of specific available examples is one of possible learning methods.

Machines (computers) need (unlike the human beings) usually significantly (much) larger amount of specific examples to generalize, therefore to get knowledge.

The application of a method to determine a degree of similarity plays a big role – for example, to categorize an unknown example to a certain group of known samples.




Algorithms of machine learning define their relevant parameters automatically during their training phase. The quality of their training is verified during testing. If the results of testing are acceptable, the trained algorithm can be used for a given application.

The training phase requests suitable learning examples because an algorithm’s properties (parameters) are

finally defined by the applied training data. The testing phase uses examples which were not been

used by an algorithm during its training phase.



● Pre-processing of data, their representation

The typical way to get knowledge from electronic unstructured texts consists in the following steps:

- source → a necessary volume of (generally noisy) data

- removing noise → clear data

- interesting part of data from the application viewpoint → information

- information generalization → knowledge




Representing text documents: bag of words (BOW).

Methods of machine learning mostly see text documents as files containing symbolic values (terms, words) without

analyzing their meaning (at most, only shallowly) or mutual dependence.

Therefore, the word order in a document is considered as being “meaningless” – naturally, it eliminates a certain information contents. However, it significantly simplifies processing of natural languages from, for example, the classification point of view.




Pre-processing affects significantly the result quality:

- excluding common words, which have no specific meaning from the application viewpoint (prepositions, abbreviations, definite/indefinite articles, etc.);

- excluding words with very low or high frequency in all processed documents;

- excluding punctuation, spaces, and so like; - transferring alphabetic characters to lower-case letters; - eliminating insignificant characters and words reduces

the problem dimensionality (e.g., from 104 to 103) because each unique word is one dimension.




An example of text representation where we ignore punctuation, spatial zoning (new lines, paragraphs, chapters, etc.), upper and lower letters, two languages (English terms in a Czech sentence), word orders – it can be very significant (for example, machine learning and learning machine), and excluding general words (“stop words”). We get a dictionary (a list of symbols) applied to training of a chosen algorithm:




Příklad representace textu, kde se ignoruje interpunkce, členění textu do řádků, velká a malá písmena, dvojjazyčnost (anglické termíny v české větě), pořadí slov, které může mít velký význam (např. machine learning – strojové učení a learning machine – učící stroj má zcela odlišný význam), a vynechají se obecná slova.

anglické české členění dvojjazyčnost ignoruje interpunkce learning machine má malá metody mít může obecná odlišný písmena pomocí pořadí příklad representace řádků slov stroj strojové termíny textu učení učící velká velký větě vynechají význam words zcela




The next dimensionality reduction can be obtained, for example, by transferring words into their stems. In the previous example, we could reduce the generated dictionary (infinitive, grammmatical case, singular, voice, and so like), so the dimensionality 8 decreases to 4:

mít má stroj strojové učení učící velká velkýmít stroj učit velký

Stemming, of course, depends on a language. For English, there exists a simplified system Porter stemming, where the machine plainly cuts off word endings – this is far from being perfect, however, it is practically very effective.




The word incidence – more possibilites to represent it:

- binary: 1/0 means a word is/isn’t in a document (a word weight is 1 or 0);

- frequency: a word weight is given by its frequency in a document;

- tf-idf: term frequency-inverted document frequency: a word frequency in a document (a document representation by a given word) to the number of documents having that word

(the higher the number of documents with that word the lower the word’s discrimination value).



● Methods of searching, similarity

The general task is to find similarity between an unlabeled document and a labeled one. It can be used, for example, for classification: interesting/uninteresting, and so like.

Unsupervised learning (clustering): learning without a techer. Supervised learning: learning with a teacher.

Semi-supervised learning: a small amount of given samples significantly improves clustering.



● Methods of searching, similarity

Supervised learning:

- k-NN (k-nearest neighbors); - generation of decision trees; - disjunctive normal form (generating rules); - support vector machines;

- Bayes naïve classifier (using conditional probability);

- etc. (there are really many possibilities).


je pěkné počasí +je chladno -není velmi chladno +není pěkné -velmi chladno -chladno -

w1

w2

w3

cj

.

.

.

.

.

.

.

.

.

.

.

.

A classified document “to není pěkné chladno”: + or - ?

Trainingtexts:

+ texts: total 6 words- texts: total 7 words the number of unique words: 6

chladno je není pěkné počasí velmi

frequency wi in + 1 1 1 1 1 1

frequency wi in - 3 1 1 1 0 1

p (wi | +) 1/6 1/6 1/6 1/6 1/6 1/6

w1

w2

w3

w4

w5

w6the sorted

dictionary:

p (wi | -) 3/7 1/7 1/7 1/7 0/7 1/7

After creating the dictionary from the unique words (here 6),computing apriori probabilities (2 texts + and 4 texts – in 6 texts), computing aposteriori probabilties of words in + and –,and the following normalization we can set the result:

p = p ( 'není', 'pěkné', 'chladno' | +/–) = = p

NBK ('není' | +/–) × p('pěkné' | +/–) × p('chladno' | +/–)

P+ = p(+) p(w3 = 'není' | +) p(w

4 = 'pěkné' | +) p(w

1 = 'chladno' | +) =

P- = p(–) p(w3 = 'není' | –) p(w

4 = 'pěkné' | –) p(w

1 = 'chladno' | –) =

“w3

w4

w1” = “není pěkné chladno”

=26

×16

×16

×16

≈

=46

×17

×17

×37

≈

0.00154

0.00583

0.001540.00154 0.00583

≈Pn

+ = 0.21

Pn

- = 0.00583

0.00154 0.00583≈ 0.79

Pn

- > Pn

+ ⇒ negative

ENDNatural Language Processing by Advanced

Artificial Intelligence Methods

● Application areas

Many applications exist in various areas where massive electronic text data exist. Typical examples are browsing the Internet or filtering of email spam. Among the contemporary application areas belong, for example:

- grouping of similar blog submissions; - determining subjectivity in text; - opinions/feelings/moods/attitudes/meanings in text; - revealing of text plagiarisms; - analyzing opinions; - business intelligence (legal commercial “espionage”);

and so like.


Technology

Text mining