Efficient Practices for Large Scale Text Mining Process

March 2, 2017

Ivelina Nikolova

Senior NLP Engineer

Efficient Practices for Large Scale Text Mining Process

2Mar 2, 2017

In this webinar you will learn …

• Industry applications that maximize Return on Investment (ROI) of your text mining process

• To describe your text mining problem

• To define the output of the text mining

• To select the appropriate text analysis techniques

• To plan the prerequisites for a successful text mining solution

• DOs and DON’Ts in setting up a text mining process.

3

Outline

• Business need for text mining solutions

• Introduction to NLP and information extraction

• How to tailor your text analysis process

• Applications and demonstrations

Mar 2, 2017

4

Analyzing text to capture data from them supports:– increased user engagement via content

recommendations, – shortened research cycle via semantic search,– regulatory compliance via smart indexing, – better content management etc.

Business needs for text mining solutions

Mar 2, 2017

5

Some of our customers

Mar 2, 2017

6

• Parsing texts in order to extract machine-readabe facts from them.

• Create sets of structured or semi-structured data out of heaps of unstructured heterogeneous documents.

• Relies on natural language processing techniques like: - automatic morphological analysis, - automated syntax analysis, - term weights and co-occurrence,- lexical semantics, and more compex tasks like:- named entity recognition - relation extraction etc.

Text analysis

Mar 2, 2017

7

• Inextricably tied to text analysis

• Links mentions in the text to knowledge base concepts

• Automatic, manual and semi-automatic

Semantic annotation/enrichment

Mar 2, 2017

8

• Named Entity Recognition– 60% F1 [OKE-challenge@ESWC2015]– 82.9% F1 [Leaman and Lu, 2016] in the biomedical

domain– above 90% for more specific tasks

State-of-the art

Mar 2, 2017

9

Designing the text mining process

• Know your business problem

• Know your data

• Find appropriate samples

• Use common formats or formats which can be easily transformed to such

• Get together domain experts, technical staff, NLP engineers and potential users

• Narrow the business problem to information extraction task

• Clearly define the annotation types

• Clearly define the annotation guidelines

• Apply the appropriate algorithm for IE

• Do iterations of evaluation and improvement

• Insure continuous adaptation by curation and re-training

Mar 2, 2017

10Mar 2, 2017

11

Clear problem definition

• Define clearly your business problem • specific smart search • content recommendation• content enrichment• content aggregation etc.

E.g. the system must do <A, B, C>

• Define clearly the text analysis problem• Reduce the business problem to information extraction problem

Business problem: faceted search by Persons, Organizations, LocationsInformation extraction problem: extract mentions of Persons, Organizations, Locations and link them to the corresponding concepts in the knowledge base

Mar 2, 2017

12

• Annotations – abstract descriptions of the mentions of concepts of interest

Named entities: Person, Location, OrganizationDisease, Symptom, Chemical SpaceObject, SpaceCraf

Relations: PersonHasRoleInOrganisation, Causation

Define the annotation types I

Mar 2, 2017

13

• Annotation types• Person, Organization, Location• Person, Organization, City• Person, Organization, City, Country

• Annotation features

Location: string, geonames instance, latitude, longitude

Define the annotation types II

Mar 2, 2017

14

• Annotation types• Person, Organization, Location• Person, Organization, City• Person, Organization, City, Country

• Annotation features

Location: string, geonames instance, latitude, longitudeChemical: string, inChi, SMILES, CASPersonHasRoleInOrganization: person instance, role instance, organization instance, timestamp

Define the annotation types II

string: the Gulf of MexicostartOffset: 71endOffset: 89type: Locationinst: http://ontology.ontotext.com/resource/tsk7b61yf5dslinks: [http://sws.geonames.org/3523271/

http://dbpedia.org/resource/Gulf_of_Mexico]latitude:25.368611longitude:-90.390556

Mar 2, 2017

http://ontology.ontotext.com/resource/tsk7b61yf5ds

http://sws.geonames.org/3523271/

http://dbpedia.org/resource/Gulf_of_Mexico

15

Locations mentioned Holocaust documents

Mar 2, 2017

16

• Realistic

• Demonstrating the desired output

• Positive and negative• “It therefore increases insulin secretion and reduces POS[glucose] levels,

especially postprandially.”• “It acts by increasing POS[NEG[glucose]-induced insulin] release and by

reducing glucagon secretion postprandially.”

• Representative and balanced set of the types of problems

• In appropriate/commonly used format – XML, HTML, TXT, CSV, DOC, PDF.

Provide examples

Mar 2, 2017

17

Domain model and knowledge

• Domain model/ontology - describes the types of objects in the problem area and the relations between them

Mar 2, 2017

18

• Data sources - proprietary data, public data, professional data

• Data cleanup

• Data formats

• Data stores • For metadata - GraphDB (http://ontotext.com/graphdb/)• For content – MongoDB, MarkLogic etc.

• Data modeling is inevitable part of the process of semantic data enrichment• Start it as early as possible• Keep to the common data formats• Mistakes and underestimations are expensive because they influence the

whole process of developing a text mining solution

Data

Mar 2, 2017

19

• Gold standard – annotated data with superior quality

• Annotation guidelines - used as guidance for manually annotating the documents.

POS[London] universities = universities located in LondonNEG[London] City CouncilNEG[London] Mayor

• Manual annotation tools – intuitive UI, visualization features, export formats• MANT – Ontotext's in-house tool• GATE – http://gate.ac.uk/ and https://gate.ac.uk/teamware/• Brad - http://brat.nlplab.org/

• Annotation approach• Manual vs. semi-automatic• Domain experts vs. crowd annotation• E.g. Mechanical Turk - https://www.mturk.com/

• Inter-annotator agreement

• Train:Test ratio – 60:40, 70:30

Gold standard

Mar 2, 2017

20

• Rule-based approach• lower number of clear patterns which do not change over time or slightly change• high precision • appropriate for domains where it is important to know how the decision for extracting

given annotation is taken – e.g. bio-medical domain

• Machine learning approach• higher number of patterns which do change over time• requires annotated data• allows for retraining over time

• Neural Network approach• Deep Neural Networks - getting closer to AI• Recent advances promise true natural language understanding via complex neural

networks• Great results in Speech recognition, Image recognition and Machine translation;

breakthrough expected in NLP• Still unclear why and how it works thus difficult to optimize

Text analysis approach

Mar 2, 2017

21

• Preprocessing

• Keyphrase extraction

• Gazetteer based enrichment

• Named entity recognition and disambiguation

• Generic entity extraction

• Result consolidation

• Relation extraction

NER Pipeline

Mar 2, 2017

22

NER pipeline

Mar 2, 2017

23

NER pipeline

Mar 2, 2017

24

NER pipeline

Mar 2, 2017

25

• Curation of results - domain experts assess manually the work of the text analysis components

• Testing interfaces

• Feedback• Select representative set of documents to evaluate manually• Provide as full description of the results and the used component as

possible: <pipeline version> <input as send for processing> <description of the wrong behavior> <description of the correct behavior>

• The earlier this happens it triggers revision of the models and improvement of the annotation

Results curation / Error analysis

Mar 2, 2017

26

• Gold standard split train:test • 70:30• 80:20

• Which task you want to evaluate • E.g. extraction at document level

or inline annotation

• Evaluation metrics• Information extraction tasks – precision, recall, F-measure• Recommendations – A/B-testing

Evaluation of the results

Mar 2, 2017

27

Continuous adaptation

Mar 2, 2017

28

• Document categorization • post, political news, sport news, etc.;

• Topic extraction• important words and phrases in the text;

• Named entity recognition • People, Organization, Location, Time, Amounts of money, etc.;

• Keyterm assignment from predefined hierarchies

• Concept extraction• entities from a knowledge base;

• Relation extraction • relations between types of entities.

Types of extracted information

Mar 2, 2017

29

• TAG (http://tag.ontotext.com)

• NOW (http://now.ontotext.com)

• Patient Insights (http://patient.ontotext.com/) - contact [email protected] for credentials.

Applications

Mar 2, 2017

http://tag.ontotext.com/

http://now.ontotext.com/

http://patient.ontotext.com/

mailto:[email protected]

30

• Clearly defined business problem needs to be broken down to a clearly defined information extraction problem

• Requires combined efforts from business decision makers, domain experts, natural language processing experts and technical staff

• Data modeling is inevitable part of the process, consider it as early as possible

• Create clear annotation guidelines based on real-world examples

• Start with an initial small set of balanced and representative documents

• Plan the evaluation of the results in advance

• Choose appropriate manual annotation tool

• While annotating content check how the quantity influences the performance

• Select the appropriate text analysis approach

• Plan iterations of curation by domain experts followed by revision of the text analysis approach

• Plan the aspects of continuous adaptation – document quantity, timing, temporality of the information fed in the model

Take away messages - DOs

Mar 2, 2017

31

Most common mistakes are caused by under/overestimation of some phases in the text mining process:

• Underestimated efforts for training corpus – this may lead to a longer phase of determining the correct algorithms and training models.

• Underestimated efforts for evaluation corpus – this may lead to a solution which cannot be practically evaluated thus formally delivered/released.

• Overestimating the value of the data in the text mining process – if you spend too much efforts in building your own vocabularies, you will most probably end up with the same text mining solution as if you buy professionally prepared data.

• Underestimating the data ETL before starting a text mining solution – this may lead to a delay in the text mining solution, caused by delayed training cicle.

• Overexpectations from dynamic data updates – it ofen turns that when the solution is ready, it is more important to have a good process for dynamic update of data rather than having the updates instantly avaiable.

• Intolerance towards extraction speed – this may leed to a faster solution which offer lower quality resuts. If the speed is not crucial tolerate it.

• No readiness to implement changes in the workflow and collected data. The good automatised soution is not the one that completely replaces the manual workflow but the one that brings higher value to your business. Be ready to slightly change your workflow, start collecting some new data and aim for an automated solution which is focused in new benefits.

Take away messages – DON'Ts

Mar 2, 2017

32

Thank you very much for the attention!

You are welcome to try our demos at http://ontotext.com

Mar 2, 2017

http://ontotext.com/

Software

Efficient Practices for Large Scale Text Mining Process