11
Paul H. Cleverley Robert Gordon University, Aberdeen, Scotland, UK. GSA Annual Conference 24 th October 2017; Seattle, USA. Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants

Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants

Embed Size (px)

Citation preview

Paul H. Cleverley

Robert Gordon University, Aberdeen, Scotland, UK.

GSA Annual Conference 24th October 2017; Seattle, USA.

Applying Text and Data Mining to Geological Articles: Towards Cognitive Computing Assistants

Background - Typical usesSpatializing entities/concepts and associationse.g. ‘mentions’ of Pre-Cambrian

‘Extracting integer and float data from unstructured text e.g. ppm is an association with a chemical element

For example GeoDeepDive supported papers (Peters et al. 2015; Liu et al. 2016; Yulaeva et al. 2017) Stromatolite relationship to dolomite; link between cobalt and supercontinent assembly; extracting hydrogeological data)

Cleverley (2017) Cleverley (2017)

But what else can we do? Examples using Python...

Learning by comparison: Discriminatory Search Term Word Associations

100,000+ Society of Petroleum Engineers (SPE), American Geosciences Institute (AGI), Geological Society of London (GSL)

Primary Search Query= submarine fanComparing secondary search terms:- Miocene- Eocene

Cleverley, P.H., Burnett, S. (2014)

Stimulating Serendipity (Discriminatory Word Associations)

“….word associations highlighted new and unexpected terms... This surprising result led us to consider a new geological element which could impact our business opportunity” Geologist Oil & Gas Company 2015

n=53

To what extent do current search interfacesin your organization facilitate serendipitous discovery?

42% - To a moderate/large extent

To what extent could word co-occurrencetechniques facilitate serendipitous discovery?

75% - To a moderate/large extent

A Wilcoxon Signed Rank Test showed a statistically significant

difference (p<0.05).

CU

RR

ENT

Cleverley, P.H., Burnett, S. (2014)

100,000+ Society of Petroleum Engineers (SPE), American Geosciences Institute (AGI), Geological Society of London (GSL). Some colour coding from NASA SWEET Ontology and others.

Question: Which is the most similar formation to the Kimmeridge Formation?

Word Vectors – very simple theory

Cleverley (2016) Digital Energy

Word Vectors – very simple theory

Cleverley (2016) Digital Energy

Find SimilarFind Similar

Similarity of entities

“I input the Zebbag Formation that I studied in Tunisia and it returned a lateral equivalent (in Libya) that I had not come across before.“

Geologist, Multi-National Oil and Gas Company (June 2016)

What are the analogues for xxx?

Cleverley (2016) Digital Energy

Adding more sophistication…- Curation (lemma’s, synonyms)- NLP e.g. ‘post Triassic’, ‘not porous’- Mikolov et al. (2013); Řehůřek (2014) Word2Vec: Using Neural networks to generate richer and more complex representations of meaning in text (text embedding’s).- Using Geoscience Ontologies to enrich meaning and add logic for reasoning.

More “related”to volcanics than

limestone

More “related”to limestone than

volcanics

Testing Hypotheses (word vector v word vector)

6,000+ Articles over 100 years of the Society of Economic Geologists (SEG) - (courtesy GeoScienceWorld)

Cleverley (2017(

R2=0.2576

A weak correlation. Arid environments can lead to high Ph (evaporation /

desorption) which can lead to Arsenic in Groundwater. So the more arid the environment (less rainwater), more

likely Arsenic may mobilize

Word Vector (Arsenic)

NO

AA

An

nu

al R

ain

fall

(mm

)

Testing Hypotheses (word vector v existing data)

Word Vector(US States)

6,000+ Articles over 100 years of the Society of Economic Geologists (SEG) - (courtesy GeoScienceWorld)National Oceanic and Atmospheric Administration (NOAA) Environmental Data

Cleverley (2017(

Are all the conditions likely to be in place for …?

Labelled training data + skip-grams + geoscience ‘friendly’ lexiconUsing literature too help challenge individual cognitive biases and organizational dogmaReports from United States Geological Survey (USGS) Petroleum Assessments

Cleverley (2017)

Summary – Areas for further research

• Opportunities may exist to increase the propensity of ‘general purpose’ enterprise search user interfaces to facilitate serendipity.

• Combining text analytics & machine learning to address a specific work task to provide actionable insights.

[email protected] www.paulhcleverley.com