4
Big Data at BITEM Research Group • (Text|Web) Mining Research Group [email protected], http:// bitem.hesge.ch • Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… • Specialised in (semi|un)structured data – We like text, text and more text – Especially on the noisy/dirty Web • Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – [email protected], ://bitem.hesge.ch Research projects:

Embed Size (px)

Citation preview

Page 1: B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – patrick.ruch@hesge.ch, ://bitem.hesge.ch Research projects:

Big Data at BITEM Research Group

• (Text|Web) Mining Research Group– [email protected], http://bitem.hesge.ch

• Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials…

• Specialised in (semi|un)structured data – We like text, text and more text– Especially on the noisy/dirty Web

• Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

Page 2: B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – patrick.ruch@hesge.ch, ://bitem.hesge.ch Research projects:

Drugbank

Twitter API

CouchDB

CouchDB

CouchDB

CouchDB

CleaningNormalisation

RSS

Forum

NoSQL Replication

Solr Cloud

Web Sources

TrendsAnalysis

CorrelationAnalysis

Novelty Detection

Pharmacovigilanceon Big Social Media Data

Dynamic and Real Time Data Analysis

26’000 per day

19’000 drug nameschecked each 10 mn 7 M of docs in 9 months

Page 3: B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – patrick.ruch@hesge.ch, ://bitem.hesge.ch Research projects:

Managing the data deluge for proteins annotation

40’000 concepts[Big-scale MulticlassMultilabel Classifier] Lazy learning !

23 000 000 articlesProteins annotationbased on litterature

by curators

annotated articles

GOA

Manual annotation planned for 2045 !

(Baumgartner et al)

Machine Learning based on Information Retrieval methods

Assistingcurators

Macro reading of litterature

Profiling any textual

content

Page 4: B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – patrick.ruch@hesge.ch, ://bitem.hesge.ch Research projects:

Patent retrieval

4

The real situation (0.5-1 TB) Experiments

Database13 millions of patents

Extraction33 days

XML patents0.221 Tb

Normalization33 days

XML patents + metadata0.234 Tb

Indexing5 days

Index0.1 Tb

DatabaseA sample of 1 million of patents

Extraction2.5 days

XML patents17 Gb

Normalization2.5 days

XML patents + metadata18 Gb

Indexing10 hours

Index3 Gb