Upload
lawrence-maxwell
View
215
Download
0
Embed Size (px)
Citation preview
Big Data at BITEM Research Group
• (Text|Web) Mining Research Group– [email protected], http://bitem.hesge.ch
• Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials…
• Specialised in (semi|un)structured data – We like text, text and more text– Especially on the noisy/dirty Web
• Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…
Drugbank
Twitter API
CouchDB
CouchDB
CouchDB
CouchDB
CleaningNormalisation
RSS
Forum
NoSQL Replication
Solr Cloud
Web Sources
TrendsAnalysis
CorrelationAnalysis
Novelty Detection
Pharmacovigilanceon Big Social Media Data
Dynamic and Real Time Data Analysis
26’000 per day
19’000 drug nameschecked each 10 mn 7 M of docs in 9 months
Managing the data deluge for proteins annotation
40’000 concepts[Big-scale MulticlassMultilabel Classifier] Lazy learning !
23 000 000 articlesProteins annotationbased on litterature
by curators
annotated articles
GOA
Manual annotation planned for 2045 !
(Baumgartner et al)
Machine Learning based on Information Retrieval methods
Assistingcurators
Macro reading of litterature
Profiling any textual
content
Patent retrieval
4
The real situation (0.5-1 TB) Experiments
Database13 millions of patents
Extraction33 days
XML patents0.221 Tb
Normalization33 days
XML patents + metadata0.234 Tb
Indexing5 days
Index0.1 Tb
DatabaseA sample of 1 million of patents
Extraction2.5 days
XML patents17 Gb
Normalization2.5 days
XML patents + metadata18 Gb
Indexing10 hours
Index3 Gb