Upload
cefora-cevora
View
258
Download
0
Embed Size (px)
DESCRIPTION
Big data? Volgens schattingen van IBM genereren we dagelijks 2.5 quintiljoen bytes aan gegevens. Dagelijks, u leest het goed. Of, bekijk het zo: 90% van de beschikbare gegevens wereldwijd zijn aangemaakt in de afgelopen twee jaar. Hallucinant. Gartner voorspelt dat tegen 2015 minstens 4,4 miljoen jobs zullen gecreëerd worden, gelinkt aan big data en analytics. Mogen we over een trend spreken? Wij denken het wel. Een trend die u niet wilt missen! Want ook ú haalt uw voordeel uit de analyse van uw gegevens. Komen aan bod in de workshop, mét talrijke voorbeelden: het analytics proces model in vogelperspectief, de verschillende stappen: data preprocessing, analytics en post processing, recente nieuwe toepassingen, zoals proces analytics, social media analytics en fraude analytics.
Citation preview
Advances in Data Mining and Big Data Analytics
Prof. dr. Bart Goethals Advanced Database Research & Modelling Department of Mathematics & Computer Science
Cevora -‐ 19 November 2014
Big Data Analytics or …
• Statistics • Data Mining • Knowledge Discovery in Data • Analytics • Data Science • …
2
Big Data is like teenage sex:
• everyone talks about it, • nobody really knows how to do it, • everyone thinks everyone else is doing it, • so everyone claims they are doing it…
[Dan Ariely]
3See also Data News Survey March 2014
The Goal of Big Data
Goal is the same: Find useful patterns or models in Data
Emphasis Changes: Volume Velocity Variety V…
4
Big Data Volume
5
[source: EMC]
Is Big better?
• Yes! But, some fundamental principles: [U. Fayyad] • Data gains value exponentially when integrated and coalesced. When fragmented: dramatic value loss.
• Fusing data together from disparate or independent sources is difficult and impossible to maintain.
• 80% of the effort of Data Mining goes to getting the right data together.
• Standardisation. Data governance and policy. Data privacy, encryption and masking. Data infrastructure.
• Data is a primary competency and not a side-‐activity.
6
Is Big a problem?
• Data can (not) be summarised (sampling) • Too much information lost for reasonable sizes • We need to find patterns that are useful and valid for all data • Personalized Recommendation • Personalized Advertising • Rare diseases
• Current analytics methods do not scale or produce satisfactory results
7
Big Data Velocity (60s on the internet)
8[Source: Qmee]
Big Data Variety
• Data can be • structured, • semi-‐structured, • text, • images, • video, • time series, • click-‐streams, • graphs or (social) networks, … • …
9
Big Data Value• Predict voting behaviour based on Twitter (~1M tweets)[UA Master thesis Christophe Van Gysel]
• Detect Fiscal Fraud based on network of ~7M transactions[UA Applied Data Mining, Prof. dr. David Martens]
• Recognise cyberpedophiles [UA Computational Linguistics, Prof. dr. Walter Daelemans]
• e-‐Health, predict rare diseases[UA Biomina, UZA, Prof. dr. Bart Goethals]
• Mining Train Delays [UA, Prof. dr. Bart Goethals and Infrabel]
• Personalised Advertising, Recommendation, Cross-‐selling, Product placement, Distribution planning
• …
10
What about the methods?
• Association-‐, Pattern Discovery • Classification, Prediction, Regression • Clustering • Recommendation • Exploration • Summarization • Visualization
11
Association-‐, Pattern Discovery
• Imagine a supermarket • What sets of products frequently bought together? • What products influence the sales of each other?
12
Challenge
Number of potentially interesting patterns is larger than the number of particles in the universe
13
Association-‐, Pattern Discovery
• “75% of all customers that buy diapers also buy beer”
14
1515
Different patterns for different data
• Patients, symptoms, diseases • Movies, ratings, viewers • Friends, Likes, Status Updates, Interactions • Routes, Trucks, Packages, Distributors, Locations
• Sequences, spatial, time series, graphs, multi-‐relations, RDF, …
16
Classification / Prediction
17
How to separate two classes of objects from each other
Rare diseases
• Neonatal heel prick used for detection of potential Medium-‐chain acyl-‐coenzyme A dehydrogenase deficiency
• Classify whether expensive genetic test is required
• Intensive Care, fast prediction of e.g. kidney failure
18
[UA Biomina]
Fraud detection
19
[De Standaard, Prof. dr. David Martens, UA Applied Data Mining research group]
Twitter brengt raad
20
Voting behaviour prediction on Twitter
21
[UA Master thesis Christophe Van Gysel]
22
Classification methods
• Pattern Based Classification • Nearest Neighbour Classification • Decision Trees • Support Vector Machines • Neural Networks • Random Forsests • Conditional Random Fields • …
23
Recommendation methods
• A customer arrives on your web-‐shop: show her the product she doesn’t know yet, but might be interested in
• For Any (online) shop! Famous example: Netflix (pattern mining is even used to produce new series: ‘House of Cards’)
• Recommendation is everywhere. • Understand user-‐intent!
• Methods: • Collaborative Filtering • Matrix Factorisation • …
24
Sentiment analysis
25
Clustering: grouping similar things together
What is a natural grouping of these objects?
26
Male vs. Female
27
Young vs. Old
28
Simpson family vs. Others
29
Similarity is hard to measure
curse of dimensionality
30
Enough about the MethodsWhat about privacy?
• Most methods function on anonymised data • Problem solved: No! • Patterns or predictions themselves can also cause Privacy Infringement
31
32
Privacy Preserving Data Mining Discrimination Aware Data Mining methods exist!
Conclusion
33
http://www.uantwerpen.be/bart-‐goethals [email protected]