Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Teacher’s Day Data Mining
Paul Kennedy School of Software, Faculty of Engineering & IT
What is Predictive Analytics?
www.youtube.com/watch?v=BjznLJcgSFI
Outline
• The exercise: making a dataset
• What is data mining?
• Data mining success stories
• What kind of jobs are available?
• Data mining you!
• My work in kid’s cancer
The Exercise
• Collect a dataset of measurements of students.
• Visualise the data.
• Predict the gender of students based on the measurements.
• Visualisation and prediction is done using free open source software called R/Rattle.
Running the exercise
1. Make the dataset - measure the students.
2. Describe data mining: visualisation & prediction.
3. Get students to visualise themselves then build predictive models.
4. How these same methods can be applied in the childhood cancer domain.
R• R: sophisticated, statistical software package, state-of-the-art, powerful,
cross platform, open-source and free.
• Interpreted like Python.
• I use it to teach to my first year undergraduate students.
• http://www.r-project.org/
• R is a language and environment for statistical computing and for graphics.
• It’s available for Linux, MS Windows and MacOS X
• It’s quite sophisticated and contains 1000s of ‘packages’, which implement different statistical and data analytics tools.
• The packages and code lives in CRAN: the Comprehensive R Archive Network which is mirrored throughout the world.
• You can write R scripts into files and run them like [email protected]
Rattle• Rattle is a graphical user
interface to do data analytics.
• It calls R in the background and makes the R learning curve easier.
• It matches up with the CRISP-DM phases.
• This is a very nice text if you feel inspired!
Rattle• Written by Graham
Williams.
• Director of Data Mining at the Australian Taxation Office.
• Rattle is used at the ATO, so it works well in a complex industrial setting with very large data sets.
Installing Rattle
• The easiest way to install R and Rattle is to go to the Rattle web site
• http://rattle.togaware.com
• Then scroll down to the Install section and choose the one you want.
• There is a troubleshooting page and a Google group if you get stuck.
Installing Rattle• The only difficulties I have had with getting Rattle installed
are dependencies to other external programs.
• Some dependencies:
• GGobi
• GTK+
• XQuartz (only on Mac OSX)
• Sometimes a DLL problem with Windows.
• Troubleshooting documentation of Google Group will help.
Installing Rattle
• GGobi
• Rattle uses another piece of software called GGobi for the visualisation. GGobi written by AT&T.
• It should be installed, but if not it can be found at: http://www.ggobi.org
• On Windows a 32 bit and a 64 bit version of R is installed. You need to have the matching version of ggobi. So if you run the 32 bit R, you need the 32 bit ggobi and similarly for 64 bit.
• For this exercise it won’t matter if you use 32 or 64 bit R.
1. Generate a dataset of your height, arm length and circumference of head.
2. Visualise you.
3. Use the data to predict your gender.
The dataset• Suggested attributes
• Name - as an identifier
• Gender - M or F
• Height
• Circumference around head
• Length of arm
• Height etc. are (usually) strongly correlated with gender, so they visualise nicely and are good to make a predictive model.
• If the class is all boys or all girls? Potentially use whether birthday is Jan-Jun or Jul-Dec.
How to do the measurements?
• The main issues are crowd control and generating and sharing a dataset quickly.
• I had 3 helpers one each to record height, arm length and head circumference.
• They entered information into a spreadsheet on Google Drive.
• I downloaded the spreadsheet as a CSV file onto Dropbox and made a URL pointing to the CSV file.
• I copied the URL into http://tinyurl.com to distribute.
The data
• It works better if the same person measures the same thing to avoid bias.
• Make sure all the gender is encoded in the same way. So M or F.
• Not a mix of M, m, Male, male, F, female, …
• R/Rattle is case sensitive.
• Data Mining is the analysis of large databases to find novel, commercially valuable and exploitable patterns.
• Aim: discover meaningful insights and knowledge from data.
• Discoveries expressed as models.
!
• A model
• Captures the essence of the discovered knowledge.
• Can assist in understanding the world.
• Can be used to make predictions.
Models
Helping to catch the backpacker killer
• Australia’s most notorious serial murder case
• Early 1990s, 7 young backpackers murdered.
• Police had developed a profile.
• Huge dataset generated of vehicle records, gym memberships, gun licensing and police records.
• Link analysis software from Sydney company NetMap Analytics, narrowed list of suspects from 18 million to 32, which included the murderer: Ivan Milat.
Predicting the 2012 US election result
• Nate Silver used predictive analytics & statistics to correctly predict outcomes of 50 out of 50 states from polling and related data.
• Republican pundits were confident in their landslide-win predictions. Democrat pundits predicted razor-thin victory.
• Shows the power of a data-centric approach over “gut-feeling”.
[email protected]: Wiki Commons, Official White House Photo by Pete Souza
Data Mining Jobs• Consultant
• analyses the data and builds data mining models
• Manager
• communicates data mining results to customers
• Data Scientist / Researcher
• develops new algorithms
Types of data
• Spreadsheets, ...
• Transactions
• DNA sequences: gtatcct ...
• Text: tweets, emails, documents
• Images
• Sound
• ... anything else you can imagine.
CRISP-DM [email protected]
Source: Kenneth Jensen / Wikimedia Commons / Public Domain
Finding the business problem can be hard
• Problem: identifying people likely to change to a different phone provider = churn
• Finding: unemployed people over 80 had a most regrettable tendency to churn
• But: the unemployed people over 80 passed away and no incentive program had much impact on decreasing the churn.
Two main modeling approaches
• Unsupervised methods
• Model tries to make sense of the data
• Supervised methods
• Models learns a relationship between inputs and outputs from old data.
• Model can then be used to predict output for new inputs.
Task: Who has better access to other friends?
“structural” [email protected]
One possible answer: Characterisation of the type of person.
Hair Colour
LotionLotionLotion
sunburned sunburnedno no no
blondeginger
brown
no nonoyes yes
1. Generate a dataset of your height, arm length and circumference of head.
2. Visualise you.
3. Use the data to predict your gender.
R/Rattle
• R is a statistical programming language.
• Rattle is a user interface to make it easier to use.
• R/Rattle are big and complex but we will only use a little part of it.
1. Generate a dataset of your height, arm length and circumference of head.
2. Visualise you.
3. Use the data to predict your gender.
• Working with Children’s Hospital at Westmead to build a tool to help clinicians to better diagnose and treat childhood cancer
• Visualise patients and predict how a patient will react to treatment by comparing patients with previous patients
Data Mining for Childhood Cancer
• Cancer is the disease that kills the most kids in Australia.
• ~700 kids diagnosed per year in Australia.
• Cancer is heterogenous.
• More than 50 types & subtypes.
• We want to find a better way to treat children with cancer using biological data.
• Focus: Leukaemia
• Collaborative work between A/Prof Paul Kennedy (UTS) A/Prof Daniel Catchpoole (The Children’s Hospital at Westmead)
Normal Blood Leukaemia
Red Blood Cells
White Blood Cells
Plasma
• Chemotherapy initially helps, but it’s bad if the cancer relapses (ie. it comes back).
• So treatment is based on the risk of relapse.
• Clinicians group patients into risk categories to treat them.
• Strong treatment is needed for those with a high risk of relapse.
Acute Lymphoblastic Leukaemia
Survival rates on BFM-95 drug trial
Can we identify at diagnosis the 10% of standard risk patients who
will relapse? !
... and give them different therapy.
High Risk
Standard Risk
Medium Risk
Years
Survival
Acute Lymphoblastic Leukaemia
• Goal: identify at diagnosis which patients may not respond to treatment to inform clinician
• Data: gene expression, gene variation, clinical, ...
• Patient-to-patient comparison based on biological background
singular value decomposition
purple = ALL patientyellow = normal child
treatment protocol
green = BFM 95red = Study 8 purple = ?
treatment protocol
Let’s look more closely at these 2 patients.
treatment protocolgreen = BFM 95red = Study 8 purple = ?
These two patients treated on different drug trials are biologically similar ...
risk categorygreen = standardyellow = mediumred = highpurple = ?
... but the one on the right was classified by the clinician at lower risk of relapse ...
relapsedgreen = nored = yes purple = ?
The one classified by the clinician at low risk suffered a relapse.
deathred = diedgreen = survived
He eventually survived but ...
should he have been originally classed at a higher risk and received a modified therapy?
Can we take a personalised treatment approach and predict at diagnosis how a patient will respond based on their biological similarity to previous patients?
Neuroblastoma• Cancer of the nervous system.
• Lowest survival rate among childhood cancers.
• Heterogeneous disease. Clinical course may range from spontaneous regression to very aggressive behaviour.
• Analysing the histopathology of tumour samples is time-consuming and error-prone.
• Aim: build a computer-aided diagnosis system
• Build classifiers for morphological features of stained images to help diagnosis.