Teacher’s Day Data Mining · correctly predict outcomes of 50 out of 50 states from polling and related data.! • Republican pundits were conﬁdent in their landslide-win predictions

Teacher’s Day Data Mining

Paul Kennedy School of Software, Faculty of Engineering & IT

What is Predictive Analytics?

www.youtube.com/watch?v=BjznLJcgSFI

[email protected]

http://www.youtube.com/watch?v=BjznLJcgSFI

http://www.youtube.com/watch?v=BjznLJcgSFI

mailto:[email protected]

Outline

• The exercise: making a dataset

• What is data mining?

• Data mining success stories

• What kind of jobs are available?

• Data mining you!

• My work in kid’s cancer

[email protected]


The Exercise

• Collect a dataset of measurements of students.

• Visualise the data.

• Predict the gender of students based on the measurements.

• Visualisation and prediction is done using free open source software called R/Rattle.

[email protected]


Running the exercise

1. Make the dataset - measure the students.

2. Describe data mining: visualisation & prediction.

3. Get students to visualise themselves then build predictive models.

4. How these same methods can be applied in the childhood cancer domain.

R• R: sophisticated, statistical software package, state-of-the-art, powerful,

cross platform, open-source and free.

• Interpreted like Python.

• I use it to teach to my first year undergraduate students.

• http://www.r-project.org/

• R is a language and environment for statistical computing and for graphics.

• It’s available for Linux, MS Windows and MacOS X

• It’s quite sophisticated and contains 1000s of ‘packages’, which implement different statistical and data analytics tools.

• The packages and code lives in CRAN: the Comprehensive R Archive Network which is mirrored throughout the world.

• You can write R scripts into files and run them like [email protected]

http://www.r-project.org


[email protected]


[email protected]


Rattle• Rattle is a graphical user

interface to do data analytics.

• It calls R in the background and makes the R learning curve easier.

• It matches up with the CRISP-DM phases.

• This is a very nice text if you feel inspired!

[email protected]


Rattle• Written by Graham

Williams.

• Director of Data Mining at the Australian Taxation Office.

• Rattle is used at the ATO, so it works well in a complex industrial setting with very large data sets.

[email protected]


[email protected]


Installing Rattle

• The easiest way to install R and Rattle is to go to the Rattle web site

• http://rattle.togaware.com

• Then scroll down to the Install section and choose the one you want.

• There is a troubleshooting page and a Google group if you get stuck.

[email protected]

http://rattle.togaware.com


Installing Rattle• The only difficulties I have had with getting Rattle installed

are dependencies to other external programs.

• Some dependencies:

• GGobi

• GTK+

• XQuartz (only on Mac OSX)

• Sometimes a DLL problem with Windows.

• Troubleshooting documentation of Google Group will help.

[email protected]


Installing Rattle

• GGobi

• Rattle uses another piece of software called GGobi for the visualisation. GGobi written by AT&T.

• It should be installed, but if not it can be found at: http://www.ggobi.org

• On Windows a 32 bit and a 64 bit version of R is installed. You need to have the matching version of ggobi. So if you run the 32 bit R, you need the 32 bit ggobi and similarly for 64 bit.

• For this exercise it won’t matter if you use 32 or 64 bit R.

[email protected]

http://www.ggobi.org


Making a dataset

[email protected]


1. Generate a dataset of your height, arm length and circumference of head.

2. Visualise you.

3. Use the data to predict your gender.

[email protected]


The dataset• Suggested attributes

• Name - as an identifier

• Gender - M or F

• Height

• Circumference around head

• Length of arm

• Height etc. are (usually) strongly correlated with gender, so they visualise nicely and are good to make a predictive model.

• If the class is all boys or all girls? Potentially use whether birthday is Jan-Jun or Jul-Dec.

[email protected]


Measuring …

[email protected]


How to do the measurements?

• The main issues are crowd control and generating and sharing a dataset quickly.

• I had 3 helpers one each to record height, arm length and head circumference.

• They entered information into a spreadsheet on Google Drive.

• I downloaded the spreadsheet as a CSV file onto Dropbox and made a URL pointing to the CSV file.

• I copied the URL into http://tinyurl.com to distribute.

[email protected]

http://tinyurl.com


The data

• It works better if the same person measures the same thing to avoid bias.

• Make sure all the gender is encoded in the same way. So M or F.

• Not a mix of M, m, Male, male, F, female, …

• R/Rattle is case sensitive.

[email protected]


What is data mining?

[email protected]


• Data Mining is the analysis of large databases to find novel, commercially valuable and exploitable patterns.

• Aim: discover meaningful insights and knowledge from data.

• Discoveries expressed as models.

[email protected]


!

• A model

• Captures the essence of the discovered knowledge.

• Can assist in understanding the world.

• Can be used to make predictions.

Models

[email protected]


Data Mining Successes

[email protected]


Helping to catch the backpacker killer

• Australia’s most notorious serial murder case

• Early 1990s, 7 young backpackers murdered.

• Police had developed a profile.

• Huge dataset generated of vehicle records, gym memberships, gun licensing and police records.

• Link analysis software from Sydney company NetMap Analytics, narrowed list of suspects from 18 million to 32, which included the murderer: Ivan Milat.

[email protected]


Predicting the 2012 US election result

• Nate Silver used predictive analytics & statistics to correctly predict outcomes of 50 out of 50 states from polling and related data.

• Republican pundits were confident in their landslide-win predictions. Democrat pundits predicted razor-thin victory.

• Shows the power of a data-centric approach over “gut-feeling”.

[email protected]: Wiki Commons, Official White House Photo by Pete Souza


moviegalaxies.com

the lion, the witch and the [email protected]


fellowship of the ring

[email protected]


the return of the king

[email protected]


Data Mining Jobs• Consultant

• analyses the data and builds data mining models

• Manager

• communicates data mining results to customers

• Data Scientist / Researcher

• develops new algorithms

[email protected]


Types of data

[email protected]

• Spreadsheets, ...

• Transactions

• DNA sequences: gtatcct ...

• Text: tweets, emails, documents

• Images

• Sound

• ... anything else you can imagine.


CRISP-DM [email protected]

Source: Kenneth Jensen / Wikimedia Commons / Public Domain


Finding the business problem can be hard

• Problem: identifying people likely to change to a different phone provider = churn

• Finding: unemployed people over 80 had a most regrettable tendency to churn

• But: the unemployed people over 80 passed away and no incentive program had much impact on decreasing the churn.

[email protected]


Two main modeling approaches

• Unsupervised methods

• Model tries to make sense of the data

• Supervised methods

• Models learns a relationship between inputs and outputs from old data.

• Model can then be used to predict output for new inputs.

[email protected]


University Friends

Attributes

Instances

[email protected]


Task: Who has better access to other friends?

“structural” [email protected]


Possible answer

[email protected]


Task: Predict whether someone gets sunburned.

The “Class”

The “mining table”[email protected]


One possible answer: Characterisation of the type of person.

Hair Colour

LotionLotionLotion

sunburned sunburnedno no no

blondeginger

brown

no nonoyes yes

[email protected]


Data Mining You!

[email protected]



2. Visualise you.


[email protected]


R/Rattle

• R is a statistical programming language.

• Rattle is a user interface to make it easier to use.

• R/Rattle are big and complex but we will only use a little part of it.

[email protected]


[email protected]


Visualising …

[email protected]


[email protected]


[email protected]


[email protected]



2. Visualise you.


[email protected]


Prediction …

[email protected]


[email protected]


[email protected]


Data Mining for Kids with Cancer

[email protected]


• Working with Children’s Hospital at Westmead to build a tool to help clinicians to better diagnose and treat childhood cancer

• Visualise patients and predict how a patient will react to treatment by comparing patients with previous patients

Data Mining for Childhood Cancer

[email protected]


• Cancer is the disease that kills the most kids in Australia.

• ~700 kids diagnosed per year in Australia.

• Cancer is heterogenous.

• More than 50 types & subtypes.

• We want to find a better way to treat children with cancer using biological data.

• Focus: Leukaemia

• Collaborative work between A/Prof Paul Kennedy (UTS) A/Prof Daniel Catchpoole (The Children’s Hospital at Westmead)

[email protected]


Normal Blood Leukaemia

Red Blood Cells

White Blood Cells

Plasma

[email protected]


• Chemotherapy initially helps, but it’s bad if the cancer relapses (ie. it comes back).

• So treatment is based on the risk of relapse.

• Clinicians group patients into risk categories to treat them.

• Strong treatment is needed for those with a high risk of relapse.

Acute Lymphoblastic Leukaemia

[email protected]


Survival rates on BFM-95 drug trial

Can we identify at diagnosis the 10% of standard risk patients who

will relapse? !

... and give them different therapy.

High Risk

Standard Risk

Medium Risk

Years

Survival

[email protected]


Acute Lymphoblastic Leukaemia

• Goal: identify at diagnosis which patients may not respond to treatment to inform clinician

• Data: gene expression, gene variation, clinical, ...

• Patient-to-patient comparison based on biological background

[email protected]


singular value decomposition

purple = ALL patientyellow = normal child

treatment protocol

green = BFM 95red = Study 8 purple = ?

treatment protocol

Let’s look more closely at these 2 patients.

treatment protocolgreen = BFM 95red = Study 8 purple = ?

These two patients treated on different drug trials are biologically similar ...

risk categorygreen = standardyellow = mediumred = highpurple = ?

... but the one on the right was classified by the clinician at lower risk of relapse ...

relapsedgreen = nored = yes purple = ?

The one classified by the clinician at low risk suffered a relapse.

deathred = diedgreen = survived

He eventually survived but ...

should he have been originally classed at a higher risk and received a modified therapy?

Can we take a personalised treatment approach and predict at diagnosis how a patient will respond based on their biological similarity to previous patients?

[email protected]


Neuroblastoma• Cancer of the nervous system.

• Lowest survival rate among childhood cancers.

• Heterogeneous disease. Clinical course may range from spontaneous regression to very aggressive behaviour.

• Analysing the histopathology of tumour samples is time-consuming and error-prone.

• Aim: build a computer-aided diagnosis system

• Build classifiers for morphological features of stained images to help diagnosis.

[email protected]


[email protected]


[email protected]


[email protected]


[email protected]


Questions ...

[email protected]


Documents

Teacher’s Day Data Mining · correctly predict outcomes of 50 out of 50 states from polling and related data.! • Republican pundits were conﬁdent in their landslide-win predictions