Teacher’s Day Data Mining · correctly predict outcomes of 50 out of 50 states from polling and...

Teacher’s Day Data Mining

Paul Kennedy School of Software, Faculty of Engineering & IT

What is Predictive Analytics?

www.youtube.com/watch?v=BjznLJcgSFI

paul.kennedy@uts.edu.au

Outline

• The exercise: making a dataset

• What is data mining?

• Data mining success stories

• What kind of jobs are available?

• Data mining you!

• My work in kid’s cancer

The Exercise

• Collect a dataset of measurements of students.

• Visualise the data.

• Predict the gender of students based on the measurements.

• Visualisation and prediction is done using free open source software called R/Rattle.

Running the exercise

1. Make the dataset - measure the students.

2. Describe data mining: visualisation & prediction.

3. Get students to visualise themselves then build predictive models.

4. How these same methods can be applied in the childhood cancer domain.

R• R: sophisticated, statistical software package, state-of-the-art, powerful,

cross platform, open-source and free.

• Interpreted like Python.

• I use it to teach to my first year undergraduate students.

• http://www.r-project.org/

• R is a language and environment for statistical computing and for graphics.

• It’s available for Linux, MS Windows and MacOS X

• It’s quite sophisticated and contains 1000s of ‘packages’, which implement different statistical and data analytics tools.

• The packages and code lives in CRAN: the Comprehensive R Archive Network which is mirrored throughout the world.

• You can write R scripts into files and run them like programs.paul.kennedy@uts.edu.au

Rattle• Rattle is a graphical user

interface to do data analytics.

• It calls R in the background and makes the R learning curve easier.

• It matches up with the CRISP-DM phases.

• This is a very nice text if you feel inspired!

Rattle• Written by Graham

Williams.

• Director of Data Mining at the Australian Taxation Office.

• Rattle is used at the ATO, so it works well in a complex industrial setting with very large data sets.

Installing Rattle

• The easiest way to install R and Rattle is to go to the Rattle web site

• http://rattle.togaware.com

• Then scroll down to the Install section and choose the one you want.

• There is a troubleshooting page and a Google group if you get stuck.

Installing Rattle• The only difficulties I have had with getting Rattle installed

are dependencies to other external programs.

• Some dependencies:

• GGobi

• GTK+

• XQuartz (only on Mac OSX)

• Sometimes a DLL problem with Windows.

• Troubleshooting documentation of Google Group will help.

Installing Rattle

• GGobi

• Rattle uses another piece of software called GGobi for the visualisation. GGobi written by AT&T.

• It should be installed, but if not it can be found at: http://www.ggobi.org

• On Windows a 32 bit and a 64 bit version of R is installed. You need to have the matching version of ggobi. So if you run the 32 bit R, you need the 32 bit ggobi and similarly for 64 bit.

• For this exercise it won’t matter if you use 32 or 64 bit R.

Making a dataset

1. Generate a dataset of your height, arm length and circumference of head.

2. Visualise you.

3. Use the data to predict your gender.

The dataset• Suggested attributes

• Name - as an identifier

• Gender - M or F

• Height

• Circumference around head

• Length of arm

• Height etc. are (usually) strongly correlated with gender, so they visualise nicely and are good to make a predictive model.

• If the class is all boys or all girls? Potentially use whether birthday is Jan-Jun or Jul-Dec.

Measuring …

How to do the measurements?

• The main issues are crowd control and generating and sharing a dataset quickly.

• I had 3 helpers one each to record height, arm length and head circumference.

• They entered information into a spreadsheet on Google Drive.

• I downloaded the spreadsheet as a CSV file onto Dropbox and made a URL pointing to the CSV file.

• I copied the URL into http://tinyurl.com to distribute.

The data

• It works better if the same person measures the same thing to avoid bias.

• Make sure all the gender is encoded in the same way. So M or F.

• Not a mix of M, m, Male, male, F, female, …

• R/Rattle is case sensitive.

What is data mining?

• Data Mining is the analysis of large databases to find novel, commercially valuable and exploitable patterns.

• Aim: discover meaningful insights and knowledge from data.

• Discoveries expressed as models.

• A model

• Captures the essence of the discovered knowledge.

• Can assist in understanding the world.

• Can be used to make predictions.

Models

Data Mining Successes

Helping to catch the backpacker killer

• Australia’s most notorious serial murder case

• Early 1990s, 7 young backpackers murdered.

• Police had developed a profile.

• Huge dataset generated of vehicle records, gym memberships, gun licensing and police records.

• Link analysis software from Sydney company NetMap Analytics, narrowed list of suspects from 18 million to 32, which included the murderer: Ivan Milat.

Predicting the 2012 US election result

• Nate Silver used predictive analytics & statistics to correctly predict outcomes of 50 out of 50 states from polling and related data.

• Republican pundits were confident in their landslide-win predictions. Democrat pundits predicted razor-thin victory.

• Shows the power of a data-centric approach over “gut-feeling”.

paul.kennedy@uts.edu.auSource: Wiki Commons, Official White House Photo by Pete Souza

moviegalaxies.com

the lion, the witch and the wardrobepaul.kennedy@uts.edu.au

fellowship of the ring

the return of the king

Data Mining Jobs• Consultant

• analyses the data and builds data mining models

• Manager

• communicates data mining results to customers

• Data Scientist / Researcher

• develops new algorithms

Types of data

• Spreadsheets, ...

• Transactions

• DNA sequences: gtatcct ...

• Text: tweets, emails, documents

• Images

• Sound

• ... anything else you can imagine.

CRISP-DM viewpaul.kennedy@uts.edu.au

Source: Kenneth Jensen / Wikimedia Commons / Public Domain

Finding the business problem can be hard

• Problem: identifying people likely to change to a different phone provider = churn

• Finding: unemployed people over 80 had a most regrettable tendency to churn

• But: the unemployed people over 80 passed away and no incentive program had much impact on decreasing the churn.

Two main modeling approaches

• Unsupervised methods

• Model tries to make sense of the data

• Supervised methods

• Models learns a relationship between inputs and outputs from old data.

• Model can then be used to predict output for new inputs.

University Friends

Attributes

Instances

Task: Who has better access to other friends?

“structural” componentpaul.kennedy@uts.edu.au

Possible answer

Task: Predict whether someone gets sunburned.

The “Class”

The “mining table”paul.kennedy@uts.edu.au

One possible answer: Characterisation of the type of person.

Hair Colour

LotionLotionLotion

sunburned sunburnedno no no

blondeginger

no nonoyes yes

Data Mining You!

2. Visualise you.

R/Rattle

• R is a statistical programming language.

• Rattle is a user interface to make it easier to use.

• R/Rattle are big and complex but we will only use a little part of it.

Visualising …

2. Visualise you.

Prediction …

Data Mining for Kids with Cancer

• Working with Children’s Hospital at Westmead to build a tool to help clinicians to better diagnose and treat childhood cancer

• Visualise patients and predict how a patient will react to treatment by comparing patients with previous patients

Data Mining for Childhood Cancer

• Cancer is the disease that kills the most kids in Australia.

• ~700 kids diagnosed per year in Australia.

• Cancer is heterogenous.

• More than 50 types & subtypes.

• We want to find a better way to treat children with cancer using biological data.

• Focus: Leukaemia

• Collaborative work between A/Prof Paul Kennedy (UTS) A/Prof Daniel Catchpoole (The Children’s Hospital at Westmead)

Normal Blood Leukaemia

Red Blood Cells

White Blood Cells

Plasma

• Chemotherapy initially helps, but it’s bad if the cancer relapses (ie. it comes back).

• So treatment is based on the risk of relapse.

• Clinicians group patients into risk categories to treat them.

• Strong treatment is needed for those with a high risk of relapse.

Acute Lymphoblastic Leukaemia

Survival rates on BFM-95 drug trial

Can we identify at diagnosis the 10% of standard risk patients who

will relapse? !

... and give them different therapy.

High Risk

Standard Risk

Medium Risk

Survival

Acute Lymphoblastic Leukaemia

• Goal: identify at diagnosis which patients may not respond to treatment to inform clinician

• Data: gene expression, gene variation, clinical, ...

• Patient-to-patient comparison based on biological background

singular value decomposition

purple = ALL patientyellow = normal child

treatment protocol

green = BFM 95red = Study 8 purple = ?

treatment protocol

Let’s look more closely at these 2 patients.

treatment protocolgreen = BFM 95red = Study 8 purple = ?

These two patients treated on different drug trials are biologically similar ...

risk categorygreen = standardyellow = mediumred = highpurple = ?

... but the one on the right was classified by the clinician at lower risk of relapse ...

relapsedgreen = nored = yes purple = ?

The one classified by the clinician at low risk suffered a relapse.

deathred = diedgreen = survived

He eventually survived but ...

should he have been originally classed at a higher risk and received a modified therapy?

Can we take a personalised treatment approach and predict at diagnosis how a patient will respond based on their biological similarity to previous patients?

Neuroblastoma• Cancer of the nervous system.

• Lowest survival rate among childhood cancers.

• Heterogeneous disease. Clinical course may range from spontaneous regression to very aggressive behaviour.

• Analysing the histopathology of tumour samples is time-consuming and error-prone.

• Aim: build a computer-aided diagnosis system

• Build classifiers for morphological features of stained images to help diagnosis.

Questions ...

Teacher’s Day Data Mining · correctly predict outcomes of 50 out of 50 states from polling and...

Documents

What Silicon Valley Pundits Dont

David Lesperance, Managing Partner, Lesperance Associates ...€¦ · angelo@familyofficeassociation.com | @familyoffice. Executive Summary In 2007 no pundits predicted that the 2017

CAPTRUST 2013 Q1 Institutional Market Commentary banner › Media › 1548 › captrust_2013_… · Fed economists and market pundits debating the stimulus plan's longevity, markets

A Generative Adversarial Density Estimatoropenaccess.thecvf.com/content_CVPR_2019/papers/...spond to questions when conﬁdent or reply “I don’t know” for uncertain answers

We’re conﬁdent you’ll love your gorgeous new Next …We’re conﬁdent you’ll love your gorgeous new Next furniture. However, it’s worth making absolutely certain before

· Investment decisions ... earnings, the earnings yield on Heineken Holdings N.V., 10.5% ... hundreds if not thousands of pundits making predictions about

How Conﬁdent Are You in Your Estimate of a Human Age ...makihara/pdf/ijcb2020_Age...How Conﬁdent Are You in Your Estimate of a Human Age? Uncertainty-aware Gait-based Age Estimation

T R A N S C R I P T HOW PRESIDENTS & PUNDITS KEEP SPINNING ... · HOW PRESIDENTS & PUNDITS KEEP SPINNING US TO DEATH FEATURING NORMAN SOLOMON ... NORMAN SOLOMON: We don’t get information

THE MEDICAL PUNDITS: DOCTORS AND INDIRECT ADVERTISING

Republican and Democrat Pundits on the topic of …902515/FULLTEXT01.pdfRepublican and Democrat Pundits on the topic of Ethnic and Cultural Identity Abstract A Master-level Thesis

Creative Learning Student Workbook - Mighty Mindsmightyminds.com.au/assets/Uploads/brochures/ECL-Flyer.pdf · Creative Learning Student Workbook ... conﬁdent that there will be

Citizens, Pundits & Scholars: In Defense of Blogs

Penalizing Conﬁdent Neural Networks - UG4 Project and

GOLD CUP AT CHELTENHAM PUNDITS’ PICKS OF …...ruby walsh MICK FITZGERALD MATT CHAPMAN FRANK HICKEY paddy power PUNDITS’ PICKS OF THE DAY GOLD CUP AT CHELTENHAM TRIUMPH HURDLE

DEVANSH SEHGAL DEVANSH SEHGAL IN THE ...2020/05/01 · DEVANSH SEHGAL DEVANSH SEHGAL IN THE DISTRICT COURT OF MEMO PUNDITS 1STMEMO PUNDITS MEMORIAL FORMATTING COMPETITION, 2020 SUBJECT

Simplify360 casia 2014 #pundits@iim trichy

vdonnell.pbworks.comvdonnell.pbworks.com/f/Best+Buy+Strategic+Change.pdf · JOURNAL OF BUSINESS STRATEGY Table I 'Best Buy Co ... Executives and industry pundits credit a good deal

ptboloji.weebly.comptboloji.weebly.com/.../aripo_book_outline_1.docx · Web viewMonthly poojas are performed by the Pundits of the Datta Yoga Centre and annual Ganga ... When word

Citizens, Pundits & Scholars: In Defense of Blogs Kalina Grewal Mark Robertson Scott Library York University

Hidden Superstars: Japan’s New Leaders in Global Businesskeizai.org/wp-content/uploads/2014/12/Keizai_Nwsltr_2013-04.pdfbrand name, especially in new product categories. Pundits