41
CS 657/790 CS 657/790 Machine Learning and Machine Learning and Data Mining Data Mining Course Introduction Course Introduction

CS 657/790 Machine Learning and Data Mining Course Introduction

Embed Size (px)

Citation preview

CS 657/790 CS 657/790

Machine Learning Machine Learning andandData MiningData Mining

Course IntroductionCourse Introduction

Student SurveyStudent Survey

• Please hand in sheet of paper with:Please hand in sheet of paper with:• Your name and email addressYour name and email address• Your classification (eg, 2Your classification (eg, 2ndnd year computer year computer

science PhD student)science PhD student)• Your experience with MATLAB (none, some Your experience with MATLAB (none, some

or much)or much)• Your undergraduate degree (when, what, Your undergraduate degree (when, what,

where)where)• Your AI experience (courses at UWM or Your AI experience (courses at UWM or

elsewhere)elsewhere)• Your programming experienceYour programming experience

Course InformationCourse Information

• Course Instructor: Joe BockhorstCourse Instructor: Joe Bockhorst• email: [email protected]: [email protected]• office: 1155 EMSoffice: 1155 EMS• Course webpage: Course webpage:

http://www.uwm.edu/~joebock/790.htmlhttp://www.uwm.edu/~joebock/790.html• office hours: ???office hours: ???

• Possible times:Possible times:• before class on Monday (3:30-5:30)before class on Monday (3:30-5:30)• Monday morningMonday morning• Wednesday morningWednesday morning• after class Monday (7:00-9:00)after class Monday (7:00-9:00)

Textbook & Reading Textbook & Reading AssignmentAssignment

• Machine Learning Machine Learning (Tom Mitchell) (Tom Mitchell)• Bookstore in union, $140 newBookstore in union, $140 new• Amazon.com hard cover: $125 new , $80 usedAmazon.com hard cover: $125 new , $80 used• Amazon.com soft cover: < $30 Amazon.com soft cover: < $30

• Read Read (posted on class web page)(posted on class web page)• PrefacePreface• Chapter 1Chapter 1• Sections 6.1, 6.2, 6.9, 6.10Sections 6.1, 6.2, 6.9, 6.10• Sections 8.1, 8.2Sections 8.1, 8.2

Powerpoint Vs Powerpoint Vs WhiteboardWhiteboard

• Powerpoint encourages words Powerpoint encourages words over pictures (not good)over pictures (not good)

• But powerpoint can be saved, But powerpoint can be saved, tweaked, easily shared, …tweaked, easily shared, …• Notes posted on course website Notes posted on course website

following lecturefollowing lecture

• Your thoughts? Your thoughts?

Full DisclosureFull Disclosure

• Slides are a combination ofSlides are a combination of1)1) Jude Shavlik’s notes from UW-Jude Shavlik’s notes from UW-

Madison machine learning course Madison machine learning course (Prof. I had)(Prof. I had)

2)2) Textbook Slides (Google “machine Textbook Slides (Google “machine learning textbook”)learning textbook”)

3)3) My notesMy notes

Class Email ListClass Email List

• Is there one?Is there one?

Course OutlineCourse Outline

• 11stst half covers half covers supervised learningsupervised learning• Algorithms: support vector machines, Algorithms: support vector machines,

neural networks, probabilistic models …neural networks, probabilistic models …• MethodologyMethodology

• 22ndnd half covers half covers graphical probability graphical probability modelsmodels• Powerful statistical models very useful Powerful statistical models very useful

for learning in complex and/or noisy for learning in complex and/or noisy settingssettings

Course "Style"Course "Style"

• Primarily algorithmic & experimentalPrimarily algorithmic & experimental• Some theory, both mathematical & Some theory, both mathematical &

conceptual (much on conceptual (much on statisticsstatistics))• "Hands on" experience, interactive "Hands on" experience, interactive

lectures/discussionslectures/discussions• Broad survey of many ML subfieldsBroad survey of many ML subfields

• "symbolic" (rules, decision trees)"symbolic" (rules, decision trees)• "connectionist" (neural nets)"connectionist" (neural nets)• Support Vector MachinesSupport Vector Machines• statistical ("Bayes rule")statistical ("Bayes rule")• genetic algorithms (if time)genetic algorithms (if time)

Two Major GoalsTwo Major Goals

• to understand to understand whatwhat a learning a learning system should dosystem should do

• to understand to understand howhow (and how (and how wellwell) ) existing systems workexisting systems work

Background AssumedBackground Assumed

• ProgrammingProgramming• Data structures and algorithmsData structures and algorithms

• CS 535 CS 535

• MathMath• Calculus (partial derivatives)Calculus (partial derivatives)• Simple probability & statisticsSimple probability & statistics

Programming Programming Assignments in Assignments in MATLABMATLAB• Why MATLAB?Why MATLAB?

• Fast prototypingFast prototyping• Integrated plottingIntegrated plotting• Widely used in academia (industry too?)Widely used in academia (industry too?)• Will save you time in the long runWill save you time in the long run

• Why not MATLAB?Why not MATLAB?• Proprietary softwareProprietary software• Harder to work from homeHarder to work from home

• Optional Assignment: familiarize yourself Optional Assignment: familiarize yourself with MATLAB, use MATLAB help systemwith MATLAB, use MATLAB help system

Student Computer Student Computer LabsLabs

• E256, E280, E285, E384, E270E256, E280, E285, E384, E270• All have MATLAB installed under All have MATLAB installed under

Windows XPWindows XP

RequirementsRequirements

• Bi-weekly programming plus perhaps Bi-weekly programming plus perhaps some “paper & pencil” homeworksome “paper & pencil” homework• "hands on" experience valuable"hands on" experience valuable• HW0 – build a datasetHW0 – build a dataset• HW1 & HW2 supervised learning algorithmsHW1 & HW2 supervised learning algorithms• HW3 & HW4 graphical probability modelsHW3 & HW4 graphical probability models

• Midterm exam (Midterm exam (after about 8-10 weeks)after about 8-10 weeks)

• Final examFinal exam• Find project of your choosingFind project of your choosing

• during last 4-5 weeks of classduring last 4-5 weeks of class

GradingGrading

HW'sHW's 25%25%

ProjectProject 20%20%

MidtermMidterm 20%20%

FinalFinal 30%30%

Quality DiscussionQuality Discussion 5% 5%

Late HW's PolicyLate HW's Policy

• HW's due @ 4pmHW's due @ 4pm• you have 5 late days to use over the you have 5 late days to use over the

semestersemester• (Fri 4pm (Fri 4pm → Mon 4pm is 1 late "day")→ Mon 4pm is 1 late "day")

• SAVE UP late days!SAVE UP late days!• extensions only for extreme casesextensions only for extreme cases

• Penalty points after late days Penalty points after late days exhaustedexhausted• 10% per day10% per day

• Can't be more than one week lateCan't be more than one week late

Machine Learning Vs Machine Learning Vs Data MiningData Mining• Machine Learning:Machine Learning: computer computer

algorithms that improve algorithms that improve automatically through experience automatically through experience [Mitchell].[Mitchell].

• Data Mining: Data Mining: Extracting knowledge Extracting knowledge from large amounts of data. from large amounts of data. [Han & [Han & Kamber]Kamber] ( (synonym: synonym: knowledge knowledge discovery in databases (KDD))discovery in databases (KDD))

What’s the difference? What’s the difference? Topics in ML and DM Topics in ML and DM texts texts (Mitchell Vs Han & (Mitchell Vs Han & Kamber)Kamber)

reinforcement learning, learning theory, evaluating learning systems, using domain knowledge, inductive logic programming, …

Data Warehouse, OLAP, query languages, association rules, presentation, …

Supervised learning, decision trees, neural nets, Bayesian networks, k-nearest neighbor, genetic algorithms, unsupervised learning (clustering in DM jargon),…

ML DM

We’ll try to cover topics in red

The learning problemThe learning problem

• Learning = improving with Learning = improving with experience experience

• Example: learn to play checkersExample: learn to play checkers

Improve over task Improve over task T, T, with respect to performance with respect to performance measure measure P,P,based on experience based on experience EE

T:T: Play Checkers Play CheckersP: P: % of games won% of games wonE: E: games played against selfgames played against self

Famous Example: Famous Example: Discovering GenesDiscovering Genes

• T: T: find genes in DNA sequencesfind genes in DNA sequences• ACGTGCATGTGTGAACGTGTGGGTCTGATGATGTACGTGCATGTGTGAACGTGTGGGTCTGATGATGT……

• P:P: % of genes found % of genes found• E: E: experimentally verified genesexperimentally verified genes

* Prediction of Complete Gene Structures in Human Genomic DNA, Burge & Carlin J. Molecular Biology, 1997, 268 78-94

Famous Example 2: Famous Example 2: Autonomous Vehicles Autonomous Vehicles DrivingDriving• T: T: drive vehicledrive vehicle• P:P: reach destination reach destination• E: E: machine observation of human machine observation of human

driverdriver

ML key to winning ML key to winning DARPA Grand ChallengeDARPA Grand Challenge

“The robot's software system relied predominately on state-of-the-art AI technologies, such as machine learning and probabilistic reasoning.”

[Winning the DARPA Grand Challenge, Thrun et al., Journal of Field Robotics, 2006]

Stanford team won 2005 driverless vehicle race across Mojave Desert

Why study machine Why study machine learning learning (data mining)(data mining)??

• Data is plentifulData is plentiful• Retail, video, images, speech, text, Retail, video, images, speech, text,

DNA, bio-medical measurements, …DNA, bio-medical measurements, …

• Computational power is availableComputational power is available• Budding IndustryBudding Industry• ML has great applications ML has great applications • ML still relatively immature ML still relatively immature

Next Time: HW0 – Next Time: HW0 – Create Your Own Create Your Own DatasetDataset

• Think about thisThink about this• will need to create it by week after will need to create it by week after

nextnext

• Google to find:Google to find:• UCI archive (or UCI KDD archive)UCI archive (or UCI KDD archive)• UCI ML archive (UCI machine learning UCI ML archive (UCI machine learning

repository)repository)

HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 1: Step 1: Choose a Boolean (true/false) conceptChoose a Boolean (true/false) concept• Subjective JudgementSubjective Judgement

• Books I like/dislike Books I like/dislike • Movies I like/dislike Movies I like/dislike • Web pages I like/dislikeWeb pages I like/dislike

• ““Time will tell” conceptsTime will tell” concepts• Stocks to buyStocks to buy• Medical outcomesMedical outcomes

• Sensory interpretation Sensory interpretation • Face recognition (See text)Face recognition (See text)• Handwritten digit recognitionHandwritten digit recognition• Sound recognitionSound recognition

HW0 – Your “Personal HW0 – Your “Personal Concept”Concept”

• Step 2: Step 2: Choosing a feature SpaceChoosing a feature Space• We will use fixed-length feature vectorsWe will use fixed-length feature vectors

• Choose Choose NN features features• Each feature has Each feature has VVii

possible valuespossible values• Each example is represented by a vector of N feature Each example is represented by a vector of N feature

values values (i.e., (i.e., is a point in the feature spaceis a point in the feature space))e.g.: e.g.: <red, 50, round><red, 50, round>

colorcolor weight shapeweight shape

• Feature TypesFeature Types• BooleanBoolean• NominalNominal• OrderedOrdered• HierarchicalHierarchical

• Step 3: Step 3: Collect examples (“I/O” pairs)Collect examples (“I/O” pairs)

Defines a space

In HW0 we will use a subset(see next slide)

Standard Feature TypesStandard Feature Typesfor representing training examples for representing training examples – source of “ – source of “domain knowledgedomain knowledge””

• NominalNominal• No relationship among possible valuesNo relationship among possible values

e.g., e.g., color color єє {red, blue, green} {red, blue, green} (vs.(vs. color = 1000 color = 1000 Hertz)Hertz)• Linear (or Ordered)Linear (or Ordered)

• Possible values of the feature are totally orderedPossible values of the feature are totally orderede.g., e.g., size size єє {small, medium, large}{small, medium, large} ←← discretediscrete

weight weight єє [0…500] [0…500] ←← continuouscontinuous

• HierarchicalHierarchical• Possible values are Possible values are partiallypartially

ordered in an ISA hierarchyordered in an ISA hierarchye.g. for e.g. for shapeshape ->->

closed

polygon continuous

trianglesquare circle ellipse

Example Hierarchy Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17)(KDD* Journal, Vol 5, No. 1-2, 2001, page 17)

Product

Pct Foods

Tea

Canned Cat Food

Dried Cat Food

99 Product Classes

2302 Product Subclasses

Friskies Liver, 250g

~30k Products• Structure of one feature!

• “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.”

- From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001

* Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers

Our Feature TypesOur Feature Types(for homeworks)(for homeworks)

• DiscreteDiscrete• tokens (char strings, w/o quote marks and tokens (char strings, w/o quote marks and

spaces)spaces)

• ContinuousContinuous• numbers (int’s or float’s)numbers (int’s or float’s)

• If only a few possible values (e.g., 0 & 1) use If only a few possible values (e.g., 0 & 1) use discretediscrete

• i.e., merge nominal and discrete-ordered i.e., merge nominal and discrete-ordered (or convert discrete-ordered into 1,2,…)(or convert discrete-ordered into 1,2,…)

• We will ignore hierarchy info and We will ignore hierarchy info and only use the leaf values (it is rare any way)only use the leaf values (it is rare any way)

Today’sToday’s TopicsTopics

• Creating a dataset ofCreating a dataset of

• HW0 out on-lineHW0 out on-line• Due next MondayDue next Monday

fixed length feature vectorsfixed length feature vectors

Some Famous ExamplesSome Famous Examples

• Car Steering (Pomerleau)Car Steering (Pomerleau)

• Medical Diagnosis (Quinlan)Medical Diagnosis (Quinlan)

• DNA CategorizationDNA Categorization• TV-pilot ratingTV-pilot rating• Chemical-plant controlChemical-plant control• Back gammon playingBack gammon playing• WWW page scoringWWW page scoring• Credit application scoringCredit application scoring

Learned Function

Steering Angle

Digitized camera image

age = 13sex = M wgt = 18

Learned Function

ill vs

healthy

Medicalrecord

HW0: Creating your HW0: Creating your datasetdataset

1.1. Choose a datasetChoose a dataset• based on interest/familiaritybased on interest/familiarity• meets basic requirementsmeets basic requirements

• >1000 examples>1000 examples• category (function) learned should category (function) learned should

be binary valuedbe binary valued• ~500 examples labeled class A, ~500 examples labeled class A,

other 500 labeled class Bother 500 labeled class B

→→ Internet Movie Database (IMD)Internet Movie Database (IMD)

HW0: Creating your HW0: Creating your datasetdataset

2.2. IMD has a lot of data that IMD has a lot of data that are not discrete or are not discrete or continuous or binary-valued continuous or binary-valued for target function for target function (category)(category)Studio

Movie

Director/Producer

ActorMade

Acted inDirected

NameCountryList of movies

NameYear of birthGenderOscar nominationsList of movies

Title, Genre, Year, Opening Wkend BO receipts,List of actors/actresses, Release season

NameYear of birthList of movies

Produced

HW0: Creating your HW0: Creating your datasetdataset

3.3. Choose a boolean or binary-Choose a boolean or binary-valued target function valued target function (category)(category)

• Opening weekend box office Opening weekend box office receipts > $2 millionreceipts > $2 million

• Movie is drama? (action, sci-fi,…)Movie is drama? (action, sci-fi,…)• Movies I like/dislike (e.g. Tivo)Movies I like/dislike (e.g. Tivo)

HW0: Creating your HW0: Creating your datasetdataset

4.4. How to transfer available How to transfer available attributes:attributes:Other example attributes (select Other example attributes (select predictive features)predictive features)• MovieMovie

• Average age of actorsAverage age of actors• Number of producersNumber of producers• Percent female actorsPercent female actors

• StudioStudio• Number of movies madeNumber of movies made• Average movie grossAverage movie gross• Percent movies released in USPercent movies released in US

HW0: Creating your HW0: Creating your datasetdataset

• Director/ProducerDirector/Producer• Years of experienceYears of experience• Most prevalent genreMost prevalent genre• Number of award winning moviesNumber of award winning movies• Average movie grossAverage movie gross

• ActorActor• GenderGender• Has previous Oscar award or nominationsHas previous Oscar award or nominations• Most prevalent genreMost prevalent genre

HW0: Creating your HW0: Creating your datasetdataset

David Jensen’s group at UMass used Naïve Bayes David Jensen’s group at UMass used Naïve Bayes (NB) to predict the following based on attributes they (NB) to predict the following based on attributes they selected and a novel way of sampling from the data:selected and a novel way of sampling from the data:

• Opening weekend box office receipts > Opening weekend box office receipts > $2 million$2 million• 25 attributes25 attributes• Accuracy = 83.3%Accuracy = 83.3%• Default accuracy = 56%Default accuracy = 56%

• Movie is drama?Movie is drama?• 12 attributes12 attributes• Accuracy = 71.9%Accuracy = 71.9%• Default accuracy = 51%Default accuracy = 51%

• http://kdl.cs.umass.edu/proximity/about.htmlhttp://kdl.cs.umass.edu/proximity/about.html

What Do You Think What Do You Think Machine Learning Machine Learning Means?Means?

What is Learning?What is Learning?

Learning denotes changes in the system that Learning denotes changes in the system that

… … enable the system to do the same task … enable the system to do the same task …

more effectively the next time.more effectively the next time.

- - Herbert Herbert SimonSimon

Learning is making useful changes in our minds.Learning is making useful changes in our minds.

- - Marvin Marvin MinskyMinsky

Major Paradigms of Major Paradigms of Machine LearningMachine Learning

• Inducing Functions from I/O PairsInducing Functions from I/O Pairs• Decision trees (e.g., Quinlan’s C4.5 [1993])Decision trees (e.g., Quinlan’s C4.5 [1993])• Connectionism / neural networks (e.g., backprop)Connectionism / neural networks (e.g., backprop)• Nearest-neighbor methodsNearest-neighbor methods• Genetic algorithmsGenetic algorithms• SVM’s SVM’s

• Learning without a TeacherLearning without a Teacher• Conceptual clusteringConceptual clustering• Self-organizing systemsSelf-organizing systems• Discovery systemsDiscovery systems

Not in Mitchell’s textbook (will spend 0-2 lectures on this – but also in CS776)

Major Paradigms of Major Paradigms of Machine LearningMachine Learning

• Improving a Multi-Step Problem Improving a Multi-Step Problem SolverSolver• Explanation-based learningExplanation-based learning• Reinforcement learningReinforcement learning

• Using Preexisting Domain Using Preexisting Domain Knowledge InductivelyKnowledge Inductively• Analogical learningAnalogical learning• Case-based reasoningCase-based reasoning• Inductive/explanatory hybridsInductive/explanatory hybrids

Will be covered briefly