Information Visualization in Data Mining S.T. Balke Department of Chemical Engineering and Applied...

Preview:

Citation preview

Information Information Visualization in Data Visualization in Data MiningMining

S.T. BalkeS.T. BalkeDepartment of Chemical Department of Chemical Engineering and Applied Engineering and Applied ChemistryChemistryUniversity of TorontoUniversity of Toronto

MotivationMotivation

Data visualization Data visualization – relies primarily on human cognition for relies primarily on human cognition for

value discovery;value discovery;– permits direct incorporation of human permits direct incorporation of human

ingenuity and analytic capabilities into ingenuity and analytic capabilities into data mining;data mining;

– can very effectively deal with very large can very effectively deal with very large quantities of data;quantities of data;

– powerfully combines with machine-based powerfully combines with machine-based discovery techniques.discovery techniques.

UsesUses

Explorative AnalysisExplorative Analysis– Data cleaningData cleaning– Provide hypothesesProvide hypotheses

Confirmative AnalysisConfirmative Analysis– Confirm or reject hypothesesConfirm or reject hypotheses

PresentationPresentation– Communicate your workCommunicate your work

http://www.alz.washington.edu/DATA2001/GERALD1/sld011.htm

Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets

mean of the x values = 9.0

mean of the y values = 7.5

equation of the least-squared regression line is: y = 3 + 0.5x

sums of squared errors (about the mean) = 110.0

Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets

regression sums of squared errors (variance accounted for by x) = 27.5

residual sums of squared errors (about the regression line) = 13.75

correlation coefficient = 0.82

coefficient of determination = 0.67

The Anscombe DataThe Anscombe Data

Marley, 1885

Snow’s Cholera Map, 1855

http://pupgg.princeton.edu/disk20/anonymous/groth/lick/licknorth.gif

Graphical ExcellenceGraphical Excellence

Graphical displays should:Graphical displays should: show the datashow the data induce the viewer to think about the substance, not induce the viewer to think about the substance, not

the methodologythe methodology avoid distorting what the data saysavoid distorting what the data says present many numbers in a small spacepresent many numbers in a small space make large data sets coherentmake large data sets coherent encourage the eye to compare different pieces of dataencourage the eye to compare different pieces of data reveal the data at several levels of detail (broad reveal the data at several levels of detail (broad

overview to fine structure)overview to fine structure) serve a reasonably clear purpose: description, serve a reasonably clear purpose: description,

exploration, tabulation, or decorationexploration, tabulation, or decoration be closely integrated with the statistical and verbal be closely integrated with the statistical and verbal

descriptions of the data set.descriptions of the data set.

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Graphical ExcellenceGraphical Excellence

Gives the viewer the greatest Gives the viewer the greatest number of ideas in the shortest number of ideas in the shortest time with the least ink in the time with the least ink in the smallest space.smallest space.

Nearly always multivariate.Nearly always multivariate. Requires telling the truth about Requires telling the truth about

the data.the data.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Lie Factor=14.8

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Lie FactorLie Factor

dataineffectofsize

graphicinshowneffectofsizeFactorLie

8.14

6.0100)6.03.5(

18100)0.185.27(

FactorLie

Require: 0.95<Lie Factor<1.05

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Using Area for One Using Area for One Dimensional DataDimensional Data

Lie Factor=2.8

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

More guidelines:More guidelines:

The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.

No legends: use labels on graph Graphics must not quote data out

of context.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Data Ink RatioData Ink Ratio

graphictheprtousedinktotal

inkdataRatioinkData

int

Data ink Ratio = proportion of a graphic’s ink devoted to the

non-redundant display of data-information.

Data ink Ratio=1.0-(proportion of a graphic that can be erasedwithout loss of data-information)

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Maximize Data DensityMaximize Data Density

graphicdataofarea

matrixdatatheinentriesofnumbergraphicaofdensitydata

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

Beware ChartjunkBeware Chartjunk

NO

“Isn’t it remarkable that the computer can be programmedto draw like that.”

YES:

“My, what interesting data!”

(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)

How to Say Nothing with How to Say Nothing with Information Visualization Information Visualization

http://www.crs4.it/~zip/13ways.htmlhttp://www.crs4.it/~zip/13ways.html

Never include a color legend.Never include a color legend. Avoid annotation.Avoid annotation. Never mention error characteristics of the Never mention error characteristics of the

visualization method.visualization method. When in doubt, smooth.When in doubt, smooth. Don’t say how long it required to plot.Don’t say how long it required to plot. Never compare your results with other data Never compare your results with other data

visualization techniques.visualization techniques. Never cite references for the data.Never cite references for the data. Claim generality but show results from a single Claim generality but show results from a single

data set.data set. Use viewing angle to hide blemishes in 3D Use viewing angle to hide blemishes in 3D

objects.objects.

An Overview of An Overview of Information Information Visualization MethodsVisualization Methods

http://www.informatik.uni-http://www.informatik.uni-halle.de/~keim/tutorials.htmlhalle.de/~keim/tutorials.html

Methods of InterestMethods of Interest

Scatterplot MatricesScatterplot Matrices Parallel CoordinatesParallel Coordinates Pixel Oriented MethodsPixel Oriented Methods Icon based MethodsIcon based Methods Dimensional StackingDimensional Stacking TreemapTreemap

Assignment 1: see Assignment 1: see handouthandout

Some websites of Some websites of interest:interest: http://http://

dmoz.org/Computers/Software/Databases/Data_Miningdmoz.org/Computers/Software/Databases/Data_Mining/ / Public_Domain_SoftwarePublic_Domain_Software//

http://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Prohttp://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Products/Visualization/ducts/Visualization/

Try a search at google.com using Try a search at google.com using the followng key words together:the followng key words together:

name_of_method download softwarename_of_method download software