19
CSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering University of Texas at Arlington Fall 2018 (Slides partly courtesy of Pang-Ning Tan, Michael Steinbach and Vipin Kumar, and Jiawei Han, Micheline Kamber and Jian Pei)

CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

CSE4334/5334 Data Mining3 Data, Data Mining

Chengkai Li

Department of Computer Science and EngineeringUniversity of Texas at ArlingtonFall 2018 (Slides partly courtesy of Pang-Ning Tan, Michael Steinbach and Vipin Kumar, and Jiawei Han, Micheline Kamber and Jian Pei)

Page 2: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Big Data

2http://dilbert.com/strip/2012-07-29

Page 3: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Big Data

3http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 4: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Big DataThe 4 Vs

o Volumeo Varietyo Velocityo Veracity

4

Page 5: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Volume: How much data is out there?

5

http://www.sciencedaily.com/releases/2013/05/130522085217.htm

http://www.storagenewsletter.com/rubriques/market-reportsresearch/ibm-cmo-study/

Page 6: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Variety: Types of DataStructured data

o (relational) database tableso CSV/TSV files

Semi-structured datao XML, JSON, RDF

Unstructured datao text data (documents, Web pages, short texts, e.g., social media)

Multimedia datao images, videos, audios

Other types of datao matrices, graphs, sequences, time-series, spatio-temporal

6

Page 7: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Velocity: Streaming Datav Stock tradesv Highway sensorsv Weather data v Social mediav Telephone callsv Video streaming

7

http://mashable.com/2012/06/22/data-created-every-minute/

Page 8: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Veracity: uncertain and imprecise data

8

Page 9: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Datasetsv Amazon Public Data Setsv Data.govv Linked Open Data, Knowledge Bases, Encyclopediav Yahoo! Webscopev Stanford Large Network Dataset Collectionv UCI Machine Learning Repositoryv UCR Time Series Classification/Clusteringv Time Series Data Library http://robjhyndman.com/TSDL/v KDnuggets Dataset List http://www.kdnuggets.com/datasets/index.htmlv KDD Cup Datasets http://www.sigkdd.org/kddcup/index.php

9

Page 10: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Amazon Public Data Setshttp://aws.amazon.com/public-data-sets/

o NASA NEX: A collection of Earth science data sets maintained by NASA, includingclimate change projections and satellite images of the Earth's surfaceo Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion webpageso 1000 Genomes Project: A detailed map of human genetic variationo Google Books Ngrams: A data set containing Google Books n-gram corpuseso US Census Data: US demographic data from 1980, 1990, and 2000 US Censuseso Freebase Data Dump: A data dump of all the current facts and assertions in theFreebase system, an open database covering millions of topics

10

Page 11: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Data.govhttp://www.data.gov/ (137,608 datasets)o Consumer Complaint Databaseo U.S. International Trade in Goods and Services: Monthly report that provides nationaltrade data including imports, exports, and balance of payments for goods and services.o DTV Reception Mapso Food Access Research Atlas — presents a spatial overview of food access indicatorsfor low-income and other census tracts using different measures of supermarket...o U.S. Hourly Precipitation Datao Great Chile Earthquake of May 22, 1960o Consumer Expenditure Surveyo Farmers Markets Geographic Data: longitude and latitude, state, address, name, andzip code of Farmers Markets in the United Stateso Crimes - 2001 to present (City of Chicago)11

Page 12: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Linked Data, Knowledge Bases, Encyclopediahttp://linkeddata.org/ (hundreds of datasets, billions of RDF triples)

IMDBDBLPPubMedWikipedia, DBpediaYAGOFreebase/Google Knowledge Graph

12

Page 13: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data/o Social networks : online social networks, edges represent interactions between peopleo Communication networks : email communication networks with edges representingcommunicationo Citation networks : nodes represent papers, edges represent citationso Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper)o Web graphs : nodes represent webpages and edges are hyperlinkso Amazon networks : nodes represent products and edges link commonly co-purchasedproductso Internet networks : nodes represent computers and edges communicationo Road networks : nodes represent intersections and edges roads connecting theintersections

13

Page 14: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Data in Every Application Areao Business: e-commerce, transactions (retailers, banking, credit cards), ratings, reviews,stock trading, …o Web, social media (YouTube, Flickr, …), and social networks (Facebook, Twitter, …)o Newso Science: bioinformatics, scientific experiments, environment, climate, astronomyo Logs and measurementso Personal information: emails, calendars, digital photos, videoso Transportationo Telecommunicationo Educationo Entertainment (film, music, gaming, …)o Sportso Health careo Crime, security14

Page 15: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

What is Data Mining?Data mining (knowledge discovery from data)

o Extraction of interesting (non-trivial, implicit, previously unknownand potentially useful) patterns or knowledge from huge amount ofdata

What is not Data Mining?o Retrieve data instead of knowledge or patterno Not interesting (trivial, explicit, known, useless)

15

Page 16: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Knowledge Discovery (KDD) Processv Data mining plays an essential role in the knowledge discovery process

16

http://cacm.acm.org/magazines/1996/11/8517-the-kdd-process-for-extracting-useful-knowledge-from-volumes-of-data/abstract

Page 17: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

This is a view from typical machine learning and statistics communities

Data integrationNormalizationFeature selectionDimension reduction

Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis

Pattern evaluationPattern selectionPattern interpretationPattern visualization

Page 18: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.18

Data Mining: Confluence of Multiple Disciplines

Data Mining

MachineLearning Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Page 19: CSE4334/5334 Data Miningcrystal.uta.edu/~cli/cse5334/slides/cse5334-fall18-03-data.pdfCSE4334/5334 Data Mining 3 Data, Data Mining Chengkai Li Department of Computer Science and Engineering

Copyright ©2007-2017 The University of Texas at Arlington. All Rights Reserved.

Data Mining Software

19

Free, open-sourceo RapidMinero Weka: Data mining tool in javao SCaVis: scientific computation and visualization, Javao Orange: Python suiteo Scikit-learn: Python machine learning lbiraryo NumPy/SciPy/Ipython/ mlpy (python modules for scientific

computing, scientific library, interactive computing, machine learning)o R: statistical computing and graphico RattleGUI: data mining GUI using Ro Octave: numerical analysiso Shogun: machine learning toolkit in C++Text Mining Toolso NLTK (NLP Toolkit): NLP suite for Pythono SenticNet API: sentiment analysis o Stanford NLP softwareo UIMA

Large-Scale Data Processing, Machine Learningo Apache Mahouto GraphLabo MapReduce/Hadoopo Sparko Pregel/GiraphCommercial Productso Matlabo Oracle Data Miningo SASo IBM SPSSo Microsoft SQL Server

Analysis Serviceso HP Vertica