30
Background Material Craig C. Douglas University of Wyoming [email protected]

Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Background Material

Craig C. Douglas University of Wyoming

[email protected]

Page 2: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Schedule

• Undergraduateso June 30, 2:00-6:00, Background materialo July 2, 2:00-6:00, Data findingo July 4, 2:00-6:00, Data finding and machine learningo July 6, 2:00-6:00, Machine learning

• Graduateso July 1, 2:30-5:30, Introduction and data findingo July 3, 8:30-11:30, Data finding and machine learningo July 3, 2:30-5:30, Machine learning

2

Page 3: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Introduction

3

Page 4: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Useful References

• http://www.mgnet.org/~douglas/Classes/bigdata/2019su-index.html

• Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, 2nd ed. (version 2.1), Stanford University, 2014. The most up to date version is online at http://www.mmds.org. I will lecture from the 3rd edition draft as well.

• Andriy Burkov, The Hundred-Page Machine Learning Book, http://themlbook.com/wiki/doku.php, 2019.

4

Page 5: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Useful References

• Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.

• Deep Learning exercises using TensorFlow, https://www.coursera.org/learn/intro-to-deep-learning/home/welcome.o https://github.com/hse-aml/intro-to-dl

5

Page 6: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Useful Software

• TensorFlowo Version 1.13 is stable. Version 2.0.0-beta is not.o Anaconda or Miniconda environmentso Additional Python packages: jupyter, matplotlib,

pandas

• Tableau• MapReduce, Spark, and workflow systems• Many problems run 1000X faster on a GPU

6

Page 7: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Some Sources of Big Data

• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Government or industry regulation/statistics• Closed circuit camera identification

7

Page 8: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Oil/Gas Pipelines

Picture courtesy of Miriam Webster Dictionary 8

Page 9: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Pipeline Network Properties

• Pipe diameters range from 2 inches to 5 feet.

• Rarely straight and level.• Contain– Possibly different grades of

oil or gas simultaneously.– Pigs as separators.– Sensors (inside and

outside)• Not restricted to oil/gas

pipelines (water, etc.).

9

Page 10: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

1970’s Modeling

• Problem modeled mathematically based on time dependent, nonlinear coupled partial differential equations (two models).– Sensors on all pipeline components (recall the cartoon).– Distributed GRID computing with scattered phone booths:

• 2 minicomputers, 4 array processors, a heat pump on top, and a U.S. nickel soldered in place to allow “free” calls for telemetry.

• Sensors provided data (temperature, pressure, and velocity) dynamically based on need and anomalies and controlled by the environment and running model.

• No central computing, just central and distributed control sites.• 2,000 pieces of telemetry/minute in complete KSA network (1978).

10

Page 11: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Current Modeling

• 3D math models of pipelines with topography.• Central computing and fiber optic TCP/IP with

Gigabit Ethernet backup near pipelines.• Many more sensors plus ones to measure pipe

(shape) changes, internal pollutants and external gas leakages.

• When 1978 system replaced in KSA in 1998, 100,000 times the telemetry/minute. In 2014, a tsunami of uncountable data.

11

Page 12: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Monitoring Site Evolution

• In 1970’s, primitive center where “what if” scenarios were run to keep pipelines from breaking in parallel with regular monitoring.

• Now, large scale visualization is used to monitor pipelines in a multiscale framework. Individual high resolution monitors (1080p and 4K+) used for “what if” scenarios.

• Always trying to find anomalies in the data streams to avoid pipeline problems.

12

Page 13: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Computer Science Techniques

13

Page 14: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Hash Tables

• A hash table is a data structure with N buckets.– N is usually a prime number and may be quite

large.– Each bucket contains data.– Accessed using a hash function Key = h(x).• h(x) must be inexpensive to evaluate.• Key is an index 0, 1, …, N-1 into the hash table.• Data x can be found only in bucket h(x).

14

Page 15: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Storing a Hash Table

• If the data is very simple (numbers or short strings), then a spreadsheet may be optimal.

• If the data is arbitrary, then dynamically allocated memory techniques are common.– Common to use linked lists inside of each bucket.– Can be error prone.–Must remember to deallocate all of the hash table

when done, which can also be error prone.–Must decide if duplicates are allowed in a bucket.

15

Page 16: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Common Data Structure

16

012

N-2N-1

0

0

0

0

Buckets Data for each bucket

Variations:• doubly

linked lists• nested

tables• spreadsheet

Page 17: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Hash Table Functionalities

• Search• Add– Uses Search

• Delete– Uses Search

• Modify (optional)– Uses Search

• Change order of data in a bucket (optional)– Uses Search and possibly Delete and Add

17

Page 18: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Functionality

• Search(x)– Compute Key = h(x)– For each data stored in bucket Key, compare x to

the data.• If a match, then return something that allows the data

to be accessed.• If there is no match, return a Failure notice.

18

Page 19: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Functionality

• Add(x)– F = Search(x)– If F ≠ Failure, then• If no duplicates are allowed, return something that

allows the data to be accessed (and that it is already in the hash table).

– Otherwise,• Probably make a copy of x and add it to bucket h(x).

– Usually added as the first or last element in bucket h(x).– Usually have to modify the linked list for bucket h(x).

19

Page 20: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Functionality

• Delete(x)– F = Search(x)– If F ≠ Failure, then• Remove the data from bucket h(x). This usually means

deleting the copy of x and relinking inside the linked list. There may be other bookkeeping, too.• Return Success.

– Otherwise,• Return Failure.

20

Page 21: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Simple Examples

• Dataset D consists of combinations of a, b, c, …, x, y, z of exactly string length 3.

• We encode each letter by 00, 01, 02, ..., 23, 24, 25. So, abz is 000125 = 125.

• Consider two hash functions:– h1(x) = x mod 7– h2(x) = leading encoded letter in x

• We get two very different hash tables.

21

Page 22: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Example Dataset D

• D = { abc, def, acd,zaa, bbb, bzq,zxw, faq, cap,eld, ssa, bab }, or encoded

• D = { 102, 30405, 203,250000, 10101, 12516,252322, 50016, 20015,41103, 181800, 10001 }

22

Page 23: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

h1(x) for D

• The number of buckets is 7 (a prime).• This is not necessarily a well balanced hash

table since too many members of D go into bucket 0.

• We can store the hash table using linked lists.23

x h1(x) x h1(x) x h1(x)

102 4 30405 4 203 0

250000 2 10101 0 12516 0

252322 2 50016 0 20015 1

41103 6 181800 3 10001 5

Page 24: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Hash Table for h1(x)

24

0123456

Buckets Data for each bucket

203 10101 12516 10101 0

20015 0

250000 252322 0

181800 0

0102 30405

10001 0

41103 0

Page 25: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

h2(x) for D

• The number of buckets is 26 (not a prime).• This is a very different distribution of data

than for h1(x) and more balanced for our particular D.

• We can store it as a table or spreadsheet.25

x h2(x) x h2(x) x h2(x)

102 0 30405 3 203 0

250000 25 10101 1 12516 1

252322 25 50016 5 20015 2

41103 4 181800 18 10001 1

Page 26: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Hash Table for h2(x)

26

key value value value

0 102 203

1 10101 12516 10001

2 20015

3 30405

4 41103

5 50016

6

7

8

9

10

11

12

key value value value

13

14

15

16

17

18 181800

19

20

21

22

23

24

25 250000 252322

Page 27: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Fracking Data Example

• Open database maintained by the Pennsylvania State government based on the fractured oil and gas wells in the Marcellus Basin.

• There are about 8,000 wells that have been drilled and information is maintained about each in this database.

• Each state in the United States has at least one public database about fracking wells.

• 15.3 million Americans live within 1 mile (1.8 km) of a well drilled since 2000.

• Spreadsheets in the comma separated values format (.csv) or PDF common.

27

Page 28: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Fracking Data File Information

• Each file contains information for a period of time during 2000-2014o Locations of wellso Owner of propertyo Approximate latitude and longitude of each wello Drilling companyo Production information

§ Potential production§ Actual production (units: barrels for oil, 1000 cubic feet for

gas)§ Active/Inactive

o Much more information with some cells blank

28

Page 29: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Interesting Questions

• What are the production curves?o Are they uniform in regions or do they vary a lot?

• How long is there a good payout? (0, 12, 39-40, …, 120 months?)

• Are there some drillers whose wells are more likely to not be in production after some period of time?

• Where are clusters of wells?• How do you visualize the data?• How do you put the data into the right format in order

to ask the right questions and get answers quickly?

29

Page 30: Background Materialdouglas/Classes/bigdata/lectures/2019su/background-grad.pdf · Hadoop, Matlab, R, etc.). oUse the data to answer some simple, but interesting questions. oVisualize

Data Files

• Approximately 574 MB of files.• First things to do:

o Determine how to use the data (Excel, MongoDB, Hadoop, Matlab, R, etc.).

o Use the data to answer some simple, but interesting questions.

o Visualize the results (Excel, Matlab, R, Tableau, etc.).• Thereafter,

o Determine how to answer general, complex questions.o Use a general database approach that uses all of your

computer’s cores and GPUs.

30