22
MapReduce Theory and Practice http://net.pku.edu.cn/~course/cs402/2010/ 彭彭 [email protected] 彭彭彭彭彭彭彭彭彭彭彭彭 7/15/2010

MapReduce Theory and Practice

Embed Size (px)

DESCRIPTION

MapReduce Theory and Practice. http://net.pku.edu.cn/~course/cs402/2010/ 彭波 [email protected] 北京大学信息科学技术学院 7/15/2010. Last Course Review. Quiz. What are they? 数据 (data) Bit Byte 数据类型 (data types) 信息 (information). Data. - PowerPoint PPT Presentation

Citation preview

Page 1: MapReduce Theory and Practice

MapReduce

Theory and Practice

http://net.pku.edu.cn/~course/cs402/2010/彭波

[email protected]北京大学信息科学技术学院

7/15/2010

Page 2: MapReduce Theory and Practice

Last Course Review

Page 3: MapReduce Theory and Practice

3

Quiz

What are they?1. 数据 (data)

1. Bit2. Byte

2. 数据类型 (data types)3. 信息 (information)

What are they?1. 数据 (data)

1. Bit2. Byte

2. 数据类型 (data types)3. 信息 (information)

Page 4: MapReduce Theory and Practice

4

Data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables.

Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables.

Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.

Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed.

Page 5: MapReduce Theory and Practice

5

Bit

位(英语: Bit ),亦称二 进制位,指二进制中的一位,是信息的最小单位。 Bit 是Binary digit (二 进制数位)的缩写

假设一事件以 A 或 B 的方式发生,且 A 、 B 发生的概率相等,都为 0.5 ,则一个二进位可用来代表 A 或 B 之一。 例如:

二进位可以用来表示一个简单的正负

有两种状态的开关 ( 如电灯开关 )

晶体管的通断 某根导线上电压的有无 一个抽像的逻辑上的是否

Page 6: MapReduce Theory and Practice

6

Byte

字节,英文名称是 Byte 。Byte 是 Binary Term的 缩写。一个字节代表八个比特。它是通常被作为计算机信息计量单位,不论被存储数据的类型为何。

Page 7: MapReduce Theory and Practice

7

History of “Information”

Latin origin: a representation implanted in the mind-> idea

Language and Coding : hide information in messages and then decode them 。 莫尔斯电码

Mathematics: Shannon 在 channel transmission 工作中,定义了一个 message 所包含的信息量为它在 source 中出现概率的 log2 ,单位为’ bits’ 。

Logic and linguistics : communication-oriented sense of information 涉及到 semantic meaning 语义 , knowledge 知识

Society : information as something that is contained in the message used to inform. “information is the tennis ball of communication”

Page 8: MapReduce Theory and Practice

8

Page 9: MapReduce Theory and Practice

9

How much data?

Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/200

9) Facebook has 2.5 PB of user data + 15 TB/day (4/20

09) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??)640K ought to

be enough for anybody.

Page 10: MapReduce Theory and Practice

10

“We are living in exponential times “

Page 11: MapReduce Theory and Practice

11

Information Overloading

Political theorist Neil Postman spoke to the German Informatics Society in 1990, claiming that we are informing ourselves to death.  He argued that the development of computer technology is not as positive as it has been heralded to be.  With our focus on technology, we are forfeiting our humanity.  We are drowning in information that contains empty promises of improving our lives. (Postman 1990).

Page 12: MapReduce Theory and Practice

12

怎样应对信息过载?

Page 13: MapReduce Theory and Practice

13

What’s matter with ME?!

What you want to do with 1000pcs, or even 100,000 pcs?

Page 14: MapReduce Theory and Practice

14

Cloud is coming…

Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law

“Data Center is a Computer”

Parallelism everywhereMassive Scalable Reliable

Resource ManagementData Management

Programming Model & Tools

Page 15: MapReduce Theory and Practice

15

What’s Mapreduce

Parallel/Distributed Computing Programming Model

Input split shuffle output

Page 16: MapReduce Theory and Practice

16

Word Frequencies in Web pages

输入: one document per record 用户实现 map function ,输入为

key = document URL value = document contents

map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录 <word, “1”>

Page 17: MapReduce Theory and Practice

17

Example continued:

MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)

用户实现 reduce function 对一个 key 对应的 values 计算

求和 sum

Reduce 输出 <key, sum>

Page 18: MapReduce Theory and Practice

Homework Reading

Page 19: MapReduce Theory and Practice

19

Checklist

What’s the title? What’s the main point of view? What’s the most impact on you?

Page 20: MapReduce Theory and Practice

20

Introduction to Distributed System Design

How many times physicist occurs in this document?

Tell me something about Remote Procedure Calls

Tell me something about the types of failures that can occur in a distributed system

Page 21: MapReduce Theory and Practice

21

Introduction to Parallel Programming and MapReduce

MASTER/WORKER technique approximating pi

MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

Page 22: MapReduce Theory and Practice

End