MapReduce
Theory and Practice
http://net.pku.edu.cn/~course/cs402/2010/彭波
[email protected]北京大学信息科学技术学院
7/15/2010
Last Course Review
3
Quiz
What are they?1. 数据 (data)
1. Bit2. Byte
2. 数据类型 (data types)3. 信息 (information)
What are they?1. 数据 (data)
1. Bit2. Byte
2. 数据类型 (data types)3. 信息 (information)
4
Data The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables.
Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables.
Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.
Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed.
5
Bit
位(英语: Bit ),亦称二 进制位,指二进制中的一位,是信息的最小单位。 Bit 是Binary digit (二 进制数位)的缩写
假设一事件以 A 或 B 的方式发生,且 A 、 B 发生的概率相等,都为 0.5 ,则一个二进位可用来代表 A 或 B 之一。 例如:
二进位可以用来表示一个简单的正负
有两种状态的开关 ( 如电灯开关 )
晶体管的通断 某根导线上电压的有无 一个抽像的逻辑上的是否
6
Byte
字节,英文名称是 Byte 。Byte 是 Binary Term的 缩写。一个字节代表八个比特。它是通常被作为计算机信息计量单位,不论被存储数据的类型为何。
7
History of “Information”
Latin origin: a representation implanted in the mind-> idea
Language and Coding : hide information in messages and then decode them 。 莫尔斯电码
Mathematics: Shannon 在 channel transmission 工作中,定义了一个 message 所包含的信息量为它在 source 中出现概率的 log2 ,单位为’ bits’ 。
Logic and linguistics : communication-oriented sense of information 涉及到 semantic meaning 语义 , knowledge 知识
Society : information as something that is contained in the message used to inform. “information is the tennis ball of communication”
8
9
How much data?
Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/200
9) Facebook has 2.5 PB of user data + 15 TB/day (4/20
09) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??)640K ought to
be enough for anybody.
10
“We are living in exponential times “
11
Information Overloading
Political theorist Neil Postman spoke to the German Informatics Society in 1990, claiming that we are informing ourselves to death. He argued that the development of computer technology is not as positive as it has been heralded to be. With our focus on technology, we are forfeiting our humanity. We are drowning in information that contains empty promises of improving our lives. (Postman 1990).
12
怎样应对信息过载?
13
What’s matter with ME?!
What you want to do with 1000pcs, or even 100,000 pcs?
14
Cloud is coming…
Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law
“Data Center is a Computer”
Parallelism everywhereMassive Scalable Reliable
Resource ManagementData Management
Programming Model & Tools
15
What’s Mapreduce
Parallel/Distributed Computing Programming Model
Input split shuffle output
16
Word Frequencies in Web pages
输入: one document per record 用户实现 map function ,输入为
key = document URL value = document contents
map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录 <word, “1”>
17
Example continued:
MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)
用户实现 reduce function 对一个 key 对应的 values 计算
求和 sum
Reduce 输出 <key, sum>
Homework Reading
19
Checklist
What’s the title? What’s the main point of view? What’s the most impact on you?
20
Introduction to Distributed System Design
How many times physicist occurs in this document?
Tell me something about Remote Procedure Calls
Tell me something about the types of failures that can occur in a distributed system
21
Introduction to Parallel Programming and MapReduce
MASTER/WORKER technique approximating pi
MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.
End