26

Click here to load reader

When big data meet python @ COSCUP 2012

Embed Size (px)

DESCRIPTION

Big Data consists of several issues: data collecting, storage, computing, analysis and visualization. Python is a popular scripting language with good code readability and thus is suitable for fast development. In this slides, the author shares how to solve Big Data issues using Python open source tools.

Citation preview

Page 2: When big data meet python @ COSCUP 2012

2012

自我介紹

• 賴弘哲 (Jimmy Lai)

• Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python

• LindedIn profile: http://goo.gl/XTEM5

• 現任職於引京聚點知識結構搜索公司,

從事大資料語意分析

2

Page 3: When big data meet python @ COSCUP 2012

2012

Outline

1. Big Data

a. Concept

b. Technical issues

2. Big Data + Python

a. Related open source tools

b. Example

3

Page 4: When big data meet python @ COSCUP 2012

2012

Benefits of Big Data

1. Creating transparency(透明度) 2. Enabling experimentation to discover needs,

expose variability, and improve performance(發現需求及潛在威脅、改善產能)

3. Segmenting populations to customize(客製化) actions

4. Replacing/supporting human decision making with automated algorithms(自動決策)

5. Innovating new business models, products and services(創新的服務、產業)

4

(May 2011). Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.

e.g. http://www.data.gov/

深度資料分析人才的短缺

Page 5: When big data meet python @ COSCUP 2012

2012

Initiative from the White House

• (Mar 2012) Big Data Research and Development Initiative, the White House.

• National Science Foundation encourages education on Big Data.

• Government invest on developing state-of-the-art technologies, harness those technologies, and expand the workforce for Big Data.

5

Page 6: When big data meet python @ COSCUP 2012

2012

Big Data Issues

6

Collecting

User Generated Content Machine Generated Data

Storage

Computing

Analysis

Visualization

Page 7: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

7

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Crawler

– Collect raw data

– E.g. Heritrix, Nutch

• Scraping

– Parse information from raw data

– E.g. Yahoo! Pipes, Scrapy

Page 8: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

8

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Big Table – Distributed key-value

storage – E.g.Hbase, Cassandra

• NoSQL – Not use SQL for

manipulation – Not use relational

database model – E.g. MongoDB, Redis,

CouchDB

Page 9: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

9

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Batch

– MapReduce

– E.g. Hadoop

• Real-time

– Stream processing

– E.g. S4, Storm

Page 10: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

10

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Data mining – Weka

• Machine learning – scikit-learn

• Natural language processing – NLTK, Stanford NLP

• Statistics – R

Page 11: When big data meet python @ COSCUP 2012

2012

Big Data Techniques

11

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

• Abstract

• Interactive

• E.g. Processing, Gephi, D3.js

Page 12: When big data meet python @ COSCUP 2012

2012

Why Python?

• Good code readability for fast development.

• Scripting language: the less code, the more productivity.

• Fast growing among open source communities.

– Commits statistics from ohloh.net

12

Page 13: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

13

Collecting

User Generated Content

Machine Generated Data

Scrapy: scraping framework

PyMongo: Python client for Mongodb

Hadoop streaming: Linux pipe interface Disco: lightweight MapReduce in Python

Storage

Computing

Analysis

Visualization

Pandas: data analysis/manipulation Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning

Matplotlib: plotting NetworkX: graph visualization

Infr

astr

uct

ure

Page 14: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

web scraping framework

• Simple and Extensible

• Components: • Scheduler

• Downloader

• Spider(Scraper)

• Item pipeline

14

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

Analysis

Visualization

http://scrapy.org/

Page 15: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NoSQL database

• PyMongo: client for python

• Document(JSON)-oriented

• No schema

• Scalable • Auto-sharding

• Replica-set

• File storage

• MapReduce aggregation

15

Collecting

User Generated Content

Machine Generated Data

Computing

Analysis

Visualization

http://www.mongodb.org/

Storage

Page 16: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

• Distributed computing: – MapReduce

– Disco distributed file system

• Write code in Python – Easy/fast to profiling

– Easy/fast to debugging

16

Collecting

User Generated Content

Machine Generated Data

Analysis

Visualization

Storage

Computing

http://discoproject.org/

Page 17: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

• Data analysis library

• Datastructure for fast data manipulation – Slicing

– Indexing

– subsetting

• Handling missing data

• Aggregation

• Time series

17

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://pandas.pydata.org/

Analysis

Page 18: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

Statsmodels

• Statistical analysis

• Statistical models

• Fit data with model

• Statistical tests

• Data exploration

• Time series analysis

18

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://statsmodels.sourceforge.net/

Analysis

Page 19: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

scikit-learn

• Machine learning algorithms

• Supervised learning

• Unsupervised learning

• Dataset

• Preprocessing

• feature extraction

• Model

• Selection

• Pipeline

19

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://scikit-learn.org/

Analysis

Page 20: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NLTK: Natural Language Toolkit

• Natural language processing

• Annotated corpora and resources

20

Collecting

User Generated Content

Machine Generated Data

Visualization

Storage

Computing

http://scikit-learn.org/

Analysis

Sentence Segmentation

Tokenization POS tagging

Named Entity Recognition

Relation Recognition

Information Extraction Work Flow

Page 21: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NL

• Plotting

– Histograms

– Power spectra

– Bar charts

– Error charts

– Scatter plots

• Full control to detail of plotting

21

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

http://matplotlib.sourceforge.net/

Analysis

Visualization

Page 22: When big data meet python @ COSCUP 2012

2012

When Big Data meet Python

NetworkX • Graph algorithms and

visisualization

• Draw graph with layout: – Circular

– Random

– Spectural

– Spring

– Shell

– Graphviz

22

Collecting

User Generated Content

Machine Generated Data

Storage

Computing

http://networkx.lanl.gov/

Analysis

Visualization

Page 23: When big data meet python @ COSCUP 2012

2012

聚寶評 www.ezpao.com

美食搜尋引擎

23

搜尋各大部落格食記

Page 24: When big data meet python @ COSCUP 2012

2012

聚寶評 www.ezpao.com

語意分析搜尋引擎

24

Page 25: When big data meet python @ COSCUP 2012

2012

網友分享菜分析

正評/負評分析

評論主題分析

25

Page 26: When big data meet python @ COSCUP 2012

2012

Thank you for your attention. Q & A

We are hiring! • 核心引擎演算法研發工程師

• 系統研發工程師

• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited

引京聚點 知識結構搜索股份有限公司

• 公司簡介: http://www.ezpao.com/about/

• 職缺簡介: http://www.ezpao.com/join/

• 請將履歷寄到 [email protected]

26

When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.