Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing...

Preprocessing the Datain

Apache Spark

Steps in preprocessing

• Deploy data to the cluster

• Creating building RDD

• Verifying the data thru. sampling

• Cleaning data: For example

– Converting the datatype

– Filing missing values

• Other steps: integration, reduction, transformation, discretization

The Data Set

Deploy the data to the cluster

• Distributed computing requires the file distributed across the cluster

• Transfer the local data files to hdfs

– $ hdfs dfs –mkdir linkage

– $ hdfs dfs –put block_*.csv linkage

Creating

• Create a RDD (Resilient Distributed Dataset) from text file– val rawblocks = sc.textFile(“hdfs:///user/yxie2/linkage2”)

• Create RDD from external databases– val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

– val test_enc_orc = hiveContext.sql("select * from test_enc_orc")

• Spark is a lazy execution

Sampling data

• Sample:

– $val head = rawblocks.first()

– $val top10 = rawblocks.take(10)

• View data:

– Printing on to client console

• $head.foreach(println)

The Data Set

• Pre-Process the data – Part II: Structuring Data with Tuples and Case Classes

– The records in the head array are all strings of comma-separated fields

– To make it a bit easier to analyze this data, we will need to parse these strings into a structured format that converts the different fields into the correct data type

def parse(line: String) ={

val pieces = line.split(‘,’)

val id1 = pieces(0).toInt

val id2 = pieces(1).toInt

val scores = pieces.slice(2,11).map(_.toDouble)

val matched = pieces(11).toBoolean

(id1, id2, scores, matched)

val lines= sc.textFile("hdfs://...")

val flines = lines.map(line => parse(line) )

• Needs to redefine the toDouble method

def toDouble(s: String) = {

if (“?”.equals(s)) Double.NaN else s.toDouble

def parse(line: String) ={val pieces = line.split(‘,’)val id1 = pieces(0).toIntval id2 = pieces(1).toIntval scores = pieces.slice(2,11).map(toDouble)val matched = pieces(11).toBoolean(id1, id2, scores, matched)

val flines = lines.map(line => parse(line) )val filtered_data = flines.filter(tup => tup(4) > 0)

Transformation

Action

Preprocessing the Data in Apache Sparkvvr3254/CMPS598/Notes/...Apache Spark Steps in preprocessing...

Documents

Nonlinearity & Preprocessing

Data Preprocessing - Western · PDF fileTypical Tasks in Data Preprocessing •Data Cleaning ... –\Weka\Vote –Replacing missing ... data preprocessing usuallyAuthors: Peter ChristenAffiliation:

Spatial Preprocessing

Video Preprocessing

Microarray Preprocessing

Data preprocessing ppt1

Ch3 Data Preprocessing

Data preprocessing

03 preprocessing

Data Preprocessing Data Preprocessing Tasksrjohns15/cse40647.sp14/www... · Data Preprocessing Data Preprocessing Tasks 1 1 2 3 Data Transformation 4 ... Normalization Transform the

Alignment and Preprocessing for Data Analysisdepts.washington.edu/cpac/Activities/Meetings/... · Alignment and Preprocessing for Data Analysis • Preprocessing tools for chromatography

Spatial Preprocessing

Importance of Data Preprocessing For Improving ...sites.khas.edu.tr/tez/KamranEmreSayin_izinli.pdf · Importance of Data Preprocessing For Improving Classiﬁcation ... Preprocessing

Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

NGS Data Preprocessing

Image Preprocessing

Visual Analytics Sandbox - Information Technologyvvr3254/CMPS598/Notes/... · 2018. 1. 27. · HUE: Hadoop User Experience An open-source Web interface that supports Apache Hadoop

Meg preprocessing

Data preprocessing ng