14
Building Spark Сonnector for Ryft - hardware high-speed compute appliance Aleksandr Pavlenko Big Data Software Engineer, DataArt [email protected]

Alexander Pavlenko, Java Software Engineer, DataArt

Embed Size (px)

Citation preview

Page 1: Alexander Pavlenko, Java Software Engineer, DataArt

Building Spark Сonnector for Ryft -hardware high-speed compute appliance

Aleksandr PavlenkoBig Data Software Engineer, [email protected]

Page 2: Alexander Pavlenko, Java Software Engineer, DataArt

What is Apache Spark ?

Page 3: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft ONE - hardware producing Big Data

Page 4: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft Query Language

Query examples:

Exact Search: (RAW_TEXT CONTAINS "Some Text")

Edit Search: (RAW_TEXT CONTAINS FEDS("Some Text", DIST=2, ...))

Date Search: (RECORD.date CONTAINS DATE(MM/DD/YYYY <=

"04/05/2015"))

Page 5: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft REST Service

Page 6: Alexander Pavlenko, Java Software Engineer, DataArt

Spark Ryft Connector

Use Cases:

● Financial services

● Customer visibility

● Call center records

● Security and defense

● e-Medical records

● Genomic research

● IoT sensor and devices

● Supply chain logistics

Page 7: Alexander Pavlenko, Java Software Engineer, DataArt

Supercharging Spark with Ryft

*Benchmark comparisons against Apache Spark running on a cluster of AWS EC2 –

c3.8xlarge “Compute Optimized” 2U servers that require 1100 Watts each.http://www.ryft.com/products#performance-proof

Page 8: Alexander Pavlenko, Java Software Engineer, DataArt

RDD - Resilient Distributed Dataset

abstract class RDD[T](...) {

@DeveloperApi

def compute(split: Partition, context: TaskContext): Iterator[T]

protected def getPartitions: Array[Partition]

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

}

Page 9: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft RDD

*Typical query: http://ryftone0/search?query=(RAW_TEXT CONTAINS "test")&files=somefile.txt

Page 10: Alexander Pavlenko, Java Software Engineer, DataArt

import com.ryft.spark.connector._

...

val sc = new SparkContext(sparkConf)

val query = RecordQuery(recordField("Description") contains

IPv4Value(IP === IPv4("192.168.190.151")))

val ryftOptions = RyftQueryOptions("data/*", xml)

val ryftRDD = sc.ryftRDD(Seq(query), ryftOptions)

...

Ryft RDD Example

Page 11: Alexander Pavlenko, Java Software Engineer, DataArt

Data Locality & Partitioning Mechanism

abstract class RDD[T](...) {

...

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

...

}

Page 12: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft DataFrame Support

Mapping of structured data at Ryft (JSON\XML) to DataFrame

RyftRelation extends BaseRelation with PruntedFilteredScan

val schema = StructType(Seq(

StructField("Arrest", BooleanType),StructField("Date", TimestampType),

StructField("Description", StringType), StructField("ID", StringType)

))

sqlContext.read.ryft(schema,xml,"*.crimestat","temp_table",

Map("date_format" -> "MM/dd/yyyy hh:mm:ss aa"))

sqlContext.sql("""select Date, ID, Description, Arrest from temp_table

where Date = '2015-04-15 23:59:00' ORDER BY Date""")

Page 13: Alexander Pavlenko, Java Software Engineer, DataArt

Ryft Twitter Demo

Page 14: Alexander Pavlenko, Java Software Engineer, DataArt

Q & A ?