Upload
anthony-baker
View
561
Download
1
Embed Size (px)
Citation preview
Introduction to Apache Geode (incubating)
Cork, September 2015
Anthony Baker (@metatype) William Markito (@william_markito)
• Introduction to Geode
• Geode concepts and usage
• The Geode open source project
• StockPrediction demo
Agenda
Introduction
4
"…an in-memory, distributed database with strong consistency built to support low latency transactional applications at extreme scale.”
History
2004 2008 2014
• Massive increase in data volumes
• Falling margins per transaction
• Increasing cost of IT maintenance
• Need for elasticity in systems
• Financial Services Providers (every major Wall Street bank)
• Department of Defense
• Real Time response needs • Time to market constraints • Need for flexible data
models across enterprise • Distributed development • Persistence + In-memory
• Global data visibility needs • Fast Ingest needs for data • Need to allow devices to
hook into enterprise data • Always on
• Largest travel Portal • Airlines • Trade clearing • Online gambling
• Largest Telcos • Large mfrers • Largest Payroll processor • Auto insurance giants • Largest rail systems on
earth
• 1000+ customers in production • Cutting edge use cases
Interesting use cases
China RailwayCorporation
5,700 train stations4.5 million tickets per day20 million daily users1.4 billion page views per day40,000 visits per second
*http://pivotal.io/big-data/pivotal-gemfire
Indian Railways
7,000 stations72,000 miles of track23 million passengers daily120,000 concurrent users10,000 transactions per minute
Interesting use cases
Indian RailwaysChina Railway Corporation
World: ~7,349,000,000
~36% of the world population
Population: 1,251,695,6161,401,586,609
8
Application Patterns
• Caching for speed and scale ➡ Read-Through, Write-Through, Write-Behind
• As the OLTP system of record ➡ Data in-memory for low-latency, on disk for durability
• Parallel compute engine • Real-time analytics
Concepts
• Cache
• Region
• Member
• Client Cache
• Functions
• Listeners
Geode Concepts and Usage
• Cache
• In-memory storage and management for your data
• Configurable through XML, Spring, Java API or CLI
• Collection of Region
Region
Region
Region
Cache
JVM
Concepts
Concepts
• Region
• Distributed java.util.Map on steroids (Key/Value)
• Consistent API regardless of where or how data is stored
• Observable (reactive)
• Highly available, redundant on cache Member (s).
• Querying
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Concepts
• Region
• Local, Replicated or Partitioned
• In-memory or persistent
• Redundant
• LRU
• Overflow
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
LOCAL LOCAL_HEAP_LRU LOCAL_OVERFLOW LOCAL_PERSISTENT LOCAL_PERSISTENT_OVERFLOW PARTITION PARTITION_HEAP_LRU PARTITION_OVERFLOW PARTITION_PERSISTENT PARTITION_PERSISTENT_OVERFLOW PARTITION_PROXY PARTITION_PROXY_REDUNDANT PARTITION_REDUNDANT PARTITION_REDUNDANT_HEAP_LRU PARTITION_REDUNDANT_OVERFLOW PARTITION_REDUNDANT_PERSISTENT PARTITION_REDUNDANT_PERSISTENT_OVERFLOW REPLICATE REPLICATE_HEAP_LRU REPLICATE_OVERFLOW REPLICATE_PERSISTENT REPLICATE_PERSISTENT_OVERFLOW REPLICATE_PROXY
• Persistent Regions
• Durability
• WAL for efficient writing
• Consistent recovery
• Compaction
Concepts
Modify k1->v5
Create k6->v6
Create k2->v2
Create k4->v4 Oplog2.crf
Member 1
Modify k4->v7 Oplog3.crf
Put k4->v7
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Region
Cache
java.util.Map
JVM
Key Value
K01 May
K02 Tim
Server 1 Server N
• Member
• A process that has a connection to the system
• A process that has created a cache
• Embeddable within your application
Concepts
Client
Locator
Server
Concepts
• Client cache
• A process connected to the Geode server(s)
• Can have a local copy of the data
• Can be notified about events on the servers
Application
GemFire Server
Region
Region
Region Client Cache
Concepts
• Functions
• Used for distributed concurrent processing (Map/Reduce, stored procedure)
• Highly available
• Data oriented
• Member oriented
Submit (f1)
f1 , f2 , … fn
Execute Functions
Concepts
• Functions
Server
Server
FunctionService.onRegion.withFilter.execute ResultCollector.getResult
Server Distributed System
execute
Server
Server
6
1
result
execute
execute
result result
2
5
3
4 3 4
Server
Partitioned Region Data Store - X
Partitioned Region Data Store - Y
Partitioned Region Data Store - Z
Partitioned Region Data Accessor
Partitioned Region Data Accessor
filter = Keys X, Y Client Region
Concepts
• Listeners
• CacheWriter / CacheListener
• AsyncEventListener (queue / batch)
• Parallel or Serial
• Conflation
Core Beliefs
Performance
Consistency
Resiliency
Why Apache Geode?
© Copyright 2014 Pivotal. All rights reserved.
Pivotal GemFire High Availability and Fault Tolerance in 6 acts
Failing data copies are replaced transparently
Data is replicated to other clusters and sites (WAN)
Network segmentations are identified and fixed automatically
Client and cluster disconnections are handled gracefully
Data is persisted on local disk for ultimate durability
“split brain”
Failed function executions are restarted automatically
restart
• Minimize copying
• Minimize contention points
• Flexible consistency model
• Partitioning and parallelism
• Avoid disk seeks
• Automated benchmarks
What makes it fast?
Benchmarks and Testing
0
2
4
6
8
10
12
14
16
18
0
1
2
3
4
5
6
2 4 6 8 10
Spee
dup
Server(Hosts
speedup
latency4(ms)
CPU4%
The Geode Project
Why Open Source? Why ASF?
• Open source is fundamentally changing software buying patterns
• Customers get transparency and co-development of features
• It’s the community that matters
• ASF provides a framework for open source
Geode Will Be a Significant Apache Project
• 1M+ LOC, over a 1000 person years invested into cutting edge R&D
• Thousands of production customers in very demanding verticals
• Cutting edge use cases that have shaped product thinking
• A core technology team that has stayed together since founding
• Performance differentiators that are baked into every aspect of the product
Geode versus GemFire
• Geode is a project supported by the OSS community
• GemFire is product from Pivotal, based on Geode source
• We donated everything but the kitchen sink*
• Development process follows “The Apache Way”
* Multi-site WAN replication, continuous queries, and native (C/C++/.NET) client
"Talk is cheap, show me the code"
• Clone & Build
git clone https://github.com/apache/incubator-‐geode cd incubator-‐geode./gradlew build -‐Dskip.tests=true
• Start a server
cd gemfire-‐assembly/build/install/apache-‐geode ./bin/gfsh gfsh> start locator -‐-‐name=locator gfsh> start server -‐-‐name=server gfsh> create region -‐-‐name=myRegion -‐-‐type=REPLICATE
29
&
Stock Predictions with
Apache Geode, Spark, and SpringXD
• RDD • Dataframe • Driver • Worker
Quick intro to Apache Spark
"An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes."
• RDD • Dataframe • Driver • Worker
Quick intro to Apache Spark
“A dataframe is a distributed collection of rows organized into named columns. An abstraction for selecting, filtering and plotting structured data (pandas), previously known as SchemaRDD."
• RDD • Dataframe • Driver • Worker
Quick intro to Apache Spark
Live Data
Apache Geode / GemFire1- Live data is ingested into the grid
2 - Trained ML model compares new data to historical patterns
3 - Results are pushed immediately to deployed applications
Machine Learning model
4 - Re-training is triggered, updating the model with the latest historical data
Spring XD
Spring XD
Data Temperature
Hot
Warm
Machine Learning Concepts
medium avg (x+1)
relative strength (x)
medium avg (x)
price(x)
Machine Learning Model (e.g. Linear Regression)
medium avg (x+1)
relative strength (x)
medium avg (x)
price(x)
Machine Learning Model (e.g. Linear Regression)
Features Label
Transform Sink
SpringXD
ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
• Off-heap memory storage
• HDFS persistence
• Lucene indexes
• Spark connector
• Cloud Foundry service
• Distributed transactions
…and other ideas from the Geode community!
Roadmap
How to Get Involved
• http://geode.incubator.apache.org
• Join the mailing lists; ask a question, answer a question, learn
[email protected] [email protected]
• File a bug in JIRA
• Update the wiki, website, or documentation
• Create example applications
• Use it in your project! We need you!
Questions?
Thank you!