Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
1
The QuantCell Big Data Spreadsheet
Agust Egilsson, PhD Big Data Science
[email protected] Saturday, March 23, 2013
* Image cropped from article about QuantCell Research in Java Magazine JULY/AUGUST 2012
*
2
We will talk about ….
Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions
The QuantCell Big Data Spreadsheet
3 The QuantCell Big Data Spreadsheet
Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions
Why spreadsheets for data-scientists and domain experts?
• shorter turnaround times (e.g. financial products) • dynamic execution, debugging and testing • integrated runtime and development environments • experiment driven programming • expression-oriented programming • minimum or no GUI design • by far the most widely used programming system
4 The QuantCell Big Data Spreadsheet
Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions
Why Java?
• large ecosystem of analytical tools and resources • explosive growth in publicly available APIs • concurrency support • big data analytics & technologies are mostly Java based • HPC and cloud ready • performance • optimization
5 The QuantCell Big Data Spreadsheet
Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions
The QuantCell big data spreadsheet supports
• high performance and access to Hadoop clusters • intuitive access to local and remote data-sources • access to a variety of algorithms and methods • simplified programming already familiar to the expert • effortless deployment of solutions to Hadoop and into
production
6 The QuantCell Big Data Spreadsheet
Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions
Common use cases include
• big data analytics • data mining using Mahout or weka etc • risk analysis, pricing and trading strategies
Live demo: Simple Java spreadsheet expressions: Data Market, Bio Data and simple analysis.
How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming
7 The QuantCell Big Data Spreadsheet
Explosive growth in publicly available Java analytical and visualization libraries
8 The QuantCell Big Data Spreadsheet
How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming
Explosive growth in publicly available Java analytical and visualization libraries For example:
• OpenGamma (695,000 lines) • Weka (507,000 lines) • RapidMiner/YALE (535,000 lines) • BioJava (270,000 lines) • Chemistry Development Kit (861,000 lines) • NASA WorldWind (420,000 lines) • and so on …
9 The QuantCell Big Data Spreadsheet
How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming
Same is true for analytical Java based frameworks
• Apache Hadoop (2,200,000 lines – Java and XML) • Apache Pig (320,000 lines, analyzing large datasets) • Apache Hive (420,000 lines, data warehousing) …
Taking advantage of these libraries from the spreadsheet is simple and in many cases possible by non-developers Live demo: OpenGamma Financial API example.
10 The QuantCell Big Data Spreadsheet
Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs
Analytical projects require top performance
expressions are compiled to byte code expressions are optimized by Java dynamically loaded into the JVM for execution just-in-time compilation
Live demo: Let’s look at a few Java optimization tricks and confirm that these are used dynamically in the spreadsheet to optimize user expressions/functions
11 The QuantCell Big Data Spreadsheet
Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs
Let’s run code turning interpreted mode on and off (-Xint) and play with the expression to eliminate optimization
double (c2) = {
long start = System.nanoTime();
double add = c2;
for (int i = 0; i < 2000_000_000; i++)
add++;
return (System.nanoTime() - start)/1000000000.0;
}
Hint: replace “start” with “start + add - add” in the last line
12 The QuantCell Big Data Spreadsheet
Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer
Deployment
• deployment depends on the production environment • the user logic should be created independent of the
eventual deployment path chosen Live demo: Deploying MapReduce algorithms to Cloudera’s CDH or to EMR
13 The QuantCell Big Data Spreadsheet
Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production
Spreadsheets are becoming more and more popular for various Big Data tasks
• driven by high demand and low supply of Big Data experts (deep analytic talents and data-savvy managers)
• using the Java based spreadsheet tool is also beneficial for other reasons
Live demo: How the spreadsheet uses both local cycles and Hadoop cloud resources in the spreadsheet
14 The QuantCell Big Data Spreadsheet
Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users
The spreadsheet banks on Java to support
• multiple threading techniques • reclaim memory • distribute work between hardware resources/cores
Live demo: Long running operations and let’s write and execute a fork and join algorithm in the spreadsheet …
15 The QuantCell Big Data Spreadsheet
Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection
End-user coding
• encourage Java spreadsheet like functions • allow Java code expressions common to C like
languages (Java, C, C++, C# ….) • include Scala, SQL, Hive and Impala expressions • use wizards to generate the more complex expressions
16
Thank you for attending.
Q & A
Signup for our upcoming Beta release www.quantcell.com
Agust Egilsson Bjorn Jonsson [email protected] [email protected]
The QuantCell Big Data Spreadsheet