Upload
comodo
View
119
Download
0
Embed Size (px)
Citation preview
Agenda
● Why a (sloppy) travel guide?
● Chasing Cool Technologies (Big Data Envy)○ Architectural complexity
○ Operational complexity
● The cliff of confusion & The Unknown Unknowns
● Guidelines○ Making simple but not simpler
○ Right task right tool right usage
○ Don’t let API fool you
○ loading data, layouts and file formats
○ learning from costs
Why a travel guide ?
“... Martin is an excellent map reader even in the most
hectic Italian traffic … And after Martin and Cindy
left us, we did better because we had learned from what
they had showed us … When there’s no guide available, it
helps to have someone who understands how to read the
maps, tracks, signs, and indications. When we’re on our
own, it helps to learn how to do those things ourselves“
“Software projects are always traveling in
areas they don’t know “ Ron Jeffries (from his foreword for PoAPA book)
Chasing Cool Technologies - Big Data Envy “We continue to see organizations chasing ‘cool’ technologies,
taking on unnecessary complexity and risk when a simpler choice
would be better.”
“ While we've long understood the value of Big Data to better
understand how people interact with us, we've noticed an alarming
trend of Big Data envy: organizations using complex tools to handle
‘not-really-that-big’ Data.”
“ The Apache Cassandra database promises massive scalability on commodity
hardware, but we have seen teams overwhelmed by its architectural and
operational complexity. Unless you have data volumes that require a 100+
node cluster, we recommend against using Cassandra. ”
https://www.thoughtworks.com/radar/techniques/big-data-envy
Big Data Envy - architectural complexity (expectation)
from ‘10000 foot view’
big data systems may seem
like ‘good old n-tier’s
Big Data Envy - architectural complexity (example) A dataflow diagram
from a good (but still a)
reference application.
Real life examples are
usually more complex !
Big Data Envy - operational complexity (devops)
http://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide
● Tuning JVM, OS and
each (big) data
system
● Choosing right
hardware for each
‘right solution’
● Orchestrating /
monitoring /
debugging many
small applications
running on and/or
interacting with such
distributed systems
OOM Troubleshooting example for Apache Spark
Know thyself - reaching the cliff of confusion
https://www.vikingcodeschool.com/posts/why-learning-to-code-is-so-damn-hard
The Unknown Unknowns - the iceberg of ignorance In his acclaimed study “The Iceberg of Ignorance”, consultant Sidney Yoshida concluded: “Only 4% of an organization’s front line problems are known by top management, 9% are known by middle management, 74% by supervisors and 100% by employees…”
Guidelines - the first principle
“DDD isn’t first and foremost about technology.
In its most central principles, DDD is about
discussion, listening, understanding, discovery,
and business value, all in an effort to
centralize knowledge. If you are capable of
understanding the business in which your company
works, you can at a minimum participate in the
software model discovery process to produce a
Ubiquitous Language.”
Our highest priority is to satisfy
the customerthrough early and continuous
delivery of
valuable softwarethe very first principle of the agile manifesto
Guidelines - making simple but not simpler● “ Make things as simple as possible,
but not simpler.” (Albert Einstein)
● Simple as simple: no over-engineering. search for most simplest feasible solution possible
○ feasible ‘ready’ solution○ fully managed solutions ○ manageable packed solutions with support○ solutions known for stability, manageability
● Not simpler: no under-engineering
○ right task, right tool○ right usage: design patterns, best practices
Guidelines - right task right tool right usage
DynamoDB Design Patterns and Best Practices : https://www.youtube.com/watch?v=PDQ3jbDyTQ4
Guidelines - don’t let API fool you (cassandra)
CQL Under The Hood : https://www.youtube.com/watch?v=CY5-bWpqAVA
Guidelines - don’t let API fool you (cassandra)
CQL Under The Hood : https://www.youtube.com/watch?v=CY5-bWpqAVA
Guidelines - learn data paths and structures ( C* )learning “write path”, “read path” and main
internal data structures gives critical hints
about “do’s and don’ts”; especially anti-patterns:
● Queue-like designs● Intensive updates● Deletes
http://www.slideshare.net/doanduyhai/cassandra-nice-use-cases-and-worst-anti-patterns
Guidelines - loading data, layouts and file formats (hdfs)
● Data distribution , small files problem
● Row v.s. columnar formats● I/O advantage, read only what you
need:○ Vertical: projection○ Horizontal: predicate pushdown
Guidelines - learning from costs (kinesis)“ Pricing is based on volume of data ingested into Amazon Kinesis Firehose, which is calculated as the number of data records you send to the service, times the size of each record rounded up to the nearest 5KB. For example, if your data records are 42KB each, Amazon Kinesis Firehose will count each record as 45 KB of data ingested. ”
“ A record is the data that your data producer adds to your Amazon Kinesis Stream. A PUT Payload Unit is counted in 25KB payload “chunks” that comprise a record. For example, a 5KB record contains one PUT Payload Unit, a 45KB record contains two PUT Payload Units, and a 1MB record contains 40 PUT Payload Units. PUT Payload Unit is charged with a per million PUT Payload Units rate. ”
Guidelines - learn windows of opportunity (streaming)
SELECT sensorid, Count(*) AS count FROM sensorreadings TIMESTAMP by time GROUP BY sensorid, tumblingwindow(second, 10)
Guidelines - learn windows of opportunity (streaming)
SELECT sensorid, Count(*) AS count FROM sensorreadings TIMESTAMP by time GROUP BY sensorid, hoppingwindow(second, 10, 5)