52
All The Things We Didn’t Do Kresten Krab Thorup Humio CTO

All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

All The Things We Didn’t Do

Kresten Krab Thorup Humio CTO

Page 2: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

A Tale in Three Parts

• About Logging and Metrics Tools

• Product Team Practices

• Careful Engineering — Data Processing Engine

Part 1

Part 2

Part 3

Page 3: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Log Analytics— And Why You Should Care

Part 1

Page 4: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues
Page 5: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues
Page 6: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Record Logs, Monitor & Respond

LogAggregation & Analytics

Engine

Metrics/Monitoring:Dashboards/Alerts

Incident response:Log Search, Drill-down

Page 7: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Dimension in Tooling

Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Page 8: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Logs vs. MetricsLogs are events — metrics are aggregates of events

Logs have high dimensionality — metrics have low dimensionality

Logs tend to be unstructured — metrics are structured

Logs support drill-down and analysis — metrics leans towards dashboards

and alerting

Logs will vary in volume — metrics have a fixed volume rate

Logs tend to be high volume — metrics tend to be low volume

Page 9: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Dimension in Tooling

Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Page 10: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Historic vs Real TimeReal-time processing lets you generate alerts and dashboards

Historic processing is great for incident response and audits

Real-time addresses known issues to look out for

Historic searches lets you look for unknown issues

Real-time needs only CPU processing

Historic data may require a lot of disk storage

Page 11: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Dimension in Tooling

Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Page 12: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Cloud vs On-PremisesCloud-based systems may have privacy and security concerns

On-premises are often required in health-care and banking applications

With cloud systems you can pay-as-you-go

On-prem systems requires dedicated hardware

With a cloud solution you don’t need to manage it

On-prem solution requires you to consider ease-of-operations

Page 13: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Dimension in Tooling

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Logs

Metrics

Page 14: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Schema vs Ad-Hoc based SearchSchema-based systems addresses known issues to look out for

With ad-hoc searching, you can dig into new, unknown issues

Setting up schemas is often for the DBA or administrator

Everyone can use free text search and learn things about the system

schema ≠ index, but they often go hand in hand

Keeping around indexes increase disk-storage requirements

Lack of indexes slow down searching

Page 15: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

effort / query

effort / insert

Page 16: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Dimension in Tooling

Logs

Metrics

Historic

Real-Time

Cloud

On-Prem

Schema

Ad-Hoc

Page 17: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Log Analytics Sweet Spot

•Record Everything - TB’s of data per day

•Generate metrics from the logs in real-time

•Interactive/ad-hoc search on historic data - 100’s of TB

•Can be installed on-premises (privacy / security)

•Affordable - TCO (hardware, license, operations)

Page 18: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues
Page 19: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Record Events, Monitor & Respond

Humio

Metrics/Monitoring:Dashboards/Alerts

Incident response:Log Search, Drill-down

Page 20: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Humio—Product Team Practices

Part 2

Page 21: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Be The Customer

• Design target was an on-premise solution

• Co-locate with first customer

• Provide a hosted service “eat our own dog food”⇒

Page 22: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Safe Environment

• “It takes all kinds”

• Be open about strengths and weaknesses

• Be open to learn (and teach) new practices

• Experienced team initially to set practices and culture

Page 23: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Be in doubt!

• Discuss trade offs — not do’s and don’ts

• Leave time to wonder

• No one knows “what’s best”

Page 24: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

High BUS factor

• We depend on people. Period.

• Don’t try to make them replaceable

• Everyone is responsible

Page 25: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Choosing Scala

• I ❤ Erlang

• Knowing what Erlang can do for you, coordination code is painful to write and manage in Scala (threadpools, futures, async).

• Use “scala, the good parts”.

Page 26: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Choosing Elm

• Elm similar to React — functional javascript — but with proper syntax and static type checking.

• Tooling and libraries are less mature.

• Takes time for new devs to learn

• Upside is that it is “cool” — we give talks and contribute to the community.

Page 27: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Take small steps — but look up!

• Running a SaaS with frequent deployments teaches you to take small steps.

• Define design goals and discuss tradeoffs. Keep those in mind and work towards that.

• Avoid long-running side-projecs. Feature-flag new work.

Page 28: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Manage critical dependencies

• Own all critical components

• It is tempting (and easy) to pull in 200+ Apache libraries

• We use docker for delivery (reduce customer’s deps)

• Two outside dependencies: HighCharts and Kafka

Page 29: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Don’t waste hardware

“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

—Henry Peteroski

Page 30: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Humio—Data Processing Engine

Part 3

Page 31: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Record Events, Monitor & Respond

Humio

Metrics/Monitoring:Dashboards/Alerts

Incident response:Log Search, Drill-down

Page 32: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

State Machine

Event Store

Query/error/i | count()

State Machine

count: 473

count: 243,565

Page 33: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Query Language State Machine

filter … | aggregate()

event: Map[String,String]

Page 34: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Aggregates State Machine

Function State Step Merge Result

count N N+1 N1+N2 N

avg (N, s) (N+1,s+value) (N1+N2, s1+s2) s/N

stddev (N, s, q)(N+1,s+value,

q+value2)(N1+N2, s1+s2,

q1+q2)√(N*q-s2)/N

Page 35: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

GroupBy(host, function=count())State Map[String,State2]

Step(G,e) key = e[“host”]map[key] = Step2(map[key])

Merge(G1,G2) ∀key in G1,G2 => result[key] = Merge2(G1[key], G2[key])

Result(G) ∀key in G => result[key] = Result2(G[key])

Page 36: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

7 4 3

time

144 3 6 13

3 6 2 11

Time Boxing groupby( time − time % bucket_size )

Page 37: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Query Language State Machine

filter … | aggregate()

Page 38: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Event Store Design

• Build minimal index and compress data

Store order of magnitude more events

• Fast “grep” for filtering events

Filtering and time/metadata selection reduces the problem space

Page 39: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Event Store

10GB (start-time, end-time, metadata)

10GB (start-time, end-time, metadata)

10GB (start-time, end-time, metadata)

10GB (start-time, end-time, metadata)

. . .

Page 40: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Event Store

1GB (start-time, end-time, metadata)

1GB (start-time, end-time, metadata)

1GB (start-time, end-time, metadata)

1GB (start-time, end-time, metadata)

. . .

compress

1 month x 30GB/day ingest 90GB data, <1MB index

1 month x 1TB/day ingest 4TB data, <1MB index

Page 41: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Query

1GB

1GB

1GB 1GB

1GB

1GB 1GB 1GB

1GB

1GB

time

#ds1, #web

#ds1, #app

#ds2, #web

metadata

Page 42: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Query

1GB

1GB

1GB 1GB

1GB

1GB 1GB 1GB

1GB

1GB

time

#ds1, #web

#ds1, #app

#ds2, #web

metadata

10GBState Machine

Page 43: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Filter 1GB data

Page 44: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Filter 1GB data

Page 45: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Filter 1GB data

Page 46: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Filter 1GB data

Page 47: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Filter 1GB data

Page 48: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Brute Force: Grep at 30x• Streaming disk access, use async file I/O

• Compress data at rest (and in OS-level cache)

• Run one JVM per NUMA node

• Critical search code is sticky 1 thread per core.

• Reduce context switching (explicit scheduling)

• Localize data access (each core works on 64k chunks)

Go and find videos and blog posts about “Mechanical Sympathy” (Martin Thompson,

LMAX) and “Why KDB+ is fast”

Page 49: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Event Processing Brute-Force Search

• “Materialized views” for relevant metrics.

• Processed when datais in-memory anyway.

• Fast response times for “known” queries.

• Shift CPU load to query time

• Data compression

• Allows ad-hoc queries

• Requires “Full stack” ownership

Page 50: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

effort / query

effort / insert

Page 51: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Log Analytics

• Logging / Metrics Landscape

• Product Team Practices & User Engagement

• Careful Engineering

Part 1

Part 2

Part 3

Page 52: All The Things We Didn’t Do · Schema vs Ad-Hoc based Search Schema-based systems addresses known issues to look out for With ad-hoc searching, you can dig into new, unknown issues

Thanks for your time.Kresten Krab Thorup Humio CTO