Scaling ELK Stack - DevOpsDays Singapore

Preview:

Citation preview

ELKLog processing at Scale

#DevOpsDays 2015, Singapore@DevOpsDaysSG

Angad Singh

About meDevOps at Viki, Inc - A global video streaming site with subtitles.

Previously a Twitter SRE, National University of Singapore

Twitter @angadsg,

Github @angad

Elasticsearch - Log Indexing and Searching

Logstash - Log Ingestion plumbing

Kibana - Frontend{

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Logging

● Useful for debugging

● Catch-all

● Full text searching

● Computationally intensive, harder

to scale

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Logs● Application logs - Stack Traces, Handled Exceptions

● Access Logs - Status codes, URI, HTTP Method at all levels of the stack

● Client Logs - Direct HTTP requests containing log events from client-side

Javascript or Mobile application (android/ios)

● Standardized log format to JSON - easy to add / remove fields.

● Request tracing through various services using Unique-ID at Load Balancer

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable

Kibana● Frontend● Visualizations,

Dashboards● Supports Geo

visualizations● Uses ES REST API

Input

Any Stream

● local file● queue● tcp, udp● twitter● etc..

LogstashFilter

Mutation

● add/remove field● parse as json● ruby code● parse geoip● etc..

Output

● elasticsearch● redis● queue● file● pagerduty● etc..

● Golang program that sits next to log files, lumberjack protocol

● Forwards logs from a file to a logstash server

● Removes the need for a buffer (such as redis, or a queue) for

logs pending ingestion to logstash.

● Docker container with volume mounted /var/log.

Configuration stored in Consul.

● Application containers with volume mounted /var/log to

/var/log/docker/<container>/application.log

Logstash Forwarder

Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM

7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.

Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.

Easily scalable by adding more machines and spinning up more logstash processes.

Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.

20 nodes, 20 shards, 3 replicas (with 1 primary).

Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.

Average 6k-8k logs per second, peak 25k logs per second.

https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Elasticsearch Hardware

● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap● Sweet spot - 64GB of RAM with half available for Lucene file buffers.● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq.● RAID0 - no need to worry about disks failing as machines can easily be

replaced due to multiple copies of data.● Disable swap.

Hardware Tuning

● 20 days of indexes open based on available memory, rest closed - open on demand

● Field data - cache used while sorting and aggregating data.● Circuit breaker - cancels requests which require large memory, prevent OOM,

http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.

● Shards >= Number of nodes● Lucene forceMerge - minor performance improvements for older indexes

(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)

Elasticsearch Configuration

Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)

Set higher ulimit for elasticsearch process

Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days

And also...

Marvel - Official plugin from Elasticsearch

KOPF - Index management plugin

CAT APIs - REST APIs to view cluster information

Curator - Data management

Monitoring

Thanksemail: angad@viki.com

twitter: @angadsg

Recommended