21
Linux NUMA & Databases Perils and Opportunities

Linux NUMA & Databases: Perils and Opportunities

Embed Size (px)

Citation preview

NUMA Reference architecture

What is NUMA● Stands for Non Uniform Memory Access

○ Non Uniform to whom.○ Von Neumann bottleneck.○ Cache coherent NUMA

● How does it work○ Memory is placed local to the processes.○ Balancing access to data over the available processors on multiple nodes.

● Large memory installations are becoming the norm○ The i2 series on AWS.○ Databases are the main consumers.

● Constraints○ Speed of light○ Interconnect saturation

What is NUMA● Constraints

○ Speed of light■ Higher latency of accessing remote memory.

○ Interconnect saturation■ Performance counters.

● Slow abundant memory○ Fast limited memory

● Cache coherence○ Processor threads and cores share resources

■ Execution units (between HT threads)■ Cache (between threads and cores)

Exotic cases● Network cards● PCIe storage● NVRAM● Nodes without memory● Nodes without processors● Unbalanced● Central/Large memory● Big Little architecture● GPU

NUMA complications● Unmovable memory● KSM● THP● Interrupt balancing and locality

Tools/libraries for NUMA● Supported by Linux since 2.5

○ Symmetric and CPU/Memory

● Numactl● Hwloc / lstopo● Numad● Numatop● Libnuma● Numastat● Taskset ● KVM for simulation and testing● Perf

Tools/libraries for NUMA● KVM for simulation and testing

● Useful for testing databases.

qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G

Tunings and observables● /proc/zoneinfo

○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim○ /proc/sys/vm/min_unmapped_ratio

● /proc/meminfo● /proc/vmstat● Ftrace● Cgroup hierarchy

○ memory

Tunings and observables● ACPI

○ SLIT and SRAT

● Per process: ○ /proc/<pid>/numa_maps○ /proc/<pid>/sched

● Auto NUMA balancing ○ CONFIG_NUMA_BALANCING in /proc/config.gz

● get_mempolicy(2), mbind(2), migrate_pages(2), move_pages(2), set_mempolicy(2), sched_getaffinity(2)

● Libnuma (3)○ Higher abstraction - numa_set_localalloc

Numa statistics

Numa statistics

AutoNUMA● CPU follows memory

○ Reschedule tasks on same nodes as memory

● Memory follows CPU○ Copy memory pages to same nodes as tasks/threads

● Heuristics○ Fault statistics○ Task grouping○ Multi-resource optimization - cache, cpu, memory, starvation

■ Avoid thrashing

● Only CPU and memory?○ For others, use manual pinning!

NUMA Policies● MPOL_DEFAULT● MPOL_BIND● MPOL_INTERLEAVE

○ Memory striping in hardware

● MPOL_PREFERRED● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL

Databases● Most databases support multiple cores and NUMA.

○ MAP_ANONYMOUS and O_DIRECT are common

● Most default to interleaving to avoid zone imbalance issues○ Effects

■ Swapping due to Reclaim■ OOM

○ Downsides to interleaving○ MySQL, Cassandra et.al.

● Pattern of accesses○ Cause of imbalance

● Duality of Applications v/s OS

Reclaim● Swappiness

○ Anon v/s File-backed

● Zone reclaim○ Single process can span multiple zones○ Imbalance without any strategies○ Watermarks○ Databases suffer the most

■ They carry a lot of state!○ Types of reclaim

● Imbalance○ Why does this happen

Access Pattern Optimizations● Thread pool

○ Reuse of threads with longer lifetime○ Explicit or implicit bind

■ Numa_set_localalloc / numa_set_preferred■ Sched_setaffinity■ CONFIG_NO_HZ and latency

● Global heaps - buffer pool, JVM○ Allocation by proxy○ Mbind and MPOL_BIND○ MAP_POPULATE (why? - First touch policy)○ Node_set_preferred

Access Patterns (contd)● Split Pools

○ Independent pools of memory in a database Ex: Multiple buffer pool instances

● Multiple instances○ Mostly for simple databases.

■ Redis○ Containers

● Hybird ○ Linux kernel - boot and init○ MySQL / InnoDB

■ MPOL_LOCAL for threads ■ MPOL_INTERLEAVE for global heaps

● Task Grouping

Credits!● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal%

20and%20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%

8E%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sample-output-of-the-numastat-command.png