Upload
raghavendra-prabhu
View
131
Download
1
Embed Size (px)
Citation preview
Linux NUMA&
DatabasesPerils and Opportunities
What is NUMA● Stands for Non Uniform Memory Access
○ Non Uniform to whom.○ Von Neumann bottleneck.○ Cache coherent NUMA
● How does it work○ Memory is placed local to the processes.○ Balancing access to data over the available processors on multiple nodes.
● Large memory installations are becoming the norm○ The i2 series on AWS.○ Databases are the main consumers.
● Constraints○ Speed of light○ Interconnect saturation
What is NUMA● Constraints
○ Speed of light■ Higher latency of accessing remote memory.
○ Interconnect saturation■ Performance counters.
● Slow abundant memory○ Fast limited memory
● Cache coherence○ Processor threads and cores share resources
■ Execution units (between HT threads)■ Cache (between threads and cores)
Exotic cases● Network cards● PCIe storage● NVRAM● Nodes without memory● Nodes without processors● Unbalanced● Central/Large memory● Big Little architecture● GPU
Tools/libraries for NUMA● Supported by Linux since 2.5
○ Symmetric and CPU/Memory
● Numactl● Hwloc / lstopo● Numad● Numatop● Libnuma● Numastat● Taskset ● KVM for simulation and testing● Perf
Tools/libraries for NUMA● KVM for simulation and testing
● Useful for testing databases.
qemu-system-x86_64 -enable-kvm -drive file=./debian-8.1-lxc-puppet.qcow2 -net nic,macaddr=52:54:00:00:EE:03 -net vde -smp sockets=2,cores=2,threads=2,maxcpus=16 -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 -numa node,nodeid=2,cpus=8-15 -m 2G
Tunings and observables● /proc/zoneinfo
○ Sysctl vm.zone_reclaim_mode OR /proc/sys/vm/zone_reclaim○ /proc/sys/vm/min_unmapped_ratio
● /proc/meminfo● /proc/vmstat● Ftrace● Cgroup hierarchy
○ memory
Tunings and observables● ACPI
○ SLIT and SRAT
● Per process: ○ /proc/<pid>/numa_maps○ /proc/<pid>/sched
● Auto NUMA balancing ○ CONFIG_NUMA_BALANCING in /proc/config.gz
● get_mempolicy(2), mbind(2), migrate_pages(2), move_pages(2), set_mempolicy(2), sched_getaffinity(2)
● Libnuma (3)○ Higher abstraction - numa_set_localalloc
AutoNUMA● CPU follows memory
○ Reschedule tasks on same nodes as memory
● Memory follows CPU○ Copy memory pages to same nodes as tasks/threads
● Heuristics○ Fault statistics○ Task grouping○ Multi-resource optimization - cache, cpu, memory, starvation
■ Avoid thrashing
● Only CPU and memory?○ For others, use manual pinning!
NUMA Policies● MPOL_DEFAULT● MPOL_BIND● MPOL_INTERLEAVE
○ Memory striping in hardware
● MPOL_PREFERRED● MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
Databases● Most databases support multiple cores and NUMA.
○ MAP_ANONYMOUS and O_DIRECT are common
● Most default to interleaving to avoid zone imbalance issues○ Effects
■ Swapping due to Reclaim■ OOM
○ Downsides to interleaving○ MySQL, Cassandra et.al.
● Pattern of accesses○ Cause of imbalance
● Duality of Applications v/s OS
Reclaim● Swappiness
○ Anon v/s File-backed
● Zone reclaim○ Single process can span multiple zones○ Imbalance without any strategies○ Watermarks○ Databases suffer the most
■ They carry a lot of state!○ Types of reclaim
● Imbalance○ Why does this happen
Access Pattern Optimizations● Thread pool
○ Reuse of threads with longer lifetime○ Explicit or implicit bind
■ Numa_set_localalloc / numa_set_preferred■ Sched_setaffinity■ CONFIG_NO_HZ and latency
● Global heaps - buffer pool, JVM○ Allocation by proxy○ Mbind and MPOL_BIND○ MAP_POPULATE (why? - First touch policy)○ Node_set_preferred
Access Patterns (contd)● Split Pools
○ Independent pools of memory in a database Ex: Multiple buffer pool instances
● Multiple instances○ Mostly for simple databases.
■ Redis○ Containers
● Hybird ○ Linux kernel - boot and init○ MySQL / InnoDB
■ MPOL_LOCAL for threads ■ MPOL_INTERLEAVE for global heaps
● Task Grouping
Credits!● http://queue.acm.org/detail.cfm?id=2513149 ● www.linux-kvm.org/images/7/75/01x07b-NumaAutobalancing.pdf ● http://events.linuxfoundation.org/sites/events/files/slides/Normal%
20and%20Exotic%20use%20cases%20for%20NUMA%20features.pdf ● https://en.wikipedia.org/wiki/Non-uniform_memory_access ● https://lihz1990.gitbooks.io/transoflptg/content/02.%E7%9B%91%E6%
8E%A7%E5%92%8C%E5%8E%8B%E6%B5%8B%E5%B7%A5%E5%85%B7/sample-output-of-the-numastat-command.png