Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Detecting Hidden Spam Bots(and other tales from the NetFlow front lines)
Jim MeehanDirector, Product Marketing
All contents © Kentik Inc. 2
Agenda
● What is flow data?● Legacy solutions and frustrations● Modern requirements and architecture● Forensic use cases and real-world examples● Detection use cases and real-world examples
All contents © Kentik Inc. 3
NetFlow records contain:• Who communicated with
whom• How long• Data amount transferred• Which protocol was used• Additional information
What is NetFlow?
• De-facto standard for network traffic statistics
• Developed by Cisco; variants now supported by all major vendors
• Provides complex insight into the entire network
• Does not affect user privacy by content monitoring
All contents © Kentik Inc. 4
Flow Sources and Insights
Route? NextHop; Source AS; Destination AS
QoS? ToS; TCP Flags; Protocol
When? Flow Start and End Time
Who?Source IP Address; Destination IP Address
What? Source Port; Destination Port; Protocol
Usage? Packet Count; Octet Count
Path? Input and Output Interface
Routers
Switches
Servers
Firewalls, VMs, etc.
Sources Insights
All contents © Kentik Inc. 5
Enhanced Flow Data
Historically: Separate silos for Flow or Routing or SNMP or DNS
Today: Flow becomes much richer when combined with:• Performance and layer 7 information• BGP attributes• Geography• DNS lookups• Tags (rack, department, customer…)• Config changes and software versions• Threat intelligence and known-bad IPs
All contents © Kentik Inc. 6
Where to Get Enhanced Flow?
• On-server or sensor software - kprobe, nprobe, argus• Commercial sensors - nBox, nPulse, and others• Packet Brokers - Ixia and Gigamon (IPFIX, potentially more)• IDS (bro) – a superset of most flow fields, + app decode• Web servers (nginx, varnish) – web logs + tcp_info for perf• Load balancers – already see HTTPS-decoded URLs• CISCO AVC, NetFlow Lite – generally only on small devices
All contents © Kentik Inc. 7
The Promise of Flow Data
• Pervasive collection of all network activity and conversations
• “Instrument everything”• Leverage existing network elements
(routers / switches) as sensors• Situational awareness from macro to micro• Real-time and historical visibility
All contents © Kentik Inc. 8
Network Traffic Data Use Cases
Anomaly Detection
Planning and Peering
Traffic Engineering
DDoS DefensePerformanceManagement
ThreatDetection
ServiceCreation
Network Forensics
Business Analytics
All contents © Kentik Inc. 9
Network Traffic Data Stakeholders
Is the network the problem?
Are we providing a great digital experience?
Are we under DDoS attack?
What does this traffic
cost?
Where should we invest
going forward?
Network Operations
Network Engineering
SecOps DevOps Finance Sales / BD
All contents © Kentik Inc. 10
Legacy Flow Data Storage and Processing
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.9.9
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.9.9
Src_ASN: 1234Src_IP: 1.1.4.8Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.7.7
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.7.7
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500
Src_ASN: 1234Src_IP: 1.1.4.8Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500
Flow Records
Reduction & Dedoop
https://dbs.uni-leipzig.de/dedoop
Src_ASN: 1234, 1Src_ASN: 1234, 1Src_ASN: 1234, 1
Src_IP: 1.1.1.1, 1Src_IP: 1.1.1.1, 1Src_IP: 1.1.1.1, 1
Src_Port: 80, 1Src_Port: 80, 1Src_Port: 80, 1
Dst_ASN: 9876, 1Dst_ASN: 9876, 1Dst_ASN: 9876, 1
Dst_IP: 2.2.2.2, 1
Dst_IP: 3.3.3.3, 1Dst_IP: 3.3.3.3, 1
Dst_Port: 6500, 1Dst_Port: 6500, 1Dst_Port: 6500, 1
Src_ASN: 1234, 3Src_IP: 1.1.1.1, 3Src_Port: 80, 3Dst_ASN: 9876, 3Dst_IP: 2.2.2.2, 1Dst_IP: 3.3.3.3, 2Dst_Port: 6500, 3
Record Drop999:1000 on interval
Map Phase Shuffle & Sort
Reduce Phase & Data Cube
Data Warehouse
Analytics / Visualization
User Query
Gigabytes of Storage
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500
All contents © Kentik Inc. 11
Legacy Solution Frustrations
Siloed, incomplete, 20-year-old tools based on appliances and open source
Turnkey Ops⬄BIAd Hoc QueriesScale Detail Unified
View
Real Network Visibility
All contents © Kentik Inc. 12
Key Network Operator Requirements
Formodern data-drivennetworkandsecurityoperations:
● Nodataaggregation orpre-filtering● Correlation (fusing)betweendatatypes● Fullresolution,searchableandstoredformonths● Fast: Lessthan10sforresults.Cannotwaitminutestoexplore● Network-savvy UIsandAPIs(understandsroutingandCIDR)● Detectanomalies:Shouldnothavetowatchgraphsmanually● Dataandalertsavailableacrossthecompany● APIs toaccessraworprocesseddataforintegration● “0”-to-usableinminutes toweeks,notmonthstoyears
All contents © Kentik Inc. 13
Modern Ingest Architecture
DATA FUSION
DecoderModules
MemTable
esNetFlow v5
NetFlow v9
IPFIX
BGP RIB
Custom Tags
SNMP Poller
BGP Daemons
Enrichment DB
DATA FUSION
Geo ←→ IP
ASN ←→ IP
SFlow
ROUTER
TRAFFIC-SAVVY DATASTORE
Single flowfused row
sent to storage
PCAP
PCAPagent
proxy
All contents © Kentik Inc. 14
Modern Storage and Query Architecture
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.9.9
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.9.9
Src_ASN: 1234Src_IP: 1.1.4.8Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.7.7
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.7.7
Flow Records
Ingestion & Enhancement
Sub Query Master Query Open APIData Warehouse
Analytics / Visualization
User Query
Petabytes of StorageSrc_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.9.9Geo Data: +BGP: +Custom Data: +
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.9.9Geo Data: +BGP: +Custom Data: +
Src_ASN: 1234Src_IP: 1.1.4.8Src_Port: 80Dst_ASN: 9876Dst_IP: 2.2.2.2Dst_Port: 6500Interface: 307Device: 9.9.7.7Geo Data: +BGP: +Custom Data: +
Src_ASN: 1234Src_IP: 1.1.1.1Src_Port: 80Dst_ASN: 9876Dst_IP: 3.3.3.3Dst_Port: 6500Interface: 307Device: 9.9.7.7Geo Data: +BGP: +Custom Data: +
SQL
RESTFul
Portal
All contents © Kentik Inc. 15
Network Traffic Data Use Cases
Anomaly Detection
Planning and Peering
Traffic Engineering
DDoS DefensePerformanceManagement
ThreatDetection
ServiceCreation
Network Forensics
Business Analytics
All contents © Kentik Inc. 16
Incident #1: External Service Dependency
• Build system reports:
“Unexpected status code [429] : Quota Exceeded”
• Investigation reveals our build system can’t connect to Google-
hosted container registry, gcr.io
• GCE admin console shows no indication of quota exceeded,
error, or expiry
• Are we / were we talking to gcr.io?
All contents © Kentik Inc. 17
Incident #1: Hosts talking to gcr.io
All contents © Kentik Inc. 18
Incident #1: Whodunnit
• Relatively high pps from two hosts (k122 / k212) toward gcr.io
• Ended abruptly shortly after 11:00 UTC
• k122 / k212 are development VMs assigned to interns
• Interns were working on a registry project
• It all clicks: Interns’ script hammered gcr.io and got us
blacklisted
• Without detailed traffic history data, time-to-root-cause would
have been much longer (or never)
All contents © Kentik Inc. 19
Incident #2: Poor query performance
• Monitor indicated > 4 sec query response time, normally < 1 sec
• Also, network bandwidth alarm: 20+ Gbps traffic among 20+ nodes
• Drilled down to immediately identify affected microservice
• Aggregation service didn’t anticipate 50+ workers responding
simultaneously during a large query over many flow sources
• < 30 min to troubleshoot, would have been hours+ without
detailed visibility
• As a result, we rebuilt the aggregation service pipeline
All contents © Kentik Inc. 20
Incident #2: Source IPs hitting agg service
All contents © Kentik Inc. 21
Incident #2: Definitely Service Affecting
In addition to the spikes on port 14999 (aggregation) we saw a dip on 20012 (ingest) which is a service collocated on the same node. Data collection was affected, not just query latency.
All contents © Kentik Inc. 22
Incident #3: Hidden spam bots
• Spammers make outbound SMTP connections from origin server
• Origin is a host / network with bad reputation
• Spoofed source IP of hosting instance (reflector) that’s under
their control, with good reputation
• SYN/ACK goes back to reflector
• Tunneled back to origin over GRE
• Hosting network never sees outbound TCP/25 SYN, which would
have been blocked
All contents © Kentik Inc. 23
Incident #3: Hidden spam bots
All contents © Kentik Inc. 24
Incident #3: Find the reflectors
• Find the hosts that are receiving traffic from source port TCP/25
• AND who are also sending GRE
• Traffic volumes are small
• Probably won’t appear in Top-N of either condition individually
• How can those conditions be combined?
All contents © Kentik Inc. 25
Incident #3: SQL over raw traffic archive
All contents © Kentik Inc. 26
Incident #3: SQL over raw traffic archive
All contents © Kentik Inc. 27
Incident #3: UNION with GRE sourcesSELECT ipv4_src_addr,
ROUND(MAX(f_sum_both_bytes$gre_src_mbps) * 8 / 1000000 / 60, 3) AS gre_src_mbps,ROUND(MAX(f_sum_both_bytes$smtp_dst_mbps) * 8 / 1000000 / 60, 3) AS smtp_dst_mbps
FROM ((SELECT ipv4_src_addr,
SUM(both_bytes) AS f_sum_both_bytes$gre_src_mbps,0 AS f_sum_both_bytes$smtp_dst_mbps
FROM all_devicesWHERE ctimestamp > 60AND protocol=47AND (src_flow_tags ILIKE '%MYNETWORK%')GROUP BY ipv4_src_addrORDER BY f_sum_both_bytes$gre_src_mbps DESCLIMIT 10000)
UNION(SELECT ipv4_dst_addr,
0 AS f_sum_both_bytes$gre_src_mbps,SUM(both_bytes) AS f_sum_both_bytes$smtp_dst_mbps
FROM all_devicesWHERE ctimestamp > 60AND protocol=6AND l4_src_port=25AND (dst_flow_tags ILIKE '%MYNETWORK%')GROUP BY ipv4_dst_addrORDER BY f_sum_both_bytes$smtp_dst_mbps DESCLIMIT 10000)) a
GROUP BY ipv4_src_addrHAVING MAX(f_sum_both_bytes$gre_src_mbps) > 0AND MAX(f_sum_both_bytes$smtp_dst_mbps) > 0ORDER BY MAX(f_sum_both_bytes$gre_src_mbps) DESC, MAX(f_sum_both_bytes$smtp_dst_mbps) DESCLIMIT 100
} Subquery for dest IPs receiving traffic from source port TCP/25
} Subquery for src IPs sending GRE
All contents © Kentik Inc. 28
Incident #3: Hosts meeting both conditions
All contents © Kentik Inc. 29
Anomaly Detection
All contents © Kentik Inc. 30
Anomaly Detection• Continuously compare historical vs. current data • Proactively find network conditions that impact:
• Security• Performance• Cost
• Find out before users complain!• With full traffic history details • Quickly determine root cause and reduce
MTTI / MTTR
All contents © Kentik Inc. 31
Policy Parameters• Filters: Which subset of traffic should we inspect?• Segment / Group By: Which “items” within that subset should we
generate alerts for? (IPs, interfaces, ASNs, combinations)• Metrics: What should we measure for each of those items?
(bits/sec, packets/sec, unique src/dst IPs, etc.)• Thresholds: Static, baseline, change in Top-N• Latency: Duration above threshold before alarm?• Notifications: Email, syslog, Slack, PagerDuty, etc.• Actions: API call, route injection, hardware mitigation
All contents © Kentik Inc. 32
Policy Example
All contents © Kentik Inc. 33
Incident #4: Rogue host detection
• Policy in place to profile count of unique src IPs per dest IP
• Top N dest IPs profiled, everything else (long tail) compared
to lowest item in Top N
• Alarm fired showing 587 unique sources to one dest IP
All contents © Kentik Inc. 34
Incident #4: Digging Deeper
• This IP was from DHCP
range
• Zero traffic history prior
to alarm
• Lots of TCP/30303 traffic,
which is associated with
cryptocurrency mining!
All contents © Kentik Inc. 35
Incident #4: Enhanced Flow, More Context
• DNS flows labeled with L7 details
• More cryptocurrency evidence
• Looks like regular client host — but in a datacenter network?
All contents © Kentik Inc. 36
Incident #4: Resolution
• Switch port was immediately disabled for this host
• Ops team remembered that a remote hands contractor was
on-site in the cage that day
• Quick call to contractor’s cell phone confirmed that he had
connected his laptop to the production LAN!
• New policies and procedures to prevent unknown hosts from
accessing production networks
All contents © Kentik Inc. 37
Incident #5: Load spike
• Policy built to profile UDP traffic volume per dest IP
• Alarmed on high bps / pps to an ingest node (fl13) based on
deviation from historical baseline
• Correlated with high CPU on the same node
All contents © Kentik Inc. 38
Incident #5: Top Source ASNs for Alarming Node
All contents © Kentik Inc. 39
Incident #5: Load spike
• A very informative alert• Indicating dramatically higher traffic volume from a customer• Allowed us to quickly:
• Understand where additional traffic was coming from• For which service• Verify this node was handling the additional load adequately• Instantly know where to look to verify other vitals
• Total time investment: < 10 minutes
All contents © Kentik Inc. 40
Incident #6: Traffic Shift
• Policy built to baseline traffic volume per source ASN, per PoP
• Continuously compares current to historical
• Baseline incorporates normal daily / weekly variation
• Alarm indicating a drop in traffic from EdgeCast’s ASN at NYC PoP
• 1.2 Gbps observed, vs. 2.1 Gbps expected
All contents © Kentik Inc. 41
Incident #6: EdgeCast traffic per PoP
• Total EdgeCast traffic remains relatively constant
• Drop in NYC traffic mirrored by increase in Ashburn traffic
• Policy has detected change in traffic distribution
• Outage? Policy change?• Could be service affecting if
insufficient capacity in NYC
Questions?