Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

  • View
    171

  • Download
    1

Embed Size (px)

DESCRIPTION

Using a modified version of set cover algorithm to reduce real-time query latencies in LinkedIn's distributed graph infrastructure.

Transcript

  • 1. SEARCH, NETWORK AND ANALYTICSUsing Set Cover to Optimize a Large-Scale Low Latency Distributed GraphLinkedIn Presentation Template2013 LinkedIn Corporation. All Rights Reserved. 1Rui WangStaff Software Engineer

2. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 2 3. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 3 4. SEARCH, NETWORK AND ANALYTICSLinkedIn Products Backed By Social Graph2013 LinkedIn Corporation. All Rights Reserved. 4 5. SEARCH, NETWORK AND ANALYTICSLinkedIns Distributed Graph APIs Graph APIs Get Connections Who does member X know? Get Shared Connections Who do I know in common with member Y? Get Graph Distances Whats the graph distances for each member returned from a searchquery? Is member Z within my second degree network so that I can view herprofile? Over 50% queries is to get graph distances2013 LinkedIn Corporation. All Rights Reserved. 5 6. SEARCH, NETWORK AND ANALYTICSDistributed Graph Architecture Diagram2013 LinkedIn Corporation. All Rights Reserved. 6 7. SEARCH, NETWORK AND ANALYTICSLinkedIns Distributed Graph Infrastructure Graph Database (Graph DB) Partitioned and replicated graph database Distributed adjacency list Distributed Network Cache (NCS) LRU cache stores second degree network for active members Graph traversals are converted to set intersections Graph APIs Get Connections Get Shared Connections Get Graph Distances2013 LinkedIn Corporation. All Rights Reserved. 7 8. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 8 9. SEARCH, NETWORK AND ANALYTICSGraph Distance Queries and Second Degree Creation2013 LinkedIn Corporation. All Rights Reserved. 9get graph distancescache lookup[exists=false] retrieve1st degree connections[exists=true] setintersectionsPartition IDspartial mergesK-Waymergesreturnreturn 10. SEARCH, NETWORK AND ANALYTICSThe Scaling Problem Illustrated2013 LinkedIn Corporation. All Rights Reserved. 10Graph DBNCS Increased Query Distribution Merging and De-duping shift to single NCS node 11. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover and Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 11 12. SEARCH, NETWORK AND ANALYTICSSet Cover Problem Definition Given a set of elements U = {1, 2, , m } (called the universe) and afamily S of n sets whose union equals the universe, the set coverproblem is to identify the smallest subset of S the union of whichcontains all elements in the universe. Greedy Algorithm Rule to select sets At each stage, choose the set that contains the largest number ofuncovered elements. Upper bound O(log s), where s is the size of the largest set from S2013 LinkedIn Corporation. All Rights Reserved. 12 13. SEARCH, NETWORK AND ANALYTICSModified Set Cover Algorithm Example Cont.2013 LinkedIn Corporation. All Rights Reserved. 13 Build a map of partition ID -> Graph DB nodes for the input graph 14. SEARCH, NETWORK AND ANALYTICSModified Set Cover Algorithm Example Cont.2013 LinkedIn Corporation. All Rights Reserved. 14 Randomly pick an elementfrom U = { 1,2,5,6 } e = 5 Retrieve from map nodes = { R21, R13 } Intersect R21 = {1,5 } with U, R13 = { 5,6 } with U Select R21 U = {2,6}, C = { R21} 15. SEARCH, NETWORK AND ANALYTICSModified Set Cover Algorithm Example Cont.2013 LinkedIn Corporation. All Rights Reserved. 15 Randomly pick an elementfrom U = { 2,6 } e = 2 Retrieve from map nodes = { R11, R22 } Intersect R11 = {1,2 } with U, R22 = { 2,4 } with U Select R22 U = {6}, C = { R21, R22 } 16. SEARCH, NETWORK AND ANALYTICSModified Set Cover Algorithm Example Cont.2013 LinkedIn Corporation. All Rights Reserved. 16 Pick the final element fromU = { 6 } e = 6 Retrieve from map nodes = { R23, R13 } Intersect R23 = { 3,6 } with U, R13 = { 5,6 } with U Select R23 U = {}, C = {R21, R22, R23 } 17. SEARCH, NETWORK AND ANALYTICSModified Set Cover Algorithm Example Solution Solution Compared to Optimal Result for U , where U = {1,2,5,6}2013 LinkedIn Corporation. All Rights Reserved. 17 18. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 18 19. SEARCH, NETWORK AND ANALYTICSEvaluation Production Results Second degree cache creation drops 38% on 99th percentile Graph distance calculation drops 25% on 99th percentile Outbound traffic drops 40%; Inbound traffic drops 10%2013 LinkedIn Corporation. All Rights Reserved. 1995th quantile 99th quantile0123control setcover control setcoverlatency(normalized)network cache building95th quantile 99th quantile0123control setcover control setcoverlatency(normalized)getdistances 20. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 20 21. SEARCH, NETWORK AND ANALYTICSRelated Work Scaling through Replications Collocating neighbors [Pujol2010] Replication based on read/write frequencies [Mondal2012] Replication based on locality [Carrasco2011] Multi-Core Implementations Parallel graph exploration [Hong2011] Offline Graph Systems Googles Pregel [Malewicz2010] Distributed GraphLab [Low2012]2013 LinkedIn Corporation. All Rights Reserved. 21 22. SEARCH, NETWORK AND ANALYTICSOutline LinkedIns Distributed Graph Infrastructure The Scaling Problem Set Cover by Example Evaluation Related Work Conclusions and Future Work2013 LinkedIn Corporation. All Rights Reserved. 22 23. SEARCH, NETWORK AND ANALYTICSConclusions and Future Work Future Work Incremental cache updates Replication on GraphDB nodes through LRU caching New graph traversal algorithms Conclusions Key challenges tackled Work distribution balancing Communication Bandwidth Set cover optimized latency for distributed query by Identifying a much smaller set of GraphDB nodes serving queries Shifting workload to GraphDB to utilize parallel powers Alleviating the K-way merging costs on NCS by reducing K Available athttps://github.com/linkedin-sna/norbert/blob/master/network/src/main/scala/com/linkedin/norbert/network/partitioned/loadbalancer/DefaultPartitionedLoadBalancerFactory.scala2013 LinkedIn Corporation. All Rights Reserved. 23 24. SEARCH, NETWORK AND ANALYTICS2013 LinkedIn Corporation. All Rights Reserved. 24