moveMountainIEEE

The Importance of using Small Solutions to solve Big Problems How to move a mountain (of data)

Christopher Gallo Technology Evangelist

SoftLayer, an IBM Company Houston, USA

[email protected]

Abstract— Abstract- Designing applications that can produce meaningful results out of large-scale data sets is a challenging and often problematic undertaking. The difficulties in these projects are often compounded by designers using the improper tool, or worse, designing a new tool that is inadequate for the task. In the current state of cloud computing, there exists a myriad of services and software to handle even the most daunting tasks, however discovering these tools is often a challenge in and of itself. This paper presents a case study concerning the design of an application that uses minimal code to solve a large-data problem as an exercise in choosing the proper tools and creating a quickly scalable application in a cloud environment. The study will take every registered Internet Domain Name and determine if it is hosted by a specific hosting provider (in this case SoftLayer, an IBM Company). While the case may seem simple, the technical challenges presented are both interesting to solve, and general enough to apply to a wide variety of similar problems. This case study shows the benefits provided by Infrastructure as a Service (IaaS), queues as a form of task distribution, configuration management tools for rapid scalability, and the importance of leveraging threads for maximum performance.

Keywords-component; Infrastructure as a Service; Cloud Scaling; Large-Scale Application Design;

I. INTRODUCTION "The Cloud" is defined by The National Institute of

Standards and Technology as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [1]

Creating an application that is not only capable, but optimized, for operating in "The Cloud" is challenging in part due to the very distributed and dynamic nature of "The Cloud", and to the rapidly changing array of tools that need to be employed. This case study will solve the same problem with two different methods, one a traditional single node approach, and the other a cloud based approach. While many of the techniques required can, and will be used for the single node approach, only when we apply these techniques to "The Cloud" will we see their optimal value.

The problem starts off fairly simply. We are tasked with iterating through every registered domain name, and assessing whether it is hosted in a SoftLayer[2] datacenter or not. The scale of the problem becomes clear when we discover how many domains there could be. The only limitation on a domain name is that each label be less than 63 ASCII characters, usually only A-Z and the "-" character [3]. This give us a grand total of 63^26 possible combinations per Top Level Domain (TLD), of which there are now over 800 [4]. To make our task somewhat easier, various registrars allow access to their list of registered domain names, so we will restrict our search to only domains we know to exist, and will not attempt to search every possible domain name combination, as that would take an eternity. The registrars behind the most popular TLDs, .COM, .NET, and .ORG all give out access, which comprises about 80% of the total registered domains, or around 150,000,000 domains total [5]. We will need to be content with that number, as obtaining access to 100% of domains is cost prohibitive for this case study.

This paper will present the case study by first elaborating on some of the background technical challenges presented by iterating through one hundred and fifty million records and how we plan to solve them, along with the methodology we plan to use for the two cases. Then we will discuss the Base Case, which would be a traditional single node solution to this problem, and some of the lessons learned. Next we will study the Cloud Case, and how it compares to the Base Case. Finally we will close with some thoughts on what could have been done better along with some other concluding remarks.

II. BACKGROUND It might seem unusual that a large IaaS provider like

SoftLayer does not have ready access to the information on which domains are being hosted on their infrastructure, but while SoftLayer keeps track of how many servers are online and the number of IP addresses that are being leased out, SoftLayer does not keep track of anything that runs on the server once access is handed over to a customer. So this leaves SoftLayer in a position of having to determine the number of domains hosted the hard way, by checking each and every registered domain.

Since there are around 150,000,000 domains to check, using a monolithic program where each domain is processed fully before proceeding to the next is simply going to take too long, each task must be broken down and parallelized as much as possible. Multi-threaded programming is generally significantly more challenging than single-treaded programming, to such a degree that many programmers avoid it altogether [18]. Yet here multi-threading is going to be a must in order to get meaningful results in a reasonable amount of time. While multi-threaded programming has not gotten easier since the paper by Bridges, Matthew, et al was published in 2007, there are now many new tools which will be explored here to help make the task easier.

Even on a single machine, being able to take advantage of every core is paramount to maximizing performance of an application [19], and the easiest way for this application will be to split every task into its own program that can run simultaneously and independently of each other. The tasks will be broken down as listed below. a. Domain Parser

This is the script that is responsible for taking the files provided by the various registrars and adding them to the RabbitMQ server. These zone file are downloaded ahead of time since they can be fairly large and are located on the system running the Domain Parser. To help minimize queue transactions, each domain is packaged into groups of 25. The package is a simple array of objects, encoded as JSON. The logic for this code is in Fig. 1: b. Domain Resolver

This script takes a packet of domains from the queue, attempts to resolve each one in a thread, and then adds an updated packet of domains to a final queue, adding in some new information about the domain. This section is where multi-threading will really shine. The average time to resolve a domain successfully for this project was 0.306 seconds. However, even with optimizations to Unbound, the time to unsuccessfully resolve a domain was 2.051 seconds, which is a very long time for a CPU to wait for a result. Thankfully threads allow us the ability to continue to attempt to resolve domains while we wait on a response from the upstream DNS server. The logic for this code is contained in Fig. 2.

DNS lookups are going to be the biggest bottleneck for this study, especially since it is expected that about 25% of the lookups will result in a failure [6], which will significantly slow down the rate at which we can query domains. To mitigate this, a local DNS resolver service (Unbound DNS [7]) will be required so that control can be exercised over how long to wait on slow DNS servers, and to limit caching to save on resource utilization. Each domain will be only queried once, so there should be no need for caching at all in this project.

c. Domain Checker This script takes a packet of domains from the final

queue, and checks against our database of IP addresses to see if the IP address of the domain is a SoftLayer IP address or not. Once the check is complete, the domain object is updated with that information and finally saved to Elastic Search. The logic is in Fig. 3.

1. Domain Parser Logic

2. Domain Resolver Logic

To control the even distribution of domains to processes between each program, a message queue will need to be added. For this project an Advanced Message Queuing Protocol (AMQP) compatible queue was chosen because it is an open standard supported by a wide variety of client and service applications [8]. the AMQP protocol is designed to be usable from different programming environments, operating systems, and hardware devices, as well as making high-performance implementations possible on various network transports including TCP, SCTP (Stream Control Transmission Protocol), and InfiniBand [9].

3. Domain Checker Logic

Specifically, RabbitMQ was chosen for this project since due to its ease of setup and support for the Python programming language [20], however any AMQP compatible service would have likely worked just as well.

Although the WHOIS [22] database serves as a great resource to lookup what organization owns an IP address, it will not be used here as SoftLayer has provided database containing all of their IP address information. To make querying this database as fast as possible, the IP information will be converted from the common dotted quad format into its decimal representation using the netaddr python library. These decimal numbers will be stored in an indexed MySQL database to facility fast queries [23].

Storing the data is the most important technical challenge to solve, since up until this point all the work we have done has been in memory, and would be lost if the services were shut down. NoSQL is defined as a collection of next generation databases mostly addressing some of these points: being non-relational, distributed, open-source and horizontally scalable [10], which are precisely the problems that we will likely encounter. There are a wide variety in NoSQL implementations, and for this project a Document Store style offering is the best fitted for how the data will be used after it is stored. In light of the huge variety of NoSQL applications that could possible work with this project, ElasticSearch for three main reasons. • Storing data is fast, and as simple as forming a HTTP

PUT request [21]. • Searching through the data is the main purpose of

ElasticSearch, which will be useful for doing post mortem data analysis.

• Most importantly, Kibana [11] is a fantastic tool to visualize data stored in ElasticSearch, and was used to create many of the graphs in this case study.

Finally, all of this will be run on the Debian “jessie/sid” operating system, with most of the custom code written in python 2.7. The operating system and programming language are just personal preferences however, it should be expected that similar results would be apparent with different choices made here.

III. Methodology The end goal of this project is to determine with some

accuracy the exact number of domains that resolve to a SoftLayer owned IP address. Yet there three important milestones that will be observed in trying to reach this goal.

1. The proof of concept. During this section, the core components of the project are put together, tested, and checked for consistency. Critically important for any software project.

2. The Base Case. The first full run through the data set and will serve as a benchmark for what we could expect performance to look like given a single server approach.

3. The Cloud Case. Here we will attempt to leverage as many resources as possible to answer our question in the shortest time possible, and will be compared against the Base Case.

While finding the answer to our question may be interesting to some, especially SoftLayer, we have setup this study to help answer some questions that might be more relevant to the community, specifically those who lack excessive experience working with cloud technologies and distributed workloads. We hope to address the following general concerns with this case study.

Concern 1 What are the difficulties in solving a large-data problem with a monolithic approach?

Concern 2 How much time and effort can be saved with a cloud based approach compared to a monolithic approach?

These concerns are important because they mirror many of the concerns newcomers coming into the cloud computing space encounter, and addressing them will hopefully alleviate some of the hesitancy to adopt cloud computing.

IV. Proof of Concept Creating a proof of concept version is critical to the

success of any application. It is during this phase where we try to answer the most basic question, "can this plan actually work?". Even with most of the technology stack already chosen before attempting the proof of concept, creating a proof of concept is important to prove that all the technology works well together before work is wasted on a solution that is impossible. This stage brought to light a collection of issues that had previously not been apparent on the surface.

As mentioned earlier, multi-threaded programming is inherently difficult, and working out these difficulties is much easier in the proof of concept phase than in a full production run. This phase also uncovered an interesting problem in that the domain files were being parsed entirely too quickly, which had the result of crashing the RabbitMQ server almost instantly by exhausting the available RAM. Thankfully this issue was discovered early and with some fine tuning of the RabbitMQ settings, and some rate limiting

on the parsing program, everything ended up running very smoothly afterwards.

Aside from those major issues uncovered, this proof of concept phase helped illuminate which areas of the program were likely to break, and where best to put in logging messages to ensure any errors were being properly reported and handled. The data structure used to pass domain information between processes was finalized here, along with the end document that will eventually be stored in ElasticSearch. V. Base Case

With the proof of concept finished, it is time to move onto actually running everything together at full speed. This involves ordering a new server, installing the required libraries and packages, configuring everything and the setting all the programs running.

A. New Problems Going from a proof of concept to a full run is generally

bound to uncover new problems, and this transition is no exception. The first unexpected hurdle turned out to be difficulties in turning a python program into a background service, which was surprisingly complicated, at least for someone not intimately familiar with how Debian manages startup scripts. Secondly, while DNS lookups were expected to be fairly CPU expensive, they turned out to largely be the limiting factor in how many processes could be launched at once. Since none of the DNS lookups being performed would be in a cache already, the resolver needed to query the root name servers, then the zone name servers, then finally the authoritative name servers for each domain.

Passing messages between processes with RabbitMQ was incredibly easy, but slightly error prone. The biggest issue was that occasionally the connector would hit a timeout and that would cause the resolver program to exit. Once some logic was added to the programs interacting with RabbitMQ to handle that exception and keep going everything ran smoothly.

B. The Hardware The power of a bare metal server has been well

documented [12], so for this single server case a single, bare metal, server will be used to get the most optimal

4. Base Case Domains Per Hour

performance. This server is something that would be easily found in any datacenter, or at least something very similar.

The server will be an Intel Xeon E3-1270, 4 Cores @3.40GHz, 2 hard drives and 8 GB of RAM, costing 0.368$/hour [13]. This server was chosen because of its fast clock speed, cheap hourly rate, and enough RAM to hold all our data.

C. Results Below is a breakdown of the average amount of CPU

percentage each part of our solution took up. These numbers are approximate averages to give a good sense of where most of the time was spent. As noted earlier, Unbound (or DNS resolver) takes up nearly 50% of the CPU time. RabbitMQ and ElasticSearch are both fairly low on this chart, which was a little unexpected, however it goes a long way to show how powerful and well made these tools are. So it should be no surprise that the code written specifically for this study performed worse than tools written by industry experts.

I. CPU USAGE BREAKDOWN

5. RabbitMQ Network Utilization

Process CPU %

Unbound 45%

Domain Resolver x 40 25%

Domain Parser 1%

Domain Checker 1%

RabbitMQ 15%

ElasticSearch 10%

Operating System 3%

Overall, the whole system took about 300 hours to run for a grand total of $102.672, averaging between 100 and 200 domains a second. A bargain considering the cost for just the Intel Xeon E3-1270 v3 is $373.11 [24].

Increasing the number of cores will easily help reduce the runtime, however there are only so many cores you can fit inside a single machine. The biggest hourly server SoftLayer provides is the Intel Xeon E5-2690 v3 (12 Cores, 2.60 GHz) $2.226/hour [14]. Since this server has three times as many cores as our original, it can be generously assumed this process would have taken a third of the time (100 hours). However 100 hours @ $2.226/hour is significantly more expensive at $222.6.

Overall once all of the programs were set running, the base case performed admirably without supervision. There are still some performance improvements that could have been made to the code and configuration of services, but that would take a significant amount of intimate knowledge about each service and some of the inner workings of the python libraries involved, so to get our runtime and overall cost down, it is easier to simply spread everything out into a cloud deployment.

VI. Cloud Case On of the many benefits of Cloud Computing is a

smoother scalability path. Cloud Computing empowers any application with architectures that were designed to easily scale with added hardware and infrastructure resources [15]. This path to smoother scalability is exactly what this case will study. The simplest way to start scaling is to split off each service into its own bare metal or virtual server. The RabbitMQ service will get a virtual server with plenty of RAM, and the ElasticSearch service, MySQL, and the domain parser will get a bare metal server with plenty of disk space and ample disk speed. Unbound and the domain resolver will be paired together on a series of virtual servers to maximize cores while minimizing costs. The virtual server will need at least two cores, one to run Unbound, and the other to go through all of the domain resolver threads. The domain checker service will also get a series of virtual servers as it is also only dependent on CPU time, with very little disk or ram usage.

A. New Problems The first major problem is adopting a cloud computing

deployment, is the network. In the Base Case data was transferred between services via the loopback interface, which is incredibly fast since the data never has to actually go over the wire. In the Cloud Case however, it quickly became apparent that the default 100Mbps data transfer rate was entirely too slow for our application. Thankfully it is a simple matter to upgrade to a 1Gbps connection in a cloud environment, which was plenty of bandwidth, with our application maxing out at around 250Mbps. Due to the amounts of data being transferred over the network, bandwidth costs also become a big concern. Luckily SoftLayer does not meter traffic over their private network , even across regions [25]. Provided all network traffic is kept

to the private network there will be no additional costs for splitting out the infrastructure. 1. Network traffic handled by rabbitMQ

Configuration management starts to become a real problem in cloud environments due to the ever increasing number of nodes requiring configuration. Setting up a single server is a fairly trivial task for any seasoned administrator, but managing dozens of nodes that all need to be provisioned simultaneously becomes a bit of a nightmare. Thankfully there are a myriad of configuration management tools [16] that help manage cloud deployments, and for this project Salt Stack [17] was selected for its ability to easily provision servers on the SoftLayer platform. Once SaltStack has been fleshed out with the details of the application and its deployment structure, creating the thirty six servers required for the Cloud Case is contained in one simple command, and takes about fifteen minutes for all nodes to be provisioned, configured, and running the programs they were told to run.

B. The Hardware a. Domain Master - Hourly Bare Metal - 4 Cores

3.50GHz 32 GB @ .595$/hour This server will be responsible for both being the master

for my SaltStack configuration management along with running the ElasticSearch, Kibana, MySQL, and Domain Parser services. This is the only bare metal server since this is the only node where data is actually written to or read from a disk.

b. Rabbit Node - Virtual Server - 4 cores 48GB RAM - 1Gbps Network @ .606$/hour

Responsible for the RabbitMQ service. 48G of RAM is a significant increase from the base case, which is due to the rate at which domains are entering the queue. In the base case we limited the rate of the Domain Parser to keep in pace with the Domain Resolver, however in this case that rate limit has been removed since the Domain Resolver will be scaled up significantly as the hardware provisioned here can support holding the entirety of the data that will be worked with. This now makes the network a limiting factor where it was not previously, hence the 1Gbps network connection.

c. 25 Resolver Nodes - Virtual Server - 2 cores 1G RAM @ .060$/hour

Responsible for Unbound and the Domain Resolver script. Each node can run about 40 Domain Resolver scripts before maxing out the CPU. Due to the very dependent nature of Unbound and Domain Resolver, keeping them together worked out really well.

d. 10 Checking Nodes - Virtual Server - 2 cores 1G RAM @ .060$/hour

Responsible for running the Domain Checker script. Each node can run about 80 Domain Checker scripts before maxing out the CPU. The amount of work required for the Domain Checker is significantly less than the Domain Resolver, which is why the same amount of domains could

be processed with ten nodes instead of the twenty five for Domain Resolver.

Separating out the services in this manner has the very significant advantage of being able to use more CPUs and RAM than can fit into a single server. Each server, aside from the Domain Master, was managed entirely by SaltStack, from the ordering step all the way to the final provisioning and running the needed services, without ever having to login to the server itself.

Overall, the server count here was a bit on the conservative side, however this setup still completely exceeded expectations even without hitting any cloud bottlenecks. With 78 cores working, the Cloud Case managed to progress through between 6000 and 7000 domains a second, which is a huge increase from the Base Case.

A. Results From the point domains started being added to

RabbitMQ, this project took a little under 6 hours to fully complete the assigned task, and could have been even shorter had more Resolver Nodes been added. This project was left to run overnight given the extreme length of time the Base Case took, which is why no other Resolver Nodes were added, since the project was already completed before it was noticed how fast it was going.

Despite the significantly higher CPU and RAM count used in the Cloud Case, the end cost was only $26.998, roughly a quarter of the Base Case cost. This cost should hopefully help make it clear how powerful Cloud Architectures can be in both time and money savings.

Since everything is also specified in SaltStack, redeploying this environment again is a trivial process, which is another huge benefit of using a cloud computing model for solving problems.

VII. Conclusion In the face of increasingly vast and complicated work

loads, traditional programming techniques are quickly becoming inadequate and time consuming. Distributing tasks across a wide array of discrete nodes is going to be a critical aspect of any large-data project, and being able to master the plethora of services that assist programmers in this space is a must for any developer going into the Cloud Era.

6. Cloud Case Domains Per Hour

Primarily the Message Queue as a tool for task distribution, and NoSQL data stores are going to play some of the biggest roles in these architectures. Hopefully this paper helped shed some light on how all these services can work together to build a successful application even without a significant amount of prior knowledge of the products involved.

Finally, we can address fully our concerns from earlier.

Concern 1 The difficulties with solving large-data problems with a

monolithic approach tend to be the limitations imposed by physical restrictions. Even though both the Cloud Case and the Base Case used a similar software architecture, the Base Case simply couldn't get a server big enough to go through the data in even a reasonable fraction of the time compared to the Cloud Case. Even though a myriad of unfamiliar technology was employed here, generally the only information required was how to get the service installed, and how to get data into or out of the service in question. While the inner workings remain a mystery, the services themselves perform well with intelligently designed defaults.

Concern 2 With the Cloud Case clocking in at around 6 hours and

$27 it greatly surpassed the Base Case in both time and cost, as the Base Case too around 300 hours and $103. Although it is counter intuitive, using more computing power can actually be cheaper if it can reduce the required computation time for a program. Getting the Cloud Case setup in SaltStack was certainly challenging and time consuming, however now that the work has been done, redeploying the Cloud Case takes no time at all, where redeploying the Base Case would still take a few hours of configuration by hand to get everything working.

In conclusion, it should hopefully be clear that expertise in cloud computing is not required to be able to take advantage of the power it offers. Nor should distributed or parallelized programming techniques be avoided because they are difficult to understand, the performance improvement they allow for are too great to ignore. Work is being done constantly to make these techniques easier to understand, and already a great many tools and concepts, such as queues for message transfers between programs, that allow even an inexperienced developer to make great choices in how to solve difficult problems.

VIII. Acknowledgments This work was sponsored by SoftLayer, which is why

they were the IaaS vendor of choice in this paper. While the pricing and servers are specific to SoftLayer, we expect the findings in this paper to be replicable in any other IaaS vendor. The developers at SaltStack were also incredibly helpful in sorting out issues relating to some of the more complicated configurations in the deployment.

REFERENCES 1. http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf 2. https://softlayer.com 3. https://tools.ietf.org/html/rfc1035 section 3.1 4. https://ntldstats.com/ 5. http://www.registrarstats.com/TLDDomainCounts.aspx 6. Jaeyeon Jung, Emil Sit, Hari Balakrishnan and Robert Morris "DNS

Performance and the Effectiveness of Caching" IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 5, OCTOBER 2002

7. https://www.unbound.net/ 8. https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol 9. O'Hara, J. (2007). "Toward a commodity enterprise middleware". Acm

Queue 5 (4): 48–55 10. http://nosql-database.org/ 11. https://www.elastic.co/products/kibana 12. Ekanayake, Jaliya, and Geoffrey Fox. "High performance parallel

computing with clouds and cloud technologies." Cloud Computing. Springer Berlin Heidelberg, 2010. 294-308.

13. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/37276/64

14. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/165559/103

15. Creeger, Mache. "Cloud Computing: An Overview." ACM Queue 7.5 (2009): 2.

16. https://en.wikipedia.org/wiki/Configuration_management 17. http://saltstack.com/ 18. Bridges, Matthew, et al. "Revisiting the sequential programming model

for multi-core." Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007.

19. Dean, Jeffrey, and Sanjay Ghemawat. "Distributed programming with Mapreduce." Beautiful Code. Sebastopol: O’Reilly Media, Inc 384 (2007).

20. https://pika.readthedocs.org/en/0.10.0/ 21. https://www.elastic.co/guide/en/elasticsearch/guide/current/create-

doc.html 22. https://whois.icann.org/en/about-whois 23. Schwartz, Baron, Peter Zaitsev, and Vadim Tkachenko. High

performance MySQL: Optimization, backups, and replication. " O'Reilly Media, Inc.", 2012. 115-130

24. http://amzn.com/B00D697QRM 25. http://blog.softlayer.com/tag/private-network

http://saltstack.com/

https://www.elastic.co/guide/en/elasticsearch/guide/current/create-doc.html

http://amzn.com/B00D697QRM

Documents

moveMountainIEEE