Real-Time Discovery in Big Data Using the Urika-GD Appliance · graph analytics to surface unknown linkages and non-obvious patterns in big data, do it with speed and simplicity,

Real-Time Discovery in Big Data Using the Urika-GD™ Appliance

Technical White Paper October 2014

www.cray.com

FINANCIAL SERVICES

LIFE SCIENCES

CU

STO

MER INSIGHTS

TELECOMMUNICATIONS

CY

BERSECURITY

SC

IEN

TIFIC R

ESEARCH

GOVE

RN

ME

NT

FRAUDSPORTS

AN

ALY

TICS

Table of Contents

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Discovery Through Human-Machine Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Using Graphs for Discovery Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Introducing the Cray® Urika-GD™ Graph Analytics Appliance . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Overview of the Multiprocessor, Shared Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Addressing the “Memory Wall” Through Massive Multithreading . . . . . . . . . . . . . . . . . . . . . . . . 8

Delivering High Bandwidth with a High Performance Interconnect . . . . . . . . . . . . . . . . . . . . . 8

Enabling Fine-Grained Parallelism with Word-Level Synchronization Hardware. . . . . . . . . . 9

Delivering Scalable I/O to Handle Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Comparison: The Urika-GD Appliance’s Hardware and Commodity Hardware. . . . . . . . . . . 9

The Urika-GD System Software Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

The Graph Analytics Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Enabling Ad-hoc Queries and Pattern-based Search with RDF and SPARQL . . . . . . . . . . 12

Augmenting Relationships in the Graph Through Inferencing . . . . . . . . . . . . . . . . . . . . . . . . . 12

Benefits of the Urika-GD System’s Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

The Benefits of an Appliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Integrating the Urika-GD Appliance into an Existing Analytics Environment . . . . . . . . . . . . 14

Building the Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Integration with Other Analytics Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Discovery in Big Data Using the Urika-GD Graph Analytics Appliance | 1


Executive Summary

Discovery, the often accidental revelations that have changed the world since Archimedes, is a vital component of the advancement of knowledge. The recognition of previously unknown linkages between occurrences, objects and facts underpins advances in such diverse areas as life sciences (cancer drug discovery, personalized medicine or understanding the spread of disease), financial services (counter-party credit risk analysis, fraud detection, identity resolution or anti-money laundering) and government operations (cybersecurity threat analysis, person-of-interest identification or counterterrorism threat detection). New discoveries often deliver very high value: Consider the harm avoided through the proactive detection of fraud or counterterrorism operations, or the billions of dollars in revenue generated from a new cancer drug.

The traditionally slow pace of discovery is being greatly accelerated by the advent of big data. Discovery takes place when a researcher has a “Eureka!” moment, where a flash of insight leads to the formulation of a new theory, followed by a painstaking validation of that theory against observations in the real world. Big data can assist in both of these phases. Applying analytics and visualization to the huge volume of captured data stimulates insight, and the ability to test new theories electronically can speed validation a thousandfold, fulfilling the true promise of big data — as long as an organization’s systems are up to the challenge.

Traditional data warehouses and business intelligence (BI) tools built on relational models are not well suited to discovery, however. BI tools are highly optimized to generate defined reports from operational systems or data warehouses. They require the development of a data model that is designed to answer specific business questions — but the model then limits the types of questions that can be asked. Discovery, however, is an iterative process, where a line of inquiry may result in previously unanticipated questions, which may also require new sources of data to be loaded. Accommodating these will likely require time-consuming, error-prone and complex extensions to the data model, for which saturated IT professionals do not have time.

Graph analytics are ideally suited to meet these challenges. Graphs represent entities and the relationships between them explicitly, greatly simplifying the addition of new relationships and new data sources, and efficiently support ad-hoc querying and analysis. Real-time response to complex queries against multi-terabyte graphs is also achievable, with the appropriate platform.

Cray’s Urika-GD appliance is built to meet the challenging requirements of discovery. With one of the world’s most scalable shared memory architectures, the Urika-GD appliance employs graph analytics to surface unknown linkages and non-obvious patterns in big data, do it with speed and simplicity, and facilitate the kinds of breakthroughs that can give any organization — in government activities ranging from national security to fraud detection, medical and pharmaceutical research, financial services and even retail — a measurable advantage. The Urika-GD appliance complements existing data warehouses and Hadoop® clusters by offloading challenging data discovery applications while still interoperating with the existing analytics workflow.


A new approach is needed. An approach that:

• Separates data and its representation, allowing for new data sources and new relationships to be included without complex data model changes.

• Supports a wide range of ad-hoc analysis as needed to spark insight and validate new theories. Typically, these will take the form of searching for patterns of relationships, but other types of analytics and visualization will also be required.

• Operates in real time, supporting collaborative, iterative discovery in very large datasets.


Discovery through Human-Machine Collaboration

Discovery is the desired outcome of an investigative analytical process. Discovery in big data requires the collaboration of man and machine, where the guiding intellect — the ability to posit and infer — is human. In time, artificial intelligence may be able to make suppositions and draw conclusions but, for now, humans still have the advantage.

The process of discovery is iterative, as shown in Figure 1. The analyst must be able to test a hypothesis against all available data by posing a question that the technology answers in depth and then renders visually, shortening the time between results.

This requires the ability to ask questions that were not anticipated by those who built the knowledge base, referred to as ad-hoc queries in the database world. In discovery, you don’t know the next question until you get the first answer, and each iteration may require additional datasets for analysis. The addition of those datasets demands fast, flexible and powerful I/O. This cycle continues until the “Eureka!” moment, where the analyst makes a high-value breakthrough discovery. Example 1, on cancer drug discovery, illustrates this process.


Figure 1. The cycle of discovery.

“ All truths are easy to understand

once they are discovered; the

point is to discover them.”

Galileo Galilei

Discovery through fast hypothesis validation


With traditional analytics technologies, discovery is challenging because of several interrelated difficulties:

1. Predicting what data is needed. Discovery depends upon the ability to import and combine new datasets, ranging from structured (databases) to semistructured (XML, log files) to unstructured sources (text, audio, video) as needed to support new lines of inquiry. Traditional analytics solutions use fixed data schema, and the addition of new types of data and relationships between data items involves complex, time consuming schema extension, often requiring person-weeks or months of effort. Analysts using traditional technologies report spending up to 80 percent of their time on data import and schema manipulation.

2. Predicting what questions will be asked. Discovery depends on the ability to follow up on new lines of questioning, including questions about the relationships implied within the data. Traditional solutions depend upon optimizing data schemata for specific queries in order to deliver acceptable performance. Failure to do so results in nested table joins, which are very damaging to performance. IT groups have described these as “forbidden queries,” for their tendency to bring the analytics infrastructure to a grinding halt until the queries are killed.

3. Delivering predictable, real-time performance as data sizes and query complexity grow.

Discovery depends upon real-time results being delivered in response to queries. Traditional systems have difficulty achieving deterministic response times to ad-hoc queries, let alone real-time response as dataset sizes and query complexity grows. The result is that analysts “cherry pick” their lines of reasoning, driven by systems capability, rather than investigating all the avenues desired, introducing bias from their own preconceptions.

The result of these challenges is an organizational unwillingness to extemporaneously experiment with data, unless the value has been proven beyond the shadow of a doubt. This is a major constraint on innovation.


Example 1: Cancer Drug Discovery Using Graph Analytics

The Institute for Systems Biology (ISB) is approaching the challenge of cancer drug discovery using a systems biology approach, involving the modeling of the formation and growth of tumors at the molecular level. The objective is to understand the gene mutations and the biological processes that lead to cancer, to discover highly targeted treatments. This is very challenging because the volume of published, relevant scholarly articles and genomic and protein databases is beyond human ability to digest.

ISB tackled this problem by extracting the relationships contained in Medline articles, containing journal citations and abstracts for biomedical literature from around the world, using natural language processing. They combined these relationships with genomic and proteomic data of healthy and cancerous cells from the Cancer Genome Atlas and other databases, as well as their own experimental wet lab results, into a very large graph comprising billions of relationships. New sources of data were continually added as their relevancy was determined. Researchers wrote complex, ad-hoc queries, effectively validating hypotheses in-silico, in an iterative process where each new set of results suggested new lines of inquiry.

Graph analytics served ISB very well for discovery. They were able to quickly add new sources of data and new types of relationships as they were uncovered, and write sophisticated, partially specified queries looking for patterns of relationships in the data. Visualization of the results enabled quick comprehension, while the ability to export large sets of results for statistical processing helped guide the discovery process and provided statistical rigor. “In the amount of time it took to validate one hypothesis, we can now validate 1,000 hypotheses — increasing our success rate significantly,” remarked Dr. Ilya Shmulevich of the ISB.

This approach led to the discovery that many breast cancers have an increase in the expression of the ABCG2 gene, and that the HIV drug nelfinavir inhibits ABCG2. This drug is a strong candidate for repurposing to treat breast cancer, a discovery with considerable potential revenue opportunity.

Repurposing is a very cost-effective way of bringing new drug therapies to market, and graph analytics are now a proven way to identify these opportunities.


Using Graphs for Discovery Analytics

Using graphs in data analytics provides many advantages.

A graph consists of nodes, representing data items, and edges, representing the relationships between nodes, as shown in Figure 2. Graphs represent the relationships between data explicitly, facilitating analysis of patterns of relationships, a key aspect of discovery. Contrast this with traditional tabular representations, where the focus is on processing data (the rows in the tables), and where relationships are second-class entities, represented indirectly by table column headings and indices.

Graphs address the challenges presented by traditional analytics.

1. Predicting what data is needed. Graphs provide a flexible data model, where new types of relationships are readily added, greatly simplifying the addition of new data sources. Relationships extracted from structured, semistructured or unstructured data can be readily represented in the same graph.

2. Predicting what questions will be asked. Graphs have no fixed schema, constraining the universe of queries that can be posed. Relationships are not hidden: it’s possible to write relationships querying the types of relationships that exist. Graphs also enable advanced analytics techniques such as community detection, path analysis, clustering and others.

3. Delivering predictable, real-time performance as data sizes and query complexity grow. Graph analytics can deliver predictable, real-time performance, as long as the hardware and software are appropriate to the task. Cray developed the Urika-GD appliance specifically for this task, as described in the following sections.

These attributes enable graph analytics to deliver value incrementally. As understanding grows, new data sources and new relationships can be added, building an ever more potent and accurate model.


Figure 2. Graphs consist of nodes (data items) and edges (relationships).

“ What differentiates Urika from the

many graph databases available

today is its ability to enable data

discovery at scale and on an

interactive basis.”

(Chartis Research, “Looking for Risk: Applying Graph Analytics to Risk Management,” Peyman Mestchian)


Introducing the Cray Urika-GD Graph Analytics Appliance

The Urika-GD appliance was introduced in recognition of the important role that graph analytics can play in discovery. A large governmental organization approached Cray about performing discovery analytics in a large and constantly growing graph. They had investigated several technologies, but none satisfied their needs.

Analysis of this organization’s needs led to a canonical list of hardware requirements for all graph analytics (the software requirements will be discussed in a later section):

• Discovery analytics requires real-time response: A multiprocessor solution is required for scale and performance. Many graph analytics solutions are single-computer implementations, very useful for small problems, but unusable at scale.

• Graphs are hard to partition: A large, shared memory is required to avoid the need to partition graphs. Analyzing graph relationships requires following the edges in the graph. Regardless of the scheme used, partitioning the graph across a cluster will result in edges spanning cluster nodes. In most cases, the number of edges crossing cluster nodes is so large that it requires a time-consuming network transfer each time those edges are crossed. Compared to local memory, even a fast commodity network such as 10 GB Ethernet is at least 100 times slower at transferring data. Given the highly interconnected nature of graphs, users gain a significant processing advantage if the entire graph is held in sufficiently large shared memory.

• Graphs are not predictable, and therefore cache-busting: A custom graph processor is needed to deal with the mismatch between processor and memory speeds. Analyzing relationships in large graphs requires the examination of multiple, competing alternatives. These memory accesses are very data dependent and eliminate the ability to apply traditional performance improvement techniques such as pre-fetching and caching. Given that even RAM memory is 100 times slower than processors, and that graph analytics consists of exploring alternatives, the processor sits idle most of the time waiting for delivery of data. Cray developed hardware multithreading technology to help alleviate this problem. Threads can explore different alternatives, and each thread can have its own memory access. As long as the processor supports a sufficient number of hardware threads, it can be kept busy. Given the highly nondeterministic nature of graphs, a massively multithreaded architecture enables a tremendous performance advantage.

• Graphs are highly dynamic: A scalable, high performance I/O system is required for fast loading. Graph analytics for discovery involves examining the relationships and correlations between multiple datasets and, consequently, requires loading many large, constantly changing datasets into memory. The sluggish speed of I/O systems — 1,000 times slower compared to the CPU — translates into graph load and modification times that can stretch into hours or days — far longer than the time required for running analytics. In a dynamic enterprise with constantly changing data, a scalable processing infrastructure enables a tremendous performance advantage for discovery.

These requirements drove the design of the Urika-GD system’s hardware, and resulted in a hardware platform proven to deliver real-time performance for complex data discovery applications.


“ Uncovering previously unknown

patterns and relationships across

increasingly large repositories of

multistructured data represents

one of the biggest opportunities

to derive new sources of

innovation, growth and productivity

from analytics.”

(Gartner, “Cool Vendors in Content and Social Analytics,” by Rita Salaam)


Overview of the Multiprocessor, Shared Memory Architecture

Cray’s Urika-GD appliance is a heterogeneous system consisting of Urika-GD appliance services nodes and graph accelerator nodes linked by a high performance interconnect fabric for data exchange (see Figure 3).

Graph accelerator nodes (“accelerator nodes”) utilize a purpose-built Threadstorm™ processor capable of delivering several orders of magnitude better performance on graph analytics applications than a conventional microprocessor. Accelerator nodes share memory and run a single instance of a UNIX-based, compute-optimized OS named multithreaded kernel (MTK).

Urika-GD appliance services nodes (“service nodes”), based on x86 processors, provide I/O, appliance and database management. As many I/O nodes may be added as desired, enabling the scaling of connectivity and management functions for larger Urika-GD appliances. Each service node runs a distinct instance of a fully featured Linux® operating system.

The interconnect fabric is designed for high-speed access to memory anywhere in the system from any processor, as well as scaling to large processor counts and memory capacity.

The Urika-GD system architecture supports flexible scaling to 8,192 graph accelerator processors and 512 TB of shared memory. Urika-GD systems can be incrementally expanded to this maximum size as data analytics needs and dataset sizes grow.


Graph Accelerator Nodes

Urika-GD Appliance Services Nodes

MTK Linux

Network

RAID Controllers

Figure 3. Urika-GD system architecture.

Threadstorm processorsrunning MTK (BSD)

x86 processors running SUSE Linux



Addressing the “Memory Wall” Through Massive Multithreading

“Memory wall” refers to the growing imbalance between CPU speeds and memory speeds. Starting in the early 1980s, CPU speed improved at an annual rate of 55 percent, while memory speed only improved at a rate of 10 percent. This imbalance has been traditionally addressed by either managing latency or amortizing latency. However, neither approach is suitable for graph analytics.

Managing latency is achieved by creating a memory hierarchy (levels of hardware caches) and by software optimization to pre-fetch data. However, this approach is not suitable for graph analytics, where the workload is heavily dependent on “pointer chasing” (the following of edges between nodes in the graph) because the random access to memory results in frequent cache “misses” and processors stalling while waiting for data to arrive.

Amortizing latency is achieved by fetching large blocks of data from memory. Vector processors and GPUs employ this technique to great advantage when all the data in the block retrieved is used in computation. This approach is also not suitable for graph analytics, where relatively little data is associated with each graph node, other than pointers to other nodes.

In response to the ineffectiveness of managing or amortizing latency on graph problems, a new approach was developed — the use of massive multithreading to tolerate latency. The Threadstorm processor is massively multithreaded with 128 independent hardware “streams.” Each stream has its own register set and executes one instruction thread. The fully pipelined Threadstorm processor switches context to a different hardware stream on each clock cycle. Up to eight memory references can be in flight for each thread, and each hardware stream is eligible to execute every 21 clock cycles if its memory dependencies are met. No caches are necessary or present anywhere in the system since the fundamental premise is that at least some of the 128 threads will have the data required to execute at any given time. Effectively, Threadstorm enables global access of multiple, random, dynamic memory refers simultaneously without pre-fetching or caching, turning the memory latency problem into a requirement for high bandwidth.

Delivering High Bandwidth with a High Performance Interconnect

The Urika-GD system uses a purpose-built

high-speed network. This interconnect links nodes

in the 3-D torus topology (Figure 3) to deliver

the system’s high communication bandwidth.

This topology provides excellent cross-sectional

bandwidth and scaling, without layers of switches.

Key performance data for the interconnection network:

• Sustained bidirectional injection bandwidth of more than 4 GB/s per processor and an aggregate bandwidth of almost 40 GB/s through each vertex in the 3-D torus

• Efficient support for Threadstorm remote memory access (RMA), as well as direct memory access (DMA) for rapid and efficient data transfer between the service nodes and accelerator nodes

• Combination of high-bandwidth, low-latency network and massively multithreaded processors make the Urika-GD appliance ideally suited to handle the most challenging graph workloads

“ A DBMS designed for known

relationships and anticipated

requests runs badly if the

relationships actually discovered

are different, and if requests

are continually adapted to what

is learned.”

(Gartner, “Urika shows Big Data is More than Hadoop and Data Warehouses” by Carl Claunch)


Enabling Fine-Grained Parallelism with Word-Level Synchronization Hardware

The benefits of massive parallelism are quickly lost if synchronization between threads involves serial processing, in accordance with Amdahl’s Law. The Threadstorm processors and memory implement fine-grained synchronization to support asynchronous, data-driven parallelism and spread the synchronization load physically within the memory and interconnection network to avoid hot spots.

Full/empty bits are provided on every 64-bit memory word in the entire global address space for fine-grained synchronization. This mechanism can be used across unrelated processes on a spin-wait basis. Within a process (which can have up to tens of thousands of active threads), multiple threads can also do efficient blocking synchronization using mechanisms based on the full/empty bits. The Urika-GD system’s software uses this mechanism directly without OS intervention in the hot path. Delivering Scalable I/O to Handle Dynamic Graphs

Any number of appliance services nodes can be plugged into the interconnect, allowing the appliance’s I/O capabilities to be scaled independently from the graph processing engine.

A Lustre® parallel file system is used to provide scalable, high performance storage. Lustre is an open-source file system designed to scale to multiple exabytes of storage, and to provide near linear scaling in I/O performance with the addition of Lustre nodes.


DIFFERENTIATOR URIKA-GD SYSTEM CAPABILITY

SIGNIFICANCE TO DISCOVERY ANALYTICS

LARGE GLOBAL SHARED MEMORY

Scales up to 512 TB

Enables uniform, low-latency access to all the data, regardless of data partitioning, layout or access pattern. A large shared memory holds the entire graph, avoiding the need to partition and enabling unknown linkages and non-obvious patterns in the data to be easily surfaced with no advance knowledge of the relationships in the dataset.

EXTREME PROCESSING POWER

Scales up to 8,192 processors

Achieving real-time performance requires employing as many processors as needed, all sharing the same memory. This scalability ensures interactive response on the most demanding workloads.

MASSIVE MULTITHREADING

128 hardware threads per processor

Graph analytics involves random memory access patterns. Random memory access results in individual threads stalling. Processors can tolerate latency if they have multiple concurrently executing hardware threads so there are always threads ready to execute upon a memory stall. The Urika-GD appliance is effectively investigating multiple changing hypotheses in real time simultaneously, enabling it to deliver two to four orders of magnitude improvement in performance.3

EXTREME MEMORY PERFORMANCE

Memory bandwidth scales with size of appliance

Traditional processors amortize memory latency (they make an inherent assumption that data will have locality, so they retrieve blocks of data into a complex hierarchy of caches). Graphs, and discovery applications generally, do not have locality, so this approach doesn’t work.

The Urika-GD platform’s Threadstorm processors tolerate latency through massive multithreading. However, each thread can issue up to eight concurrent memory references — so massive memory bandwidth is required to keep the processors running at peak performance.

Massive multithreading and extreme memory bandwidth go hand in hand to deliver the Urika-GD system’s performance advantage.

Word-level memory synchronization hardware enables very linear scaling to high thread and processor counts.

Together, these optimizations deliver an appliance finely tuned to the requirements of discovery in big data.

Comparison: The Urika-GD Appliance’s Hardware and Commodity Hardware

The Urika-GD appliance hardware provides a number of key advantages for discovery analytics over commodity cluster systems: a large, global shared memory, extreme processing power, a purpose-built, massively multithreaded graph acceleration processor, extreme memory bandwidth and extreme tolerance for memory latency. The table below sums up these differentiators and the benefit each provides for graph analytics.



The Urika-GD System Software Stack

The Urika-GD system’s software stack was crafted with several goals in mind:

• Create a standards-based appliance for real-time data discovery using graph analytics • Facilitate migration of existing graph workloads onto the Urika-GD appliance• Allow users to easily fuse diverse datasets from structured, semistructured and unstructured

sources without upfront modeling, schema design or partitioning considerations• Enable ad-hoc queries and pattern-based searches across the entire dynamic

graph database.

The Urika-GD appliance software (Figure 4) is partitioned across the two types of processors in the service nodes and accelerator nodes, with each processing the workload for which it is best suited.

The service nodes run discrete copies of Linux and are responsible for all interactions with the external world — database and appliance management, and database services, including a SPARQL endpoint for external query submission. The service nodes also perform network and file system I/O. A Lustre parallel file system enables near-linear scalability across multiple service nodes, allowing even the largest datasets to be loaded into memory in minutes.

The accelerator nodes perform the functions of maintaining the in-memory graph database, including loading and updating the graph, performing inferencing and responding to queries.

Figure 4. Urika-GD system software architecture.

Urika-GD Graph Appliance Accelerator Nodes

RDF SPARQL Java VisualizationTools

Urika-GD Appliance Services Nodes

Graph Analytics Application ServicesDatabase Manager, Database Services, Visualization Services

SUSE Linux 11 Optimized Multithreaded Kernel

Graph Analytics Database



The Graph Analytics Database

The graph analytics database provides an extensive set of capabilities for defining and querying graphs using the industry standard RDF and SPARQL. These standards are widely used for storing graphs and performing analytics against them.

Cray built the database and query engine from the ground up to take advantage of the massive multithreading on the Threadstorm processors. The standards-based approach and comprehensive feature set ensure that existing graph data and workloads can be migrated onto the Urika-GD platform with minimal or no changes to existing queries and application software.

With the Urika-GD appliance, query results can be sent back to the user or be written to the parallel file system. The latter capability can be very useful when the set of results is very large.

Goal: Proactively identify patterns of activity and threat candidates by aggregating intelligence and analysis

Data sets: Reference data, people, places, things, organizations, communications....

Technical challenges: Volume and velocity of data; inaccurate, incomplete and falsified data

Users: Intelligence analysts

Usage model: Search for patterns of activity and graphically explore relationships between candidate behavior and activities

Augmenting: Existing Hadoop cluster and multiple data appliances

Fertilize Transport Factory

Concept

Figure 5. An example of identifying threat patterns.


Enabling Ad-hoc Queries and Pattern-based Search with RDF and SPARQL

Traditional relational databases and warehouses require the definition of a schema, specifying the design of the tables in which data is held. Databases use the schema definition to store related data together, in order to improve performance through increased data locality. However, the imposition of schema means the loss of flexibility to run ad-hoc queries. If the queries to be run are not carefully considered in the schema design, many complex queries will result in multiple database joins (or even worse, self-referential joins), destroying a traditional database’s performance.

In contrast, graphs are “schema-free” in that the relationships between nodes are individually represented. RDF represents a relationship (an edge in a graph) using a triple (<subject>, <predicate>,<object>) where URIs uniquely identify each member of the triple. No advance consideration is required of the queries to be run. Additionally, structured, semistructured and unstructured data may be easily merged without schema or layout considerations by simply combining the sets of RDF triples.

SPARQL provides a very powerful pattern-based search capability as it enables queries infeasible on a relational database because of the complexity of the joins required. This pattern-search capability is of tremendous value in the search for unknown linkages or non-obvious patterns.

Augmenting Relationships in the Graph Through Inferencing

The Urika-GD system’s graph analytics database supports the use of inferencing rules to augment the relationships contained therein. Common uses include:

• Reconciling differences between different data sources. The representation of relationships in RDF triples makes it easy to combine data from different sources. However, it is often necessary to write rules to establish equivalence between name usages. For example, the Austrian city of Vienna may also be known as “Wien,” depending upon the user’s native language. Similarly, knowing whether two identically named genes are the same or not requires knowledge of whether the genes come from the same species. Rules provide a convenient way to provide this knowledge.

• Inferring the existence of new relationships based upon deduction. This powerful and general capability can be used in a variety of ways. For example, if person A met person B, then symmetry implies that person B met person A. That connection can be codified as a rule (the question marks indicate a variable to be instantiated): (?A met ?B)>>(?B met ?A) A rule can be used to deduce new relationships. The rule (?A is-a ?B) and (?B is-a ?C) >> (?A is-a ?C) indicates that the member of a subclass is also a member of the parent class. For example, if codeine is a member of the class of drugs known as opiate agonist, and opiate agonist is a member of the class of drugs known as opioid, then codeine is also a member of the opioid family. Chaining is supported: A rule can add a relationship to the database, which triggers one or more other rules to add other relationships, and so forth.

RDF and inferencing make the fusing of multiple data sources to create, augment and update graph databases straightforward.



Benefits of the Urika-GD System’s Software Architecture

The benefits of the Urika-GD system’s software stack for discovery applications are summarized in the table below.

The Benefits of an Appliance

There are a number of benefits to the use of an integrated, tested, optimized appliance that go well beyond the benefits of the hardware and software independently.

Easy to Deploy. The Urika-GD appliance augments your existing IT infrastructure with a datacenter-ready, standards-based appliance, designed specifically for data discovery using graph analytics. It combines purpose-built hardware with a tuned and optimized software stack, encompassing graph query endpoints, application server, full-featured W3C standard graph database, interface and query tools for users and administration tools for managing the appliance, database security, and rights. Alternative approaches would require the user to separately buy hardware and database and administration tools, and configure, integrate and support them, adding cost, complexity and delays. The Urika-GD appliance easily integrates with other analytical and visualization tools to provide a rapid return on investment.

Single Point of Support. Cray offers a single point of support for discovery analytics, with expertise spanning hardware, software, graph analytics, consultation on user solutions and optimization.

The result is a lower initial cost of acquisition, a lower total cost of ownership and a quicker time to solution — analysts can be up and running within hours of installing the Urika-GD appliance.


DIFFERENTIATOR URIKA-GD SYSTEM CAPABILITY

SIGNIFICANCE TO DISCOVERY ANALYTICS

DATABASE TUNED TO URIKA-GD APPLIANCE HARDWARE

Very highly parallel database

Achieving the benefits that the Threadstorm processor can deliver requires the database to be capable of taking advantage of massive parallelism. Cray’s implementation provides very linear scaling and performance that outstrips traditional solutions.

STANDARDS-BASED SOFTWARE STACK

Conforms to W3C RDF and SPARQL standards

RDF enables fusing of structured, semistructured and unstructured data without schemas or layouts from many different sources, and ready addition of new sources of data.

SPARQL is designed to express pattern-based queries, which are a natural way to test hypotheses against the entire dataset.

And the fact that both are open W3C standards ensures the easy migration to Urika-GD appliance, and avoids vendor lock-in.

ENTERPRISE-GRADE MANAGEMENT ENVIRONMENT

Comprehensive Graph Analytics Manager

The Graph Analytics Manager is a first for a graph database: a comprehensive, enterprise-grade tool providing database and appliance administration capabilities.

“ Why is a purpose-built analytics

appliance needed…The answer is

that x86 clusters and associated

software tools have some

distinct limitations for the most

challenging analytics work.”

(IDC, “Urika Shines Where Others Falter: Finding High-Value Relationships in Big Data” by Steve Conway, Earl Joseph, and Chirag DeKate)



Integrating the Urika-GD Appliance into an Existing Analytics Environment

Cray designed the Urika-GD system to integrate seamlessly into existing analytics and visualization environments, as illustrated in Figure 6. The many applications of analytics range from historical reporting to statistical clustering techniques to discovery, each optimally served by different solutions.

The Urika-GD appliance is designed to integrate with central repositories of analytics data, or to extract information from multiple data silos and integrate the relationships in the data into a unified graph. It may then be used to offload the computationally challenging graph analytics workload. The results the Urika-GD appliance generates may be displayed visually to the analyst, or exported to other analytics engines or back to the repository.

Figure 6. Integrating the Urika-GD appliance into an existing analytics environment.

Application

UserVisualization

Load Graph

Datasets

Share Analytic/Relationship

Results

Data Warehouse

OtherBig Data

Appliances

EXI STI NG ANALYTIC E NVI RON M E NTS

D A T A S O U R C E S

Appl

A

Existing Analytics Environments

“ [The] Urika does not need

to stand alone: it can work

collaboratively within a broader

data warehousing environment,

importing graph datasets

and exporting results for

further analysis.”

(The Bloor Group, “Urika, an In-Detail White Paper,” Philip Howard, Bloor)



Building the Graph

Graph analytics has the power to ingest a variety of types of data, ranging from highly structured database representations, through semistructured data (XML, for example), to unstructured information (text, video, audio). A variety of techniques are used to extract data from these different sources and build them into a cohesive set of relationships.

Relational databases, the principal structured representation, encode the relationships between fields into the structure of tables. Cray solution architects have tools that enable the automatic generation of RDF triples, using column headings as the relationship between a pair of fields. The tools have sophisticated mapping capabilities to facilitate integrating databases employing different schema, and the inferencing capabilities associated with the Urika-GD appliance allow missing relationships to be automatically generated.

Semistructured and unstructured data goes through a multiphase transformation, where the entities and relationships described in the data are extracted and then integrated into the graph. A variety of different tools are used, depending upon the source data representation. Cray solutions architects are experts at identifying the optimum solution for every type of data, and would be pleased to help design the pipeline for this extract-transform-load (ETL) phase.

Visualization

The ability to visually inspect the results of testing a hypothesis against the entire dataset is critical to collaborative discovery.

Cray has found that the best visualization is very strongly domain dependent. For example, the visualization demanded by biologists working on a cancer cure is very different from that required by cybersecurity analysts trying to identify a cyberattack. Consequently, the Urika-GD appliance integrates with a range of visualization solutions from a variety of vendors, as well as home-grown visualization solutions. Please contact Cray for the list of visualization packages supported and assistance with integrating custom visualization packages.

Integration with Other Analytics Packages

The power of graph analytics for discovery lies in the ability to test a hypothesis against the entire dataset. However, the results returned in response to a query may still be too large to be directly assimilated by a human. The answer often lies in mixing analytics techniques, for example by using statistical techniques to summarize data.

The Urika-GD graph analytics appliance makes such integration straightforward by supporting the ready export of large datasets. An example is discussed in Example 1 (page 4), where this capability was successfully used to combine statistical analysis and graph analytics in the search for cancer treatments.

1 The term “memory wall” was introduced in: Wulf, W.A. and McKee, S.A., “Hitting the Memory Wall: Implications of the Obvious,” Technical Note published by the Dept. of Computer Science, University of Virginia, Dec. 1994.

2 Amdahl’s Law predicts the speedup of a parallel implementation of an algorithm over its serial form, based upon the time spent in sequential and parallel execution. For example, if a program needs 20 hours using a single processor core, and a particular portion of one hour cannot be parallelized, while the remaining promising portion of 19 hours (95 percent) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence the speedup is limited up to 20×.

3 As an example, see S.J. Plimpton and K.D. Devine, “MapReduce in MPI for Large-scale Graph Algorithms,” Parallel Computing, 2011, where they measured approximately 2,000x the performance over optimized, commodity hardware.

4 RDF stands for resource description framework. It is and is a format for representing directed, labeled graph data. It is a W3C standard and its full specification available at http://w3.org/RDF.

5 SPARQL is a recursive acronym for SPARQL protocol and RDF query language. It is a W3C standard; a full specification, tutorials and examples are available at http://w3.org/2009/sparql/wiki/Main_Page.

6 Some argue that the representation of an RDF triple is still a schema. The key is that all graph data is represented using the same schema, so no a-priori knowledge of queries to be run is required.

7 URI stands for uniform resource identifier (a W3C standard).

8 This assumes the semantics of the data from different sources is identical. If not, inference rules can sometimes map one set of semantics onto another.

“ … IDC believes that IT organizations

will increasingly be expected to

augment today’s static searches

with such an ability, which is

becoming crucial … for a wide range

of other Big Data problems…”

(IDC, “Urika Shines Where Others Falter: Finding High-Value Relationships in Big Data” by Steve Conway, Earl Joseph, and Chirag DeKate)



Conclusion

Many of the most beneficial applications of big data involve discovery. However, as dataset sizes grow, a collaborative human-machine approach to discovery is required to enable humans to cope with the size and complexity of the datasets.

Discovery in big data typically involves fusing information from many different sources and then testing hypotheses, expressed as complex queries, against the entire dataset. Traditional solutions hard-code the schema, making the addition of new data sources difficult, and display extremely poor performance on ad-hoc queries where the database isn’t optimized for the specific queries to be run.

Graph analytics support discovery by overcoming these limitations. Expressing relationships explicitly enables the easy addition of new types of relationships from new data sources. Ad-hoc queries and powerful new graph analytics techniques are supported.

Cray’s easy-to-deploy Urika-GD appliance supports discovery in large datasets. The appliance combines custom graph acceleration hardware with a highly optimized graph database, analytics capability and management to make adoption effortless, while avoiding vendor lock-in.

Finally, Cray’s solution focus all but ensures customer success. Engagements typically begin with a pilot, which demonstrates feasibility and the benefits to be gained. Cray’s subscription model eliminates acquisition risk, enabling quick adoption by enterprises seeking to solve vital business problems.

Gain new insights by easily fusing diverse datasets without upfront modeling and independent of linkage

In-memory Graph Analytics Database enables merging of structured/semi-structured/unstructured data without schemas/layouts and querying data without pre-specifying the connections between the data

Surface unknown linkages or non-obvious patterns without advance knowledge of relationships in the data

Shared memory model enables uniform, low-latency access to all the data regardless of data partition, layout or access pattern

Investigate multiple changing hypotheses in real time simultaneously

Hardware accelerator enables global access of multiple, random, dynamic memory references in parallel without pre-fetching/caching

Urika-GD Big Data Appliance for Real-Time Data Discovery

For more information on the Urika-GD graph analytics appliance:

See our website: www.cray.com

©2014 Cray Inc. All rights reserved. Specifications subject to change without notice. Cray is a registered trademark and Urika-GD is a trademark of Cray Inc. All other trademarks mentioned herein are the properties of their respective owners. 20140915


Documents

Real-Time Discovery in Big Data Using the Urika-GD Appliance · graph analytics to surface unknown linkages and non-obvious patterns in big data, do it with speed and simplicity,