Guest Editors' Introduction: Internet-Scale Data Management

10 Published by the IEEE Computer Society 1089-7801/12/$31.00 © 2012 IEEE IEEE INTERNET COMPUTING

Gue

st E

dito

rs’ In

trod

ucti

on

I n the past few years, many areas of computer science have been abuzz with talk about “big data.” Although

different people mean different things when they use this term, the consen-sus is that data volumes are explod-ing, and that new tools are needed to handle these volumes. EMC and IDC have released a series of white papers estimating that in 2009, there were 800 exabytes of digital data stored world-wide (1 exabyte = 1,000 petabytes = 1,000,000 terabytes; see www.emc.com/digital_universe). By 2020, they predict that this number will be 44 times larger, or 35 zettabytes. (According to blog-ger Paul McNamara, if all of humanity were to post continuously to Twitter, it would take one century before the total volume of tweets occupied 1 zettabyte! See www.networkworld.com/community/ node/60897.) This flood of data is due in large part to the increased desire to col-lect, query, and mine logs, clickstreams, and other Internet-based data. This is increasingly true as companies look to monetize their data through targeted advertising, clickstream analysis, and customer behavior profiling.

Such data isn’t just localized on a single machine but spread across many machines, which are often geographi-cally distributed. This distribution arises as more and more datasets are available via the Internet, and as orga-nizations themselves become larger and more geographically diverse. In addition, we’re realizing the need to give users more and exclusive control over their own data, as witnessed by the privacy discussions that surround online social networks. Requiring dis-tributed control on top of geographical distribution further complicates data management.

This demand for data processing has given rise to new technologies and techniques for Internet-scale data management. These include numer-ous “NoSQL” tools, designed to pro-vide access to structured data without using a conventional relational data-base (often to address scalability or fault-tolerance concerns), as well as new relational systems aimed at operat-ing at larger scales or higher rates than conventional databases. Additionally, researchers have developed several

Sam MaddenMassachusetts Institute of Technology

Maarten van SteenVU University Amsterdam

Internet-Scale Data Management

IC-16-01-GEI.indd 10 12/7/11 5:19 PM

Internet-Scale Data Management

JANUARY/FEBRUARY 2012 11

systems for accessing unstructured textual data, or heterogeneous and “semistructured” data spread across the Internet.

In this IssueThis special issue of IC presents five articles covering a range of technologies related to Internet-scale data management. Taken together, they give an exciting overview of recent devel-opments in this space.

In their article “PNUTS in Flight: Web-Scale Data Serving at Yahoo,” Adam Silber-stein and his colleagues provide an overview of their PNUTS system, a new database system that’s widely used inside Yahoo. It provides several features designed to target wide-area scalability, including the use of relaxed con-sistency to improve wide-area query perfor-mance, and a simplified, key-value-based query interface targeted toward the needs of many of Yahoo’s Web applications. The article describes PNUTS’ numerous advanced features, includ-ing notifications, materialized views, and bulk operations.

In “Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends,” André Freitas, Edward Curry, João Gabriel Oliveira, and Seán O’Riain describe some of the challenges that arise when trying to build a system to write structured queries (in a language such as SPARQL or SQL) over het-erogeneous datasets spread across the Web. Key issues include semantic heterogeneity, in which related datasets are encoded using diverse sche-mas that aren’t always easy to combine. This difficulty arises for several reasons. First, two datasets might reference a particular person or object (entity) using different labels (keys) — this is the so-called entity-resolution problem. Second, two datasets might encode the same type of data (such as a salary) under different names (for example, “earnings” or “income”). The article surveys various techniques that have been developed to address these challenges and summarizes their strengths and limitations.

In their article, “Multiterm Keyword Search in NoSQL Systems,” Christian von der Weth and Anwitaman Datta describe a general methodol-ogy for adding keyword search to NoSQL sys-tems (that is, structured data storage systems without a SQL interface). Specifically, for NoSQL systems based on a key-value store interface, they describe how to create an index that lets

users look up keys or values that contain spe-cific words, or sets of words. They demonstrate that their approach provides excellent perfor-mance and scalability.

“Scalable Execution of Continuous Aggrega-tion Queries over Web Data,” by Rajeev Gupta and Krithi Ramamritham, describes a set of problems related to continuously monitoring a collection of distributed data sources for some condition. As an example, imagine a user who wishes to be notified when some condition over a collection of stock prices or volumes has been satisfied. Such applications might not involve huge volumes of stored data but still require a scalable infrastructure that can handle very high data rates. Many methods exist for archi-tecting a system that provides this type of interface, and the authors survey these archi-tectures and provide a set of guidelines sum-marizing the best one to choose depending on different architectural parameters (for instance, the update rate of underlying data, or number of queries).

Finally, in “Enabling Web Services to Consume and Produce Large Datasets,” Spi-ros Koulouzis, Reginald Cushing, Konstanti-nos Karasavvas, Adam Belloum, and Marian Bubak describe the challenges associated with transferring large datasets using conventional SOAP-based Web services. Specifically, SOAP doesn’t deal well with large messages, and using XML increases message sizes and inhib-its performance. As an alternative, the authors propose a solution that uses SOAP messages to coordinate data transfer, while actually trans-ferring large files and data streams using an out-of-band communication channel that bet-ter handles large volumes. They show that their design provides substantially better perfor-mance than conventional SOAP.

The Road AheadViewed as a whole, these articles provide a snapshot of the exciting and diverse research happening in the area of Internet-scale data management. However, they’re just a sampling of the broad challenges that Internet-scale data presents. The future of data management poses several fascinating questions.

How can recent advances in probabilistic models and statistical machine learning best be integrated into large-scale data manage-ment systems? Increasingly, users want to do

IC-16-01-GEI.indd 11 12/7/11 5:19 PM

Guest Editors’ Introduction

12 www.computer.org/internet/ IEEE INTERNET COMPUTING

more than simply query or extract simple sta-tistics about their data, looking instead to fit functions, detect outliers, and make predic-tions on the data they store. Figuring out how these integrate with data management systems is nontrivial, especially as more sophisticated models were originally designed to operate on datasets that fit into a single machine’s RAM.

Is there a better language than SQL for data access on the Internet? What is it? Ideally, the language chosen would be able to express rich computation over data — such as the statistical models described in the previous paragraph — in a more natural way than SQL-based user-defined functions, while preserving SQL’s abil-ity to parallelize naturally over a very large number of processors.

What is the role of analytic SQL-based databases vis-à-vis MapReduce? Several lively debates have occurred on this topic in blogs over the past few years, and both classes of systems are quite widely used. The widespread adoption of SQL-like interfaces such as Hive and Pig on top of MapReduce suggests that many users prefer a declarative, high-level language. Increasingly, these tools’ authors are adopting techniques from the relational database com-munity for query optimization, materialized views, and so on. Furthermore, database ven-dors are rapidly adding support to both ingest data from and output data to HDFS (the Hadoop MapReduce file system). Both trends suggest that these systems might converge, or at least learn to peacefully co-exist.

How can we provide end users with control of their data? As noted, concern is increas-ing that large quantities of information about individuals is in the hands of a few very large

organizations. What’s needed are simple mech-anisms that give “data back to the people,” yet at the same time allow that data to be used for their owners’ benefit. For example, building good recommendation systems requires gather-ing and processing a broad range of data from different people to optimally serve individuals.

As multisensor devices such as smart phones become commonplace, how can we best pro-cess the data coming from them? Substantial research and development efforts are under way that will let us continuously monitor the whereabouts of people, thus generating self-tracking data (for example, detailed activity measurements), data on location and move-ment (estimating road traffic from phone posi-tion data), and so on. Likewise, enough sensory input will soon exist to let us accurately derive actual social ties. The processing of continuous data streams presents a new field of research in which large-scale data management plays a crucial role.

I ndeed, the Internet already introduces a very diverse collection of applications and associ-

ated data, including microblogs such as Twitter, sensor and location data from mobile phones, and human-supplied data from crowdsourcing systems such as Amazon’s Mechanical Turk. Streaming, storing, querying, and mining this data introduces yet more new challenges that will continue to drive research and innovation in data management.

Sam Madden is an associate professor at the Massachusetts

Institute of Technology. His research interests include

databases, mobile computing, sensor networks, and

distributed systems. Madden has a PhD in computer

science from the University of California, Berkeley.

Contact him at [email protected].

Maarten van Steen is a full professor at VU University

Amsterdam. His research interests include large-scale

distributed systems and complex networks. Van Steen

has a PhD in computer science from Leiden University.

He’s an associate editor in chief for IEEE Internet Com-

puting and a member of IEEE and the ACM. Contact

him at [email protected].

Selected CS articles and columns are also available for free at http://ComputingNow.computer.org.

Our experts. Your future.

www.computer.org/byc

IC-16-01-GEI.indd 12 12/7/11 5:19 PM

Documents

Guest Editors' Introduction: Internet-Scale Data Management