12
Representing and Querying Regression Models in a DBMS Arvind Thiagarajan MIT CSAIL [email protected] Samuel Madden MIT CSAIL [email protected] ABSTRACT Curve fitting is a widely employed and useful modeling tool in sev- eral financial, scientific, engineering and data mining applications, as well as in applications like sensor networks that need to tolerate missing and/or noisy data. These applications need to both fit func- tions to their data using regression, and pose relational-style queries over these functional models. Unfortunately, existing DBMSs are ill suited for this task because they do not include support for cre- ating, representing and querying this functional data short of brute- force discretization of the functions into a collection of tuples. In this paper, we describe FunctionDB, a novel DBMS that treats func- tions output by regression as first-class citizens that can be queried and manipulated like traditional relations. The key contributions of this paper are a simple, compact, algebraic representation for regression models in a DBMS as piecewise functions, and an al- gebraic query processor that executes relational queries directly on this representation as combinations of algebraic operations like func- tion inversion, zero finding and symbolic integration. We evaluate FunctionDB on two real world data sets: measurements from a tem- perature sensor network, and a collection of trac traces from cars driving on Boston roads. We show that operating directly in the functional domain has substantial advantages in terms of accuracy (over 15% for some queries) and order of magnitude (10x-100x) performance gains over existing approaches that represent models as discrete collections of points. 1. INTRODUCTION Relational databases have traditionally taken the view that the data they store is a set of discrete observations. This is clearly rea- sonable when storing individual facts, such as the salary of an em- ployee or the description of a product. However, when representing time- or space- varying data, such as a series of temperature obser- vations, the trajectory of a moving object, or a history of salaries over time, a set of discrete points is often neither the most intuitive nor compact representation. Indeed, for researchers in many fields – from social sciences to biology to aeronautics to computer sci- ence – a common first step in understanding a set of data points is to model those points as a collection of curves, typically generated using some form of regression (i.e., curve fitting). Regression – a form of modeling – helps smooth over errors and gaps in raw data points, yields a compact and more accurate representation of those points as a few parameters, and provides insight into the data by re- vealing trends and outliers. Once in this curve domain, it is natural to ask questions over the fit data directly, looking, for example, for curves that intersect, are confined within a certain area, or that have the maximum slope. In this paper, we describe a system we are building, called FunctionDB, that allows users to directly query the functions output by regression inside a database system. By push- ing this support into the database, rather than requiring the use of an external curve fitting and analysis tool, users can manage these models just like any other data, providing the benefits of declarative queries, indexability, and integration with existing database data. One alternative to modeling might be to simply run queries over the raw data points. Unfortunately, it is usually not desirable or even feasible to directly query raw data, because raw data are either missing (necessitating interpolation), noisy (necessitating smooth- ing) or simply unavailable (necessitating prediction via extrapola- tion). For example, in a sensor network setting, where a number of sensors are used to monitor some remote environment, it may be necessary to interpolate the data to predict the sensor readings at locations where sensors are not physically deployed. Also, sen- sors may occasionally fail or malfunction, or report garbage values (due to low batteries and other anomalies) that must be smoothed or eliminated by filtering. Hence, FunctionDB is designed to help users who need to use re- gression models manage their data, particularly when that raw data is noisy or has missing values. FunctionDB is a relational database system with support for special tables that can contain functions, in addition to standard tables with raw (discrete) data. In addition to managing raw data, FunctionDB provides users with tools to fit that raw data with one or more functions, and to pose familiar relational queries over those functions. For example, FunctionDB might rep- resent the points (t = 1, x = 5), (t = 2, x = 7), (t = 3, x = 9) as the function x(t) = 2t + 3. FunctionDB includes an algebraic query processor that executes relational queries on such functions by using direct symbolic algebra (e.g., solving symbolic equations). Traditional relational operations become algebraic manipulations in this functional domain; for example, a selection query that finds the time the temperature of a sensor whose value is described by the equation x(t) = 2t + 3 crosses the line x = 5 involves evaluating the expression 5 = 2t + 3 to find t = 1. Similarly, the symbolic analogs for aggregation and join use integration and function inversion. Existing database systems provide some support for fitting mod- els, but do not support regression models as first-class objects. Some commercial DBMSs do provide modeling tools for data mining ap- plications – for example, IBM’s Intelligent Miner [8] and Oracle Data Miner [10] support creating models using PMML (Predictive Model Markup Language). However, these tools do not export a re- lational interface to model data. Rather, models are viewed as stan- dalone black boxes with specialized interfaces for fitting. A typi- cal use of PMML involving regression is to first fit a set of points to functions using an external tool, load those functions into the database, and then use the functions to predict the value of some

Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

Representing and Querying Regression Models in a DBMS

Arvind ThiagarajanMIT CSAIL

[email protected]

Samuel MaddenMIT CSAIL

[email protected]

ABSTRACTCurve fitting is a widely employed and useful modeling tool in sev-eral financial, scientific, engineering and data mining applications,as well as in applications like sensor networks that need to toleratemissing and/or noisy data. These applications need to both fit func-tions to their data using regression, and pose relational-style queriesover these functional models. Unfortunately, existing DBMSs areill suited for this task because they do not include support for cre-ating, representing and querying this functional data short of brute-force discretization of the functions into a collection of tuples. Inthis paper, we describe FunctionDB, a novel DBMS that treats func-tions output by regression as first-class citizens that can be queriedand manipulated like traditional relations. The key contributionsof this paper are a simple, compact, algebraic representation forregression models in a DBMS as piecewise functions, and an al-gebraic query processor that executes relational queries directly onthis representation as combinations of algebraic operations like func-tion inversion, zero finding and symbolic integration. We evaluateFunctionDB on two real world data sets: measurements from a tem-perature sensor network, and a collection of traffic traces from carsdriving on Boston roads. We show that operating directly in thefunctional domain has substantial advantages in terms of accuracy(over 15% for some queries) and order of magnitude (10x-100x)performance gains over existing approaches that represent modelsas discrete collections of points.

1. INTRODUCTIONRelational databases have traditionally taken the view that the

data they store is a set of discrete observations. This is clearly rea-sonable when storing individual facts, such as the salary of an em-ployee or the description of a product. However, when representingtime- or space- varying data, such as a series of temperature obser-vations, the trajectory of a moving object, or a history of salariesover time, a set of discrete points is often neither the most intuitivenor compact representation. Indeed, for researchers in many fields– from social sciences to biology to aeronautics to computer sci-ence – a common first step in understanding a set of data points isto model those points as a collection of curves, typically generatedusing some form of regression (i.e., curve fitting). Regression – aform of modeling – helps smooth over errors and gaps in raw datapoints, yields a compact and more accurate representation of thosepoints as a few parameters, and provides insight into the data by re-vealing trends and outliers. Once in this curve domain, it is naturalto ask questions over the fit data directly, looking, for example, forcurves that intersect, are confined within a certain area, or that havethe maximum slope. In this paper, we describe a system we are

building, called FunctionDB, that allows users to directly query thefunctions output by regression inside a database system. By push-ing this support into the database, rather than requiring the use ofan external curve fitting and analysis tool, users can manage thesemodels just like any other data, providing the benefits of declarativequeries, indexability, and integration with existing database data.

One alternative to modeling might be to simply run queries overthe raw data points. Unfortunately, it is usually not desirable oreven feasible to directly query raw data, because raw data are eithermissing (necessitating interpolation), noisy (necessitating smooth-ing) or simply unavailable (necessitating prediction via extrapola-tion). For example, in a sensor network setting, where a numberof sensors are used to monitor some remote environment, it maybe necessary to interpolate the data to predict the sensor readingsat locations where sensors are not physically deployed. Also, sen-sors may occasionally fail or malfunction, or report garbage values(due to low batteries and other anomalies) that must be smoothedor eliminated by filtering.

Hence, FunctionDB is designed to help users who need to use re-gression models manage their data, particularly when that raw datais noisy or has missing values. FunctionDB is a relational databasesystem with support for special tables that can contain functions, inaddition to standard tables with raw (discrete) data. In addition tomanaging raw data, FunctionDB provides users with tools to fit thatraw data with one or more functions, and to pose familiar relationalqueries over those functions. For example, FunctionDB might rep-resent the points (t = 1, x = 5), (t = 2, x = 7), (t = 3, x = 9)as the function x(t) = 2t + 3. FunctionDB includes an algebraicquery processor that executes relational queries on such functionsby using direct symbolic algebra (e.g., solving symbolic equations).Traditional relational operations become algebraic manipulations inthis functional domain; for example, a selection query that finds thetime the temperature of a sensor whose value is described by theequation x(t) = 2t + 3 crosses the line x = 5 involves evaluating theexpression 5 = 2t + 3 to find t = 1. Similarly, the symbolic analogsfor aggregation and join use integration and function inversion.

Existing database systems provide some support for fitting mod-els, but do not support regression models as first-class objects. Somecommercial DBMSs do provide modeling tools for data mining ap-plications – for example, IBM’s Intelligent Miner [8] and OracleData Miner [10] support creating models using PMML (PredictiveModel Markup Language). However, these tools do not export a re-lational interface to model data. Rather, models are viewed as stan-dalone black boxes with specialized interfaces for fitting. A typi-cal use of PMML involving regression is to first fit a set of pointsto functions using an external tool, load those functions into thedatabase, and then use the functions to predict the value of some

Page 2: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

other set of points by plugging them into the functions. This is verydifferent than our approach, where the functions themselves can bejoined, aggregated, or queried.

Perhaps the most closely related work is the MauveDB system [4],which also proposes to integrate models directly into a databasesystem, but performs query processing over those models by stor-ing them as gridded data points that can be fed directly into exist-ing relational operators. For example, a curve like y(x) = 2x + 1would be represented as a set of discrete data points in MauveDB –e.g., (0, 1), (1, 3), (2, 5) . . ., which can then be queried. In contrast,though FunctionDB does support gridding to expose relational an-swers to users, query processing is done over ungridded data, yield-ing substantial efficiency and accuracy gains.

FunctionDB is also related to constraint databases [11,12], whichwere first proposed in the context of spatial databases, and allowusers to pose queries over arbitrary systems of algebraic constraints.In contrast to constraint databases, FunctionDB views query pro-cessing in terms of algebraic primitives applied to functions, ratherthan in terms of solving systems of arbitrary constraints. This re-striction enables a simpler design and data model that are also moreextensible to a variety of regression functions.

In summary, FunctionDB makes it possible to summarize rawrelational data as a collection of functions, allows users to posequeries over those functions just as they would query relationaldata. We discuss three key elements of FunctionDB in this paper:• A compact representation for regression models as collec-

tions of piecewise functions.• A symbolic query processor that operates directly on this

compact algebraic representation of data without first ma-terializing functions into discrete data points. We describehow relational query processing operators can be describedas a combination of algebraic primitives: function evalua-tion, zero finding, inversion, and symbolic integration. Wealso describe an indexing method that allows FunctionDB toefficiently identify functions that satisfy common classes ofrelational predicates.

• Evaluation of FunctionDB on two real world data sets — acollection of temperature data from 54 temperature sensors,and a collection of traffic traces from cars driving on Bostonroads. FunctionDB achieves order of magnitude (10x-100x)better performance for aggregate queries and substantial sav-ings (2-4x) for selective queries, compared to MauveDB-likeapproaches that represent and process regression models asdiscrete points. FunctionDB query plans are also more accu-rate than gridding, which results in over 15% discretizationerror on some queries.

2. EXAMPLE APPLICATIONSIn this section, we introduce FunctionDB through two real-world

applications: an indoor sensor network and an application that an-alyzes data from car trajectories. We show queries for both ap-plications in the FunctionDB query language, which is SQL withextensions to support regression. We use real data and queries fromthese applications for our evaluation in Section 5.

2.1 Indoor Sensor NetworkApplication Scenario. In this application, a network of tempera-ture sensors are placed on the floor of a building (we use real datafrom a lab deployment in Intel Research, Berkeley).Each sensorproduces a time series of temperature observations. We considerseveral queries that users of this application might want to ask:

1. What is the temperature at a particular location in the buildingat a given time?

2. At what locations is the heating insufficient? For example,where does the temperature drop below 18oC?

3. Are there nearby sensors (e.g., within 2 feet of each other)that produce temperature values at the same time instant thatdiffer significantly (e.g., by more than 5oC)?

4. Compute the average of temperature at a particular sensorlocation over a particular time window, say between timest1 and t2.

5. Over a longer time scale (e.g., a day) what does the histogramof temperature with time look like at a particular location?This might help detect locations prone to getting hot at par-ticular times of the day when they face the sun.

Regression is useful in this application for several reasons. First,raw temperature data is only available at select sensor locations, butusers would like to know the temperature at all locations, requiringinterpolation or extrapolation. Second, radios on sensor nodes losepackets (sometimes as high as 15-20%); third, nodes themselvesfail (e.g., when batteries are low), producing no readings or evengarbage data. Hence, our application needs to cope with missing aswell as incorrect and noisy data.

While interpolation can be accomplished by simple averaging,regression is a more robust alternative. Regression takes as inputa dataset with two correlated variables (like time and temperature),and produces as output a formula for one of the variables (the de-pendent variable), as a continuous function of the other variable(the independent variable). The aim of regression is to produce amodel that approximates the raw data with as little error as possible.The most common form of regression is linear regression, whichtakes a set of basis functions (e.g., x, x2, x3) and computes coeffi-cients for each of the basis functions (e.g., a,b,c) such that the sumof the products of the basis functions and their coefficients (e.g.,ax+ bx2 + cx3) produces a minimum-error fit for an input vector ofraw data, X. Performing linear regression is equivalent to perform-ing Gaussian elimination on a matrix of size |X| × |X|. A standardapplication of regression in modeling data is to first segment datainto regions within which it exhibits regularity or a well-definedtrend, and then choose basis functions that are most appropriate tofit the data in each region. We term this process piecewise regres-sion.

In this application, a regression model for temperature (the de-pendent variable) as a function of time or location (independentvariables) is preferable to averaging, for two reasons. First, regres-sion models the underlying phenomenon, which is a continuousfunction of location or time. Interpolation and extrapolation bothfit naturally into this framework. Second, when there is sufficientdata, regression is less prone to outliers than local averaging.

Figure 1 illustrates linear regression for temperature data fromour sensors. The plot shows a snapshot of temperature data fromone of the sensors in the Intel Lab dataset, and a piecewise linearregression model for the observed data. Each linear piece is definedover a time interval, and captures an increasing or decreasing trendin temperature over that interval (the temperature increases duringthe morning and decreases later in the day).FunctionDB Implementation. We now show how creating a re-gression view of temperature data can be accomplished with Func-tionDB, and illustrate how each of the above queries would be ex-pressed in our query language. We assume that raw temperaturesare stored in a standard relational table, tempdata and the schemalooks like <ID,x,y,time,temp> – where ID is the ID of the sen-

Page 3: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

0 500 1000 1500 2000 2500 300016

17

18

19

20

21

22

23

24

Time (minutes)

Tem

pera

ture

(de

gree

s C

elsi

us)

Piecewise Linear Model For Room Temperature

Data pointRegression model

Figure 1: Temperature recorded by a sensor in the Intel Lab plottedas a function of time. Dots represent temperature readings and linesegments represent “pieces” of the regression fit.

sor that made the measurement, (x,y) are coordinates that specifythe sensor location, time is the time of measurement, and temp isthe actual temperature measurement.

Our previous work on MauveDB [4] introduces SQL extensionsto fit a regression view to data already in a relational table. TheFunctionDB syntax is is similar to that syntax. The FunctionDBquery to fit a piecewise linear model of temperature as a function oftime to the readings in tempdata looks like:

CREATE VIEW timemodel

AS FIT temp OVER time

USING FUNCTION Line2D

USING PARTITION FindPeaks, 0.1

TRAINING DATA SELECT temp, time FROM tempdata

This query instructs the DBMS to fit a regression model, timemodel,using data from the column temp as the dependent variable, anddata from time as the independent variable. This model can bequeried like any relational table with the schema (temp,time).The USING FUNCTION clause specifies the type of regression func-tion to use to fit the data – in our case, Line2D, which representsa line in 2D space. The USING PARTITION clause tells the DBMShow to segment (partition) the data into regions within which tofit different regression functions. Our example uses FindPeaks,a simple in-built segmentation algorithm that finds peaks and val-leys in temperature data and segments the data at these extrema;the parameter to FindPeaks specifies how aggressive the algo-rithm should be about segmenting data into pieces. Finally, theTRAINING DATA clause specifies the data used to train the regres-sion model, which in this case consists of all the readings from thetable tempdata.

The most appropriate choice for regression functions is often dataor application dependent, and a restricted set of built-in regressionfunctions is unlikely to prove sufficient. Hence, our query processorenables users to create and add new types of regression functions(Section 4.1).

Currently, our FunctionDB implementation permits users to fitregression models in a single pass over raw data. While updates arenot a focus of this paper, it would be interesting to explore onlinemaintenance strategies for models as new raw data arrives. Someideas for online maintenance have been previously proposed e.g.,by the MauveDB work.

We now turn to querying regression models. We assume that tworegression views: timemodel for temperature as a function of time,and locmodel, also for temperature but as a function of (x,y) lo-cation, have already been constructed using CREATE VIEW as de-

scribed previously. Given these views, we show below what eachof the queries posed earlier look like in FunctionDB:

Query 1: Temperature at a given location (Simple Selection)

SELECT temp FROM locmodel WHERE x = 20 AND y = 20

Query 2: Locations where temperature is below threshold (SimpleSelection)

SELECT x, y FROM locmodel WHERE temp < 18

GRID 0.5,0.5

Query 3: Nearby locations reporting differing temperatures (Join)

SELECT T1.time, T1.ID, T1.temp, T2.ID, T2.temp

FROM timemodel AS T1, timemodel AS T2

WHERE T1.time = T2.time AND

ABS(T1.temp - T2.temp) > 5 AND(T1.x - T2.x)2 + (T1.y - T2.y)2 < 22

Query 4: Time window average of temperature (Aggregation):

SELECT AVG(temp) FROM timemodel

WHERE ID = given id AND time ≥ t1 AND time ≤ t2

Query 5: Distribution of temperatures experienced by a particularlocation (Grouping, Aggregation):

SELECT temp, AMOUNT(time) FROM timemodel

WHERE ID = given id AND time ≥ t1 AND time ≤ t2

GROUP BY temp GROUPSIZE 1

All queries use the regression fit to predict the values of the depen-dent variable used in the query. For example, Query 1 would returna temperature value for location (20, 20) even if a temperature sen-sor were not physically deployed at that location, by evaluating theregression function at (20, 20). Also, some queries include a GRIDclause. This is a simple SQL extension that specifies the granular-ity with which results are displayed to the end user. Since Func-tionDB represents regression functions symbolically and executesqueries without actually materializing the models at any point in aquery plan (Section 4), the results of queries are continuous inter-vals or regions, unlike a traditional RDBMS where queries returndiscrete tuples. GRID is one of the possible output semantics1 for aquery that returns a continuous result. GRID discretizes the indepen-dent variable at fixed intervals to generate output tuples in the veinof a traditional DBMS. For example, the GRID 0.5,0.5 clause inQuery 2 specifies that regions where the temperature is below 18o

should be output as a grid of (x, y) coordinates, with grid size 0.5along both x and y attributes.

The histogram query (Query 5) includes a GROUPSIZE clausewhich in this context indicates the bin size of the histogram. Inother words, temp is grouped into bins of size 1oC and the aggre-gate AMOUNT(time) is computed over these bins.

2.2 Spatial Queries on Car TrajectoriesApplication Scenario. Our second application aims to supportqueries on data from Cartel [5], a mobile sensor platform devel-oped at MIT that has been deployed on automobiles in and aroundBoston. Each car is equipped with an embedded computer con-nected to several sensors as well as a GPS device for determiningvehicle location in real time.

1FunctionDB queries need not always use GRID – e.g., users mightinstead want to connect to a GUI or plotting interface to display acontinuous curve.

Page 4: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

Figure 2: Trajectory of a car driving near Boston. The points rep-resent raw GPS data, and the line segments illustrate our regressionfit to interpolate the data.

In this paper, we consider queries on car trajectory data specifiedby a sequence of GPS readings. This data requires interpolationbecause GPS data are sometimes missing, e.g., when the car passesunder a bridge or through a tunnel, or whenever the GPS signal isweak. Also, GPS values are sometimes incorrect or include outliers,requiring smoothing.

The fact that trajectories correspond to cars driving on a struc-tured road network, as opposed to being arbitrary curves, makesthis data an attractive target for regression. Using FunctionDB, wehave constructed a piecewise regression model where the pieces arelinear functions that represent road segments. This approximatesthe trajectory quite well, and smooths over gaps and errors in thedata. Figure 2 illustrates real GPS data collected from a car drivingnear Boston, and the corresponding regression model. Notice thelarge gap in data near the Massachusetts Turnpike, and the linearfunction that helps interpolate this gap.FunctionDB Implementation. The FunctionDB schema for the re-gression view of a car trajectory is <tid, lon, lat>where tid isa trajectory identifier, and lat (latitude) is modeled as a piecewiselinear function of the independent variable, lon (longitude) on eachroad segment. Below, we describe two queries that Cartel users areinterested in, and show how they are written in FunctionDB.Bounding Box. How many (or what fraction) of the trajectories passthrough a given geographic area (e.g., bounding box)? This queryis trivial to express as a selection query:

SELECT COUNT DISTINCT(tid) FROM locmodel

WHERE lat > 42.4 AND lat < 42.5 AND

AND lon < -71 AND lon > -71.1

Trajectory Similarity. One application of the Cartel data is findingroutes between two locations taking recent road and traffic condi-tions into account. To do this, we need to cluster actual routes intosimilar groups, and compute statistics about commute time for eachcluster. We express this query as a self-join that finds pairs of tra-jectories that start and end near each other, computing a similaritymetric for each pair. Our similarity metric pairs up points from thetrajectories that correspond to the same fraction of distance trav-elled along their respective routes. The metric computes the av-erage distance between the points in each pair. In the interest ofbrevity, we show a somewhat simplified version of the query thatcomputes all-pairs trajectory similarities assuming the availability

of a view fracview with a precomputed distance fraction, frac2:

SELECT table1.tid, table2.tid,

AVG(sqrt((table2.lon - table1.lon)2 +

(table2.lat - table1.lat)2))

FROM fracview AS table1, fracview AS table2,

WHERE table1.frac = table2.frac AND

table1.tid < table2.tid

GROUP BY table1.tid, table2.tid

As in the sensor application, querying the regression model, asopposed to the raw data, results in benefits. For example, even ifthe bounding box in Query 1 happened to fall entirely in a gap suchas the one shown in Figure 2, the trajectory would count as passingthrough the box as long as one of the underlying line segment(s)intersected it.

As the above examples illustrate, FunctionDB provides a logicalabstraction to the user that is similar to a traditional relation. Thishas the advantage that SQL queries already written for raw data runin FunctionDB with little or no modification, but because the under-lying implementation represents and queries regression functions,the same queries now tolerate missing or incorrect data. Also, aswe shall show, because FunctionDB represents and queries modelsalgebraically, as opposed to materializing the models into griddeddata, the same queries also execute faster and more accurately.

3. REPRESENTATION AND DATA MODELFunctions are first-class objects in the FunctionDB data model.

In this section, we describe the basic representation we adopt forfunctions and describe the semantics of relational operations Func-tionDB supports over them.

3.1 Piecewise FunctionsThe basic idea behind our data model is to add a new type of

queryable relation to the standard relational data model – the func-tion table. A function table represents regression models that con-sist of a collection of pieces, where each piece is a continuous func-tion defined over an interval of values taken by the independentvariable of regression (more generally, a region if there are multi-ple independent variables). Each tuple in a function table describesa single piece of the regression model, and consists of two sets ofattributes: interval attributes that describe the interval over whichthe piece is defined, and function attributes that describe the pa-rameters of the regression function that represents the value of thedependent variable in that interval. These parameters are in alge-braic form; for example, in polynomial regression they are a list ofpolynomial coefficients.

Figure 3 shows an example of FunctionDB fitting a regressionmodel to raw data with two attributes: x and y. The data has beenmodeled with two regression functions that express y (the depen-dent variable) as a function of x (the independent variable): y = xin the case when 1 ≤ x < 6.5, and y = 2x − 6 when 6.5 ≤ x < 14.In the function table, attributes ”Start x” and ”End x” are intervalattributes while ”Slope” and ”Intercept” are function attributes.

While a function table is represented by a finite collection offunctions, at the query language level, a function table is just a re-lation. However, this relation describes an infinite set, and logicallyconsists of every single point on a continuous curve representingthe function. This is in contrast to traditional relations that are fi-

2frac always lies in the interval [0, 1], and has value 0.5 at themidpoint of a trajectory.

Page 5: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

14 22.21812

11 16.19 11.98 9.87 7.9

yx1.111.923.1344

4.856.16

1.0

2.0

SlopeStart x

1

6.5

SlopeEnd x

6.5 1.0

2.014

Function Table

Regression CurveRaw Data

RegressionSELECT * GRID 1

14 22.020.013

12 18.016.011

10 14.09 12.08 10.07 8.0

yx1.012.023.034.045.056.06

Materialized Grid

y = x

y = 2x - 6

Figure 3: Raw data, corresponding function table, and result of aSELECT * query.

nite collections of discrete tuples. To illustrate this, Figure 3 showsthe result of a SELECT * query on the function table. Since it is notpossible to output an infinite set of tuples, the query includes a GRID1 clause, which tells the system to materialize the function table byevaluating the function at intervals of 1 along the independent vari-able, x, and display the resulting (gridded) set of discrete points. Aswe shall see in Section 4, most FunctionDB query plans only per-form materialization at the very end for displaying the result to theuser (though some queries require a form of partial materializationin the case of higher dimensional functions).

Interval attributes represent an interval on the real line in thecase of a single independent variable, and a hypercube3 in the caseof multiple independent variables. Also, our data model permitsrelations that contain overlapping intervals or regions of indepen-dent variable(s). This enables us to support models in which therecan be multiple values of the dependent variable corresponding toa single value or set of values taken by the independent variable(s).While regression functions on time series data are single-valued,this relaxation is useful to represent other models that are not single-valued, like car trajectories.

3.2 Relational AlgebraSince function models are just relations, traditional relational op-

erators such as selections, projections, joins and aggregates extendnaturally to function models. These operators retain semantics thatare similar to their discrete versions, but with some differences, be-cause our operators logically operate over infinite sets, as opposedto traditional operators on finite relations.

Some operators, like aggregates, need to be generalized to oper-ate in the continuous domain. Also, arbitrary relational operators onhigher dimensional functions sometimes yield regions that cannotalways be represented as function tables. In these cases, our imple-mentation approximates these regions with hypercubes. This en-sures that the result is finitely representable. While our data modeldoes sacrifice some expressive power in this theoretical sense, Sec-tion 3.3 compares to a more expressive (but complex) constraintbased data model, and explains why we believe trading off a littlegenerality for a simpler data model is a worthwhile tradeoff.

We define the semantics of FunctionDB operators below. For

3Our implementation is currently restricted to hypercubes. Extend-ing to more complex regions would require supporting arbitraryconstraints — see Section 3.3 for a discussion.

simplicity, we focus on the semantics of operators on regressionfunctions of a single independent variable, but we do mention con-ditions under which the hypercube approximation mentioned aboveis required for higher dimensional functions.Selection. The selection operator, σP, when applied to the infiniterelation F represented by a function table yields another relationF′, that can also be represented by a function table (assuming asingle independent variable). F′ is defined to be the (largest) subsetof F satisfying the selection predicate P. Selection predicates caninvolve both dependent and independent variables of regression.

As an example of selection, applying σx≥5 to a table with twopieces: y = 2x for 0 ≤ x ≤ 2 and y = x + 2 for 2 < x ≤ 4 results ina function table with a single piece: y = x + 2 for 3 ≤ x ≤ 4.

Applying selection predicates to functions of more than a singleindependent variable does not always yield a function table. Forexample, the selection σz<4 when applied to the function z = x2+y2

yields the circular region x2 + y2 < 4, which is not representableas a function table. In such cases, the operation is not lossless andneeds to use the hypercube approximation mentioned earlier.Projection. The projection operator, πV when applied to an infiniterelation F yields an infinite relation F′ consisting of all the possiblevalues the projection attributes V take in the original relation F.If the variables being projected are all independent or dependentvariables, the result is simply a set of intervals.

As with selection, in higher dimensions, function tables need notbe closed under projection. Omitting an arbitrary subset of the in-dependent variables may yield regions that are not representable asfunction tables. For example, projecting x and z from the functionz =

√1 − x2 − y2 for 0 ≤ x, y ≤ 1 yields the region x2 + z2 ≤ 1,

which is not representable as a function table.Join. The join operator, ZP is applied to two relations R1 and R2,where R1 and R2 can be function tables or normal relations. Joinshave identical logical semantics to traditional relational algebra: atuple (t1, t2) is present in the joined result if and only if t1 ∈ R1,t2 ∈ R2 and (t1, t2) satisfies the join predicate P. Note that it ispossible to join a function table to a traditional relational table: inthis case, the result is always a finite set of tuples. As with the otheroperators, the results of some joins on higher dimensional functionsare not function tables, necessitating materialization.Aggregates. In addition to relational operators, our algebra in-cludes a suite of aggregate operators that are the continuous ana-logues of traditional database aggregates like SUM, AVG and COUNT.Traditional aggregates over discrete relations generalize to definiteintegrals when operating over continuous functions.

For example, consider the SQL COUNT operator applied to an at-tribute R.A of a relation R. This operator counts the number oftuples that occur in column A of R, which essentially computes thesum Σv∈R.A1 (ignoring duplicates). In the continuous domain, thissum generalizes to a definite integral:

∫v∈R.A

1 dv. If R.A repre-sents time, then this integral computes the total time spanned bythe model R, which is the sum of lengths of all the time intervalsthat pieces of R are defined over. We name this aggregate opera-tor AMOUNT. Other aggregate operators generalize similarly. Table 1shows three common SQL aggregate operators, the discrete sumthey compute and the definite integral that this sum generalizes toin the continuous domain. In the table, R.X denotes the indepen-dent variable(s) that the integral is computed over. The aggregationattribute R.A can be either dependent or independent.Grid. FunctionDB operators take infinite sets as arguments andproduce results that can be infinite sets. As we have discussed ear-

Page 6: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

SQL Aggregate FunctionDB AnalogueCOUNT(R.A) = Σv∈R.A1 AMOUNT(R.A) =

∫v∈R.A

1 dvSUM(R.A) = Σv∈R.Av AREA(R.A) =

∫v∈R.A

v d(R.X)

AVG(R.A) = Σv∈R.AvΣv∈R.X 1 AVG(R.A) =

∫v∈R.A v d(R.X)∫

1 d(R.X)

Table 1: Discrete aggregates and their FunctionDB counterparts.

lier, materializing models is required to display the results of ex-ecuting queries to the user, or for some query plans on higher di-mensional functions. To handle this, our algebra includes a specialmaterialization operator, GRID. GRID takes a function table as inputand selects a finite set of discrete tuples from the corresponding in-finite relation by evaluating the function at fixed intervals for eachindependent variable. For example, the result of applying GRIDwith grid interval 1 to a function table with two pieces, y = 2x for0 ≤ x < 2 and y = x + 2 for 2 ≤ x ≤ 4, is a finite collection oftuples: { (0, 0), (1, 2), (2, 4), (3, 5), (4, 6) }.

3.3 DiscussionOur choice of piecewise continuous functions as the underlying

representation for models has several benefits:Algebraic Query Execution. Restricting models to functions en-ables query execution using algebra (Section 4), which helps an-swer queries faster and more accurately.Losslessness. It is a completely lossless representation of the un-derlying continuous function. Because the FunctionDB query pro-cessor usually avoids materialization until late in a query plan, thereare no errors introduced in addition to the inherent error of fittingthe model. In contrast, executing queries on a gridded representa-tion, as done in MauveDB [4], introduces and propagates additionaldiscretization errors throughout the query plan. (Section 5).Reduced Footprint. It is compact compared to maintaining a grid-ded representation of the model, because a function table has onlyas many rows as there are pieces in the regression model, usuallyan order of magnitude smaller than a materialized representation.Lower footprint results in substantial savings in I/O and CPU cost,and order of magnitude better performance on queries (Section 5).Wide Applicability. It is simple to reason about and implement,and at the same time generalizes naturally to support a wide classof regression models used in practice.

Function tables are only one possible way to represent an infiniterelation using a finite set of tuples. For example, our data model ismore restrictive than constraint databases [11, 12], which representinfinite regions like polygons or line segments as algebraic con-straints. Constraint databases are more general than function tablesbecause any function table can be described by a set of inequal-ity constraints for interval attribute(s), like 740 ≤ t ≤ 1001, and anequality constraint for functions like temp = 2t+3. These databasesare capable of solving arbitrary linear programs, and the results ofarbitrary relational operators in higher dimensions are always welldefined, but this generality comes at a price. General constraintsolvers are difficult to build, with the result that in practice, con-straint databases have typically been confined to linear constraints.Our query processor is simpler and more extensible to a variety offunctions. Also, while our data model is not closed under arbitraryrelational operations in higher dimensions, we deal with this prob-lem in practice by approximating regions that cannot be expressedas function tables (like x2 + y2 < 4) with collections of hypercubes.This is simple to implement, and provides most of the practical per-formance advantages of algebraic processing.

4. QUERY PROCESSORThe core of FunctionDB is an algebraic query processor that ex-

ecutes relational queries using operations in the functional domain.This section describes the key idea behind our query processor: anextensible framework for expressing relational operators as combi-nations of primitive operations on functions.

4.1 Function FrameworkThe obvious way to implement a function-aware query processor

would be to define abstract data types (ADTs) for different classesof functions (e.g., linear, periodic or polynomial) and implementrelational operators that work on each function type. While this ap-proach would be adequate if FunctionDB only needed to supporta limited class of functions, it requires replicating implementationsof operators for each new class of function added to the system.Hence, we have instead chosen to characterize a small set of ab-stract interfaces implemented by all continuous functions, and ex-press relational and aggregate operators in terms of these interfaces.Since our operators work by invoking these interfaces and do notdirectly rely on function type, it is sufficient to implement thesealgebraic primitives for each new model type to be supported.

At one extreme, it is possible to implement arbitrary relationalqueries on functions using a pure gridding approach similar to thatused in MauveDB [4]. The only property required for this approachto work is function evaluation, i.e., the ability to compute the valuetaken by the function, f (x) for a particular value x (or set of values)of the independent variables. However, this approach is quite slowand results in loss of accuracy due to discretization (Section 5).

Suppose we now require functions4 to implement a simple newprimitive: root finding. Given a real number a as input, a “rootfinding” interface determines all values of the independent variablex for which f (x) = a. This primitive is sufficient to solve selectionqueries with predicates of the form5 WHERE y = a, which involvethe dependent variable of regression, y. Selection based on rootfinding is simple. For each piece in the function table, the algorithminvokes root finding to determine candidate values of x such thatf (x) = a, and then checks which of these values fall within theinterval of x over which the piece is defined. For each such value,say x0, it outputs a tuple with the same function coefficients as theoriginal tuple, but defined over a single point: [x0, x0]. If none of thecandidate x values lie within the interval of definition, this meansthat no points in the input piece satisfy the selection predicate, sono output tuples are emitted.

Selection by root finding is the simplest example of algebraicquery processing. Taking advantage of more properties progres-sively enables new classes of queries to be executed algebraically.To illustrate, Table 2 lists some common classes of relational oper-ators, an example application query for each class, algebraic prim-itives that enable executing that operator without materialization,and the advantages of algebraic execution over materialization inthe context of that operator. The table focuses on functions of asingle variable, but does include a few operators in higher dimen-sions. As explained in Section 3, some of these require approxima-tion. The table is not exhaustive, but rather meant to give a flavourof the algebraic techniques used by FunctionDB. The next sectionpresents more detailed algorithms for relational operators.

4.2 Operators4Of a single variable.5Because pieces are continuous functions, root finding is sufficientto solve arbitrary > / < predicates (Algorithm 1).

Page 7: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

Relational Operator Application Example Required Primitives Advantages Over GriddingSelection on dependent vari-able (σF.y>a/F.y<a/F.y=a)

Find when temperature crossesthreshold

Root Finding, Evaluation Faster if low selectivity

Selection on independent vari-able (σF.x>a/F.x<a/F.x=a)

Restrict query to time window None (Interval Manipulation) Faster if low selectivity

Equijoin of independent vari-ables, compare dependent(ZF1 .x=F2 .x∨F1 .y>/<F2 .y)

Compare temperatures at twosensors at same time

Function Subtraction, RootFinding

More Accurate for = test,Faster for > / < test if low se-lectivity

Equijoin of dependent vari-ables (ZF1 .y=F2 .y)

Line up vehicle trajectories ondistance fraction (Section 2.2,Query 2)

Function inversion More accurate for = test, Muchfaster

Group by independent vari-able, aggregate dependent

Compute temperature averageover time windows

Definite integral (Area undercurve)

More accurate, faster

Group by dependent variable,aggregate independent

Duration histogram for tem-perature ranges (Section 2.1,Query 5)

Function inversion, definite in-tegral for inverse

More accurate, faster

Selection on dependentvariable in two dimensions(σF.y>/</=a)

Find region where temperaturecrosses threshold (Section 2.1,Query 2)

Evaluation (Hypercube Ap-proximation)

Faster

Selection on independentvariable in two dimensions(σF.x1>/</=a1∨F.x2>/</=a1 )

Find temperature at particu-lar location in building (Sec-tion 2.1, Query 1)

None (Interval Manipulation) More Accurate (especially =),Faster if low selectivity

Table 2: Commonly used relational operators and algebraic primitives that can be used to execute them. F, F1, F2 denote function tables, x,x1, x2 denote independent variables, y denotes a dependent variable, and a, a1 and a2 denote constants (query parameters).

This section presents algebraic algorithms for selection, projec-tion, join, grouping and aggregation. In this paper, we confine our-selves to functions of a single variable. Our operators use a standardcursor interface that returns the next output tuple if available, or anull value if done.

Algorithm 1: (Selection on y) σy>a(F)Note: Algebraic primitives are typeset in bold italics.

Given: An upstream cursor, F, to apply selection to, and aselection predicate, y > a.

if cache not empty then return tuple from cache1while F has more tuples do2

Next← F.GetNext()3{Start, End} ← Next.interval4Piece← Next.function5AllRoots← Piece.findroots(a)6Candidates← {x: x ∈ AllRoots and Start ≤ x ≤ End}7SR← Smallest root ∈ Candidates larger than Start8if Piece.evaluate(Start) > a then9

I← Alternating intervals from Candidates including10[Start, SR]

else11I← Alternating intervals from Candidates excluding12[Start, SR]

for i ∈ I do13Output.function← Next.function14Output.interval← i15return Output if first tuple, else cache it16

Selection. Selection queries can involve either the dependent orindependent variable of a model, or both. Selection over the inde-pendent variable (x > / < / = a) is trivial, and does not dependon the function at all. It merely requires checking each piece in the

function table to determine what portion of the corresponding inter-val (if any) satisfies the selection predicate. For example, applyingthe predicate x ≥ 3 to the function piece y = 2x defined over [2, 4]would yield as output a piece with the same function, y = 2x, butnow defined over [3, 4].

We previously described selection for predicates of the form y =a using root finding. Algorithm 1 handles predicates of the form y >a6. This algorithm also uses root finding, and additionally exploitsthe fact that a continuous function f (x) has alternating7 signs in theintervals of x between the roots of the equation f (x) = 0. Note thatfor general functions, a single piece could contain multiple disjointintervals of x where the selection predicate is true (though there isat most one for linear functions).

The algebraic algorithms for selection presented here outperformthe gridding approach, because their running time depends only onthe number of pieces in the regression model, which is usually anorder of magnitude smaller than a materialized grid of points. Also,Section 5 shows that even if the selection result ultimately needs tobe materialized to display results to the user, the algebraic approachhas significant performance gains when the filter is selective.Join. Though there are many interesting cases of joins, we primar-ily confine ourselves to describing two important classes of joinswe have encountered in our applications — equijoins on the inde-pendent variable, corresponding to predicates of the form F1.x1 =

F2.x2, and equijoins on the dependent variable, corresponding toF1.y1 = F2.y2.Equijoin on x. As with selections, equijoins involving only the in-dependent variable are easy to compute and do not use the functionattributes at all. Any of the traditional join algorithms can be usedhere, with a minor modification to test for overlap of x intervalsrather than to compare attributes. For example, a simple nested

6The algorithm for y < a follows by symmetry.7For this to be true always, findroots also needs to return two copiesof corner case “degenerate” roots where the function is tangentialto the line y = a.

Page 8: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

loops join would loop over all pairs of pieces in both relations anddetermine which pairs overlap. For each pair of pieces whose x in-tervals overlap, the algorithm outputs a result tuple whose intervalis the intersection of the overlapping x intervals, and which con-tains the function attributes from both pieces, copied over withoutany modifications.Equijoin on y. Equijoins on the dependent variable are a little trick-ier. The key insight is that an equijoin on y can be transformed to anequijoin on x using function inversion. As a simple example, con-sider the problem of joining two isolated pieces: y = 2x+ 2 definedover x ∈ [0, 2], and y = x defined over x ∈ [3, 7]. Inverting the lin-ear functions enables us to rewrite x as a function of y instead. Thetransformed equivalents of these pieces are x = 1

2 y− 1 defined overy ∈ [0, 6], and x = y defined over y ∈ [3, 7]. We now simply usethe procedure for equijoins on the independent variable, yielding acomposite piece with two functions: x = 1

2 y − 1 and x = y, bothdefined over the overlapping range y ∈ [3, 6]. It is easy to verifythat this correctly represents the result of the y equijoin.

While the above procedure works for functions that have a uniqueinverse (like linear functions) it may not be immediately obvioushow to extend the algorithm to functions that may not have a math-ematical inverse e.g., functions like y = x2 for which there are mul-tiple values of x that yield the same value of y. Fortunately, ourpiecewise data model comes to the rescue. For such functions, werequire the inversion primitive to return a list of inverse pieces de-fined over appropriate intervals, rather than a single piece. For ex-ample, invoking inversion on the quadratic function y = x2 definedover x ∈ [−2, 2] would return two pieces: x = +

√y and x = −

√y,

both defined over the interval y ∈ [0, 4]. Algorithm 2 formally de-scribes this procedure.

Like a traditional DBMS, FunctionDB can also use indexes tospeed up queries. Selections and joins on x benefit from an intervaltree index that stores intervals of x. If functions support an interfaceto find their extreme values over an interval (e.g., using differentialcalculus) it is easy to do the same for y, by determining the max-imum and minimum values taken by y in each function piece, andstoring these intervals in a tree. In either case, if the selection/joinquery is selective and not many pairs overlap, it is faster to look upoverlapping intervals in the index.Aggregation. We have implemented the three aggregates men-tioned in Section 3: namely, AMOUNT (analogous to SQL COUNT),AREA (analogous to SQL SUM) and AVG. Just like selections andjoins, these aggregate functions can be used to compute aggregatesof fields that are independent or dependent variables.

In the continuous domain, aggregates over any kind of variable(independent or dependent) can be evaluated as the limit of a sum,which is equivalent to a definite integral. For example, we considerthe aggregate AMOUNT(x) applied to tuples from a function table Fwith schema <x, y=f(x)>. For each tuple fed to it with x interval[x1, x2], AMOUNT(x) computes the length of the interval x2− x1, andaccumulates the sum of these lengths8 over all the input tuples in F.Mathematically, AMOUNT(x) can be viewed as computing the sumΣ[x1 ,x2]∈F (

∫ x2

x11 dx) (though it simplifies, as we have seen above).

While aggregates over the independent variable have simple ex-pressions and do not actually require computing symbolic integrals,this ability is required for computing averages or sums of dependentvariables. For example, the aggregate AREA(y) when applied to aregression model with schema <x, y> computes the total area under

8It is possible to either include or exclude the signs of the individualintegrals when computing this sum. We have implemented bothversions of aggregate operators.

Algorithm 2: (NL-Join on y) ZF1 .y1=F2 .y2 (F1, F2)

Given: Upstream cursors F1 and F2, and join predicateF1.y1 = F2.y2.

if cache not empty then return tuple from cache1while F1 has more tuples do2

NextOuter← F1.GetNext()3OuterInv← NextOuter.function.invert()4OuterInt← OuterInv.interval5F2.Rewind() ; // Reset inner cursor6while F2 has more tuples do7

NextInner← F2.GetNext()8IList← InnerPiece.function.invert()9for InnerInv ∈ IList do10

InnerInt← InnerInv.interval11if OuterInt and InnerInt overlap then12

Output.interval← OuterInt ∩ InnerInt13Output.function1← OuterInv.function14Output.function2← InnerInv.function15return Output if first tuple, else cache it16

the regression curve, which is equal to the sum of these areas underall the pieces in the function table. For an individual piece y = f (x)defined over the interval [x1, x2], the area under the piece is givenby the definite integral

∫ x2

x1f (x) dx. Accordingly, our aggregate im-

plementation works by invoking a “definite integral” primitive oneach piece in the function table. This primitive takes the endpointsof the interval as a parameter, and returns the value of the integral(which itself is computed symbolically). AREA(y) accumulates thereturned values over all the pieces in the table, in effect evaluatingthe sum Σ[x1 ,x2]∈F (

∫ x2

x1y dx).

The other aggregates: AREA(x), AVG(x), AMOUNT(y), and AVG(y)have similar expressions, derived from Table 1 in Section 3. Weomit a detailed description here.Grouping. As with other operators, aggregate queries that firstGROUP BY a model attribute can be classified on the basis of thegrouping field type: independent or dependent.

Grouping the independent variable into groups of a given size,S , is accomplished with simple interval manipulation. The algo-rithm first splits each of the input tuples into tuples defined oversmaller intervals that entirely lie within an interval of the form[kS , (k + 1)S ] for some integer k ≥ 0. For example, if using binsof size 0.1, the intervals [0.22, 0.38] and [0.38, 0.56] would be splitinto [0.22, 0.3], [0.3, 0.38], [0.38, 0.4], [0.4, 0.5], [0.5, 0.56]. Oncethe tuples have been split, it is trivial to hash each of the smaller tu-ples into an appropriate bucket based on the start points of theirintervals (e.g., [0.3, 0.38] and [0.38, 0.4] would fall in the samebucket). The required aggregate over the dependent variable cannow be computed by summing the values of definite integrals overall the pieces within each hash bucket, as explained earlier.

Grouping the dependent variable (e.g., Query 5 in Section 2.1)uses the same inverse transformation as discussed for joins. Thealgorithm first invokes the inversion primitive on each tuple to ex-press x in terms of y, and then follows a split-and-hash procedureidentical to that described above to group on y.

5. EVALUATIONWe present an experimental evaluation of FunctionDB on queries

from the applications described in Section 2. We first show a simple

Page 9: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

moving average query on which using FunctionDB to fit a regres-sion model helps smooth over gaps in raw data. We then quantifytwo main advantages of our algebraic query processor over the grid-ding approach used by systems like MauveDB, where models arerepresented as discrete points by evaluating the regression model ata fixed gridding interval.

First, gridding introduces discretization error, because a grid isa coarse approximation to a continuous model. We show that thiserror can be significant even for simple aggregate queries on ourdata. In contrast, FunctionDB queries avoid gridding models untilthe output stage of a query plan, and hence do not introduce anydiscretization error.

Second, while it is sometimes possible to reduce discretizationerror by using a small gridding interval, this requires storing and/orprocessing a large number of discrete points. FunctionDB’s alge-braic approach is a performance win, because it only has to processas many tuples as there are pieces in the regression model. Al-though algebraic manipulations on tuples are slightly more complexthan relational operations on individual raw data points, our resultsdemonstrate that in practice, this tradeoff largely favours algebraicquery processing. The lower footprint of our approach results inreduced CPU cost for per-tuple processing overhead, memory allo-cation and deallocation, and substantially lower I/O cost.Experimental Methodology. For evaluation, we built a simple pro-totype of FunctionDB in C++. We constructed FunctionDB queryplans by hand, by connecting together FunctionDB operators. Foreach experiment, we also implemented a gridding version of thequery which operates on a model representation gridded at regularintervals of the independent variable, and uses traditional query pro-cessing operators. In all our experiments (algebraic and gridding),the query processor reads data stored on disk into an in-memorytable, executes the query plan on the in-memory table, and writesthe query results to a file on disk. We have chosen to build an in-memory prototype of our query processor for simplicity, becauseour data sets fit in main memory. Our experiments quantify I/Ocost (reading data from disk) separately from CPU cost (query pro-cessing). The CPU cost measures query execution time when allthe data is already in the DBMS buffer pool, while the I/O cost pro-vides insight into how performance might scale to larger on-diskdatasets. All our experiments were run on a 3.2 GHz Pentium 4single processor machine with 1 GB RAM and 512KB L2 cache.Our results are all averaged over 10 experimental runs.

5.1 Temperature Sensor ApplicationWe evaluated FunctionDB on the temperature data described in

Section 2.1. To recap, this data contains temperature observationscollected from 54 sensors, with the schema <time, temp>. Wefirst fitted a piecewise linear regression model to temperature usinga peak finding procedure, as described in Section 2.1. For testing,we inserted all the regression models into a single function table,tempmodel, containing 5360 function pieces. This corresponds to10 days of temperature data, with a total of ∼ 100,000 temperaturereadings. Each piece describes temp as a linear function of time,and on average fits approximately 200 raw temperature readings.Comparison to Raw Data. We present a simple comparison ofFunctionDB to query processing over raw temperature data. Fig-ure 4 shows the results of a simple query that computes movingaverages of temperature over 10 second windows (Query 4, Sec-tion 2.1). The figure shows that computing the average over rawdata is inaccurate and yields spurious results whenever there is agap in the data. Running the query over regression functions yields

0

5

10

15

20

25

0 200 400 600 800 1000 1200 1400

Mov

ing

Ave

rage

of T

empe

ratu

re (

degr

ees

C)

Time (seconds)

Moving Average: Regression Model vs Raw Data

Moving Average (FunctionDB)Moving Average (Raw Data)

Figure 4: 10-second moving averages over 25 minutes of temper-ature data, computed over raw data and over a piecewise linear re-gression model with FunctionDB. The spikes in the average are er-rors due to missing data, which are corrected by regression.

a smoother moving average without outliers.Comparison to Gridding. To compare to a gridded representationof the model, we evaluated FunctionDB on the histogram queryfrom Section 2.1 (Query 5). The query computes a histogram oftemperatures over the time period of the dataset, using temperaturebins of width B0 (a parameter). For each bin, the height of the his-togram measures the total length of time (summed over all sensors)for which any of the locations experiences a temperature that liesin the range described by that bin. This query involves a groupingoperation prior to aggregation:

SELECT AMOUNT(time) FROM tempmodel

GROUP BY temp GROUPSIZE B0

The FunctionDB plan for the above query consists of an algebraicGROUP BY operation on the dependent variable, temp, followed byan AMOUNT aggregate over the independent variable, time. Thequery consists entirely of algebraic operators. The GROUP BY op-eration works by first inverting the function and then splitting tem-perature intervals into groups with the specified size B0, and theaggregation works by summing the lengths of all the time intervalsthat fall within each temperature bucket (Section 4.2). The grid-ding query plan, on the other hand, reads a gridded representation(gridded on time) of data off disk and processes it. This query planuses a traditional GROUP BY operator to map temperature values tobins of size B0, and the discrete SQL COUNT aggregate to count thenumber of time samples that lie within each bin.

Figure 5 shows the execution time for the histogram query whenusing FunctionDB, as compared to gridding. Results are shown for4 values of the grid size at which the query can be executed withoutsignificant loss of accuracy. The figure shows that FunctionDB isfaster than the gridding strategies by an order of magnitude in termsof both CPU and I/O cost. Both CPU and I/O savings are due toFunctionDB’s small footprint: the FunctionDB query plan needs toread, allocate memory for and process only 5360 tuples (one perfunction piece). On the other hand, ”Grid Raw” (gridding with thesame spacing as the raw data) needs to read and process ∼ 100,000discrete points, and hence performs an order of magnitude worse.

Figure 5 indicates that it is possible to reduce gridding footprint,and hence processing time, by widening grid size. Doing so, how-ever, adversely impacts query accuracy. Figure 6 shows the dis-cretization error introduced by gridding (averaged over all bins) onthe histogram query, as a function of grid size. The error is com-

Page 10: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

Histogram Query: Performance (1 degree C bins)

1

10

100

1000

10000

CPU I/O Total

Qu

ery

Execu

tio

n T

ime (

ms)

FunctionDBGrid_RawGrid_2RawGrid_4RawGrid_8Raw

Figure 5: Performance of FunctionDB compared to gridding withdifferent grid sizes, on histogram query. The base grid size,”Grid Raw”, is equal to the average spacing of raw temperatures(1 sec). The other three grid sizes are successively larger multiplesof this spacing (2 sec, 4 sec, 8 sec).

puted as a percent deviation from the result of the algebraic queryplan, which does not 9 suffer from this error. The graph shows thatdiscretization error grows significantly with grid size. Hence, usinga widely spaced grid is not a viable option.

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16

Dis

cret

izat

ion

Err

or (

%)

Grid Size On Time Attribute (seconds)

Histogram Query: Discretization Error vs Grid Size

Bin size 0.5 degrees CBin size 1 degrees CBin size 2 degrees C

Figure 6: Discretization error (% deviation from algebraic answer)due to gridding, as a function of grid size, averaged over all bins forthe histogram query.

The discretization error in Figure 5 results from sampling, whichis an imperfect way to compute the aggregate of a continuous func-tion. The gridding approach samples temperature data at discretetime intervals and counts the number of temperature observationsin each bin. This becomes inaccurate when the sampling interval iscomparable to a time scale over which temperature varies signifi-cantly (affecting the bin to which sampled temperature readings areassigned). As the figure shows, the error is exacerbated for queriesthat group temperature into smaller bins.

While FunctionDB results in a clear performance win for aggre-gate queries that do not need to grid results at any point in a queryplan, some queries (including simple selections) do need to grid re-sults for display or writing to a file, in order to provide the familiar9While we have used FunctionDB results as the baseline for com-puting error, we recognize that our results are limited by the inher-ent error in raw data and in the model fit by the user. The experi-ments are mainly intended to show that gridding introduces signifi-cant additional error, while FunctionDB does not.

output semantics of a traditional DBMS. In such situations, alge-braic query execution does not always outperform gridding, but isstill beneficial for selective queries, if gridding is performed afterselection. Our second query determines when the temperature ex-ceeds a particular cutoff, T0:

SELECT time, temp FROM tempmodel

WHERE temp > T0 GRID G0

The query takes an additional parameter G0, which specifies thegranularity with which results are gridded for display to the user.The FunctionDB query plan for selection uses an algebraic rootfinding procedure (Algorithm 1, Section 4.2), but then applies GRIDto the selected functions for displaying the result of selection. Thegridding approach applies a traditional selection directly to datagridded at regular intervals of the time attribute.

Figure 7 shows the total query execution time (CPU + IO) for theabove query, for three different values of the cutoff temperature T0.These values correspond to three different predicate selectivities:Not Selective (∼ 70% of tuples pass the filter), Selective (∼ 20%of tuples pass), and Highly Selective (only ∼ 2% of tuples pass).Algebraic query processing yields significant benefits for the selec-tive query plan, because this plan only needs to grid 2% of the datafor output. The benefits are less pronounced for the less selectivequery plan, and there is no difference for a query plan that needsto output most of the data, because gridding is the main bottleneckin both query plans. We have repeated this experiment for largervalues of output grid size G0 ; the relative results and conclusionsare unchanged as long as the grid size is not too wide.

Selection Performance vs Selectivity

0

2

4

6

8

10

12

14

16

18

20

Not Selective Selective Highly Selective

Qu

ery E

xecu

tio

n T

ime (

seco

nd

s)

FunctionDB

Gridding

Selection Performance vs Selectivity

0

2

4

6

8

10

12

14

16

18

20

Not Selective Selective Highly Selective

Qu

ery E

xecu

tio

n T

ime (

seco

nd

s)

FunctionDB

Gridding

Figure 7: Selection performance of FunctionDB compared to grid-ding with output grid size 1 second, for 3 values of selectivity: 70%,20% and 2%. Algebraic processing is a win for selective queries,which do not grid much data for display.

5.2 Trajectory SimilarityOur second evaluation dataset consists of vehicle trajectories from

Cartel [5], fitted with a piecewise linear model for road segments.The data consists of 1,974 vehicle trajectories collected over a year.Our regression model consists of 72,348 road segment “pieces”,used to fit 1.65 million raw GPS readings. This model fits fewerraw data points per piece than the regression model for tempera-ture, because trajectories are sometimes curved or include manyturns, requiring multiple line segments for approximation.

For evaluation, we used a variant of the trajectory similarity querydescribed in Section 2.2. Given a trajectory identifier, this query

Page 11: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

finds trajectories whose endpoints are within 1’ of latitude and lon-gitude (corresponding to ∼ 1.4 km) of the endpoints of the giventrajectory. For each such neighbouring trajectory, the query linesup points on the given trajectory with points on its neighbour basedon a “distance fraction” criterion (Section 2.2), and computes theaverage distance between pairs of lined up points as a similaritymetric between the trajectory and its neighbour.

To focus our benchmark on join performance, we simplified thequery somewhat for both the FunctionDB and gridding approaches,by precomputing a materialized view, fracview, with the logicalschema <tid, lon, lat, frac>. In the gridded representation,points in the trajectory are sampled at regular intervals of the inde-pendent variable, lon. For each point in the trajectory, frac is areal number in the range [0, 1] representing the fraction of distancealong the trajectory at which the point occurs. In the function tablerepresentation, both lat and frac are stored as piecewise linearfunctions of lon. This is possible because the increase in fracalong each road segment is proportional to the distance along thesegment,

√lat2 + lon2, which is a linear function of lon whenever

lat is a linear function of lon. Given this, trajectory similarity in-volves computing a join on frac. The SQL query is similar to thatpresented in Section 2.2, except that it does not compute all pairssimilarities, and it includes an additional nearness criterion.

The FunctionDB plan for trajectory similarity is completely al-gebraic. The major step is an NL-Join on frac (the dependent vari-able) using function inversion (Section 4.2). Joined trajectories aregrouped on tid using a traditional GROUP BY. The last step mapsthe algebraic expression for Euclidean distance to the functions ineach tuple, and computes an AVG aggregate over this expression.

Given Trajectory

SearchTrajectory

2

2

6

5

5

4

4.5

Total similarity:AVG(2,2,6,5,5,4.5,5)=4.1

Figure 8: Lining up two trajectories using a grid approach.

The procedure for the gridded version of the join on frac is illus-trated in Figure 8. Each grid point on a search trajectory is matchedto the point with the closest value of frac on the (given) query tra-jectory using binary search on a tree data structure built on the fly,and average distance is computed between the pairs of points linedup in this way.

Figure 9 shows a CDF of the discretization error in the similaritymetric (averaged over search trajectories) computed by the griddingapproach. The distribution is computed over different choices forthe query trajectory, and is shown for two grid sizes, one equal tothe raw data, and the other 8 times this size (the curves for 2 and 4times lie in between). The graph shows that median discretizationerror is larger for the wider grid size. Also, for both grid sizes, por-tions of the distribution experience significant discretization error(> 15%).

Discretization error occurs in this example because the data isgridded along the lon attribute (Figure 8). Hence, each trajectoryhas an unequal number of grid points, unevenly spaced along thejoin attribute frac, resulting in imperfect pairings. While usinga representation gridded over frac could be more accurate in this

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25

Cum

ulat

ive

Pro

babi

lity

Discretization Error (%)

Trajectory Similarity: CDF Of Discretization Error

Grid_RawGrid_8Raw

Figure 9: CDF of discretization error for two different grid sizes ontrajectory similarity query, computed over query trajectories. Theerror distribution has a significant tail.

Trajectory Similarity: Performance

1

10

100

1000

10000

CPU I/O Total

Averag

e Q

uery E

xecu

tio

n T

ime (

ms)

FunctionDB

Grid_Raw

Grid_2Raw

Grid_4Raw

Grid_8Raw

Figure 10: Performance of FunctionDB on the trajectory similarityquery, compared to gridding with different grid sizes. As before,the grid sizes are multiples (1x, 2x, 4x, 8x) of the average spacingof raw data in the data set.

particular case, this might not work well for other queries. It maybe possible to maintain multiple gridded representations and inferwhich one to use for each query, but this would add complexity andredundancy, and would be harder to maintain.

Figure 10 shows the corresponding performance results, aver-aged over query trajectories. FunctionDB outperforms all but oneof the gridding strategies, and at the same time has no discretizationerror. The improvements are not as dramatic as in the temperaturehistogram, because the ratio of number of grid points to pieces issomewhat lower (as mentioned earlier). However, these results doshow that algebraic processing has significantly better performancethan discrete sampling.

6. RELATED WORKExisting DBMSs do include some support for models for data

mining [8, 10], but they typically treat models as black box user-defined functions, as opposed to FunctionDB in which regressionfunctions are represented and manipulated in the database as first-class objects. [16] generalizes query optimization to support predi-cates that involve data mining models; however, [16] is mainly fo-cused on classification, as opposed to regression models, which areour primary focus.

Our previous work on MauveDB [4] proposed querying models(including regression models) using a relational framework. Func-tionDB is based on a similar idea, but extends the state of the artby using an algebraic framework and representation, which enable

Page 12: Representing and Querying Regression Models in a DBMSnms.csail.mit.edu/~arvindt/functiondb.pdf · functional domain has substantial advantages in terms of accuracy (over 15% for some

substantially faster and more accurate query execution than the grid-ding approach used by MauveDB.

The ideas of representing an infinite relation in a DBMS, andquery processing using algebraic computations are not new. Con-straint query languages, proposed in the context of querying ge-ometric regions and formalized by [11], represent and query infi-nite regions as systems of constraints. There have been prototypeimplementations of constraint database systems for solving spatialqueries and interpolating spatial data [3, 12–15]. FunctionDB dif-fers from constraint databases in two main ways. First, our datamodel is simpler and specifically restricted to regression models.Hence, our query processor is very different from a generalized con-straint solver, and explicitly leverages the fact that tables containfunctions that support algebraic primitives. FunctionDB is conse-quently more extensible to new classes of models, whereas con-straint databases have focused mainly on linear constraints to keepquery processing tractable. Second, the focus of our work is on ef-ficient query processing for regression models, while work on con-straint query languages and databases has traditionally focused onthe complexity of supporting large numbers of constraints (e.g., forlinear programming applications).

Some systems (such as Postgres with PostGIS [1] extensions)support polyline and other trajectory data as ADTs. Internally, thesetypes have a similar structure to the curves output by regression,but they do not support the range of operations over functional datathat our more general model supports and are targeted exclusivelytowards geospatial data. Hence, these GIS extensions cannot, forexample, compute the average value of a curve over some range,join two curves together, or convert to or from curves and discretepoints.

Moving objects databases [6, 9] have proposed query languagesand representations suitable for expressing queries on continuouslyvarying spatio-temporal data, including trajectories. While someof their techniques have parallels with our ideas for querying con-tinuous data, their work is specifically targeted at representing andquerying object trajectories, while our work aims to support a moregeneral class of applications using regression. Specific algorithmsfor similarity and indexing using piecewise line segments and func-tions have also been investigated earlier for time series and spatialdata e.g., [7] and [17]. These also are interesting instances of ageneral class of applications supported by FunctionDB.

Tools like Matlab [2] support fitting regression functions, as wellas algebraic and symbolic manipulation of functions, but lack sup-port for relational queries. Also, as argued in [4], using these toolsis inconvenient if data is already in a DBMS, because data needs tobe moved back and forth between the external tool and the DBMS,resulting in considerable inconvenience and performance overhead.

7. CONCLUSIONThis paper described FunctionDB, a novel DBMS that supports

regression functions as a data types that can be queried and manip-ulated like traditional relations. We proposed a simple piecewisefunction representation for regression models, and showed how tobuild an extensible algebraic query processor by expressing rela-tional operations in terms of basic algebraic primitives on functions.We have evaluated and quantified the benefits of algebraic queryprocessing on two real-world applications that use modeling, andhave shown that regression provides accuracy benefits over query-ing raw data directly. In addition, we have shown that our queryprocessor is 10x-100x faster, as well as up to 15% more accurate

on several realistic queries, compared to existing approaches thatrepresent models as gridded data.

8. REFERENCES[1] PostGIS. http://postgis.refractions.net/.[2] Matlab.http://www.mathworks.com/products/matlab/.

[3] Alexander Brodsky, Victor E. Segal, Jia Chen, and PavelA. Exarkhopoulo. The CCUBE Constraint Object-OrientedDatabase System. In SIGMOD Conference on Managementof Data, 1999.

[4] Amol Deshpande and Samuel Madden. MauveDB:Supporting Model-Based User Views in Database Systems.In ACM SIGMOD Conference on Management of Data,2006.

[5] Bret Hull, Vladimir Bychkovsky, Yang Zhang, Kevin Chen,Michel Goraczko, Allen K. Miu, Eugene Shih, HariBalakrishnan, and Samuel Madden. CarTel: A DistributedMobile Sensor Computing System. In 4th ACM SenSys,Boulder, CO, November 2006.

[6] R. H. Guting, M. H. Bohlen, M. Erwig, C. S. Jensen, N. A.Lorentzos, M. Schneider, and M. Vazirgiannis. A Foundationfor Representing and Querying Moving Objects. ACMTransactions on Database Systems, 25(1):1–42, 2000.

[7] Huanmei Wu, Betty Salzberg, Gregory C Sharp, SteveB Jiang, Hiroki Shirato, and David Kaeli. SubsequenceMatching on Structured Time Series Data. In ACM SIGMODConference on Management of Data, 2005.

[8] IBM. IBM DB2 Intelligent Miner.http://www-306.ibm.com/software/data/iminer/.

[9] Michalis Vazirgiannis and Ouri Wolfson. A SpatiotemporalModel and Language for Moving Objects on Road Networks.In SSTD, pages 20–35, 2001.

[10] Oracle. Oracle Data Miner. http://www.oracle.com/technology/products/bi/odm/odminer.html.

[11] Paris C. Kanellakis, Gabriel M. Kuper, and Peter Z. Revesz.Constraint Query Languages. In Symposium on Principles ofDatabase Systems, pages 299–313, 1990.

[12] Peter Z. Revesz. Constraint databases: A survey. InSemantics in Databases, pages 209–246, 1995.

[13] Peter Z. Revesz, Rui Chen, Pradip Kanjamala, Yiming Li,Yuguo Liu, and Yonghui Wang. The MLPQ/GIS ConstraintDatabase System. In SIGMOD Conference on Managementof Data, 2000.

[14] Stephane Grumbach, Philippe Rigaux, and Luc Segoufin. TheDEDALE system for complex spatial queries. In SIGMODConference on Management of Data, pages 213–224, 1998.

[15] Stephane Grumbach, Philippe Rigaux, and Luc Segoufin.Manipulating Interpolated Data is Easier than You Thought.In The VLDB Journal, pages 156–165, 2000.

[16] Surajit Chaudhuri, Vivek R. Narasayya, and Sunita Sarawagi.Efficient Evaluation of Queries with Mining Predicates. InInternational Conference on Data Engineering, 2002.

[17] Yuhan Cai and Raymond Ng. Indexing Spatio-TemporalTrajectories with Chebyshev Polynomials. In ACM SIGMODConference on Management of Data, 2004.