155
UC Irvine UC Irvine Electronic Theses and Dissertations Title Supporting Interactive Analytics and Visualization on Large Data Permalink https://escholarship.org/uc/item/3pn936k6 Author Jia, Jianfeng Publication Date 2017 License CC BY 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California

UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

UC IrvineUC Irvine Electronic Theses and Dissertations

TitleSupporting Interactive Analytics and Visualization on Large Data

Permalinkhttps://escholarship.org/uc/item/3pn936k6

AuthorJia, Jianfeng

Publication Date2017

LicenseCC BY 4.0 Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital LibraryUniversity of California

Page 2: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

UNIVERSITY OF CALIFORNIA,IRVINE

Supporting Interactive Analytics and Visualization on Large Data

DISSERTATION

submitted in partial satisfaction of the requirementsfor the degree of

DOCTOR OF PHILOSOPHY

in Computer Science

by

Jianfeng Jia

Dissertation Committee:Professor Chen Li, Chair

Professor Michael J. CareyProfessor Sharad Mehrotra

2017

Page 3: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

All materials © 2017 Jianfeng Jia

Page 4: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

DEDICATION

To my family.

ii

Page 5: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

TABLE OF CONTENTS

Page

LIST OF FIGURES vi

LIST OF TABLES ix

ACKNOWLEDGMENTS x

CURRICULUM VITAE xii

ABSTRACT OF THE DISSERTATION xiv

1 Introduction 1

2 TwitterMap: Interactive Visualization on One Billion Tweets 62.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Frontend Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Frontend Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 AsterixDB Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Tweet Analytics Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Initial Map View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.2 Normalized Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5.3 Zooming In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.4 Query Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Cloudberry: A Middleware System to Support Interactive Analytics 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Clouberry Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 API Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Data Declaration Request . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 Data Query Request . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.1 Data Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.2 Query Rewriting Using Views . . . . . . . . . . . . . . . . . . . . . . 373.4.3 View Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iii

Page 6: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

3.4.4 Concurrency Management . . . . . . . . . . . . . . . . . . . . . . . . 513.5 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5.1 Paraview Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5.2 Twitter Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.5.3 Connecting to Different Databases . . . . . . . . . . . . . . . . . . . 55

3.6 Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 553.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Drum: A Rhythmic Approach to Interactive Analytics on Large Data 614.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 Architecture and Query Slicing . . . . . . . . . . . . . . . . . . . . . 654.2.2 Slicing Schedules and User Satisfaction . . . . . . . . . . . . . . . . . 664.2.3 Schedule Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Drum: An Adaptive Framework for Generating Mini-Queries . . . . . . . . . 694.3.1 Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.2 Uncertainty Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.3 Tradeoff of Running Time and Penalty . . . . . . . . . . . . . . . . . 734.3.4 Choosing the Range ri for Next Mini-Query Qi . . . . . . . . . . . . 75

4.4 Choosing an Optimal Predicate Range . . . . . . . . . . . . . . . . . . . . . 764.4.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.4.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4.3 An Example Sequence of Mini-queries . . . . . . . . . . . . . . . . . . 82

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.5.2 Effect of Different Slicing Methods . . . . . . . . . . . . . . . . . . . 854.5.3 Effect of Penalty Weight α . . . . . . . . . . . . . . . . . . . . . . . . 894.5.4 Adaptiveness of Regression Function . . . . . . . . . . . . . . . . . . 90

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5 Using Filters to Improve Secondary-to-Primary Index Search 955.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 LSM-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.2 Filter-based LSM Storage in AsterixDB . . . . . . . . . . . . . . . . . 97

5.3 Using Filters to Improve Secondary-to-Primary Index Search . . . . . . . . . 1045.3.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.3.2 Implementation in AsterixDB . . . . . . . . . . . . . . . . . . . . . . 107

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.4.2 Machine and Parameter Configuration . . . . . . . . . . . . . . . . . 1105.4.3 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.4.4 Analysis of the Improvement . . . . . . . . . . . . . . . . . . . . . . . 116

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

iv

Page 7: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6 Conclusions and Future Work 1216.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Bibliography 127

A Solve the Optimal ri in the Drum Framework 135A.1 The Optimal ri using Histogram Distribution . . . . . . . . . . . . . . . . . 135A.2 The Optimal ri using Gaussian Distribution . . . . . . . . . . . . . . . . . . 136

v

Page 8: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

LIST OF FIGURES

Page

2.1 TwitterMap interface when a user searches “Zika” on tweets. . . . . . . . . . 72.2 TwitterMap architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 A frontend web JSON record to indicate that the user has a “move” action. . 92.4 The $4,000 TwitterMap cluster supporting interactive visualization on 2TB

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Normalized distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Zoom into county-level details in Florida. . . . . . . . . . . . . . . . . . . . . 132.7 Example of using query slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 The spatial and temporal distribution of tweets that mention “Zika” . . . . . 183.2 The relation between the event time and the ingestion time. . . . . . . . . . 213.3 Cloudberry architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 An example of the dataset tweet in AsterixDB . . . . . . . . . . . . . . . . 263.5 DDL in Cloudberry to register the tweet dataset of the underlying database. 273.6 Get the per-state and per-hour count of tweets that contain “zika” and “virus” 303.7 Get the top-10 related hashtags for tweets that mention “zika” . . . . . . . . 313.8 Get 100 latest sample tweets that mention “zika” . . . . . . . . . . . . . . . . 313.9 Lookup the population value from the “population” dataset and append it

to the current record. The “state” field (joinKey) in the result record willbe used to join with the population.stateID (lookupKey) from thepopulation dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.10 An estimate request that asks for the total number of tweets and it can acceptan approximate answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.11 Using the option field to specify that the partial result should be deliv-ered every 2,000 milliseconds and the result should be updated every 2,000milliseconds for the newest data. . . . . . . . . . . . . . . . . . . . . . . . . 34

3.12 Data registration workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.13 An example metadata record of the tweet dataset in Cloudberry. . . . . . . 373.14 Create view “zika”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.15 Create the “zika” view in AsterixDB. . . . . . . . . . . . . . . . . . . . . . . 423.16 The “zika” view metadata record . . . . . . . . . . . . . . . . . . . . . . . . . 433.17 The process of rewriting the request using “zika” view. . . . . . . . . . . . 443.18 The time relationship between the “zika” view and the query. . . . . . . . . . 46

vi

Page 9: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

3.19 The SQL++ query to combine the records from the “zika” view and the basedataset “tweet”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.20 Update the view based on its time range. . . . . . . . . . . . . . . . . . . . . 493.21 The SQL++ example of appending more records from the base dataset tweet

to the view “zika”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.22 The “zika” view’s metadata record after the update finished at 2017-10-10T10:01:00.

513.23 Actor architecture of Cloudberry. . . . . . . . . . . . . . . . . . . . . . . . . 523.24 The large display visualization system at the Army Research Lab. . . . . . . 54

4.1 Distribution of “Zika” tweets collected over 1.5 years from November 2015 toMay 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Three query-slicing schedules, where each thick line indicates a delay for anexpected 2-second pace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Missing a delivery deadline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 Drum framework for adaptive query slicing. . . . . . . . . . . . . . . . . . . . 704.5 Linear regression for tweet mini-queries with different predicates on the create_at

attribute. (110 million tweets, keyword=zika) . . . . . . . . . . . . . . . . 714.6 The Histogram and Gaussian models. The errors are collected from the ob-

servation ti − f(ri) in Fig. 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . 724.7 The score and error PDF for two target running times, ri = 2.2 seconds and

r′i = 1.85 seconds, with time limit Li = 2.2 seconds, f(ri) = 0.08ri + 0.9, andPDF (ti) = N(f(ri), 0.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.8 Use an error histogram to compute the expected score for targeting runningtime f(ri). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.9 Expected score as ri changes (histogram model). . . . . . . . . . . . . . . . . 794.10 Evaluating slicing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.11 Effect of penalty weight α (DRUM-HS) . . . . . . . . . . . . . . . . . . . . . 894.12 Effect of penalty weight α (DRUM-GA) . . . . . . . . . . . . . . . . . . . . . 904.13 Adaptiveness of the linear regression (adding a 1-second sleep for the second

half of the mini-queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.14 Adaptiveness of the linear regression (adding a 1-second sleep for each mini-

query randomly) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1 Life cycle of a component-based LSM-tree [51] . . . . . . . . . . . . . . . . . 975.2 LSM storage example in AsterixDB. . . . . . . . . . . . . . . . . . . . . . . 985.3 SQL++ DDL statement to declare the filter field. . . . . . . . . . . . . . . . 1005.4 Update the filter range when merging two components. . . . . . . . . . . . . 1005.5 The query plan of secondary-to-primary index search using a filter. “SIX”

stands for secondary index search. “PIX” stands for primary index search. . . 1015.6 SQL++ query with a filter condition. . . . . . . . . . . . . . . . . . . . . . . 1015.7 The physical data flow of filter-based secondary-to-primary search . . . . . . 1025.8 SQL++ query to count the number of tweets mentioning zika. . . . . . . . . 104

vii

Page 10: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

5.9 The relationship between secondary-index components with primary-indexcomponents based on their filter value. The primary key found in a secondarycomponent does not need to be probed on the primary components whosefilter values are not overlapping with the secondary component’s filter. . . . 105

5.10 The filter-based secondary-to-primary index search plan. . . . . . . . . . . . 1075.11 Tweet type definition used in the experiments. . . . . . . . . . . . . . . . . . 1095.12 Three types of test queries. We changed the keywords, city ids, and radii of

the queries to change their selectivity. . . . . . . . . . . . . . . . . . . . . . . 1125.13 Comparison of the query performance between the original filter plan to using

the secondary filter values for the primary index search. (The percentagesbelow the predicate values on the x-axes show their selectivities.) . . . . . . 113

5.14 Comparison of the query performance between using and not using the sec-ondary filters for the primary index search under the correlated merge policy. 114

5.15 Break down the performance gain between not using (W/O) and using (W/I)the secondary filter plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

viii

Page 11: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

LIST OF TABLES

Page

3.1 Cloudberry data types and functions. “-” means unsupported . . . . . . . . . 28

4.1 The costs of the schedules of Fig. 4.2 . . . . . . . . . . . . . . . . . . . . . . 694.2 Maximal expected score for each interval bin . . . . . . . . . . . . . . . . . . 804.3 A sequence of mini-queries generated adaptively based on the latest statistical

information. (α = 25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4 The first three mini-query results . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1 A sample dataset tweet. The id is the primary key. The value of the timefield has a timestamp number. The text field contains a bag of words. . . . 98

5.2 AsterixDB settings for the experiment. . . . . . . . . . . . . . . . . . . . . . 1105.3 The number of components of each index after ingestion. . . . . . . . . . . . 1115.4 The number of Bloom filter checks and primary B-tree searches to find “flood”

related records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

ix

Page 12: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

ACKNOWLEDGMENTS

I am deeply grateful to my advisers Professor Chen Li and Professor Michael J. Carey fortheir continuous support and guidance during the past five years. Professor Li guided everyaspect of my Ph.D. study from choosing my research topics, developing systems, to writingtechnical articles. His insightful vision made the Cloudberry project happen. His distinctiveperspective and serious research procedure fundamentally changed my mindset and shapedme to be an independent researcher. Professor Carey is a knowledgeable system researcherwho has been a walking encyclopedia and led me to a more enlightened research path. Hiscarefulness on every detail of experiment analysis as well as the enthusiasm in every potentialperformance improvement educated me how to become a serious system builder. Withoutthem, I could not have finished this thesis.

I would like to thank Professor Sharad Mehrotra for joining my doctoral committee. Hisknowledge and insightful comments have always strengthened my work.

I would like to thank Ian Maxon for proof-reading the draft version of the thesis and providinginvaluable suggestions from structure to wordings.

I would like to thank Dr. Simon Su, Micheal An, and Vincent Perry from the Army ResearchLab for building their multi-display visualization application on top of the Cloudberry systemand authorizing us to use their demo picture in the thesis.

I would like to thank Chen Li (the student version) and Zhang Xi for contributing the earlyprototype of Cloudberry. With their help, we won the Best Visualization Award in the UCIData Hackathon, which gave a high confidence on the direction of the system at the earlystage.

I would like to thank Yingyi Bu for his invaluable suggestions and discussions during myresearch. His idea directly inspired the work in Chapter 5.

I would like to thank Mengfan Tang for his great help on the mathematical solutions inChapter 4.

I would like to thank all other Cloudberry team members and alumnus: Taewoo Kim,Chen Luo, Te-Yu Chen, Dharini Sreenivasan, Nishad Gurav, Aishwarya Kapse, Kaiyi Ma,Zongheng Ma, Monique Moledo, Victor Phung, Hao Chen, Vignesh Sankar, Shengjie Xu,Liangju Chu, Sicong Liu, Vidhyasagar Thirumaraiselvan, Haohan Zhang, Qiancheng Wu,Qiushi Bai, and Tejia Zhang. I would like to thank the Professor Wenhai Li and ProfessorHui Zhang who tried Cloudberry and gave us feedback.

I would like to thank the AsterixDB team including Dr. Till Westmann, Dr. Alex Behm,Dr. Raman Grover, Dr. Sattam Alsubaiee, Pouria Pirzadeh, Ian Maxon, Abdullah Alam-oudi, Inci Cetindil, Taewoo Kim, Zach Heilbron, Preston Carman, Ildar Absalyamov, StevenJacobs, Madhusudan C.S, Khurram Faraaz, Kereno Ouaknine, Xikui Wang, Chen Luo, Pro-fessor Heri Ramampiaro, and Professor Wenhai Li for working hard together to make the

x

Page 13: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

system stable and reliable.

I would like to thank my wife Hangyan for every day we have spent together, for her love,support, care, and understanding through these years. I thank my daughter Jasmine forfilling my life with love, joy, and happiness. I thank my parents Zhongqi and Faping fortheir love and continuous encouragement. Their sacrifices allowed me to pursue this pathand come this far.

The work reported in this thesis has also been supported by NIH award 1U01HG008488- 01,NSF CNS award 1305430, and the Army Research Laboratory under Cooperative AgreementNo. W911NF-16-2-0110.

xi

Page 14: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

CURRICULUM VITAE

Jianfeng Jia

EDUCATION

Doctor of Philosophy in Information and Computer Science 2017University of California, Irvine Irvine, California

Master of Science in Computer Science 2008Xiamen University Xiamen, China

Bachelor of Science in Computer Science 2005Xiamen University Xiamen, China

SELECTED HONORS AND AWARDS

Google Graduate Student Award in ICS 2017University of California, Irvine

Best Visualization Demo Award in UCI Data Science Hackathon 2016University of California, Irvine

xii

Page 15: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

PUBLICATIONS

Drum: A Rhythmic Approach to Interactive Analytics on LargeData

2017

IEEE International Conference on Big Data (IEEE Big Data)

Caching Geospatial Objects in Web Browsers 2017International Conference on Advances in Geographic Information Systems

Visual Analytics Ecology for Complex System Testing 2017Visualization in Practice at IEEE VIS

Twitter Coverage of Climate Change and Health before and afterthe 2016 US Presidential Election

2017

APHA Annual Meeting

Towards Interactive Analytics and Visualization on One BillionTweets

2016

International Conference on Advances in Geographic Information Systems

Pregelix: Big(ger) Graph Analytics on A Dataflow Engine 2014Proceedings of the Very Large Database Endowment (PVLDB)

xiii

Page 16: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

ABSTRACT OF THE DISSERTATION

Supporting Interactive Analytics and Visualization on Large Data

By

Jianfeng Jia

Doctor of Philosophy in Computer Science

University of California, Irvine, 2017

Professor Chen Li, Chair

There is an increasing demand to visualize large datasets as human observable reports in

order to quickly draw insights and gain timely awareness from the data. An interactive

user interface is an indispensable tool that allows users to analyze the data from different

perspectives and to inspect the result from the global overview to the finest granularity. To

enable this type of interactive user experience, the frontend needs to generate new requests on

the fly, and the results must be computed and delivered within seconds. Big Data platforms

can take tens or hundreds of seconds to complete an OLAP-style query, so there is a need

for a solution that can meet the stringent latency requirement of interactive visualization

frontends.

In this thesis, we address the interactivity challenges from a middleware perspective to

provide a generic solution that can utilize existing database systems as a “black box” to

support various interactive visualization applications efficiently.

We first present Cloudberry, an open-source general-purpose middleware system to support

interactive analytics and visualization on big data with various attributes. It can auto-

matically create, maintain, and delete materialized views by analyzing each request and its

results. By utilizing materialized views stored in the backend database, it can reduce the

query response time remarkably. We build an application called “TwitterMap” using Cloud-

xiv

Page 17: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

berry to demonstrate its suitability to support interactive data analytics and visualization

on more than one billion tweets (about 2TB).

We then present a query-slicing technique in Cloudberry, called Drum, that can “slice” a query

into small pieces (called “mini-queries”) so that the middleware can send these mini-queries

to the DBMS one by one and compute results progressively. The Drum framework can collect

run-time behavioral statistics for the database system to decide the predicate for the next

mini-query so that it can appropriately provide a smooth user experience. Moreover, Drum

is a general technique in Cloudberry that requires no changes to the underlying database

system. Our experiments on a large, real dataset show that the Drum technique can reduce

the delay of delivering intermediate results to the user without much reduction of the overall

speed for the complete query answer.

To further speed up Big Data visualization, we present a method of using LSM filters to

accelerate secondary-to-primary index search in an LSM storage setting. This technique can

propagate the filter values from the secondary index to speed up the primary index search.

We implemented it in AsterixDB, and our experiments show that the new approach can

reduce the query time by 20% to 70% for different queries with various selectivities. With

this improvement, Cloudberry can generate mini-queries that contain much wider range

predicates, which can reduce the cost of the progressive query evaluation.

xv

Page 18: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 1

Introduction

We are living in an era where a significant amount of digital information is generated on a

daily or even hourly basis through social networks, blogs, online communities, news sources,

and mobile applications. This massive amount of data holds valuable information that can

be used for various analytical purposes, e.g., to track social opinions on different topics,

trends of a specific event, etc. There are increasing demands to visualize the large datasets

as human observable reports in order to quickly draw insights and gain timely awareness

from the data. Most importantly, an interactive user interface is an indispensable tool that

allows users to analyze the data from multiple perspectives and to inspect the results from

the global overview to the finest granularity.

To achieve the interactivity, the frontend needs to generate new requests on the fly, and the

results must be computed and delivered in seconds or even in milliseconds. Many analytics

systems (e.g., Superset [9], Tableau [87]) are connecting to databases directly and send new

database queries once users have new actions in the frontend. Consequently, the responsive-

ness depends on the query execution time of the underlying database. Although nowadays

Big Data systems (e.g., Apache AsterixDB [13]) can manage large amounts of data, it can

1

Page 19: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

still take tens or hundreds of seconds to answer an OLAP-style query. There is a need for a

backend service that can meet the stringent latency requirements of interactive visualization

in frontend applications.

In this thesis, we address the interactivity challenges from a middleware perspective to pro-

vide a general solution that can utilize existing database systems to support various inter-

active visualization applications efficiently. In particular, we study three different problems

and provide an efficient solution for each of them.

Design a General-Purpose Middleware System

An analytical application often issues a series of queries where the results of the previous

queries affect the formulation of the next ones. For example, an analyst may want to find

how social networks respond to events such as the Zika virus. The analyst may first want

to get a quick overview of the distribution of the tweets that mention Zika. Once they

have found that Zika was discussed more frequently in a particular region, e.g., Florida, the

analyst may want to drill down to the Florida region to further explore the time and spatial

patterns. The later queries share the same interest in Zika as the first one. This pattern

presents a chance to speed up new queries by using the previous query results. For example,

if we can store all the Zika-related tweets in a materialized view after the first query, the

future zoom-in queries will only need to visit the view without touching many irrelevant

records in the main dataset. It is desirable to have a system that can automatically collect

the previous results and reuse them efficiently for solving related queries.

Chapter 3 presents Cloudberry, an open-source general-purpose middleware system to sup-

port interactive analytics and visualization on large data with various attributes. It can

automatically create, maintain, and delete views by analyzing each request and its results.

By utilizing materialized views stored in the database, it can reduce the query-response time

significantly. Furthermore, by implementing the main logic in the middleware layer, it is a

2

Page 20: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

generic solution that can connect to different types of frontend applications as well as various

database systems with the corresponding connectors.

Progressively Solving Queries in Middleware

Some queries cannot be answered using materialized views, in which case executing the

query over the entire dataset is inevitable. In order to guarantee the responsiveness of the

frontend, we present query-slicing techniques in Cloudberry in Chapter 4, called Drum1,

which can slice a query into small pieces (called “mini-queries”) so that the middleware can

send them efficiently one by one and compute results progressively. For example, suppose

the first request that the user sends to Cloudberry is “show the number of tweets mentioning

the keyword zika for each state”. Instead of directly sending one query to the database

system, Cloudberry can issue a sequence of cheaper “mini-queries” by adding a predicate on

the time period, e.g., “show the number of tweets mentioning the keyword zika for each

state from January 2017 to February 2017”, so that each of them can access a small amount

of data and could finish faster. Cloudberry can merge the results as needed and send them

to the frontend progressively. Then the frontend interface can be updated more quickly,

which leads to a more responsive user experience. We formulate an optimization problem to

produce the predicates of mini-queries by considering both their total running time as well

as the smoothness of result delivery, the key goal being to provide the incremental results at

a rhythmic pace to improve the user experience. The Drum framework can collect run-time

behavioral statistics for the database system to decide the predicate for the next mini-query

appropriately. Moreover, Drum is a middleware solution that requires no changes to the

underlying database system.

Using Filters to Improve Secondary-to-Primary Index Search

Besides the enormous size, data can be created and ingested at a high rate in real-time.1Drum stands for “Data Retrieval Using Milestones.”

3

Page 21: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

To accommodate this kind of insert-intensive workload, popular NoSQL systems such as

HBase [1], Cassandra [22], LevelDB [54], and BigTable [23] have adopted LSM-trees (log-

structured merge-trees) [63]. LSM-trees amortize the cost of writes by batching updates

in-memory before writing them to disk, thus avoiding random writes. This benefit comes

at the cost of possibly sacrificing read efficiency. To improve read performance, AsterixDB

implemented LSM filters [15] as a way to add additional metadata to LSM components

and to exploit that metadata during query processing to skip irrelevant components. The

filter feature plays an important role in optimizing mini-query performance. If the original

query is split on the same dimension as the filter field, it can significantly improve query

performance since most of the irrelevant components can be skipped. In order to efficiently

utilize the power of the filter, the mini-query needs to have a highly selective filter related

predicate. This limitation largely restricts the scenarios that can utilize the pruning power of

the filter, as many analytical queries do not match this pattern. For example, a mini-query

can have a wide time range predicate, and then many index components will be matched

based on their filters. In Chapter 5, we visit the filter idea, explore the aligned structure

property between secondary index components and primary index components, and develop

an improved version of the secondary-to-primary search. The main idea is to propagate the

filter attached to the secondary index components along with their output primary keys to

the primary-index search. Then the primary-index search can use the filter hints to improve

the search speed, even though the query itself does not contain any filter-related predicates.

The thesis is organized as follows. We first present an analytics web application example

of Cloudberry, called TwitterMap, that allows users to interactively query, analyze, and

visualize more than one billion tweets in Chapter 2. Then we introduce the design and

implementation of Cloudberry in Chapter 3. In Chapter 4 we present the Drum framework

to show how to progressively answer a time-consuming query on a large dataset by generating

a sequence of mini-queries. In Chapter 5 we visit the filter idea and present a method of using

LSM filters to accelerate secondary-to-primary LSM index search in AsterixDB to reduce the

4

Page 22: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

response time of a mini-query. Chapter 6 concludes the thesis and outlines a few directions

for future work.

5

Page 23: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 2

TwitterMap: Interactive Visualization

on One Billion Tweets

We have developed a live visualization system called “TwitterMap” [84] that allows users

to interactively explore and visualize more than one billion tweets by specifying temporal,

spatial, and textual conditions in real time. In this chapter, we will introduce how Twit-

terMap supports interactive analysis and visualization to illustrate the requirements of the

visualization system.

Fig. 2.1 shows the user interface of TwitterMap. The system follows the “overview first, zoom

and filter, then details on demand” [74] design principle. A user can type keywords (e.g.,

“zika”) to see the spatial, temporal, and hashtag distributions of the related tweets. After

identifying a hot area, time range, or hashtags of interest, the user can further drill down

or zoom out to different levels, including states, counties, and cities. The system also lists

relevant tweets in a sidebar in their reverse chronological order so that the user can review

the finest details of the tweets.

6

Page 24: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Figure 2.1: TwitterMap interface when a user searches “Zika” on tweets.

2.1 Architecture

Fig. 2.2 shows the four-tier architecture of the application as a full-stack analytics and visu-

alization solution. It includes a modularized web interface, a web server to interpret the user

action description into the corresponding Cloudberry JSON request, the Cloudberry middle-

ware to rewrite the request into backend queries, and an AsterixDB cluster to continuously

ingest, store, and query data in parallel. The four parts together enable an interactive,

real-time visualization UI on a large-scale Twitter dataset.

2.2 Frontend Interface

The frontend of TwitterMap is a single web page developed in HTML, Angular JS, and

CSS. It displays the spatial and temporal distribution of tweets that contain user-specified

keywords on both a map and a timeline dashboard. The interface allows users to zoom into

7

Page 25: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Resources Data model

Map view Time viewSidebar

view

Fro

nt-e

nd

Web

page

Web

Sev

er

WebSocket

Cloudberry

Ast

erix

DB

MetadataTwitter

Data Feed

Views

WebSocket

User Actions

Response

Cloudberry Request

Cloudberry Response

DB queries DB Results

Cache

Figure 2.2: TwitterMap architecture.

8

Page 26: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "action": "move",3 "from": [[-117.826505,33.684567], [-115.139830,36.169941]],4 "to": " [[-117.396156,33.953349], [-115.112997,36.106965]],5 "mapLevel": "county",6 "keywords": ["zika"],7 "time": ["2015-10-00T00:00:00", "2017-10-06T17:02:18"]8 }

Figure 2.3: A frontend web JSON record to indicate that the user has a “move” action.

finer details on both spatial and temporal dimensions. Whenever the zoom-in level changes

on either dimension, the frontend will issue a new request to the web server and render

the results on the UI. By using a WebSocket connection, the UI is continuously updated as

new results are received from the web server. Fig. 2.3 shows an example JSON record that

describes that a “move” event has happened on the web page. It contains the geographic

bounding box (left-bottom and right-top coordinates) of the map canvas, the keyword that

was input, and the time range that was selected on the UI.

2.3 Frontend Web Server

The web server launches the frontend web page and provides essential resources such as a

GeoJSON file of the state, county, and city polygons. It also receives the frontend actions

then interprets them as corresponding data request(s) to the Cloudberry system.

Take the “move” action in Fig. 2.3 as an example. The web server will first check the overlap

relationship between the stored county shapes with the requested boundary to generate a list

of matched county ids. Then it will generate four Cloudberry requests to ask for per-county

and per-day count aggregations, top-50 hashtags, and the latest sample tweets.

We will introduce how to compose the Cloudberry requests later in Section 3.3.3.

9

Page 27: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

2.4 AsterixDB Cluster

Apache AsterixDB is used to store a large number of tweets and views, and provide a

fast query engine to run queries efficiently. It is a scalable big data management system

that supports a variety of indexes such as B-tree, R-tree, and inverted index to support

filtering operations without scanning entire datasets. It has a built-in parallel runtime query

execution engine, Hyracks, to scale up to hundreds of machines so that an aggregation

query can finish with a low latency. The view manager in Cloudberry communicates with

AsterixDB server via SQL++ queries.

In addition to query processing, AsterixDB provides a “data feed” feature that accepts con-

tinuous data into the database. The TwitterMap system uses AsterixDB’s built-in socket

feed adapter to get streaming tweets while applying a geo-tagging function to annotate the

named locations of each tweet. In this way, users can query the newest tweets in real-time.

We set up the whole system on a cluster of five Intel NUC machines (shown in Fig. 2.4).

Each machine has four cores, 16 GB of memory, and a 500GB SSD disk, and each runs an

Apache AsterixDB node. The cluster costs less than $4,000. The data set is collected using

the Twitter streaming API for the North American area. We have already collected data for

about two years since November 2015, yielding more than one billion tweet records (about

2TB in size), and are still receiving new data at a rate of about 20 tweets per second.

10

Page 28: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Figure 2.4: The $4,000 TwitterMap cluster supporting interactive visualization on 2TB data.

2.5 Tweet Analytics Demonstration

2.5.1 Initial Map View

Fig. 2.1 shows the primary Web interface, which consists of a U.S. geographic map, a time

chart, and a sidebar with rich contextual information. The user types keywords into the

search box, which are used as a filter condition in the frontend action request sent to the

web server. The web server translates the user’s actions into a request in the Cloudberry

JSON format and sends it to the Cloudberry middleware system. After the response comes

back, the map displays the geographic tweet-count distribution for different states. The

time chart shows the number of tweets per day, and the sidebar shows the top-50 frequent

hashtags in the tweets and also the details of the latest sample tweets. For instance, Fig. 2.1

shows that there were over 41 thousand tweets mentioning “zika” out of one billion tweets.

11

Page 29: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

The map figure illustrates how “zika” is distributed in different states. The color of each

region illustrates the “hotness” of the topic in the region. The time chart shows how “zika” is

distributed at different days. We can see there is a spike in August 2016 when the Olympic

Games in Rio de Janerio, Brazil took place and there was a lot of global concern about the

Zika epidemic.

2.5.2 Normalized Count

Figure 2.5: Normalized distribution.

One limitation of using the absolute count to show the map result as shown in Fig. 2.1 is

that the “hotness” states are often the populous ones, e.g., California, Texas, New York, etc.

We provide a “normalization” view feature that can show the number of tweets per capita

in the region to show the relative “hotness” among different places.

The feature is implemented by using the “lookup” request in the Cloudberry, which can

augment the results with additional fields from another dataset. Specifically, the aggregation

12

Page 30: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

records for each region contain additional “population” fields whose values are obtained from

a “US population” dataset.

Fig. 2.5 shows the normalized distribution. In the figure, we can see that “zika” is discussed

more frequently per capita in Florida than other states. This result may due to the fact that

there were many reported “zika” cases in Florida, so people there were tweeting more about

“zika” than in other regions.

2.5.3 Zooming In

Figure 2.6: Zoom into county-level details in Florida.

TwitterMap allows users to explore multi-granularity data. Fig. 2.6 shows a user zooming

into the Florida area after the finding that it is the hottest state talking about “zika”. The

map section automatically switches to the county-level distribution. There are two hot-spots,

13

Page 31: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

in Miami and Martin counties. The time chart is updated to show the corresponding patterns

of tweets published in the zoomed-in area. The user can further select an interesting period

to add another predicate in the time dimension to focus on tweets issued within that period.

Under the hood of the Web interface, every frontend action triggers an HTTP request to the

web sever and finally to Cloudberry to get the required data. Thus, the response time is a

critical factor to achieving an interactive user experience.

2.5.4 Query Slicing

(a) The first response returns the result of the queryon the dataset since February 2017

(b) The second response returns the result from lastDecember to now

Figure 2.7: Example of using query slicing

When there is no view to be utilized, scanning the entire dataset is inevitable, which may

result in a long waiting time in the frontend application. To improve the interactiveness of

the user experience, Cloudberry provides a query slicing feature to return a stream of quick

responses. The query-slicing strategy decomposes a single large query into a sequence of

mini-queries by adding a series of range predicates on the time dimension to the original

query. In this way, each of the mini-queries can deliver results sooner so that it can provide

a better interactive user experience. Inside the middleware system, the partial results will be

merged and sent to the client as a stream of progressive results. TwitterMap uses this feature

by specifying the option “slicingMills:2000” in the request to Cloudberry to indicate

14

Page 32: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

that it accepts partial results and that the expected interval per result is 2,000 milliseconds.

We will present the details of this feature in Chapter 4.

From the perspective of the frontend web page, there is a stream of results from the server.

TwitterMap uses a feature called “watch” in the AngularJS library [17] to continuously update

the UI when the new results are coming. Fig. 2.7 shows the effect of the UI changes. A

user is asking for the distribution of the tweets that mention “fire”. Since there is no view

available, the query will be answered progressively. The first response shown in Fig 2.7a

returns the distribution of the tweets published since February. After two seconds, another

response that contains the aggregation results from last December is returned, and the

frontend updates the interface with the new results. (Though the results are not complete,

they already convey much more information, e.g., how many tweets per day, which states

discuss more about “fire”, etc.) In this way, users can receive progressive results as soon as

possible without waiting for query completion.

2.6 Summary

In this chapter, we presented the TwitterMap application that can allow users to interactively

explore and visualize more than one billion tweets by specifying temporal, spatial, and textual

conditions in real time. Besides its powerful and friendly interface for users to visualize

information on a map, it has several unique capabilities that are enabled by Cloudberry

and the underlying AsterixDB system: (1) Scalability: it can store, index, and query large

amounts of information (e.g., billions of tweets). Besides the primary dataset of tweets, it

can also access other datasets to augment the visualization results. (2) Interactivity: it can

answer an analytical query efficiently (e.g., in sub-seconds). In the case where computing

the complete result for the original query takes a long time, it can provide progressive results

to give timely feedback on the user interface. (3) Real-time analysis: the tweets are ingested

15

Page 33: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

in real-time, and the system allows users to query the latest information as well as historical

data.

In the following chapters, we will introduce the details of the underlying system to explain

how it helps the TwitterMap implement those features.

16

Page 34: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 3

Cloudberry: A Middleware System to

Support Interactive Analytics

3.1 Introduction

As discussed earlier, we are living in an era where a tremendous amount of digital information

is being generated on a continuous basis through social networks, blogs, online communities,

news sources, and mobile applications. This massive amount of data holds valuable infor-

mation that can be used for various analytical purposes, e.g., to track social opinions on

different topics, trends of a specific event, etc.

To understand what is in a dataset and the characteristics of the data, an analyst often

uses data visualization or tabular reports to have an initial view of the data and a basic

understanding of key characteristics. It is usually followed by drill-down or filtering of the

data to identify anomalies or patterns recognized through a serial of visualization results [48,

49].

17

Page 35: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Figure 3.1: The spatial and temporal distribution of tweets that mention “Zika”

In the TwitterMap example, an analyst may want to find how social networks respond to

events like epidemics, such as the Zika virus. The analyst may want to get a quick overview

of the distribution of the tweets that mention “Zika”. Fig. 3.1 shows a by-state and by-day

distribution of those tweets on a web page. In the figure, we can see that Zika was discussed

more frequently in Florida than in other states. Also, on the time bar chart, there is a high

spike of tweets in August 2016. The analyst can drill down to the Florida region or the period

of August 2016 to further explore the time and spatial patterns. Then the analyst may make

some hypotheses about the pattern, and examine the detailed tweet content afterward to

verify the hypotheses.

This kind of iterative analysis often defined as an exploration session [48], which includes

several queries where the results of the previous queries trigger the formulation of the next

ones. There is thus a chance to speed up the queries by using the previous query results.

For interactive visualization applications, the query-response time greatly affects the user

experience. It is therefore desirable to have a system that can automatically collect the

18

Page 36: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

previous results and reuse them efficiently for solving the later queries.

Connecting a visualization frontend with a database system directly (e.g., Superset [9], Po-

laris [78], Tableau [87]) does not suffice for this end, mainly due to two reasons. First, the

database system usually processes each incoming query independently. Therefore, it lacks

optimization for a sequence of queries that share the similar interests. Second, even though

some database systems support materialized views, it is difficult for the system to create or

drop such views automatically. They instead rely on the administrator to define the views

by evaluating a predictable workload. Moreover, in the exploration case, users may not know

what they are looking for, and they will know that something is interesting only after they

find it. Thus, it is not feasible to define views in advance. Those difficulties often lead to

a complex design in the application’s logic, which burdens visualization developers with the

task of optimizing query speed.

OLAP datacubes [39] support fast aggregation queries, but only over the dimensions that

were included during the construction of the cubes. With large datasets involving 20 or

more dimensions, constructing a “full” cube with all dimensions is often not feasible [56]. For

the same reason, it is not efficient to build a cube with a field that holds a large number of

values, such as a textual field that contains tens of thousands of words.

There are also end-to-end solutions for building specialized visualization data management

systems (e.g., SeeDB [86], DVMS [89]) that improve the visualization performance at every

layer of the overall software stack. Since the outputs of such a system are visualization figures

instead of data, they may not be easily integrated into existing frontend user interfaces. In

addition, these kinds of systems often need to store another copy of the data in their own

managed space. From the developer’s perspective, it could be an unacceptable additional

overhead to maintain an entire copy of a large dataset just for visualization.

In this chapter, we introduce a general-purpose middleware system, called Cloudberry, sitting

19

Page 37: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

between a visualization application and a database system to support interactive analytics

and visualization on large datasets with various attributes. Cloudberry can automatically

create, maintain, and delete views by analyzing each request and its results. By utilizing

materialized views stored in the database, it can reduce the query response time remarkably.

The system has several unique capabilities:

• Visualization friendly interface: we developed a generic API for the frontend to define

and query data. The API allows the frontend to specify rich semantic information

about the dataset, functional operations (e.g., filtering, aggregations, and lookups),

query execution and result delivery requirements (e.g., ship the entire result as a whole

or as a sequence of streaming partial results), and other visualization-related parame-

ters;

• Interactivity: it is able to answer an analytical query efficiently (e.g., in sub-seconds)

by using view-materialization techniques, which allows a user to analyze and explore

data interactively;

• Real-time Analysis: by asking the query to visit both the base dataset and the already

materialized view dataset, it allows users to query the latest information as well as

historical data;

• Generality: by implementing the main logic in the middleware layer, it can connect

to different types of frontend applications. It can also connect to various database

systems with the corresponding connectors.

We focus on datasets with temporal, spatial, and textual attributes, which are common in

domains such as social media and mobile phone application usage. We make two assumptions

about the datasets.

20

Page 38: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Ingestion TimeDelay Tolerance

3:29PM 3:32PM

Data Feed

Event Time

≤ 5mins

Figure 3.2: The relation between the event time and the ingestion time.

• Append-only: Many systems, including logging or monitoring, Internet of Things (IoT),

social-media, etc., generate a huge amount of data and keep it for a long time for future

analysis. Our target dataset is append-only in the sense that there is no update or

deletion on historical records.

• No “too late” records: Often the time when a record is ingested into the database (called

the “ingestion time”) is later than the time when the event in the record occurred (called

“event time”). Fig. 3.2 shows an example of TwitterMap that is using the AsterixDB

data feed to collect the tweets from the Twitter API. The actual tweet was published

at 3:29pm (event time). Due to delay in the Twitter service and network processing,

it was not inserted into the database until 3:32pm (ingestion time). Cloudberry allows

the developer to specify a maximum delay tolerance (noted as D, e.g., 5 minutes) to

define the maximum time between the event time and the ingestion time.

In the following sections, we will provide an overview of the Cloudberry system (Section 3.2),

its request interface (Secion 3.3), and its internal implementation details (Section 3.4). We

then show several use cases of Cloudberry to demonstrate its generality (Section 3.5).

21

Page 39: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Query Rewriter

View Manager

Request Parser

Database Connector

Client Interface

JSON Request

Schema

MetaData

Cloudberry Query

JSON ResultsCreate/Update/Delete

Response

Transformed Queries

DB QueriesDB Results

Base DataSet

Views Cloudberry MetaData

Clo

udbe

rry

Dat

abas

e

DDL Request

DB Modification Queries

Interactive Analytics Application

Figure 3.3: Cloudberry architecture.

22

Page 40: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

3.2 Clouberry Overview

Fig.. 3.3 shows an overview of the main software components of Cloudberry.

The clients of Cloudberry are web applications. It provides HTTP as an interface to receive

JSON requests. After receiving a JSON request, the Request Parser translates it into an

internal Cloudberry query object. Each query object contains the dataset name, a sequence

of data transformation expressions, e.g., an array of filter conditions, a series of unnesting

and lookup parameters, group-by and aggregation functions, etc. To validate the request,

the parser asks for the schema information of the dataset from the View Manager. In case

the request contains a lookup request, the lookup dataset schema will also be retrieved from

the View Manager. If the request is valid, it will be forwarded to the Query Rewriter to

process.

The main optimization logic inside the Query Rewriter is utilizing materialized views [40] to

answer the query. Based on the available view information provided by the View Manager,

it may rewrite the given query into one or multiple queries on both the base dataset and

view dataset.

The finalized query (or series of queries) is forwarded to the underlying database via the

Database Connector that translates a query object into the corresponding database state-

ments and then runs them in the connected database. The connector will also transform the

database results of various formats into a unified JSON object.

Finally, the results are returned back to the Query Rewriter, which packages the results

directly or combines them with the previous results into one response and then returns it to

the client.

The View Manager controls the life-cycle of views. It stores the metadata of the base datasets

and their associated views. The materialized views and the meta-dataset of Cloudberry are

23

Page 41: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

also stored in the backend database. In this way, we can rely on the database to do the

computations both on views and base datasets so that the middleware is very lightweight.

3.3 API Design

The frontend application can communicate with Cloudberry using a JSON format request

via an HTTP connection. Similar to SQL, there are two types of requests, including a Data

Declaration Request that describes the data model of a dataset in a database and a Data

Querying Request that specifies how to filter and transform the records of the dataset. In

this section, we explain the details of these two request types.

3.3.1 Data Declaration Request

A Data Declaration Request defines the data model of a dataset in the underlying database

of Cloudberry. In addition to the structure of a dataset, i.e., its field names and types,

we require the application to specify additional semantic information that may be useful to

analytics and visualizations. One main requirement is that the fields of a dataset should

be classified as being either a dimension or a measurement. For example, a product name

could be a dimension, and a product price or size would be a measurement. The definition

is composed of five elements:

• dataset: it specifies the name of the dataset in the database;

• primaryKey: the name of the primary key in the dataset;

• dimension: it is a field considered as an independent variable whose data type is ordinal

(e.g., the id of a record) or categorical. It is a field that can be filtered on or grouped

by. From the perspective of visualizations, dimension fields are usually used as an

24

Page 42: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

axis of a figure. Internally, the middleware may keep statistical information on certain

dimensions for the purpose of optimized query rewriting;

• measurement: a measurement field is a dependent variable; that is, its value is a

function of one or more dimensions. It is often of a quantitative type and is stored as

numbers (integers or floats) or as complex objects like text or bag. A measurement

is usually used to apply an aggregation function, such as count(), min(), max(), or

topK(). It can also be used to filter the data, but should not be used as a “group by”

key in a request;

• timeField: the name of the timestamp field in the dataset. This field is mandatory for

declaring a more “active” dataset that new records are being inserted into. The time

field is critical to guarantee the correctness of the response if it is composed partially by

the results from a previous view. If the timeField is not declared, then Cloudberry

will treat the dataset in question as a “static” dataset that will be used for the lookup

request to augment the record from another dataset. We will introduce the lookup

request in Section 3.3.3.

We will explain the Data Declaration Request of Cloudberry by using a dataset named

tweet. Each tweet record contains a mix of date, string, and integer attributes. In

addition, it also has nested data types. For example, the hashtags field contains a bag

(unordered list) of string values; the user field itself is a structured record that contains user-

related information such as id, name, image, etc. Before being registered with Cloudberry,

the dataset should already be created and possibly ingested in the underlying database

system. As an example, Fig. 3.4 shows the definition of the tweet dataset stored in the

AsterixDB system in SQL++.

A frontend application can register the tweet dataset with the Cloudberry system by using

the JSON specification in Fig. 3.5. The fields that are more likely to be grouped on, like id

25

Page 43: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 create dataverse twitter;2 use twitter;3 create type typeUser if not exists as open {4 id: int64,5 name: string,6 screen_name : string,7 lang : string8 };9

10 create type typeGeoTag if not exists as open {11 stateID: int32,12 countyID: int32,13 cityID: int3214 };15

16 create type typeTweet if not exists as open{17 create_at : datetime,18 id: int64,19 "text": string,20 lang : string,21 favorite_count : int64,22 retweet_count : int64,23 hashtags : {{string }},24 user_mentions : {{ int64 }} ,25 user : typeUser,26 geo_tag: typeGeoTag27 };28

29 create dataset tweet(typeTweet) if not exists primary key id;

Figure 3.4: An example of the dataset tweet in AsterixDB

and lang, are declared as dimensions, while the fields that are less likely to be grouped on,

like text and retweet_count, are specified as measurements. The primary key is the

id field. The create_at field is the time field, indicating that there will be new records

appended to this dataset in the database.

26

Page 44: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset":"twitter.tweet",3 "dimension":[4 {"name":"create_at", "datatype":"Time"},5 {"name":"id", "datatype":"Number"},6 {"name":"lang","datatype":"String"},7 {"name":"user.id","datatype":"Number"},8 {"name":"geo_tag.stateID","datatype":"Number"},9 {"name":"geo_tag.countyID","datatype":"Number"},

10 {"name":"geo_tag.cityID","datatype":"Number"},11 {"name":"geo","datatype":"Hierarchy","innerType":"Number",12 "levels":[13 {"level":"state","field":"geo_tag.stateID"},14 {"level":"county","field":"geo_tag.countyID"},15 {"level":"city","field":"geo_tag.cityID"}]}16 ],17 "measurement":[18 {"name":"text","datatype":"Text"},19 {"name":"user_mentions","datatype":"Bag","innerType":"Number"},20 {"name":"hashtags","datatype":"Bag","innerType":"String"},21 {"name":"favorite_count","datatype":"Number"},22 {"name":"retweet_count","datatype":"Number"},23 ],24 "primaryKey":["id"],25 "timeField":"create_at"26 }

Figure 3.5: DDL in Cloudberry to register the tweet dataset of the underlying database.

3.3.2 Data Model

The Cloudberry system can support the flat relational data model that has been widely

adopted in visualization applications. It is straightforward to use the fields that are defined

in the underlying relational database (e.g., MySQL and PostgreSQL). If the dataset is stored

in a nested format, the full path of a field is required to be specified in the DDL. For example,

AsterixDB supports the semi-structured data model. The user.id field name in Fig. 3.5

is a nested field stored at one level deeper in the tweet record. A client can use dot (“.”)

to declare the user.id field in the Cloudberry DDL. When a request needs to access the

nested field, the Cloudberry system will generate the appropriate underlying query syntax

to obtain the value of the nested field. Unlike the Druid system [95], we do not need users

to manually flatten their nested data records before the analysis can be.

27

Page 45: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Type Filter Group Aggregation

Boolean isTrue, isFalse self -Number <, <=, >, >=, ==, in, inRange self,bin count, sum, min, max, avgTime <, <=, >, >=, ==, inRange self,interval -String contains, matches, in self distinct-countText contains - -Bag contains - -

Table 3.1: Cloudberry data types and functions. “-” means unsupported

Each field needs to have a data type. The Cloudberry system supports primitive JSON types,

such as String, Number, and Boolean, as well as other frequently-used data types, such

as Datetime and Text. Compared to the String type that specifies a short attribute

(e.g., the language field lang), the Text type specifies a field with much larger content.

Users may care more about its inside strings than using the entire bag of words as a single

dimension. For example, the text field is declared to be of type Text so that the user can

filter on the text field, but they should not group on the field.

In addition, Cloudberry supports the nested datatype Bag that declares a collection of

values that have the same primitive type. For example, the hashtags field declares a bag

of words of the String type. Table 3.1 lists the supported types as well as some of their

corresponding functions. A complete list and more details can be found in the Cloudberry

documentation [26].

Cloudberry allows users to define a synthetic field consisting of the existing fields. For

example, the geo field in Fig. 3.5 declares a hierarchical field that consists of the stateID,

countyID, and cityID. By specifying the level, Cloudberry knows the hierarchical relation

between the three fields. When a client request a group-by operation on the geo field

on the “county” level, the actual query will group on both stateID and countyID fields.

In the future, the Cloudberry system could apply a roll-up function on such a field to

generate nested group-by results in a predefined hierarchical order.

28

Page 46: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Notice that the fields in a request may not correspond to all the fields of a dataset in the

backend database. If some fields are not relevant to the frontend analysis, they need not be

mentioned in the request. For example, the user.name and user.screen_name fields

in AsterixDB do not appear in the Cloudberry DDL in Fig. 3.5.

Compared to database DDL, which focuses more on the physical nature of the record’s

fields, the Cloudberry system is designed to capture a higher level of information to better

understand the semantic nature of fields and their relationships. This kind of information is

helpful to infer the relationship between previous requests and the current request, so that

the common computation between them can potentially be saved by using previously stored

query results.

3.3.3 Data Query Request

After declaring a dataset, clients can send a request to Cloudberry to query the underlying

dataset. A Cloudberry’s data-querying request is also expressed in a JSON format and

can be submitted via an HTTP POST method or via a WebSocket. Such requests specify

which dataset to query on and what records should be filtered and transformed. The request

language supports common SQL-like functionalities, e.g., filter, group, sort, and lookup. In

addition, it can also define how to deliver the query results. For instance, a query can be

sliced into mini-queries and then there will be a stream of partial results returned to the

client instead of one result (see Chapter 4). The result is also a JSON object, and contains

the filtered or aggregated values. We will describe the request format by presenting a series

of illustrative examples based on the previous declared tweet schema.

We begin the introduction with the request shown in Fig 3.6, which contains the basic

filtering and group-by specifications. The request first filters the tweets whose text field

contains “zika” and “virus”, then groups the filtered tweets by both the geo_tag.stateID

29

Page 47: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "filter": [{4 "field": "text",5 "relation": "contains",6 "values": [ "zika", "virus" ]7 }],8 "group": {9 "by": [

10 { "field": "geo_tag.stateID", "as": "state"},11 { "field": "create_at",12 "apply": { "name": "interval", "args": {"unit": "hour"} },13 "as": "hour"}14 ],15 "aggregate": [16 { "field": "*", "apply": { "name": "count" }, "as": "count"}17 ]18 }19 }

Figure 3.6: Get the per-state and per-hour count of tweets that contain “zika” and “virus”

and hour fields, generated by applying the interval function to the create_at field. By

using the “as” renaming operation, the result will contain the grouped keys in the “state”

and “hour” field, and the aggregated value in the “count” field.

Fig 3.7 illustrates an example of the unnest and the order operations. The unnest

operation flattens a Bag field into individual items, producing multiple records, each of

which is one of the original input records augmented with a flattened item from its original

collection. For example, after unnesting the “hashtags” field, there will be more records

produced and each record will contain a single hashtag word in the newly added “tag” field.

After that, the request groups the record on the new “tag” field and uses “count” as the

aggregation function to get the frequency of each hashtag. Lastly, the select segment

asks to sort the hashtags by its count in the descending order. The “-” mark indicates

descending order, while a field name without “-” defines ascending order. The “limit”

property specifies that only the top 10 most frequent hashtag should be returned.

Except for the aggregation result, analytics applications often have the need to examine

30

Page 48: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "filter": [4 { "field": "text", "relation": "contains", "values": ["zika"]}5 ],6 "unnest" : [{ "hashtags": "tag"}],7 "group": {8 "by": [{ "field": "tag" }],9 "aggregate": [

10 { "field" : "*", "apply" : { "name": "count" },"as" : "count"}11 ]12 },13 "select" : {14 "order" : [ "-count"], "limit": 10, "offset" : 015 }16 }

Figure 3.7: Get the top-10 related hashtags for tweets that mention “zika”

1 {2 "dataset": "twitter.tweet",3 "filter": [4 { "field": "text", "relation": "contains", "values": ["zika"]}5 ],6 "select" : {7 "order" : [ "-create_at"],8 "limit": 100, "offset" : 0,9 "field": ["text" ]

10 }11 }

Figure 3.8: Get 100 latest sample tweets that mention “zika”

31

Page 49: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "filter": [4 { "field": "text", "relation": "contains", "values": ["zika"]}5 ],6 "group": {7 "by": [{ "field": "geo_tag.stateID", "as": "state"}],8 "aggregate": [9 { "field" : "*", "apply" : { "name": "count" },"as" : "count"}

10 ]11 "lookup" : {12 "joinKey": ["state"],13 "dataset": "twitter.population",14 "lookupKey": ["stateID"],15 "select": ["population"],16 "as": ["population"]17 }18 }19 }

Figure 3.9: Lookup the population value from the “population” dataset and append it to thecurrent record. The “state” field (joinKey) in the result record will be used to join withthe population.stateID (lookupKey) from the population dataset.

the details of the interesting records that satisfy certain conditions. Fig. 3.8 illustrates a

sampling request that fetches the latest 100 tweets that mention “zika”. Users can specify

the limit and offset properties to return fewer results. In addition, one can specify

certain fields in field properties to output only the selected fields of the results.

Fig. 3.9 shows an example of specifying the lookup operation. It expresses the semantics of

a left-outer join that augments the records from the main dataset or the aggregation results

with additional fields from another dataset. The joinKey specifies the fields from the

original record; the lookupKey specifies the fields from another dataset to join with. In the

example, the request first generates the per-state aggregation result of the tweets mentioning

“zika”, then augments the aggregated result by adding a new “population” field obtained from

the matching records in the “population” dataset whose stateID is equal to the grouped

state value. This is a useful feature to combine fields from other datasets. Particularly,

TweetMap uses this lookup request to get the population of each administrative region to

support the normalization feature that shows the per-capita count of matched tweets.

32

Page 50: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "global": {4 "aggregate": {5 "field": "*", "apply": { "name": "count"}, "as": "count"},6 }7 "estimate" : true,8 }

Figure 3.10: An estimate request that asks for the total number of tweets and it can acceptan approximate answer.

Often in analytics applications, the most precise result may not always be needed. A frontend

application may only want to get a rough idea of some simple facts (e.g., the cardinality of the

dataset or the distinct count of a dimension) about the dataset quickly. Cloudberry supports

this kind of approximate request by declaring the “estimate” property in the JSON request.

The request in Fig. 3.10 illustrates an “estimate” example that asks for the cardinality of the

tweet dataset. Recall that the attributes of a dataset are separated into dimensions and

measurement. Cloudberry considers dimensions and measurements differently. Internally,

the system can periodically collect the metrics (e.g., min, max, cardinality, etc) of some

dimensions and store the values in a separate metadata dataset. In this way, without asking

the underlying database, Cloudberry itself can answer some simple questions if an estimation

is acceptable. Currently, we store the cardinality of the dataset and the min and max values

of the time dimension in Cloudberry’s metadata. Thus, the “count *” request in Fig. 3.10

can be estimated simply by checking the metadata at the middleware layer. In the future,

we could keep track of the metrics of all dimension fields to be able to answer more estimate

requests. If the estimate option is not specified or set to “false”, the request will be

solved by sending the rewritten queries to the database for their precise results.

Fig. 3.11 shows another distinct feature called “slicing” in Cloudberry. Analytical queries are

often expensive due to the large data size. Thus, the frontend application may need to wait

for a long time to update its user interface. To improve the interactiveness of the application,

Cloudberry supports progressive computing by sending partial results sequentially to the

33

Page 51: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "group": {4 "by": [{ "field": "geo_tag.stateID", "as": "state"}],5 "aggregate": [6 { "field" : "*", "apply" : { "name": "count" },"as" : "count"}7 ]8 }9 "options" : {

10 "sliceMillis": 200011 "continueMillis": 200012 }13 }

Figure 3.11: Using the option field to specify that the partial result should be deliveredevery 2,000 milliseconds and the result should be updated every 2,000 milliseconds for thenewest data.

client, rather than waiting for query completion. As a result, the frontend application can

refresh its UI at a much faster rate, and hence the interactiveness is much improved. The

example in Fig. 3.11 asks for the per-state count for the entire tweet dataset. It has an

“option” property indicating that it accepts the slicing of result and that the expected

updating rate is 2,000 milliseconds. (Currently, we only support slicing the dataset on the

time dimension from the latest data to the oldest one.) Then, instead of a single result, there

will be a stream of partial results returned in reverse-chronological order in 2,000 millisecond

intervals. Each partial result will contain the following attributes:

• result: It includes the requested aggregation results. In the case of the request in Fig.

3.11, it will be a collection of records containing the count of each state;

• percentage: It shows the percentage of progress of answering the current request. The

result value is complete once the percentage reaches 100%;

• time interval: It indicates the time interval during which the result is valid.

In the future, we can also support continuous queries in the same manner by specifying the

“continueMillis” option in the request.

34

Page 52: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

3.4 System Implementation

We have described how a frontend application can issue requests to the Cloudberry system.

Now we introduce the architecture and implementation details of the system.

3.4.1 Data Registration

Mid

dlew

are

View Manager

(name,schema,query,stats)(metadata)

Dat

abas

e

Tweet(base)

3. Write MetaData("tweet", schema, null, 2015 Oct -2017 Oct, 1Billion)

Query Rewriter

2. Collect Dataset Info

1. DDL Request

Figure 3.12: Data registration workflow

The first step of using Cloudberry is to register the dataset in the database by using the DDL

JSON request described in Section 3.3.1. Fig. 3.12 shows the registration process. When

the View Manager receives the data declaration request as described in Section 3.3.1 for

registering the twitter dataset, a batch of statistical queries will be sent to the declared

dataset. For example, the View Manager will send the query to get the range of the time

dimension, the distinct count of the user.id dimension, the cardinality of the dataset, etc.

After the statistical information is collected, the View Manager will generate a metadata

35

Page 53: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

record including:

• name: the name of the dataset in the database;

• schema: the schema information of the dataset;

• query: the query object that defines the view if the dataset is a view. Views and their

creating queries are automatically generated by the system. For the registered base

dataset it will be always empty.

• time interval: the range of the time dimension. This field is mainly used to define the

valid interval of the view, which is critical for the purpose of correct query rewriting

using a view.

• stats: the statistical information about this dataset, including “create time”, “last mod-

ify time”, “last read time”, “cardinality”, etc. This information is mainly used to main-

tain a view. It can also be used to estimate the simple query result if the client accepts

approximate answers.

The record is stored as a nested JSON record. Fig. 3.13 shows an example metadata record af-

ter registering the tweet dataset. The “schema” field stores the DDL JSON object as shown

in Fig. 3.5. The “query” field is empty because it is a base dataset. The “timeInterval”

field stores the min and max values of the time dimension at the registration time. The

“stats” field shows that the dataset was registered at “2017-10-03T12:40:12” and

that there are about one billion records in the tweet dataset.

The nested JSON format may not be fully supported in a traditional database system (e.g.,

MySQL). In such a case, we store each meta record as a string in the database and load the

metadata dataset entirely into the middleware layer. The Database Connector parses the

string record and sends the transformed JSON record to the View Manager for later use.

36

Page 54: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 { "name": "twitter.tweet",2 "schema": {3 "dimension":[4 {"name":"create_at", "datatype":"Time"},5 {"name":"id", "datatype":"Number"},6 {"name":"lang","datatype":"String"},7 {"name":"user.id","datatype":"Number"},8 ...9 ],

10 "measurement":[11 {"name":"text","datatype":"Text"},12 {"name":"user_mentions","datatype":"Bag","innerType":"Number"},13 {"name":"hashtags","datatype":"Bag","innerType":"String"},14 ...15 ],16 "primaryKey":["id"],17 "timeField":"create_at"18 },19 "query": null,20 "timeInterval": {21 "start": "2015-01-01T13:33:18",22 "end": "2017-10-03T12:19:49"23 },24 "stats": {25 "createTime": "2017-10-03T12:40:12",26 "lastModifyTime": "2017-10-03T12:40:12",27 "lastReadTime": "2017-10-03T12:40:12",28 "cardinality": 101082633029 }30 }

Figure 3.13: An example metadata record of the tweet dataset in Cloudberry.

Once the data about a dataset is written into the database, the dataset is ready to be queried

using Cloudberry.

3.4.2 Query Rewriting Using Views

Now we describe how views are created, utilized, and maintained in Cloudberry.

37

Page 55: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 {2 "dataset": "twitter.tweet",3 "filter": [4 {"field": "text", "relation": "contains","values": [ "zika", "virus" ]}5 {"field": "retweet_count", "relation": ">","values": 5}6 ],7 "global": {8 "aggregate": {9 "field": "*", "apply": { "name": "count"}, "as": "count"},

10 }11 }12 }

Request 3.1: Get the count of tweets that contain “zika” and “virus” and retweeted morethan five times.

3.4.2.1 View Creation

A view is a form of derived data as the result of applying some structural or computational

transformations to the base data [25]. It is said to be materialized if its results are stored in

the database. Studies [35, 94] show that precomputing views can significantly reduce query

response time. In Cloudberry, views are always materialized.

In Cloudberry, we focus on append-only datasets that keep ingesting new data at a fast rate,

in which case the time dimension plays an important role for correctly rewriting a query. For

this reason, we include the time range as part of a view definition to specify that the result

of a view is consistent with the result of the original query applied on the records from the

base dataset within a specific time interval.

Formally, we define a view V of a dataset S as a query Q that applies on S with an additional

time range predicate Pt. A view can be represented as V (S,Q, Pt).

Cloudberry currently supports subset views. A subset view is created by applying one filter

predicate (in addition to the time predicate) onto the base table. Since such a view keeps

every detail of the filtered records, it will be more likely to be utilized for answering new

queries.

38

Page 56: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

When the Query Rewriter receives a query, the filter predicates of the query expressions

are transformed into conjunctive normal form (CNF). The filter expression is defined as

Fq = Pt∧P1∧P2∧ . . .∧Pm, where Pt is the time condition. If it does not appear in the query,

we will use the current time (now()) as an upper bound for the time filter condition. For

example, if the query issuing time is “2017-10-10T08:03:00”, then the filter predicates

in Fig. 3.1 can be expressed as follows:

Ftweet = {create_at < 2017-10-10T08:03:00}

∧ {contains(text, “zika”)}

∧ {contains(text, “virus”)}

∧ {retweet_count > 5}.

(3.1)

When there are no views that can be utilized, the query will be run as it is. By the time

the query finishes processing, the query rewriter will suggest the views that may be useful to

speed up such queries in the future. One principle it follows is that the size of the generated

view should be as small as possible so that the future computation on the view could be

cheaper. However, as a middleware component, the rewriter may not have the statistics for

every filter condition. As a heuristic approach, we assume that a point-search predicate (e.g.

“==” comparison or contains()) is more selective than a range search predicate. Under

this assumption, we use the following rules:

1. If a point search predicate exists, only the point search predicate will be used to define

the view;

2. If there are multiple point search predicates, multiple views will be generated such that

each has a point search predicate;

3. If no point search is available, only one of the range predicates will be used to define

the view. In the case where the predicate is on a field without statistical information,

39

Page 57: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the first predicate will be used.

As an example, in case the request in Fig. 3.1 has been processed and no views are available,

the query rewriter will suggest two views:

1. view zika with the definition predicate as Pzika = {contains(text, zika)};

2. view virus with the definition query as Pvirus = {contains(text, virus)}.

The {retweet_count > 5} condition will not be used to generate a view since there are

already two point search predicates.

The query rewriter generates the view-creation requests with the predicate Pv of the view

definition, but it does not create any views. The query will be packaged into a “createView”

request and sent to the View Manager that will conduct the actual view creation task. The

View Manager manages the lifecycle of views and their metadata information.

When the View Manager receives the view-creation query, it tries to create a view as latest as

possible to catch up with the base dataset. Ideally, the time predicate of the view definition

should be from the start time of the dataset (Ts) to the current timestamp (Tnow). The issue

is that as studied in [36, 10], the ingestion time when a record is ingested into the database

is often later than the event time when the event in the record occurred. Moreover, the

dataset usually only contains the event time. Thus, if we directly use the current time Tnow

as the view ending time to match the event time of the records, there could be some records

that have not yet arrived in the system, and hence will not be included in the view. The

view-based computation would then be incorrect. To solve this problem, Cloudberry allows

the developer to specify a maximum Delay Tolerance (noted as D) to define the maximum

time (e.g., 3 minutes) between the event time and the ingestion time to accommodate the

lag between them.

40

Page 58: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

When the View Manager receives the creation query, it will complete the view definition

by filling the time period predicate that is from the start time of the dataset (Ts) that was

collected during the dataset registration, and the timestamp of the current second minus

the delay tolerance (Tnow − D), i.e., Pt = {Ts <= timeField < Tnow − D}. Then the

view definition V (S, σPv , σPt) is complete and the data-creation query that contains two

conjunctive predicates will be sent to the underlying database to execute.

Mid

dlew

are View Manager

(tweet,schema,null, 2015 Jan...)(metadata)"zika"

(view)

Dat

abas

e

Tweet(base)

text contains("zika") andcreate_at < Oct.10 2017

3. View Stats

4. Write MetaData("zika", schema, ?, [2015 Jan 01, 2017 Oct.10), 41K)

Query Rewriter 1.Suggest View "zika"

2. Create NewDataset "zika"

Figure 3.14: Create view “zika”.

Fig. 3.14 shows the process of creating the “zika” view (the request parser and the database

connector are not shown for ease of illustration). First, the definition query σcontains(text,zika)

is sent to the View Manager. Suppose that the time when the View Manager receives the

query is 2017-10-10T08:03:00, the delay tolerance D is 3 minutes, and the starting date of the

base dataset is 2015-01-01T13:33:18 (collected when registering the tweet dataset). Then

the view definition is

V (tweet, σcontains(text,zika), σ2015-01-01T13:33:18<=create_at<2017-10-10T08:00:00).

41

Page 59: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Then the “zika” view will be generated by pulling the matching records from the tweet

dataset to the newly created zika dataset. The generated SQL++ statement is shown in

Fig. 3.15.

1 USE twitter;2 CREATE dataset zika(typeTweet) IF NOT EXISTS primary key id;3 INSERT INTO zika (4 SELECT VALUE t5 FROM twitter.tweet t6 WHERE t.`create_at` < datetime('2017-10-10T08:00:00')7 AND ftcontains(t.`text`, ['zika'], {'mode':'all'})8 )

Figure 3.15: Create the “zika” view in AsterixDB.

When the view is created, the View Manager will again issue statistical queries on the newly

generated view to collect necessary information, e.g., the cardinality of the view dataset.

Once the statistical data is returned, a new record about the “zika” view is stored in the meta-

data. Fig. 3.16 shows an example meta record about the “zika” view after it has been created.

It has the same schema as the base dataset. The “query” field stores the query with the predi-

cate Pv. Its time range is from the start date of the base dataset (2015-01-01T13:33:18)

to the end of the creating time (2017-10-10T08:00:00). The “stats” fields indicate that

the view was successfully built at 2017-10-10T08:05:35, which is the creation time, the

last update time, and also the last read time of the view. This time information is used by

the View Manager to control the lifecycle of the view. The view can be used by the query

rewriter in the future.

Multiple view creation requests will be executed sequentially since the dataset creation could

be an expensive operation due to a significant amount of disk reads from the base dataset

and also the disk writes to the view dataset. For example, the “virus” view as suggested by

the query rewriter will not be processed until the “zika” view has been built successfully.

42

Page 60: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 { "name": "twitter.zika",2 "schema": {...},3 "query": {4 "dataset": "twitter.ds_tweet",5 "filter": [6 { "field": "text", "relation": "contains", "values": [ "zika" ] }7 ]8 },9 "timeInterval": {

10 "start": "2015-01-01T13:33:18",11 "end": "2017-10-10T08:00:00""12 },13 "stats": {14 "createTime": "2017-10-10T08:05:35",15 "lastModifyTime": "2017-10-10T08:05:35",16 "lastReadTime": "2017-10-10T08:05:35",17 "cardinality": 4115018 }19 }

Figure 3.16: The “zika” view metadata record

3.4.2.2 View Selection

Having discussed how to create views, we now turn to their use. After receiving a data

request, the query rewriter will ask the View Manager to get all the views created on the

requested base dataset. From all the views of the requested dataset, the rewriter needs to

find a view whose definition contains the original query. This is a classic problem that has

been studied extensively in the literature [43, 67, 38]. Based on the context of the subset

views, in which case the details of the record are well kept in the view dataset, the question

we wish to answer is “can the required rows be selected using the view?”.

We will use the request example in Fig. 3.1 again to explain the view-selection process.

Suppose that request is sent to the system at 2017-10-10T08:30:00. Then the predicate

43

Page 61: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Mid

dlew

are

View Manager

("zika", ..., 41K)("virus", ..., 58K)

(metadata)"zika"(view)

Dat

abas

e

Tweet(base)

text contains("zika")

2. Query MetaData

Query Rewriter1.Ask Viewsof "tweet"

3. Get Views

4.Get Views

5.Count "zika virus" from Oct.10 2017 to Now

6.Count "virus"till Oct.10 2017

Figure 3.17: The process of rewriting the request using “zika” view.

expression of the request is as below:

Ftweet = {create_at < “2017-10-10T08:30:00"}

∧ {contains(text, “zika”)}

∧ {contains(text, “virus”)}

∧ {retweet_count > 5}.

(3.2)

Similarly, the view definition can be expressed as two predicates Fv = Pv,t ∧ Pv. The

predicates of the “zika” view and the “virus” view are given as:

44

Page 62: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Fzika = {2015-01-01T13:33:18 <= create_at < 2017-10-10T08:00:00}

∧ {contains(text, “zika”)}

and

Fvirus = {2015-01-01T13:33:18 <= create_at < 2017-10-10T08:00:00}

∧ {contains(text, “virus”)}

respectively.

The first task for the query rewriter now is to find a matching view. We say a view is

matching if it satisfies both of the following conditions:

1. the predicate of the view V.Pv covers one of the predicates Q.Pi from the query Q,

2. the time range predicate of the view V.Pt overlaps with the time predicate Q.Pt of the

original query.

The coverage test between V.Pv and Q.Pi is conducted in a semantic way so that it will

check the logical meaning of the expression rather than a simple string match. For example,

{contains(text, “zika”)} and {contains(text, “ZIKA”)} will be treated as a covered case

because text.contains is a case-insensitive keyword search. Similarly, the set predicate

{in(state.id, [23, 55]} is covered by {in(state.id, [55, 23])} since the order of values does not

matter for the “in” relation. The time overlap test is based on checking interval overlaps.

For example, given the predicate in Equation 3.2 of the request in Fig. 3.15, both views

“zika” and “virus” will be matched because both time ranges overlap with the request’s

range and both view definitions cover the query predicates {contains(tweet.text, “zika”)

and {contains(tweet.text, “virus”)}, respectively.

45

Page 63: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Once a view is matched, the rewriter will generate a new equivalent predicate expression on

the view by applying the uncovered predicates from the original query. In case the view’s

time range does not entirely cover the request range, the time predicate on the base table

query will also be updated to compensate for uncovered query results.

2017-10-1008:30:00

Query Predicate Q.Pt

Zika View Predicate V.Pt

2015-01-0113:33:18

Difference

2017-10-1008:00:00

Figure 3.18: The time relationship between the “zika” view and the query.

For instance, as illustrated in Fig. 3.18, the “zika” view can be used to answer the query

partially. To compute all the answers, the base dataset is also accessed to retrieve the newer

records between the view ending time and the query ending time. The predicate on the view

is

F ′zika = {2015-01-01T13:33:18 <= create_at < 2017-10-10T08:00:00}

∧ {contains(text, “virus”)}

∧ {retweet_count > 5}

46

Page 64: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

and the predicate on the base dataset is

F ′tweet = {2017-10-10T08:00:00 <= create_at < 2017-10-10T08:30:00}

∧ {contains(text, “zika”)}

∧ {contains(text, “virus”)}

∧ {retweet_count > 5}.

The equivalent predicate from Q.Pi will be eliminated in the rewritten filter expression on

the view. However, the time predicate is not dropped because there might be an ongoing

maintenance process on the view, which appends the records that happen after the original

view’s time predicate. For example, when the query Q was issued, there could be already

a maintenance operation to append the records that occur during the “difference” interval

after the view ending time (2017-10-10T08:00:00 as shown in Fig. 3.18). If we do not

include the ending time predicate to the query on the “zika” view, the records that happened

after 2017-10-10T08:00:00 may be included twice.

There could be multiple views that match the query, in which case we need to decide which

views are more likely to speed up query processing. Currently, we only pick one of the views

to use. We use a heuristic approach to pick the view with the smallest cardinality. In our

current example, both the “zika” view and “virus” view are matched based on the coverage

test. Since the cardinality of “zika” view (41,000) is less than that of the “virus” view

(58,000) as shown in Fig. 3.17, the “zika” view will be chosen.

At last, the records from the selected view and the base dataset will be combined, and then

the rest of the operation expressions (e.g., group by, order, lookup, etc.) will be appended

to form a new query. The newly composed query will be translated into the appropriate

database language and executed. As an example, the rewritten query using “zika” view can

be expressed in SQL++ as shown in Fig. 3.19.

47

Page 65: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 SELECT count(u) AS count FROM (2 SELECT * FROM twitter.zika t3 WHERE t.`create_at` >= datetime('2015-01-01T13:33:18')4 AND t.`create_at` < datetime('2017-10-10T08:00:00')5 AND ftcontains(t.`text`, ['virus'])6 AND t.`retweet_count` > 57 UNION ALL8 SELECT * FROM twitter.tweet t9 WHERE t.`create_at` >= datetime('2017-10-10T08:00:00')

10 AND t.`create_at` < datetime('2017-10-10T08:30:00')11 AND ftcontains(t.`text`, ['zika'])12 AND ftcontains(t.`text`, ['virus'])13 AND t.`retweet_count` > 514 ) AS u;

Figure 3.19: The SQL++ query to combine the records from the “zika” view and the basedataset “tweet”.

The size of the “zika” view (41,000) is significantly smaller compared to the size of the base

tweet dataset (one billion). Thus, the time spent on scanning on view “zika” is much less

than querying on the base dataset. On the other hand, the time range predicate applied on

the base tweet dataset is only asking for records from the last eight hours. Compared to

the entire range of two years, it is again a small portion of data to process, if there are some

storage optimizations based on the time dimension. For example, there could be an index

on the time dimension to speed up the data access. It is also not uncommon for a database

system to use the time dimension to partition the dataset so that visiting a certain small

range of the data will be even faster than using an index. Particularly, AsterixDB uses filters

to accelerate data access by skipping many unrelated LSM components [15]. Moreover, a

database system often optimizes the access speed on the “hot” (recent) data, e.g., keeping

the most recent data in memory. Thus, the total time that is spent on the base data for a

small portion of the latest data on the base dataset should be small. In this way, by routing

the data access path from the large base dataset to a small query on a materialized view

plus a small residual base query, we can significantly reduce the total query time.

48

Page 66: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

3.4.3 View Maintenance

After a view has been created, we need to keep it up to date with changes to the base

data (a.k.a view maintenance) to keep the content consistent with its definition. Cloudberry

focuses on the domain where the dataset is append-only (no updates and deletes on the

historical records) and new records are continually inserted. Though there will not be any

correctness issue if a view is not updated, it is desirable in order to cover a larger range of

the base dataset so that more pre-computed results could be used to answer future queries.

3.4.3.1 Incremental Updates of Views

We employ a periodic incremental view maintenance strategy (e.g., every 1 hour) [41] so

that the views are updated gradually over time. A naive implementation, for example, is to

trigger an update process every 1 hour, which retrieves the records that satisfy the predicate

from the base table between the last update time to the current system time Tnow and stores

them in the view. As we described in Section 3.4.2.1, however, we also need to consider the

maximum Delay Tolerance (e.g., 3 minutes) between the event time and the ingestion time

to accommodate the lag between them.

new time range Ri+1

View time range Ri

Ei 2017-10-10

08:57

Si2015-01-01

13:33

Ui+1

10:00

incremental range ∆R

delay tolerance D

Ui

09:00

update interval

Ei+1

09:57

Figure 3.20: Update the view based on its time range.

Fig. 3.20 illustrates the timestamps that are used to append a view. Recall that each view

49

Page 67: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

contains the “stats” information about when it has been created or updated as described in

Section 3.4.2.1. Take the “zika” view as an example. Based on the metadata in Fig. 3.16, the

view ending time (Ei) is 2017-10-10-08:00:00, meaning that all of the “zika” related

records that happened before that time have been stored in the view. The last updating

time (Ui) is at 09:00am. Assuming the update interval is 1 hour, by the time 10:00am

(Ui+1), the View Manager is reminded to update the “zika” view. If the delay tolerance is 3

minutes, then the ending time of the updated view is 09:57am (Ei+1, 3 minutes to Ui+1).

The View Manager then applies the range predicate 08:57am <= create_at < 09:57am

(∆R) together with the filter predicates of the view definition {contains(text, “zika”)} to

the base dataset to incrementally update the view. The corresponding SQL++ request is

shown in Fig. 3.21.

1 INSERT INTO twitter.zika (2 SELECT value t3 FROM twitter.tweet t4 WHERE t.`create_at` >= datetime('2017-10-10T08:57:00')5 and t.`create_at` < datetime('2017-10-10T09:57:00')6 and ftcontains(t.`text`, ['zika'], {'mode':'all'})7 )

Figure 3.21: The SQL++ example of appending more records from the base dataset tweetto the view “zika”.

Once the update request has finished, the View Manager collects the new cardinality of

the view and updates the meta record of the view by changing the time range as well

as the statistical information including the last update time and cardinality of the view.

Fig. 3.22 shows an example of the updated view meta record after the maintenance process

was finished.

3.4.3.2 View Deletion

The advantage of having a subset view, compared to an aggregation view, is that it keeps

the finest details of the original records from the base dataset. As a result, it allows more

50

Page 68: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 { "name": "twitter.zika",2 "schema": {...},3 "query": {4 "dataset": "twitter.ds_tweet",5 "filter": [6 { "field": "text", "relation": "contains", "values": [ "zika" ] }7 ]8 },9 "timeInterval": {

10 "start": "2015-01-01T13:33:18",11 "end": "2017-10-10T09:57:00"12 },13 "stats": {14 "createTime": "2017-10-10T08:05:35",15 "lastModifyTime": "2017-10-10T10:01:00",16 "lastReadTime": "2017-10-10T01:03:35",17 "cardinality": 4115418 }19 }

Figure 3.22: The “zika” view’s metadata record after the update finished at 2017-10-10T10:01:00.

operations (e.g., group, lookup, sort, project) on the view data, which can increase the chance

of utilizing a view. On the other hand, the disadvantage of keeping the complete copy is that

more resources will be needed and the update cost will be increased proportionally with the

number of views.

In Cloudberry, we tackle this problem by removing less-frequently used views. Recall that

each view meta record has a “last read time” property. Each time a view gets used, the “last

read time” in the meta record will be updated. The View Manager will check all the views

periodically to remove the views that have not been actively used for a certain amount of

time (e.g., 24 hours).

3.4.4 Concurrency Management

Cloudberry follows the actor model [5] to handle concurrency. The actor model provides

a high-level abstraction for writing concurrent and distributed systems. An actor is a con-

51

Page 69: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

ceptual entity to deal with computation within its context thread. Actors are completely

isolated from each other, and they will never share memory. Actors communicate with each

other by sending asynchronous messages. Those messages are stored in actors’ mailboxes

until they are processed. We use the Akka [11] toolkit to implement the actor model for

Cloudberry. The library natively supports mailbox and message passing. Thus, it alleviates

efforts of dealing with explicit locking and thread management, making it easier to implement

concurrent and parallel systems.

Dat

abas

e

("zika", ..., 41K)("virus", ..., 58K)

(metadata)

"zika"(view)

Tweet(base)

Cached MetaData

Actor for View Manager

Actor for base dataset

StatsQueue

Client Actor A

Query Rewriter

Client Actor B

Actor formetadata

StatsQueue

Actor for zika view

StatsQueue

actor message

read query

read/writequery

Request Parser

Query Rewriter

Request Parser

Figure 3.23: Actor architecture of Cloudberry.

Fig. 3.23 shows how actors are organized in Cloudberry. Internally, each actor has a queue

to store its incoming messages. Each HTTP connection is handled by a Client Actor. The

actor has its own Request Parser and Query Rewriter to serve its queries independently.

There is only one View Manager Actor in the system that provides the functionality of the

view maintenance. It also has an in-memory copy of the metadata to speed up the frequent

metadata search requests. Each dataset in the database is assigned to one specific actor

52

Page 70: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

to control the query and maintain against the dataset. The dataset actor can read other

datasets, but it can only write to its assigned one. The actor assigned to the base data is

unique in the sense that it can only read but not write the base dataset.

A client actor asks for the data actor through the view router; then the client will interact

with the data actor directly to send the query. Different client actors may interact with the

same data actor. In this way, the data actor can precisely know the statistical information,

such as how many times different clients have used the view. The read requests sent to

the data actor will be issued asynchronously, while the write requests (e.g., updating view,

updating metadata) will be run sequentially so that multiple maintenance processes will not

interleave with each other. Whenever a data actor needs to write to another dataset, for

example, if the “zika” view wants to update its time range, the request will be sent to the

centralized router to reconcile the multiple writes.

Another interesting aspect of the actor model is that actors can live remotely on different

machines. An actor is capable of sending a message to a remote actor that lives on another

machine. Thus, this architecture can easily scale up to multiple machines, which is an

important step towards a scalable middleware.

3.5 Use Cases

3.5.1 Paraview Web

The Army Research Lab [18] has developed a high-resolution data visualization frame-

work [79] on top of the ParaViewWeb [66] supporting both scientific and non-scientific data

visualization within a single framework. Fig. 3.24 shows their data visualization application

running on a high-resolution tiled display within the SAGE2 [59] framework. It shows an

53

Page 71: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Figure 3.24: The large display visualization system at the Army Research Lab.

example of using Paraview web to visualize one billion tweets on a map interface similar to

TwitterMap. Additionally, their interface can show temporal patterns for the selected re-

gions for a more in-depth analysis. Cloudberry is used as a middleware layer between a data

management system and the visualization side of the ecology. Based on their experience,

it is simple for the frontend application to interface with Cloudberry. The frontend passes

a JSON request to Cloudberry and listens for messages and responses. As the Cloudberry

middleware supports query optimization, there is no more complexity on the client side as

compared to sending a query directly to a database. Therefore, the frontend can focus on

the visualization tasks without worrying about performance issues.

3.5.2 Twitter Analytics

The TwitterMap application is used by the UCI public health department to conduct research

on social media’s reactions to certain public health concerns. [47] studied the content of

climate change in Twitter messages before, during, and after the November 8th, 2016 US

54

Page 72: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Presidential election. TwitterMap was used to generate the by-time trend of the related

topics. [60] employed TwitterMap to investigate the feasibility of using Twitter data for

Zika virus epidemic tracking and forecasting at a national and state (Florida) level. They

demonstrated the value of utilizing Twitter data for the purposes of disease surveillance

and early disease signaling. TwitterMap was used to quickly generate geographic and time

patterns of the “zika” related tweets to compare with the official Zika cases reported by U.S.

Centers for Disease Control and Prevention (CDC).

3.5.3 Connecting to Different Databases

Cloudberry is a general-purpose middleware system that can not only allow multiple different

applications to request various datasets, but also connect to different databases.

We have implemented an AsterixDB connector to communicate with an AsterixDB cluster

by using SQL++ statements. We also implemented a MySQL connector and a PostgreSQL

connector to interact with a MySQL database or a PostgreSQL database by using their SQL

dialects. With minimal schema changes (due to their flattened relational model), one can

successfully build the TwitterMap application with the data stored in MySQL or PostgreSQL

database separately. More details can be found on the Cloudberry Wiki page [27].

3.6 Comparison with Related Work

Analytics and Visualization Systems

There have been many research studies on data analytics and visualization as surveyed in [37].

Frontend visualization solutions, like Superset [80] and Polaris [78], can generate precise sets

of relational queries from the visualization specification in the user interface. Such solutions

55

Page 73: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

could be clients of Cloudberry that could be used to quickly retrieve their required data.

There are in-memory visualization solutions, like Nanocubes [55], ImMens [56] , that rely on

in-memory data structures to reduce query processing time. These solutions would not scale

easily if the data size is larger than memory.

There are specialized storage systems supporting interactive analytics on big data. Druid [95]

is a high-performance, column-oriented, distributed data store. Dremel [61] combines multi-

level execution trees and a columnar data layout to store read-only nested data, and it

can support interactive aggregation queries over trillion-row tables. MapD [58] provides an

analytical solution by relying on hardware GPU support. Our solution is extensible in the

sense that by building the corresponding database connectors, Cloudberry can sit on top

of these systems to explore the shared context between queries to further reduce the query

time, which could then save computation resources.

Some visualization systems use pre-processing to reduce their query latency in the case where

the workload is known in a-priori. Battle et al [20] developed middleware based on a memory

cache to serve data tiles from a backend database based on users’ recent requests. Hippo [96]

pre-partitions the data on a pre-defined attribute and builds a lightweight index structure to

skip the irrelevant pages if they do not contain the queried partitions. Since their results are

calculated beforehand, this approach may not be feasible if each query is not known a-priori.

XmdvTool [72] provides speculative prefetching strategies where the interaction options in

the visualization frontend are limited.

There are also domain-specialized visualization systems. KITE [57] is an end-to-end system

capable of managing microblog data at a large scale. HadoopViz [33] uses a MapReduce-

based framework to generate high-quality geographic images with giga-pixel resolution for

spatial data. Our work targets more general domains and more diverse workload.

There are also end-to-end systems, such as Tableau [83] and DVMS [90], that are designed

56

Page 74: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

to explore cross-layer optimization opportunities from the bottom data management layer

to the top visualization layer. Since the outputs of the system are visualization figures, the

solution may not be easily integrated with other existing frontend user interfaces. Besides,

this kind of systems need to store another copy of the data in their own managed space.

From the developer’s perspective, this additional overhead to maintain an entire copy of the

dataset just for visualization may not be acceptable.

BlinkDB [4] and INCVISAGE [68] answer queries interactively by delivering approximate

results using sampling techniques. The approximate answering technique could also be used

in our system.

Materialized Views

Materialized views have been studied intensively in the literature [40, 2, 41, 43, 38, 67]. Given

a set of predefined views, the problems are divided into two categories: 1) maintaining mate-

rialized views efficiently when the base dataset changes, and 2) answering queries using views

effectively to improve performance and availability. Given a general-purpose database envi-

ronment, the maintenance workload must consider deletions and updates, which greatly in-

crease the overhead of maintenance compared to the unique scenario of append-only datasets.

Answering queries using views to find the least expensive plan is computationally hard even

for restricted query classes [2]. The work in [43, 38] has explored various rewriting techniques

that can apply to both select-project-join view and aggregation views. These techniques are

being used in our work to implement materialized views in the middleware layer instead of

in the database layer. In this way the visualization frontend can connect to the dataset in

a database system that does not have its own support for materialized views. Moreover,

Cloudberry’s views are automatically managed by the middleware layer based on the query

logic. The visualization frontend then does not need to handle query performance issues.

57

Page 75: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Data Cube

Data Cube [39] is another pre-computing technique to generate aggregation results for pre-

defined multidimensional groups. There are many studies of efficiently computing data

cubes [3, 44, 71] and organizing them for query answering [44, 75, 75]. There are also

systems [52, 77] that generate a data cube for a large dataset stored in Hadoop or HBase.

Different from the materialized view approaches, data cube systems usually enumerate all

combinations of every dimension value so that users can quickly slice and dice on the specific

range of dimensions to get the pre-calculated results. However, the exploration path is largely

limited by the pre-defined dimensions and aggregation functions during cube construction.

For example, this approach cannot support slicing on measurement fields. Especially for the

case of keyword matching, it is infeasible to enumerate all keywords in a text field to build

the cube. Our system adopts the notion of dimension and measurement only for the purpose

of supporting better query rewriting. The views in Cloudberry can be built by applying any

filter conditions, either on dimensions or measurements.

Semantic Caching

Semantic caching [30, 16] studies how to reuse subsets of input tables that are stored in a

client-side cache. Each entry is modeled semantically as a set of queries, rather than at the

physical level as a set of data pages or tuples. When a query is submitted to the client and

can be answered (partially) using the cache, only a “remainder query” will be sent to the

server to fetch the residual results. The technique needs to split the result on an attribute

with bounded value space, thus its usage is limited and more suitable for a predefined query

workload on the client side. We have implemented this technique in the TwitterMap frontend

in [50] to cache the geographic aggregation results in order to skip unnecessary Cloudberry

requests. Our work in the middleware layer focuses on using much richer context information

58

Page 76: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

between requests to speed up a broader range of requests.

Automated Physical Design

Automated tuning [24] is a rich field including topics such as partitioning [65, 70], index

selection [12, 73], and materialized view selection [6, 7, 19, 31]. Adaptive index selection

creates and drops indexes on-the-fly [12, 73]. Adaptive materialized view selection [6, 7,

19, 31] shares the same philosophy by selecting the most promising views based on the

observed queries. Most of the index and materialized view selection techniques use a DBMS

optimizer’s cost model to evaluate the benefits of an index or view, and they have to be

implemented inside a database. Cloudberry implemented the automatic view maintenance

in the middleware layer, which may save the frontend development efforts for managing the

views if the application connects to a database systems without the automated physical

design support.

3.7 Conclusions

In this chapter, we have presented the design, implementation, and use cases of Cloudberry,

a general-purpose middleware system specially designed for interactive analytic and visual-

ization applications for large-scale data. It can reduce the query response time remarkably

by utilizing materialized views stored in the database. By using the example TwitterMap,

we demonstrated that it is a suitable solution to support interactive data analytics and visu-

alization on one billion tweets. By using an example Paraview web application and building

connectors to different databases, we have demonstrated that Cloudberry is general-purpose

and can be used by various frontend applications on different backend database systems. We

have open-sourced the Cloudberry system (http://cloudberry.ics.uci.edu) since 2015, and we

59

Page 77: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

invite others to download and try the system.

60

Page 78: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 4

Drum: A Rhythmic Approach to

Interactive Analytics on Large Data

4.1 Introduction

Data exploration and analytics are becoming increasingly important in applications to help

users gain insights from their data and make time-critical decisions. Data analysis can

become even more powerful and desirable by being interactive, so that users can see the

results quickly, ideally in sub-seconds latency after submitting a request. At the same time,

achieving such a user experience is technically challenging on large data sets due to the high

computational cost.

One way to solve the problem is to materialize earlier query results as views, with which

we can answer future related queries, as discussed earlier in Chapter 3. This view-based

approach does not solve the problem completely, since there can always be queries that

cannot be answered using materialized views, thus resulting in a high latency.

61

Page 79: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

As a motivating scenario, suppose the first request that the TwitterMap application sends to

Cloudberry is “show the number of tweets mentioning the keyword zika per state”. Since

there are no views available, Cloudberry has to send the query to the database as it is:

Q: SELECT state, COUNT(*)

FROM twitter t

WHERE ftcontains(t.text, "zika")

GROUP BY state;

The predicate ftcontains(message, “zika”) checks if the textual message value con-

tains the keyword zika. Due to the large number of records in the relation, the query Q

may take a long time to finish, and the user has to wait during this period. One way to solve

this long-waiting-time problem is to issue a sequence of cheaper “mini-queries” by adding

a range predicate on the create_at attribute so that each of them can be answered by

the database system efficiently. Then Cloudberry can merge the results as needed and send

them back to TwitterMap progressively. Finally, the frontend user interface can be updated

more quickly, which in turn leads to a more responsive user experience as demonstrated in

Section 2.5.4.

The following is an example mini-query:

Qm: SELECT state, COUNT(*)

FROM twitter t

WHERE ftcontains(t.text, "zika") AND

2016-01-01 <= t.create_at AND t.create_at < 2016-03-01

GROUP BY state;

One naive solution is to divide the space of the create_at attribute into multiple fixed-

length intervals and generate a mini-query for each of them. For instance, we could divide

62

Page 80: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

Figure 4.1: Distribution of “Zika” tweets collected over 1.5 years from November 2015 toMay 2017.

the attribute into multiple months and generate a mini-query for each month. While this

slicing method is easy to implement, it has two major limitations. First, the interval is

difficult to decide, especially for queries with different query times. For instance, a tweet

query for a popular keyword such as water can take a much longer time than a query with

a rare keyword such as authoritarianism. Second, the user experience can be very poor

due to the large variance of running times of the min-queries, mainly due to two factors.

1. The data distribution along the slicing attribute can be skewed. Fig. 4.1 shows the

distribution of the number of tweets mentioning the keyword zika, which has a large

variance. There was a peak in the summer of 2016 due to the Olympic Games in Rio

de Janeiro, Brazil and to the global concern about this potential epidemic, but public

attention quickly diminished after that. If we divide the time range equally into 14

months (as shown in the figure), the mini-queries will have different running times

because they need to access different amounts of data.

2. The performance of the backend database system can fluctuate over time, especially

when there are multiple queries running simultaneously. As a result, some of the mini-

queries can be fast, and others can be slow, causing the incremental results to be sent

at an irregular pace.

In this chapter we study how to progressively answer a time-consuming query on a large data

63

Page 81: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

set by generating a sequence of mini-queries. We focus on how to deliver the results of mini-

queries smoothly by following an “expected rhythm” for the user so that the user sees regular

updates of the incremental results. We address two main challenges, as discussed above:

(1) The data distribution of the slicing attribute is hard to model using offline statistics,

especially when the query can have various selection predicates. In the example, the user

can type in arbitrary keywords, and it would be difficult to know a-priori the distributions

for the keywords. (2) The performance of the database system can be very dynamic and thus

hard to model offline due to other running queries that compete for the limited computing

resources. We make the following contributions here:

• We formulate an optimization problem to generate the predicates of mini-queries by

considering both their total running time as well as the smoothness of result deliv-

ery to deliver incremental results at a rhythmic pace to improve the user experience

(Section 4.2).

• We develop an adaptive framework, called “Drum”1, in which we collect run-time be-

havioral statistics of the mini-queries and the backend database system. It includes a

regression function for the relationship between the predicate and the running time of a

mini-query and an uncertainty model for the estimation error of the regression function.

It also includes a greedy algorithm that can automatically use the observed performance

information to generate the predicate for the next mini-query (Section 4.3).

• We develop a technique that considers both the total running time and the penalty

of missing the next expected milestone to deliver the incremental results. Based on

a rigorous analysis of the benefit and cost of varying the predicate, this technique

can select a predicate that can maximize the expected gain of the next mini-query

(Section 4.4).1Drum stands for “Data Retrieval Using Milestones.”

64

Page 82: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

• We have conducted an extensive experimental study on a real, large data set of social

media tweets to evaluate Drum and these techniques and to demonstrate their efficiency

(Section 4.5).

4.2 Problem Formulation

4.2.1 Architecture and Query Slicing

We consider a widely-used multi-tier Web architecture that includes a backend database

system, a Cloudberry middleware server, and a frontend application, such as the web service

as described in Section 2.1. The frontend application sends requests to Cloudberry, which

in turn generates queries to the backend system to retrieve results and sends a response to

the frontend.

Consider a dataset R that contains a set of fields in the backend database. The middleware

posts a query Q to the dataset. Due to a large number of records in R, the query Q can

be computationally expensive, and the user needs to wait for a long time to see the results.

We assume that the dataset R has an attribute ρ, called a slicing attribute, using which

the middleware can generate a sequence of mini-queries Q1, Q2, . . ., and send them to the

backend. Each mini-query Qi is Q with an additional filtering condition on the slicing

attribute ρ. After receiving the results of each mini-query Qi, the middleware uses these

results to update the frontend interface. We make the following two assumptions:

1. Each mini-query is much faster than the original query Q, e.g., due to the smaller

amount of data accessed by Q and an available index on this attribute.

2. The results of these mini-queries can be incrementally combined to compute the results

of the query Q.

65

Page 83: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Section 4.1 showed an original tweet query Q and a mini-query Qm on this dataset. Notice

that after receiving the results of each mini-query, the frontend can decide how to combine

them with earlier results and update the interface. It can either show the results as they

are or it could show estimated results for the whole data set based on a statistical model.

Our work mainly focuses on the timing of the generated mini-queries, not on the way their

results are displayed. We also focus on the case where the ρ is a numerical attribute.

4.2.2 Slicing Schedules and User Satisfaction

There can be many ways to slice a query into mini-queries. Fig. 4.2 shows three ways to

answer our example query Q. (We call each of them a slicing schedule, or just schedule for

short.) The first schedule S1 uses a single query (Q) that takes 6 seconds to finish. The

second schedule S2 uses 4 mini-queries that take 2 seconds each, with a total time of 8

seconds. The third schedule S3 also uses 4 mini-queries, but they take 1, 3, 1, and 3 seconds,

respectively.

Schedule S1

Q

seconds 2 40 6 81 3 5 7 9

Schedule S3

Q1 Q2 Q3 Q4

Schedule S2

seconds 2 40 6 81 3 5 7 9

seconds 2 40 6 81 3 5 7 9

Q1 Q2 Q3 Q4

Figure 4.2: Three query-slicing schedules, where each thick line indicates a delay for anexpected 2-second pace.

66

Page 84: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

When deciding a good strategy to slice a query, we need to consider how the corresponding

mini-query schedule affects the user experience. In the three schedules above, schedule S1

takes the least amount of total time (6 seconds), but the user cannot get any intermediate

results before that. Schedule S2 takes a longer time (8 seconds) to finish, but it gives the

user an earlier response and updates the interface periodically every 2 seconds. Schedule S3

takes the same total amount of time as S2, but its update frequency is not regular. From the

user’s perspective, the way the interface is updated in S3 is not very “smooth”, as the next

time the new results are shown is not predictable. Schedule S2 updates the interface with a

regular pace and has a better predictability and rhythm in terms of the next time the user

expects the interface to be refreshed, so it provides a better user experience.

4.2.3 Schedule Quality

The example above suggests the following factors that affect the user experience in the way

mini-query results are delivered:

1. Total running time: For a frontend request, we want to minimize the total time of

these mini-queries to compute the complete results.

2. Smoothness of result delivery: Users want the frontend to be updated at a regular

pace so that the next time of seeing new results is predictable.

Based on this analysis, we define the cost of a schedule S as follows:

Cost(S) = Costt(S) + Costm(S). (4.1)

In the formula, Costt(S), measures the overall time cost of the mini-queries, which can be

quantified by their total running time. Costm(S), where “m” means “smoothness,” measures

67

Page 85: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Qi-1

Qi

Ui-1 Ui

P DiQi-1 finished Qi-1 results delivered

Qi started Qi finished

Li

Figure 4.3: Missing a delivery deadline.

the smoothness of result delivery. To quantify this cost, we assume a given desired pace

parameter P (an interval time), and the user expects an update of the frontend P seconds

after the previous update. Every time the middleware does not update the interface after

this time P , this schedule needs to pay a “missing-the-deadline” penalty. The smoothness

cost can be quantified as:

Costm(S) = α∑i

Di. (4.2)

In the formula, α is a weight used to quantify the penalty of missing the next deadline. Di

is the delay time of mini-query Qi based on the pace P , defined as:

Di = max(0, Ui − (Ui−1 + P )), (4.3)

where Ui and Ui−1 are the update times when the actual results of mini-queries Qi and Qi−1

are delivered to the frontend, respectively. The general idea is shown in Fig. 4.3. Notice

that the middleware starts the new mini-query Qi right after the previous mini-query Qi−1

finishes, but the middleware can choose to wait for some time before delivering the results to

the frontend in order to provide a smooth experience, since the user is expecting an update

at time Ui−1 based on the pace P . (Our developed results are orthogonal to whether or not

the middleware decides to wait on the mini-query results before the next milestone).

68

Page 86: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Table 4.1 shows the formula costs for the three schedules in Fig. 4.2 when the pace P = 2

seconds. Take schedule S3 as an example. It finished in 8 seconds, so Costt(S3) = 8 seconds.

Its Q1 and Q3 did not miss their deadlines, while Q2 and Q4 missed their deadlines by 1

second each, so its smoothness cost is Costm(S3) = α ∗ 2 seconds. If α = 2, for example, the

total cost of S3 is 12 seconds. In general, the lower the cost a schedule has, the better a user

experience it provides.

Schedule Cost Cost (α = 2)S1 6 + 4α 14S2 8 8S3 8 + 2α 12

Table 4.1: The costs of the schedules of Fig. 4.2

Next we study the following problem: given a query Q, generate a sequence of mini-queries

whose execution schedule has a minimal cost.

4.3 Drum: An Adaptive Framework for Generating Mini-

Queries

In this section, we present an adaptive middleware-based framework, called Drum, to dynam-

ically decide a condition on the slicing attribute for the next mini-query based on statistics

collected from earlier mini-queries. A main advantage of this framework is that it does not

require any apriori knowledge about the performance characteristics of the backend database

system, and it can be adaptive with respect to skewed data distributions and performance

fluctuations of the backend.

Fig. 4.4 shows the framework. Each incoming request from the frontend is submitted to

the mini-query generator. The generator utilizes estimation information provided by the

69

Page 87: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Mini-query Generator

Mini-Query Executor

Estimator

RequestNext Mini-query Qi Results

Run-time Statistics

Estimation

Regression Function Uncertainty

Figure 4.4: Drum framework for adaptive query slicing.

estimator module to choose a range of size ri of the slicing attribute for the next mini-

query Qi. It then generates Qi and sends it to the backend database to execute. After the

intermediate results return, the middleware sends updated results to the frontend, possibly

by combining them with earlier results. Meanwhile, the run-time statistics of executing the

mini-queries are collected and stored in the estimator module. The estimator utilizes the

collected information to help the generator create the next mini-query. Next we give the

details of the framework.

4.3.1 Regression Function

The regression function in the estimator is used to capture the relationship between the slicing

predicate range size and the corresponding mini-query running time. As more statistics

are collected, the regression function can become more accurate. Since each mini-query is

obtained by adding a predicate on the slicing attribute, we need to know how this predicate

affects the execution time of the mini-query. As an example, if the mini-query time is closely

related to the number of records satisfying the predicate, we can use a linear function between

a predicate range size ri and the running time f(ri) for the next mini-query Qi:

f(ri) = a1ri + a0. (4.4)

70

Page 88: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

0.5

1

1.5

2

2.5

3

3.5

5 10 15 20 25 30

Runnin

g T

ime (

seconds)

Predicate Range (days)

f(r)=0.08r+0.9

Figure 4.5: Linear regression for tweet mini-queries with different predicates on thecreate_at attribute. (110 million tweets, keyword=zika)

Notice that the values a1 and a0 in the regression can be request-dependent, so we build a

linear regression for each request. That is, we would have one linear function for a Zika

tweet query and another for a Trump query. We can use a standard curve-fitting algorithm

with the least squares fitting method to train the linear models on the collected pairs of

range size and running time. Each newly added observation from running a mini-query can

update the linear model. Note that due to data skew and fluctuation of system performance,

the relationship between the predicate range size and running time can also change over

time. To enable the regression function to quickly adapt to such changes, we can adjust the

length of the history to train the model. As an illustration, Fig 4.5 shows an actual linear

model learned from a collection of pairs of range size ri and running time ti:

4.3.2 Uncertainty Model

The real running time ti for a mini-query Qi will differ from the predicted time f(ri) from the

regression function. The uncertainty model in Drum is used to measure the distribution of

71

Page 89: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

2

4

6

8

10

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fre

qu

en

cy o

f e

rro

rs (

His

tog

ram

)

PD

F v

alu

e (

Ga

ussia

n)

Error ti-f(ri) (seconds)

Figure 4.6: The Histogram and Gaussian models. The errors are collected from the obser-vation ti − f(ri) in Fig. 4.5.

such per-query errors for use in selecting ri. We consider two types of models: (1) Histogram:

We split the space of the errors into equal-size bins and maintain a frequency for each bin.

(2) Gaussian function: We use a function to model the error distribution. For example,

we can assume errors follow the Gaussian distribution function N(0, σ). Fig. 4.6 shows

both the histogram distribution and Gaussian distribution for the errors ti − f(ri) collected

in Fig. 4.5. For the same collection of observations, using different models can produce

different probability values and thus lead to different mini-query choices.

In reality, the error distribution may depend on the target running time f(ri). Intuitively,

the larger the time f(ri), the harder it may be to decide a predicate range ri to achieve the

desired time. In addition, the middleware has a limited number of performance data points

collected during the execution of past mini-queries. For simplicity, our model will assume

that the error distribution is independent of the target time f(ri). That is, if the error

probability distribution function (PDF) is PDF (ti − f(ri)), then for a specific given f(ri),

the distribution of ti is PDF(ti|f(ri)

)= PDF

(ti − f(ri)

). For instance, when using the

72

Page 90: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Gaussian distribution to model the error distribution as N(0, σ), we assume the real running

time ti follows the distribution of N(f(ri), σ), where f(ri) is the target running time and σ

is the standard derivation of all the observations.

4.3.3 Tradeoff of Running Time and Penalty

When deciding the predicate range ri for the next mini-query Qi, if we choose a large range,

the total running time can be reduced, since the middleware will send fewer mini-queries

to the database. At the same time, a mini-query with a large range ri will have a longer

execution time, which can cause a large penalty of missing the next deadline. To consider

the tradeoff between the two factors, we define the following scoring function for mini-query

Qi:

score(Qi) =

riI

if ti <= Li

riI− α ti−Li

CiLiotherwise.

(4.5)

In the function,

• I is the entire interval range of Qi’s slicing attribute;

• ri is the predicate range on the slicing attribute of Qi;

• riIquantifies the progress that Qi will contribute to the overall task;

• Li is the time limit (i.e., time to deadline) for the next mini-query Qi. As shown in

Fig. 4.3, the limit Li can be bigger than the pace P if the previous query Qi−1 completes

before its own deadline;

• ti is the real running time of query Qi;

• α is the constant weight in the penalty function in Equation 4.2.

73

Page 91: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

• Ci is the estimated total number of mini-queries needed. It can be estimated from

the number of mini-queries issued so far and the summation of the relative progress

that has been made. CiLi is thus a current estimate of the total running time, making

the penalty be on the same scale as the progress riI. Notice that Ci is specific to the

mini-query Qi, but in our analysis step for deciding range ri, it is a constant.

Using the linear function in Fig. 4.5 and the error PDF from Fig. 4.6, Fig. 4.7 shows the

score values for two ri values for Qi, where we have Li = 2.2 seconds from the time Qi−1 was

finished until the time we need to have the results of Qi in order to update the frontend.

Fig. 4.7(a) shows the score for a strategy A1 that picks a range ri = 16 days for a target

running time of f(ri) = 2.2 seconds. When the real time ti is less than Li, the score is equal

to the progress. As the real time ti becomes larger than Li, the score constantly decreases

since we have to pay a penalty for missing the Li deadline. Fig. 4.7(b) shows the PDF of

the running time ti that is of the same shape as shown in Fig. 4.6, but the mean of the

distribution is f(ri) = 2.2 instead of 0.

Fig. 4.7(c) shows the score function for another strategy A2 that picks a range r′i = 12 days

and the target running time f(r′i) = 1.85 seconds, which is more conservative by using a

value smaller than Li. Similar to the previous case, when the real time ti is less than Li (2.2

seconds), the score is a constant (which is smaller than the value when we had f(ri) = 2.2).

When the real time ti is larger than Li, the score value constantly decreases. Fig. 4.7(d)

shows the PDF of the running time ti when r′i is 12 days. As a consequence, compared to the

previous f(ri), the PDF “shifts” to the left, providing a larger probability of having results to

deliver to the user by the Li deadline. When comparing these two strategies, we notice that

strategy A1 is more aggressive by choosing a target running time f(ri) = 2.2s that is equal

to the time limit Li, while A2 is more conservative by using a smaller time f(r′i) = 1.85s

(with a smaller slicing range r′i). As a consequence, A2 has a lower progress r′i/I, but it also

has a lower probability of missing the deadline.

74

Page 92: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Score(Qi)

t

PDF(ti)

meet deadline

miss deadline

t

r i=16 days

f(ri)=2.2s

Li=2.2s

Score(Qi’)

t

PDF(ti’)

meet deadline

miss deadline

t

ri’ =12 days

f(ri’)=1.85s 2.2s

Li=2.2s0s

0s

(a) Score of A1

(b) PDF of A1

(c) Score of A2

(d) PDF of A2

Figure 4.7: The score and error PDF for two target running times, ri = 2.2 seconds andr′i = 1.85 seconds, with time limit Li = 2.2 seconds, f(ri) = 0.08ri + 0.9, and PDF (ti) =N(f(ri), 0.4).

4.3.4 Choosing the Range ri for Next Mini-Query Qi

At each step, Drum picks a range ri to maximize the expected score for mini-query Qi so

that Qi can finish as close to the deadline Li as possible but have less risk of running longer

than Li. In this heuristic way, the schedule composed by all Qis can deliver the results at

a regular pace. For a specific range ri, the running time f(ri) can be estimated using the

regression function. Let P (t|f(ri)) be the PDF of the real running time ti when the targeted

time is f(ri). Then the expected score E(score) of the mini-query Qi can be calculated as:

E(score(Qi)) =

∫ ∞0

score(Qi)P (t|f(ri))dt

=

∫ Li

0

riIP(t|f(ri)

)dt+

∫ ∞Li

(riI− α(t− Li)

CiLi

)P(t|f(ri)

)dt

=riI− α

∫ ∞Li

(t− Li)CiLi

P(t|f(ri)

)dt.

(4.6)

75

Page 93: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Our goal is to compute a value ri to maximize this expected score E(score(Qi)).

4.4 Choosing an Optimal Predicate Range

We now study how to choose a predicate range ri for the next mini-query Qi in order to

maximize the expected score E(score(Qi)). We consider two different error models for the

time distribution P (t|f(ri)), namely histogram and Gaussian, and develop a technique for

deciding the range ri for each of them.

4.4.1 Histogram

Construction

A histogram of the estimated error distribution can be constructed as follows. Take an equi-

width histogram as an example. We first choose its error bin size b. After running each

mini-query Qj, we measure its real execution time tj and compute the difference between tj

and the estimated time f(rj). We identify the bin corresponding to this error, tj − f(rj),

and increment the frequency of the corresponding bin.

Calculating expected score for a given range ri

For a given predicate range ri and the corresponding estimated running time f(ri) from

the regression function, we want to calculate the expected score for the generated mini-

query Qi. The intuition of our method is the following. The obtained histogram will give

an error distribution centered at f(ri). We apply Equation 4.6 to compute the integral of

the score using the error distribution. In particular, before the time limit Li there is only

76

Page 94: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

2.0

P2=3%

P3=12%

P1=1%

1.7 2.3 2.61.41.10.8 2.9

Time Limit Li=2.2sEstimated Time when ri=12 days

P5=26%

P6=22%

P7=2%

(seconds)

b=0.3

P4=34%

T4 T5 T6

y

T7 T8T3

m=5

f(ri)

Figure 4.8: Use an error histogram to compute the expected score for targeting running timef(ri).

progress without any penalty, and after Li the penalty increases linearly. The integral can

be computed based on the probability accumulated in each of the bins whose ending time

is larger than Li. Depending on the distance between f(ri) with Li, the Li could fall into

different bins with different probabilities, which results in different score computations.

Using the previous example of Li = 2.2 seconds and f(ri) = 0.08ri + 0.9, Fig. 4.8 shows the

histogram distribution of the running time ti when ri = 12 days and f(ri) = 1.85. The limit

Li falls into the 5-th bin in the histogram, with error interval [2.0, 2.3]. The left four bins

are all less than Li, so they will not be contribute to the penalty. The two rightmost bins

[2.3, 2.6] and [2.6, 2.9] are both larger than Li. We thus need to multiply their probability

with the linear penalty function to compute the integral. Within the interval bin [2.0, 2.3],

we assume the error is uniformly distributed, and can deal with the two pieces divided by

the Li limit separately to compute the integral as the expected score. At the end, we take

the summation of these three cases as the overall expected score.

Formally, we assume the error distributes evenly within each interval bin of id k. Hence the

PDF for the error within the k-th interval bin is PDFk = Prk/b, where Prk is the aggregated

77

Page 95: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

probability in the bin. Then the expected score can be computed as:

E(ri) =riI− α

CiLi

(∫ Tm+1

Li

(1− yb

)Prmb

(t−Li)dt+n∑

k=m+1

( ∫ Tk+1

Tk

Prkb

(t−Li)dt)), (4.7)

where

• n: total number of bins in the histogram;

• b: the width of each bin;

• k: the sequence id of each bin;

• Tk: the start time of the k-th bin;

• m: the id of the bin that time limit Li falls into, i.e., Tm ≤ Li < Tm+1;

• y: the time difference Li − Tm.

For a specific range of ri that makes Li fall into them-th bin, Equation 4.7 can be transformed

to:

E(ri) =riI− α

CiLi

(Prm2b2

(b− y)3 −n∑

k=m+1

Prky + b

n∑k=m+1

Prk(1

2+ k −m)

). (4.8)

Computing ri to maximize the expected score

Fig. 4.9 shows how the expected score changes as we increase the range ri for the histogram

distribution in Fig. 4.8 for different values of the penalty weight α. As ri increases, initially

the expected score also increases, since the progress increases without much penalty. After a

certain time, the expected score starts declining since the penalty of missing the time limit

78

Page 96: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0.06

0.07

0.08

0.09

0.1

0.11

0.12

8.5 9 9.5 10 10.5 11 11.5

Exp

ecte

d S

co

re

Range ri (days)

α=0.1α=5.0α=10

Figure 4.9: Expected score as ri changes (histogram model).

also increases. There is an optimal ri that can achieve the best expected score, and the

optimal value depends on the penalty weight α.

In general, we can find an optimal ri to maximize the expected score as follows. For each

interval bin Bm that the time limit Li falls into, we can use Equation 4.8 to compute a best

y value, namely ymax, that maximizes the expected score. The value ymax, if it exists, must

satisfy the following equation (see the Appendix A for more details):

(b− ymax)2 =2b2

3Prm(CiLiαa1I

−n∑

k=m+1

Prk). (4.9)

Let g be the id of the bin that covers f(ri). Using Equation 4.9 we can compute the

corresponding ymax value. We can then compute the best f(ri) as:

f(ri) = Li − ymax − (m− g − 1/2)b.

Then we can compute the best ri value based on the regression function in Equation 4.4.

79

Page 97: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Bin id k max(E(score)) ri (days)

4 0.085 10.855 0.095 10.156 0.096 9.97 0.087 9.45

Table 4.2: Maximal expected score for each interval bin

Notice that the expectation score in Equation 4.8 will change based on which bin the Li falls

into, so we need to examine all the possible bins to find the ri that yields the global maximum

expectation. A simple ri optimization method is to find such a ymax for each of the bins and

choose the one with the largest value. Then we use the ymax value for this chosen bin to

compute the corresponding ri as the final predicate range on the slicing attribute. Table 4.2

shows the maximum expected score for each interval bin in the example in Fig. 4.8.

To terminate the search for the maximum progress earlier, we can start by being aggressive

and choosing the target time f(ri) close to the time limit Li. We can gradually decrease the

f(ri) value by the bin size b. As a consequence, the progress will decrease, while the penalty

also decreases, and the overall expected score increases. We can keep decreasing f(ri) until

we reach an interval bin where the expected score starts to decrease. Then we choose the

best f(ri) in this bin as the final value.

4.4.2 Gaussian Distribution

Assuming that the estimated time error ti − f(ri) follows a Gaussian distribution N(0, σ),

the value of ti follows the distribution of N(f(ri), σ). The parameter σ is the standard

deviation of the observation data. We keep updating a single standard deviation estimate

after seeing each new data point of the performance of the database system. One advantage

of the Gaussian distribution is that it has a closed-form PDF. We can replace the PDF in

80

Page 98: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Equation 4.6 with N(f(ri), σ) and get the following function:

E(ri) =riI− α

∫ ∞Li

t− LiCiLi

1√2πσ

e−( t−f(ri)√

2σ)2dt. (4.10)

This equation can be transformed to

E(ri) =riI− ασ

CiLi√

2

( 1√πe−z

2

+ z(1 + erf(z)

)), (4.11)

where

z =f(ri)− Li√

2σ. (4.12)

To compute an ri value to maximize the expected score, we need to find a value that makes

the derivative of Equation 4.11 be 0, i.e., E ′(ri) = 0. The derivative can be transformed to

E ′(ri) =1

I− f ′(ri)

α

2CiLi

(1 + erf(z)

). (4.13)

Using the estimation model in Equation 4.4, we can compute ri as

ri =

√2σzmax + Li − a0

a1. (4.14)

When

α >2CiLia1R

, (4.15)

81

Page 99: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

we have

zmax = erf−1(2CiLia1Iα

− 1). (4.16)

To summarize, for a given time limit Li, penalty weight α, range space I, coefficients a1 and

a0 of the linear function, and variance σ of the Gaussian model, we can use Equation 4.14

and Equation 4.16 to compute ri to achieve the maximal expected score.

4.4.3 An Example Sequence of Mini-queries

In this section, we use an example to illustrate how the Drum framework decides the sequence

of mini-queries by using the adaptive regression function and the uncertainty model.

Mini-query Li ri f(ri) ti error σ new f(r)

Q1 2.0s 1.0h 0.74s 0.5s -0.24s 0.2623 -Q2 3.5s 2.0h 1.13s 1.5s 0.37s 0.2623 -Q3 4.0s 4.0h 1.92s 1.8s -0.12s 0.2623 0.392r + 0.350Q4 4.2s 9.2h 3.96s 3.0s -0.96s 0.5294 0.269r + 0.611Q5 3.2s 7.3h 2.57s 3.2s 0.63s 0.5503 0.302r + 0.578Q6 2.0s 1.8h 1.12s 1.5s 0.38s 0.5255 0.286r + 0.710Q7 2.5s ... ... ... ... ... ...... ... ... ... ... ... ... ...

Table 4.3: A sequence of mini-queries generated adaptively based on the latest statisticalinformation. (α = 25)

All Drum variants start with three mini-queries with fixed time ranges of one, two, and four

times the minimal range, respectively, to accumulate the initial statistics, and then begin

the adaptive process. An example of possible running times ti for the first three mini-queries

is shown in Table 4.4.

Based on the first three observations of range size ri and running time ti, we can learn

the linear function f(ri), and use it to calculate errors (ti − f(ri)) to build the uncertainty

82

Page 100: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Mini-queries ri ti (seconds)Q1 1h 0.5Q2 2h 1.5Q3 4h 1.8

Table 4.4: The first three mini-query results

model. For simplicity, we use the Gaussian distribution model that only requires the standard

deviation (σ) of the errors to estimate the uncertainty. Table 4.3 shows the chosen range

size ri of each mini-query and how the model is adaptively updated based on newly arriving

statistical information. Taking Q4 as an example, after the first three mini-queries, we got

the linear function f(r) = 0.392r+ 0.350 and the uncertainty model N(f(r), 0.2623). Given

the time limit of 4.2 seconds, using Equation 4.14 we select the range size r4 = 9.2h for Q4,

which is predicated by the model to finish in 3.96 seconds. When Q4 actually finished in 3.0

seconds, we calculate the error (3.0s− 3.96s = −0.96s), then update the standard deviation

of the Gaussian model to σ = 0.5294. Meanwhile, based on the new observation of range

size r4 and running time t4, we update the linear function as f(r) = 0.269r + 0.611, then

use the new f(r) and σ to choose the next range size for Q5. This online learning process

continues until the mini-queries cover the entire value space of the slicing attribute.

4.5 Experiments

In this section we report the results of an empirical evaluation of the proposed Drum frame-

work and its adaptive estimation techniques.

83

Page 101: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

4.5.1 Setting

We collected 114 million tweets using the Twitter streaming API from November 14, 2016 to

January 17 2017. We used Apache AsterixDB [13] (0.9.2 version) as the backend database to

store the data. Its schema included the following attributes: (id: int, create_at: datetime,

text: string, lang:string, stateID: int, countyID:int, cityID:int). The total size of the

records in the database was 90GB. The Drum middleware was written in Scala 2.11.7 and

ran on a 64-bit JVM (Oracle 1.8 version). Both the middleware and the database ran on a

machine with a CPU of 2 cores, 16GB memory, and a 500GB M.2 SSD disk, running the

Centos 7 operating system. We evaluated queries to count the number of tweets mentioning

a particular keyword. The original, unsliced queries to AsterixDB were in the following

format:

1 SELECT COUNT(t)

2 FROM twitter t

3 WHERE ftcontains(t.message, $keyword);

In the query, the predicate ftcontains() checks if the message attribute includes a given

keyword. We considered different types of queries regarding their keyword condition:

1. NoKeyword: The query did not have an ftcontains predicate;

2. Uniform keyword: The keyword in the query was distributed uniformly over the whole

time range. An example was rain, whose daily tweet number did not change very

much over time;

3. Non-uniform keyword: The keyword had a skewed distribution in the time range. Ex-

ample keywords were election (popular around November 2016) and happy (pop-

ular around January 2017).

We built a full-text index on the text field to speed up these keyword-search queries. We used

84

Page 102: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the “create_at” field as the slicing attribute, which was the creation timestamp of the

tweet. We also used this field to do filtering for the AsterixDB indexes [15], which can prune

irrelevant disk components if a query has a range predicate on the attribute. Therefore, a

mini-query with a short range condition on the create_at field ran faster than one with

a bigger range. We assume the user was more interested in the latest results, so the slicing

was moving from the latest time to the earliest.

4.5.2 Effect of Different Slicing Methods

For comparison purposes, we implemented the following methods to run a query:

1. No slicing (“NS”): we ran the original query as it was;

2. Fixed-length slicing (“FL”): we used a fixed interval size for the slicing attribute in each

mini-query.

3. Baseline Drum (“DRUM-BL”): we used the Drum framework without an error model,

and used a linear regression function to directly compute a predicate range for the next

mini-query without considering the penalty of missing the next milestone;

4. Drum with a histogram error model (“DRUM-HS”): we used a histogram error model

(Equation 4.7) to decide the next predicate range;

5. Drum with a Gaussian error model (“DRUM-GA”): we used a Gaussian error model

(Equation 4.10) to decide the next predicate range.

All Drum methods started with 3 mini-queries with a fixed creation timestamp range of

one, two, and four hours respectively to accumulate the initial statistics and then began the

adaptive process. For each query, we ran each slicing method three times and computed their

average performance numbers. We set the desired rhythmic response pace to be 2 seconds.

85

Page 103: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

As explained in Section 4.1, a good fixed length is hard to decide for the FL method. To give

it a fair (comparable) setting, we used the average number of mini-queries obtained from the

DRUM-BL method for our different keywords.

Number of mini-queries

We first tested the different methods to evaluate their number of mini-queries. We considered

four queries with different keyword conditions, namely keyword rain (uniform), keyword

election (non-uniform), keyword happy (non-uniform and expensive), and NoKeyword

(uniform and expensive). The results are shown in Fig. 4.10a. The NS method sent one big

query directly to the database, so its number of mini-queries was always 1. The FL method

used a fixed interval of 29 hours and issued 60 mini-queries regardless of the keyword con-

dition. In contrast, the Drum framework dynamically adjusted the number of mini-queries

depending on the observed performance numbers of the database system. For example, key-

words election and rain were very selective, and their corresponding original queries

would access fewer records than the query of the popular keyword happy. The Drum frame-

work utilized the collected performance information and picked a bigger predicate range for

these two queries. As a result, their numbers of mini-queries were much less than that of the

happy query and that of the query without keywords. Among the three Drum methods, the

DRUM-HS and DRUM-GA methods considered the uncertainty of the regression, thus were

more conservative when generating the next predicate range and issued more mini-queries

than DRUM-BL.

Total running time

Fig. 4.10b shows the total running time of the schedules generated by different models. The

NS method had the least total running time because it did not do any slicing and thus did

86

Page 104: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

not pay any overhead. The FL method sent 60 mini-queries and had a slightly longer total

time. The times of all the DRUM methods were comparable, and they did not increase the

running time much.

Total delay of missing milestones

We next evaluated the different methods in terms of their total delay time. The results are

shown in Fig. 4.10c. The NS method had the longest delay because it did not give the user

any updates until the whole query was finished. The delay time of FL was always 0 for the

election and rain requests, as each of their mini-queries used a small fixed range and

could finish before the next milestone. However, the same fixed range setting caused mini-

queries of the expensive requests happy and NoKeyword to miss their deadlines. This result

illustrates the difficulty of choosing a single range that works well for different requests.

Compared to the DRUM-BL method, DRUM-HS and DRUM-GA used the histogram and

Gaussian distribution uncertainty models to measure the penalty of missing a deadline, and

hence they were more conservative regarding the range of each mini-query. As a result, the

delays of DRUM-GA and DRUM-HS were much smaller.

Total cost

Fig. 4.10d shows the overall cost of a schedule generated by each method as defined in

Equation 4.1, where the weight was set to α = 25 to reflect a case where the client is

concerned more about the pace of result delivery than the total running time. The NL

method had the highest cost due to its long delay. The cost of FL was also significant for

the two expensive queries happy and NoKeyword. The Drum methods had lower costs for

all the queries because of the low penalty of their generated schedules. The DRUM-HS and

DRUM-GA methods were able to balance the number of mini-queries and the risk of missing

87

Page 105: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

10

20

30

40

50

60

70

80

90

100

elec

tion

rain

happ

yN

oKey

wor

d

Nu

m o

f m

ini-q

ue

rie

s

NSFL

DRUM-BLDRUM-HSDRUM-GA

(a) Number of mini-queries

0

20

40

60

80

100

120

140

160

180

200

elec

tion

rain

happ

yN

oKey

wor

d

To

tal tim

e (

se

co

nd

s)

NSFL

DRUM-BLDRUM-HSDRUM-GA

(b) Total time

0

20

40

60

80

100

120

140

160

180

elec

tion

rain

happ

yN

oKey

wor

d

De

lay (

se

co

nd

s)

NSFL

DRUM-BLDRUM-HSDRUM-GA

(c) Total delay (FL is 0 in election and rain cases)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

elec

tion

rain

happ

yN

oKey

wor

d

Co

st

(se

co

nd

s)

NSFL

DRUM-BLDRUM-HSDRUM-GA

(d) Total cost

Figure 4.10: Evaluating slicing methods

88

Page 106: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

10

20

30

40

50

60

70

80

90

100

elec

tion

rain

happ

yN

oKey

wor

d

Num

ber

of m

ini-queries α=5

α=25α=125

(a) Number of mini-queries

0

1

2

3

4

5

6

7

8

9

elec

tion

rain

happ

yN

oKey

wor

d

Tota

l dela

ys (

seconds) α=5

α=25α=125

(b) Delay

Figure 4.11: Effect of penalty weight α (DRUM-HS)

milestones, so their costs were the lowest.

These experiments showed that there was not much difference between the schedules gener-

ated by DRUM-HS and DRUM-GA. The main reason was that Drum is an adaptive frame-

work, and it can automatically adjust the regression function based on observed performance

numbers, so the estimation error distribution was similar to Gaussian.

4.5.3 Effect of Penalty Weight α

We evaluated the effect of the penalty weight α in the scoring function of Equation 4.2.

Intuitively, a higher α value means a higher penalty for missed deadlines, and the slicing

method should become more conservative in generating the next predicate range. We did

experiments using three different values for α, namely 5, 25, and 125, for both DRUM-HS and

DRUM-GA. Fig. 4.11a shows the number of mini-queries when using the DRUM-HS method.

When α increases, the number of mini-queries also increases since each mini-query became

89

Page 107: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

10

20

30

40

50

60

70

80

90

100

elec

tion

rain

happ

yN

oKey

wor

d

Num

ber

of m

ini-queries α=5

α=25α=125

(a) Number of mini-queries

0

1

2

3

4

5

6

7

8

9

elec

tion

rain

happ

yN

oKey

wor

d

Tota

l dela

ys (

seconds) α=5

α=25α=125

(b) Delay

Figure 4.12: Effect of penalty weight α (DRUM-GA)

more conservative (with a smaller range). Meanwhile, the total delay decreased due to the

smaller predicate ranges, as shown in Fig. 4.11b. The DRUM-GA results for different α values

are shown in Fig. 4.12.

4.5.4 Adaptiveness of Regression Function

To evaluate the adaptiveness of the regression function, we considered two ways to derive

the linear regression function: (1) All, which used the observed performance numbers of

all previous mini-queries; and (2) Last10, which used the latest 10 performance numbers.

The purpose here was to see how well the Drum framework can adapt to changes in the

performance of the underlying database system. We considered the NoKeyword query, whose

results were distributed evenly over the entire time range.

We first simulated a situation with a dynamic database workload. The first half of the mini-

queries were run normally, but after some time we added a 1-second sleep to the database

90

Page 108: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

50

100

150

200

250

300

350

DRUM-HS DRUM-GA

Tim

e (

seconds)

AllLast10

(a) Total running time

0

2

4

6

8

10

12

14

DRUM-HS DRUM-GA

Tim

e (

seconds)

AllLast10

(b) Total delay

Figure 4.13: Adaptiveness of the linear regression (adding a 1-second sleep for the secondhalf of the mini-queries

connection before sending each result back to the mini-query executor in order to simulate

a case where the second half of the mini-queries took longer than the mini-queries with the

same time ranges in the first half. Fig. 4.13 shows the results. The Last10 method was able

to adapt to the new performance of the database, since its obtained linear regression better

described the relationship between the predicate range and the running time.

We then simulated a situation where the database system is less stable. In this case we

added a 1-second sleep randomly (with probability of 0.5) in the database connector for each

mini-query. Fig. 4.14 shows the results. In this case, the linear function trained using the All

method was more precise, and both its total running time and penalty were less than those

for the Last10 method.

These two experiments showed that using fewer recent observations to train a linear function

can help Drum to adapt to new backend performance changes as expected. However, if the

underlying database is unstable, using fewer observations may more likely be affected by

“noise”, making the trained model less precise, which could lead to worse performance (e.g.,

a longer running time and larger penalties).

91

Page 109: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

50

100

150

200

250

300

350

DRUM-HS DRUM-GA

Tim

e (

seconds)

AllLast10

(a) Total running time

0

2

4

6

8

10

12

14

DRUM-HS DRUM-GA

Tim

e (

seconds)

AllLast10

(b) Total delay

Figure 4.14: Adaptiveness of the linear regression (adding a 1-second sleep for each mini-query randomly)

Remarks: In summary, we have seen that (1) The Drum framework can adaptively adjust

the size of each mini-query to successfully reduce the amount of delay for different queries

without any prior knowledge, and it does not increase the total running time by much.

(2) The uncertainty models can automatically balance the number of mini-queries and the

penalty of missing milestones depending on the weight on the penalties. (3) The Gaussian

model and histogram model gave similar results. Since the Gaussian model is easier to

implement, it is arguably the better choice to adopt.

4.6 Related Work

Progressive computing: Many approaches have been proposed in the literature to support

interactive queries on large data sets residing in a backend database system [46, 42, 45, 64].

For example, [46, 42] developed a solution called “online aggregation” that progressively

shows approximate aggregation answers with a confidence interval based on partial results

and a statistical model. The solution relies on random sampling, which requires significant

92

Page 110: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

changes to the underlying database system. [69] introduced another method for improving

the user experience by showing partial results, which also requires changes to the database

system. [28, 64] studied online aggregation in a MapReduce-like framework. [53, 82] studied

how to provide aggregation results with different levels of details incrementally by using

an auxiliary data structure specially designed for spatio-temporal data; the data structure

needs to be synchronized with the underlying database system. Our study is different since

we consider the case where the underlying database system is used as a “black box” without

any changes.

[81] also addressed how a middleware layer can decompose a query into mini-queries to

improve responsiveness. Their focus was mainly on how to select an attribute to do the

decomposition, and the generated mini-queries have the same-size predicate intervals. (See

the analysis of fixed-length intervals above.) Our focus is on how to generate variable-range

predicates for the mini-queries to maintain the rhythm of delivering results.

Waiting time in progressive computation: Since the user encounters a series of waiting

times, the quality of the experience heavily depends on the timing of those updates on the

frontend [32, 76]. Similar to the case of streaming online videos, user-perceived quality

can suffer from freezes during the delivery of results. A sudden, different waiting time can

be different from the user’s expectation, causing a negative experience [76, 88]. Thus, the

rhythm of result delivery is a key determinant related to user satisfaction in progressive

computing. Deciding a good amount of waiting time for the next incremental update has

been studied in [32, 21, 85]. [34] suggested 10 seconds as an upper time limit for keeping the

user’s attention focused on the interface. Instead of returning a single response after several

minutes, progressive computing provides a flow of partial results that helps users “lose their

sense of time” [29], resulting in a good experience of distorted time perception [76].

Query time estimation: Estimating the running time of a query is a fundamental topic

that has been extensively studied (e.g., [8, 91]). Many of the existing techniques can be

93

Page 111: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

adopted in our Drum framework as part of the regression function. [92] considered an un-

certainty model that relied on internal information about a database system that may not

be available in many applications. [4] implemented a system called BlinkDB that allows

users to choose a trade-off between query accuracy and response time. [93] considered an

uncertainty model for admission control of multiuser workloads, which needs to be trained in

advance on sample data. Our proposed Drum solution is different; it does not rely on inter-

nal information about the database system nor a pre-trained model, as it collects statistical

information on the fly during the execution of mini-queries.

4.7 Conclusions

In this chapter, we have studied how to progressively answer a time-consuming query on

a large data set by generating a sequence of mini-queries. We formulated an optimization

problem to produce the predicates of mini-queries by considering both their total running

time as well as the smoothness of result delivery. Its goal key is to provide the incremental

results at a rhythmic pace to improve the user experience. We developed an adaptive frame-

work called Drum that can collect run-time behavioral statistics for the database system to

decide the predicate for the next mini-query appropriately. Drum is a general middleware

solution that requires no changes to the underlying database system. Our experiments on a

large, real data set showed that Drum and its techniques can reduce the delay of delivering

intermediate results to the user without sacrificing much total time. Therefore, the user can

see smooth result updates at the expected rhythm.

94

Page 112: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 5

Using Filters to Improve

Secondary-to-Primary Index Search

5.1 Introduction

Fast query response time is critical for a good user experience. In the previous query slic-

ing work, Cloudberry slices a query into a sequence of mini-queries by adding time-range

predicates to the original query so that the mini-queries can run much faster to support a

responsive user experience. It is always desirable to improve the query processing speed so

that there will be fewer mini-queries needed, and hence less time for the frontend to receive

the complete results.

To improve mini-query performance, the TwitterMap application benefits from the LSM filter

feature [15] provided by AsterixDB to speed up the time-related queries. AsterixDB’s LSM

filters provides a way to add additional metadata to LSM components and to exploit the

metadata during query processing to skip irrelevant components. In order to efficiently utilize

the power of the filters, the query needs to have a highly selective filter-related predicate.

95

Page 113: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

This limitation largely restricts the scenarios that can utilize the pruning power of the filters.

Unfortunately, many analytical queries do not match this pattern. For example, a mini-

query can have a wide time-range predicate, and many index components will be matched

by checking their filters. In such cases, the query time can not be reduced much.

In this chapter, we revisit the filter idea, explore the aligned structure property between

secondary index components and primary index components, and propose an improvement

for accelerating the secondary-to-primary search. The main idea is to propagate the filter

information attached to a searched secondary index component along with its output primary

keys to the primary index search. The primary index search can then still use filter hints to

improve the speed, even though the query itself does not contain any filter-related predicates.

The rest of the chapter is organized as follows. First, we revisit the background of LSM-

trees and the filter concept in Section 5.2. Then we introduce the idea of acceleration using

secondary components’ filters in Section 5.3. We present the evaluation of the improvement

in Section 5.4. Finally, we provide conclusions in Section 5.6.

5.2 Background

5.2.1 LSM-Tree

An LSM-tree [63] is an ordered, persistent index structure that supports typical operations

such as insert, delete, and search. It is optimized for frequent or high-volume updates. By

first batching updates in memory, the LSM-tree amortizes the cost of an update by converting

what would have been several disk seeks into some portion of a sequential I/O. Entries being

inserted into an LSM-tree are initially placed into a component of the index that resides

in main memory, called an in-memory component. As shown in Fig. 5.1, when the space

96

Page 114: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Disk Memory

in-memory component (mutable)

disk component (immutable)

insert

T2 flush

merge

T1

T3

time

Figure 5.1: Life cycle of a component-based LSM-tree [51].

occupancy of the in-memory component exceeds a specified threshold, entries are sequentially

flushed to disk. As entries accumulate on disk, the entries are periodically merged together

subject to a policy that decides when and what to merge.

5.2.2 Filter-based LSM Storage in AsterixDB

5.2.2.1 LSM Storage in AsterixDB

The AsterixDB’s storage layer [14] provides a general framework for converting a class of

indexes (including conventional B-trees, R-trees, and inverted indexes) into LSM-based in-

dexes.

Fig. 5.2 shows a snapshot of the LSM storage after inserting the records from Table 5.1.

Each record contains the id field that stores the primary key, the time field that collects

the timestamp, and the text field that holds the tweet content. After ingestion, there is one

primary LSM B-tree index and one secondary inverted LSM index built on field text. The

in-memory component of the primary LSM B-tree stores entries of 〈pk, record〉 pairs and

orders them by the primary key (noted as “pk”). There are multiple on-disk components of

97

Page 115: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

id time text1 25 Zika virus symptoms and risks2 34 Rio Olympics3 35 Flu virus4 47 Zika cases5 48 Zika in mosquitos6 67 News about Olympics7 69 Ebola virus

Table 5.1: A sample dataset tweet. The id is the primary key. The value of the timefield has a timestamp number. The text field contains a bag of words.

[25,34]

<7,R7>

[35,47] [48,67]

[25,47] [48,67]

PC1 PC2 PC3 PC0

SC1 SC2 SC0

<1, R1><2, R2>

<3, R3><4, R4>

<5, R5><6, R6>

<zika, [1,4]><virus, [1,3] >, <rio, [2]>

<zika, [5]><news, [6]>

<virus, [7]>

Secondary LSM keywordInverted Index(key field: text)

PrimaryLSM B-treeIndex(key field: id)

Index Type Disk Components In-memory Components

Bloom Filter B-tree[25,34]

LSM Filter

[69,69]

[69,69]

Figure 5.2: LSM storage example in AsterixDB.

98

Page 116: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the primary LSM B-tree that are formed by flushing an in-memory component or by merging

multiple smaller on-disk components. Particularly for the case of the primary index, a disk

component consists of a B-tree with an associated Bloom filter that is built by the primary

keys stored in the specific component. The Bloom filter is used to skip irrelevant B-tree

components when looking for a specific primary-key.

Similarly, the secondary index consists of an in-memory index structure and several on-

disk index structures. Different from the primary index, a secondary disk component does

not have Bloom filters. Logically, the secondary index stores entries of 〈sk, pk〉, where sk

represents a secondary key. Depending on the type of the index, the entries may be stored

in different formats. For example, in the case of inverted index, all pk’s associated with sk

are physically stored as a list in a separate inverted list file. Then the B-tree structure stores

〈sk, pointer〉, where the pointer stores the offset of the list address in the inverted list file.

5.2.2.2 LSM Filter

The LSM filter is an additional piece of metadata added to each LSM component that stores

the minimum and maximum values of a selected attribute (called the filter key) among the

records in the component [15]. If a query has an predicate on the filter key, the query

executor can exploit the metadata to skip irrelevant components.

Users of filters are advised to use time-correlated fields or a field that is monotonically

increasing over the order of record arrivals as their filter field. In this case, the filters on

the disk components are likely to have disjoint filter ranges, making them very effective for

pruning. Fig. 5.3 shows an example of the SQL++ DDL to declare the time field filter of

the tweet dataset in AsterixDB.

As shown in Fig. 5.2, after declaring the filter field, both primary and secondary index

components contain an LSM filter. The in-memory component incrementally maintains the

99

Page 117: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 CREATE DATASET tweet(tweetType) PRIMARY KEY id WITH FILTER ON time;

Figure 5.3: SQL++ DDL statement to declare the filter field.

[25,47]

SC1

<zika, [R1,R4]><virus, [R1,R3] >, <rio, [R2]>

[25,34]

SC’11

<zika, [R1><virus, [R1] >, <rio, [R2]>

[35,47]

SC’12

<zika, [R4]><virus, [R3] >

merge

Figure 5.4: Update the filter range when merging two components.

minimum and maximum filter key values based on newly inserted values. When the in-

memory component is flushed to disk, its filter record is also flushed to disk. During the

merge phase, the minimum and maximum values for the resulting component’s filter record

are taken from the smallest and largest values of all components participating in the merge.

For example, as shown in Fig. 5.4, component SC1 is created by merging the previous two

on-disk components SC ′11 and SC ′12. The filter range of the merged component is set to

cover both filter ranges of the previous components.

5.2.2.3 Filter-based Secondary-to-Primary Search

When a query comes to the system, the AsterixDB query optimizer first checks if the dataset

has a filter. If so, then it analyzes the query to find if there are any filter-related predicates.

Once the optimizer finds applicable predicates, it will pass them to all the index search

100

Page 118: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

SIX SEARCH

SORT

predicate on SK & filter field

<PK> ...

PIX SEARCH

<PK, query’s filter predicate range>….

ASSIGN

ordered <PK> ...

Figure 5.5: The query plan of secondary-to-primary index search using a filter. “SIX” standsfor secondary index search. “PIX” stands for primary index search.

1 SELECT t FROM tweet t2 WHERE t.time > 20 AND t.time < 303 AND ftcontains(t.text, "zika");

Figure 5.6: SQL++ query with a filter condition.

operators. At runtime, the index search operator uses the filter predicate to prune irrelevant

components before searching them. Then a regular search on the index is performed only for

the components that match the query’s filter predicates. Both the primary and secondary

index search operators use the filter in the same way to skip unnecessary searches.

Fig. 5.5 shows the query plan for filter-based secondary-to-primary search. The secondary

index is searched first to return all the matching primary keys. Then the keys are sorted and

sent to the assign operator to attach the filter predicate ranges from the query. Finally, the

primary index search operator searches for all primary keys one by one, and the associated

filter predicate can be used to prune the irrelevant primary components during this process.

Fig. 5.7 shows an example of the corresponding data flow when the query in Fig. 5.6 is issued.

101

Page 119: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[25,34]

<7,R7>

[35,47] [48,67]

[25,47] [48,67]

PC1 PC2 PC3 PC0

SC1 SC2 SC0

<1, R1><2, R2>

<3, R3><4, R4>

<5, R5><6, R6>

<zika, [1,4]><virus, [1,3] >, <rio, [2]>

<zika, [5]><news, [6]>

<virus, [7]>

Secondary LSM keywordInverted Index(key field: text)

PrimaryLSM B-treeIndex(key field: id)

Index Type Disk Components In-memory Component

Index Type Disk Components In-memory Component

<1><4>

SORT

<zika, 20 to 30 >

an index search operation

<1, 20 to 30 ><4, 20 to 30 >

R1

[69,69]

[69,69]

ASSIGN

Figure 5.7: The physical data flow of filter-based secondary-to-primary search

102

Page 120: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

The query optimizer finds a full-text search predicate ftcontains(t.text, “zika”)

that can utilize the inverted index on the text field and another predicate ([20,30]) that

can utilize the LSM filter on time field of the tweet dataset. At runtime, both the keyword

and the time range predicates are sent to the inverted index to find the matching primary

keys. Given the filter predicate, SC2 will be filtered out because its range [48,67] does not

overlap with the query predicate [20, 30]. The in-memory component is always selected

because it can include update and delete operations. After the secondary index search, only

the primary keys 1 and 4 are returned after searching SC1. The output records are sent

to the sort operator, then through the “assign” operator to attach the same filter predicate

range from the query again. Finally, the primary search operator receives a collection of

primary key pk’s. The collection will be “unnested” in the sense that each pk will be searched

independently by using a point lookup in the primary B-tree index. Hence, there will be a

sequence of key searches instead of one search as in the secondary index search case. For

each pk search, the filter predicate from the query is again used to prune irrelevant disk

components. For example, there is no need to search PC2 and PC3 since their component

filter ranges do not overlap with the query predicate. Note that since each pk search is a

point lookup, the Bloom filter will also be used to eliminate unnecessary B-tree component

accesses in the primary index search. Finally, pk = 1 will be found in PC1, and R1 will be

returned from the index and pushed to downstream operators.

5.2.2.4 Merge Policies

The flushed on-disk components are periodically merged together to improve the query

performance. During the process, the merge policy plays a critical role to decide when

and what to merge.

The prefix merge policy relies on component sizes and the number of components to decide

which components to merge. When the size of a component exceeds a certain limit, which

103

Page 121: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 SELECT t FROM tweet t2 WHERE ftcontains(t.text, "zika");

Figure 5.8: SQL++ query to count the number of tweets mentioning zika.

can be configured by users, the component is not included in the merging process. The size

limit parameter is shared by all indexes that are built on a dataset, and each index is merged

independently. Since the size of the secondary index is often comparatively smaller than the

primary index, the number of secondary components can be much less than the number of

primary components.

The correlated-prefix policy delegates the decision of merging the disk components of all the

indexes in a dataset to its primary index. When the policy decides that the primary index

needs to be merged, it will issue successive merge requests to the I/O scheduler on behalf of

all other indexes associated with the same dataset. The end result is that secondary indexes

will always have the same number of disk components as their primary index and that they

will be aligned in terms of their filter ranges.

5.3 Using Filters to Improve Secondary-to-Primary In-

dex Search

5.3.1 Basic Idea1

LSM filters can improve the performance of a query if the query contains a highly selective

predicate on the LSM filter field [15]. Otherwise, if the query does not contain a filter-related

predicate or the predicate range is not highly selective, the filter is inapplicable or of limited

use.1The idea is primarily credited to Yingyi Bu.

104

Page 122: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[25,34]

<7,R7>

[35,47] [48,67]

[25,47] [48,67]

PC1 PC2 PC3 PC0

SC1 SC2 SC0

<1, R1><2, R2>

<3, R3><4, R4>

<5, R5><6, R6>

<zika, [1,4]><virus, [1,3] >, <rio, [2]>

<zika, [5]><news, [6]>

<virus, [7]>

Secondary LSM keywordInverted Index(key field: text)

PrimaryLSM B-treeIndex(key field: id)

Index Type Disk Components In-memory Component

Index Type Disk Components In-memory Component

<1, 25 to 47> <4, 25 to 47><5, 48 to 67>

SORT

<zika>

[69,69]

[69,69]

R1, R4, R5

Figure 5.9: The relationship between secondary-index components with primary-index com-ponents based on their filter value. The primary key found in a secondary component doesnot need to be probed on the primary components whose filter values are not overlappingwith the secondary component’s filter.

105

Page 123: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

However, in the secondary-to-primary search case, since both secondary and primary index

components contain the filter metadata, the filter ranges of the secondary components can

still be used to improve the primary index search regardless of whether the query has a filter

related predicate or not. We illustrate the idea with an example.

Suppose the LSM storage layout is the same as shown in Fig. 5.2. The inverted index has two

on-disk components, and the primary B-tree has three on-disk components. All components

have associated filters that are formed from the minimum and maximum values of the time

attribute of the records within each component. The issued SQL++ query is the one shown

in Fig. 5.8, which computes the number of tweets mentioning zika.

Since the query does not contain a filter predicate, all secondary components will be searched

to get a list of “zika”-related primary keys (pk’s). After the pk’s arrive at the primary index

search, again each pk will search (possibly) all the disk components in the reverse-chronical

order to obtain the matching record. (The older components will not be accessed once the

pk has been found in a more recent component).

On more careful consideration, we can observe that some of the primary components ac-

tually do not need to be searched. As illustrated in Fig. 5.9, the LSM filter ranges of the

secondary and primary components reveal some relationships between them. For example,

the secondary component SC2 stores the text field of the tweets that were created between

48 and 67, and the filter of primary component PC3 indicates that PC3 only stores the

records from 48 to 67. Logically, to search the matched primary key 5 from SC2, only PC3

needs to be searched. Similarly, to check the primary keys 1 and 4 obtained from SC1, only

PC1 and PC2 need to be accessed.

To utilize this relationship between the secondary and primary components, we can choose

to carry the filter value from the secondary index along the data flow pipeline to the primary

index search operator. The primary index search can then reuse the original filter-based

106

Page 124: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

pruning logic to skip the irrelevant components and thus improve the performance.

5.3.2 Implementation in AsterixDB

SIX SEARCH

SORT

<PK, Ts, Te> ...

PIX SEARCH

<PK, Ts, Te> ...

(a) Single secondary index search plan.

SIX SEARCH

SORT

predicate

<PK, Ts, Te> ...

PIX SEARCH

SIX SEARCH

predicate

SORT

<PK, Ts, Te> ...

PIX SEARCH

<PK, Ts, Te> ...

<PK, Ts, Te> ...

<PK, Ts, Te> ...

(b) Multiple-secondary-index search plan.

Figure 5.10: The filter-based secondary-to-primary index search plan.

To implement this search improvement when filters are available, we modified the existing

filter rewrite rule in the AsterixDB optimizer. Fig. 5.10 shows the logical query plan after

applying the rewrite rule. The optimizer analyzes the query to see if there is any secondary-

to-primary index search. If so, it sets the parameter in the secondary search to have it output

the pk and also append a pair of filter values (denoted as Ts and Te) of the component where

pk come from. Each resultant record can be represented as a triple 〈pk, Ts, Te〉. The entire

list of records will be sorted on the pk field and sent to the primary index to use in its search.

The primary index search operator can then utilize the filter predicate in the record for every

pk search to prune components for different pk’s independently. If multiple indexes are found

107

Page 125: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

in the path, all indexes will append the filter values of their own. The records will then be

sorted by pk and sent to the intersect operator. Only the records whose pk’s appear in all

the index paths will be selected, and only one of the index’s records will be output. (Ideally,

the output record could have a narrower filter range computed by getting the overlapped

range by comparing the multiple filter values. However, the intersect operator supported in

Hyracks is a general-purpose operator without an interface to specify such merging logic.)

Finally, the intersected records are sent to the primary-index-search operator to execute the

filter-based index search.

5.4 Experiments

In this section, we present the result of an experimental evaluation of the proposed filter-

based secondary-to-primary index search implemented in AsterixDB. We focus on the queries

that can trigger the secondary-to-primary index search but without any filter related pred-

icates. We show the effect of using the improved filter-based index searches with different

secondary index access paths and merge policy settings.

5.4.1 Dataset

We used one month of real Twitter data collected via the Twitter streaming API from

September 13, 2016 to October 14, 2016. The dataset had 54 million records with a total

size of 50GB. The schema of the dataset is shown in Fig. 5.11.

The size of each tweet was about 1KB. Some of the fields of a tweet that can have an impact

on the storage layout and as well as query performance are as follows:

• id: the primary key of a tweet;

108

Page 126: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

1 create type typeUser if not exists as open {2 id: int64,3 name: string,4 screen_name : string,5 lang : string,6 location: string,7 create_at: date,8 description: string,9 followers_count: int32,

10 friends_count: int32,11 statues_count: int6412 }13 create type typePlace if not exists as open {14 country : string,15 country_code : string,16 full_name : string,17 id : string,18 name : string,19 place_type : string,20 bounding_box : rectangle21 }22 create type typeGeoTag if not exists as open {23 stateID: int32,24 stateName: string,25 countyID: int32,26 countyName: string,27 cityID: int32,28 cityName: string29 }30 create type typeTweet if not exists as open {31 create_at : datetime,32 id: int64,33 "text": string,34 in_reply_to_status : int64,35 in_reply_to_user : int64,36 favorite_count : int64,37 coordinate: point? ,38 retweet_count : int64,39 lang : string,40 is_retweet: boolean,41 hashtags : {{ string }} ? ,42 user_mentions : {{ int64 }} ? ,43 user : typeUser,44 place : typePlace? ,45 geo_tag: typeGeoTag46 }

Figure 5.11: Tweet type definition used in the experiments.

109

Page 127: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

• create_at: the publishing time of a tweet. We used this field as the LSM filter field

and ingested the data in the same order as their creating time so that the storage

components could have disjoint minimum and maximum values, making them very

effective for pruning using the filter rules;

• text: the content of the tweet. We built a full-text inverted index on this field for the

purpose of quickly finding the tweets that mentioned a specific keyword;

• geo_tag.cityID: the geo_tag field describes the region where the tweet was published.

We built a B-tree index on this field to find the tweets published in a certain city.

• coordinate: the latitude and longitude values for the place where the tweet was pub-

lished. It was a null-able field, meaning not every tweet contained this field. In our test

dataset, 8 million out of 54 million tweets had a non-null value for this field. We built

an R-tree on this field to quickly find the tweets published within a certain geographic

area.

5.4.2 Machine and Parameter Configuration

We used a single machine with a dual-core CPU, 16GB memory, and a 500GB M.2 SSD disk

to host an AsterixDB instance. We used the parameters shown in Table 5.2 to configure the

datasets.

Parameter ValueData page size 128KBDisk buffer cache size 3GBBloom filter target false-positive rate 1%Max component size for merge policy 128MB

Table 5.2: AsterixDB settings for the experiment.

We ingested two copies of the 54 million-tweet dataset separately with the LSM prefix

and correlated merge policies as described in Section 5.2.2.4 to test the performance of

110

Page 128: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the secondary-to-primary search filter rule under different merge policies. Table 5.3 shows

the number of on-disk components that resulted from using these different policies after the

same data file was ingested. Under the prefix merge policy, each secondary index was merged

independently and the adjacent components were merged until the merged size reached the

given size limit, i.e., 128MB. Since the secondary index was smaller than the primary index,

there were fewer secondary index components than primary index components. For example,

the text index had 43 components, which means that one inverted index component roughly

mapped to 6 primary index components. The R-tree index had the fewest components be-

cause only 8/54 of the tweets had this attribute, hence the size was the smallest among all

secondary indexes. The correlated merge policy aligned each secondary component with the

corresponding primary component. Therefore, all secondary and primary indexes had the

same number of components, i.e., 240.

Index type Prefix merge policy Correlated merge policyPrimary 231 240text (inverted index) 43 240cityID (B-tree) 11 240coordinate (R-tree) 5 240

Table 5.3: The number of components of each index after ingestion.

5.4.3 Query Performance

5.4.3.1 Test Queries

We experimented with different types of secondary-to-primary index searches to see the

performance of using the proposed optimization. The queries were simple count queries

that calculated the number of tweets satisfying specific predicates. The database system

utilized different secondary indexes to find the matching records depending on the predicates.

Fig. 5.12 shows three examples that led the system to use the inverted index on the text

111

Page 129: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

field, B-tree index on the cityID field, R-tree index on the coordinate field, respectively.

Each query was run three times and we report the average running time.

SELECT count(t) FROM twitter.tweet t

WHERE ftcontains(t.text, ['zika'], {'mode':'all'});

(a) Count the number of tweets mentioning zika.

SELECT count(t) FROM twitter.tweet t

WHERE t.geo_tag.cityID = 100820;

(b) Find the number of tweets published in Al-abaster, Alabama (cityID = 100820).

SELECT count(t) FROM twitter.tweet t

WHERE spatial_intersect(t.coordinate,

create_circle(create_point(-118.3238,34.1347), 0.01));

(c) Find the number of tweets published around the “Hollywood” sign in Los Angeles,CA by specifying a circle range around its coordinates.

Figure 5.12: Three types of test queries. We changed the keywords, city ids, and radii of thequeries to change their selectivity.

5.4.3.2 Using the Prefix Merge Policy

We first evaluated the performance of using the secondary filter optimization on the dataset

ingested using the prefix merge policy. Fig. 5.13 shows the query performance for the keyword

search, B-tree search, and R-tree search queries against the dataset. Fig. 5.13a shows the

query time using different keyword search predicates with different selectivities. Fig. 5.13b

shows the speedup from using the secondary filters. The results show that the secondary

filters optimization can reduce the query time for all the keyword search predicates of various

selectivities. As we increased the selectivity of the search predicate, the speedup became more

significant. For the “love” query with a 2.98% selectivity, the filter search optimization can

reduce the query time by 70%, which corresponds to a three-times speedup as compared to

112

Page 130: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

100

200

300

400

500

zika(0.004%)

flood(0.05%)

thanks(1.01%)

love(2.98%)

Runnin

g tim

e (

seconds) Original

Optimized

(a) Query time of counting the records with dif-ferent matching keywords.

0

10

20

30

40

50

60

70

80

zika(0.004%)

flood(0.05%)

thanks(1.01%)

love(2.98%)

Tim

e r

eduction(%

)

(b) Time reduction of using the filter from theinverted index.

0

100

200

300

400

500

AlabasterAL

(0.007%)

GoodyearAZ

(0.04%)

BrooklynNY

(0.95%)

LosAngeles(2.96%)

Runnin

g tim

e (

seconds) Original

Optimized

(c) Query time of counting the records with dif-ferent matching cities .

0

10

20

30

40

50

60

70

80

AlabasterAL

(0.007%)

GoodyearAZ

(0.04%)

BrooklynNY

(0.95%)

LosAngeles(2.96%)

Tim

e r

eduction (

%)

(d) Time reduction of using the filter from sec-ondary B-tree.

0

100

200

300

400

500

r=0.025(0.006%)

r=0.035(0.04%)

r=1.0(1.00%)

r=8.0(2.75%)

Runnin

g tim

e (

seconds) Original

Optimized

(e) Query time of counting the records withindifferent spatial circle regions with different radiir.

0

10

20

30

40

50

60

70

80

r=0.025(0.006%)

r=0.035(0.04%)

r=1.0(1.00%)

r=8.0(2.75%)

Tim

e r

eduction (

%)

(f) Time reduction using the filter from sec-ondary R-tree.

Figure 5.13: Comparison of the query performance between the original filter plan to usingthe secondary filter values for the primary index search. (The percentages below the predicatevalues on the x-axes show their selectivities.)

113

Page 131: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

100

200

300

400

500

zika(0.004%)

flood(0.05%)

thanks(1.01%)

love(2.98%)

Runnin

g tim

e (

seconds) Original

Optimized

(a) Query time of counting the records with dif-ferent matching keywords.

0

10

20

30

40

50

60

70

80

zika(0.004%)

flood(0.05%)

thanks(1.01%)

love(2.98%)

Tim

e r

eduction(%

)

(b) Time reduction using the filter from the in-verted index.

0

100

200

300

400

500

AlabasterAL

(0.007%)

GoodyearAZ

(0.04%)

BrooklynNY

(0.95%)

LosAngeles(2.96%)

Runnin

g tim

e (

seconds) Original

Optimized

(c) Query time of counting the records with dif-ferent matching city ids.

0

10

20

30

40

50

60

70

80

AlabasterAL

(0.007%)

GoodyearAZ

(0.04%)

BrooklynNY

(0.95%)

LosAngeles(2.96%)

Tim

e r

eduction (

%)

(d) Time reduction using the filter from sec-ondary B-tree.

0

100

200

300

400

500

r=0.025(0.006%)

r=0.035(0.04%)

r=1.0(1.002%)

r=8.0(2.75%)

Runnin

g tim

e (

seconds) Original

Optimized

(e) Query time of counting the records withindifferent spatial circle regions with different radiir.

0

10

20

30

40

50

60

70

80

r=0.025(0.006%)

r=0.035(0.04%)

r=1.0(1.002%)

r=8.0(2.75%)

Tim

e r

eduction (

%)

(f) Time reduction using the filter from sec-ondary R-tree.

Figure 5.14: Comparison of the query performance between using and not using the sec-ondary filters for the primary index search under the correlated merge policy.

114

Page 132: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the original query plan. The reason behind this is as expected, using the filter hints reduces

the individual primary index search time, as it then only needs to check a few primary

components. As a result, the more primary keys that the secondary index returns, the more

time the optimization can save.

Fig. 5.13c and Fig. 5.13d show the query performance when using the secondary filters from

the secondary B-tree index built on the cityID field. We can see a similar improvement of

using the secondary filters from the inverted index. Compared to the speedup values of the

inverted index case, the speedup from the cityID index was relatively smaller for queries

with similar selectivities. Since the number of cityID index components was small (11),

the filter values attached to the secondary component was 4 times (onemonth11

v.s. onemonth43

)

wider than the range of the inverted index filter. Thus, there were more matched primary

index components to search in this case.

Fig. 5.13e and Fig. 5.13f show the performance when using the filters from the R-tree index

built on the coordinate field. Due to missing locations, the R-tree index was only built

on 15% of the records. Since the total size of index was small, when using the prefix

merge policy the index only consisted of 5 components, which means on average each R-tree

component aligned with 46 (231/5) primary index components. Thus, the pruning power of

the secondary filter from this coordinate index is even less than the other indexes.

5.4.3.3 Using the Correlated Merge Policy

In contrast with the prefix merge policy, where each index is merged independently, the

correlated merge policy coordinates all secondary indexes with the associated primary index

so that each secondary index component aligns with precisely one primary index component.

As Table 5.3 shows, all of the indexes, including the text index, city index, and coordinate

index, had 240 components, which was identical to the number of components of the primary

115

Page 133: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

index. Fig. 5.14 shows the response time of queries against the dataset ingested using the

correlated merge policy. As with the prefix merge policy, we can see that using the secondary

filters optimization can significantly improve performance. Moreover, since the secondary

index components are perfectly aligned with the primary index, the filter values coming from

the secondary index can overlap with at most 3 primary index components (the aligned one

and its left and right neighbors). Thus, the speedup here was higher than the experiment on

the dataset with the prefix merge policy. Especially for the case of R-tree index in Fig. 5.14f,

the time reduction can be up to 70% for the query with a 2.75% selectivity, while the same

query can only have a 35% reduction in the prefix merge case.

5.4.4 Analysis of the Improvement

In this section, we drill down further to analyze the internal access details to identify where

the speed improvement came from. The cost of secondary-to-primary search is composed

of the secondary index search, sort, and primary index search. Since using secondary filter

optimization does not change the secondary search or sort operation much, we focus on how

the cost of the primary index search has changed.

To search for a specific pk on n primary index components, the search operator checks for

pk in the latest component to the earliest one until pk is found. For each component, it first

checks the Bloom filter to avoid unnecessary B-tree searches. If the Bloom filter matches

pk, the B-tree search will look for pk. Note that there could be a false positive match

from the Bloom filter, which results in extra B-tree lookups. By using the secondary filter

optimization, we can reduce this cost by searching fewer primary components instead of

all the n components. Correspondingly, there are fewer Bloom filter searches and also less

false-positive B-tree searches.

We can use one of the keyword search queries matching the keyword “flood” as an example

116

Page 134: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

0

2

4

6

8

10

12

14

W/O W/I

Tim

e (

se

co

nd

s)

Initialize CursorBloom Filter

B-tree Search

(a) The Bloom filter time and B-tree search timecomparison when cache is clean.

0

2

4

6

8

10

12

14

W/O W/I

Tim

e (

se

co

nd

s)

Initialize CursorBloom Filter

B-tree Search

(b) The Bloom filter time and B-tree search timecomparison when records are cached.

Figure 5.15: Break down the performance gain between not using (W/O) and using (W/I)the secondary filter plan.

to verify the analysis. Internally, we added code to collect the numbers of Bloom filter

searches and B-tree searches as well as their corresponding timing. The experiment was

conducted on the dataset built using the prefix merge policy.

Count Original Secondary filter optimization ReductionBloom filter checks 4,261,002 108,433 97.5%B-tree searches 66,920 27,623 58.7%False Positive B-tree searches 40,046 749 98.1%

Table 5.4: The number of Bloom filter checks and primary B-tree searches to find “flood”related records.

Table 5.4 shows the number of Bloom filter searches and B-tree searches. As expected, both

numbers were greatly reduced by using the secondary index filter. The Bloom filter searches

were reduced by about 97.5%. Correspondingly, the false-positive B-tree searches were also

reduced by a similar factor (98.1%). Therefore, the total number of B-tree searches was

greatly decreased.

Fig. 5.15a shows the time spent on the Bloom filter search and B-tree search when the buffer

cache of the database was initially empty. We can see the Bloom filter search time was

greatly reduced when the secondary filter optimization was used. Surprisingly, however, the

117

Page 135: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

B-tree search time did not change much, so the performance gain was mainly due to the

4 million fewer Bloom filter searches. The reason for this result is that the B-tree search

time was mainly spent on fetching cold pages. Since the primary keys were ordered, false

positive searches may not need to load new pages, so the entire B-tree time did not reduce

much. To verify this, we conducted another experiment to check the numbers when the

B-tree pages were already cached (by running the same query multiple times). The result is

shown in Fig. 5.15b. In this in-memory setting, the B-tree search time reduced by more than

half (from 0.33 seconds to 0.12 seconds), which was consistent with our previous analysis.

However, compared to the 2.5-second difference that was saved by reducing Bloom filter

searches, the absolute time benefit from saved B-tree searches was small.

Besides the Bloom filter and B-tree search, for each pk, the system also needs to initialize

the search cursor for every filtered component. Though one such initialization operation is

as cheap as a Bloom filter search, as the number of pk’s and the number of components n

increases, the overall time for the cursor initialization adds up. Based on the result shown

in Fig. 5.15, by using the information from the secondary filter, the initialization time was

also significantly reduced, this is because many components were filtered and thus there was

no need to initialize their search cursors.

In summary, by leveraging the secondary filter, we are able to reduce the query time by

skipping many unnecessary searches of irrelevant components. The main contributor to this

savings comes from reducing many “cheap” in-memory operations.

5.5 Related Work

The work in [62] introduced an index structure called LHAM for transaction-time temporal

data. Instead of using an attribute from records, their solution uses the insertion timestamp

118

Page 136: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

as the record version number, and the data is partitioned into successive components based

on the timestamps of the record versions. Queries with temporal predicates only need to

access those components that satisfy the predicates, and the remaining components can be

skipped. HBase uses an idea similar to LHAM by using LSM-tree to store incoming records

and maintaining a timestamp for each record for versioning purposes. Each LSM component

is then tagged with the minimum and maximum timestamps of the records contained in

the component. When answering a query with temporal predicates, HBase can leverage the

minimum and maximum timestamp filters to only access those relevant components, which

reduces query time.

The main difference between our work and the previous work, including the original use of

filter in AsterixDB, is that we do not require the query to have filter-related predicates. Our

secondary filter optimization can be applied to leverage the presence of filters whenever there

is a secondary-to-primary index search, which is a much more general query setting.

5.6 Conclusion

In this chapter, we presented a method of using LSM filters to accelerate the secondary-to-

primary LSM index search process in AsterixDB. This technique automatically propagates

filter range values from secondary index searches to speed up the primary index search.

Importantly, the technique does not rely on the predicate of the query. Thus, it can benefit

a broader range of queries.

We have designed and implemented this technique in AsterixDB. Our experiments showed

that the new approach can reduce query execution times by 20% to 70% for different queries

with various selectivities. The lower the selectivity of the predicate, thus the more keys in-

volved, the more significant the improvement becomes. With this improvement, Cloudberry

119

Page 137: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

can choose to generate mini-queries that have much wider time range predicates, which will

introduce less overhead for progressive query answering. We also analyzed the reasons for

the improvement in detail. We found that a large portion of the observed speedup is from

saving many in-memory operations, which reveals the fact that many individual “cheap”

in-memory operations are no longer cheap when their number becomes large.

120

Page 138: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Chapter 6

Conclusions and Future Work

6.1 Conclusions

In this thesis, we have presented a general-purpose middleware system called Cloudberry

that sits between a frontend visualization application and a backend database system to

support interactive analytics and visualization on large datasets.

In Chapter 3 we presented the design, implementation, and use cases of Cloudberry. We

showed that it can reduce the query response time significantly by utilizing materialized views

stored in the database. By using an example application, TwitterMap, we demonstrated its

suitability to support interactive data analytics and visualization on more than one billion

tweets. By using an example Paraview web application and connectors to different databases,

we demonstrated that Cloudberry is a general-purpose middleware system.

In Chapter 4 we studied how to progressively answer a time-consuming query on a large

data set by generating a sequence of mini-queries. We formulated an optimization problem

to produce the predicates of mini-queries by considering both their total running time as well

121

Page 139: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

as the smoothness of result delivery, the key goal being to provide the incremental results

at a rhythmic pace to improve the user experience. We developed an adaptive framework

called Drum that can collect run-time behavioral statistics of the database system to decide

on the predicate for the next mini-query appropriately. Drum is a general-purpose technique

that requires no changes to the underlying database system. Our experiments on a large,

real data set showed that Drum can reduce the delay of delivering intermediate results to the

user without increasing the total query time much. Drum enables users to see smooth result

updates at the expected rhythm.

In Chapter 5 we presented a way to use LSM filters to accelerate secondary-to-primary

LSM index searches in Apache AsterixDB. This technique automatically propagates filter

values from the secondary index to speed up the primary index search. It does not rely on

having a filter predicate in the user query, so a broader range of queries can benefit from the

pruning power of the filter. We have designed and implemented the technique in AsterixDB.

Experiments showed that it can reduce the query time by 20% to 70% for different queries

with various selectivities; the lower the selectivity of the predicate, the more significant the

improvement can have. With this improvement, Cloudberry can generate mini-queries that

contain much wider time range predicates, resulting in less overhead for progressive query

evaluation. We analyzed the reason for this improvement and found that a large portion

of the speedup was from saving many in-memory operations - many individual “cheap” in-

memory operations are no longer cheap when their number becomes large.

6.2 Future Work

Our work on Cloudberry has identified a number of interesting opportunities for follow-on

work.

122

Page 140: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Aggregation Views

In Chapter 3 we have shown the general logic of using views to answer queries. We mainly

focused on subset views that contain records matching certain filter conditions. One possible

future direction is to exploit aggregation view techniques to extend the answering-query-

using-view logic to support aggregation queries. For example, it would be natural to ask

for the distribution of the total number of tweets for each state or different days. We

can store the by-state and by-day counts in a separate materialized view in order to skip

the time-consuming scan and aggregation steps on many historical records by fetching the

aggregated results from a much smaller view. Moreover, we can exploit the hierarchical

information provided by the Cloudberry DDL to store the finer granularity aggregation

results to maximize the utilization of views. For example, since we know there are hierarchical

relationships between the stateID, countyID, and cityID fields, then instead of storing

the by-state aggregation results, we can store the by-city results. In this case, the view can

also be useful if a query asks for by-county or by-state aggregation results.

Updates in the Underlying Database

We made two assumptions about the datasets: 1) the dataset should be append-only; 2)

there are no too late records whose difference between the event time and ingestion time

is more than a delay tolerant time. We want to support the occasional update or delete

operations, and late records. One possible way could be to “watch” the database system

logs, if it has such feature. Then Cloudberry will not be blind to the underlying updates

and can maintain the views accordingly to make sure that the content of a view remains

consistent with the base datasets.

123

Page 141: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Approximate Answering by Sampling in Middleware

In Fig. 3.10 we introduced the “estimate” feature, allowing some simple requests to be an-

swered in the middleware layer without accessing the database. To support this feature,

the middleware needs to periodically collect the metrics (e.g., min, max, cardinality, etc) of

certain dimensions and to store the values in a separate metadata dataset. This statistical

information can also be used to support a sampling approach to approximate answering.

For example, instead of query slicing, a user may want to get a rough but fast estimation

of the by-state distribution of tweets that mention certain keywords, e.g., “hurricane”. One

estimation approach can be fetching a certain number of samples, e.g., 1,000 tweets, from

the tweets in each state to get the percentages of tweets contain “hurricane” for each of

them. We can then give an estimated count of each state by multiplying the total number of

tweets in each state with each sampled percentage. To enable such an estimation, we need

to know the total number of tweets for all the states in advance. We also need to implement

the corresponding logic of calculating the error bounds of the estimation in the middleware

system accordingly.

Slicing on More Attributes

In Chapter 4 we studied the query-slicing technique and mainly focused on slicing on a time

dimension. One future direction could be to extend the techniques to do slicing on other

dimensions. For example, it might be reasonable to slice a query by geographical region so

that the frontend user interface can show the results progressively in the spatial dimension.

This feature still relies on the underlying database’s support to speed up the mini-queries’

performance. Compared to the time dimension, the values of other attributes, e.g., spatial

coordinates, are less likely to be consistent with the ingestion order of records. Consequently,

this could result in highly overlapping filter ranges between LSM components, making them

124

Page 142: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

ineffective for pruning. One possible direction could be to periodically re-partition the com-

ponents based on the declared filter values so that most of the components can have disjoint

filter ranges afterward. The mini-query with a geographic boundary predicate can then still

skip many irrelevant components so that the overall time of query slicing can be reduced.

More Mergeable Queries

Another challenge of query-slicing is that the mini-query results may not always be mergeable

in a straightforward way. For a query with distributive functions (e.g., sum(), count(),

min(), max()), the results can be merged by applying the function on the partial result

of the same groups. For the algebraic function case (e.g., avg()), the middleware needs

to rewrite the query to get all algebraic elements and then apply the algebraic function

similarly to the result of the same group. In contrast, a sort query is not easily mergeable

in general. However, if the ordering attribute is same as the slicing attribute, the result is

still mergeable by a simple union operation if the whole value range of the sorting attribute

is known a priori. For example, if a request wants to sort tweets by the number of “likes”

they have received and the middleware knows the distribution of the “like”s, it can compose

a series of mini-queries by attaching disjoint range predicates to the number of “like”s. Thus,

the results can be concatenated directly without extra processing. For an aggregation query

with holistic aggregation functions, e.g., median() or topk(), it cannot be simply merged

because it requires global information. One possible solution could be to rewrite the mini-

query to ask for a richer context of the result. For example, we can request the top 20 results

in the mini-query in order to return the top 10 results in the merging case. We could even

re-issue the previous mini-query if the previous top 20 results are not adequate to produce

the final results. In this way, we can make query slicing more general to support a larger set

of progressive display options.

125

Page 143: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Batch Primary Index Lookup

In Chapter 5 we investigated filter-based acceleration for the secondary-to-primary index

search process. We found that a significant speedup is from saving many in-memory opera-

tions. This observation suggests improving the secondary-to-primary index search path. For

example, there could be a specialized primary index set-based search operator that opens

the cursor only once for all primary key probes within the set. By doing so, we could save

many repeated open operations for each primary key search.

126

Page 144: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Bibliography

[1] Apache HBase Website. http://hbase.apache.org/.

[2] S. Abiteboul and O. M. Duschka. Complexity of answering queries using materializedviews. In A. O. Mendelzon and J. Paredaens, editors, Proceedings of the SeventeenthACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June1-3, 1998, Seattle, Washington, USA, pages 254–263. ACM Press, 1998.

[3] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,and S. Sarawagi. On the computation of multidimensional aggregates. In T. M. Vijayara-man, A. P. Buchmann, C. Mohan, and N. L. Sarda, editors, VLDB’96, Proceedings of22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai(Bombay), India, pages 506–521. Morgan Kaufmann, 1996.

[4] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb:queries with bounded errors and bounded response times on very large data. InZ. Hanzálek, H. Härtig, M. Castro, and M. F. Kaashoek, editors, Eighth Eurosys Con-ference 2013, EuroSys ’13, Prague, Czech Republic, April 14-17, 2013, pages 29–42.ACM, 2013.

[5] G. A. Agha. ACTORS - a model of concurrent computation in distributed systems. MITPress series in artificial intelligence. MIT Press, 1990.

[6] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materializedviews and indexes in SQL databases. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy,U. Dayal, N. Kamel, G. Schlageter, and K. Whang, editors, VLDB 2000, Proceedingsof 26th International Conference on Very Large Data Bases, September 10-14, 2000,Cairo, Egypt, pages 496–505. Morgan Kaufmann, 2000.

[7] S. Agrawal, E. Chu, and V. R. Narasayya. Automatic physical design tuning: workloadas a sequence. In S. Chaudhuri, V. Hristidis, and N. Polyzotis, editors, Proceedings ofthe ACM SIGMOD International Conference on Management of Data, Chicago, Illinois,USA, June 27-29, 2006, pages 683–694. ACM, 2006.

[8] M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Predicting completion times of batchquery workloads using interaction-aware models and simulation. In ACM EDBT, pages449–460, 2011.

127

Page 145: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[9] Airbnb superset, https://github.com/airbnb/superset.

[10] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. Fernández-Moctezuma, R. Lax,S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataflow model: A prac-tical approach to balancing correctness, latency, and cost in massive-scale, unbounded,out-of-order data processing. PVLDB, 8(12):1792–1803, 2015.

[11] Akka Website, https://akka.io/.

[12] I. Alagiannis, D. Dash, K. Schnaitter, A. Ailamaki, and N. Polyzotis. An automated,yet interactive and portable DB designer. In A. K. Elmagarmid and D. Agrawal, editors,Proceedings of the ACM SIGMOD International Conference on Management of Data,SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 1183–1186. ACM,2010.

[13] S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey,I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y. Kim,C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, andT. Westmann. AsterixDB: A scalable, open source BDMS. PVLDB, 7(14):1905–1916,2014.

[14] S. Alsubaiee, A. Behm, V. R. Borkar, Z. Heilbron, Y.-S. Kim, M. J. Carey, M. Dreseler,and C. Li. Storage management in asterixdb. PVLDB, 7(10):841–852, 2014.

[15] S. Alsubaiee, M. J. Carey, and C. Li. Lsm-based storage and indexing: An old ideawith timely benefits. In Second International ACM Workshop on Managing and MiningEnriched Geo-Spatial Data, GeoRich at SIGMOD 2015, Melbourne, VIC, Australia,May 31, 2015, pages 1–6, 2015.

[16] M. Altinel, C. Bornhövd, S. Krishnamurthy, C. Mohan, H. Pirahesh, and B. Reinwald.Cache tables: Paving the way for an adaptive database cache. In J. C. Freytag, P. C.Lockemann, S. Abiteboul, M. J. Carey, P. G. Selinger, and A. Heuer, editors, VLDB2003, Proceedings of 29th International Conference on Very Large Data Bases, Septem-ber 9-12, 2003, Berlin, Germany, pages 718–729. Morgan Kaufmann, 2003.

[17] Angular JavaScript, https://angularjs.org/.

[18] Army Research Lab, https://www.arl.army.mil/.

[19] E. Baralis, S. Paraboschi, and E. Teniente. Materialized views selection in a multi-dimensional database. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky,P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB’97, Proceedings of 23rd Interna-tional Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages156–165. Morgan Kaufmann, 1997.

[20] L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for interac-tive visualization. In Proceedings of the 2016 International Conference on Managementof Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016,pages 1363–1375, 2016.

128

Page 146: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[21] N. Bhatti, A. Bouch, and A. Kuchinsky. Integrating user-perceived quality into webserver design. Computer Networks, 33(1):1–16, 2000.

[22] Apache Cassandra, http://cassandra.apache.org/.

[23] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra,A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data.ACM Trans. Comput. Syst., 26(2):4:1–4:26, 2008.

[24] S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress.In C. Koch, J. Gehrke, M. N. Garofalakis, D. Srivastava, K. Aberer, A. Deshpande,D. Florescu, C. Y. Chan, V. Ganti, C. Kanne, W. Klas, and E. J. Neuhold, editors,Proceedings of the 33rd International Conference on Very Large Data Bases, Universityof Vienna, Austria, September 23-27, 2007, pages 3–14. ACM, 2007.

[25] R. Chirkova and J. Yang. Materialized views. Foundations and Trends in Databases,4(4):295–405, 2012.

[26] Cloudberry Project Website, http://cloudberry.ics.uci.edu/.

[27] Cloudberry Wiki Page, https://github.com/ISG-ICS/cloudberry/wiki/Documentation-for-Cloudberry-(Using-SQL-Database).

[28] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapRe-duce Online. In Proc. USENIX Symposium on NSDI, pages 313–328, 2010.

[29] M. Csikszentmihalyi. Flow and the psychology of discovery and invention. New York:Harper Collins, 1996.

[30] S. Dar, M. J. Franklin, B. Þ. Jónsson, D. Srivastava, and M. Tan. Semantic datacaching and replacement. In T. M. Vijayaraman, A. P. Buchmann, C. Mohan, and N. L.Sarda, editors, VLDB’96, Proceedings of 22th International Conference on Very LargeData Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 330–341. MorganKaufmann, 1996.

[31] J. Du, R. J. Miller, B. Glavic, and W. Tan. Deepsea: Progressive workload-awarepartitioning of materialized views in scalable data analytics. In V. Markl, S. Orlando,B. Mitschang, P. Andritsos, K. Sattler, and S. Breß, editors, Proceedings of the 20thInternational Conference on Extending Database Technology, EDBT 2017, Venice, Italy,March 21-24, 2017., pages 198–209. OpenProceedings.org, 2017.

[32] S. Egger, T. Hoßfeld, R. Schatz, and M. Fiedler. Waiting times in quality of experiencefor web based services. In QoMEX, pages 86–96, 2012.

[33] A. Eldawy, M. F. Mokbel, and C. Jonathan. Hadoopviz: A mapreduce framework forextensible visualization of big spatial data. In 32nd IEEE International Conferenceon Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, pages 601–612.IEEE Computer Society, 2016.

129

Page 147: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[34] M. Fiedler et al. State-of-the-art with regards to user-perceived Quality of Service andquality feedback. Euro-NGI Deliverable D. WP. JRA. 6. 1. 1, 2004.

[35] D. Florescu, A. Y. Levy, D. Suciu, and K. Yagoub. Optimization of run-time man-agement of data intensive web-sites. In VLDB’99, Proceedings of 25th InternationalConference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,UK, pages 627–638, 1999.

[36] A. U. Frank, I. Campari, and U. Formentini, editors. Theories and Methods of Spatio-Temporal Reasoning in Geographic Space, International Conference GIS - From Space toTerritory: Theories and Methods of Spatio-Temporal Reasoning, Pisa, Italy, September21-23, 1992, Proceedings, volume 639 of Lecture Notes in Computer Science. Springer,1992.

[37] P. Godfrey, J. Gryz, and P. Lasek. Interactive visualization of large data sets. IEEETrans. Knowl. Data Eng., 28(8):2142–2157, 2016.

[38] J. Goldstein and P. Larson. Optimizing queries using materialized views: A practical,scalable solution. In Proceedings of the 2001 ACM SIGMOD international conferenceon Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pages 331–342,2001.

[39] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by,cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1):29–53, Jan. 1997.

[40] A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques,and applications. IEEE Data Eng. Bull., 18(2):3–18, 1995.

[41] A. Gupta and I. S. Mumick. Materialized views: techniques, implementations, andapplications. MIT press, 1999.

[42] P. J. Haas and J. M. Hellerstein. Ripple Joins for Online Aggregation. In Proc. ACMSIGMOD, pages 287–298, 1999.

[43] A. Y. Halevy. Answering queries using views: A survey. VLDB J., 10(4):270–294, 2001.

[44] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.In H. V. Jagadish and I. S. Mumick, editors, Proceedings of the 1996 ACM SIGMODInternational Conference on Management of Data, Montreal, Quebec, Canada, June4-6, 1996., pages 205–216. ACM Press, 1996.

[45] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth, andP. J. Haas. Interactive data analysis: The Control Project. IEEE Computer, 32(8):51–59, 1999.

[46] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proc. ACMSIGMOD, pages 171–182, 1997.

130

Page 148: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[47] S. Hopfer, M. Runnerstrom, J. Jia, T. Kim, and C. Li. Twitter coverage of climatechange and health before and after the 2016 us presidential election. In Proceedings ofthe 2017 American Public Health Association Annual Meeting, Nov. 4-8, 2017, Atlanta,Georgia, USA., 2017.

[48] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration tech-niques. In T. K. Sellis, S. B. Davidson, and Z. G. Ives, editors, Proceedings of the2015 ACM SIGMOD International Conference on Management of Data, Melbourne,Victoria, Australia, May 31 - June 4, 2015, pages 277–281. ACM, 2015.

[49] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis andvisualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12):2917–2926, 2012.

[50] T. Kim, V. Thirumaraiselvan, J. Jia, and C. Li. Caching geospatial objects in webbrowsers. In Proceedings of the 25th ACM SIGSPATIAL International Conference onAdvances in Geographic Information Systems, GIS 2017, Redondo Beach, California,USA, November 7 - November 10, 2017, 2017.

[51] Y.-S. Kim. Transactional and Spatial Query Processing in the Big Data Era. PhDthesis, UNIVERSITY OF CALIFORNIA, IRVINE, 2016.

[52] Apache Kylin, http://kylin.apache.org/.

[53] I. Lazaridis and S. Mehrotra. Progressive Approximate Aggregate Queries with a Multi-Resolution Tree Structure. In Proc. ACM SIGMOD, pages 401–412, 2001.

[54] LevelDB, URL: https://github. com/google/leveldb.

[55] L. D. Lins, J. T. Klosowski, and C. E. Scheidegger. Nanocubes for real-time explorationof spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph., 19(12):2456–2465, 2013.

[56] Z. Liu, B. Jiang, and J. Heer. imMens : Real-time visual querying of big data. Comput.Graph. Forum, 32(3):421–430, 2013.

[57] A. Magdy and M. F. Mokbel. Demonstration of kite: A scalable system for microblogsdata management. In 33rd IEEE International Conference on Data Engineering, ICDE2017, San Diego, CA, USA, April 19-22, 2017, pages 1383–1384. IEEE Computer So-ciety, 2017.

[58] Mapd demo. https://www.mapd.com/demos/taxis.

[59] T. Marrinan, J. Aurisano, A. Nishimoto, K. Bharadwaj, V. A. Mateevitsi, L. Renambot,L. Long, A. E. Johnson, and J. Leigh. SAGE2: A new approach for data intensive col-laboration using scalable resolution shared displays. In E. Bertino, S. Chen, K. Aberer,P. Krishnamurthy, and M. Kantarcioglu, editors, 10th IEEE International Conferenceon Collaborative Computing: Networking, Applications and Worksharing, Collaborate-Com 2014, Miami, Florida, USA, October 22-25, 2014, pages 177–186. ICST / IEEE,2014.

131

Page 149: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[60] S. Masri, J. Jia, C. Li, G. Zhou, M.-C. Lee, G. Yan, and J. Wu. Use of twitter data topredict zika virus cases in the united states during the 2016 epidemic. 2017.

[61] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassi-lakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114–123,2011.

[62] P. Muth, P. E. O’Neil, A. Pick, and G. Weikum. Design, implementation, and per-formance of the LHAM log-structured history data access method. In A. Gupta,O. Shmueli, and J. Widom, editors, VLDB’98, Proceedings of 24rd International Con-ference on Very Large Data Bases, August 24-27, 1998, New York City, New York,USA, pages 452–463. Morgan Kaufmann, 1998.

[63] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured merge-tree(lsm-tree). Acta Inf., 33(4):351–385, 1996.

[64] N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for LargeMapReduce Jobs. PVLDB, 4(11):1135–1145, 2011.

[65] S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for largescientific databases using data partitioning. In Proceedings of the 16th InternationalConference on Scientific and Statistical Database Management (SSDBM 2004), 21-23June 2004, Santorini Island, Greece, pages 383–392. IEEE Computer Society, 2004.

[66] ParaViewWeb Website, https://www.paraview.org/web/.

[67] R. Pottinger and A. Y. Halevy. Minicon: A scalable algorithm for answering queriesusing views. VLDB J., 10(2-3):182–198, 2001.

[68] S. Rahman, M. Aliakbarpour, H. Kong, E. Blais, K. Karahalios, A. G. Parameswaran,and R. Rubinfeld. I’ve seen “enough”: Incrementally improving visualizations to supportrapid decision making. PVLDB, 10(11):1262–1273, 2017.

[69] V. Raman and J. M. Hellerstein. Partial results for online query processing. In M. J.Franklin, B. Moon, and A. Ailamaki, editors, Proc.ACM SIGMOD, pages 275–286,2002.

[70] J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman. Automating physical databasedesign in a parallel database. In M. J. Franklin, B. Moon, and A. Ailamaki, editors,Proceedings of the 2002 ACM SIGMOD International Conference on Management ofData, Madison, Wisconsin, June 3-6, 2002, pages 558–569. ACM, 2002.

[71] K. A. Ross and D. Srivastava. Fast computation of sparse datacubes. In M. Jarke,M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld,editors, VLDB’97, Proceedings of 23rd International Conference on Very Large DataBases, August 25-29, 1997, Athens, Greece, pages 116–125. Morgan Kaufmann, 1997.

132

Page 150: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[72] E. A. Rundensteiner, M. O. Ward, Z. Xie, Q. Cui, C. V. Wad, D. Yang, and S. Huang.Xmdvtoolq : : quality-aware interactive data exploration. In Proceedings of the ACMSIGMOD International Conference on Management of Data, Beijing, China, June 12-14, 2007, pages 1109–1112, 2007.

[73] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. On-line index selection forshifting workloads. In Proceedings of the 23rd International Conference on Data En-gineering Workshops, ICDE 2007, 15-20 April 2007, Istanbul, Turkey, pages 459–468.IEEE Computer Society, 2007.

[74] B. Shneiderman. The eyes have it: A task by data type taxonomy for information visu-alizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages, Boulder,pages 336–343, 1996.

[75] Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: shrinkingthe petacube. In M. J. Franklin, B. Moon, and A. Ailamaki, editors, Proceedings ofthe 2002 ACM SIGMOD International Conference on Management of Data, Madison,Wisconsin, June 3-6, 2002, pages 464–475. ACM, 2002.

[76] Y. Skadberg and J. R. Kimmel. Visitors’ flow experience while browsing a Web site: itsmeasurement, contributing factors and consequences. Computers in Human Behavior,20(3):403–422, 2004.

[77] J. Song, C. Guo, Z. Wang, Y. Zhang, G. Yu, and J. Pierson. Haolap: A hadoop basedOLAP system for big data. Journal of Systems and Software, 102:167–181, 2015.

[78] C. Stolte and P. Hanrahan. Polaris: A system for query, analysis and visualization ofmulti-dimensional relational databases. In J. D. Mackinlay, S. F. Roth, and D. A. Keim,editors, IEEE Symposium on Information Visualization 2000 (INFOVIS’00), Salt LakeCity, Utah, USA, October 9-10, 2000., pages 5–14. IEEE Computer Society, 2000.

[79] S. Su, V. Perry, N. Cantner, D. Kobayashi, and J. Leigh. High-resolution interactive andcollaborative data visualization framework for large-scale data analysis. In W. W. Smariand J. Natarian, editors, 2016 International Conference on Collaboration Technologiesand Systems, CTS 2016, Orlando, FL, USA, October 31 - November 4, 2016, pages275–280. IEEE Computer Society, 2016.

[80] Apache Superset, https://superset.incubator.apache.org/.

[81] K.-L. Tan, C. H. Goh, and B. C. Ooi. Query rewriting for SWIFT (first) answers. IEEETrans. Knowl. Data Eng., 12(5):694–714, 2000.

[82] Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio-Temporal AggregationUsing Sketches. In Proc. ICDE, pages 214–225, 2004.

[83] P. Terlecki, F. Xu, M. Shaw, V. Kim, and R. M. G. Wesley. On improving user responsetimes in tableau. In T. K. Sellis, S. B. Davidson, and Z. G. Ives, editors, Proceedings ofthe 2015 ACM SIGMOD International Conference on Management of Data, Melbourne,Victoria, Australia, May 31 - June 4, 2015, pages 1695–1706. ACM, 2015.

133

Page 151: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

[84] TwitterMap demo, http://cloudberry.ics.uci.edu/demos/twittermap/.

[85] A. Van Moorsel. Metrics for the internet age: Quality of experience and quality ofbusiness. In Fifth International Workshop on Performability Modeling of Computer andCommunication Systems, volume 34, pages 26–31, 2001.

[86] M. Vartak, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: automaticallygenerating query visualizations. PVLDB, 7(13):1581–1584, 2014.

[87] R. M. G. Wesley, M. Eldridge, and P. Terlecki. An analytic data engine for visualizationin tableau. In T. K. Sellis, R. J. Miller, A. Kementsietsidis, and Y. Velegrakis, editors,Proceedings of the ACM SIGMOD International Conference on Management of Data,SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 1185–1194. ACM, 2011.

[88] A. K. Wong. A literature review of the impact of flow on human-computer interactions(hci)–the study of a fundamental ingredient in the effective use of computers. In Proc.IAMB, 2006.

[89] E. Wu, L. Battle, and S. R. Madden. The case for data visualization managementsystems. PVLDB, 7(10):903–906, 2014.

[90] E. Wu, F. Psallidas, Z. Miao, H. Zhang, and L. Rettig. Combining design and per-formance in a data visualization management system. In CIDR 2017, 8th BiennialConference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings. www.cidrdb.org, 2017.

[91] W. Wu, Y. Chi, H. Hacigümüs, and J. F. Naughton. Towards Predicting Query Execu-tion Time for Concurrent and Dynamic Database Workloads. PVLDB, 6(10):925–936,2013.

[92] W. Wu, X. Wu, H. Hacigümüs, and J. F. Naughton. Uncertainty Aware Query ExecutionTime Prediction. PVLDB, 7(14):1857–1868, 2014.

[93] P. Xiong, Y. Chi, S. Zhu, J. Tatemura, C. Pu, and H. Hacigümüs. ActiveSLA: aprofit-oriented admission control framework for database-as-a-service providers. In ACMSOSP, page 15, 2011.

[94] K. Yagoub, D. Florescu, V. Issarny, and P. Valduriez. Caching strategies for data-intensive web sites. In VLDB 2000, Proceedings of 26th International Conference onVery Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 188–199, 2000.

[95] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli. Druid: areal-time analytical data store. In International Conference on Management of Data,SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 157–168, 2014.

[96] J. Yu, R. Moraffah, and M. Sarwat. Hippo in action: Scalable indexing of a billionnew york city taxi trips and beyond. In 33rd IEEE International Conference on DataEngineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pages 1413–1414.IEEE Computer Society, 2017.

134

Page 152: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Appendix A

Solve the Optimal ri in the Drum

Framework

A.1 The Optimal ri using Histogram Distribution

The expectation score of using histogram is

E(ri) =riI− α

CiLi

(∫ Tm+1

Li

(1− yb

)Prmb

(t−Li)dt+n∑

k=m+1

( ∫ Tk+1

Tk

Prkb

(t−Li)dt)). (A.1)

It can be solved as

E(ri) =riI− α

CiLi

(Prmb

(1− y

b)(t2

2− Lit) |Tm+1

Li+

1

b

n∑k=m+1

(Prk(t2

2− Lit) |Tk+1

Tk)). (A.2)

Let Tm+1 = Li − y + b, then

Prmb

(1− y

b)(t2

2− Lit) |Tm+1

Li=Prm2b2

(b− y)3. (A.3)

135

Page 153: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Let Tk = Li − y + (k −m)b, then

1

b

n∑k=m+1

(Prk(

t2

2− Lit) |Tk+1

Tk

)=

n∑k=m+1

Prk

((1

2+ i− k)b− y

). (A.4)

Then, for a given m, the expectation function is as following:

E(ri) =riI− α

CiLi

(Prm2b2

(b− y)3 −n∑

k=m+1

Prky + bn∑

k=m+1

Prk(1

2+ k −m)

). (A.5)

To get the critical value, by solving the E ′(ri) = 0, we can get the following form:

(b− ymax)2 =2b2

3Prm(CiLiαa1I

−n∑

k=m+1

Prk). (A.6)

Using this equation we can compute the corresponding ymax value. We compute the best

f(ri) as:

f(ri) = Li − ymax − (m− g − 1/2)b.

A.2 The Optimal ri using Gaussian Distribution

The expectation score function is

E(score(qi)) =riI−α

∫ ∞Li

ti − LiCiLi

P (t|f(ri))dt =riI− α

CiLi

∫ ∞Li

(t−Li)1√2πσ

e−( t−f(ri)√

2σ)2

)dt

(A.7)

Let

y =t− f(ri)√

2σ, (A.8)

136

Page 154: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

the penalty part of the function can be written as following

Penalty =

∫ ∞Li

(t− Li)1√2πσ

e−( t−f(ri)√

2σ)2dt

=

∫ ∞Li−f(ri)√

(√

2σy + f(ri)− Li)1√πe−y

2

dy

=

√2σ√π

∫ ∞Li−f(ri)√

ye−y2

dy

+f(ri)− Li√

π

∫ ∞Li−f(ri)√

e−y2

dy

(A.9)

Let

z =f(ri)− Li√

2σ, (A.10)

then

Penalty =σ√2πe−z

2

+σz√

2(1 + erf(z)). (A.11)

Then the score function is

E(score) =rxI− ασ

CiLi√2

( 1√πe−z

2+ z(1 + erf(z)

)). (A.12)

The critical point of ri is solved by let E ′(ri) = 0.

E′(ri) =1

I

a1√2σ

ασ

CiLi√2

d(

1√πe−z

2+ z(1 + erf(z)

))dz

. (A.13)

137

Page 155: UC Irvine - cloudberry.ics.uci.educloudberry.ics.uci.edu/wp-content/uploads/2018/11/jianfeng-thesis.pdf · 4.6 The Histogram and Gaussian models. The errors are collected from the

Using

d( 1√πe−z

2)

dz= −2e−z

2z√

π(A.14)

and

d(z(1 + erf(z)

))dz

= erf(z) + 1 +2e−z

2z√

π(A.15)

Let equation A.13 = 0 will have

2CiLia1Iα

= 1 + erf(z) (A.16)

Based on the constraints of erf(z) ∈ (−1, 1) and f(ri) < Li, we will have the limit value

when the erf(z) ∈ (−1, 0), when

α >2CiLia1I

, (A.17)

the critical point is

zmax = erf−1(2CiLia1Iα

− 1) (A.18)

Finally we can get the ri by

ri =

√2σzmax + Li − a0

a1(A.19)

138