Presented by: Ashkan Malekloo Fall 2015. Type: Demonstration paper Authors: VLDB 15 Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin,

Wisteria: Nurturing Scalable Data Cleaning

InfrastructurePresented by: Ashkan Malekloo

Fall 2015

Wisteria: Nurturing Scalable Data Cleaning Infrastructure

Type: Demonstration paper

Authors:

VLDB 15

Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu

Introduction Dirty data

Data cleaning is often domain specific, dataset, and eventual analysis, analysts report spending upwards of 80% of their time on problems in data cleaning Possible errors

What to extract

How to clean the data

Whether that cleaning will significantly change results

Example

Extraction Methods

While the extraction operation can be represented at a logical level by its input and output schema, there is a huge space of possible physical implementations of the logical operators. Rule-based

Learning-based

Crowd-based

Or a combination of three

Let’s say we select crowd-based operator as our extraction method There are still many parameters that might influence the quality of output

the number of crowd workers

the amount each worker is paid

Example

Related works ETL ( Extract-Transfer-Load)

Constraint Driven tools

Wrangler

OpenRefine

Crowd-based

Wisteria

a system designed to support the iterative development and optimization of data cleaning plans end to end

allows users to specify declarative data cleaning plans

Wisteria phases: Sampling

Recommendation

Crowd Latency

Annotating Database Schemas to Help Enterprise Search

Presented by: Ashkan Malekloo

Fall 2015

Annotating Database Schemas to Help Enterprise Search

Type: Demonstration paper

Authors:

VLDB 15

Eli Cortez, Philip A. Bernstein, Yeye He, Lev Novik

Introduction

In large enterprises, data discovery is a common problem faced by users who need to find relevant information in relational databases Finding Tables That are relevant

Find out whether it is truly relevant

Introduction

In large enterprises, data discovery is a common problem faced by users who need to find relevant information in relational databases Finding Tables That are relevant

Find out whether it is truly relevant

In this paper their sampling involves 29 databases

639 tables

4216 data columns

Introduction

many frequently-used column names are very generic Name

Id

Description

Field

Code

Column

These generic column names are useless for helping users find tables that have the data they need.

Barcelos

a system that automatically generates candidate keywords to annotate columns of database tables

mining spreadsheets Spreadsheets are more readble

Contribution

A method to automatically extract tables from a corpus of enterprise spreadsheets.

A method for identifying and ranking relevant column annotations, and an efficient technique for calculating it.

An implementation of our method, and an experimental evaluation that shows its efficiency and effectiveness.

Smart Drill-Down: A New Data Exploration Operator

Type: Demonstration Paper

Authors: Manas Joglekar, Hector Garcia-Molina(Stanford), Aditya Parameswaran(University of Illinois)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

Drill Down -> Data exploration Drawbacks with Traditional drill down operation

Too many Distinct ValuesOne Column at a timeSimultaneously drilling down columns presents too many

values

Related Work

Interpretable and informative explanations of outcomes User-adaptive exploration of multidimensional data User-cognizant multidimensional analysis Discovery-driven exploration of olap data cubes

Contribution

Provenance for SQL through Abstract

Interpretation: Value-less, but Worthwhile


Authors: Tobias Muller, Torsten Grust (Universit at Tubingen Tubingen, Germany)

Presented by: Siddhant Kulkarni

Term: Fall 2015

The idea

Given a query Record it’s control flow decisions (WHY- PROVENANCE) Data access locations (WHERE- PROVENANCE)

Without actual data values (VALUELESS!) determine the why-origin and where-origin of a query

Motivation

DETERMINE THE I/O DEPENDENCIES OF REAL LIFE SQL QUERIES

How do they do it?

Step 1: Convert SQL query into Python code Step 2: Apply Program Splicing Step 3: Apply Abstract Interpretation

Demo with PostgreSQL

EFQ: Why-Not Answer Polynomials in Action

Presented by: Omar Alqahtani

Fall 2015

Authors

Nicole Bidoit

Universite Paris Sud / Inria

Melanie Herschel

Universitat Stuttgart

Katerina Tzompanaki

Universite Paris Sud / Inria

Motivation

Related Works

Explanations to Why-Not questions: data-based explanations query-based explanations Mixed.

Contribution

Explain and Fix Query platform (EFQ) that enable to execute queries, express a Why-Not question and ask for:

Explanations to Why-Not questions. Query-based Why-Not Answer polynomials

Query refinements that produce the desired results. cost model for ranking

DATASPREAD: Unifying Databases and Spreadsheets

Author: Mangesh Bendre, Bofan Sun, Ding Zhang, Xinyan Zhou Kevin Chen-Chuan Chang, Aditya Parameswaran University of Illinois at Urbana-Champaign (UIUC).

Type: DemoPresented by:

Ranjan

Fall 2015

Intro

Database ?

Spreadsheets?

Problem ?

Example

A spreadsheet containing course assignment scores and eventual grades for students from rows 1–1000, columns 1–10 in one sheet, and demographic information for the students from rows 1–1000, columns 1–20 in another sheet.

user wants to understand the impact of assignment grades on the course grade, for which having std_points> 90 in at least one assignment.

user wants to plot the average grade by demographic group (undergrad, MS, PhD).

the course management software outputs actions performed by students into a relational database or a CSV file; there is no easy way for the user to study this data within the spreadsheet, as the data is continuously added.

Challenges Schema Addressing Modifications Computation: spreadsheets support value-at-a-time formulae to

allow derived computation, while databases support arbitrary SQL queries operating on groups of tuples at once.

DATASPREAD Architecture

DEMONSTRATION

a) analytic queries that reference data on the spreadsheet, as well as data in other database relations.

b) importing or exporting data from the relational database.

c) keeps data in the front-end and back-end in-sync during modifications at either end.

Related Work

a)Use of spreadsheets to mimic the relational database functionalities :

achieves expressivity of SQL, it is unable to leverage the scalability of databases.

b) Use of databases to mimic spreadsheet functionalities :

achieves scalability of databases, it is does not support ad-hoc tabular management provided by spreadsheets.

c) Use of spreadsheet interface for querying data :

Provide an intuitive interface to query data , but looses the expressivity of SQL as well as ad-hoc data management capabilities.

Conclusion

Overall, the aforementioned demonstration scenarios will convince attendees that DATASPREAD system offers a valuable hybrid between spreadsheets and databases, retaining the ease-of-use of spreadsheets, and the power of databases

Permutation Search Methods are Efficient,Yet Faster Search is Possible

Presented by: Zohreh Raghebi

Fall 2015

Authors

Bilegsaikhan Naidan

Norwegian University of Science and Technology Trondheim, Norway

Leonid Boytsov Car negie Mellon University Pittsburgh, PA, USA

Er ic Nyberg Car negie Mellon University Pittsburgh, PA, USA

Motivation

Nearest-neighbor searching is a fundamental operation employed in many applied areas such as: pattern recognition, computer vision, multimedia retrieval

Given a query data point q, the goal is to identify the nearest (neighbor) data point x

A natural generalization is a k-NN search, where we aim to find k closest points

The most studied instance of the problem is an exact nearest-neighbor search in vector spaces

where a distance function is an actual metric distance

Related works

Exact methods work well only in low dimensional metric spaces

Experiments showed that exact methods can rarely outperform the sequential scan when dimensionality exceeds ten

This a well-known phenomenon known as “the curse of dimensionality

Approximate search methods can be much more efficient than exact ones

but this comes at the expense of a reduced search accuracy

The quality of approximate searching is often measured using recall

the average fraction of true neighbors returned by a search method

Permutation-based methods

It is based on the idea that if we rank a set of reference points–called pivots–with respect to distances from a given point

the pivot rankings produced by two near points should be similar

In these methods, every data point is represented by a ranked list of pivots sorted by the distance to this point.

Such ranked lists are called permutations

the distance between permutations is a good proxy for the distance between original points

However, a comprehensive evaluation that involves a diverse set of large metric and nonmetric data sets is lacking

We survey permutation-based methods for approximate k nearest neighbor search

Conclusion by examining only a tiny subset of data points whose permutations are similar to the

permutation of a query

Converting the vector of distances to pivots into a permutation entails information loss

but this loss is not necessarily detrimental

our preliminary experiments showed that using permutations instead of vectors of original distances:

results in slightly better retrieval performance

(1) The distance function is expensive (or the data resides on disk)

(2) The indexing costs of k-NN graphs are unacceptably high

(3) There is a need for a simple, but reasonably efficient, implementation that operates on top of a relational database

FIT to Monitor Feed Quality

Tamrapar ni DasuAT&T Labs–[email protected]

Vladislav ShkapenyukAT&T Labs–[email protected]

Divesh Sr ivastavaAT&T Labs–[email protected]

Presented by: Zohreh Raghebi

Motivation

Data are being collected and analyzed today at an unprecedented scale

Data errors (or glitches) in many domains, such as medicine, finance can have severe consequences

need to develop data quality management systems to effectively detect and correct glitches in the data

Data errors can arise throughout the data lifecycle

from data entry, through storage, data integration, analysis

Introduction

Much of the data quality effort in the database research has focused on detecting and errors in data once the data has been collected

This is surprising since data entry time offers the first opportunity to detect and correct errors

We address this problem in our paper, describe principled techniques for online data quality monitoring in a dynamic feed environment

While there has been significant focus on collecting and managing data feeds

it is only now that attention is turning to their quality

Data feed management systems

Our goal is to alert quickly when feed behavior deviates from expectations

Data feed management systems(DFMSs) have recently emerged to provide reliable, continuous data delivery to :

databases and data intensive applications that need to:

perform real-time correlation and analysis

In prior work we have presented the Bistro DFMS, which is deployed at AT&T Labs

responsible for the real-time delivery of over 100 different raw feeds,

distributing data to several large-scale stream warehouses.

Related works Bistro uses a publish-subscribe architecture to efficiently process incoming data from a

large number of data publishers,

identify logical data feeds

reliably distribute these feeds to remote subscribers

FIT naturally fits into this DFMS architecture:

both as a subscriber of data and metadata feeds

as a publisher of learned statistical models and identified outliers

we propose novel enhancements to permit a publish subscribe approach

to incorporate data quality modules into the DFMS architecture

Contribution

Early detection of errors by FIT enables data administrators to quickly remedy any problems with the incoming feeds

FIT’s online feed monitoring can naturally detect errors from two distinct perspectives:

(i) errors in the data feed processes

e.g., missing or delayed delivery of files in a feed

by continuously analyzing the DFMS metadata feed

(ii) significant changes in distributions in the data records present in the feeds

e.g., erroneously switching from packets/second to bytes/second in a measurement feed

by continuously analyzing the contents of the data feeds.

Differential Privacy in Telco Big Data Platform

Presented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015

Type: Industrial Paper

IntroductionWhat have been done in this paper?

The first attempt to implement three basic DP architectures in the deployed telecommunication (telco) big data platform for data mining applications (churn prediction).

What is DP?

Differential Privacy (DP) is an Anonymization technique.

What is Anonymization?

A privacy protection technique, which removes or replaces the explicitly sensitive identifiers (ID) of customers, such as the identification number or mobile phone number, by random mapping or encryption mechanisms in DB, and provides the sanitized dataset without any ID information to DM services.

Introduction (2)What have been done in this paper?

The first attempt to implement three basic DP architectures in the deployed telecommunication (telco) big data platform for data mining applications (churn prediction).

Who is a Churner?

A person who quits the service! Customer churn is on of the biggest challenge in telco industry.

Telecommunication (telco) big data platform

Telecommunication (telco) big data record billions of customers’ communication behaviors for years in the world. Mining big data to increase customers’ experience for higher profits becomes one of important tasks for telco operators.

Contributions Implementation DP in telco big data platform: Data Publication Architecture,

Separated Architecture and Hybridized Architecture.

Extensive experimental results on big data: influence of privacy budget parameter on different DP implementations with industrial big

data.

The accuracy and privacy budgets trade-off.

The performance of three basic DP architectures in churn prediction;.

How volume and variety of big data affect the performance.

Comparing the DP implementation performance between the simple decision tree and the relatively complicated random forest classifiers in churn prediction.

Contributions (2) Findings:

All DP architectures have a relative accuracy loss less than 5% with week privacy guarantee and more than 15% (up to 30) with storing privacy guarantee.

Among all three basic DP architectures, the Hybridized architecture performs the best.

Prediction error: increases with the number of features .

decreases with the growth of the training data volume.

Related WorkAnonymization techniques: such as K-Anonymity

DP is currently the strongest privacy protection technique, which does not need any background information assumption of attackers. The attacker can be assumed to know the maximum knowledge.

Studying DP in different scenarios:

Histogram query

Statistical geospatial

Data query

Frequent item set mining

Crowdsourcing …

System Overview

Experimental Results Dataset: collected from one of biggest telco operators in China, having 9 consecutive

months of more than 2 million prepaid customer’s behavior records from 2013 to 2014 (around 2M users).

Experiments: checking the effect of following properties on the churn prediction accuracy: Privacy budget parameter.

Number of features.

Training data volume

Experimental Results (2) AUC: Area under ROC Curve

ROC is a graphical plot that illustrates the performance of a binary classifier system. [Wikipedia]

The effect of number of features on prediction accuracy (1M training records)



The effect of training data volume on prediction accuracy



Decision Trees VS. Random Forests

AIDE: An Automatic User Navigation System for

Interactive Data Exploration

Presented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015


Introduction Data analysts often engage in data exploration tasks to discover interesting data

patterns, without knowing exactly what they are looking for (exploratory analysis).

Users try to make sense of the underlying data space by navigating through it. The process includes a great deal of experimentation with queries, backtracking on the basis of query results, and revision of results at various points in the process.

When data size is huge, finding the relevant sub-space and relevant results takes so long.

AIDEAIDE is an automated data exploration system that:

Steers the user towards interesting data areas based on her relevance feedback on database samples.

Aims to achieve the goal of identifying all database objects that match the user interest with high efficiency.

It relies on a combination of machine learning techniques and sample selection algorithms to provide effective data exploration results as well as high interactive performance over databases of large sizes.

Experimental ResultsDatasets:

AuctionMark: information on action items and their bids. 1.77GB.

Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of stars and galaxies. Large data size and complex schema. 1GB-100GB.

US housing and used cars: available through the DAIDEM Lab

System Implementation:

Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees

PostgreSQL

Documents

Presented by: Ashkan Malekloo Fall 2015. Type: Demonstration paper Authors: VLDB 15 Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin,