Upload
garey-phelps
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Spatial Partitioning Techniques in SpatialHadoop
Type: Research Paper (Experimental evaluation)
Authors: Ahmed Eldawy, Louai Alarabi, Mohamed F. Mokbel
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Spatial Hadoop is extended MapReduce framework that supports global indexing
Experiment with four new partitioning techniques
Related Work
Two of their own papers A Demonstration of SpatialHadoop: An Efficient MapReduce
framework for Spatial data (2013) SpatialHadoop: A MapReduce framework for spatial data (2015)
These demonstrated SpatialHadoop along with indexing techniques (used to implement grid, R tree, R+ tree)
Contribution
Main contribution of this paper is Experimentation and evaluation of performance of
Range Queries Join Queries
It also explains the three steps used by Spatial Hadoop (compute, sample, partition) for partitioning and different partitioning techniques in short one liners
Conclusion
There are two main conclusions drawn in this paper QuadTree technique outperformed Z-curve, Hilbert Curve and K-d Tree 1% sample is enough to produce high quality partitions
StarDB: A Large Scale DBMS for
StringsType: Demo
Authors: Majed Sahli, Essam Mansour, Panos Kalnis
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Increasing use of Strings and Applications that use strings Need for the ability to query String data
Related Work
S. Tata, W. Lang, and J. M. Patel. Periscope/SQ: Interactive exploration of biological sequence databases. In Proc. of VLDB, 2007 Extended version of PostgreSQL for biological sequences
Two of their own papers that describe the data structures and parallel string algorithms
Contribution
Major Contributions: Data Query Language Query Optimization Large scale parallelism
And experimentation with some very high configuration systems
Stale View Cleaning: Getting Fresh
Answers from Stale Materialized Views
Presented by: Omar Alqahtani
Fall 2015
Authors
Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska
VLDB Endowment, Vol. 8, No. 12 - 2015
Introduction
Materialized Views ( MV ): are stored pre-computed results, which used to facilitate fast queries on large datasets.
Problem
How to update MVs: Updating all affected MVs, but it reduces the available
transaction throughput. Batching updates, but it leads to stale MVs.
Contribution
Stale View Cleaning (SVC): Efficiently cleans a sample of rows from a stale MV, and use the
clean sample to estimate aggregate query results. Provides an outlier indexing technique to reduce sensitivity to
skewed datasets Gives tight bounds on approximate results.
Title: Lenses –An On-Demand Approach to ETL
Author: Ying Yang; Niccolo Meneghetti; Ronny Fehling; Zhen Hua Lia; Oliver Kennedy
(SUNY Buffalo, Oracle)41st International conference on very large Data
Bases.
Presented by:
Ranjan Yadav
Fall 2015
Overview
Selective parsing, transformation and loading into a new structure as data curation
Accurate Reliable Ad-hoc approach(High-quality information)
Datasets : Product data, Credit data, Real Estate data etc.
Major Challenges
Schema matching and entity resolutionDe-duplication Information extraction and inference Information integration Ontology construction Efficiently ranking potential data curation tasks.
Related Work
The problem of incomplete data arises frequently in distributed systems where no failure are common.
Prioritizing curation tasks is quite closely related to Stochastic Boolean Function Evaluation (SBFE)
Deshpande et al designed a 3-approximation algorithm for linear threshold formulas, and
Allen et al developed exact and approximate solution for monotone K-term DNF formulas
Conclusion
On-demand ETL enables composable non-deterministic data processing operators called Lenses that provide the illusion of fully cleaned relational data that can be queried using standard SQL.
Paygo: which generalizes task-specific on-demand curation solution( Schema matching and entity resolution)
Future work the effectiveness of on-demand ETL and CPI-based heuristics.
GIS Navigation Boosted by Column
StoresPresented by: Shahab Helmi
VLDB 2015 Paper Review Series
Fall 2015
Paper InfoAuthors:
Foteini Alvanaki, Centrum Wiskunde & Informatica
Romulo Goncalves, NLeSC Amsterdam
Milena Ivanovaa , NuoDB Cambridge MA
Martin Kersten, Centrum Wiskunde & Informatica
Kostis Kyzirakos, Centrum Wiskunde & Informatica
Publication: VLDB 2015
Type: Demonstration
Introduction / Motivation Earth observation sciences, astronomy, and seismology have large data sets which
have inherently rich spatial and geospatial information.
In combination with large collections of semantically rich objects which have a large number of domain properties, they form a new source of knowledge for urban planning, smart cities and natural resource management.
These properties and the relationship between then are usually modeled as relational databases. The production of point cloud data sets have increased in size over the past years to Terabyte
scale due to its easy collection using airborne laser scanning.
In this paper column store used to query a 640 billion point dataset. Lightweight and cache conscious secondary index called Imprints
Results are visualized in real time using QGIS.
Related Work Processing point cloud data:
Known DMBSs with point cloud extension such as Oracle and PostgreSQL. They use the data type POINT to store point cloud data.
Points are assigned to blocks and the blocks are stored in tables.
It helps to reduce the space requirements and speeds up the query.
File-based solutions: work directly with the standard LAS format. Such solutions are now encountering scalability problems.
Column-stores: MonetDB is a modern in-memory column-store database system, designed in the late 90's.
It’s free and open source.
has recently been extended with a novel lightweight secondary indexing scheme called column imprints. Imprint is used during query evaluation to limit data access, and thus minimize memory traffic.
Related Work (2) Indexing for point cloud data:
Space filling curves: reduces the reduces the dimensionality of the data by mapping for example the X and Y coordinates in one dimension. Used by spatial DBMSs, such as Oracle spatial, and files systems.
Octrees
System Architecture: Storage Model
The point cloud id is stored in a flat table and it’s XYZ coordinate + other 23 properties are stored in a column.
Using a flat table needs more memory. However, we will be able to use compression techniques for column-store such as run length encoding.
System Architecture: Database Loading
In most of the systems, the dominant part of loading stems from the conversion of the LAZ files into CSV format and the subsequent parsing of the CSV records by the database engine.
In this system:
The loader takes as input a LAS/LAZ files and for each property it generates a new file that is the binary dump of a C-array containing the values of the property for all points.
Then, the generated files are appended to each column of the at table using the bulk loading operator COPY BINARY.
load and index the full AHN2 dataset that consists of approximately 640 billion points in less than one day.
For PostgreSQL it takes a week!
System Architecture: Query Model
MonetDB has an SQL interface to the Simple Features Access standard of the Open Geospatial Consortium(OGC).
1. Filtering step: the majority of points that do not satisfy the spatial predicate for a given geometry G are identied and disregarded using a fast approximation of the predicate. MonetDB performs the faltering using the column imprints.
Refinements step: During this step, the spatial predicate is evaluated against the precise geometry G. When the geometry is complex, it is not possible to evaluate every single point -> MonetDB, creates a regular grid over the point geometries selected in the filtering step and assigns each geometry to a grid cell. Then it evaluates grids against the predicates.
Scaling up Concurrent Main-Memory Column-Store Scans: Towards
Adaptive NUMA–Aware Data and Task Placement
Presented by: Ashkan Malekloo
Fall 2015
Scaling up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA–Aware Data and Task Placement
Type: Research paper
Authors:
International conference on Very Large Data Bases.
Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, Anastasia Ailamaki
Introduction
Non-Uniform Memory Access
The Efficient Usage of NUMA Architecture Depends on: Data placement
Scheduling strategy of the column store
Related works
Data Placement HyPer [23]
ERIS [17]
Task Scheduling HyPer [23]
NUMA-aware Operators Albutiu et al [6]
Yinan et al [25]
Black-box Approach Giceva et al [15]
Contributions
Presenting a partitioning scheme for dictionary encoded columns
Investigating the effect of partitioning
Presenting a design that can adapt the task scheduling and data placement
JetScope: Reliable and Interactive Analytics at Cloud
Scale Type: Industry paper
Authors:
International conference on Very Large Data Bases.
Eric Boutin, Paul Brett, Xiaoyu Chen, Jaliya Ekanayake Tao Guan, Anna Korsun, Zhicheng Yin, Nan Zhang, Jingren Zhou
Motivation
Reliability, Scalability and low latency are the main capabilities of a cloud scale data analytics.
Alongside these, an effective fault tolerance and efficient recovery are very important in distributed environments.
JetScope
A cloud scale interactive query processing system
Developed at Microsoft
Delivers massive scalability and high performance
In order to achieve low latency Various access methods
Optimizes delivering first row
Maximizes network efficiency
Maximizes scheduling efficiency
Offers a fine-grained fault tolerance mechanism which is able to efficiently detect and mitigate failure without significantly impacting the query latency and user experience
Processing of Probabilistic
Skyline Queries Using MapReduce
Presented by: Dardan Xhymshiti
Fall 2015
Processing of Probabilistic Skyline Queries Using
MapReduce Type: Research paper
Authors:
International conference on Very Large Data Bases.
Yoonjar ParkSeoul National University
Seoul, [email protected]
Jun-Ki MinKorea Univ. of Tech. & Edu
Cheonan, [email protected]
Kyuseok ShimSeoul National University
Seoul, [email protected]
Keywords
MapReduce: is a programming model and implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
Skyline operator: The Skyline operator is used in a query and performs a filtering of results from a database so that it keeps only those objects that are not worse than any other.
Major problem?
Applications that produce large volumes of uncertain data: Social networks, Data integration Sensor data management
Uncertain data sources? Data randomness,
Data incompleteness,
Limitation of measuring equipments.
Some details:
What is skyline in this case? Let be a set of objects
: where is represented a -dimensional point then:
Skyline is the set of all objects that are not dominated by any other objects in
What is domination?
An object is said to dominate an object () if the teo following condiions hold: (1) such that we have
(2) There exists w such that
An uncertain object can be modeled in two ways: (1) Discrete model
An object is modeled as a set of instances denoted by where is a -dimensional point with its existence probability.
(2) Continuous model An object is modeled as an uncertainty region with its pdf.
Possible world
Given a set of uncertain objects represented by the discrete model, a possible world is a set of instances from objects in where at most a single instance may be selected from each object.
Skyline probability (discrete model):
The skyline probability of an instance is the probability that it appears in a possible world an is not dominated by every instance of the other objects in the possible world.
The skyline probability of an objects is the sum of the skyline probabilities of its all instances.
Similarly for the continuous model, we define the skyline probability of an object by using its uncertainty region and pdf.
Probability threshold
Given a regardless of the uncertainty models used, the probabilistic skyline is the set of uncertain objects whose skyline probabilities are at least
Example: Consider a large number of sensors to measure and concentrations in the air are
deployed in a wide area to monitor the air pollution.
Every sensor is associated with some measurement error, resulting in uncertain data.
The pairs of measured and by each device can be modeled as an objects with its uncertainty region and pdf.
Each device can be modeled as an object where each pair of measured values can be considered as an instance of the object.
How to find the less pollution locations?
The locations of the devices whose pairs of measure and are in probabilistic skyline are said to be less polluted.
Problem: We have a large amount of uncertain data, computing probabilistic skyline is challenging today.
Contribution:
Propose parallel algorithms using MapReduce to process the probabilistic skyline
queries for uncertain data modeled by both discrete and continuous models.
3 filtering methods to identify probabilistic non-skyline objects in advance.
Development of a single MapReduce phase algorithm PS-QP-MR.
Enhances algorithm PS-QPF-MR by applying the three filtering methods
additionally.
Presents brute-force algorithms PS-BR-MR and PS-BRF-MR with partitioning
randomly and applying the filtering methods.
Related work
Several algorithms have been proposed for skyline queries: Nearest Neighbor (Kossman).
Papadias improved NN algorithm by using the branch-and-bound strategy.
Have been proposed techniques for processing uncertain queries such as probabilistic top-K.
The serial algorithms for probabilistic skyline processing over uncertain data have been introduced.
Performance and Scalability of Indexed Subgraph Query
Processing MethodsFoteini Katsarou
Nikos Ntarmos
Peter Triantafillou
Presented by: Zohreh Raghebi
Motivations
One of the main problems addressed by Graph data management systems is subgraph query processing;
The naive method for processing such queries is to perform a subgraph isomorphism test against each graph in the dataset
Filter-and-verification methods have been proposed to alleviate the problem
Solutions utilize an index based on features (i.e., substructures) of the graphs to filter out the graphs that definitely do not contain q
problem definition
we identify a set of key factors-parameters, that influence the performance of related index methods:
the number of nodes per graph
the graph density
the number of distinct labels
the number of graphs in the dataset
the query graph size
Related work
Most related works are tested against the AIDS antiviral dataset and synthetic datasets
These sets are not adequate to provide definitive conclusions
Grapes alone used several real datasets; , the authors did not evaluate scalability
The iGraph comparison framework [10], which implements several such techniques, compared the performance of older algorithms (up to 2010).
Since then, several, more efficient algorithms have been proposed
Conclusion
a linear increase in the number of nodes results in a quadratic increase in the number of edges;
As the number of features is superlinear to the size of a graph, the increase of the above two factors leads to a detrimental increase in the indexing time
The increase in the number of distinct labels leads to an easier dataset to index
It results in fewer occurrences of any given feature and thus a decrease in the false positiveratio of the various algorithms
Our findings give rise to the following adage:
“Keep It Simple and Smart”
the simpler the feature structure and extraction process:
the faster the indexing and query processingalgorithm
Respect to the type of graph features indexed (i.e., paths, trees, cycles, subgraphs) and the method for
generating graph features (i.e., based onfrequency mining or exhaustive enumeration of graphfeatures)
We choose six well established indexing methods, namely:
Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode
For query processing time and lowest indexing time, the approaches using exhaustive enumeration (Grapes, GGSX, CT-Index) are theclear winners, with those indexing simple features (paths; i.e., Grapes, GGSX) having the edge over those with more complex features (trees, cycles; i.e., CT-Index).
Frequent mining approaches (gIndex, Tree+∆) are usually an orderof magnitude slower
frequent mining approaches (gIndex, Tree+∆) are competitive only for small/sparse datasets, buttheir indexing times grow very high very fast.
In contrast to iGraph’s conclusion, our study reveals 2 methods, GGSX and Grapes always the clear winner for query processing time and scalability
One Trillion Edges: Graph Processing at Facebook-Scale
Industrial paper
Avery Ching
Sergey Edunov
Maja Kabiljo
Presented by: Zohreh Raghebi
Motivation
Analyzing the real world graphs at the scale of hundreds of billions or even a trillion edges is very difficult
graph processing engines tend to have additional challenges in scaling to larger graphs
Related work
Apache Giraph is a graph processing system designed to scale to hundreds or thousands of machines and process trillions of edges.
It is currently used at Facebook to analyze the social graph formed by users and their connections
Giraph was inspired by Pregel, the graph processing architecture developed at Google.
While Giraph did not scale to our needs at Facebook with over 1.39B users and hundreds of billions of social connections
we improved the platform in a variety of ways to support our workloads
Limitations: Giraph’s graph input model was only vertex centric
Parallelizing Giraph infrastructure relied completely on MapReduce’s task level parallelism did not have multithreading support
Giraph’s flexible types and computing model were initially implemented using native Java objects consumed excessive memory and garbage collection time
The aggregator framework was inefficiently implemented in ZooKeeper
we needed to support very large aggregators
Improvements: we modified Giraph to allow loading vertex data and edges from separate
sources
Parallelization support: Adding more workers per machine.
Use multithreading to take advantage of additional CPU cores.
memory optimization : by default we serialize the edges of every vertex into a byte array rather than instantiating them as native Java objects
sharded aggregator architecture: each aggregator is now randomly assigned to one of the workers
The assigned worker is in charge of gathering the values of its aggregators, performing the aggregation, and distributing the final values to the master and other workers