Download pdf - Elysium Technologies Private Limitedelysiumtechnologies.com/wp-content/uploads/2014/07/IEEE2014_Da… · ETPL NT-008 ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification,

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad

Pondicherry | Salem | Erode | Tirunelveli

http://www.elysiumtechnologies.com, [email protected]

http://www.elysiumtechnologies.com/


mailto:[email protected]





ETPL NT-001 Answering “What-If” Deployment and Configuration Questions With WISE: Techniques and Deployment Experience

ETPL NT-002 Complexity Analysis and Algorithm Design for Advance Bandwidth Scheduling in Dedicated Networks

ETPL NT-003 Diffusion Dynamics of Network Technologies With Bounded Rational Users: Aspiration-Based Learning

ETPL NT-004 Delay-Based Network Utility Maximization

ETPL NT-005 A Distributed Control Law for Load Balancing in Content Delivery Networks

ETPL NT-006 Efficient Algorithms for Neighbor Discovery in Wireless Networks

ETPL NT-007 Stochastic Game for Wireless Network Virtualization

ETPL NT-008 ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification,

ETPL NT-009 A Utility Maximization Framework for Fair and Efficient Multicasting in Multicarrier Wireless Cellular Networks

ETPL NT-010 Achieving Efficient Flooding by Utilizing Link Correlation in Wireless Sensor Networks,

ETPL NT-011 Random Walks and Green's Function on Digraphs: A Framework for Estimating Wireless Transmission Costs

ETPL NT-012 "A Flexible Platform for Hardware-Aware Network Experiments and a Case Study on Wireless Network Coding

ETPL NT-013 Exploring the Design Space of Multichannel Peer-to-Peer Live Video Streaming Systems

ETPL NT-014 Secondary Spectrum Trading—Auction-Based Framework for Spectrum Allocation and Profit Sharing

ETPL NT-015 Towards Practical Communication in Byzantine-Resistant DHTs

ETPL NT-016 Semi-Random Backoff: Towards Resource Reservation for Channel Access in Wireless LANs

ETPL NT-017 Entry and Spectrum Sharing Scheme Selection in Femtocell Communications Markets

ETPL NT-018 On Replication Algorithm in P2P VoD,

ETPL NT-019 Back-Pressure-Based Packet-by-Packet Adaptive Routing in Communication Networks

ETPL NT-020 Scheduling in a Random Environment: Stability and Asymptotic Optimality

ETPL NT-021 An Empirical Interference Modeling for Link Reliability Assessment in Wireless Networks

ETPL NT-022 On Downlink Capacity of Cellular Data Networks With WLAN/WPAN Relays

ETPL NT-023 Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management

ETPL NT-024 Localization of Wireless Sensor Networks in the Wild: Pursuit of Ranging Quality

ETPL NT-025 Control of Wireless Networks With Secrecy

ETPL NT-026 ICTCP: Incast Congestion Control for TCP in Data-Center Networks

ETPL NT-027 Context-Aware Nanoscale Modeling of Multicast Multihop Cellular Networks

ETPL NT-028 Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information

ETPL NT-029 Internet-Scale IPv4 Alias Resolution With MIDAR

ETPL NT-030 Time-Bounded Essential Localization for Wireless Sensor Networks

ETPL NT-031 Stability of FIPP -Cycles Under Dynamic Traffic in WDM Networks

ETPL NT-032 Cooperative Carrier Signaling: Harmonizing Coexisting WPAN and WLAN Devices

ETPL NT-033 Mobility Increases the Connectivity of Wireless Networks

ETPL NT-034 Topology Control for Effective Interference Cancellation in Multiuser MIMO Networks

ETPL NT-035 Distortion-Aware Scalable Video Streaming to Multinetwork Clients

ETPL NT-036 Combined Optimal Control of Activation and Transmission in Delay-Tolerant Networks

ETPL NT-037 A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless















Kernel principal component analysis and the reconstruction error is an effective anomaly

detection technique for non-linear datasets. In an environment where a phenomenon is generating data that is

non-stationary, anomaly detection requires a recomputation of the kernel eigenspace in order to represent the

current data distribution. Recomputation is a computationally complex operation and reducing computational

complexity is therefore a key challenge. In this paper, we propose an algorithm that is able to accurately

remove data from a kernel eigenspace without performing a batch recomputation. Coupled with a kernel

eigenspace update, we demonstrate that our technique is able to remove and add data to a kernel eigenspace

more accurately than existing techniques. An adaptive version determines an appropriately sized sliding

window of data and when a model update is necessary. Experimental evaluations on both synthetic and real-

world datasets demonstrate the superior performance of the proposed approach in comparison to alternative

incremental KPCA approaches and alternative anomaly detection techniques.

.

ETPL

DM - 001

Adaptive Anomaly Detection with Kernel Eigenspace Splitting and Merging

It is nowadays well-established that the construction of quality domain ontologies benefits from

the involvement in the modelling process of more actors, possibly having different roles and skills. To be

effective, the collaboration between these actors has to be fostered, enabling each of them to actively and

readily participate to the development of the ontology, favouring as much as possible the direct involvement of

the domain experts in the authoring activities. Recent works have shown that ontology modelling tools based

on wikis’ paradigm and technology could contribute in meeting these collaborative requirements. This paper

investigates, both at the theoretical and empirical level, the effectiveness of wiki features for collaborative

ontology authoring in supporting teamworks composed of domain experts and knowledge engineers, as well as

their impact on the entire process of collaborative ontology modelling and entity lifecycle.

.

ETPL

DM - 002

Evaluating Wiki Collaborative Features in Ontology Authoring








Phase Change Memory requires big capacityof Frame Store (FS) for buffering reference frames. The

upto-date Phase-change Random Access Memory (PRAM) is thepromising approach for on-chip caching the

reference signals, asPRAM offers the advantages in terms of high density and low leakage power. However,

the write endurance problem, that is a PRAM cell can only tolerant limited number of write

operations,becomes the main barrier in practical applications. This paper studies the wear reduction techniques

of PRAM based FS in H.264 codec system. On the basis of rate-distortion theory, the content oriented

selective writing mechanisms are proposed to reduce bit updates in the reference frame buffers. Experiments

demonstrate that, for typical video sequences with different frame sizes, our methods averagely achieve more

than 30% reduction of bit updates, while introducing around 20% BDBR cost. The power consumption is

reduced by 55% on average, and the estimated PRAM lifetime is extended by 61%.

ETPL

DM - 003

B^p-tree: A Predictive B^+-tree for Reducing Writes on Phase Change Memory

Edit distance is widely used for measuring the similarity between two strings. As a primitive

operation, edit distance based string similarity search is to find strings in a collection that are similar to a given

query string using edit distance. Existing approaches for answering such string similarity queries follow the

filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and

datasets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based

approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily

integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques

employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions

according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on

the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to

efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal

partitioning of the dataset is an NP-hard problem, and therefore propose a heuristic approach for selecting the

reference strings greedily and present an optimal partition assignment strategy to minimize the expected

number of strings that need to be verified during the query evaluation. Through extensive experiments over a

variety of real datasets, we demonstrate that our B+-tree based approaches provide superior performance over

state-of-the-art techniques on both range and KNN queries in most cases.

ETPL

DM - 004

Efficiently Supporting Edit Distance based String Similarity Search Using B+-trees








Search Engine Marketing (SEM) agencies manage thousands of search keywords for their

clients. The campaign management dashboards provided by advertisement brokers have interfaces to change

search campaign attributes. Using these dashboards, advertisers create test variants for various bid choices,

keyword ideas, and advertisement text options. Later on, they conduct controlled experiments for selecting the

best performing variants. Given a large keyword portfolio and many variants to consider, campaign

management can easily become a burden on even experienced advertisers. In order to target users in need of a

particular service, advertisers have to determine the purchase intents or information needs of target users.

Once the target intents are determined, advertisers can target those users with relevant search keywords. In

order to formulate information needs and to scale campaign management with increasing number of keywords,

we propose a framework called TopicMachine, where we learn the latent topics hidden in the available search

terms reports. Our hypothesis is that these topics correspond to the set of information needs that best match-

make a given client with users. In our experiments, TopicMachine outperformed its closest competitor by 41%

on predicting total user subscriptions.

ETPL

DM - 005

TopicMachine: Conversion Prediction in Search Advertising using Latent Topic

Models

A highly comparative, feature-based approach to time series classification is introduced

that uses an extensive database of algorithms to extract thousands of interpretable features from time series.

These features are derived from across the scientific time-series analysis literature, and include summaries of

time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits

to a range of time-series models. After computing thousands of features for each time series in a training set,

those that are most informative of the class structure are selected using greedy forward feature selection with a

linear classifier. The resulting feature-based classifiers automatically learn the differences between classes

using a reduced number of time-series properties, and circumvent the need to calculate distances between time

series. Representing time series in this way results in orders of magnitude of dimensionality reduction,

allowing the method to perform well on very large datasets containing long time series or time series of

different lengths. For many of the datasets studied, classification performance exceeded that of conventional

instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic

time warping and, most importantly, the features selected provide an understanding of the properties of the

dataset, insight that can guide further scientific investigation.

ETPL

DM - 006

Highly comparative feature-based time-series classification








Product quantization-based approaches are effective to encode high-dimensional data points for

approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional

subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using

these sub codebooks, and the distance between two data points can be approximated efficiently from their

codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace,

only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on

the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K-Means

(OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM,

multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword

stems from different sub codebooks in each subspace, which are optimally generated with regards to the

minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of

the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower

distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate

the superiority over state-of-the-art approaches for approximate nearest neighbor search.

ETPL

DM - 007

Optimized Cartesian K-Means

The key task in developing graph-based learning algorithms is constructing an

informative graph to express the contextual information of a data manifold. Since traditional graph

construction methods are sensitive to noise and less datum-adaptive to changes in density, a new method

called ℓ1-graph was proposed recently. A graph construction needs to have two important properties: sparsity

and locality. The ℓ1-graph has a strong sparsity property, but a weak locality property. Thus, we propose a

new method of constructing an informative graph using auto-grouped sparse regularization based on the ℓ1-

graph, which is called as Group Sparse graph (GSgraph). We also show how to efficiently construct a GS-

graph in reproducing kernel Hilbert space with the kernel trick. The new methods, the GS-graph and its

kernelized version (KGS-graph), have the same noise-insensitive property as that of ℓ1-graph and also can

successively preserve the properties of sparsity and locality simultaneously. Furthermore, we integrate the

proposed graph with several graph-based learning algorithms to demonstrate the effectiveness of our method.

The empirical studies on benchmarks show that the proposed methods outperform the ℓ1-graph and other

traditional graph construction methods in various learning tasks.

ETPL

DM - 008

Graph-based Learning via Auto-Grouped Sparse Regularization and

Kernelized Extension








I Location-based services (LBS) enable mobile users to query points-of-interest (e.g., restaurants,

cafes) on various features (e.g., price, quality, variety). In addition, users require accurate query results with

up-to-date travel times. Lacking the monitoring infrastructure for road traffic, the LBS may obtain live travel

times of routes from online route APIs in order to offer accurate results. Our goal is to reduce the number of

requests issued by the LBS significantly while preserving accurate query results. First, we propose to exploit

recent routes requested from route APIs to answer queries accurately. Then, we design effective lower/upper

bounding techniques and ordering techniques to process queries efficiently. Also, we study parallel route

requests to further reduce the query response time. Our experimental evaluation shows that our solution is 3

times more efficient than a competitor, and yet achieves high result accuracy (above 98%).

ETPL

DM - 009

Route-Saver: Leveraging Route APIs for Accurate and Efficient Query

Processing at Location-Based Services

The knowledge remembered by the human body and reflected by the dexterity of body motion is

called embodied knowledge. In this paper, we propose a new method using singular value decomposition for

extracting embodied knowledge from the time-series data of the motion. We compose a matrix from the time-

series data and use the left singular vectors of the matrix as the patterns of the motion and the singular values

as a scalar, by which each corresponding left singular vector affects the matrix. Two experiments were

conducted to validate the method. One is a gesture recognition experiment in which we categorize gesture

motions by two kinds of models with indexes of similarity and estimation that use left singular vectors. The

proposed method obtained a higher correct categorization ratio than principal component analysis (PCA) and

correlation efficiency (CE). The other is an ambulation evaluation experiment in which we distinguished the

levels of walking disability. The first singular values derived from the walking acceleration were suggested to

be a reliable criterion to evaluate walking disability. Finally we discuss the characteristic and significance of

the embodied knowledge extraction using the singular value decomposition proposed in this paper.

ETPL

DM - 010

Knowledge Acquisition Method based on Singular Value Decomposition for

Human Motion Analysis








Given a scoring function that computes the score of a pair of objects, a top-k pairs query returns k

pairs with the smallest scores. In this paper, we present a unified framework for answering generic top-k pairs

queries including k-closest pairs queries, k- furthest pairs queries and their variants. Note that k-closest pairs

query is a special case of top-k pairs queries where the scoring function is the distance between the two objects

in a pair. We are the first to present a unified framework to efficiently answer a broad class of top-k queries

including the queries mentioned above.We present efficient algorithms and provide a detailed theoretical

analysis that demonstrates that the expected performance of our proposed algorithms is optimal for two

dimensional data sets. Furthermore, our framework does not require pre-built indexes, uses limited main

memory and is easy to implement. We also extend our techniques to support top-k pairs queries on multi-

valued (or uncertain) objects. We also demonstrate that our framework can handle exclusive top-k pairs

queries. Our extensive experimental study demonstrates effectiveness and efficiency of our proposed

techniques.

ETPL

DM - 011

A Unified Framework for Answering k Closest Pairs Queries and Variants

In this paper, we tackle the problem of discovering movement-based communities of users,

where users in the same community have similar movement behaviors. Note that the identification of

movement-based communities is beneficial to location-based services and trajectory recommendation services.

Specifically, we propose a framework to mine movementbased communities which consists of three phases: 1)

constructing trajectory profiles of users, 2) deriving similarity between trajectory profiles, and 3) discovering

movement-based communities. In the first phase, we design a data structure, called the Sequential Probability

tree (SP-tree), as a user trajectory profile. SP-trees not only derive sequential patterns, but also indicate

transition probabilities of movements. Moreover, we propose two algorithms: BF (standing for Breadth-First)

and DF (standing for Depth-First) to construct SP-tree structures as user profiles. To measure the similarity

values among users’ trajectory profiles, we further develop a similarity function that takes SP-tree information

into account. In light of the similarity values derived, we formulate an objective function to evaluate the

quality of communities. According to the objective function derived, we propose a greedy algorithm Geo-

Cluster to effectively derive communities. To evaluate our proposed algorithms, we have conducted

comprehensive experiments on two real datasets. The experimental results show that our proposed framework

can effectively discover movement-based user communities.

ETPL

DM - 012

Exploring Sequential Probability Tree for Movement-based Community

Discovery








In the literature about association analysis, many interestingness measures have been

proposed to assess the quality of obtained association rules in order to select a small set of the most interesting

among them. In the particular case of hierarchically organized items and generalized association rules

connecting them, a measure that dealt appropriately with the hierarchy would be advantageous. Here we

present the further developments of a new class of such hierarchical interestingness measures and compare

them with a large set of conventional measures and with three hierarchical pruning methods from the

literature. The aim is to find interesting pairwise generalized association rules connecting the concepts of

multiple ontologies. Interested in the broad empirical evaluation of interestingness measures, we compared the

rules obtained by 39 methods on three real world datasets against predefined ground truth sets of associations.

To this end, we adopted a framework of instancebased ontology matching and extended the set of performance

measures by two novel measures: relation learning recall and precision which take into account hierarchical

relationships between rules.

ETPL

DM - 013

Evaluation of hierarchical interestingness measures for mining pairwise

generalized association rules

In some real world applications, like information retrieval and data classification, we often

confront with the situation that the same semantic concept can be expressed using different views with similar

information. Thus, how to obtain a certain Semantically Consistent Patterns (SCP) for cross-view data, which

embeds the complementary information from different views, is of great importance for those applications.

However, the heterogeneity among cross-view representations brings a significant challenge on mining the

SCP. In this paper, we propose a general framework to discover the SCP for cross-view data. Specifically,

aiming at building a feature-isomorphic space among different views, a novel Isomorphic Relevant Redundant

Transformation (IRRT) is first proposed. The IRRT linearly maps multiple heterogeneous low-level feature

spaces to a high-dimensional redundant feature-isomorphic one, which we name as mid-level space. Thus,

much more complementary information from different views can be captured. Furthermore, to mine the

semantic consistency among the isomorphic representations in the mid-level space, we propose a new

Correlation-based Joint Feature Learning (CJFL) model to extract a unique high-level semantic subspace

shared across the feature-isomorphic data. Consequently, the SCP for cross-view data can be obtained.

Comprehensive experiments on three datasets demonstrate the advantages of our framework in classification

and retrieval.

ETPL

DM - 014

Mining Semantically Consistent Patterns for Cross-View Data








Given a real world graph, how should we lay-out its edges? How can we compress it? These

questions are closely related, and the typical approach so far is to find clique-like communities, like the

‘cavemen graph’, and compress them. We show that the block-diagonal mental image of the ‘cavemen graph’

is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead,

we propose to envision graphs as a collection of hubs connecting spokes, with super-hubs connecting the hubs,

and so on, recursively. Based on the idea, we propose the SLASHBURN method to recursively split a graph

into hubs and spokes connected only by the hubs. We also propose techniques to select the hubs and give an

ordering to the spokes, in addition to the basic SLASHBURN. We give theoretical analysis of the proposed

hub selection methods. Our view point has several advantages: (a) it avoids the ‘no good cuts’ problem, (b) it

gives better compression, and (c) it leads to faster execution times for matrix-vector operations, which are the

back-bone of most graph processing tools. Through experiments, we show that SLASHBURN consistently

outperforms other methods for all datasets, resulting in better compression and faster running time. Moreover,

we show that SLASHBURN with the appropriate spokes ordering can further improve compression while

hardly sacrificing the running time.

ETPL

DM – 015

SlashBurn: Graph Compression and Mining beyond Caveman Communities

Subgraph similarity search is used in graph databases to retrieve graphs whose subgraphs are

similar to a given query graph. It has been proven successful in a wide range of applications including

bioinformatics and chem-informatics, etc. Due to the cost of providing efficient similarity search services on

ever-increasing graph data, database outsourcing is apparently an appealing solution to database owners.

Unfortunately, query service providers may be untrusted or compromised by attacks. To our knowledge, no

studies have been carried out on the authentication of the search. In this paper, we propose authentication

techniques that follow the popular filtering-and-verification framework. We propose an authentication-friendly

metric index called GMTree. Specifically, we transform the similarity search into a search in a graph metric

space and derive small verification objects (VOs) to-be-transmitted to query clients. To further optimize

GMTree, we propose a sampling-based pivot selection method and an authenticated version of MCS

computation. Our comprehensive experiments verified the effectiveness and efficiency of our proposed

techniques.

ETPL

DM - 016

Authenticated Subgraph Similarity Search in Outsourced Graph Databases








Keyword search is a useful tool for exploring large RDF datasets. Existing techniques

either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF

graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing

with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization

techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective

summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant

pruning powers to exploratory keyword search and result in much better efficiency compared to previous

works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries

we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF

data sets show that our techniques are scalable and efficient.

ETPL

DM - 017

Scalable Keyword Search on Large RDF Data

Mobile devices with geo-positioning capabilities (e.g., GPS) enable users to access

information that is relevant to their present location. Users are interested in querying about points of interest

(POI) in their physical proximity, such as restaurants, cafes, ongoing events, etc. Entities specialized in various

areas of interest (e.g., certain niche directions in arts, entertainment, travel) gather large amounts of geo-tagged

data that appeal to subscribed users. Such data may be sensitive due to their contents. Furthermore, keeping

such information up-to-date and relevant to the users is not an easy task, so the owners of such datasets will

make the data accessible only to paying customers. Users send their current location as the query parameter,

and wish to receive as result the nearest POIs, i.e., nearest-neighbors (NNs). But typical data owners do not

have the technical means to support processing queries on a large scale, so they outsource data storage and

querying to a cloud service provider. Many such cloud providers exist who offer powerful storage and

computational infrastructures at low cost. However, cloud providers are not fully trusted, and typically behave

in an honest-but-curious fashion. Specifically, they follow the protocol to answer queries correctly, but they

also collect the locations of the POIs and the subscribers for other purposes. Leakage of POI locations can lead

to privacy breaches as well as financial losses to the data owners, for whom the POI dataset is an important

source of revenue. Disclosure of user locations leads to privacy violations and may deter subscribers from using

the service altogether. In this paper, we propose a family of techniques that allow processing of NN queries in

an untrusted outsourced environment, while at the same time protecting both the POI and querying users’

positions. Our techniques rely on mutable order preserving encoding (mOPE), the only secure order-preserving

encryption method known to-date. We also provide performance optimizations to decrease the computational

cost inherent to processing on encrypted data, and we consider the case of incrementally updating datasets.

ETPL

DM - 018

Secure kNN Query Processing in Untrusted Cloud Environments








The multiple longest common subsequence (MLCS) problem, related to the identification of sequence

similarity, is an important problem in many fields. As an NP-hard problem, its exact algorithms have difficulty

in handling large-scale data and time- and space-efficient algorithms are required in real-world applications.

To deal with time constraints, anytime algorithms have been proposed to generate good solutions with a

reasonable time. However, there exists little work on space-efficient MLCS algorithms. In this paper, we

formulate the MLCS problem into a graph search problem and present two space-efficient anytime MLCS

algorithms, SA-MLCS and SLA-MLCS. SA-MLCS uses an iterative beam widening search strategy to reduce

space usage during the iterative process of finding better solutions.Based on SA-MLCS, SLA-MLCS, a space-

bounded algorithm, is developed to avoid space usage from exceeding available memory. SLA-MLCS uses a

replacing strategy when SA-MLCS reaches a given space bound. Experimental results show SA-MLCS and

SLA-MLCS use an order of magnitude less space and time than the state-of-the-art approximate algorithm

MLCS-APP while finding better solutions. Compared to the state-of-the-art anytime algorithm Pro-MLCS,

SA-MLCS and SLA-MLCS can solve an order of magnitude larger size instances. Furthermore, SLA-MLCS

can find much better solutions than SA-MLCS on large size instances.

ETPL

DM - 019

A Space-Bounded Anytime Algorithm for the Multiple Longest Common

Subsequence Problem

As machine learning techniques mature and are used to tackle complex scientific problems,

challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-

represented in comparison with other classes. Existing oversampling approaches for addressing this problem

typically do not consider the probability distribution of the minority class while synthetically generating new

samples. As a result, the minority class is not represented well which leads to high misclassification error. We

introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically

generating and strategically selecting new minority class samples. The proposed approaches use the joint

probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While

RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those

samples that have the highest probability of being misclassified by the existing learning model. We validate our

approach using nine UCI datasets that were carefully modified to exhibit class imbalance and one new

application domain dataset with inherent extreme class imbalance. In addition, we compare the classification

performance of the proposed methods with three other existing resampling techniques.

ETPL

DM - 020

RACOG and wRACOG: Two Probabilistic Oversampling Techniques








Malware is pervasive in networks, and poses a critical threat to network security. However, we have

very limited understanding of malware behavior in networks to date. In this paper, we investigate how

malware propagate in networks from a global perspective. We formulate the problem, and establish a rigorous

two layer epidemic model for malware propagation from network to network. Based on the proposed model,

our analysis indicates that the distribution of a given malware follows exponential distribution, power law

distribution with a short exponential tail, and power law distribution at its early, late and final stages,

respectively. Extensive experiments have been performed through two real-world global scale malware data

sets, and the results confirm our theoretical findings.

ETPL

DM - 021

Malware Propagation in Large-Scale Networks

This paper focuses on an important query in scientific simulation data analysis: the Spatial

Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic.

Often, such queries are executed continuously over certain time periods, increasing the computation time. We

propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable

error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and

temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the

spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles’

spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in

implementing and optimizing the above algorithm in Graphics Processing Units (GPUs) as means to further

improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical

analysis and results of extensive experiments using data generated from real simulation studies.

ETPL

DM - 022

Computing Spatial Distance Histograms for Large Scientific Datasets On-the-

Fly








Although mostly used for pattern classification, linear discriminant analysis (LDA) can also be used in

feature selection as an effective measure to evaluate the separative ability of a feature subset. When applied to

feature selection on high-dimensional smallsized (HDSS) data (generally) with class-imbalance, LDA

encounters four problems, including singularity of scatter matrix, overfitting, overwhelming and prohibitively

computational complexity. In this study, we propose the LDA-based feature selection method MCELDA

(minority class emphasized linear discriminant analysis) with a new regularization technique to address the

first three problems. Different to giving equal or more emphasis to majority class in conventional forms of

regularization, the proposed regularization emphasizes more on minority class, with the expectation of

improving overall performance by alleviating overwhelming of majority class to minority class as well as

overfitting in minority class. In order to reduce computational overhead, an incremental implementation of

LDA-based feature selection has been introduced. Comparative studies with other forms of regularization to

LDA as well as with other popular feature selection methods on five HDSS problems show that MCE-LDA

can produce feature subsets with excellent performance in both classification and robustness. Further

experimental results of true positive rate (TPR) and true negative rate (TNR) have also verified the

effectiveness of the proposed technique in alleviating overwhelming and overfitting problems.

ETPL

DM - 023

Emphasizing Minority Class in LDA for Feature Subset Selection on High-

Dimensional Small-Sized Problems

In a top-k Geometric Intersection Query (top-k GIQ) problem, a set of n weighted,

geometric objects in Rd is to be pre-processed into a compact data structure so that for any query geometric

object, q, and integer k > 0, the k largest-weight objects intersected by q can be reported efficiently. While the

top-k problem has been studied extensively for non-geometric problems (e.g., recommender systems), the

geometric version has received little attention. This paper gives a general technique to solve any top-k GIQ

problem efficiently. The technique relies only on the availability of an efficient solution for the underlying (non-

top-k) GIQ problem, which is often the case. Using this, asymptotically efficient solutions are derived for

several top-k GIQ problems, including top-k orthogonal and circular range search, point enclosure search,

halfspace range search, etc. Implementations of some of these solutions, using practical data structures, show

that they are quite efficient in practice. This paper also does a formal investigation of the hardness of the top-k

GIQ problem, which reveals interesting connections between the top-k GIQ problem and the underlying (non-

top-k) GIQ problem.

ETPL

DM - 024

A General Technique for Top-k Geometric Intersection Query Problems








Ensemble learning has become a common tool for data stream classification, being able to handle

large volumes of stream data and concept drifting. Previous studies focus on building accurate ensemble

models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during

prediction incurs significant costs in response time, preventing ensemble learning from being practical for

many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection,

and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is

necessary to classify each stream record in a timely manner. To address this problem, we propose a novel

\emph{Ensemble-tree} (E-tree for short) indexing structure to organize all base classifiers in an ensemble for

fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an \emph{R-tree} like

height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the

other hand, E-trees can automatically update themselves by continuously integrating new classifiers and

discarding outdated ones, well adapting to new trends and patterns underneath data streams. Theoretical

analysis and empirical studies on both synthetic and real-world data streams demonstrate the performance of

our approach

ETPL

DM - 025

E-Tree: An Efficient Indexing Structure for Ensemble Models on Data Streams

Affinity Propagation (AP) clustering has been successfully used in a lot of

clustering problems. However, most of the applications deal with static data. This paper considers how to

apply AP in incremental clustering problems. Firstly, we point out the difficulties in Incremental Affinity

Propagation (IAP) clustering, and then propose two strategies to solve them. Correspondingly, two IAP

clustering algorithms are proposed. They are IAP clustering based on K-Medoids (IAPKM) and IAP clustering

based on Nearest Neighbor Assignment (IAPNA). Five popular labeled data sets, real world time series and a

video are used to test the performance of IAPKM and IAPNA. Traditional AP clustering is also implemented

to provide benchmark performance. Experimental results show that IAPKM and IAPNA can achieve

comparable clustering performance with traditional AP clustering on all the data sets. Meanwhile, the time

cost is dramatically reduced in IAPKM and IAPNA. Both the effectiveness and the efficiency make IAPKM

and IAPNA able to be well used in incremental clustering tasks.

ETPL

DM - 026

Incremental Affinity Propagation Clustering Based on Message Passing








The key task in developing graph-based learning algorithms is constructing an informative graph to

express the contextual information of a data manifold. Since traditional graph construction methods are

sensitive to noise and less datum-adaptive to changes in density, a new method called ℓ1-graph was proposed

recently. A graph construction needs to have two important properties: sparsity and locality. The ℓ1-graph has

a strong sparsity property, but a weak locality property. Thus, we propose a new method of constructing an

informative graph using auto-grouped sparse regularization based on the ℓ1-graph, which is called as Group

Sparse graph (GSgraph). We also show how to efficiently construct a GS-graph in reproducing kernel Hilbert

space with the kernel trick. The new methods, the GS-graph and its kernelized version (KGS-graph), have the

same noise-insensitive property as that of ℓ1-graph and also can successively preserve the properties of

sparsity and locality simultaneously. Furthermore, we integrate the proposed graph with several graph-based

learning algorithms to demonstrate the effectiveness of our method. The empirical studies on benchmarks

show that the proposed methods outperform the ℓ1-graph and other traditional graph construction methods in

various learning tasks.

ETPL

DM - 027 Graph-based Learning via Auto-Grouped Sparse Regularization and

Kernelized Extension

Location-based services (LBS) enable mobile users to query points-of-interest (e.g., restaurants,

cafes) on various features (e.g., price, quality, variety). In addition, users require accurate query results with

up-to-date travel times. Lacking the monitoring infrastructure for road traffic, the LBS may obtain live travel

times of routes from online route APIs in order to offer accurate results. Our goal is to reduce the number of

requests issued by the LBS significantly while preserving accurate query results. First, we propose to exploit

recent routes requested from route APIs to answer queries accurately. Then, we design effective lower/upper

bounding techniques and ordering techniques to process queries efficiently. Also, we study parallel route

requests to further reduce the query response time. Our experimental evaluation shows that our solution is 3

times more efficient than a competitor, and yet achieves high result accuracy (above 98%).

ETPL

DM - 028 Route-Saver: Leveraging Route APIs for Accurate and Efficient Query

Processing at Location-Based Services








Recent large-scale hierarchical classification tasks typically have tens of thousands of classes on

which the most widely used approach to multiclass classification--one-versus-rest--becomes intractable due to

computational complexity. The top-down methods are usually adopted instead, but they are less accurate

because of the so-called error-propagation problem in their classifying phase. To address this problem, this

paper proposes a meta-top-down method that employs metaclassification to enhance the normal top-down

classifying procedure. The proposed method is first analyzed theoretically on complexity and accuracy, and

then applied to five real-world large-scale data sets. The experimental results indicate that the classification

accuracy is largely improved, while the increased time costs are smaller than most of the existing approaches.

ETPL

DM - 029 A Meta-Top-Down Method for Large-Scale Hierarchical Classification

Creating an efficient and economic trip plan is the most annoying job for a backpack traveler.

Although travel agency can provide some predefined itineraries, they are not tailored for each specific

customer. Previous efforts address the problem by providing an automatic itinerary planning service, which

organizes the points-of-interests (POIs) into a customized itinerary. Because the search space of all possible

itineraries is too costly to fully explore, to simplify the complexity, most work assume that user's trip is limited

to some important POIs and will complete within one day. To address the above limitation, in this paper, we

design a more general itinerary planning service, which generates multiday itineraries for the users. In our

service, all POIs are considered and ranked based on the users' preference. The problem of searching the

optimal itinerary is a team orienteering problem (TOP), a well-known NP-complete problem. To reduce the

processing cost, a two-stage planning scheme is proposed. In its preprocessing stage, single-day itineraries are

precomputed via the MapReduce jobs. In its online stage, an approximate search algorithm is used to combine

the single day itineraries. In this way, we transfer the TOP problem with no polynomial approximation into

another NP-complete problem (set-packing problem) with good approximate algorithms. Experiments on real

data sets show that our approach can generate high-quality itineraries efficiently

ETPL

DM - 030 Automatic Itinerary Planning for Traveling Services








Time provides context for all our experiences, cognition, and coordinated collective action. Prior

research in linguistics, artificial intelligence, and temporal databases suggests the need to differentiate between

temporal facts with goal-related semantics (i.e., telic) from those are intrinsically devoid of culmination (i.e.,

atelic). To differentiate between telic and atelic data semantics in conceptual database design, we propose an

annotation-based temporal conceptual model that generalizes the semantics of a conventional conceptual

model. Our temporal conceptual design approach involves: 1) capturing "what" semantics using a

conventional conceptual model; 2) employing annotations to differentiate between telic and atelic data

semantics that help capture "when" semantics; 3) specifying temporal constraints, specifically nonsequenced

semantics, in the temporal data dictionary as metadata. Our proposed approach provides a mechanism to

represent telic/atelic temporal semantics using temporal annotations. We also show how these semantics can

be formally defined using constructs of the conventional conceptual model and axioms in first-order logic. Via

what we refer to as the "semantics of composition," i.e., semantics implied by the interaction of annotations,

we illustrate the logical consequences of representing telic/atelic data semantics during temporal conceptual

design.

ETPL

DM - 031 Capturing Telic/Atelic Temporal Data Semantics: Generalizing Conventional

Conceptual Models

The extended space forest is a new method for decision tree construction in which training is done with input

vectors including all the original features and their random combinations. The combinations are generated with

a difference operator applied to random pairs of original features. The experimental results show that extended

space versions of ensemble algorithms have better performance than the original ensemble algorithms. To

investigate the success dynamics of the extended space forest, the individual accuracy and diversity creation

powers of ensemble algorithms are compared. The Extended Space Forest creates more diversity when it uses

all the input features than Bagging and Rotation Forest. It also results in more individual accuracy when it uses

random selection of the features than Random Subspace and Random Forest methods. It needs more training

time because of using more features than the original algorithms. But its testing time is lower than the others

because it generates less complex base learners.

ETPL

DM - 032

Classifier Ensembles with the Extended Space Forest








The visualization of information contained in reports is an important aspect of

human-computer interaction, for both the accuracy and the complexity of relationships between data must be

preserved. A greater attention has been paid to individual report visualization through different types of

standard graphs (Histograms, Pies, etc.). However, this kind of representation provides separate information

items and gives no support to visualize their relationships which are extremely important for most decision

processes. This paper presents a design methodology exploiting the visual language CoDe based on a logic

paradigm. CoDe allows to organize the visualization through the CoDe model which graphically represents

relationships between information items and can be considered a conceptual map of the view. The proposed

design methodology is composed of four phases: the CoDe Modeling and OLAP Operation pattern definition

phases define the CoDe model and underlying metadata information, the OLAP Operation phase physically

extracts data from a data warehouse and the Report Visualization phase generates the final visualization.

Moreover, a case study on real data is provided.

ETPL

DM - 033 CoDe Modeling of Graph Composition for Data Warehouse Report Visualization

There are numerous applications where we wish to discover unexpected activities in a sequence of

time-stamped observation data-for instance, we may want to detect inexplicable events in transactions at a

website or in video of an airport tarmac. In this paper, we start with a known set A of activities (both

innocuous and dangerous) that we wish to monitor. However, in addition, we wish to identify “unexplained”

subsequences in an observation sequence that are poorly explained (e.g., because they may contain

occurrences of activities that have never been seen or anticipated before, i.e., they are not in A). We formally

define the probability that a sequence of observations is unexplained (totally or partially) w.r.t. A. We develop

efficient algorithms to identify the top-k Totally and partially unexplained sequences w.r.t. A. These

algorithms leverage theorems that enable us to speed up the search for totally/partially unexplained sequences.

We describe experiments using real-world video and cyber-security data sets showing that our approach works

well in practice in terms of both running time and accuracy

ETPL

DM - 034 Discovering the Top-k Unexplained Sequences in Time-Stamped Observation Data








This paper studies the problem of finding objects with durable quality over time in historical time series

databases. For example, a sociologist may be interested in the top 10 web search terms during the period of

some historical events; the police may seek for vehicles that move close to a suspect 70 percent of the time

during a certain time period and so on. Durable top-k (DTop-k) and nearest neighbor (DkNN) queries can be

viewed as natural extensions of the standard snapshot top-k and NN queries to timestamped sequences of

values or locations. Although their snapshot counterparts have been studied extensively, to our knowledge,

there is little prior work that addresses this new class of durable queries. Existing methods for DTop-k

processing either apply trivial solutions, or rely on domain-specific properties. Motivated by this, we propose

efficient and scalable algorithms for the DTop-k and DkNN queries, based on novel indexing and query

evaluation techniques. Our experiments show that the proposed algorithms outperform previous and baseline

solutions by a wide margin.

ETPL

DM - 035

Durable Queries over Historical Time Series

As the uncertainty is inherent in a wide spectrum of applications such as radio frequency identification

(RFID) networks and location-based services (LBS), it is highly demanded to address the uncertainty of the

objects. In this paper, we propose a novel indexing structure, named U-Quadtree, to organize the uncertain

objects in the multidimensional space such that the queries can be processed efficiently by taking advantage of

U-Quadtree. Particularly, we focus on the range search on multidimensional uncertain objects since it is a

fundamental query in a spatial database. We propose a cost model which carefully considers various factors

that may impact the performance. Then, an effective and efficient index construction algorithm is proposed to

build the optimal U-Quadtree regarding the cost model. We show that U-Quadtree can also efficiently support

other types of queries such as uncertain range query and nearest neighbor query. Comprehensive experiments

demonstrate that our techniques outperform the existing works on multidimensional uncertain objects.

ETPL

DM - 036 Effectively Indexing the Multidimensional Uncertain Objects








The vast majority of existing approaches to opinion feature extraction rely on mining

patterns only from a single review corpus, ignoring the nontrivial disparities in word distributional

characteristics of opinion features across different corpora. In this paper, we propose a novel method to

identify opinion features from online reviews by exploiting the difference in opinion feature statistics across

two corpora, one domain-specific corpus (i.e., the given review corpus) and one domain-independent corpus

(i.e., the contrasting corpus). We capture this disparity via a measure called domain relevance (DR), which

characterizes the relevance of a term to a text collection. We first extract a list of candidate opinion features

from the domain review corpus by defining a set of syntactic dependence rules. For each extracted candidate

feature, we then estimate its intrinsic-domain relevance (IDR) and extrinsic-domain relevance (EDR) scores

on the domain-dependent and domain-independent corpora, respectively. Candidate features that are less

generic (EDR score less than a threshold) and more domain-specific (IDR score greater than another

threshold) are then confirmed as opinion features. We call this interval thresholding approach the intrinsic and

extrinsic domain relevance (IEDR) criterion. Experimental results on two real-world review domains show the

proposed IEDR approach to outperform several other well-established methods in identifying opinion features.

ETPL

DM - 037 Identifying Features in Opinion Mining via Intrinsic and Extrinsic Domain

Relevance

Social networks model the social activities between individuals, which change as time goes by. In

light of useful information from such dynamic networks, there is a continuous demand for privacy-preserving

data sharing with analyzers, collaborators or customers. In this paper, we address the privacy risks of identity

disclosures in sequential releases of a dynamic network. To prevent privacy breaches, we proposed novel kw-

structural diversity anonymity, where k is an appreciated privacy level and w is a time period that an adversary

can monitor a victim to collect the attack knowledge. We also present a heuristic algorithm for generating

releases satisfying kw-structural diversity anonymity so that the adversary cannot utilize his knowledge to

reidentify the victim and take advantages. The evaluations on both real and synthetic data sets show that the

proposed algorithm can retain much of the characteristics of the networks while confirming the privacy

protection.

ETPL

DM - 038 Identity Protection in Sequential Releases of Dynamic Networks








If knowledge such as classification rules are extracted from sample data in a distributed way, it may be

necessary to combine or fuse these rules. In a conventional approach this would typically be done either by

combining the classifiers' outputs (e.g., in form of a classifier ensemble) or by combining the sets of

classification rules (e.g., by weighting them individually). In this paper, we introduce a new way of fusing

classifiers at the level of parameters of classification rules. This technique is based on the use of probabilistic

generative classifiers using multinomial distributions for categorical input dimensions and multivariate normal

distributions for the continuous ones. That means, we have distributions such as Dirichlet or normal-Wishart

distributions over parameters of the classifier. We refer to these distributions as hyperdistributions or second-

order distributions. We show that fusing two (or more) classifiers can be done by multiplying the

hyperdistributions of the parameters and derive simple formulas for that task. Properties of this new approach

are demonstrated with a few experiments. The main advantage of this fusion approach is that the

hyperdistributions are retained throughout the fusion process. Thus, the fused components may, for example,

be used in subsequent training steps (online training).

ETPL

DM - 039

Knowledge Fusion for Probabilistic Generative Classifiers with Data Mining

Applications

Advanced microarray technologies have enabled to simultaneously monitor the expression levels of all

genes. An important problem in microarray data analysis is to discover phenotype structures. The goal is to 1)

find groups of samples corresponding to different phenotypes (such as disease or normal), and 2) for each group

of samples, find the representative expression pattern or signature that distinguishes this group from others.

Some methods have been proposed for this issue, however, a common drawback is that the identified signatures

often include a large number of genes but with low discriminative power. In this paper, we propose a g*-

sequence model to address this limitation, where the ordered expression values among genes are profitably

utilized. Compared with the existing methods, the proposed sequence model is more robust to noise and allows

to discover the signatures with more discriminative power using fewer genes. This is important for the

subsequent analysis by the biologists. We prove that the problem of phenotype structure discovery is NP-

complete. An efficient algorithm, FINDER, is developed, which includes three steps: 1) trivial g*-sequences

identifying, 2) phenotype structure discovery, and 3) refinement. Effective pruning strategies are developed to

further improve the efficiency.

ETPL

DM - 040 Learning Phenotype Structure Using Sequence Model








One-to-many data linkage is an essential task in many domains, yet only a handful of prior publications

have addressed this issue. Furthermore, while traditionally data linkage is performed among entities of the

same type, it is extremely necessary to develop linkage techniques that link between matching entities of

different types as well. In this paper, we propose a new one-to-many data linkage method that links between

entities of different natures. The proposed method is based on a one-class clustering tree (OCCT) that

characterizes the entities that should be linked together. The tree is built such that it is easy to understand and

transform into association rules, i.e., the inner nodes consist only of features describing the first set of

entities, while the leaves of the tree represent features of their matching entities from the second data set. We

propose four splitting criteria and two different pruning methods which can be used for inducing the OCCT.

The method was evaluated using data sets from three different domains. The results affirm the effectiveness

of the proposed method and show that the OCCT yields better performance in terms of precision and recall

(in most cases it is statistically significant) when compared to a C4.5 decision tree-based linkage method.

ETPL

DM - 041

OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data

Linkage

Feature selection is an important technique for data mining. Despite its importance,

most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods,

online learning represents a promising family of efficient and scalable machine learning algorithms for large-

scale applications. Most existing studies of online learning require accessing all the attributes/features of

training instances. Such a classical setting is not always appropriate for real-world applications when data

instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address

this limitation, we investigate the problem of online feature selection (OFS) in which an online learner is

only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge

of online feature selection is how to make accurate prediction for an instance using a small number of active

features. This is in contrast to the classical setup of online learning where all the features can be used for

prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques.

Specifically, this article addresses two different tasks of online feature selection: 1) learning with full input,

where an learner is allowed to access all the features to decide the subset of active features, and 2) learning

with partial input, where only a limited number of features is allowed to be accessed for each instance by the

learner. We present novel algorithms to solve each of the two problems and give their performance analysis.

The encouraging results of our experiments validate the efficacy and efficiency of th- proposed techniques.

ETPL

DM - 042 Online Feature Selection and Its Applications








This paper designs an efficient image hashing with a ring partition and a nonnegative matrix

factorization (NMF), which has both the rotation robustness and good discriminative capability. The key

contribution is a novel construction of rotation-invariant secondary image, which is used for the first time in

image hashing and helps to make image hash resistant to rotation. In addition, NMF coefficients are

approximately linearly changed by content-preserving manipulations, so as to measure hash similarity with

correlation coefficient. We conduct experiments for illustrating the efficiency with 346 images. Our

experiments show that the proposed hashing is robust against content-preserving operations, such as image

rotation, JPEG compression, watermark embedding, Gaussian low-pass filtering, gamma correction,

brightness adjustment, contrast adjustment, and image scaling. Receiver operating characteristics (ROC) curve

comparisons are also conducted with the state-of-the-art algorithms, and demonstrate that the proposed

hashing is much better than all these algorithms in classification performances with respect to robustness and

discrimination.

ETPL

DM - 043 Robust Perceptual Image Hashing Based on Ring Partition and NMF

Similarity query is a fundamental problem in database, data mining and information retrieval research.

Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional

querying techniques. The existing work on querying incomplete data addresses the problem where the data

values on certain dimensions are unknown. However, in many real-life applications, such as data collected by

a sensor network in a noisy environment, not only the data values but also the dimension information may be

missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete

data. A probabilistic framework is developed to model this problem so that the users can find objects in the

database that are similar to the query with probability guarantee. Missing dimension information poses great

computational challenge, since all possible combinations of missing dimensions need to be examined when

evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of

the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant

data objects without explicitly examining all missing dimension combinations. A probability triangle

inequality is also employed to further prune the search space and speed up the query process. The proposed

probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive

experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.

ETPL

DM - 044

Searching Dimension Incomplete Databases








High-dimensional data arise naturally in many domains, and have regularly presented a great challenge

for traditional data mining techniques, both in terms of effectiveness and efficiency. Clustering becomes

difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing

distances between data points. In this paper, we take a novel perspective on the problem of clustering high-

dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower dimensional

feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena.

More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs)

that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering. We

validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-

dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major

hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster

configurations. Experimental results demonstrate good performance of our algorithms in multiple settings,

particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for

detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of

arbitrary shapes.

ETPL

DM - 045

The Role of Hubness in Clustering High-Dimensional Data

Traditionally, as soon as confidentiality becomes a concern, data are encrypted before outsourcing to a

service provider. Any software-based cryptographic constructs then deployed, for server-side query processing

on the encrypted data, inherently limit query expressiveness. Here, we introduce TrustedDB, an outsourced

database prototype that allows clients to execute SQL queries with privacy and under regulatory compliance

constraints by leveraging server-hosted, tamper-proof trusted hardware in critical query processing stages,

thereby removing any limitations on the type of supported queries. Despite the cost overhead and performance

limitations of trusted hardware, we show that the costs per query are orders of magnitude lower than any

(existing or) potential future software-only mechanisms. TrustedDB is built and runs on actual hardware, and

its performance and costs are evaluated here.

ETPL

DM - 046

TrustedDB: A Trusted Hardware-Based Database with Privacy and Data

Confidentiality








Collaborative filtering (CF) is an important and popular technology for recommendation

systems. However, current collaborative filtering methods suffer from some problems such as sparsity

problem, inaccurate recommendation and producing big-error predictions. In this paper, we borrow ideas of

object typicality from cognitive psychology and propose a novel typicality-based collaborative filtering

recommendation method named TyCo. A distinct feature of typicality-based CF is that it finds `neighbors' of

users based on user typicality degrees in user groups (instead of the co-rated items of users or common users

of items in traditional CF). To the best of our knowledge, there is no work on investigating collaborative

filtering recommendation by combining object typicality.

ETPL

DM - 047

TyCo: Towards Typicality-based Collaborative Filtering Recommendation

In this paper, we address the problem of the high annotation cost of acquiring training data for

semantic segmentation. Most modern approaches to semantic segmentation are based upon graphical models,

such as the conditional random fields, and rely on sufficient training data in form of object contours. To reduce

the manual effort on pixel-wise annotating contours, we consider the setting in which the training data set for

semantic segmentation is a mixture of a few object contours and an abundant set of bounding boxes of objects.

Our idea is to borrow the knowledge derived from the object contours to infer the unknown object contours

enclosed by the bounding boxes. The inferred contours can then serve as training data for semantic

segmentation. To this end, we generate multiple contour hypotheses for each bounding box with the

assumption that at least one hypothesis is close to the ground truth. This paper proposes an approach, called

augmented multiple instance regression (AMIR), that formulates the task of hypothesis selection as the

problem of multiple instance regression (MIR), and augments information derived from the object contours to

guide and regularize the training process of MIR. In this way, a bounding box is treated as a bag with its

contour hypotheses as instances, and the positive instances refer to the hypotheses close to the ground truth.

The proposed approach has been evaluated on the Pascal VOC segmentation task. The promising results

demonstrate that AMIR can precisely infer the object contours in the bounding boxes, and hence provide

effective alternatives to manually labeled contours for semantic segmentation.

ETPL

DM - 048

A Two-Level Topic Model Towards Knowledge Discovery from Citation

Networks








Access control mechanisms protect sensitive information from unauthorized users. However, when

sensitive information is shared and a Privacy Protection Mechanism (PPM) is not in place, an authorized user

can still compromise the privacy of a person leading to identity disclosure. A PPM can use suppression and

generalization of relational data to anonymize and satisfy privacy requirements, e.g., k-anonymity and l-

diversity, against identity and attribute disclosure. However, privacy is achieved at the cost of precision of

authorized information. In this paper, we propose an accuracy-constrained privacy-preserving access control

framework. The access control policies define selection predicates available to roles while the privacy

requirement is to satisfy the k-anonymity or l-diversity. An additional constraint that needs to be satisfied by

the PPM is the imprecision bound for each selection predicate. The techniques for workload-aware

anonymization for selection predicates have been discussed in the literature. However, to the best of our

knowledge, the problem of satisfying the accuracy constraints for multiple roles has not been studied before.

In our formulation of the aforementioned problem, we propose heuristics for anonymization algorithms and

show empirically that the proposed approach satisfies imprecision bounds for more permissions and has lower

total imprecision than the current state of the art.

ETPL

DM- 049

Accuracy-Constrained Privacy-Preserving Access Control Mechanism for

Relational Data

Traditional active learning methods require the labeler to provide a class label for each queried instance. The

labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in

turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert

labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this

paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked “whether a pair

of instances belong to the same class”, namely, a pairwise label homogeneity. Under such circumstances, our

active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to

make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we

propose a “Pairwise Query on Max-flow Paths” strategy to query pairwise label homogeneity from a

nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to

differentiate instances in different classes). In addition, a “Confidence-based Data Selection” measure is used

to evaluate data utility based on the Min-cut model's prediction results. The selected instances, with inferred

class labels, are included into the labeled set to form a closed-loop active learning process. Experimental

results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can

result in good performance with nonexpert labelers.

ETPL

DM - 050

Active Learning without Knowing Individual Instance Labels: A Pairwise Label

Homogeneity Query Approach








In this paper we present a framework for automatic exploitation of news in stock trading

strategies. Events are extracted from news messages presented in free text without annotations. We test the

introduced framework by deriving trading strategies based on technical indicators and impacts of the extracted

events. The strategies take the form of rules that combine technical trading indicators with a news variable,

and are revealed through the use of genetic programming. We find that the news variable is often included in

the optimal trading rules, indicating the added value of news for predictive purposes and validating our

proposed framework for automatically incorporating news in stock trading strategies.

ETPL

DM - 051

An Automated Framework for Incorporating News into Stock Trading

Strategies

We identify relation completion (RC) as one recurring problem that is central to the success of

novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation ℜ,

RC attempts at linking entity pairs between two entity lists under the relation ℜ. To accomplish the RC goals,

we propose to formulate search queries for each query entity α based on some auxiliary information, so that to

detect its target entity β from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses

extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns

may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method

that uses context terms learned surrounding the expression of a relation as the auxiliary information in

formulating queries. The experimental results based on several real-world web data collections demonstrate

that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.

ETPL

DM - 052

CoRE: A Context-Aware Relation Extraction Method for Relation Completion

Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-

relationship graphs. There are two main ways to personalize authority flow ranking: Node-based

personalization, where authority originates from a set of user-specific nodes; edge-based personalization,

where the importance of different edge types is user-specific. We propose the first approach to achieve

efficient edge-based personalization using a combination of precomputation and runtime algorithms. In

particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV)

assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings

for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated

as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at

the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial

edge types based on the edge distribution in the data graph.

ETPL

DM - 053

Efficient Ranking on Entity Graphs with Personalized Relationships








Although several distance or similarity functions for trees have been introduced, their performance is

not always satisfactory in many applications, ranging from document clustering to natural language

processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where

a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree

mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches.

Extensive evaluation frameworks are developed to evaluate the performance of the new approach against

previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic

labeled data sets are performed to provide an unbiased evaluation where different distance functions are

investigated. The experimental results demonstrate the superior performance of the proposed distance

function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree

distance functions in terms of runtime efficiency.

ETPL

DM - 054 Extended Subtree: A New Similarity Function for Tree Structured Data

Conventional spatial queries, such as range search and nearest neighbor retrieval,

involve only conditions on objects' geometric properties. Today, many modern applications call for novel

forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated

texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for

the restaurant that is the closest among those whose menus contain “steak, spaghetti, brandy” all at the same

time. Currently, the best solution to such queries is based on the IR 2-tree, which, as shown in this paper, has a

few deficiencies that seriously impact its efficiency. Motivated by this, we develop a new access method

called the spatial inverted index that extends the conventional inverted index to cope with multidimensional

data, and comes with algorithms that can answer nearest neighbor queries with keywords in real time. As

verified by experiments, the proposed techniques outperform the IR 2-tree in query response time significantly,

often by a factor of orders of magnitude

ETPL

DM - 055

Fast Nearest Neighbor Search with Keywords








Activity recognition is a key task for the development of advanced and effective

ubiquitous applications in fields like ambient assisted living. A major problem in designing effective

recognition algorithms is the difficulty of incorporating long-range dependencies between distant time instants

without incurring substantial increase in computational complexity of inference. In this paper we present a

novel approach for introducing long-range interactions based on sequential pattern mining. The algorithm

searches for patterns characterizing time segments during which the same activity is performed. A

probabilistic model is learned to represent the distribution of pattern matches along sequences, trying to

maximize the coverage of an activity segment by a pattern match. The model is integrated in a segmental

labeling algorithm and applied to novel sequences, tagged according to matches of the extracted patterns. The

rationale of the approach is that restricting dependencies to span the same activity segment (i.e., sharing the

same label), allows keeping inference tractable. An experimental evaluation shows that enriching sensor-based

representations with the mined patterns allows improving results over sequential and segmental labeling

algorithms in most of the cases. An analysis of the discovered patterns highlights non-trivial interactions

spanning over a significant time horizon.

ETPL

DM - 056 Improving Activity Recognition by Segmental Pattern Mining

Frequent weighted itemsets represent correlations frequently holding in data in which items may

weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function,

discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of

discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two

novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that

perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented.

Experimental results show efficiency and effectiveness of the proposed approach

ETPL

DM - 057

Infrequent Weighted Itemset Mining Using Frequent Pattern Growth








Local thresholding algorithms were first presented more than a decade ago and have since been

applied to a variety of data mining tasks in peer-to-peer systems, wireless sensor networks, and in grid

systems. One critical assumption made by those algorithms has always been cycle-free routing. The existence

of even one cycle may lead all peers to the wrong outcome. Outside the lab, unfortunately, cycle freedom is

not easy to achieve. This work is the first to lift the requirement of cycle freedom by presenting a local

thresholding algorithm suitable for general network graphs. The algorithm relies on a new repositioning of the

problem in weighted vector arithmetics, on a new stopping rule, whose proof does not require that the network

be cycle free, and on new methods for balance correction when the stopping rule fails. The new stopping and

update rules permit calculation of the very same functions that were calculable using previous algorithms,

which do assume cycle freedom. The algorithm is implemented on a standard peer-to-peer simulator and is

validated for networks of up to 80,000 peers, organized in three different topologies representative of major

current distributed systems: the Internet, structured peer-to-peer systems, and wireless sensor networks.

ETPL

DM- 058 Local Thresholding in General Network Graphs

The main aim of this paper is to develop a community discovery scheme in a multi-dimensional

network for data mining applications. In online social media, networked data consists of multiple

dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of

users who interact significantly on these media entities. In a co-citation network, we are interested in finding a

group of authors who relate to other authors significantly on publication information in titles, abstracts, and

keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a

framework (MultiComm)to identify a seed-based community in a multi-dimensional network by evaluating

the affinity between two items in the same type of entity (same dimension)or different types of entities

(different dimensions)from the network. Our idea is to calculate the probabilities of visiting each item in each

dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a

high quality of generated communities by the proposed algorithm, we develop and study a local modularity

measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data

sets suggest that the proposed framework is able to find a community effectively. Experimental results have

also shown that the performance of the proposed algorithm is better in accuracy than the other testing

algorithms in finding communities in multi-dimensional networks.

ETPL

DM - 059

MultiComm: Finding Community Structurein Multi-Dimensional Networks








ormulate and investigate the novel problem of finding the skyline k-tuple groups from an n-

tuple data set-i.e., groups of k tuples which are not dominated by any other group of equal size, based on

aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-

monotonic properties for pruning the search space of skyline groups. To this end, we first show that the anti-

monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. Then, we

identify two anti-monotonic properties with varying degrees of applicability: order-specific property which

applies to SUM, MIN, and MAX as well as weak candidate-generation property which applies to MIN and

MAX only. Experimental results on both real and synthetic data sets verify that the proposed algorithms

achieve orders of magnitude performance gain over the baseline method.

ETPL

DM - 060

On Skyline Groups

The probabilistic threshold query is one of the most common queries in uncertain databases,

where a result satisfying the query must be also with probability meeting the threshold requirement. In this

paper, we investigate probabilistic threshold keyword queries (PrTKQ)over XML data, which is not studied

before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the

consideration of possible world semantics. Then we design a probabilistic inverted (PI)index that can be used

to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper

bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-

based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An

empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our

approaches

ETPL

DM - 061

Quasi-SLCA Based Keyword QueryProcessing over Probabilistic XML Data

We propose a protocol for secure mining of association rules in horizontally distributed databases. The

current leading protocol is that of Kantarcioglu and Clifton . Our protocol, like theirs, is based on the Fast

Distributed Mining (FDM)algorithm of Cheung et al. , which is an unsecured distributed version of the Apriori

algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms-one that

computes the union of private subsets that each of the interacting players hold, and another that tests the

inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy

with respect to the protocol in . In addition, it is simpler and is significantly more efficient in terms of

communication rounds, communication cost and computational cost

ETPL

DM - 062

Secure Mining of Association Rules in Horizontally Distributed Databases








Pattern classification systems are commonly used in adversarial applications, like biometric

authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by

humans to undermine their operation. As this adversarial scenario is not taken into account by classical design

methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect

their performance, and consequently limit their practical utility. Extending pattern classification theory and

design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet

been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design

phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may

incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes

and generalizes the main ideas proposed in the literature, and give examples of its use in three real

applications. Reported results show that security evaluation can provide a more complete understanding of the

classifier's behavior in adversarial environments, and lead to better design choices.

ETPL

DM - 063

Security Evaluation of Pattern Classifiers under Attack

This paper takes the shortest path discovery to study efficient relational approaches to graph search

queries. We first abstract three enhanced relational operators, based on which we introduce an FEM

framework to bridge the gap between relational operations and graph operations. We show new features

introduced by recent SQL standards, such as window function and merge statement, can improve the

performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema and

design a bi-directional restrictive BFS (breadth-first-search)over partitioned tables, which improves the

scalability and performance without extra indexing overheads. The final extensive experimental results

illustrate our relational approach with optimization strategies can achieve high scalability and performance.

ETPL

DM - 064

Shortest Path Computing in Relational DBMSs








The online shortest path problem aims at computing the shortest path based on live

traffic circumstances. This is very important in modern car navigation systems as it helps drivers to make

sensible decisions. To our best knowledge, there is no efficient system/solution that can offer affordable costs

at both client and server sides for online shortest path computation. Unfortunately, the conventional client-

server architecture scales poorly with the number of clients. A promising approach is to let the server collect

live traffic information and then broadcast them over radio or wireless network. This approach has excellent

scalability with the number of clients. Thus, we develop a new framework called live traffic index (LTI)which

enables drivers to quickly and effectively collect the live traffic information on the broadcasting channel. An

impressive result is that the driver can compute/update their shortest path result by receiving only a small

fraction of the index. Our experimental study shows that LTI is robust to various parameters and it offers

relatively short tune-in cost (at client side), fast query response time (at client side), small broadcast size (at

server side), and light maintenance time (at server side)for online shortest path problem.

ETPL

DM - 065

Towards Online Shortest Path Computation

The Object Summary (OS)is a recently proposed tree structure, which summarizes all data held in a

relational database about a data subject. An OS can potentially be very large in size and therefore unfriendly

for users who wish to view synoptic information about the data subject. In this paper, we investigate the

effective and efficient retrieval of concise and informative OS snippets (denoted as size-l OSs). We propose

and investigate the effectiveness of two types of size- l OSs, namely size- l OS (t)s and size-l OS (a)s that

consist of l tuple nodes and l attribute nodes respectively. For computing size-l OSs, we propose an optimal

dynamic programming algorithm, two greedy algorithms and preprocessing heuristics. By collecting feedback

from real users (e.g., from DBLP authors), we assess the relative usability of the two different types of

snippets, the choice of the size- l parameter, as well as the effectiveness of the snippets with respect to the user

expectations. In addition, via thorough evaluation on real databases, we test the speed and effectiveness of our

techniques.

ETPL

DM- 066

Versatile Size-l Object Summaries for Relational Keyword Search








Ontology reuse offers great benefits by measuring and comparing ontologies. However, the state of art

approaches for measuring ontologies neglects the problems of both the polymorphism of ontology

representation and the addition of implicit semantic knowledge. One way to tackle these problems is to devise

a mechanism for ontology measurement that is stable, the basic criteria for automatic measurement. In this

paper, we present a graph derivation representation based approach (GDR) for stable semantic measurement,

which captures structural semantics of ontologies and addresses those problems that cause unstable

measurement of ontologies. This paper makes three original contributions. First, we introduce and define the

concept of semantic measurement and the concept of stable measurement. We present the GDR based

approach, a three-phase process to transform an ontology to its GDR. Second, we formally analyze important

properties of GDRs based on which stable semantic measurement and comparison can be achieved

successfully. Third but not the least, we compare our GDR based approach with existing graph based methods

using a dozen real world exemplar ontologies. Our experimental comparison is conducted based on nine

ontology measurement entities and distance metric, which stably compares the similarity of two ontologies in

terms of their GDRs.

ETPL

DM - 067

A Graph Derivation Based Approach for Measuring and Comparing

Structural Semantics of Ontologies

Building Bayesian belief networks in the absence of data involves the challenging task of

eliciting conditional probabilities from experts to parameterize the model. In this paper, we develop an

analytical method for determining the optimal order for eliciting these probabilities. Our method uses prior

distributions on network parameters and a novel expected proximity criteria, to propose an order that

maximizes information gain per unit elicitation time. We present analytical results when priors are uniform

Dirichlet; for other priors, we find through experiments that the optimal order is strongly affected by which

variables are of primary interest to the analyst. Our results should prove useful to researchers and practitioners

involved in belief network model building and elicitation.

ETPL

DM - 068

A Myopic Approach to Ordering Nodes for Parameter Elicitation in Bayesian

Belief Networks








Many problems in natural language processing, data mining, information retrieval, and

bioinformatics can be formalized as string transformation, which is a task as follows. Given an input string, the

system generates the $k$ most likely output strings corresponding to the input string. This paper proposes a

novel and probabilistic approach to string transformation, which is both accurate and efficient. The approach

includes the use of a log linear model, a method for training the model, and an algorithm for generating the top

$k$ candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a

conditional probability distribution of an output string and a rule set for the transformation conditioned on an

input string. The learning method employs maximum likelihood estimation for parameter estimation. The

string generation algorithm based on pruning is guaranteed to generate the optimal top $k$ candidates. The

proposed method is applied to correction of spelling errors in queries as well as reformulation of queries in

web search. Experimental results on large scale data show that the proposed approach is very accurate and

efficient improving upon existing methods in terms of accuracy and efficiency in different settings.

ETPL

DM - 069

A Probabilistic Approach to String Transformation

Domain transfer learning, which learns a target classifier using labeled data from a different

distribution, has shown promising value in knowledge discovery yet still been a challenging problem. Most

previous works designed adaptive classifiers by exploring two learning strategies independently: distribution

adaptation and label propagation. In this paper, we propose a novel transfer learning framework, referred to as

Adaptation Regularization based Transfer Learning (ARTL), to model them in a unified way based on the

structural risk minimization principle and the regularization theory. Specifically, ARTL learns the adaptive

classifier by simultaneously optimizing the structural risk functional, the joint distribution matching between

domains, and the manifold consistency underlying marginal distribution. Based on the framework, we propose

two novel methods using Regularized Least Squares (RLS) and Support Vector Machines (SVMs),

respectively, and use the Representer theorem in reproducing kernel Hilbert space to derive corresponding

solutions. Comprehensive experiments verify that ARTL can significantly outperform state-of-the-art learning

methods on several public text and image datasets.

ETPL

DM - 070

Adaptation Regularization: A General Framework for Transfer Learning








Clustering algorithm and cluster validity are two highly correlated parts in cluster analysis. In

this paper, a novel idea for cluster validity and a clustering algorithm based on the validity index are

introduced. A Centroid Ratio is firstly introduced to compare two clustering results. This centroid ratio is then

used in prototype-based clustering by introducing a Pairwise Random Swap clustering algorithm to avoid the

local optimum problem of $k$ -means. The swap strategy in the algorithm alternates between simple

perturbation to the solution and convergence toward the nearest optimum by $k$ -means. The centroid ratio is

shown to be highly correlated to the mean square error (MSE) and other external indices. Moreover, it is fast

and simple to calculate. An empirical study of several different datasets indicates that the proposed algorithm

works more efficiently than Random Swap, Deterministic Random Swap, Repeated k-means or k-means++.

The algorithm is successfully applied to document clustering and color image quantization as well.

ETPL

DM - 071

Centroid Ratio for a Pairwise Random Swap Clustering Algorithm

Result diversification has recently attracted considerable attention as a means of increasing

user satisfaction in recommender systems, as well as in web and database search. In this paper, we focus on

the problem of selecting the $k$ -most diverse items from a result set. Whereas previous research has mainly

considered the static version of the problem, in this paper, we exploit the dynamic case in which the result set

changes over time, as for example, in the case of notification services. We define the CONTINUOUS $k$ -

DIVERSITY PROBLEM along with appropriate constraints that enforce continuity requirements on the

diversified results. Our proposed approach is based on cover trees and supports dynamic item insertion and

deletion. The diversification problem is in general NP-hard; we provide theoretical bounds that characterize

the quality of our cover tree solution with respect to the optimal one. Since results are often associated with a

relevance score, we extend our approach to account for relevance. Finally, we report experimental results

concerning the efficiency and effectiveness of our approach on a variety of real and synthetic datasets.

ETPL

DM - 072

Diverse Set Selection Over Dynamic Data








Recently, probabilistic graphs have attracted significant interests of the data mining community. It

is observed that correlations may exist among adjacent edges in various probabilistic graphs. As one of the

basic mining techniques, graph clustering is widely used in exploratory data analysis, such as data

compression, information retrieval, image segmentation, etc. Graph clustering aims to divide data into clusters

according to their similarities, and a number of algorithms have been proposed for clustering graphs, such as

the pKwikCluster algorithm, spectral clustering, k-path clustering, etc. However, little research has been

performed to develop efficient clustering algorithms for probabilistic graphs. Particularly, it becomes more

challenging to efficiently cluster probabilistic graphs when correlations are considered. In this paper, we

define the problem of clustering correlated probabilistic graphs. To solve the challenging problem, we propose

two algorithms, namely the $PEEDR$ and the $CPGS$ clustering algorithm. For each of the proposed

algorithms, we develop several pruning techniques to further improve their efficiency. We evaluate the

effectiveness and efficiency of our algorithms and pruning methods through comprehensive experiments.

ETPL

DM - 073

Effective and Efficient Clustering Methods for Correlated Probabilistic Graphs

This paper describes a three-level framework for semi-supervised feature selection. Most feature

selection methods mainly focus on finding relevant features for optimizing high-dimensional data. In this

paper, we show that the relevance requires two important procedures to provide an efficient feature selection

in the semi-supervised context. The first one concerns the selection of pairwise constraints that can be

extracted from the labeled part of data. The second procedure aims to reduce the redundancy that could be

detected in the selected relevant features. For the relevance, we develop a filter approach based on a

constrained Laplacian score. Finally, experimental results are provided to show the efficiency of our proposal

in comparison with several representative methods

ETPL

DM - 074

Efficient Semi-Supervised Feature Selection: Constraint, Relevance, and

Redundancy

A well-studied query type on moving objects is the continuous range query. An interesting

and practical situation is that instead of being continuously evaluated, the query may be evaluated at different

degrees of continuity, e.g., every 2 seconds (close to continuous), every 10 minutes or at irregular time

intervals (close to snapshot). Furthermore, the range query may be stacked under predicates applied to the

returned objects. An example is the count predicate that requires the number of objects in the range to be at

least $gamma$ . The conjecture is that these two practical considerations can help reduce communication

costs. We propose a safe region-based solution that exploits these two practical considerations. An extensive

experimental study shows that our solution can reduce communication costs by a factor of 9.5 compared to an

existing state-of-the-art system.

ETPL

DIP - 075

Evaluation of Range Queries With Predicates on Moving Objects








Millions of users share their opinions on Twitter, making it a valuable platform for tracking

and analyzing public sentiment. Such tracking and analysis can provide critical information for decision

making in various domains. Therefore it has attracted attention in both academia and industry. Previous

research mainly focused on modeling and tracking public sentiment. In this work, we move one step further to

interpret sentiment variations. We observed that emerging topics (named foreground topics) within the

sentiment variation periods are highly related to the genuine reasons behind the variations. Based on this

observation, we propose a Latent Dirichlet Allocation (LDA) based model, Foreground and Background LDA

(FB-LDA), to distill foreground topics and filter out longstanding background topics. These foreground topics

can give potential interpretations of the sentiment variations. To further enhance the readability of the mined

reasons, we select the most representative tweets for foreground topics and develop another generative model

called Reason Candidate and Background LDA (RCB-LDA) to rank them with respect to their “popularity”

within the variation period. Experimental results show that our methods can effectively find foreground topics

and rank reason candidates. The proposed models can also be applied to other tasks such as finding topic

differences between two sets of documents.

ETPL

DM - 076

Interpreting the Public Sentiment Variations on Twitter

Data uncertainty is inherent in many real-world applications such as environmental surveillance

and mobile tracking. Mining sequential patterns from inaccurate data, such as those data arising from sensor

readings and GPS trajectories, is important for discovering hidden knowledge in such applications. In this

paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two

uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data,

and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that

conform to our models. However, the number of possible worlds is extremely large, which makes the mining

prohibitively expensive. Inspired by the famous PrefixSpan algorithm, we develop two new algorithms,

collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible

worlds explosion”, and when combined with our four pruning and validating methods, achieves even better

performance. We also propose a fast validating method to further speed up our U-PrefixSpan algorithm. The

efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and

synthetic datasets

ETPL

DM - 077

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain

Databases








In spatial domains, interaction between features gives rise to two types of interaction patterns: co-

location and segregation patterns. Existing approaches to finding co-location patterns have several

shortcomings: (1) They depend on user specified thresholds for prevalence measures; (2) they do not take

spatial auto-correlation into account; and (3) they may report co-locations even if the features are randomly

distributed. Segregation patterns have yet to receive much attention. In this paper, we propose a method for

finding both types of interaction patterns, based on a statistical test. We introduce a new definition of co-

location and segregation pattern, we propose a model for the null distribution of features so spatial auto-

correlation is taken into account, and we design an algorithm for finding both co-location and segregation

patterns. We also develop two strategies to reduce the computational cost compared to a naïve approach based

on simulations of the data distribution, and we propose an approach to reduce the runtime of our algorithm

even further by using an approximation of the neighborhood of features. We evaluate our method empirically

using synthetic and real data sets and demonstrate its advantages over a state-of-the-art co-location mining

algorithm.

ETPL

DM - 078

Mining Statistically Significant Co-location and Segregation Patterns

In this paper we present a solution to one of the location-based query problems. This problem is

defined as follows: (i) a user wants to query a database of location data, known as Points Of Interest (POIs),

and does not want to reveal his/her location to the server due to privacy concerns; (ii) the owner of the location

data, that is, the location server, does not want to simply distribute its data to all users. The location server

desires to have some control over its data, since the data is its asset. We propose a major enhancement upon

previous solutions by introducing a two stage approach, where the first step is based on Oblivious Transfer and

the second step is based on Private Information Retrieval, to achieve a secure solution for both parties. The

solution we present is efficient and practical in many scenarios. We implement our solution on a desktop

machine and a mobile device to assess the efficiency of our protocol. We also introduce a security model and

analyse the security in the context of our protocol. Finally, we highlight a security weakness of our previous

work and present a solution to overcome it.

ETPL

DM - 079

Privacy-Preserving and Content-Protecting Location Based Queries








Numerous consumer reviews of products are now available on the Internet. Consumer

reviews contain rich and valuable knowledge for both firms and users. However, the reviews are often

disorganized, leading to difficulties in information navigation and knowledge acquisition. This article proposes

a product aspect ranking framework, which automatically identifies the important aspects of products from

online consumer reviews, aiming at improving the usability of the numerous reviews. The important product

aspects are identified based on two observations: 1) the important aspects are usually commented on by a large

number of consumers and 2) consumer opinions on the important aspects greatly influence their overall

opinions on the product. In particular, given the consumer reviews of a product, we first identify product

aspects by a shallow dependency parser and determine consumer opinions on these aspects via a sentiment

classifier. We then develop a probabilistic aspect ranking algorithm to infer the importance of aspects by

simultaneously considering aspect frequency and the influence of consumer opinions given to each aspect over

their overall opinions. The experimental results on a review corpus of 21 popular products in eight domains

demonstrate the effectiveness of the proposed approach. Moreover, we apply product aspect ranking to two

real-world applications, i.e., document-level sentiment classification and extractive review summarization, and

achieve significant performance improvements, which demonstrate the capacity of product aspect ranking in

facilitating real-world applications.

ETPL

DM - 080

Product Aspect Ranking and Its Applications

In this paper, we present a novel ensemble method random projection random discretization

ensembles(RPRDE) to create ensembles of linear multivariate decision trees by using a univariate decision

tree algorithm. The present method combines the better computational complexity of a univariate decision tree

algorithm with the better representational power of linear multivariate decision trees. We develop random

discretization (RD) method that creates random discretized features from continuous features. Random

projection (RP) is used to create new features that are linear combinations of original features. A new dataset

is created by augmenting discretized features (created by using RD) with features created by using RP. Each

decision tree of a RPRD ensemble is trained on one dataset from the pool of these datasets by using a

univariate decision tree algorithm. As these multivariate decision trees (because of features created by RP)

have more representational power than univariate decision trees, we expect accurate decision trees in the

ensemble. Diverse training datasets ensure diverse decision trees in the ensemble. We study the performance

of RPRDE against other popular ensemble techniques using C4.5 tree as the base classifier. RPRDE matches

or outperforms other popular ensemble methods. Experiments results also suggest that the proposed method is

quite robust to the class noise.

ETPL

DM- 081

Random Projection Random Discretization Ensembles—Ensembles of Linear

Multivariate Decision Trees








In the classic range aggregation problem, we have a set $S$ of objects such that, given an interval

$I$ , a query counts how many objects of $S$ are covered by $I$ . Besides COUNT, the problem can also be

defined with other aggregate functions, e.g., SUM, MIN, MAX and AVERAGE. This paper studies a novel

variant of range aggregation, where an object can belong to multiple sets. A query (at runtime) picks any two

sets, and aggregates on their intersection. More formally, let $S_{1},ldots, S_{m}$ be $m$ sets of objects.

Given distinct set ids $i$ , $j$ and an interval $I$ , a query reports how many objects in

$S_{i}mathop{rmcapkern 0pt}displaylimits S_{j}$ are covered by $I$ . We call this problem range

aggregation with set selection (RASS). Its hardness lies in that the pair $(i, j)$ can have ${mchoose 2}$

choices, rendering effective indexing a non-trivial task. The RASS problem can also be defined with other

aggregate functions, and generalized so that a query cho- ses more than 2 sets. We develop a system called

RASS to power this type of queries. Our system has excellent efficiency in both theory and practice.

Theoretically, it consumes linear space, and achieves nearly-optimal query time. Practically, it outperforms

existing solutions on real datasets by a factor up to an order of magnitude. The paper also features a rigorous

theoretical analysis on the hardness of the RASS problem, which reveals invaluable insight into its

characteristics.

ETPL

DM - 082

Range Aggregation With Set Selection

The integration of social networking concepts into the Internet of things has led to the Social Internet

of Things (SIoT) paradigm, according to which objects are capable of establishing social relationships in an

autonomous way with respect to their owners with the benefits of improving the network scalability in

information/service discovery. Within this scenario, we focus on the problem of understanding how the

information provided by members of the social IoT has to be processed so as to build a reliable system on the

basis of the behavior of the objects. We define two models for trustworthiness management starting from the

solutions proposed for P2P and social networks. In the subjective model each node computes the

trustworthiness of its friends on the basis of its own experience and on the opinion of the friends in common

with the potential service providers. In the objective model, the information about each node is distributed and

stored making use of a distributed hash table structure so that any node can make use of the same information.

Simulations show how the proposed models can effectively isolate almost any malicious nodes in the network

at the expenses of an increase in the network traffic for feedback exchange.

ETPL

DM - 083

Trustworthiness Management in the Social Internet of Things








We are witnessing increasing interests in the effective use of road networks. For example, to enable

effective vehicle routing, weighted-graph models of transportation networks are used, where the weight of an

edge captures some cost associated with traversing the edge, e.g., greenhouse gas (GHG) emissions or travel

time. It is a precondition to using a graph model for routing that all edges have weights. Weights that capture

travel times and GHG emissions can be extracted from GPS trajectory data collected from the network.

However, GPS trajectory data typically lack the coverage needed to assign weights to all edges. This paper

formulates and addresses the problem of annotating all edges in a road network with travel cost based weights

from a set of trips in the network that cover only a small fraction of the edges, each with an associated ground-

truth travel cost. A general framework is proposed to solve the problem. Specifically, the problem is modeled

as a regression problem and solved by minimizing a judiciously designed objective function that takes into

account the topology of the road network. In particular, the use of weighted PageRank values of edges is

explored for assigning appropriate weights to all edges, and the property of directional adjacency of edges is

also taken into account to assign weights. Empirical studies with weights capturing travel time and GHG

emissions on two road networks (Skagen, Denmark, and North Jutland, Denmark) offer insight into the design

properties of the proposed techniques and offer evidence that the techniques are effective.

ETPL

DM - 084

Using Incomplete Information for Complete Weight Annotation of Road

Networks

Multicore systems and multithreaded processing are now the de facto standards of enterprise and

personal computing. If used in an uninformed way, however, multithreaded processing might actually degrade

performance. We present the facets of the memory access bottleneck as they manifest in multithreaded

processing and show their impact on query evaluation. We present a system design based on partition

parallelism, memory pooling, and data structures conducive to multithreaded processing. Based on this design,

we present alternative implementations of the most common query processing algorithms, which we

experimentally evaluate using multiple scenarios and hardware platforms. Our results show that the design and

algorithms are indeed scalable across platforms, but the choice of optimal algorithm largely depends on the

problem parameters and underlying hardware. However, our proposals are a good first step toward generic

multithreaded parallelism.

ETPL

DM - 085

A Comparative Study of Implementation Techniques for Query Processing in

Multicore Systems








The selection of relevant and significant features is an important problem particularly for data sets

with large number of features. In this regard, a new feature selection algorithm is presented based on a rough

hypercuboid approach. It selects a set of features from a data set by maximizing the relevance, dependency,

and significance of the selected features. By introducing the concept of the hypercuboid equivalence partition

matrix, a novel representation of degree of dependency of sample categories on features is proposed to

measure the relevance, dependency, and significance of features in approximation spaces. The equivalence

partition matrix also offers an efficient way to calculate many more quantitative measures to describe the

inexactness of approximate classification. Several quantitative indices are introduced based on the rough

hypercuboid approach for evaluating the performance of the proposed method. The superiority of the proposed

method over other feature selection methods, in terms of computational complexity and classification

accuracy, is established extensively on various real-life data sets of different sizes and dimensions.

ETPL

DM - 086

A Rough Hyper cuboid Approach for Feature Selection in Approximation

Spaces

Schemas are often used to constrain the content and structure of XML documents. They can be

quite big and complex and, thus, difficult to be accessed manually. The ability to query a single schema, a

collection of schemas or to retrieve schema components that meet certain structural constraints significantly

eases schema management and is, thus, useful in many contexts. In this paper, we propose a query language,

named XSPath, specifically tailored for XML schema that works on logical graph-based representations

of schemas, on which it enables the navigation, and allows the selection of nodes. We also propose

XPath/XQuery-based translations that can be exploited for the evaluation of XSPath queries. An extensive

evaluation of the usability and efficiency of the proposed approach is finally presented within the EXup

system

ETPL

DM - 087

XSPath: Navigation on XML Schemas Made Easy

Our proposed framework consists of two parts. First, we put forward uncertain one-class learning to

cope with data of uncertainty. We first propose a local kernel-density-based method to generate a bound score

for each instance, which refines the location of the corresponding instance, and then construct

an uncertain one-class classifier (UOCC) by incorporating the generated bound score into a one-class SVM-

based learning phase. Second, we propose a support vectors (SVs)-based clustering technique to summarize

the concept of the user from the history chunks by representing the chunk data using support vectors of

the uncertain one-class classifier developed on each chunk, and then extend k-mean clustering method to

cluster history chunks into clusters so that we can summarize concept from the history chunks.

ETPL

DM - 088

Uncertain One-Class Learning and Concept Summarization Learning on

Uncertain Data Streams








Personalized web search (PWS) has demonstrated its effectiveness in improving the quality of various

search services on the Internet. However, evidences show that users' reluctance to disclose their private

information during search has become a major barrier for the wide proliferation of PWS. We

study privacy protection in PWS applications that model user preferences as hierarchical user profiles. We

propose a PWS framework called UPS that can adaptively generalize profiles by queries while respecting

user-specified privacy requirements. Our runtime generalization aims at striking a balance between two

predictive metrics that evaluate the utility of personalization and the privacy risk of exposing the generalized

profile. We present two greedy algorithms, namely GreedyDP and GreedyIL, for runtime generalization. We

also provide an online prediction mechanism for deciding whether personalizing a query is beneficial.

Extensive experiments demonstrate the effectiveness of our framework. The experimental results also reveal

that GreedyIL significantly outperforms GreedyDP in terms of efficiency.

ETPL

DM - 089

Supporting Privacy Protection in Personalized Web Search

In data warehousing and OLAP applications, scalar-level predicates in SQL become increasingly

inadequate to support a class of operations that require set-level comparison semantics, i.e., comparing

a group of tuples with multiple values. Currently, complex SQL queries composed by scalar-level operations

are often formed to obtain even very simple set-level semantics. Such queries are not only difficult to write but

also challenging for a database engine to optimize, thus can result in costly evaluation. This paper proposes to

augment SQL with set predicate, to bring out otherwise obscured set-level semantics. We studied two

approaches to processing set predicates-an aggregate function-based approach and a bitmap index-based

approach. Moreover, we designed a histogram-based probabilistic method of set predicate selectivity

estimation, for optimizing queries with multiple predicates. The experiments verified its accuracy and

effectiveness in optimizing queries.

ETPL

DM - 090

Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically

Formed Groups








We tackle the time-series classification problem using a novel probabilistic model that represents

the conditional densities of the observed sequences being time-warped and transformed from an underlying

base sequence. We call it probabilistic sequence translation-alignment model (PSTAM) since it aims to

capture both feature alignment and mapping between sequences, analogous to translating one language into

another in the field of machine translation. To deal with general time-series, we impose the time-monotonicity

constraints on the hidden alignment variables in the model parameter space, where by marginalizing them out

it allows effective learning of class-specific time-warping and feature transformation simultaneously. Our

PSTAM, thus, naturally enjoys the advantages from two typical approaches widely used

in sequence classification: 1) benefits from the alignment-based methods that aim to estimate distance

measures between non-equal-length sequences via direct comparison of aligned features, and 2) merits of

the model-based approaches that can effectively capture the class-specific patterns or trends. Furthermore, the

low-dimensional modeling of the latent base sequence naturally provides a way to discover the intrinsic

manifold structure possibly retained in the observed data, leading to an unsupervised manifold learning

for sequence data. The benefits of the proposed approach are demonstrated on a comprehensive set of

evaluations with both synthetic and real-world sequence data sets.

ETPL

DM - 091

Probabilistic Sequence Translation-Alignment Model for Time-Series

Classification

Imbalanced learning problems contain an unequal distribution of data samples among different

classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic

oversampling methods address this problem by generating the synthetic minority class samples to balance the

distribution between the samples of the majority and minority classes. This paper identifies that most of the

existing oversampling methods may generate the wrong synthetic minority samples in some scenarios and

make learning tasks harder.MWMOTE first identifies the hard-to-learn informative minority class samples and

assigns them weights according to their euclidean distance from the nearest majority class samples. It then

generates the synthetic samples from the weighted informative minority class samples using a clustering

approach. This is done in such a way that all the generated samples lie inside some minority class

cluster. MWMOTE has been evaluated extensively on four artificial and 20 real-world data sets. The

simulation results show that our method is better than or comparable with some other existing methods in

terms of various assessment metrics, such as geometric mean (G-mean) and area under the receiver operating

curve (ROC), usually known as area under curve (AUC).

ETPL

DM - 092

MWMOTE--Majority Weighted Minority Oversampling Technique for

Imbalanced Data Set Learning








ETPL

DM - 093

Learning the Gain Values and Discount Factors of Discounted Cumulative

Gains

The problem of learning conditional preference networks (CP-nets) from a set of examples has

received great attention recently. However, because of the randomicity of the users' behaviors and the

observation errors, there is always some noise making the examples inconsistent, namely, there exists at least

one outcome preferred over itself (by transferring) in examples. Existing CP-nets learning methods cannot

handle inconsistent examples. In this work, we introduce the model of learning consistent CP-nets

from inconsistent examples and present a method to solve this model. We do not learn the CP-nets directly.

Instead, we first learn a preference graph from the inconsistent examples, because dominance testing and

consistency testing in preference graphs are easier than those in CP-nets. The problem

of learning preference graphs is translated into a 0-1 programming and is solved by the branch-and-bound

search. Then, the obtained preference graph is transformed into a CP-net equivalently, which can entail a

subset of examples with maximal sum of weight. Examples are given to show that our method can obtain

consistent CP-nets over both binary and multivalued variables from inconsistent examples. The proposed

method is verified on both simulated data and real data, and it is also compared with existing methods.

Learning Conditional Preference Networks from Inconsistent Examples

Evaluation metric is an essential and integral part of a ranking system. In the past, several evaluation

metrics have been proposed in information retrieval and web search, among them Discounted

Cumulative Gain (DCG) has emerged as one that is widely adopted for evaluating the performance of ranking

functions used in web search. However, the two sets of parameters, the gain values and discount factors, used

in DCG are usually determined in a rather ad-hoc way, and their impacts have not been carefully analyzed. In

this paper, we first show that DCG is generally not coherent, i.e., comparing the performance of ranking

functions using DCG very much depends on the particular gain values and discount factors used. We then

propose a novel methodology that can learn the gain values and discount factors from user preferences over

rankings, modeled as a special case of learning linear utility functions. We also discuss how to extend our

methods to handle tied preference pairs and how to explore active learning to reduce preference labeling.

Numerical simulations illustrate the effectiveness of our proposed methods. Moreover, experiments are also

conducted over a side-by-side comparison data set from a commercial search engine to validate the proposed

methods on real-world data.

ETPL

DM - 094








Keyword search is an intuitive paradigm for searching linked data sources on the web. We propose to

route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all

sources. We propose a novel method for computing top-k routing plans based on their potentials to contain

results for a given keyword query. We employ a keyword-element relationship summary that compactly

represents relationships between keywords and the data elements mentioning them. A multilevel scoring

mechanism is proposed for computing the relevance of routing plans based on scores at the level of keywords,

data elements, element sets, and sub graphs that connect these elements. Experiments carried out using 150

publicly available sources on the web showed that valid plans (precision@1 of 0.92) that are highly relevant

(mean reciprocal rank of 0.89) can be computed in 1 second on average on a single PC. Further, we

show routing greatly helps to improve the performance of keyword search, without compromising its result

quality.

ETPL

DM - 095

Keyword Query Routing

Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there

nodes that participate in too many or too few triangles? Are there close-knit near-cliques? These questions are

expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency

matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let

alone for billion-scale ones. We address this problem with the proposed HEIGEN algorithm, which we

carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP)

environment. This enables HEIGEN to handle matrices more than 1;000 × larger than those which can

be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50

supercomputers in the world. We report important discoveries about nearcliques and triangles on several real-

world graphs, including a snapshot of the Twitter social network (56 Gb, 2 billion edges) and the

“YahooWeb” data set, one of the largest publicly available graphs (120 Gb, 1.4 billion nodes,

6.6 billion edges).

ETPL

DM - 096

HEigen: Spectral Analysis for Billion-Scale Graphs








A large number of organizations today generate and share textual descriptions of their products,

services, and actions. Such collections of textual data contain significant amount of structured information,

which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction

of structured relations, they are often expensive and inaccurate, especially when operating on top of text that

does not contain any instances of the targeted structured information. We present a novel alternative approach

that facilitates the generation of the structured metadata by identifying documents that are likely to contain

information of interest and this information is going to be subsequently useful for querying the database. Our

approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if

prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata

when such information actually exists in the document, instead of naively prompting users to fill in forms with

information that is not available in the document. As a major contribution of this paper, we present algorithms

that identify structured attributes that are likely to appear within the document, by jointly utilizing

the content of the text and the query workload. Our experimental evaluation shows that our approach generates

superior results compared to approaches that rely only on the textual content or only on the query workload, to

identify attributes of interest.

ETPL

DM -097

Facilitating Document Annotation Using Content and Querying Value

With the wide deployment of public cloud computing infrastructures, using clouds to host data query

services has become an appealing solution for the advantages on scalability and cost-saving. However,

some data might be sensitive that the data owner does not want to move to the cloud unless the data

confidentiality and query privacy are guaranteed. On the other hand, a secured query service should still

provide efficient query processing and significantly reduce the in-house workload to fully realize the benefits

of cloud computing.The RASP data perturbation method combines order preserving encryption,

dimensionality expansion, random noise injection, and random projection, to provide strong resilience to

attacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing

indexing techniques to be applied to speedup range query processing. The kNN-R algorithm is designed to

work with the RASP range query algorithm to process the kNN queries. We have carefully analyzed the

attacks on data and queries under a precisely defined threat model and realistic security assumptions.

Extensive experiments have been conducted to show the advantages of this approach on efficiency and

security.

ETPL

DM - 98

Building Confidential and Efficient Query Services in the Cloud with RASP

Data Perturbation








We describe a new 3D saliency prediction model that accounts for diverse low-level luminance,

Many supervised learning approaches that adapt to changes in data distribution over time (e.g., concept drift)

have been developed. The majority of them assume that the data comes already preprocessed or

that preprocessing is an integral part of a learning algorithm. In real-application tasks, data that comes from,

e.g., sensor readings, is typically noisy, contain missing values, redundant features, and a very large part of

model development efforts is devoted to data preprocessing. As data is evolving over time, learning models

need to be able to adapt to changes automatically. From a practical perspective, automating a predictor makes

little sense if preprocessing requires manual adjustment over time. Nevertheless, adaptation

of preprocessing has been largely overlooked in research. In this paper, we introduce and address the problem

of adaptive preprocessing. We analyze when and under what circumstances it is beneficial to handle adaptivity

of preprocessing and adaptivity of the learning model separately. We present three scenarios where

handling adaptive preprocessing separately benefits the final prediction accuracy and illustrate them using

computational examples. As a result of our analysis, we construct a prototype approach for

combining adaptive preprocessing with adaptive predictor online. Our case study with real sensory data from a

production process demonstrates that decoupling the adaptivity of preprocessing and the predictor contributes

to improving the prediction accuracy. The developed reference framework and our experimental findings are

intended to serve as a starting point in systematic research of adaptive preprocessing mechanisms

for adaptive learning with evolving data.

ETPL

DM - 099

Adaptive Preprocessing for Streaming Data

Many real data increase dynamically in size. This phenomenon occurs in several fields including

economics, population studies, and medical research. As an effective and efficient mechanism to deal with

such data, incremental technique has been proposed in the literature and attracted much attention, which

stimulates the result in this paper. When a group of objects are added to a decision table, we first introduce

incremental mechanisms for three representative information entropies and then develop a group incremental

rough feature selection algorithm based on information entropy. When multiple objects are added to a decision

table, the algorithm aims to find the new feature subset in a much shorter time. Experiments have been carried

out on eight UCI data sets and the experimental results show that the algorithm is effective and efficient.

ETPL

DM - 100

A Group Incremental Approach to Feature Selection Applying Rough Set

Technique








Recent years have witnessed an increased interest in recommender systems. Despite significant progress

in this field, there still remain numerous avenues to explore. Indeed, this paper provides a study of exploiting

online travel information for personalized travel package recommendation. A critical challenge along this line

is to address the unique characteristics of travel data, which distinguish travel packages from traditional items

for recommendation. To that end, in this paper, we first analyze the characteristics of the existing travel

packages and develop a tourist-area-season topic (TAST) model. This TAST model can represent travel

packages and tourists by different topic distributions, where the topic extraction is conditioned on both the

tourists and the intrinsic features (i.e., locations, travel seasons) of the landscapes. Then, based on this topic

model representation, we propose a cocktail approach to generate the lists for personalized travel package

recommendation. Furthermore, we extend the TAST model to the tourist-relation-area-season topic (TRAST)

model for capturing the latent relationships among the tourists in each travel group. Finally, we evaluate the

TAST model, the TRAST model, and the cocktail recommendation approach on the real-world travel package

data. Experimental results show that the TAST model can effectively capture the unique characteristics of the

travel data and the cocktail approach is, thus, much more effective than traditional recommendation techniques

for travel package recommendation. Also, by considering tourist relationships, the TRAST model can be used

as an effective assessment for travel group formation.

ETPL

DM – 101

A Cocktail Approach for Travel Package Recommendation

A protein-protein interaction (PPI) network is a biomolecule relationship network that plays an important role

in biological activities. Studies of functional modules in a PPI network contribute greatly to the understanding

of biological mechanism. With the development of life science and computing science, a great amount of PPI

data has been acquired by various experimental and computational approaches, which presents a significant

challenge of detecting functional modules in a PPI network. To address this challenge,

many functional module detecting methods have been developed. In this survey, we first analyze the existing

problems in detecting functional modules and discuss the countermeasures in the data preprocess and post

process. Second, we introduce some special metrics for distance or graph developed in clustering process of

proteins. Third, we give a classification system of functional moduledetecting methods and describe some

existing detection methods in each category. Fourth, we list databases in common use and conduct

performance comparisons of several typical algorithms by popular measurements. Finally, we present the

prospects and references for researchers engaged in analyzing PPI networks.

ETPL

DM - 102

Survey: Functional Module Detection from Protein-Protein Interaction

Networks








ETPL

DM - 103

Decision Trees for Mining Data Streams Based on the Gaussian Approximation

ETPL

DM - 104

Structural Diversity for Resisting Community Identification in Published Social

Networks

Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the

most popular tools for mining data streams. The key point of constructing the decision tree is to determine the

best attribute to split the considered node. Several methods to solve this problem were presented so far.

However, they are either wrongly mathematically justified (e.g., in the Hoeffding treealgorithm) or time-

consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which

significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method

ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a

finite data sample is the same as it would be in the case of the whole datastream.

As an increasing number of social networking data is published and shared for commercial and research

purposes, privacy issues about the individuals in social networks have become serious concerns.

Vertex identification, which identifies a particular user from a network based on background knowledge such

as vertex degree, is one of the most important problems that have been addressed. In reality, however, each

individual in a social network is inclined to be associated with not only a vertex identity but also

a community identity, which can represent the personal privacy information sensitive to the public, such as

political party affiliation. This paper first addresses the new privacy issue, referred to

as community identification, by showing that the community identity of a victim can still be inferred even

though the social network is protected by existing anonymity schemes. For this problem, we then propose the

concept of structural diversity to provide the anonymity of the community identities. The k-

Structural Diversity Anonymization (k-SDA) is to ensure sufficient vertices with the same vertex degree in at

least k communities in a social network. We propose an Integer Programming formulation to find optimal

solutions to k-SDA and also devise scalable heuristics to solve large-scale instances of k-SDA from different

perspectives. The performance studies on real data sets from various perspectives demonstrate the practical

utility of the proposed privacy scheme and our anonymization approaches.








The task of assigning geographic coordinates to textual resources plays an increasingly central role in

geographic information retrieval. The ability to select those terms from a given collection that are most

indicative of geographic location is of key importance in successfully addressing this task. However, this

process of selecting spatially relevant terms is at present not well understood, and the majority of current

systems are based on standard term selection techniques, such as x2 or information gain, and thus fail to

exploit the spatial nature of the domain. In this paper, we propose two classes of termselection techniques

based on standard geostatistical methods. First, to implement the idea of spatial smoothing

of term occurrences, we investigate the use of kernel density estimation (KDE) to model each term as a two-

dimensional probability distribution over the surface of the Earth. The second class of term selection methods

we consider is based on Ripley's K statistic, which measures the deviation of a point set from spatial

homogeneity. We provide experimental results which compare these classes of methods against existing

baseline techniques on the tasks of assigning coordinates to Flickr photos and to Wikipedia articles, revealing

marked improvements in cases where only a relatively small number of terms can be selected.

ETPL

DM - 105

Spatially Aware Term Selection for Geotagging

Information extraction from printed documents is still a crucial problem in many interorganizational

workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario

well, as printed documents do not carry any explicit structural or syntactical description. Moreover,

printed documents usually lack any explicit indication about their source. We present a system, which we call

PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO

selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists,

and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of

normal system operation: new wrappers are generated online based on a few point-and-click operations

performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO

may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very

good performance on a challenging data set composed of more than 600 printed documents drawn from three

different application domains: invoices, datasheets of electronic components, and patents. We also perform an

extensive analysis of the crucial tradeoff between accuracy and automation level.

ETPL

DM - 106

Semisupervised Wrapper Choice and Generation for Print-Oriented Documents








Nowadays, the high availability of data gathered from wireless sensor networks and

telecommunication systems has drawn the attention of researchers on the problem of extracting knowledge

froms patiotemporal data. Detecting outliers which are grossly different from or inconsistent with the

remaining spatiotemporal data set is a major challenge in real-world knowledge discovery and data mining

applications. In this paper, we deal with the outlier detection problem in spatiotemporal data and describe

a rough set approach that finds the top outliers in an unlabeled spatiotemporal data set. The proposed method,

called Rough Outlier Set Extraction (ROSE), relies on a rough set theoretic representation of

the outlier set using the rough set approximations, i.e., lower and upper approximations. We have also

introduced a new set, named Kernel Set, that is a subset of the original data set, which is able to describe the

original data set both in terms of data structure and of obtained results. Experimental results on real-world

data sets demonstrate the superiority of ROSE, both in terms of some quantitative indices

and outliers detected, over those obtained by various rough fuzzy clustering algorithms and by the state-of-the-

art outlier detection methods. It is also demonstrated that the kernel set is able to detect the

same outliers set but with less computational time.

ETPL

DM - 107

Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection

This paper investigates a framework of search-based face annotation (SBFA)

by mining weakly labeledfacial images that are freely available on the World Wide Web (WWW). One

challenging problem forsearch-based face annotation scheme is how to effectively perform annotation by

exploiting the list of most similar facial images and their weak labels that are often noisy and incomplete. To

tackle this problem, we propose an effective unsupervised label refinement (ULR) approach for refining the

labels of web facial images using machine learning techniques. We formulate the learning problem as a

convex optimization and develop effective optimization algorithms to solve the large-scale learning task

efficiently. To further speed up the proposed scheme, we also propose a clustering-basedapproximation

algorithm which can improve the scalability considerably. We have conducted an extensive set of empirical

studies on a large-scale web facial image testbed, in which encouraging results showed that the proposed ULR

algorithms can significantly boost the performance of the promising SBFA scheme.

ETPL

DM - 108

Mining Weakly Labeled Web Facial Images for Search-Based Face Annotation








In this paper, we construct a linkable ring signature scheme with unconditional anonymity. It has been

regarded as an open problem in [22] since 2004 for the construction of

an unconditional anonymouslinkable ring signature scheme. We are the first to solve this open problem by

giving a concrete instantiation, which is proven secure in the random oracle model. Our construction is even

more efficient than other schemes that can only provide computational anonymity. Simultaneously, our

scheme can act as an counterexample to show that [19, Theorem 1] is not always true, which stated

that linkable ring signature scheme cannot provide strong anonymity. Yet we prove that our scheme can

achieve strong anonymity (under one of the interpretations).

ETPL

DM - 109

Linkable Ring Signature with Unconditional Anonymity

The new method proposed in this paper applies a multivariate reconstructed phase space (MRPS) for

identifying multivariate temporal patterns that are characteristic and predictive of anomalies or events in

a dynamic data system. The new method extends the original univariate reconstructed phase space framework,

which is based on fuzzy unsupervised clustering method, by incorporating a new mechanism

of data categorization based on the definition of events. In addition to modeling temporaldynamics in a

multivariate phase space, a Bayesian approach is applied to model the first-order Markov behavior in the

multidimensional data sequences. The method utilizes an exponential loss objective function to optimize a

hybrid classifier which consists of a radial basis kernel function and a log-odds ratio component. We

performed experimental evaluation on three data sets to demonstrate the feasibility and effectiveness of the

proposed approach.

ETPL

DM - 110

Event Characterization and Prediction Based on Temporal Patterns in Dynamic

Data System

This paper introduces two kinds of decision tree ensembles for imbalanced classification problems,

extensively utilizing properties of α-divergence. First, a novel splitting criterion based on α-divergence is

shown to generalize several well-known splitting criteria such as those used in C4.5 and CART. When the α-

divergence splitting criterion is applied to imbalanced data, one can obtain decision trees that tend to be less

correlated (α-diversification) by varying the value of α. This increased diversity in anensemble of

such trees improves AUROC values across a range of minority class priors. The resultant ensemble produces a

set of interpretable rules that provide higher lift values for a given coverage, a property that is much desirable

in applications such as direct marketing. Experimental results across many class-imbalanced data sets,

including BRFSS, and MIMIC data sets from the medical community and several sets from UCI and KEEL

are provided to highlight the effectiveness of the proposed ensembles over a wide range of data distributions

and of class imbalance.

ETPL

DM - 111

Ensembles of $({alpha})$-Trees for Imbalanced Classification Problems








Detection of emerging topics is now receiving renewed interest motivated by the rapid growth

of socialnetworks. Conventional-term-frequency-based approaches may not be appropriate in this context,

because the information exchanged in social-network posts include not only text but also images, URLs, and

videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus

on mentions of users--links between users that are generated dynamically (intentionally or unintentionally)

through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of

a social network user, and propose to detect the emergence of a newtopic from the anomalies measured

through the model. Aggregating anomaly scores from hundreds of users, we show that we can

detect emerging topics only based on the reply/mention relationships insocial-network posts. We demonstrate

our technique in several real data sets we gathered from Twitter. The experiments show that the proposed

mention-anomaly-based approaches can detect newtopics at least as early as text-anomaly-based approaches,

and in some cases much earlier when thetopic is poorly identified by the textual contents in posts.

ETPL

DM - 112

Discovering Emerging Topics in Social Streams via Link-Anomaly Detection

Big Data concern large-volume, complex, growing data sets with multiple, autonomous

sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are

now rapidly expanding in all science and engineering domains, including physical, biological and biomedical

sciences. This paper presents a HACE theorem that characterizes the features of the Big Datarevolution, and

proposes a Big Data processing model, from the data mining perspective. This data-driven model involves

demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security

and privacy considerations. We analyze the challenging issues in thedata-driven model and also in

the Big Data revolution.

ETPL

DM - 113

Data mining with big data

In this paper, we tackle a novel problem of ranking multivalued objects, where an object has multiple

instances in a multidimensional space, and the number of instances per object is not fixed. Given an ad hoc

scoring function that assigns a score to a multidimensional instance, we want to rank a set of multivalued

objects. Different from the existing models of ranking uncertain and probabilistic data, which model an object

as a random variable and the instances of an object are assumed exclusive, we have to capture the coexistence

of instances here. To tackle the problem, we advocate the semantics of favoring widely preferred objects

instead of majority votes, which is widely used in many elections and competitions.

ETPL

DM - 114

Consensus-Based Ranking of Multivalued Objects: A Generalized Borda Count

Approach








Technology-supported learning systems have proved to be helpful in many learning situations. These

systems require an appropriate representation of the knowledge to be learned, the Domain Module. The

authoring of the Domain Module is cost and labor intensive, but its development cost might be lightened by

profiting from semiautomatic Domain Module authoring techniques and promoting knowledge reuse. DOM-

Sortze is a system that uses natural language processing techniques, heuristic reasoning, and ontologies for the

semiautomatic construction of the Domain Module from electronic textbooks. To determine how it might help

in the Domain Module authoring process, it has been tested with an electronic textbook, and the gathered

knowledge has been compared with the Domain Module that instructional designers developed manually. This

paper presents DOM-Sortze and describes the experiment carried out.

ETPL

DM - 115

Automatic Generation of the Domain Module from Electronic Textbooks:

Method and Validation

Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in

the literature take a landmark embedding approach, which selects a set of graph nodes as landmarksand

computes the shortest distances from each landmark to all nodes as an embedding. To answer ashortest

distance query, the precomputed distances from the landmarks to the two query nodes are used to compute an

approximate shortest distance based on the triangle inequality. In this paper, we analyze the factors that affect

the accuracy of distance estimation in landmark embedding. In particular, we find that a globally selected,

query-independent landmark set may introduce a large relative error, especially for nearby query nodes. To

address this issue, we propose a query-dependent locallandmark scheme, which identifies a local landmark

close to both query nodes and provides more accurate distance estimation than the traditional global landmark

approach. We propose efficient locallandmark indexing and retrieval techniques, which achieve low offline

indexing complexity and onlinequery complexity. Two optimization techniques on graph compression and

graph online search are also proposed, with the goal of further reducing index size and improving query

accuracy. Furthermore, the challenge of immense graphs whose index may not fit in the memory leads us to

store the embedding in relational database, so that a query of the local landmark scheme can be expressed with

relational operators. Effective indexing and query optimization mechanisms are designed in this context. Our

experimental results on large-scale social networks and road networks demonstrate that the locallandmark

scheme reduces the shortest distance estimation error significantly when compared with global landmark

embedding and the state-of-the-art sketch-based embedding.

ETPL

DM - 116

Approximate Shortest Distance Computing: A Query-Dependent Local

Landmark Scheme








Semi-supervised clustering aims to improve clustering performance by considering user supervision in

the form of pairwise constraints. In this paper, we study the active learning problem of selecting pairwise

must-link and cannot-link constraints for semi-supervised clustering. We consider active learning in an

iterative manner where in each iteration queries are selected based on the current clustering solution and the

existing constraint set. We apply a general framework that builds on the concept of neighborhood, where

neighborhoods contain "labeled examples" of different clusters according to the pairwise constraints. Our

active learning method expands the neighborhoods by selecting informative points and querying their

relationship with the neighborhoods. Under this framework, we build on the classic uncertainty-based

principle and present a novel approach for computing the uncertainty associated with each data point. We

further introduce a selection criterion that trades off the amount of uncertainty of each data point with the

expected number of queries (the cost) required to resolve this uncertainty. This allows us to select queries that

have the highest information rate. We evaluate the proposed method on the benchmark data sets and the results

demonstrate consistent and substantial improvements over the current state of the art.

ETPL

DM - 117

Active Learning of Constraints for Semi-Supervised Clustering

Extending the keyword search paradigm to relational data has been an active area of research within

the database and IR community during the past decade. Many approaches have been proposed, but despite

numerous publications, there remains a severe lack of standardization for the evaluation of proposed search

techniques. Lack of standardization has resulted in contradictory results from differentevaluations, and the

numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we

present the most extensive empirical performance evaluation of relationalkeyword search techniques to appear

to date in the literature. Our results indicate that many existingsearch techniques do not provide acceptable

performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques

from scaling beyond small data sets with tens of thousands of vertices. We also explore the relationship

between execution time and factors varied in previous evaluations; our analysis indicates that most of these

factors have relatively little impact on performance. In summary, our work confirms previous claims regarding

the unacceptableperformance of these search techniques and underscores the need for standardization in

evaluations--standardization exemplified by the IR community.

ETPL

DM - 118

An Empirical Performance Evaluation of Relational Keyword Search

Techniques








Collaborative tagging is one of the most popular services available online, and it allows end user to

loosely classify either online or offline resources based on their feedback, expressed in the form of free-text

labels (i.e., tags). Although tags may not be per se sensitive information, the wide use of

collaborative tagging services increases the risk of cross referencing, thereby seriously compromising

user privacy. In this paper, we make a first contribution toward the development of a privacy-preserving

collaborative tagging service, by showing how a specific privacy-enhancing technology, namely tag

suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the

effectiveness of a policy-based collaborative tagging system that supports enhanced web access

functionalities, like content filtering and discovery, based on preferences specified by end users.

ETPL

DM - 119

Privacy-Preserving Enhanced Collaborative Tagging

Thank You !