Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
ETPL NT-001 Answering “What-If” Deployment and Configuration Questions With WISE: Techniques and Deployment Experience
ETPL NT-002 Complexity Analysis and Algorithm Design for Advance Bandwidth Scheduling in Dedicated Networks
ETPL NT-003 Diffusion Dynamics of Network Technologies With Bounded Rational Users: Aspiration-Based Learning
ETPL NT-004 Delay-Based Network Utility Maximization
ETPL NT-005 A Distributed Control Law for Load Balancing in Content Delivery Networks
ETPL NT-006 Efficient Algorithms for Neighbor Discovery in Wireless Networks
ETPL NT-007 Stochastic Game for Wireless Network Virtualization
ETPL NT-008 ABC: Adaptive Binary Cuttings for Multidimensional Packet Classification,
ETPL NT-009 A Utility Maximization Framework for Fair and Efficient Multicasting in Multicarrier Wireless Cellular Networks
ETPL NT-010 Achieving Efficient Flooding by Utilizing Link Correlation in Wireless Sensor Networks,
ETPL NT-011 Random Walks and Green's Function on Digraphs: A Framework for Estimating Wireless Transmission Costs
ETPL NT-012 "A Flexible Platform for Hardware-Aware Network Experiments and a Case Study on Wireless Network Coding
ETPL NT-013 Exploring the Design Space of Multichannel Peer-to-Peer Live Video Streaming Systems
ETPL NT-014 Secondary Spectrum Trading—Auction-Based Framework for Spectrum Allocation and Profit Sharing
ETPL NT-015 Towards Practical Communication in Byzantine-Resistant DHTs
ETPL NT-016 Semi-Random Backoff: Towards Resource Reservation for Channel Access in Wireless LANs
ETPL NT-017 Entry and Spectrum Sharing Scheme Selection in Femtocell Communications Markets
ETPL NT-018 On Replication Algorithm in P2P VoD,
ETPL NT-019 Back-Pressure-Based Packet-by-Packet Adaptive Routing in Communication Networks
ETPL NT-020 Scheduling in a Random Environment: Stability and Asymptotic Optimality
ETPL NT-021 An Empirical Interference Modeling for Link Reliability Assessment in Wireless Networks
ETPL NT-022 On Downlink Capacity of Cellular Data Networks With WLAN/WPAN Relays
ETPL NT-023 Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management
ETPL NT-024 Localization of Wireless Sensor Networks in the Wild: Pursuit of Ranging Quality
ETPL NT-025 Control of Wireless Networks With Secrecy
ETPL NT-026 ICTCP: Incast Congestion Control for TCP in Data-Center Networks
ETPL NT-027 Context-Aware Nanoscale Modeling of Multicast Multihop Cellular Networks
ETPL NT-028 Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information
ETPL NT-029 Internet-Scale IPv4 Alias Resolution With MIDAR
ETPL NT-030 Time-Bounded Essential Localization for Wireless Sensor Networks
ETPL NT-031 Stability of FIPP -Cycles Under Dynamic Traffic in WDM Networks
ETPL NT-032 Cooperative Carrier Signaling: Harmonizing Coexisting WPAN and WLAN Devices
ETPL NT-033 Mobility Increases the Connectivity of Wireless Networks
ETPL NT-034 Topology Control for Effective Interference Cancellation in Multiuser MIMO Networks
ETPL NT-035 Distortion-Aware Scalable Video Streaming to Multinetwork Clients
ETPL NT-036 Combined Optimal Control of Activation and Transmission in Delay-Tolerant Networks
ETPL NT-037 A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Kernel principal component analysis and the reconstruction error is an effective anomaly
detection technique for non-linear datasets. In an environment where a phenomenon is generating data that is
non-stationary, anomaly detection requires a recomputation of the kernel eigenspace in order to represent the
current data distribution. Recomputation is a computationally complex operation and reducing computational
complexity is therefore a key challenge. In this paper, we propose an algorithm that is able to accurately
remove data from a kernel eigenspace without performing a batch recomputation. Coupled with a kernel
eigenspace update, we demonstrate that our technique is able to remove and add data to a kernel eigenspace
more accurately than existing techniques. An adaptive version determines an appropriately sized sliding
window of data and when a model update is necessary. Experimental evaluations on both synthetic and real-
world datasets demonstrate the superior performance of the proposed approach in comparison to alternative
incremental KPCA approaches and alternative anomaly detection techniques.
.
ETPL
DM - 001
Adaptive Anomaly Detection with Kernel Eigenspace Splitting and Merging
It is nowadays well-established that the construction of quality domain ontologies benefits from
the involvement in the modelling process of more actors, possibly having different roles and skills. To be
effective, the collaboration between these actors has to be fostered, enabling each of them to actively and
readily participate to the development of the ontology, favouring as much as possible the direct involvement of
the domain experts in the authoring activities. Recent works have shown that ontology modelling tools based
on wikis’ paradigm and technology could contribute in meeting these collaborative requirements. This paper
investigates, both at the theoretical and empirical level, the effectiveness of wiki features for collaborative
ontology authoring in supporting teamworks composed of domain experts and knowledge engineers, as well as
their impact on the entire process of collaborative ontology modelling and entity lifecycle.
.
ETPL
DM - 002
Evaluating Wiki Collaborative Features in Ontology Authoring
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Phase Change Memory requires big capacityof Frame Store (FS) for buffering reference frames. The
upto-date Phase-change Random Access Memory (PRAM) is thepromising approach for on-chip caching the
reference signals, asPRAM offers the advantages in terms of high density and low leakage power. However,
the write endurance problem, that is a PRAM cell can only tolerant limited number of write
operations,becomes the main barrier in practical applications. This paper studies the wear reduction techniques
of PRAM based FS in H.264 codec system. On the basis of rate-distortion theory, the content oriented
selective writing mechanisms are proposed to reduce bit updates in the reference frame buffers. Experiments
demonstrate that, for typical video sequences with different frame sizes, our methods averagely achieve more
than 30% reduction of bit updates, while introducing around 20% BDBR cost. The power consumption is
reduced by 55% on average, and the estimated PRAM lifetime is extended by 61%.
ETPL
DM - 003
B^p-tree: A Predictive B^+-tree for Reducing Writes on Phase Change Memory
Edit distance is widely used for measuring the similarity between two strings. As a primitive
operation, edit distance based string similarity search is to find strings in a collection that are similar to a given
query string using edit distance. Existing approaches for answering such string similarity queries follow the
filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and
datasets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based
approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily
integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques
employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions
according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on
the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to
efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal
partitioning of the dataset is an NP-hard problem, and therefore propose a heuristic approach for selecting the
reference strings greedily and present an optimal partition assignment strategy to minimize the expected
number of strings that need to be verified during the query evaluation. Through extensive experiments over a
variety of real datasets, we demonstrate that our B+-tree based approaches provide superior performance over
state-of-the-art techniques on both range and KNN queries in most cases.
ETPL
DM - 004
Efficiently Supporting Edit Distance based String Similarity Search Using B+-trees
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Search Engine Marketing (SEM) agencies manage thousands of search keywords for their
clients. The campaign management dashboards provided by advertisement brokers have interfaces to change
search campaign attributes. Using these dashboards, advertisers create test variants for various bid choices,
keyword ideas, and advertisement text options. Later on, they conduct controlled experiments for selecting the
best performing variants. Given a large keyword portfolio and many variants to consider, campaign
management can easily become a burden on even experienced advertisers. In order to target users in need of a
particular service, advertisers have to determine the purchase intents or information needs of target users.
Once the target intents are determined, advertisers can target those users with relevant search keywords. In
order to formulate information needs and to scale campaign management with increasing number of keywords,
we propose a framework called TopicMachine, where we learn the latent topics hidden in the available search
terms reports. Our hypothesis is that these topics correspond to the set of information needs that best match-
make a given client with users. In our experiments, TopicMachine outperformed its closest competitor by 41%
on predicting total user subscriptions.
ETPL
DM - 005
TopicMachine: Conversion Prediction in Search Advertising using Latent Topic
Models
A highly comparative, feature-based approach to time series classification is introduced
that uses an extensive database of algorithms to extract thousands of interpretable features from time series.
These features are derived from across the scientific time-series analysis literature, and include summaries of
time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits
to a range of time-series models. After computing thousands of features for each time series in a training set,
those that are most informative of the class structure are selected using greedy forward feature selection with a
linear classifier. The resulting feature-based classifiers automatically learn the differences between classes
using a reduced number of time-series properties, and circumvent the need to calculate distances between time
series. Representing time series in this way results in orders of magnitude of dimensionality reduction,
allowing the method to perform well on very large datasets containing long time series or time series of
different lengths. For many of the datasets studied, classification performance exceeded that of conventional
instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic
time warping and, most importantly, the features selected provide an understanding of the properties of the
dataset, insight that can guide further scientific investigation.
ETPL
DM - 006
Highly comparative feature-based time-series classification
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Product quantization-based approaches are effective to encode high-dimensional data points for
approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional
subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using
these sub codebooks, and the distance between two data points can be approximated efficiently from their
codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace,
only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on
the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K-Means
(OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM,
multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword
stems from different sub codebooks in each subspace, which are optimally generated with regards to the
minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of
the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower
distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate
the superiority over state-of-the-art approaches for approximate nearest neighbor search.
ETPL
DM - 007
Optimized Cartesian K-Means
The key task in developing graph-based learning algorithms is constructing an
informative graph to express the contextual information of a data manifold. Since traditional graph
construction methods are sensitive to noise and less datum-adaptive to changes in density, a new method
called ℓ1-graph was proposed recently. A graph construction needs to have two important properties: sparsity
and locality. The ℓ1-graph has a strong sparsity property, but a weak locality property. Thus, we propose a
new method of constructing an informative graph using auto-grouped sparse regularization based on the ℓ1-
graph, which is called as Group Sparse graph (GSgraph). We also show how to efficiently construct a GS-
graph in reproducing kernel Hilbert space with the kernel trick. The new methods, the GS-graph and its
kernelized version (KGS-graph), have the same noise-insensitive property as that of ℓ1-graph and also can
successively preserve the properties of sparsity and locality simultaneously. Furthermore, we integrate the
proposed graph with several graph-based learning algorithms to demonstrate the effectiveness of our method.
The empirical studies on benchmarks show that the proposed methods outperform the ℓ1-graph and other
traditional graph construction methods in various learning tasks.
ETPL
DM - 008
Graph-based Learning via Auto-Grouped Sparse Regularization and
Kernelized Extension
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
I Location-based services (LBS) enable mobile users to query points-of-interest (e.g., restaurants,
cafes) on various features (e.g., price, quality, variety). In addition, users require accurate query results with
up-to-date travel times. Lacking the monitoring infrastructure for road traffic, the LBS may obtain live travel
times of routes from online route APIs in order to offer accurate results. Our goal is to reduce the number of
requests issued by the LBS significantly while preserving accurate query results. First, we propose to exploit
recent routes requested from route APIs to answer queries accurately. Then, we design effective lower/upper
bounding techniques and ordering techniques to process queries efficiently. Also, we study parallel route
requests to further reduce the query response time. Our experimental evaluation shows that our solution is 3
times more efficient than a competitor, and yet achieves high result accuracy (above 98%).
ETPL
DM - 009
Route-Saver: Leveraging Route APIs for Accurate and Efficient Query
Processing at Location-Based Services
The knowledge remembered by the human body and reflected by the dexterity of body motion is
called embodied knowledge. In this paper, we propose a new method using singular value decomposition for
extracting embodied knowledge from the time-series data of the motion. We compose a matrix from the time-
series data and use the left singular vectors of the matrix as the patterns of the motion and the singular values
as a scalar, by which each corresponding left singular vector affects the matrix. Two experiments were
conducted to validate the method. One is a gesture recognition experiment in which we categorize gesture
motions by two kinds of models with indexes of similarity and estimation that use left singular vectors. The
proposed method obtained a higher correct categorization ratio than principal component analysis (PCA) and
correlation efficiency (CE). The other is an ambulation evaluation experiment in which we distinguished the
levels of walking disability. The first singular values derived from the walking acceleration were suggested to
be a reliable criterion to evaluate walking disability. Finally we discuss the characteristic and significance of
the embodied knowledge extraction using the singular value decomposition proposed in this paper.
ETPL
DM - 010
Knowledge Acquisition Method based on Singular Value Decomposition for
Human Motion Analysis
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Given a scoring function that computes the score of a pair of objects, a top-k pairs query returns k
pairs with the smallest scores. In this paper, we present a unified framework for answering generic top-k pairs
queries including k-closest pairs queries, k- furthest pairs queries and their variants. Note that k-closest pairs
query is a special case of top-k pairs queries where the scoring function is the distance between the two objects
in a pair. We are the first to present a unified framework to efficiently answer a broad class of top-k queries
including the queries mentioned above.We present efficient algorithms and provide a detailed theoretical
analysis that demonstrates that the expected performance of our proposed algorithms is optimal for two
dimensional data sets. Furthermore, our framework does not require pre-built indexes, uses limited main
memory and is easy to implement. We also extend our techniques to support top-k pairs queries on multi-
valued (or uncertain) objects. We also demonstrate that our framework can handle exclusive top-k pairs
queries. Our extensive experimental study demonstrates effectiveness and efficiency of our proposed
techniques.
ETPL
DM - 011
A Unified Framework for Answering k Closest Pairs Queries and Variants
In this paper, we tackle the problem of discovering movement-based communities of users,
where users in the same community have similar movement behaviors. Note that the identification of
movement-based communities is beneficial to location-based services and trajectory recommendation services.
Specifically, we propose a framework to mine movementbased communities which consists of three phases: 1)
constructing trajectory profiles of users, 2) deriving similarity between trajectory profiles, and 3) discovering
movement-based communities. In the first phase, we design a data structure, called the Sequential Probability
tree (SP-tree), as a user trajectory profile. SP-trees not only derive sequential patterns, but also indicate
transition probabilities of movements. Moreover, we propose two algorithms: BF (standing for Breadth-First)
and DF (standing for Depth-First) to construct SP-tree structures as user profiles. To measure the similarity
values among users’ trajectory profiles, we further develop a similarity function that takes SP-tree information
into account. In light of the similarity values derived, we formulate an objective function to evaluate the
quality of communities. According to the objective function derived, we propose a greedy algorithm Geo-
Cluster to effectively derive communities. To evaluate our proposed algorithms, we have conducted
comprehensive experiments on two real datasets. The experimental results show that our proposed framework
can effectively discover movement-based user communities.
ETPL
DM - 012
Exploring Sequential Probability Tree for Movement-based Community
Discovery
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
In the literature about association analysis, many interestingness measures have been
proposed to assess the quality of obtained association rules in order to select a small set of the most interesting
among them. In the particular case of hierarchically organized items and generalized association rules
connecting them, a measure that dealt appropriately with the hierarchy would be advantageous. Here we
present the further developments of a new class of such hierarchical interestingness measures and compare
them with a large set of conventional measures and with three hierarchical pruning methods from the
literature. The aim is to find interesting pairwise generalized association rules connecting the concepts of
multiple ontologies. Interested in the broad empirical evaluation of interestingness measures, we compared the
rules obtained by 39 methods on three real world datasets against predefined ground truth sets of associations.
To this end, we adopted a framework of instancebased ontology matching and extended the set of performance
measures by two novel measures: relation learning recall and precision which take into account hierarchical
relationships between rules.
ETPL
DM - 013
Evaluation of hierarchical interestingness measures for mining pairwise
generalized association rules
In some real world applications, like information retrieval and data classification, we often
confront with the situation that the same semantic concept can be expressed using different views with similar
information. Thus, how to obtain a certain Semantically Consistent Patterns (SCP) for cross-view data, which
embeds the complementary information from different views, is of great importance for those applications.
However, the heterogeneity among cross-view representations brings a significant challenge on mining the
SCP. In this paper, we propose a general framework to discover the SCP for cross-view data. Specifically,
aiming at building a feature-isomorphic space among different views, a novel Isomorphic Relevant Redundant
Transformation (IRRT) is first proposed. The IRRT linearly maps multiple heterogeneous low-level feature
spaces to a high-dimensional redundant feature-isomorphic one, which we name as mid-level space. Thus,
much more complementary information from different views can be captured. Furthermore, to mine the
semantic consistency among the isomorphic representations in the mid-level space, we propose a new
Correlation-based Joint Feature Learning (CJFL) model to extract a unique high-level semantic subspace
shared across the feature-isomorphic data. Consequently, the SCP for cross-view data can be obtained.
Comprehensive experiments on three datasets demonstrate the advantages of our framework in classification
and retrieval.
ETPL
DM - 014
Mining Semantically Consistent Patterns for Cross-View Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Given a real world graph, how should we lay-out its edges? How can we compress it? These
questions are closely related, and the typical approach so far is to find clique-like communities, like the
‘cavemen graph’, and compress them. We show that the block-diagonal mental image of the ‘cavemen graph’
is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead,
we propose to envision graphs as a collection of hubs connecting spokes, with super-hubs connecting the hubs,
and so on, recursively. Based on the idea, we propose the SLASHBURN method to recursively split a graph
into hubs and spokes connected only by the hubs. We also propose techniques to select the hubs and give an
ordering to the spokes, in addition to the basic SLASHBURN. We give theoretical analysis of the proposed
hub selection methods. Our view point has several advantages: (a) it avoids the ‘no good cuts’ problem, (b) it
gives better compression, and (c) it leads to faster execution times for matrix-vector operations, which are the
back-bone of most graph processing tools. Through experiments, we show that SLASHBURN consistently
outperforms other methods for all datasets, resulting in better compression and faster running time. Moreover,
we show that SLASHBURN with the appropriate spokes ordering can further improve compression while
hardly sacrificing the running time.
ETPL
DM – 015
SlashBurn: Graph Compression and Mining beyond Caveman Communities
Subgraph similarity search is used in graph databases to retrieve graphs whose subgraphs are
similar to a given query graph. It has been proven successful in a wide range of applications including
bioinformatics and chem-informatics, etc. Due to the cost of providing efficient similarity search services on
ever-increasing graph data, database outsourcing is apparently an appealing solution to database owners.
Unfortunately, query service providers may be untrusted or compromised by attacks. To our knowledge, no
studies have been carried out on the authentication of the search. In this paper, we propose authentication
techniques that follow the popular filtering-and-verification framework. We propose an authentication-friendly
metric index called GMTree. Specifically, we transform the similarity search into a search in a graph metric
space and derive small verification objects (VOs) to-be-transmitted to query clients. To further optimize
GMTree, we propose a sampling-based pivot selection method and an authenticated version of MCS
computation. Our comprehensive experiments verified the effectiveness and efficiency of our proposed
techniques.
ETPL
DM - 016
Authenticated Subgraph Similarity Search in Outsourced Graph Databases
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Keyword search is a useful tool for exploring large RDF datasets. Existing techniques
either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF
graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing
with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization
techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective
summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant
pruning powers to exploratory keyword search and result in much better efficiency compared to previous
works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries
we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF
data sets show that our techniques are scalable and efficient.
ETPL
DM - 017
Scalable Keyword Search on Large RDF Data
Mobile devices with geo-positioning capabilities (e.g., GPS) enable users to access
information that is relevant to their present location. Users are interested in querying about points of interest
(POI) in their physical proximity, such as restaurants, cafes, ongoing events, etc. Entities specialized in various
areas of interest (e.g., certain niche directions in arts, entertainment, travel) gather large amounts of geo-tagged
data that appeal to subscribed users. Such data may be sensitive due to their contents. Furthermore, keeping
such information up-to-date and relevant to the users is not an easy task, so the owners of such datasets will
make the data accessible only to paying customers. Users send their current location as the query parameter,
and wish to receive as result the nearest POIs, i.e., nearest-neighbors (NNs). But typical data owners do not
have the technical means to support processing queries on a large scale, so they outsource data storage and
querying to a cloud service provider. Many such cloud providers exist who offer powerful storage and
computational infrastructures at low cost. However, cloud providers are not fully trusted, and typically behave
in an honest-but-curious fashion. Specifically, they follow the protocol to answer queries correctly, but they
also collect the locations of the POIs and the subscribers for other purposes. Leakage of POI locations can lead
to privacy breaches as well as financial losses to the data owners, for whom the POI dataset is an important
source of revenue. Disclosure of user locations leads to privacy violations and may deter subscribers from using
the service altogether. In this paper, we propose a family of techniques that allow processing of NN queries in
an untrusted outsourced environment, while at the same time protecting both the POI and querying users’
positions. Our techniques rely on mutable order preserving encoding (mOPE), the only secure order-preserving
encryption method known to-date. We also provide performance optimizations to decrease the computational
cost inherent to processing on encrypted data, and we consider the case of incrementally updating datasets.
ETPL
DM - 018
Secure kNN Query Processing in Untrusted Cloud Environments
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The multiple longest common subsequence (MLCS) problem, related to the identification of sequence
similarity, is an important problem in many fields. As an NP-hard problem, its exact algorithms have difficulty
in handling large-scale data and time- and space-efficient algorithms are required in real-world applications.
To deal with time constraints, anytime algorithms have been proposed to generate good solutions with a
reasonable time. However, there exists little work on space-efficient MLCS algorithms. In this paper, we
formulate the MLCS problem into a graph search problem and present two space-efficient anytime MLCS
algorithms, SA-MLCS and SLA-MLCS. SA-MLCS uses an iterative beam widening search strategy to reduce
space usage during the iterative process of finding better solutions.Based on SA-MLCS, SLA-MLCS, a space-
bounded algorithm, is developed to avoid space usage from exceeding available memory. SLA-MLCS uses a
replacing strategy when SA-MLCS reaches a given space bound. Experimental results show SA-MLCS and
SLA-MLCS use an order of magnitude less space and time than the state-of-the-art approximate algorithm
MLCS-APP while finding better solutions. Compared to the state-of-the-art anytime algorithm Pro-MLCS,
SA-MLCS and SLA-MLCS can solve an order of magnitude larger size instances. Furthermore, SLA-MLCS
can find much better solutions than SA-MLCS on large size instances.
ETPL
DM - 019
A Space-Bounded Anytime Algorithm for the Multiple Longest Common
Subsequence Problem
As machine learning techniques mature and are used to tackle complex scientific problems,
challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-
represented in comparison with other classes. Existing oversampling approaches for addressing this problem
typically do not consider the probability distribution of the minority class while synthetically generating new
samples. As a result, the minority class is not represented well which leads to high misclassification error. We
introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically
generating and strategically selecting new minority class samples. The proposed approaches use the joint
probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While
RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those
samples that have the highest probability of being misclassified by the existing learning model. We validate our
approach using nine UCI datasets that were carefully modified to exhibit class imbalance and one new
application domain dataset with inherent extreme class imbalance. In addition, we compare the classification
performance of the proposed methods with three other existing resampling techniques.
ETPL
DM - 020
RACOG and wRACOG: Two Probabilistic Oversampling Techniques
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Malware is pervasive in networks, and poses a critical threat to network security. However, we have
very limited understanding of malware behavior in networks to date. In this paper, we investigate how
malware propagate in networks from a global perspective. We formulate the problem, and establish a rigorous
two layer epidemic model for malware propagation from network to network. Based on the proposed model,
our analysis indicates that the distribution of a given malware follows exponential distribution, power law
distribution with a short exponential tail, and power law distribution at its early, late and final stages,
respectively. Extensive experiments have been performed through two real-world global scale malware data
sets, and the results confirm our theoretical findings.
ETPL
DM - 021
Malware Propagation in Large-Scale Networks
This paper focuses on an important query in scientific simulation data analysis: the Spatial
Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic.
Often, such queries are executed continuously over certain time periods, increasing the computation time. We
propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable
error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and
temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the
spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles’
spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in
implementing and optimizing the above algorithm in Graphics Processing Units (GPUs) as means to further
improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical
analysis and results of extensive experiments using data generated from real simulation studies.
ETPL
DM - 022
Computing Spatial Distance Histograms for Large Scientific Datasets On-the-
Fly
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Although mostly used for pattern classification, linear discriminant analysis (LDA) can also be used in
feature selection as an effective measure to evaluate the separative ability of a feature subset. When applied to
feature selection on high-dimensional smallsized (HDSS) data (generally) with class-imbalance, LDA
encounters four problems, including singularity of scatter matrix, overfitting, overwhelming and prohibitively
computational complexity. In this study, we propose the LDA-based feature selection method MCELDA
(minority class emphasized linear discriminant analysis) with a new regularization technique to address the
first three problems. Different to giving equal or more emphasis to majority class in conventional forms of
regularization, the proposed regularization emphasizes more on minority class, with the expectation of
improving overall performance by alleviating overwhelming of majority class to minority class as well as
overfitting in minority class. In order to reduce computational overhead, an incremental implementation of
LDA-based feature selection has been introduced. Comparative studies with other forms of regularization to
LDA as well as with other popular feature selection methods on five HDSS problems show that MCE-LDA
can produce feature subsets with excellent performance in both classification and robustness. Further
experimental results of true positive rate (TPR) and true negative rate (TNR) have also verified the
effectiveness of the proposed technique in alleviating overwhelming and overfitting problems.
ETPL
DM - 023
Emphasizing Minority Class in LDA for Feature Subset Selection on High-
Dimensional Small-Sized Problems
In a top-k Geometric Intersection Query (top-k GIQ) problem, a set of n weighted,
geometric objects in Rd is to be pre-processed into a compact data structure so that for any query geometric
object, q, and integer k > 0, the k largest-weight objects intersected by q can be reported efficiently. While the
top-k problem has been studied extensively for non-geometric problems (e.g., recommender systems), the
geometric version has received little attention. This paper gives a general technique to solve any top-k GIQ
problem efficiently. The technique relies only on the availability of an efficient solution for the underlying (non-
top-k) GIQ problem, which is often the case. Using this, asymptotically efficient solutions are derived for
several top-k GIQ problems, including top-k orthogonal and circular range search, point enclosure search,
halfspace range search, etc. Implementations of some of these solutions, using practical data structures, show
that they are quite efficient in practice. This paper also does a formal investigation of the hardness of the top-k
GIQ problem, which reveals interesting connections between the top-k GIQ problem and the underlying (non-
top-k) GIQ problem.
ETPL
DM - 024
A General Technique for Top-k Geometric Intersection Query Problems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Ensemble learning has become a common tool for data stream classification, being able to handle
large volumes of stream data and concept drifting. Previous studies focus on building accurate ensemble
models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during
prediction incurs significant costs in response time, preventing ensemble learning from being practical for
many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection,
and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is
necessary to classify each stream record in a timely manner. To address this problem, we propose a novel
\emph{Ensemble-tree} (E-tree for short) indexing structure to organize all base classifiers in an ensemble for
fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an \emph{R-tree} like
height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the
other hand, E-trees can automatically update themselves by continuously integrating new classifiers and
discarding outdated ones, well adapting to new trends and patterns underneath data streams. Theoretical
analysis and empirical studies on both synthetic and real-world data streams demonstrate the performance of
our approach
ETPL
DM - 025
E-Tree: An Efficient Indexing Structure for Ensemble Models on Data Streams
Affinity Propagation (AP) clustering has been successfully used in a lot of
clustering problems. However, most of the applications deal with static data. This paper considers how to
apply AP in incremental clustering problems. Firstly, we point out the difficulties in Incremental Affinity
Propagation (IAP) clustering, and then propose two strategies to solve them. Correspondingly, two IAP
clustering algorithms are proposed. They are IAP clustering based on K-Medoids (IAPKM) and IAP clustering
based on Nearest Neighbor Assignment (IAPNA). Five popular labeled data sets, real world time series and a
video are used to test the performance of IAPKM and IAPNA. Traditional AP clustering is also implemented
to provide benchmark performance. Experimental results show that IAPKM and IAPNA can achieve
comparable clustering performance with traditional AP clustering on all the data sets. Meanwhile, the time
cost is dramatically reduced in IAPKM and IAPNA. Both the effectiveness and the efficiency make IAPKM
and IAPNA able to be well used in incremental clustering tasks.
ETPL
DM - 026
Incremental Affinity Propagation Clustering Based on Message Passing
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The key task in developing graph-based learning algorithms is constructing an informative graph to
express the contextual information of a data manifold. Since traditional graph construction methods are
sensitive to noise and less datum-adaptive to changes in density, a new method called ℓ1-graph was proposed
recently. A graph construction needs to have two important properties: sparsity and locality. The ℓ1-graph has
a strong sparsity property, but a weak locality property. Thus, we propose a new method of constructing an
informative graph using auto-grouped sparse regularization based on the ℓ1-graph, which is called as Group
Sparse graph (GSgraph). We also show how to efficiently construct a GS-graph in reproducing kernel Hilbert
space with the kernel trick. The new methods, the GS-graph and its kernelized version (KGS-graph), have the
same noise-insensitive property as that of ℓ1-graph and also can successively preserve the properties of
sparsity and locality simultaneously. Furthermore, we integrate the proposed graph with several graph-based
learning algorithms to demonstrate the effectiveness of our method. The empirical studies on benchmarks
show that the proposed methods outperform the ℓ1-graph and other traditional graph construction methods in
various learning tasks.
ETPL
DM - 027 Graph-based Learning via Auto-Grouped Sparse Regularization and
Kernelized Extension
Location-based services (LBS) enable mobile users to query points-of-interest (e.g., restaurants,
cafes) on various features (e.g., price, quality, variety). In addition, users require accurate query results with
up-to-date travel times. Lacking the monitoring infrastructure for road traffic, the LBS may obtain live travel
times of routes from online route APIs in order to offer accurate results. Our goal is to reduce the number of
requests issued by the LBS significantly while preserving accurate query results. First, we propose to exploit
recent routes requested from route APIs to answer queries accurately. Then, we design effective lower/upper
bounding techniques and ordering techniques to process queries efficiently. Also, we study parallel route
requests to further reduce the query response time. Our experimental evaluation shows that our solution is 3
times more efficient than a competitor, and yet achieves high result accuracy (above 98%).
ETPL
DM - 028 Route-Saver: Leveraging Route APIs for Accurate and Efficient Query
Processing at Location-Based Services
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Recent large-scale hierarchical classification tasks typically have tens of thousands of classes on
which the most widely used approach to multiclass classification--one-versus-rest--becomes intractable due to
computational complexity. The top-down methods are usually adopted instead, but they are less accurate
because of the so-called error-propagation problem in their classifying phase. To address this problem, this
paper proposes a meta-top-down method that employs metaclassification to enhance the normal top-down
classifying procedure. The proposed method is first analyzed theoretically on complexity and accuracy, and
then applied to five real-world large-scale data sets. The experimental results indicate that the classification
accuracy is largely improved, while the increased time costs are smaller than most of the existing approaches.
ETPL
DM - 029 A Meta-Top-Down Method for Large-Scale Hierarchical Classification
Creating an efficient and economic trip plan is the most annoying job for a backpack traveler.
Although travel agency can provide some predefined itineraries, they are not tailored for each specific
customer. Previous efforts address the problem by providing an automatic itinerary planning service, which
organizes the points-of-interests (POIs) into a customized itinerary. Because the search space of all possible
itineraries is too costly to fully explore, to simplify the complexity, most work assume that user's trip is limited
to some important POIs and will complete within one day. To address the above limitation, in this paper, we
design a more general itinerary planning service, which generates multiday itineraries for the users. In our
service, all POIs are considered and ranked based on the users' preference. The problem of searching the
optimal itinerary is a team orienteering problem (TOP), a well-known NP-complete problem. To reduce the
processing cost, a two-stage planning scheme is proposed. In its preprocessing stage, single-day itineraries are
precomputed via the MapReduce jobs. In its online stage, an approximate search algorithm is used to combine
the single day itineraries. In this way, we transfer the TOP problem with no polynomial approximation into
another NP-complete problem (set-packing problem) with good approximate algorithms. Experiments on real
data sets show that our approach can generate high-quality itineraries efficiently
ETPL
DM - 030 Automatic Itinerary Planning for Traveling Services
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Time provides context for all our experiences, cognition, and coordinated collective action. Prior
research in linguistics, artificial intelligence, and temporal databases suggests the need to differentiate between
temporal facts with goal-related semantics (i.e., telic) from those are intrinsically devoid of culmination (i.e.,
atelic). To differentiate between telic and atelic data semantics in conceptual database design, we propose an
annotation-based temporal conceptual model that generalizes the semantics of a conventional conceptual
model. Our temporal conceptual design approach involves: 1) capturing "what" semantics using a
conventional conceptual model; 2) employing annotations to differentiate between telic and atelic data
semantics that help capture "when" semantics; 3) specifying temporal constraints, specifically nonsequenced
semantics, in the temporal data dictionary as metadata. Our proposed approach provides a mechanism to
represent telic/atelic temporal semantics using temporal annotations. We also show how these semantics can
be formally defined using constructs of the conventional conceptual model and axioms in first-order logic. Via
what we refer to as the "semantics of composition," i.e., semantics implied by the interaction of annotations,
we illustrate the logical consequences of representing telic/atelic data semantics during temporal conceptual
design.
ETPL
DM - 031 Capturing Telic/Atelic Temporal Data Semantics: Generalizing Conventional
Conceptual Models
The extended space forest is a new method for decision tree construction in which training is done with input
vectors including all the original features and their random combinations. The combinations are generated with
a difference operator applied to random pairs of original features. The experimental results show that extended
space versions of ensemble algorithms have better performance than the original ensemble algorithms. To
investigate the success dynamics of the extended space forest, the individual accuracy and diversity creation
powers of ensemble algorithms are compared. The Extended Space Forest creates more diversity when it uses
all the input features than Bagging and Rotation Forest. It also results in more individual accuracy when it uses
random selection of the features than Random Subspace and Random Forest methods. It needs more training
time because of using more features than the original algorithms. But its testing time is lower than the others
because it generates less complex base learners.
ETPL
DM - 032
Classifier Ensembles with the Extended Space Forest
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The visualization of information contained in reports is an important aspect of
human-computer interaction, for both the accuracy and the complexity of relationships between data must be
preserved. A greater attention has been paid to individual report visualization through different types of
standard graphs (Histograms, Pies, etc.). However, this kind of representation provides separate information
items and gives no support to visualize their relationships which are extremely important for most decision
processes. This paper presents a design methodology exploiting the visual language CoDe based on a logic
paradigm. CoDe allows to organize the visualization through the CoDe model which graphically represents
relationships between information items and can be considered a conceptual map of the view. The proposed
design methodology is composed of four phases: the CoDe Modeling and OLAP Operation pattern definition
phases define the CoDe model and underlying metadata information, the OLAP Operation phase physically
extracts data from a data warehouse and the Report Visualization phase generates the final visualization.
Moreover, a case study on real data is provided.
ETPL
DM - 033 CoDe Modeling of Graph Composition for Data Warehouse Report Visualization
There are numerous applications where we wish to discover unexpected activities in a sequence of
time-stamped observation data-for instance, we may want to detect inexplicable events in transactions at a
website or in video of an airport tarmac. In this paper, we start with a known set A of activities (both
innocuous and dangerous) that we wish to monitor. However, in addition, we wish to identify “unexplained”
subsequences in an observation sequence that are poorly explained (e.g., because they may contain
occurrences of activities that have never been seen or anticipated before, i.e., they are not in A). We formally
define the probability that a sequence of observations is unexplained (totally or partially) w.r.t. A. We develop
efficient algorithms to identify the top-k Totally and partially unexplained sequences w.r.t. A. These
algorithms leverage theorems that enable us to speed up the search for totally/partially unexplained sequences.
We describe experiments using real-world video and cyber-security data sets showing that our approach works
well in practice in terms of both running time and accuracy
ETPL
DM - 034 Discovering the Top-k Unexplained Sequences in Time-Stamped Observation Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
This paper studies the problem of finding objects with durable quality over time in historical time series
databases. For example, a sociologist may be interested in the top 10 web search terms during the period of
some historical events; the police may seek for vehicles that move close to a suspect 70 percent of the time
during a certain time period and so on. Durable top-k (DTop-k) and nearest neighbor (DkNN) queries can be
viewed as natural extensions of the standard snapshot top-k and NN queries to timestamped sequences of
values or locations. Although their snapshot counterparts have been studied extensively, to our knowledge,
there is little prior work that addresses this new class of durable queries. Existing methods for DTop-k
processing either apply trivial solutions, or rely on domain-specific properties. Motivated by this, we propose
efficient and scalable algorithms for the DTop-k and DkNN queries, based on novel indexing and query
evaluation techniques. Our experiments show that the proposed algorithms outperform previous and baseline
solutions by a wide margin.
ETPL
DM - 035
Durable Queries over Historical Time Series
As the uncertainty is inherent in a wide spectrum of applications such as radio frequency identification
(RFID) networks and location-based services (LBS), it is highly demanded to address the uncertainty of the
objects. In this paper, we propose a novel indexing structure, named U-Quadtree, to organize the uncertain
objects in the multidimensional space such that the queries can be processed efficiently by taking advantage of
U-Quadtree. Particularly, we focus on the range search on multidimensional uncertain objects since it is a
fundamental query in a spatial database. We propose a cost model which carefully considers various factors
that may impact the performance. Then, an effective and efficient index construction algorithm is proposed to
build the optimal U-Quadtree regarding the cost model. We show that U-Quadtree can also efficiently support
other types of queries such as uncertain range query and nearest neighbor query. Comprehensive experiments
demonstrate that our techniques outperform the existing works on multidimensional uncertain objects.
ETPL
DM - 036 Effectively Indexing the Multidimensional Uncertain Objects
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The vast majority of existing approaches to opinion feature extraction rely on mining
patterns only from a single review corpus, ignoring the nontrivial disparities in word distributional
characteristics of opinion features across different corpora. In this paper, we propose a novel method to
identify opinion features from online reviews by exploiting the difference in opinion feature statistics across
two corpora, one domain-specific corpus (i.e., the given review corpus) and one domain-independent corpus
(i.e., the contrasting corpus). We capture this disparity via a measure called domain relevance (DR), which
characterizes the relevance of a term to a text collection. We first extract a list of candidate opinion features
from the domain review corpus by defining a set of syntactic dependence rules. For each extracted candidate
feature, we then estimate its intrinsic-domain relevance (IDR) and extrinsic-domain relevance (EDR) scores
on the domain-dependent and domain-independent corpora, respectively. Candidate features that are less
generic (EDR score less than a threshold) and more domain-specific (IDR score greater than another
threshold) are then confirmed as opinion features. We call this interval thresholding approach the intrinsic and
extrinsic domain relevance (IEDR) criterion. Experimental results on two real-world review domains show the
proposed IEDR approach to outperform several other well-established methods in identifying opinion features.
ETPL
DM - 037 Identifying Features in Opinion Mining via Intrinsic and Extrinsic Domain
Relevance
Social networks model the social activities between individuals, which change as time goes by. In
light of useful information from such dynamic networks, there is a continuous demand for privacy-preserving
data sharing with analyzers, collaborators or customers. In this paper, we address the privacy risks of identity
disclosures in sequential releases of a dynamic network. To prevent privacy breaches, we proposed novel kw-
structural diversity anonymity, where k is an appreciated privacy level and w is a time period that an adversary
can monitor a victim to collect the attack knowledge. We also present a heuristic algorithm for generating
releases satisfying kw-structural diversity anonymity so that the adversary cannot utilize his knowledge to
reidentify the victim and take advantages. The evaluations on both real and synthetic data sets show that the
proposed algorithm can retain much of the characteristics of the networks while confirming the privacy
protection.
ETPL
DM - 038 Identity Protection in Sequential Releases of Dynamic Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
If knowledge such as classification rules are extracted from sample data in a distributed way, it may be
necessary to combine or fuse these rules. In a conventional approach this would typically be done either by
combining the classifiers' outputs (e.g., in form of a classifier ensemble) or by combining the sets of
classification rules (e.g., by weighting them individually). In this paper, we introduce a new way of fusing
classifiers at the level of parameters of classification rules. This technique is based on the use of probabilistic
generative classifiers using multinomial distributions for categorical input dimensions and multivariate normal
distributions for the continuous ones. That means, we have distributions such as Dirichlet or normal-Wishart
distributions over parameters of the classifier. We refer to these distributions as hyperdistributions or second-
order distributions. We show that fusing two (or more) classifiers can be done by multiplying the
hyperdistributions of the parameters and derive simple formulas for that task. Properties of this new approach
are demonstrated with a few experiments. The main advantage of this fusion approach is that the
hyperdistributions are retained throughout the fusion process. Thus, the fused components may, for example,
be used in subsequent training steps (online training).
ETPL
DM - 039
Knowledge Fusion for Probabilistic Generative Classifiers with Data Mining
Applications
Advanced microarray technologies have enabled to simultaneously monitor the expression levels of all
genes. An important problem in microarray data analysis is to discover phenotype structures. The goal is to 1)
find groups of samples corresponding to different phenotypes (such as disease or normal), and 2) for each group
of samples, find the representative expression pattern or signature that distinguishes this group from others.
Some methods have been proposed for this issue, however, a common drawback is that the identified signatures
often include a large number of genes but with low discriminative power. In this paper, we propose a g*-
sequence model to address this limitation, where the ordered expression values among genes are profitably
utilized. Compared with the existing methods, the proposed sequence model is more robust to noise and allows
to discover the signatures with more discriminative power using fewer genes. This is important for the
subsequent analysis by the biologists. We prove that the problem of phenotype structure discovery is NP-
complete. An efficient algorithm, FINDER, is developed, which includes three steps: 1) trivial g*-sequences
identifying, 2) phenotype structure discovery, and 3) refinement. Effective pruning strategies are developed to
further improve the efficiency.
ETPL
DM - 040 Learning Phenotype Structure Using Sequence Model
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
One-to-many data linkage is an essential task in many domains, yet only a handful of prior publications
have addressed this issue. Furthermore, while traditionally data linkage is performed among entities of the
same type, it is extremely necessary to develop linkage techniques that link between matching entities of
different types as well. In this paper, we propose a new one-to-many data linkage method that links between
entities of different natures. The proposed method is based on a one-class clustering tree (OCCT) that
characterizes the entities that should be linked together. The tree is built such that it is easy to understand and
transform into association rules, i.e., the inner nodes consist only of features describing the first set of
entities, while the leaves of the tree represent features of their matching entities from the second data set. We
propose four splitting criteria and two different pruning methods which can be used for inducing the OCCT.
The method was evaluated using data sets from three different domains. The results affirm the effectiveness
of the proposed method and show that the OCCT yields better performance in terms of precision and recall
(in most cases it is statistically significant) when compared to a C4.5 decision tree-based linkage method.
ETPL
DM - 041
OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data
Linkage
Feature selection is an important technique for data mining. Despite its importance,
most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods,
online learning represents a promising family of efficient and scalable machine learning algorithms for large-
scale applications. Most existing studies of online learning require accessing all the attributes/features of
training instances. Such a classical setting is not always appropriate for real-world applications when data
instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address
this limitation, we investigate the problem of online feature selection (OFS) in which an online learner is
only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge
of online feature selection is how to make accurate prediction for an instance using a small number of active
features. This is in contrast to the classical setup of online learning where all the features can be used for
prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques.
Specifically, this article addresses two different tasks of online feature selection: 1) learning with full input,
where an learner is allowed to access all the features to decide the subset of active features, and 2) learning
with partial input, where only a limited number of features is allowed to be accessed for each instance by the
learner. We present novel algorithms to solve each of the two problems and give their performance analysis.
The encouraging results of our experiments validate the efficacy and efficiency of th- proposed techniques.
ETPL
DM - 042 Online Feature Selection and Its Applications
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
This paper designs an efficient image hashing with a ring partition and a nonnegative matrix
factorization (NMF), which has both the rotation robustness and good discriminative capability. The key
contribution is a novel construction of rotation-invariant secondary image, which is used for the first time in
image hashing and helps to make image hash resistant to rotation. In addition, NMF coefficients are
approximately linearly changed by content-preserving manipulations, so as to measure hash similarity with
correlation coefficient. We conduct experiments for illustrating the efficiency with 346 images. Our
experiments show that the proposed hashing is robust against content-preserving operations, such as image
rotation, JPEG compression, watermark embedding, Gaussian low-pass filtering, gamma correction,
brightness adjustment, contrast adjustment, and image scaling. Receiver operating characteristics (ROC) curve
comparisons are also conducted with the state-of-the-art algorithms, and demonstrate that the proposed
hashing is much better than all these algorithms in classification performances with respect to robustness and
discrimination.
ETPL
DM - 043 Robust Perceptual Image Hashing Based on Ring Partition and NMF
Similarity query is a fundamental problem in database, data mining and information retrieval research.
Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional
querying techniques. The existing work on querying incomplete data addresses the problem where the data
values on certain dimensions are unknown. However, in many real-life applications, such as data collected by
a sensor network in a noisy environment, not only the data values but also the dimension information may be
missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete
data. A probabilistic framework is developed to model this problem so that the users can find objects in the
database that are similar to the query with probability guarantee. Missing dimension information poses great
computational challenge, since all possible combinations of missing dimensions need to be examined when
evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of
the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant
data objects without explicitly examining all missing dimension combinations. A probability triangle
inequality is also employed to further prune the search space and speed up the query process. The proposed
probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive
experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.
ETPL
DM - 044
Searching Dimension Incomplete Databases
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge
for traditional data mining techniques, both in terms of effectiveness and efficiency. Clustering becomes
difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing
distances between data points. In this paper, we take a novel perspective on the problem of clustering high-
dimensional data. Instead of attempting to avoid the curse of dimensionality by observing a lower dimensional
feature subspace, we embrace dimensionality by taking advantage of inherently high-dimensional phenomena.
More specifically, we show that hubness, i.e., the tendency of high-dimensional data to contain points (hubs)
that frequently occur in k-nearest-neighbor lists of other points, can be successfully exploited in clustering. We
validate our hypothesis by demonstrating that hubness is a good measure of point centrality within a high-
dimensional data cluster, and by proposing several hubness-based clustering algorithms, showing that major
hubs can be used effectively as cluster prototypes or as guides during the search for centroid-based cluster
configurations. Experimental results demonstrate good performance of our algorithms in multiple settings,
particularly in the presence of large quantities of noise. The proposed methods are tailored mostly for
detecting approximately hyperspherical clusters and need to be extended to properly handle clusters of
arbitrary shapes.
ETPL
DM - 045
The Role of Hubness in Clustering High-Dimensional Data
Traditionally, as soon as confidentiality becomes a concern, data are encrypted before outsourcing to a
service provider. Any software-based cryptographic constructs then deployed, for server-side query processing
on the encrypted data, inherently limit query expressiveness. Here, we introduce TrustedDB, an outsourced
database prototype that allows clients to execute SQL queries with privacy and under regulatory compliance
constraints by leveraging server-hosted, tamper-proof trusted hardware in critical query processing stages,
thereby removing any limitations on the type of supported queries. Despite the cost overhead and performance
limitations of trusted hardware, we show that the costs per query are orders of magnitude lower than any
(existing or) potential future software-only mechanisms. TrustedDB is built and runs on actual hardware, and
its performance and costs are evaluated here.
ETPL
DM - 046
TrustedDB: A Trusted Hardware-Based Database with Privacy and Data
Confidentiality
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Collaborative filtering (CF) is an important and popular technology for recommendation
systems. However, current collaborative filtering methods suffer from some problems such as sparsity
problem, inaccurate recommendation and producing big-error predictions. In this paper, we borrow ideas of
object typicality from cognitive psychology and propose a novel typicality-based collaborative filtering
recommendation method named TyCo. A distinct feature of typicality-based CF is that it finds `neighbors' of
users based on user typicality degrees in user groups (instead of the co-rated items of users or common users
of items in traditional CF). To the best of our knowledge, there is no work on investigating collaborative
filtering recommendation by combining object typicality.
ETPL
DM - 047
TyCo: Towards Typicality-based Collaborative Filtering Recommendation
In this paper, we address the problem of the high annotation cost of acquiring training data for
semantic segmentation. Most modern approaches to semantic segmentation are based upon graphical models,
such as the conditional random fields, and rely on sufficient training data in form of object contours. To reduce
the manual effort on pixel-wise annotating contours, we consider the setting in which the training data set for
semantic segmentation is a mixture of a few object contours and an abundant set of bounding boxes of objects.
Our idea is to borrow the knowledge derived from the object contours to infer the unknown object contours
enclosed by the bounding boxes. The inferred contours can then serve as training data for semantic
segmentation. To this end, we generate multiple contour hypotheses for each bounding box with the
assumption that at least one hypothesis is close to the ground truth. This paper proposes an approach, called
augmented multiple instance regression (AMIR), that formulates the task of hypothesis selection as the
problem of multiple instance regression (MIR), and augments information derived from the object contours to
guide and regularize the training process of MIR. In this way, a bounding box is treated as a bag with its
contour hypotheses as instances, and the positive instances refer to the hypotheses close to the ground truth.
The proposed approach has been evaluated on the Pascal VOC segmentation task. The promising results
demonstrate that AMIR can precisely infer the object contours in the bounding boxes, and hence provide
effective alternatives to manually labeled contours for semantic segmentation.
ETPL
DM - 048
A Two-Level Topic Model Towards Knowledge Discovery from Citation
Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Access control mechanisms protect sensitive information from unauthorized users. However, when
sensitive information is shared and a Privacy Protection Mechanism (PPM) is not in place, an authorized user
can still compromise the privacy of a person leading to identity disclosure. A PPM can use suppression and
generalization of relational data to anonymize and satisfy privacy requirements, e.g., k-anonymity and l-
diversity, against identity and attribute disclosure. However, privacy is achieved at the cost of precision of
authorized information. In this paper, we propose an accuracy-constrained privacy-preserving access control
framework. The access control policies define selection predicates available to roles while the privacy
requirement is to satisfy the k-anonymity or l-diversity. An additional constraint that needs to be satisfied by
the PPM is the imprecision bound for each selection predicate. The techniques for workload-aware
anonymization for selection predicates have been discussed in the literature. However, to the best of our
knowledge, the problem of satisfying the accuracy constraints for multiple roles has not been studied before.
In our formulation of the aforementioned problem, we propose heuristics for anonymization algorithms and
show empirically that the proposed approach satisfies imprecision bounds for more permissions and has lower
total imprecision than the current state of the art.
ETPL
DM- 049
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for
Relational Data
Traditional active learning methods require the labeler to provide a class label for each queried instance. The
labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in
turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert
labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this
paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked “whether a pair
of instances belong to the same class”, namely, a pairwise label homogeneity. Under such circumstances, our
active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to
make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we
propose a “Pairwise Query on Max-flow Paths” strategy to query pairwise label homogeneity from a
nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to
differentiate instances in different classes). In addition, a “Confidence-based Data Selection” measure is used
to evaluate data utility based on the Min-cut model's prediction results. The selected instances, with inferred
class labels, are included into the labeled set to form a closed-loop active learning process. Experimental
results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can
result in good performance with nonexpert labelers.
ETPL
DM - 050
Active Learning without Knowing Individual Instance Labels: A Pairwise Label
Homogeneity Query Approach
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
In this paper we present a framework for automatic exploitation of news in stock trading
strategies. Events are extracted from news messages presented in free text without annotations. We test the
introduced framework by deriving trading strategies based on technical indicators and impacts of the extracted
events. The strategies take the form of rules that combine technical trading indicators with a news variable,
and are revealed through the use of genetic programming. We find that the news variable is often included in
the optimal trading rules, indicating the added value of news for predictive purposes and validating our
proposed framework for automatically incorporating news in stock trading strategies.
ETPL
DM - 051
An Automated Framework for Incorporating News into Stock Trading
Strategies
We identify relation completion (RC) as one recurring problem that is central to the success of
novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation ℜ,
RC attempts at linking entity pairs between two entity lists under the relation ℜ. To accomplish the RC goals,
we propose to formulate search queries for each query entity α based on some auxiliary information, so that to
detect its target entity β from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses
extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns
may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method
that uses context terms learned surrounding the expression of a relation as the auxiliary information in
formulating queries. The experimental results based on several real-world web data collections demonstrate
that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.
ETPL
DM - 052
CoRE: A Context-Aware Relation Extraction Method for Relation Completion
Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-
relationship graphs. There are two main ways to personalize authority flow ranking: Node-based
personalization, where authority originates from a set of user-specific nodes; edge-based personalization,
where the importance of different edge types is user-specific. We propose the first approach to achieve
efficient edge-based personalization using a combination of precomputation and runtime algorithms. In
particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV)
assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings
for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated
as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at
the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial
edge types based on the edge distribution in the data graph.
ETPL
DM - 053
Efficient Ranking on Entity Graphs with Personalized Relationships
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Although several distance or similarity functions for trees have been introduced, their performance is
not always satisfactory in many applications, ranging from document clustering to natural language
processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where
a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree
mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches.
Extensive evaluation frameworks are developed to evaluate the performance of the new approach against
previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic
labeled data sets are performed to provide an unbiased evaluation where different distance functions are
investigated. The experimental results demonstrate the superior performance of the proposed distance
function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree
distance functions in terms of runtime efficiency.
ETPL
DM - 054 Extended Subtree: A New Similarity Function for Tree Structured Data
Conventional spatial queries, such as range search and nearest neighbor retrieval,
involve only conditions on objects' geometric properties. Today, many modern applications call for novel
forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated
texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for
the restaurant that is the closest among those whose menus contain “steak, spaghetti, brandy” all at the same
time. Currently, the best solution to such queries is based on the IR 2-tree, which, as shown in this paper, has a
few deficiencies that seriously impact its efficiency. Motivated by this, we develop a new access method
called the spatial inverted index that extends the conventional inverted index to cope with multidimensional
data, and comes with algorithms that can answer nearest neighbor queries with keywords in real time. As
verified by experiments, the proposed techniques outperform the IR 2-tree in query response time significantly,
often by a factor of orders of magnitude
ETPL
DM - 055
Fast Nearest Neighbor Search with Keywords
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Activity recognition is a key task for the development of advanced and effective
ubiquitous applications in fields like ambient assisted living. A major problem in designing effective
recognition algorithms is the difficulty of incorporating long-range dependencies between distant time instants
without incurring substantial increase in computational complexity of inference. In this paper we present a
novel approach for introducing long-range interactions based on sequential pattern mining. The algorithm
searches for patterns characterizing time segments during which the same activity is performed. A
probabilistic model is learned to represent the distribution of pattern matches along sequences, trying to
maximize the coverage of an activity segment by a pattern match. The model is integrated in a segmental
labeling algorithm and applied to novel sequences, tagged according to matches of the extracted patterns. The
rationale of the approach is that restricting dependencies to span the same activity segment (i.e., sharing the
same label), allows keeping inference tractable. An experimental evaluation shows that enriching sensor-based
representations with the mined patterns allows improving results over sequential and segmental labeling
algorithms in most of the cases. An analysis of the discovered patterns highlights non-trivial interactions
spanning over a significant time horizon.
ETPL
DM - 056 Improving Activity Recognition by Segmental Pattern Mining
Frequent weighted itemsets represent correlations frequently holding in data in which items may
weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function,
discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of
discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two
novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that
perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented.
Experimental results show efficiency and effectiveness of the proposed approach
ETPL
DM - 057
Infrequent Weighted Itemset Mining Using Frequent Pattern Growth
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Local thresholding algorithms were first presented more than a decade ago and have since been
applied to a variety of data mining tasks in peer-to-peer systems, wireless sensor networks, and in grid
systems. One critical assumption made by those algorithms has always been cycle-free routing. The existence
of even one cycle may lead all peers to the wrong outcome. Outside the lab, unfortunately, cycle freedom is
not easy to achieve. This work is the first to lift the requirement of cycle freedom by presenting a local
thresholding algorithm suitable for general network graphs. The algorithm relies on a new repositioning of the
problem in weighted vector arithmetics, on a new stopping rule, whose proof does not require that the network
be cycle free, and on new methods for balance correction when the stopping rule fails. The new stopping and
update rules permit calculation of the very same functions that were calculable using previous algorithms,
which do assume cycle freedom. The algorithm is implemented on a standard peer-to-peer simulator and is
validated for networks of up to 80,000 peers, organized in three different topologies representative of major
current distributed systems: the Internet, structured peer-to-peer systems, and wireless sensor networks.
ETPL
DM- 058 Local Thresholding in General Network Graphs
The main aim of this paper is to develop a community discovery scheme in a multi-dimensional
network for data mining applications. In online social media, networked data consists of multiple
dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of
users who interact significantly on these media entities. In a co-citation network, we are interested in finding a
group of authors who relate to other authors significantly on publication information in titles, abstracts, and
keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a
framework (MultiComm)to identify a seed-based community in a multi-dimensional network by evaluating
the affinity between two items in the same type of entity (same dimension)or different types of entities
(different dimensions)from the network. Our idea is to calculate the probabilities of visiting each item in each
dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a
high quality of generated communities by the proposed algorithm, we develop and study a local modularity
measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data
sets suggest that the proposed framework is able to find a community effectively. Experimental results have
also shown that the performance of the proposed algorithm is better in accuracy than the other testing
algorithms in finding communities in multi-dimensional networks.
ETPL
DM - 059
MultiComm: Finding Community Structurein Multi-Dimensional Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
ormulate and investigate the novel problem of finding the skyline k-tuple groups from an n-
tuple data set-i.e., groups of k tuples which are not dominated by any other group of equal size, based on
aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-
monotonic properties for pruning the search space of skyline groups. To this end, we first show that the anti-
monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. Then, we
identify two anti-monotonic properties with varying degrees of applicability: order-specific property which
applies to SUM, MIN, and MAX as well as weak candidate-generation property which applies to MIN and
MAX only. Experimental results on both real and synthetic data sets verify that the proposed algorithms
achieve orders of magnitude performance gain over the baseline method.
ETPL
DM - 060
On Skyline Groups
The probabilistic threshold query is one of the most common queries in uncertain databases,
where a result satisfying the query must be also with probability meeting the threshold requirement. In this
paper, we investigate probabilistic threshold keyword queries (PrTKQ)over XML data, which is not studied
before. We first introduce the notion of quasi-SLCA and use it to represent results for a PrTKQ with the
consideration of possible world semantics. Then we design a probabilistic inverted (PI)index that can be used
to quickly return the qualified answers and filter out the unqualified ones based on our proposed lower/upper
bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-
based Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An
empirical study using real and synthetic data sets has verified the effectiveness and the efficiency of our
approaches
ETPL
DM - 061
Quasi-SLCA Based Keyword QueryProcessing over Probabilistic XML Data
We propose a protocol for secure mining of association rules in horizontally distributed databases. The
current leading protocol is that of Kantarcioglu and Clifton . Our protocol, like theirs, is based on the Fast
Distributed Mining (FDM)algorithm of Cheung et al. , which is an unsecured distributed version of the Apriori
algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms-one that
computes the union of private subsets that each of the interacting players hold, and another that tests the
inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy
with respect to the protocol in . In addition, it is simpler and is significantly more efficient in terms of
communication rounds, communication cost and computational cost
ETPL
DM - 062
Secure Mining of Association Rules in Horizontally Distributed Databases
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Pattern classification systems are commonly used in adversarial applications, like biometric
authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by
humans to undermine their operation. As this adversarial scenario is not taken into account by classical design
methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect
their performance, and consequently limit their practical utility. Extending pattern classification theory and
design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet
been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design
phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may
incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes
and generalizes the main ideas proposed in the literature, and give examples of its use in three real
applications. Reported results show that security evaluation can provide a more complete understanding of the
classifier's behavior in adversarial environments, and lead to better design choices.
ETPL
DM - 063
Security Evaluation of Pattern Classifiers under Attack
This paper takes the shortest path discovery to study efficient relational approaches to graph search
queries. We first abstract three enhanced relational operators, based on which we introduce an FEM
framework to bridge the gap between relational operations and graph operations. We show new features
introduced by recent SQL standards, such as window function and merge statement, can improve the
performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema and
design a bi-directional restrictive BFS (breadth-first-search)over partitioned tables, which improves the
scalability and performance without extra indexing overheads. The final extensive experimental results
illustrate our relational approach with optimization strategies can achieve high scalability and performance.
ETPL
DM - 064
Shortest Path Computing in Relational DBMSs
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The online shortest path problem aims at computing the shortest path based on live
traffic circumstances. This is very important in modern car navigation systems as it helps drivers to make
sensible decisions. To our best knowledge, there is no efficient system/solution that can offer affordable costs
at both client and server sides for online shortest path computation. Unfortunately, the conventional client-
server architecture scales poorly with the number of clients. A promising approach is to let the server collect
live traffic information and then broadcast them over radio or wireless network. This approach has excellent
scalability with the number of clients. Thus, we develop a new framework called live traffic index (LTI)which
enables drivers to quickly and effectively collect the live traffic information on the broadcasting channel. An
impressive result is that the driver can compute/update their shortest path result by receiving only a small
fraction of the index. Our experimental study shows that LTI is robust to various parameters and it offers
relatively short tune-in cost (at client side), fast query response time (at client side), small broadcast size (at
server side), and light maintenance time (at server side)for online shortest path problem.
ETPL
DM - 065
Towards Online Shortest Path Computation
The Object Summary (OS)is a recently proposed tree structure, which summarizes all data held in a
relational database about a data subject. An OS can potentially be very large in size and therefore unfriendly
for users who wish to view synoptic information about the data subject. In this paper, we investigate the
effective and efficient retrieval of concise and informative OS snippets (denoted as size-l OSs). We propose
and investigate the effectiveness of two types of size- l OSs, namely size- l OS (t)s and size-l OS (a)s that
consist of l tuple nodes and l attribute nodes respectively. For computing size-l OSs, we propose an optimal
dynamic programming algorithm, two greedy algorithms and preprocessing heuristics. By collecting feedback
from real users (e.g., from DBLP authors), we assess the relative usability of the two different types of
snippets, the choice of the size- l parameter, as well as the effectiveness of the snippets with respect to the user
expectations. In addition, via thorough evaluation on real databases, we test the speed and effectiveness of our
techniques.
ETPL
DM- 066
Versatile Size-l Object Summaries for Relational Keyword Search
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Ontology reuse offers great benefits by measuring and comparing ontologies. However, the state of art
approaches for measuring ontologies neglects the problems of both the polymorphism of ontology
representation and the addition of implicit semantic knowledge. One way to tackle these problems is to devise
a mechanism for ontology measurement that is stable, the basic criteria for automatic measurement. In this
paper, we present a graph derivation representation based approach (GDR) for stable semantic measurement,
which captures structural semantics of ontologies and addresses those problems that cause unstable
measurement of ontologies. This paper makes three original contributions. First, we introduce and define the
concept of semantic measurement and the concept of stable measurement. We present the GDR based
approach, a three-phase process to transform an ontology to its GDR. Second, we formally analyze important
properties of GDRs based on which stable semantic measurement and comparison can be achieved
successfully. Third but not the least, we compare our GDR based approach with existing graph based methods
using a dozen real world exemplar ontologies. Our experimental comparison is conducted based on nine
ontology measurement entities and distance metric, which stably compares the similarity of two ontologies in
terms of their GDRs.
ETPL
DM - 067
A Graph Derivation Based Approach for Measuring and Comparing
Structural Semantics of Ontologies
Building Bayesian belief networks in the absence of data involves the challenging task of
eliciting conditional probabilities from experts to parameterize the model. In this paper, we develop an
analytical method for determining the optimal order for eliciting these probabilities. Our method uses prior
distributions on network parameters and a novel expected proximity criteria, to propose an order that
maximizes information gain per unit elicitation time. We present analytical results when priors are uniform
Dirichlet; for other priors, we find through experiments that the optimal order is strongly affected by which
variables are of primary interest to the analyst. Our results should prove useful to researchers and practitioners
involved in belief network model building and elicitation.
ETPL
DM - 068
A Myopic Approach to Ordering Nodes for Parameter Elicitation in Bayesian
Belief Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Many problems in natural language processing, data mining, information retrieval, and
bioinformatics can be formalized as string transformation, which is a task as follows. Given an input string, the
system generates the $k$ most likely output strings corresponding to the input string. This paper proposes a
novel and probabilistic approach to string transformation, which is both accurate and efficient. The approach
includes the use of a log linear model, a method for training the model, and an algorithm for generating the top
$k$ candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a
conditional probability distribution of an output string and a rule set for the transformation conditioned on an
input string. The learning method employs maximum likelihood estimation for parameter estimation. The
string generation algorithm based on pruning is guaranteed to generate the optimal top $k$ candidates. The
proposed method is applied to correction of spelling errors in queries as well as reformulation of queries in
web search. Experimental results on large scale data show that the proposed approach is very accurate and
efficient improving upon existing methods in terms of accuracy and efficiency in different settings.
ETPL
DM - 069
A Probabilistic Approach to String Transformation
Domain transfer learning, which learns a target classifier using labeled data from a different
distribution, has shown promising value in knowledge discovery yet still been a challenging problem. Most
previous works designed adaptive classifiers by exploring two learning strategies independently: distribution
adaptation and label propagation. In this paper, we propose a novel transfer learning framework, referred to as
Adaptation Regularization based Transfer Learning (ARTL), to model them in a unified way based on the
structural risk minimization principle and the regularization theory. Specifically, ARTL learns the adaptive
classifier by simultaneously optimizing the structural risk functional, the joint distribution matching between
domains, and the manifold consistency underlying marginal distribution. Based on the framework, we propose
two novel methods using Regularized Least Squares (RLS) and Support Vector Machines (SVMs),
respectively, and use the Representer theorem in reproducing kernel Hilbert space to derive corresponding
solutions. Comprehensive experiments verify that ARTL can significantly outperform state-of-the-art learning
methods on several public text and image datasets.
ETPL
DM - 070
Adaptation Regularization: A General Framework for Transfer Learning
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Clustering algorithm and cluster validity are two highly correlated parts in cluster analysis. In
this paper, a novel idea for cluster validity and a clustering algorithm based on the validity index are
introduced. A Centroid Ratio is firstly introduced to compare two clustering results. This centroid ratio is then
used in prototype-based clustering by introducing a Pairwise Random Swap clustering algorithm to avoid the
local optimum problem of $k$ -means. The swap strategy in the algorithm alternates between simple
perturbation to the solution and convergence toward the nearest optimum by $k$ -means. The centroid ratio is
shown to be highly correlated to the mean square error (MSE) and other external indices. Moreover, it is fast
and simple to calculate. An empirical study of several different datasets indicates that the proposed algorithm
works more efficiently than Random Swap, Deterministic Random Swap, Repeated k-means or k-means++.
The algorithm is successfully applied to document clustering and color image quantization as well.
ETPL
DM - 071
Centroid Ratio for a Pairwise Random Swap Clustering Algorithm
Result diversification has recently attracted considerable attention as a means of increasing
user satisfaction in recommender systems, as well as in web and database search. In this paper, we focus on
the problem of selecting the $k$ -most diverse items from a result set. Whereas previous research has mainly
considered the static version of the problem, in this paper, we exploit the dynamic case in which the result set
changes over time, as for example, in the case of notification services. We define the CONTINUOUS $k$ -
DIVERSITY PROBLEM along with appropriate constraints that enforce continuity requirements on the
diversified results. Our proposed approach is based on cover trees and supports dynamic item insertion and
deletion. The diversification problem is in general NP-hard; we provide theoretical bounds that characterize
the quality of our cover tree solution with respect to the optimal one. Since results are often associated with a
relevance score, we extend our approach to account for relevance. Finally, we report experimental results
concerning the efficiency and effectiveness of our approach on a variety of real and synthetic datasets.
ETPL
DM - 072
Diverse Set Selection Over Dynamic Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Recently, probabilistic graphs have attracted significant interests of the data mining community. It
is observed that correlations may exist among adjacent edges in various probabilistic graphs. As one of the
basic mining techniques, graph clustering is widely used in exploratory data analysis, such as data
compression, information retrieval, image segmentation, etc. Graph clustering aims to divide data into clusters
according to their similarities, and a number of algorithms have been proposed for clustering graphs, such as
the pKwikCluster algorithm, spectral clustering, k-path clustering, etc. However, little research has been
performed to develop efficient clustering algorithms for probabilistic graphs. Particularly, it becomes more
challenging to efficiently cluster probabilistic graphs when correlations are considered. In this paper, we
define the problem of clustering correlated probabilistic graphs. To solve the challenging problem, we propose
two algorithms, namely the $PEEDR$ and the $CPGS$ clustering algorithm. For each of the proposed
algorithms, we develop several pruning techniques to further improve their efficiency. We evaluate the
effectiveness and efficiency of our algorithms and pruning methods through comprehensive experiments.
ETPL
DM - 073
Effective and Efficient Clustering Methods for Correlated Probabilistic Graphs
This paper describes a three-level framework for semi-supervised feature selection. Most feature
selection methods mainly focus on finding relevant features for optimizing high-dimensional data. In this
paper, we show that the relevance requires two important procedures to provide an efficient feature selection
in the semi-supervised context. The first one concerns the selection of pairwise constraints that can be
extracted from the labeled part of data. The second procedure aims to reduce the redundancy that could be
detected in the selected relevant features. For the relevance, we develop a filter approach based on a
constrained Laplacian score. Finally, experimental results are provided to show the efficiency of our proposal
in comparison with several representative methods
ETPL
DM - 074
Efficient Semi-Supervised Feature Selection: Constraint, Relevance, and
Redundancy
A well-studied query type on moving objects is the continuous range query. An interesting
and practical situation is that instead of being continuously evaluated, the query may be evaluated at different
degrees of continuity, e.g., every 2 seconds (close to continuous), every 10 minutes or at irregular time
intervals (close to snapshot). Furthermore, the range query may be stacked under predicates applied to the
returned objects. An example is the count predicate that requires the number of objects in the range to be at
least $gamma$ . The conjecture is that these two practical considerations can help reduce communication
costs. We propose a safe region-based solution that exploits these two practical considerations. An extensive
experimental study shows that our solution can reduce communication costs by a factor of 9.5 compared to an
existing state-of-the-art system.
ETPL
DIP - 075
Evaluation of Range Queries With Predicates on Moving Objects
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Millions of users share their opinions on Twitter, making it a valuable platform for tracking
and analyzing public sentiment. Such tracking and analysis can provide critical information for decision
making in various domains. Therefore it has attracted attention in both academia and industry. Previous
research mainly focused on modeling and tracking public sentiment. In this work, we move one step further to
interpret sentiment variations. We observed that emerging topics (named foreground topics) within the
sentiment variation periods are highly related to the genuine reasons behind the variations. Based on this
observation, we propose a Latent Dirichlet Allocation (LDA) based model, Foreground and Background LDA
(FB-LDA), to distill foreground topics and filter out longstanding background topics. These foreground topics
can give potential interpretations of the sentiment variations. To further enhance the readability of the mined
reasons, we select the most representative tweets for foreground topics and develop another generative model
called Reason Candidate and Background LDA (RCB-LDA) to rank them with respect to their “popularity”
within the variation period. Experimental results show that our methods can effectively find foreground topics
and rank reason candidates. The proposed models can also be applied to other tasks such as finding topic
differences between two sets of documents.
ETPL
DM - 076
Interpreting the Public Sentiment Variations on Twitter
Data uncertainty is inherent in many real-world applications such as environmental surveillance
and mobile tracking. Mining sequential patterns from inaccurate data, such as those data arising from sensor
readings and GPS trajectories, is important for discovering hidden knowledge in such applications. In this
paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two
uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data,
and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that
conform to our models. However, the number of possible worlds is extremely large, which makes the mining
prohibitively expensive. Inspired by the famous PrefixSpan algorithm, we develop two new algorithms,
collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible
worlds explosion”, and when combined with our four pruning and validating methods, achieves even better
performance. We also propose a fast validating method to further speed up our U-PrefixSpan algorithm. The
efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and
synthetic datasets
ETPL
DM - 077
Mining Probabilistically Frequent Sequential Patterns in Large Uncertain
Databases
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
In spatial domains, interaction between features gives rise to two types of interaction patterns: co-
location and segregation patterns. Existing approaches to finding co-location patterns have several
shortcomings: (1) They depend on user specified thresholds for prevalence measures; (2) they do not take
spatial auto-correlation into account; and (3) they may report co-locations even if the features are randomly
distributed. Segregation patterns have yet to receive much attention. In this paper, we propose a method for
finding both types of interaction patterns, based on a statistical test. We introduce a new definition of co-
location and segregation pattern, we propose a model for the null distribution of features so spatial auto-
correlation is taken into account, and we design an algorithm for finding both co-location and segregation
patterns. We also develop two strategies to reduce the computational cost compared to a naïve approach based
on simulations of the data distribution, and we propose an approach to reduce the runtime of our algorithm
even further by using an approximation of the neighborhood of features. We evaluate our method empirically
using synthetic and real data sets and demonstrate its advantages over a state-of-the-art co-location mining
algorithm.
ETPL
DM - 078
Mining Statistically Significant Co-location and Segregation Patterns
In this paper we present a solution to one of the location-based query problems. This problem is
defined as follows: (i) a user wants to query a database of location data, known as Points Of Interest (POIs),
and does not want to reveal his/her location to the server due to privacy concerns; (ii) the owner of the location
data, that is, the location server, does not want to simply distribute its data to all users. The location server
desires to have some control over its data, since the data is its asset. We propose a major enhancement upon
previous solutions by introducing a two stage approach, where the first step is based on Oblivious Transfer and
the second step is based on Private Information Retrieval, to achieve a secure solution for both parties. The
solution we present is efficient and practical in many scenarios. We implement our solution on a desktop
machine and a mobile device to assess the efficiency of our protocol. We also introduce a security model and
analyse the security in the context of our protocol. Finally, we highlight a security weakness of our previous
work and present a solution to overcome it.
ETPL
DM - 079
Privacy-Preserving and Content-Protecting Location Based Queries
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Numerous consumer reviews of products are now available on the Internet. Consumer
reviews contain rich and valuable knowledge for both firms and users. However, the reviews are often
disorganized, leading to difficulties in information navigation and knowledge acquisition. This article proposes
a product aspect ranking framework, which automatically identifies the important aspects of products from
online consumer reviews, aiming at improving the usability of the numerous reviews. The important product
aspects are identified based on two observations: 1) the important aspects are usually commented on by a large
number of consumers and 2) consumer opinions on the important aspects greatly influence their overall
opinions on the product. In particular, given the consumer reviews of a product, we first identify product
aspects by a shallow dependency parser and determine consumer opinions on these aspects via a sentiment
classifier. We then develop a probabilistic aspect ranking algorithm to infer the importance of aspects by
simultaneously considering aspect frequency and the influence of consumer opinions given to each aspect over
their overall opinions. The experimental results on a review corpus of 21 popular products in eight domains
demonstrate the effectiveness of the proposed approach. Moreover, we apply product aspect ranking to two
real-world applications, i.e., document-level sentiment classification and extractive review summarization, and
achieve significant performance improvements, which demonstrate the capacity of product aspect ranking in
facilitating real-world applications.
ETPL
DM - 080
Product Aspect Ranking and Its Applications
In this paper, we present a novel ensemble method random projection random discretization
ensembles(RPRDE) to create ensembles of linear multivariate decision trees by using a univariate decision
tree algorithm. The present method combines the better computational complexity of a univariate decision tree
algorithm with the better representational power of linear multivariate decision trees. We develop random
discretization (RD) method that creates random discretized features from continuous features. Random
projection (RP) is used to create new features that are linear combinations of original features. A new dataset
is created by augmenting discretized features (created by using RD) with features created by using RP. Each
decision tree of a RPRD ensemble is trained on one dataset from the pool of these datasets by using a
univariate decision tree algorithm. As these multivariate decision trees (because of features created by RP)
have more representational power than univariate decision trees, we expect accurate decision trees in the
ensemble. Diverse training datasets ensure diverse decision trees in the ensemble. We study the performance
of RPRDE against other popular ensemble techniques using C4.5 tree as the base classifier. RPRDE matches
or outperforms other popular ensemble methods. Experiments results also suggest that the proposed method is
quite robust to the class noise.
ETPL
DM- 081
Random Projection Random Discretization Ensembles—Ensembles of Linear
Multivariate Decision Trees
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
In the classic range aggregation problem, we have a set $S$ of objects such that, given an interval
$I$ , a query counts how many objects of $S$ are covered by $I$ . Besides COUNT, the problem can also be
defined with other aggregate functions, e.g., SUM, MIN, MAX and AVERAGE. This paper studies a novel
variant of range aggregation, where an object can belong to multiple sets. A query (at runtime) picks any two
sets, and aggregates on their intersection. More formally, let $S_{1},ldots, S_{m}$ be $m$ sets of objects.
Given distinct set ids $i$ , $j$ and an interval $I$ , a query reports how many objects in
$S_{i}mathop{rmcapkern 0pt}displaylimits S_{j}$ are covered by $I$ . We call this problem range
aggregation with set selection (RASS). Its hardness lies in that the pair $(i, j)$ can have ${mchoose 2}$
choices, rendering effective indexing a non-trivial task. The RASS problem can also be defined with other
aggregate functions, and generalized so that a query cho- ses more than 2 sets. We develop a system called
RASS to power this type of queries. Our system has excellent efficiency in both theory and practice.
Theoretically, it consumes linear space, and achieves nearly-optimal query time. Practically, it outperforms
existing solutions on real datasets by a factor up to an order of magnitude. The paper also features a rigorous
theoretical analysis on the hardness of the RASS problem, which reveals invaluable insight into its
characteristics.
ETPL
DM - 082
Range Aggregation With Set Selection
The integration of social networking concepts into the Internet of things has led to the Social Internet
of Things (SIoT) paradigm, according to which objects are capable of establishing social relationships in an
autonomous way with respect to their owners with the benefits of improving the network scalability in
information/service discovery. Within this scenario, we focus on the problem of understanding how the
information provided by members of the social IoT has to be processed so as to build a reliable system on the
basis of the behavior of the objects. We define two models for trustworthiness management starting from the
solutions proposed for P2P and social networks. In the subjective model each node computes the
trustworthiness of its friends on the basis of its own experience and on the opinion of the friends in common
with the potential service providers. In the objective model, the information about each node is distributed and
stored making use of a distributed hash table structure so that any node can make use of the same information.
Simulations show how the proposed models can effectively isolate almost any malicious nodes in the network
at the expenses of an increase in the network traffic for feedback exchange.
ETPL
DM - 083
Trustworthiness Management in the Social Internet of Things
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
We are witnessing increasing interests in the effective use of road networks. For example, to enable
effective vehicle routing, weighted-graph models of transportation networks are used, where the weight of an
edge captures some cost associated with traversing the edge, e.g., greenhouse gas (GHG) emissions or travel
time. It is a precondition to using a graph model for routing that all edges have weights. Weights that capture
travel times and GHG emissions can be extracted from GPS trajectory data collected from the network.
However, GPS trajectory data typically lack the coverage needed to assign weights to all edges. This paper
formulates and addresses the problem of annotating all edges in a road network with travel cost based weights
from a set of trips in the network that cover only a small fraction of the edges, each with an associated ground-
truth travel cost. A general framework is proposed to solve the problem. Specifically, the problem is modeled
as a regression problem and solved by minimizing a judiciously designed objective function that takes into
account the topology of the road network. In particular, the use of weighted PageRank values of edges is
explored for assigning appropriate weights to all edges, and the property of directional adjacency of edges is
also taken into account to assign weights. Empirical studies with weights capturing travel time and GHG
emissions on two road networks (Skagen, Denmark, and North Jutland, Denmark) offer insight into the design
properties of the proposed techniques and offer evidence that the techniques are effective.
ETPL
DM - 084
Using Incomplete Information for Complete Weight Annotation of Road
Networks
Multicore systems and multithreaded processing are now the de facto standards of enterprise and
personal computing. If used in an uninformed way, however, multithreaded processing might actually degrade
performance. We present the facets of the memory access bottleneck as they manifest in multithreaded
processing and show their impact on query evaluation. We present a system design based on partition
parallelism, memory pooling, and data structures conducive to multithreaded processing. Based on this design,
we present alternative implementations of the most common query processing algorithms, which we
experimentally evaluate using multiple scenarios and hardware platforms. Our results show that the design and
algorithms are indeed scalable across platforms, but the choice of optimal algorithm largely depends on the
problem parameters and underlying hardware. However, our proposals are a good first step toward generic
multithreaded parallelism.
ETPL
DM - 085
A Comparative Study of Implementation Techniques for Query Processing in
Multicore Systems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The selection of relevant and significant features is an important problem particularly for data sets
with large number of features. In this regard, a new feature selection algorithm is presented based on a rough
hypercuboid approach. It selects a set of features from a data set by maximizing the relevance, dependency,
and significance of the selected features. By introducing the concept of the hypercuboid equivalence partition
matrix, a novel representation of degree of dependency of sample categories on features is proposed to
measure the relevance, dependency, and significance of features in approximation spaces. The equivalence
partition matrix also offers an efficient way to calculate many more quantitative measures to describe the
inexactness of approximate classification. Several quantitative indices are introduced based on the rough
hypercuboid approach for evaluating the performance of the proposed method. The superiority of the proposed
method over other feature selection methods, in terms of computational complexity and classification
accuracy, is established extensively on various real-life data sets of different sizes and dimensions.
ETPL
DM - 086
A Rough Hyper cuboid Approach for Feature Selection in Approximation
Spaces
Schemas are often used to constrain the content and structure of XML documents. They can be
quite big and complex and, thus, difficult to be accessed manually. The ability to query a single schema, a
collection of schemas or to retrieve schema components that meet certain structural constraints significantly
eases schema management and is, thus, useful in many contexts. In this paper, we propose a query language,
named XSPath, specifically tailored for XML schema that works on logical graph-based representations
of schemas, on which it enables the navigation, and allows the selection of nodes. We also propose
XPath/XQuery-based translations that can be exploited for the evaluation of XSPath queries. An extensive
evaluation of the usability and efficiency of the proposed approach is finally presented within the EXup
system
ETPL
DM - 087
XSPath: Navigation on XML Schemas Made Easy
Our proposed framework consists of two parts. First, we put forward uncertain one-class learning to
cope with data of uncertainty. We first propose a local kernel-density-based method to generate a bound score
for each instance, which refines the location of the corresponding instance, and then construct
an uncertain one-class classifier (UOCC) by incorporating the generated bound score into a one-class SVM-
based learning phase. Second, we propose a support vectors (SVs)-based clustering technique to summarize
the concept of the user from the history chunks by representing the chunk data using support vectors of
the uncertain one-class classifier developed on each chunk, and then extend k-mean clustering method to
cluster history chunks into clusters so that we can summarize concept from the history chunks.
ETPL
DM - 088
Uncertain One-Class Learning and Concept Summarization Learning on
Uncertain Data Streams
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Personalized web search (PWS) has demonstrated its effectiveness in improving the quality of various
search services on the Internet. However, evidences show that users' reluctance to disclose their private
information during search has become a major barrier for the wide proliferation of PWS. We
study privacy protection in PWS applications that model user preferences as hierarchical user profiles. We
propose a PWS framework called UPS that can adaptively generalize profiles by queries while respecting
user-specified privacy requirements. Our runtime generalization aims at striking a balance between two
predictive metrics that evaluate the utility of personalization and the privacy risk of exposing the generalized
profile. We present two greedy algorithms, namely GreedyDP and GreedyIL, for runtime generalization. We
also provide an online prediction mechanism for deciding whether personalizing a query is beneficial.
Extensive experiments demonstrate the effectiveness of our framework. The experimental results also reveal
that GreedyIL significantly outperforms GreedyDP in terms of efficiency.
ETPL
DM - 089
Supporting Privacy Protection in Personalized Web Search
In data warehousing and OLAP applications, scalar-level predicates in SQL become increasingly
inadequate to support a class of operations that require set-level comparison semantics, i.e., comparing
a group of tuples with multiple values. Currently, complex SQL queries composed by scalar-level operations
are often formed to obtain even very simple set-level semantics. Such queries are not only difficult to write but
also challenging for a database engine to optimize, thus can result in costly evaluation. This paper proposes to
augment SQL with set predicate, to bring out otherwise obscured set-level semantics. We studied two
approaches to processing set predicates-an aggregate function-based approach and a bitmap index-based
approach. Moreover, we designed a histogram-based probabilistic method of set predicate selectivity
estimation, for optimizing queries with multiple predicates. The experiments verified its accuracy and
effectiveness in optimizing queries.
ETPL
DM - 090
Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically
Formed Groups
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
We tackle the time-series classification problem using a novel probabilistic model that represents
the conditional densities of the observed sequences being time-warped and transformed from an underlying
base sequence. We call it probabilistic sequence translation-alignment model (PSTAM) since it aims to
capture both feature alignment and mapping between sequences, analogous to translating one language into
another in the field of machine translation. To deal with general time-series, we impose the time-monotonicity
constraints on the hidden alignment variables in the model parameter space, where by marginalizing them out
it allows effective learning of class-specific time-warping and feature transformation simultaneously. Our
PSTAM, thus, naturally enjoys the advantages from two typical approaches widely used
in sequence classification: 1) benefits from the alignment-based methods that aim to estimate distance
measures between non-equal-length sequences via direct comparison of aligned features, and 2) merits of
the model-based approaches that can effectively capture the class-specific patterns or trends. Furthermore, the
low-dimensional modeling of the latent base sequence naturally provides a way to discover the intrinsic
manifold structure possibly retained in the observed data, leading to an unsupervised manifold learning
for sequence data. The benefits of the proposed approach are demonstrated on a comprehensive set of
evaluations with both synthetic and real-world sequence data sets.
ETPL
DM - 091
Probabilistic Sequence Translation-Alignment Model for Time-Series
Classification
Imbalanced learning problems contain an unequal distribution of data samples among different
classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic
oversampling methods address this problem by generating the synthetic minority class samples to balance the
distribution between the samples of the majority and minority classes. This paper identifies that most of the
existing oversampling methods may generate the wrong synthetic minority samples in some scenarios and
make learning tasks harder.MWMOTE first identifies the hard-to-learn informative minority class samples and
assigns them weights according to their euclidean distance from the nearest majority class samples. It then
generates the synthetic samples from the weighted informative minority class samples using a clustering
approach. This is done in such a way that all the generated samples lie inside some minority class
cluster. MWMOTE has been evaluated extensively on four artificial and 20 real-world data sets. The
simulation results show that our method is better than or comparable with some other existing methods in
terms of various assessment metrics, such as geometric mean (G-mean) and area under the receiver operating
curve (ROC), usually known as area under curve (AUC).
ETPL
DM - 092
MWMOTE--Majority Weighted Minority Oversampling Technique for
Imbalanced Data Set Learning
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
ETPL
DM - 093
Learning the Gain Values and Discount Factors of Discounted Cumulative
Gains
The problem of learning conditional preference networks (CP-nets) from a set of examples has
received great attention recently. However, because of the randomicity of the users' behaviors and the
observation errors, there is always some noise making the examples inconsistent, namely, there exists at least
one outcome preferred over itself (by transferring) in examples. Existing CP-nets learning methods cannot
handle inconsistent examples. In this work, we introduce the model of learning consistent CP-nets
from inconsistent examples and present a method to solve this model. We do not learn the CP-nets directly.
Instead, we first learn a preference graph from the inconsistent examples, because dominance testing and
consistency testing in preference graphs are easier than those in CP-nets. The problem
of learning preference graphs is translated into a 0-1 programming and is solved by the branch-and-bound
search. Then, the obtained preference graph is transformed into a CP-net equivalently, which can entail a
subset of examples with maximal sum of weight. Examples are given to show that our method can obtain
consistent CP-nets over both binary and multivalued variables from inconsistent examples. The proposed
method is verified on both simulated data and real data, and it is also compared with existing methods.
Learning Conditional Preference Networks from Inconsistent Examples
Evaluation metric is an essential and integral part of a ranking system. In the past, several evaluation
metrics have been proposed in information retrieval and web search, among them Discounted
Cumulative Gain (DCG) has emerged as one that is widely adopted for evaluating the performance of ranking
functions used in web search. However, the two sets of parameters, the gain values and discount factors, used
in DCG are usually determined in a rather ad-hoc way, and their impacts have not been carefully analyzed. In
this paper, we first show that DCG is generally not coherent, i.e., comparing the performance of ranking
functions using DCG very much depends on the particular gain values and discount factors used. We then
propose a novel methodology that can learn the gain values and discount factors from user preferences over
rankings, modeled as a special case of learning linear utility functions. We also discuss how to extend our
methods to handle tied preference pairs and how to explore active learning to reduce preference labeling.
Numerical simulations illustrate the effectiveness of our proposed methods. Moreover, experiments are also
conducted over a side-by-side comparison data set from a commercial search engine to validate the proposed
methods on real-world data.
ETPL
DM - 094
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Keyword search is an intuitive paradigm for searching linked data sources on the web. We propose to
route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all
sources. We propose a novel method for computing top-k routing plans based on their potentials to contain
results for a given keyword query. We employ a keyword-element relationship summary that compactly
represents relationships between keywords and the data elements mentioning them. A multilevel scoring
mechanism is proposed for computing the relevance of routing plans based on scores at the level of keywords,
data elements, element sets, and sub graphs that connect these elements. Experiments carried out using 150
publicly available sources on the web showed that valid plans (precision@1 of 0.92) that are highly relevant
(mean reciprocal rank of 0.89) can be computed in 1 second on average on a single PC. Further, we
show routing greatly helps to improve the performance of keyword search, without compromising its result
quality.
ETPL
DM - 095
Keyword Query Routing
Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there
nodes that participate in too many or too few triangles? Are there close-knit near-cliques? These questions are
expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency
matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let
alone for billion-scale ones. We address this problem with the proposed HEIGEN algorithm, which we
carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP)
environment. This enables HEIGEN to handle matrices more than 1;000 × larger than those which can
be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50
supercomputers in the world. We report important discoveries about nearcliques and triangles on several real-
world graphs, including a snapshot of the Twitter social network (56 Gb, 2 billion edges) and the
“YahooWeb” data set, one of the largest publicly available graphs (120 Gb, 1.4 billion nodes,
6.6 billion edges).
ETPL
DM - 096
HEigen: Spectral Analysis for Billion-Scale Graphs
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
A large number of organizations today generate and share textual descriptions of their products,
services, and actions. Such collections of textual data contain significant amount of structured information,
which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction
of structured relations, they are often expensive and inaccurate, especially when operating on top of text that
does not contain any instances of the targeted structured information. We present a novel alternative approach
that facilitates the generation of the structured metadata by identifying documents that are likely to contain
information of interest and this information is going to be subsequently useful for querying the database. Our
approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if
prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata
when such information actually exists in the document, instead of naively prompting users to fill in forms with
information that is not available in the document. As a major contribution of this paper, we present algorithms
that identify structured attributes that are likely to appear within the document, by jointly utilizing
the content of the text and the query workload. Our experimental evaluation shows that our approach generates
superior results compared to approaches that rely only on the textual content or only on the query workload, to
identify attributes of interest.
ETPL
DM -097
Facilitating Document Annotation Using Content and Querying Value
With the wide deployment of public cloud computing infrastructures, using clouds to host data query
services has become an appealing solution for the advantages on scalability and cost-saving. However,
some data might be sensitive that the data owner does not want to move to the cloud unless the data
confidentiality and query privacy are guaranteed. On the other hand, a secured query service should still
provide efficient query processing and significantly reduce the in-house workload to fully realize the benefits
of cloud computing.The RASP data perturbation method combines order preserving encryption,
dimensionality expansion, random noise injection, and random projection, to provide strong resilience to
attacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing
indexing techniques to be applied to speedup range query processing. The kNN-R algorithm is designed to
work with the RASP range query algorithm to process the kNN queries. We have carefully analyzed the
attacks on data and queries under a precisely defined threat model and realistic security assumptions.
Extensive experiments have been conducted to show the advantages of this approach on efficiency and
security.
ETPL
DM - 98
Building Confidential and Efficient Query Services in the Cloud with RASP
Data Perturbation
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
We describe a new 3D saliency prediction model that accounts for diverse low-level luminance,
Many supervised learning approaches that adapt to changes in data distribution over time (e.g., concept drift)
have been developed. The majority of them assume that the data comes already preprocessed or
that preprocessing is an integral part of a learning algorithm. In real-application tasks, data that comes from,
e.g., sensor readings, is typically noisy, contain missing values, redundant features, and a very large part of
model development efforts is devoted to data preprocessing. As data is evolving over time, learning models
need to be able to adapt to changes automatically. From a practical perspective, automating a predictor makes
little sense if preprocessing requires manual adjustment over time. Nevertheless, adaptation
of preprocessing has been largely overlooked in research. In this paper, we introduce and address the problem
of adaptive preprocessing. We analyze when and under what circumstances it is beneficial to handle adaptivity
of preprocessing and adaptivity of the learning model separately. We present three scenarios where
handling adaptive preprocessing separately benefits the final prediction accuracy and illustrate them using
computational examples. As a result of our analysis, we construct a prototype approach for
combining adaptive preprocessing with adaptive predictor online. Our case study with real sensory data from a
production process demonstrates that decoupling the adaptivity of preprocessing and the predictor contributes
to improving the prediction accuracy. The developed reference framework and our experimental findings are
intended to serve as a starting point in systematic research of adaptive preprocessing mechanisms
for adaptive learning with evolving data.
ETPL
DM - 099
Adaptive Preprocessing for Streaming Data
Many real data increase dynamically in size. This phenomenon occurs in several fields including
economics, population studies, and medical research. As an effective and efficient mechanism to deal with
such data, incremental technique has been proposed in the literature and attracted much attention, which
stimulates the result in this paper. When a group of objects are added to a decision table, we first introduce
incremental mechanisms for three representative information entropies and then develop a group incremental
rough feature selection algorithm based on information entropy. When multiple objects are added to a decision
table, the algorithm aims to find the new feature subset in a much shorter time. Experiments have been carried
out on eight UCI data sets and the experimental results show that the algorithm is effective and efficient.
ETPL
DM - 100
A Group Incremental Approach to Feature Selection Applying Rough Set
Technique
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Recent years have witnessed an increased interest in recommender systems. Despite significant progress
in this field, there still remain numerous avenues to explore. Indeed, this paper provides a study of exploiting
online travel information for personalized travel package recommendation. A critical challenge along this line
is to address the unique characteristics of travel data, which distinguish travel packages from traditional items
for recommendation. To that end, in this paper, we first analyze the characteristics of the existing travel
packages and develop a tourist-area-season topic (TAST) model. This TAST model can represent travel
packages and tourists by different topic distributions, where the topic extraction is conditioned on both the
tourists and the intrinsic features (i.e., locations, travel seasons) of the landscapes. Then, based on this topic
model representation, we propose a cocktail approach to generate the lists for personalized travel package
recommendation. Furthermore, we extend the TAST model to the tourist-relation-area-season topic (TRAST)
model for capturing the latent relationships among the tourists in each travel group. Finally, we evaluate the
TAST model, the TRAST model, and the cocktail recommendation approach on the real-world travel package
data. Experimental results show that the TAST model can effectively capture the unique characteristics of the
travel data and the cocktail approach is, thus, much more effective than traditional recommendation techniques
for travel package recommendation. Also, by considering tourist relationships, the TRAST model can be used
as an effective assessment for travel group formation.
ETPL
DM – 101
A Cocktail Approach for Travel Package Recommendation
A protein-protein interaction (PPI) network is a biomolecule relationship network that plays an important role
in biological activities. Studies of functional modules in a PPI network contribute greatly to the understanding
of biological mechanism. With the development of life science and computing science, a great amount of PPI
data has been acquired by various experimental and computational approaches, which presents a significant
challenge of detecting functional modules in a PPI network. To address this challenge,
many functional module detecting methods have been developed. In this survey, we first analyze the existing
problems in detecting functional modules and discuss the countermeasures in the data preprocess and post
process. Second, we introduce some special metrics for distance or graph developed in clustering process of
proteins. Third, we give a classification system of functional moduledetecting methods and describe some
existing detection methods in each category. Fourth, we list databases in common use and conduct
performance comparisons of several typical algorithms by popular measurements. Finally, we present the
prospects and references for researchers engaged in analyzing PPI networks.
ETPL
DM - 102
Survey: Functional Module Detection from Protein-Protein Interaction
Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
ETPL
DM - 103
Decision Trees for Mining Data Streams Based on the Gaussian Approximation
ETPL
DM - 104
Structural Diversity for Resisting Community Identification in Published Social
Networks
Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the
most popular tools for mining data streams. The key point of constructing the decision tree is to determine the
best attribute to split the considered node. Several methods to solve this problem were presented so far.
However, they are either wrongly mathematically justified (e.g., in the Hoeffding treealgorithm) or time-
consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which
significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method
ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a
finite data sample is the same as it would be in the case of the whole datastream.
As an increasing number of social networking data is published and shared for commercial and research
purposes, privacy issues about the individuals in social networks have become serious concerns.
Vertex identification, which identifies a particular user from a network based on background knowledge such
as vertex degree, is one of the most important problems that have been addressed. In reality, however, each
individual in a social network is inclined to be associated with not only a vertex identity but also
a community identity, which can represent the personal privacy information sensitive to the public, such as
political party affiliation. This paper first addresses the new privacy issue, referred to
as community identification, by showing that the community identity of a victim can still be inferred even
though the social network is protected by existing anonymity schemes. For this problem, we then propose the
concept of structural diversity to provide the anonymity of the community identities. The k-
Structural Diversity Anonymization (k-SDA) is to ensure sufficient vertices with the same vertex degree in at
least k communities in a social network. We propose an Integer Programming formulation to find optimal
solutions to k-SDA and also devise scalable heuristics to solve large-scale instances of k-SDA from different
perspectives. The performance studies on real data sets from various perspectives demonstrate the practical
utility of the proposed privacy scheme and our anonymization approaches.
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
The task of assigning geographic coordinates to textual resources plays an increasingly central role in
geographic information retrieval. The ability to select those terms from a given collection that are most
indicative of geographic location is of key importance in successfully addressing this task. However, this
process of selecting spatially relevant terms is at present not well understood, and the majority of current
systems are based on standard term selection techniques, such as x2 or information gain, and thus fail to
exploit the spatial nature of the domain. In this paper, we propose two classes of termselection techniques
based on standard geostatistical methods. First, to implement the idea of spatial smoothing
of term occurrences, we investigate the use of kernel density estimation (KDE) to model each term as a two-
dimensional probability distribution over the surface of the Earth. The second class of term selection methods
we consider is based on Ripley's K statistic, which measures the deviation of a point set from spatial
homogeneity. We provide experimental results which compare these classes of methods against existing
baseline techniques on the tasks of assigning coordinates to Flickr photos and to Wikipedia articles, revealing
marked improvements in cases where only a relatively small number of terms can be selected.
ETPL
DM - 105
Spatially Aware Term Selection for Geotagging
Information extraction from printed documents is still a crucial problem in many interorganizational
workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario
well, as printed documents do not carry any explicit structural or syntactical description. Moreover,
printed documents usually lack any explicit indication about their source. We present a system, which we call
PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO
selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists,
and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of
normal system operation: new wrappers are generated online based on a few point-and-click operations
performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO
may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very
good performance on a challenging data set composed of more than 600 printed documents drawn from three
different application domains: invoices, datasheets of electronic components, and patents. We also perform an
extensive analysis of the crucial tradeoff between accuracy and automation level.
ETPL
DM - 106
Semisupervised Wrapper Choice and Generation for Print-Oriented Documents
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Nowadays, the high availability of data gathered from wireless sensor networks and
telecommunication systems has drawn the attention of researchers on the problem of extracting knowledge
froms patiotemporal data. Detecting outliers which are grossly different from or inconsistent with the
remaining spatiotemporal data set is a major challenge in real-world knowledge discovery and data mining
applications. In this paper, we deal with the outlier detection problem in spatiotemporal data and describe
a rough set approach that finds the top outliers in an unlabeled spatiotemporal data set. The proposed method,
called Rough Outlier Set Extraction (ROSE), relies on a rough set theoretic representation of
the outlier set using the rough set approximations, i.e., lower and upper approximations. We have also
introduced a new set, named Kernel Set, that is a subset of the original data set, which is able to describe the
original data set both in terms of data structure and of obtained results. Experimental results on real-world
data sets demonstrate the superiority of ROSE, both in terms of some quantitative indices
and outliers detected, over those obtained by various rough fuzzy clustering algorithms and by the state-of-the-
art outlier detection methods. It is also demonstrated that the kernel set is able to detect the
same outliers set but with less computational time.
ETPL
DM - 107
Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection
This paper investigates a framework of search-based face annotation (SBFA)
by mining weakly labeledfacial images that are freely available on the World Wide Web (WWW). One
challenging problem forsearch-based face annotation scheme is how to effectively perform annotation by
exploiting the list of most similar facial images and their weak labels that are often noisy and incomplete. To
tackle this problem, we propose an effective unsupervised label refinement (ULR) approach for refining the
labels of web facial images using machine learning techniques. We formulate the learning problem as a
convex optimization and develop effective optimization algorithms to solve the large-scale learning task
efficiently. To further speed up the proposed scheme, we also propose a clustering-basedapproximation
algorithm which can improve the scalability considerably. We have conducted an extensive set of empirical
studies on a large-scale web facial image testbed, in which encouraging results showed that the proposed ULR
algorithms can significantly boost the performance of the promising SBFA scheme.
ETPL
DM - 108
Mining Weakly Labeled Web Facial Images for Search-Based Face Annotation
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
In this paper, we construct a linkable ring signature scheme with unconditional anonymity. It has been
regarded as an open problem in [22] since 2004 for the construction of
an unconditional anonymouslinkable ring signature scheme. We are the first to solve this open problem by
giving a concrete instantiation, which is proven secure in the random oracle model. Our construction is even
more efficient than other schemes that can only provide computational anonymity. Simultaneously, our
scheme can act as an counterexample to show that [19, Theorem 1] is not always true, which stated
that linkable ring signature scheme cannot provide strong anonymity. Yet we prove that our scheme can
achieve strong anonymity (under one of the interpretations).
ETPL
DM - 109
Linkable Ring Signature with Unconditional Anonymity
The new method proposed in this paper applies a multivariate reconstructed phase space (MRPS) for
identifying multivariate temporal patterns that are characteristic and predictive of anomalies or events in
a dynamic data system. The new method extends the original univariate reconstructed phase space framework,
which is based on fuzzy unsupervised clustering method, by incorporating a new mechanism
of data categorization based on the definition of events. In addition to modeling temporaldynamics in a
multivariate phase space, a Bayesian approach is applied to model the first-order Markov behavior in the
multidimensional data sequences. The method utilizes an exponential loss objective function to optimize a
hybrid classifier which consists of a radial basis kernel function and a log-odds ratio component. We
performed experimental evaluation on three data sets to demonstrate the feasibility and effectiveness of the
proposed approach.
ETPL
DM - 110
Event Characterization and Prediction Based on Temporal Patterns in Dynamic
Data System
This paper introduces two kinds of decision tree ensembles for imbalanced classification problems,
extensively utilizing properties of α-divergence. First, a novel splitting criterion based on α-divergence is
shown to generalize several well-known splitting criteria such as those used in C4.5 and CART. When the α-
divergence splitting criterion is applied to imbalanced data, one can obtain decision trees that tend to be less
correlated (α-diversification) by varying the value of α. This increased diversity in anensemble of
such trees improves AUROC values across a range of minority class priors. The resultant ensemble produces a
set of interpretable rules that provide higher lift values for a given coverage, a property that is much desirable
in applications such as direct marketing. Experimental results across many class-imbalanced data sets,
including BRFSS, and MIMIC data sets from the medical community and several sets from UCI and KEEL
are provided to highlight the effectiveness of the proposed ensembles over a wide range of data distributions
and of class imbalance.
ETPL
DM - 111
Ensembles of $({alpha})$-Trees for Imbalanced Classification Problems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Detection of emerging topics is now receiving renewed interest motivated by the rapid growth
of socialnetworks. Conventional-term-frequency-based approaches may not be appropriate in this context,
because the information exchanged in social-network posts include not only text but also images, URLs, and
videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus
on mentions of users--links between users that are generated dynamically (intentionally or unintentionally)
through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of
a social network user, and propose to detect the emergence of a newtopic from the anomalies measured
through the model. Aggregating anomaly scores from hundreds of users, we show that we can
detect emerging topics only based on the reply/mention relationships insocial-network posts. We demonstrate
our technique in several real data sets we gathered from Twitter. The experiments show that the proposed
mention-anomaly-based approaches can detect newtopics at least as early as text-anomaly-based approaches,
and in some cases much earlier when thetopic is poorly identified by the textual contents in posts.
ETPL
DM - 112
Discovering Emerging Topics in Social Streams via Link-Anomaly Detection
Big Data concern large-volume, complex, growing data sets with multiple, autonomous
sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are
now rapidly expanding in all science and engineering domains, including physical, biological and biomedical
sciences. This paper presents a HACE theorem that characterizes the features of the Big Datarevolution, and
proposes a Big Data processing model, from the data mining perspective. This data-driven model involves
demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security
and privacy considerations. We analyze the challenging issues in thedata-driven model and also in
the Big Data revolution.
ETPL
DM - 113
Data mining with big data
In this paper, we tackle a novel problem of ranking multivalued objects, where an object has multiple
instances in a multidimensional space, and the number of instances per object is not fixed. Given an ad hoc
scoring function that assigns a score to a multidimensional instance, we want to rank a set of multivalued
objects. Different from the existing models of ranking uncertain and probabilistic data, which model an object
as a random variable and the instances of an object are assumed exclusive, we have to capture the coexistence
of instances here. To tackle the problem, we advocate the semantics of favoring widely preferred objects
instead of majority votes, which is widely used in many elections and competitions.
ETPL
DM - 114
Consensus-Based Ranking of Multivalued Objects: A Generalized Borda Count
Approach
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Technology-supported learning systems have proved to be helpful in many learning situations. These
systems require an appropriate representation of the knowledge to be learned, the Domain Module. The
authoring of the Domain Module is cost and labor intensive, but its development cost might be lightened by
profiting from semiautomatic Domain Module authoring techniques and promoting knowledge reuse. DOM-
Sortze is a system that uses natural language processing techniques, heuristic reasoning, and ontologies for the
semiautomatic construction of the Domain Module from electronic textbooks. To determine how it might help
in the Domain Module authoring process, it has been tested with an electronic textbook, and the gathered
knowledge has been compared with the Domain Module that instructional designers developed manually. This
paper presents DOM-Sortze and describes the experiment carried out.
ETPL
DM - 115
Automatic Generation of the Domain Module from Electronic Textbooks:
Method and Validation
Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in
the literature take a landmark embedding approach, which selects a set of graph nodes as landmarksand
computes the shortest distances from each landmark to all nodes as an embedding. To answer ashortest
distance query, the precomputed distances from the landmarks to the two query nodes are used to compute an
approximate shortest distance based on the triangle inequality. In this paper, we analyze the factors that affect
the accuracy of distance estimation in landmark embedding. In particular, we find that a globally selected,
query-independent landmark set may introduce a large relative error, especially for nearby query nodes. To
address this issue, we propose a query-dependent locallandmark scheme, which identifies a local landmark
close to both query nodes and provides more accurate distance estimation than the traditional global landmark
approach. We propose efficient locallandmark indexing and retrieval techniques, which achieve low offline
indexing complexity and onlinequery complexity. Two optimization techniques on graph compression and
graph online search are also proposed, with the goal of further reducing index size and improving query
accuracy. Furthermore, the challenge of immense graphs whose index may not fit in the memory leads us to
store the embedding in relational database, so that a query of the local landmark scheme can be expressed with
relational operators. Effective indexing and query optimization mechanisms are designed in this context. Our
experimental results on large-scale social networks and road networks demonstrate that the locallandmark
scheme reduces the shortest distance estimation error significantly when compared with global landmark
embedding and the state-of-the-art sketch-based embedding.
ETPL
DM - 116
Approximate Shortest Distance Computing: A Query-Dependent Local
Landmark Scheme
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Semi-supervised clustering aims to improve clustering performance by considering user supervision in
the form of pairwise constraints. In this paper, we study the active learning problem of selecting pairwise
must-link and cannot-link constraints for semi-supervised clustering. We consider active learning in an
iterative manner where in each iteration queries are selected based on the current clustering solution and the
existing constraint set. We apply a general framework that builds on the concept of neighborhood, where
neighborhoods contain "labeled examples" of different clusters according to the pairwise constraints. Our
active learning method expands the neighborhoods by selecting informative points and querying their
relationship with the neighborhoods. Under this framework, we build on the classic uncertainty-based
principle and present a novel approach for computing the uncertainty associated with each data point. We
further introduce a selection criterion that trades off the amount of uncertainty of each data point with the
expected number of queries (the cost) required to resolve this uncertainty. This allows us to select queries that
have the highest information rate. We evaluate the proposed method on the benchmark data sets and the results
demonstrate consistent and substantial improvements over the current state of the art.
ETPL
DM - 117
Active Learning of Constraints for Semi-Supervised Clustering
Extending the keyword search paradigm to relational data has been an active area of research within
the database and IR community during the past decade. Many approaches have been proposed, but despite
numerous publications, there remains a severe lack of standardization for the evaluation of proposed search
techniques. Lack of standardization has resulted in contradictory results from differentevaluations, and the
numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we
present the most extensive empirical performance evaluation of relationalkeyword search techniques to appear
to date in the literature. Our results indicate that many existingsearch techniques do not provide acceptable
performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques
from scaling beyond small data sets with tens of thousands of vertices. We also explore the relationship
between execution time and factors varied in previous evaluations; our analysis indicates that most of these
factors have relatively little impact on performance. In summary, our work confirms previous claims regarding
the unacceptableperformance of these search techniques and underscores the need for standardization in
evaluations--standardization exemplified by the IR community.
ETPL
DM - 118
An Empirical Performance Evaluation of Relational Keyword Search
Techniques
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Coimbatore | Ramnad
Pondicherry | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, [email protected]
Collaborative tagging is one of the most popular services available online, and it allows end user to
loosely classify either online or offline resources based on their feedback, expressed in the form of free-text
labels (i.e., tags). Although tags may not be per se sensitive information, the wide use of
collaborative tagging services increases the risk of cross referencing, thereby seriously compromising
user privacy. In this paper, we make a first contribution toward the development of a privacy-preserving
collaborative tagging service, by showing how a specific privacy-enhancing technology, namely tag
suppression, can be used to protect end-user privacy. Moreover, we analyze how our approach can affect the
effectiveness of a policy-based collaborative tagging system that supports enhanced web access
functionalities, like content filtering and discovery, based on preferences specified by end users.
ETPL
DM - 119
Privacy-Preserving Enhanced Collaborative Tagging
Thank You !