A deep awareness framework for pervasiv video cloud

SPECIAL SECTION ON CHALLENGES FOR SMART WORLDS

Received September 10, 2015, accepted October 15, 2015, date of publication November 2, 2015,date of current version November 18, 2015.

Digital Object Identifier 10.1109/ACCESS.2015.2497278

A Deep Awareness Framework forPervasive Video CloudWEISHAN ZHANG1, PENGCHENG DUAN1, ZHONGWEI LI1, QINGHUA LU1,WENJUAN GONG1, AND SU YANG21Department of Software Engineering, China University of Petroleum, Qingdao 266580, China2College of Computer Science and Technology, Fudan University, Shanghai 200433, China

Corresponding author: W. Zhang ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61402533, in part by the NaturalScience Foundation of Shandong Province under Grant ZR2014FM038 and Grant ZR2015FL015, and in part by the Key TechnologiesDevelopment Plan of Qingdao Technical Economic Development Area. The work of W. Zhang was supported by the Start-Up Fundsfor Academic Top-Notch Professors through the China University of Petroleum.

ABSTRACT Context-awareness for big data applications is different from that of traditional applications inthat it is getting challenging to obtain the contexts from big data due to the complexity, velocity, variety, andother aspects of big data, especially big video data. The awareness of contexts in big data is more difficult,and should be more in-depth than that of classical applications. Therefore, in this paper, we propose anin-depth context-awareness framework for a pervasive video cloud in order to obtain underlying contexts inbig video data. In this framework, we propose an approach that combines the historical view with the currentview to obtain meaningful in-depth contexts, where deep learning techniques are used to obtain raw contextdata. We have conducted initial evaluations to show the effectiveness of the proposed approach in terms ofperformance and also the accuracy of obtaining the contexts. The evaluation results show that the proposedapproach is effective for real-time context-awareness in a pervasive video cloud.

INDEX TERMS Pervasive video cloud, deep learning, framework, context awareness, cloud computing.

I. INTRODUCTIONThere are huge amount of video data generated everydaythrough smart city applications, such as smart transportationmonitoring and security surveillance.Making full use of thesevideo data may provide critical benefits, e.g. for national andsocial security. An example is the 2014 violent terrorist attackin Kunming Railway Station. If we could take advantageof cloud computing facilities to process surveillance videotimely and accurately, and then recognize what wss hap-pening from the surveillance video, or even from capturedimages and videos from people’s smart phones, we could thenmake emergency decisions to minimize potential harm. Thiskind of computing paradigm that combines cloud computing,networked embedded video devices including smart phones,surveillance cameras and so on, is called pervasive cloud [1].

There has been numerous research work on various aspectsof context-awareness. For example, how to model fuzzycontexts [2], and how to use hybrid artificial intelligencetechniques to get better reasoning capabilities [3], [4]. Theseexisting work on context-awareness can not work properlyon pervasive video cloud due to intrinsic complexities ofbig video data, in that they can not retrieve context data

effectively from videos, especially when there are hugeamount of data. On the other hand, the context data obtainedfrom big data should be properly synthesized in order to getmeaningful and useful knowledge on contexts.

The success of deep learning [5] initiated byHinton and Salakhutdinov provides possibilities to extractin-depth raw contexts from big video data. Hinton et al.obtained the world’s best classification results for theImageNet problem with deep convolution neural network(deep CNN, or DCNN). DCNNs can be trained with rawinput images, without manually designing feature extractors.It performs better in object recognition than many traditionalapproaches [6], [7]. A number of object detection researchis taking advantage of DCNNs for accuracy improvements.OverFeat [8] is presented as an integrated framework forusing Convolutional Networks for classification, localizationand detection. Other approaches of deep learning techniquesincluding deep belief network (DBN). DBN is a probabilis-tic neural network consisting of multiple undirected layers,which are called restricted Boltzmann machines (RBMs).Hinton et al. [9] showed that significantly better results couldbe achieved in deeper architectures when each RBM layer

VOLUME 3, 20152169-3536 2015 IEEE. Translations and content mining are permitted for academic research only.

Personal use is also permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2227

www.redpel.com +917620593389


W. Zhang et al.: Deep Awareness Framework for Pervasive Video Cloud

was pre-trained with an unsupervised learning algorithm(contrastive divergence [9]). An RBM can capture featuresor patterns of data. A DBN includes multiple stacked RBMs,and can capture deep features.

In order to know appropriate contexts from big video data,for example, knowing the current status of traffic on a road,the historical data (e.g. an hour ago), together with the mostup-to-date traffic monitoring video should be considered.Therefore, for obtaining in-depth contexts in big data, we pro-pose to adopt an approach that combining historical data viewwith the current data view, where deep learning techniquesare used to obtain raw context data. For big data computing,this approach of synthesizing historical data with current datausing deep learning to obtain underlying contexts, is calledin-depth context awareness, or deep awareness for short.And consequently, it can get deeper level perceptual contextswhich are called ’in-depth contexts’. The deeper level con-texts like traffic status from big video data are different withshallow information in traditional non-big data environmentreadily available at a system level, such as places, people,networks and so on.

There are already some efforts to understand big videoand image data using deep learning. For example, usingtime domain features classic convolution neural networkwas extended to three dimensions in video in order toconduct action recognition [10]. These research providesgood insights for achieving context-awareness in video data.However, in order to obtain contexts in time in big data,we need to make use of powerful computing and storageinfrastructure like what cloud computing can provide. This isespecially important for understanding huge amount of videodata, especially in public safety and security cases where runtime fast deep awareness is needed as the case of Kunmingterrorism attack.

Therefore, in this paper we propose an in-depth context-awareness framework for pervasive video cloud based ondeep learning techniques supported by both online and offlinecloud computing technologies. The contributions include:

• We propose a deep awareness architecture that can beused to achieve deep context-awareness for pervasivevideo cloud.

• We propose a deep learning based approach for raw con-text data acquisition, including using convolution neuralnetwork for obtaining in-depth raw context data insidebig video, and a deep belief network based method topredict work load status of different cloud nodes, as partof knowledge on system running status.

• We make use of both online and offline cloud com-puting technologies to implement the in-depth context-awareness framework, raw context data are thenconverged by synthesizing processing results from bothonline and offline data views.

• We evaluate the accuracy of the prediction, and theperformance of object recognition in video to show theeffectiveness of the proposed framework.

The remainder of this paper is organized as fol-lows: In section II we present a high level viewof the proposed in-depth context-awareness framework.In section III and IV-B, we will show the results of ourexperiments with the retrieval of raw context data from videodata using DCNN, and cloud nodes running status data usingDBN respectively, with the discussion of the effectiveness ofthe proposed approach. Finally, related work and conclusionsend the paper.

II. ARCHITECTURE OF THE IN-DEPTHCONTEXT-AWARENESS FRAMEWORKAs discussed in the introduction, the awareness of contextin big data relies on both historical data processing andalso the current data processing. Therefore, it is natural thatboth large scale offline processing for historical data andonline fast processing of current data should be convergedin order to achieve the deep-awareness. At the same time,due to complexities of video data processing, it is impor-tant to utilize potential processing advantages from twodifferent cloud computing styles, namely offline batchingprocessing (e.g. Hadoop),1 online real time stream processing(e.g. Storm).2

In the proposed framework, we adopt a layered architecturestyle that combines different cloud computing technologies,and deep learning technologies together as shown in Figure 1,where main software packages deployed on each layer arealso shown.

The first layer is the Hadoop platform based off-line pro-cessing layer. There are three main features in this layer:• Storage of large data video sets. This is achieved usingHDFS (Hadoop Distributed File System) and HBase.3

• TheMapReduce based high performance batch process-ing of video data, for example, processing video data todo a video summary in order to reduce the size of videofor long term storage. For achieving context-awareness,some special kind of offline data processing may beneeded, for example, count the moving vehicles for thepast 3 hours on a road.

• Training of deep learning algorithms in order to get anoptimal parameter set for the underlying deep learningnetwork. Different deep learning algorithms serving dif-ferent purposes as discussed in the introduction.

The obtaining of different offline data views on videodata is mainly through the MapReduce based processing. Forexample, to understand whether there are dangerous actionsin a historical video, the processing include background sub-traction, preprocessing, and action classification. The resultsof the processing are video frames containing only people thathas moving actions.

The second layer is the Storm stream processing basedfast real-time processing layer, including real-time objectrecognition and so on. This layer has the following features:

1http://hadoop.apache.org2http://storm.apache.org3http://hbase.apache.org/

2228 VOLUME 3, 2015




FIGURE 1. High level architecture of the in-depth context-awareness framework.

• Run time processing to obtain the real time view ofvideo data that involves light weigh algorithms, such asbackground subtraction using GMM (Gaussian MixtureModel), face recognition, and so on.

• Real time monitoring of cluster running status, includ-ing work load status, CPU and RAM usage and otherparameters in order to achieve optimized scheduling ofvideo processing tasks.

• Run time decision making based on the collected offlineand online contexts, for example to initiate a routescheduling during rush hours.

• Run time deep learning based data mining. This includesusing DBN for work load prediction, and object recog-nition using DCNN, and so on.

The service layer (service convergence layer) is built on topof the above two layers, where the data set consists of bothoff-line and real-time video processing data are fused in theContext management component in order to get meaningfulcontext data. Splout SQL4 is used to query offline data views,and Trident DRPC is used for retrieving real time data views.These combined data are passed as inputs to deep learningcomponents to realize continuous learning. The Service layeris built on the converged infrastructure with Hadoop andStorm, coordinated by Zookeeper.5 There are also semanticweb ontologies used as a knowledge base in order to furtherimprove the capabilities of context reasoning.

As obtaining raw context data is a critical precondition forknowing contexts before these data are fused, wewill evaluate

4http://www.datasalt.com/products/splout-sql/5http://zookeeper.apache.org

extensively on the performance of using DCNN and DBN forobtaining raw context data, to show the effectiveness of theproposed approach. These evaluations include:

• The prediction of workload of different nodes can makeit possible for optimal scheduling of cloud resourcesin advance. In the proposed framework, we are usingDBN-based time-series data analysis as a means of pre-dicting workload.

• Object recognition is one of the most important tasks forvideo processing, and is one of the most useful way tomine video data in order to understand videos. DCNN isused in our framework to conduct recognition of objectsas part of how we obtain raw context data from video.

III. EVALUATION ON PREDICTINGWORKLOAD OF CLOUD NODESWe first evaluate on using DBN to predict workload of cloudnodes. We choose Google cluster trace6 released in 2011as the starting point to show the effectiveness of prediction.Google Cloud trace released in 2011 is measured on a het-erogeneous 7000-machine server cluster on a 29-day periodinvolving 672,075 jobs and more than 48 million tasks. It ismade up of several datasets that record metrics of jobs/tasksand machine nodes in the cluster. A dataset contains a singletable, indexed by a primary key that typically includes atimestamp. At a certain timestamp, the cluster’s metrics arerecorded into several tables. One of them is the task usagetable, which includes the Google cluster’s RAM workloads

6http://code.google.com/p/googleclusterdata and http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html

VOLUME 3, 2015 2229




which are the subject to be researched in this paper. Moredetails such as the format and schema are explained in [11].

A. TRAINING AND PREDICTIONThe training process of our deep belief networks consists of5 parts as shown in Figure 2. The workloads are first extracted

FIGURE 2. Training workflow.

from the Google cluster trace and then go through a prepro-cessing procedure. During the preprocessing, each workloadtime series x(t) is transformed by the differential equationxd (t) = x(t)− x(t − d) where d is the integer delay of time.xd (t) is then normalized(denoted as xdn(t)) to the range [0, 1]and fed to a DBN for training the network’s parameters. Thepurpose of conducting the differential transformation is toreduce the linearity relationship due to the neural-network’slow learning capability for data with strong linear factors.Figure 3 illustrates how the normalized workload time series

FIGURE 3. Structure of a DBN.

xdn(t) is used to train the DBNwhere d = 5 and x̂dn(t) denotethe predicted workload at time t . d is decided by the ARIMAmodel [12] because the model analyzes the autoregressionand autocorrelation of the workload time series to check lin-earity and interdependency, which gives us hints on decidingthe number of visible neurons. Re-normalizing the x̂dn(t), weget x̂d (t). Due to the transformation on the original data in thepre-processing, the predicted data should be reconstructed byx̂(t + 1) = x̂d (t + 1)+ x(t − d + 1) during post-processing,where the x̂(t + 1) is the prediction for the original data.

A RBM can be regarded as a probabilistic graphical model.A RBM is undirected and the maximal clique in a RBMhave and only have 2 neurons (one resides in the visiblelayer, the other resides in the hidden layer). It means that anytwo neurons in the same layer of a RBM are separated by theother layer, which indicates they are conditional independent.So more specifically a RBM can be seen as a MRF(MarkovRandom Field).7 Figure 4 illustrates a RBM with 2 visible

FIGURE 4. An illustration of a RBM from the MRF’s viewpoint.

neurons and 3 hidden neurons from the MRF’s viewpoint.We can see that the two hidden neurons (denoted by a redand a blue solid circle) are separated by the neurons in thevisible layer(denoted by the green circle). Viewing a RBM asa parameterized generative model representing a probabilitydistribution, the training data can be regarded as being sam-pled from the sample space of random variables in the visiblelayer. Given the training data, to minimize the loss means tominimize the difference between the distribution generatedfrom the training samples and the ideally desired distribution.One way to minimize the loss is maximizing the likelihoodL : 2−→R under the RBM parameters (i.e finding the RBMparameters θ that maximize the likelihood given the trainingdata). Maximizing the likelihood is the same as maximizingthe log-likelihood given by

lnL(θ |S) = lnl∏i=1

p(xi|θ ) =l∑i=1

ln p(xi|θ )

where S represents the training data, l represents the numberof training samples, p represents the distribution of the MRFand xi represents the ith training sample in S. So intuitivelythe loss function for training a RBM should be represented by

l(θ |S) = −lnL(θ |S)

l= −

∑li=1 ln p(xi|θ )

l.

Although it is in general not possible to find the maximumlikelihood parameters analytically due to the intractabilityof learning exact maximum likelihood in an RBM, Gibbssampling [13] can stochastically be applied to obtain asequence of observations which can be used to approximatethe joint distribution of the RBM. However Gibbs samplingrequires many sampling steps. Based on the MRF prop-erty of the RBM, contrastive divergence(CD) learning by

7https://en.wikipedia.org/wiki/Markov_random_field

2230 VOLUME 3, 2015




Hinton et al. [9] greatly reduces the sampling steps whichis adopted in this paper. Note that because xdn(t) rangesfrom [0, 1], the visible layer of the first RBM and the hiddenlayer of the last RBM in the DBN should not be Bernoulli butGaussian in the proposed paper.

The weights and biases of each RBM are first trainedfrom bottom to top based on l(θ |S) using the CD algorithm.Then they are fine-tuned using the backpropagation (BP)8

algorithm, followed by conducting long-term and short-termforecasting procedures. In the proposed paper, the long-termforecasting represents predicting workloads that are going tohappen in one or more days while the short-term forecastingrepresents predicting workloads that are going to happen inone or more hours or minutes.

The automatic ARIMA fitting by Hyndman andKhandakar [14] generates orders of 3 parts, that is the orderof autoregressive, integrated, and moving average whichis 5,1,2 for the original data and 4,0,5 for the differential datarespectively. None of the order of each ARIMA part in bothoriginal data and its transformation is larger than 5, which candirect us to design the DBN.

When multiple factors, each of which has multiple levels,are considered in a model, conducting a fully coveredseries of experiments may take rather extensive efforts. Forexample, 43 experiments will be conducted for a three-factormodel given 4 levels for each factor. It is essential, there-fore, to design a method to reduce the experiment numberwhile no significant difference exists in the results. Thereare several methods to achieve the purpose such as theuniform experimental design, the orthogonal experimentaldesign (OED)9 [15]. OED guarantees that the effect of onefactor or interaction can be estimated separately from theeffect of any other factors or interactions in the model. Onekey step in OED is to consider choosing the factors and thelevel number of each factor. Then framework the factors andlevels into one of the OED tables denoted by La(bc) where a,b and c represent the number of experiments to be conducted,the level number of each factor and the number of factorsrespectively. ADBN can be effected by several factors such asinput nodes, hidden nodes, learning rate and training samplesize. We employ the orthogonal experimental design to findoptimal parameter set for the DBN. In this paper we focuson the following 4 factors, that is, input nodes (IN) whichrepresents the neuron number of the input layer of a DBN,the learning rate (LR), training sample size (TSS), RBMnumbers(RN), and build a L9(34) OED table. Each factor has3 levels which is shown in Table 1.

TABLE 1. Levels of the Four Factors With OED for the DBN.

8https://en.wikipedia.org/wiki/Backpropagation9https://en.wikipedia.org/wiki/Principle_of_orthogonal_design

2 considered factors can affect the structure of a DBN, thatis, the input nodes and the RBM numbers. The number ofhidden nodes is rarely larger than double the input nodes, andZhang et al. [16] suggests that input nodes is more impor-tant than hidden nodes in a neural network model built forprediction. Hence, we do not consider the hidden nodes as afactor, and set them equal to the input nodes. Therefore, giventhe input nodes and the RBM numbers, we can construct thestructure of a DBN. Figure 5 illustrates a DBN given 3 inputnodes and 3 RBMs.

FIGURE 5. Constructing a DBN given the input nodes and the RBMnumbers.

B. MEASUREMENTS ANALYSISThe mean square error(MSE) measures the average of thesquares of the ‘‘errors’’, that is, the difference between thepredicted outcome and the original data. It is employedby authors such as [16]–[19] to evaluate the predictionperformance. In this paper it is defined by MSE(t, s) =∑i=t+s

i=t (l̂i−li)2

s , which computes the MSE of the subsequents time units starting from time t . l̂i is the predicted load andli is the observed load at time t . The short term prediction ismade on hours level, and the long term prediction is made ondays level.

Suppose we train DBN from the beginning of the observa-tion to the time t0 and we want to start prediction at a futuretime t0+ t , then the subsequent prediction interval should be[t0 + t, t0 + t + s], which is illustrated in Figure 6.

FIGURE 6. Workload time series.

Our purpose is to see how the prediction accuracy behaveswith the t and s, and also how it differs in the conditions ofdifferent factor’s levels. For the long-term prediction, we startprediction in 0/1/2/3 days(that is t = 0/1/2/3), also take theprediction intervals as 1/2/3 days(that is s = 1/2/3). Forthe short-term prediction, we just start prediction right afterthe end of the traces of training samples (that is t = 0), andtake the prediction intervals as 1/2/3 hours(that is s = 1/2/3).

VOLUME 3, 2015 2231




Table 3 and 4 show which values t and s can take from forshort-term and long-term prediction respectively. Thereforefor short-term experiments, each experiment has 3 MSEswhile for long-term experiments, each experiment has9 MSEs. Note that in this paper we focus on the DBN’sperformance for the prediction of Google cluster trace andtake 4 key factors to be considered in OED. More parametersincluding the parameters of the back propagation are set thesame in all experiments conducted in this paper. Parameterdetails for each experiment are shown in Table 2.

TABLE 2. Parameters used in the prediction experiments.

Table 3 and 4 show the prediction accuracy of the short-term and long-term RAM workloads of the Google Cloudtrace under the OED L9(34) methodology. There are 9 orthog-onal experiments and each of them is assigned with one levelof each factors. It is targeted to find the DBN with a set ofgood levels that generates the least prediction mean squareerror. We can see that the best solution for the short-termprediction is the 5th experiment with 5 input nodes, 2 RBMs,learning rate 0.5 and 40% of the total samples for training andthe best solution for the long-term is also the 5th experiment.We can also see that the least accurate performance is the firstexperiment in short-term and also in long-term prediction,which may be resulted from fewer input nodes and only-onehidden layer. The fewer input nodes makes the DBN lackin memorizing enough metric histories to give a reasonableprediction while the only-one hidden layer is too simple tomodeling high variances of Google Cloud trace.

Figure 7 shows a subrange of prediction interval ofthe 9th experiment with 2 hours’ span for the short-termprediction.

We compare our outcomes with that of an ARIMA pre-dictor [20]. ARIMA (2, 3, 5) was chosen to compare withour proposed DBN based approach under the optimal levelsets of short-term and long-term cases respectively shownin Figure 8. Compared with the ARIMA predictor, we see thatthe short term prediction shows MSE reduction of 61%, andthe long term prediction shows 67% reduction respectively.

We also try to find the main-effect factor and the best levelset by the analysis for the OED results shown in Table 5. Theanalysis method is the visual analysis method [21], [22] withsingle evaluation index because the MSE is the only index

FIGURE 7. Comparison between predicted and original RAM workloads.

FIGURE 8. Comparing the proposed DBN with ARIMA (2, 3, 5).

considered in this paper. The ith experiment is scored from

s(i) =Max[AMSE]−Min[AMSE]

AMSE(i)

where AMSE(i) is the average mean square error of the ithexperiment. For example, the AMSE of the 2th experimentfor the long-term prediction is AMSE(2) = 1.762 × 10−5.Max[AMSE] and Max[AMSE] represent the maximum andminimum of all AMSEs for the short-term and long-termcases respectively. We can see that for the short-term pre-diction, Max[AMSE] = 5.454 × 10−5 and Min[AMSE] =8.79 × 10−6 while for the long=term case, Max[AMSE] =3.962 × 10−5 and Min[AMSE] = 9.28 × 10−6. Althoughthere are no uniform standard for scoring an experiment inthe visual analysis, a reasonable score function should reflectthe range among experiments and should make distinctions.So we take the difference between the maximum and the min-imum of AMSEs, and normalize it by being divided by thecorresponding averagemean square error of each experiment.

Table 5 shows the analysis results. ki represents the averagescore of each level of every factor. Ri is the data range of each

2232 VOLUME 3, 2015




TABLE 3. Short term prediction accuracy of RAM workloads(10−5).

TABLE 4. Long term prediction accuracy of RAM workloads(10−5).

factor i, reflecting each factor’s importance, and the factors’importance rank can be obtained by sorting their ranges sothat we can find out the main-effect factor. The results inTable 5 verify the correctness of the experiment results inTable 3 and 4 where their best solution just match. The main-effect factor is the number of RBMs in both short-term andlong-term predictions. The best solution to predicting theRAM workloads is the level set [IN = 5,RN = 2,LR = 0.5,TSS = 40%].

C. DISCUSSIONAfter conducting the prediction experiments and contrast-ing our proposed method in the Google Cloud trace to theARIMA predictor, we observe that:

• The proposed method achieves MSE in the range[10−6, 10−5], which is quite good accuracy forprediction.

• After training, the running of using the obtained opti-mized parameters for DBN is taking only around 30 mswhich is quite acceptable for real time situations.

IV. EVALUATION ON DCNN BASED OBJECTRECOGNITION AND TRACKINGFor the second evaluation, we experiment to retrieve vehicletypes from traffic videos, and then track a specific vehicle.

A. PREPARING VIDEO DATATo build the recognition dataset, we collected 7 videos usingmounted cameras in Qingdao city. The size of each video isabout 2GB and the frame size is 1920× 1080 pixels. To pre-pare the training and validation dataset, we select 5 videosto generate vehicle images and the other 2 videos are usedto evaluate our approach. These generated images will beused as training and validation dataset. Vehicle images fromthese videos need to be cropped to fit for DCNNs. Afterpreprocessing and preparation, our dataset has 2500 differentimages.We increase the size of the dataset to 5000 by flippingall images. In this paper, we separate vehicle types into fiveclasses: car, bus, mini-bus, truck, and motorcycle. All imagesin training dataset has been pre-processed to be gray scale andsize of them have been re-sized to 96× 96 pixels.

VOLUME 3, 2015 2233




TABLE 5. Visual analysis results for RAM workloads.

B. VEHICLE TYPE RECOGNITION IN VIDEOSFirst, we briefly introduce CNN. A CNN consists of severallayers of three types [5]:Convolutional: Convolutional layers consist of a rectangu-

lar grid of neurons. It requires that the previous layer also bea rectangular grid of neurons. Each neuron takes inputs froma rectangular section of the previous layer; the weights forthis rectangular section are the same for each neuron in theconvolutional layer.Max-Pooling: The pooling layer takes small rectangular

blocks from the convolutional layer and subsamples it toproduce a single output from that block. The pooling can bedone with the average or the maximum, or a learned linearcombination of the neurons in the block. The max-poolinglayers will take the maximum of the block they are pooling.Fully-Connected: After several convolutional and max

pooling layers, the high-level reasoning in the neural network

is done via fully connected layers. A fully connected layertakes all neurons in the previous layer and connects it to everysingle neuron it has.

We design the DCNNs by borrowing the DCNNs’ structureas in [5] with some simplification, which can suit for our owndataset and problem. Table 6 shows the architectural design ofDCNNs. The foregoing three layers are convolutional layersfollowed by max pooling layer, and the following two layersare fully connected. In layers 1 to 3, neuron’s output f asa function of its input x is with non-saturating nonlinearityf (x) = max(0, x) [6], while that used in Layer 4 is f (x) =tanh(x). Softmax regression is used to perform vehicle typeclassification on top of the network.

The training database is composed of gray scale images ofstandard size as 96× 96 pixels. These pre-processed imagesare put into pre-trained DCNNs to recognize which vehicletype they are, and then the recognition results are returned tothe client, which can draw bounding box and show the vehicletype information in a video at real-time.

Figure 9 shows the outputs of each layer. We can seethat each convolutional layer can acquire vehicle features,especially the edges. With the depth of layers increasing,convolutional layer can have higher level feature abstractioncapabilities.

FIGURE 9. Output of each layer of vehicle type recognition DCNNs.

Using the aforementioned dataset to train the DCNN, wecan achieve an accuracy of 89.5% for vehicle type recogni-tion. Note that a large number of images in this dataset do notcontain whole contours of vehicles, but just parts of contours.

The DCNN-based approach can handle 16 fps using adated K1000M GPU. Figure 10 shows the results of vehicletype recognition in video. Figure 10 contains two groupsof different results, figures 10a 10b 10c show that withsome appropriate morphological image processing steps, tra-ditional three-frame difference approach can perform well intraffic video in most cases, and we do not require imagescontain the whole vehicle. Though the image in boundingbox just contains a part of vehicle, the network can alwaysrecognize what the type of the vehicle is. Figures 10d 10e 10fshow some failure cases. From figure 10d 10e we can see thatsome parts of vehicle can not be recognized by the DCNN, thereason is that the dataset used for training is not large enough,therefore some images of different angles cannot be classified

2234 VOLUME 3, 2015




TABLE 6. Architecture design for DCNNs used for vehicle type recognition.

FIGURE 10. Vehicle detection and type recognition from Video using DCNN.

TABLE 7. Accuracy, performance, and deep learning network scale.

accurately. Figure 10f shows there are also some limitationsof vehicle detection in this work, this is caused by the imageprocessing steps after three-frame difference, such as dila-tion and erosion, which can blur the contours of differentvehicles.

C. MORE THOROUGH TESTSTo further evaluate the feasibility of using DCNN for realtime retrieving of raw contexts from big video data, weconducted extensive evaluations with different image size,network scale, training time and iterations, the correspondingaccuracy, and time taken for the recognition of objects invideo. These evaluations are shown in Table 7. From thistable, we can see that for real time deep awareness, the size ofimages, and the corresponding deep learning network shouldbe kept in a moderate scale.

D. OBSERVATIONWe have done comprehensive tests on the accuracy, and per-formance of using DCNN for obtaining targets in video data.The evaluations show that we can achieve a good recognitionrate, also it take around only a few milliseconds to finish theonline object recognition.

V. RELATED WORKContext-awareness is a very important research topic forover a decade, ranging from the modeling and reasoning ofcontexts, to the retrieval of contexts. For example, how tomodel fuzzy contexts [2], and how to use hybrid artificialintelligence techniques to get better reasoning capabilities[3], [4]. These existing research focuses mainly on situationswhere initial raw contexts can be retrieved relatively easily,and the data involved are not big.

VOLUME 3, 2015 2235




Deep learning has been used extensively recently innatural language processing, image processing, voice recog-nition, and so on, and is in rapid development propelledby not only academia but also a lot of companies includ-ing Google, Microsoft, and Baidu. In [23], the authorspresented a vehicle recognition approach for a real transporta-tion surveillance system using sparse coding. The datasetthey used consists of 2520 images, which equally distributein 4 classes: Car, Bus,Motor,Minibus. They did a comparisonbetween sparse coding and conventional histogram of orien-tation gradient (HOG), which showed that the sparse codinglearned feature is better than HOG feature in such vehiclerecognition applications. Their work is similar to ours, butthey are not targeting traffic video frame sequences in theirwork. In addition, DCNN performs better than sparse codingin image recognition, especially for large image datasets.

Overfeat [24] is a CNN framework for classification, local-ization and detection, and Overfeat is used in [25], by makinga few minor modifications to Overfeat’s labels in order tohandle occlusions of cars, make prediction of lanes, andaccelerate performance during inference. We will extend ourwork to consider such kind of video shotted by car-mountedcamera to know such contexts.

A Bayes based model is proposed in [19] to predict hostloads. However, the prediction accuracy is not good. SVRwith a Kalman filter preprocessing is proposed in [26] to pre-dict Google Clustermetrics. A limitation of the support vectorapproach is that the choice of kernel is highly depending ondomain specifications, and adjusting parameters of the kernelis very difficult. Roy develops a model-predictive algorithm(ARIMA model) [27] for workload forecasting which thenis used for resource auto-scaling. From our experiences, theARIMAmodel is inferior in fitting highly variant patterns likeGoogle Cluster trace which is shown above.

VI. CONCLUSIONS AND FUTURE WORKContext-awareness for big data should be more in-depth thanthat of classical ones, especially for video data with intrinsiccomplexities. In this paper we propose an in-depth context-awareness framework for pervasive video cloud in order toknow underlying contexts in big video data, based on deeplearning techniques, for example deep belief network anddeep convolutional neural network. We have evaluated theeffectiveness of the proposed approach by showing the accu-racy of the prediction of workload of cloud nodes, and therecognition of targets in video at real time. The evaluationsshow that deep learning is an efficient and effective way forretrieving raw contexts from big video data, and the proposedin-depth context-awareness is usable in terms of accuracy andperformance for retrieving contexts in video.

In the future, we will work on adding more deep learningtechniques to this framework, in order to make the frameworkcapable of achieving complete deep awareness. And also, alarger scale of video data set is being prepared, which willbe used for improving the existing video data set, and help toimprove the recognition rate by DCNN.

REFERENCES[1] W. Zhang, K. M. Hansen, and P. Bellavista, ‘‘A research roadmap for

context-awareness-based self-managed systems,’’ in Proc. Int. Service-Oriented Comput. (ICSOC Workshops), 2013, pp. 275–283.

[2] J. Zhao, H. Boley, and W. Du, ‘‘A fuzzy logic based approach to express-ing and reasoning with uncertain knowledge on the semantic Web,’’ inComputational Intelligence. New York, NY, USA: Springer-Verlag, 2012,pp. 167–181.

[3] G. Nakamiti, V. E. da Silva, J. H. Ventura, and S. A. da Silva, ‘‘Urbantraffic control and monitoring—An approach for the Brazilian intelligentcities project,’’ in Practical Applications of Intelligent Systems. New York,NY, USA: Springer-Verlag, 2012, pp. 543–551.

[4] W. Zhang, K. M. Hansen, and T. Kunz, ‘‘Enhancing intelligence anddependability of a product line enabled pervasive middleware,’’ PervasiveMobile Comput., vol. 6, no. 2, pp. 198–217, Apr. 2010.

[5] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality ofdata with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507,Jul. 2006.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classifica-tion with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1097–1105.

[7] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutionalnetworks for large-scale image recognition.’’ [Online]. Available:http://arxiv.org/abs/1409.1556

[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hier-archies for accurate object detection and semantic segmentation,’’ inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014,pp. 580–587.

[9] G. E. Hinton, S. Osindero, and Y.-W. Teh, ‘‘A fast learning algorithmfor deep belief nets,’’ Neural Comput., vol. 18, no. 7, pp. 1527–1554,2006.

[10] S. Ji, W. Xu, M. Yang, and K. Yu, ‘‘3D convolutional neural networksfor human action recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell.,vol. 35, no. 1, pp. 221–231, Jan. 2013.

[11] C. Reiss, J. Wilkes, and J. L. Hellerstein, ‘‘Google cluster-usagetraces: Format + schema,’’ Google Inc., Mountain View, CA, USA,Tech. Rep., 2011. [Online]. Available: https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view

[12] E. S. Gardner, Jr., and E. McKenzie, ‘‘Note—Seasonal exponentialsmoothing with damped trends,’’Manage. Sci., vol. 35, no. 3, pp. 372–376,1989.

[13] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY,USA: Springer-Verlag, 2006.

[14] R. J. Hyndman and Y. Khandakar, ‘‘Automatic time series forecasting: Theforecast package for R,’’ J. Statist. Softw., vol. 27, no. 3, pp. 1–22, 2008.

[15] J. Zurovac and R. Brown, ‘‘Orthogonal design: A powerfulmethod for comparative effectiveness research with multipleinterventions,’’ Center Healthcare Effectiveness, Math. Policy Res.,Washington, DC, USA, Tech. Rep., 2012, pp. 12–30. [Online].Available: http://www.mathematica-mpr.com/~/media/publications/PDFs/health/orthogonaldesign_ib.pdf

[16] G. P. Zhang, B. E. Patuwo, and M. Y. Hu, ‘‘A simulation study of artificialneural networks for nonlinear time-series forecasting,’’ Comput. Oper.Res., vol. 28, no. 4, pp. 381–396, Apr. 2001.

[17] P. A. Dinda and D. R. O’Hallaron, ‘‘Host load prediction using linearmodels,’’ Cluster Comput., vol. 3, no. 4, pp. 265–280, Dec. 2000.

[18] T. Kuremoto, S. Kimura, K. Kobayashi, and M. Obayashi, ‘‘Timeseries forecasting using a deep belief network with restricted Boltzmannmachines,’’ Neurocomputing, vol. 137, pp. 47–56, Aug. 2014.

[19] S. Di, D. Kondo, andW. Cirne, ‘‘Host load prediction in a Google computecloud with a Bayesian model,’’ in Proc. IEEE/ACM 24th Int. Conf. HighPerform. Comput., Netw., Storage Anal. (SC), Nov. 2012, pp. 1–11.

[20] J. D. Hamilton, Time Series Analysis, vol. 2. Princeton, NJ, USA:Princeton Univ. Press, 1994.

[21] J. W. Creswell and V. L. P. Clark, Designing and Conducting MixedMethods Research, 2nd ed. New York, NY, USA: Sage Publications, 2010.

[22] T. A. Matyas and K. M. Greenwood, ‘‘Visual analysis of single-casetime series: Effects of variability, serial dependence, and magnitude ofintervention effects,’’ J. Appl. Behavior Anal., vol. 23, no. 3, pp. 341–351,1990.

[23] S. Zeng, X. Niu, and Y. Dou, ‘‘Vehicle recognition for surveillance videousing sparse coding,’’ in Pattern Recognition. New York, NY, USA:Springer-Verlag, 2014, pp. 228–234.

2236 VOLUME 3, 2015




[24] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, andY. LeCun. (2013). ‘‘OverFeat: Integrated recognition, localizationand detection using convolutional networks.’’ [Online]. Available:http://arxiv.org/abs/1312.6229

[25] B. Huval et al. (2015). ‘‘An empirical evaluation of deep learning onhighway driving.’’ [Online]. Available: http://arxiv.org/abs/1504.01716

[26] R. Hu, J. Jiang, G. Liu, and L. Wang, ‘‘Efficient resources provisioningbased on load forecasting in cloud,’’ Sci. World J., vol. 2014, Feb. 2014,Art. ID 321231.

[27] N. Roy, A. Dubey, and A. Gokhale, ‘‘Efficient autoscaling in the cloudusing predictivemodels for workload forecasting,’’ inProc. IEEE Int. Conf.Cloud Comput. (CLOUD), Jul. 2011, pp. 500–507.

WEISHAN ZHANG was an NSTB Post-DoctoralResearch Fellow with the Department of Com-puter Science, National University of Singapore(from 2001 to 2003). He was an Associate Pro-fessor with the School of Software Engineering,Tongji University, Shanghai, China (2003–2007).He was a Visiting Scholar with the Depart-ment of Systems and Computer Engineering,Carleton University, Canada (2006–2007). He wasa Research Associate Professor/Senior Researcher

with the Computer Science Department, University of Aarhus (2010), wherehe was involved in the EU FP6 Hydra pervasive middleware project as aTechnical Manager (2008–2009). He is currently a Full Professor and theDeputy Head of research with the Department of Software Engineering,School of Computer and Communication Engineering, China University ofPetroleum. He is the Director of the Big Data Intelligent Processing Innova-tion Team of Huangdao District, the Director of the Smart City ResearchCenter with Fuwode Electronic Corporation, and the Founding Directorof the Autonomous Service Systems Laboratory. He has an h-indexof 13 according to Google Scholar, and his total number of citations isaround 490 in 2015.

PENGCHENG DUAN received the bachelor’s degree from the China Uni-versity of Petroleum, in 2014, where he is currently pursuing the master’sdegree with a focus on big data processing and software architecture. He hasauthored two papers using Storm for image recognition.

ZHONGWEI LI received the Ph.D. degree from the China University ofPetroleum, Qingdao, China, in 2011. He is currently an Associate Profes-sor with the Department of Computer Applications, China University ofPetroleum. His research interests include big data processing for petroleumengineering, software architecture, and so on.

QINGHUA LU received the Ph.D. degree from the University of New SouthWales, in 2013. She is currently a Lecturer with the Department of SoftwareEngineering, China University of Petroleum, Qingdao, China. Her researchinterests include software architecture, dependability of cloud computing,and service engineering.

WENJUAN GONG received the Ph.D. (cum laude) degree from theAutonomous University of Barcelona, in 2013. She was a Post-DoctoralResearchAssistant withOxford BrookesUniversity in 2014. She participatedin the European project Consolider Ingenio 2010 and the EPSRC projectTensorial Modeling of Dynamical Systems for Gait and Activity Recogni-tion. She has led the CCF-Tencent Funds and the Natural Science Foundationof China of Shandong Province. She is currently a Lecturer with the ChinaUniversity of Petroleum. Her research interests include computer vision andmachine learning.

SU YANG is currently a Professor with the Department of Computer Scienceand Engineering, Fudan University. His research interests are mainly patternrecognition, social computing, machine vision, and data mining. He is the PIof a number of NSFC projects, including the Graphical Symbol Recognitionin Natural Scenes and the Detection of Abnormal Collective Behaviors viaMovement and Communication Pattern Analysis.

VOLUME 3, 2015 2237



Education

A deep awareness framework for pervasiv video cloud