Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
AAL Data Cluster Analysis
BACHELORARBEIT
zur Erlangung des akademischen Grades
Bachelor of Science
im Rahmen des Studiums
Medieninformatik und Visual Computing
eingereicht von
Dzenan Hamzic Matrikelnummer 0327029
an der Fakultät für Informatik
der Technischen Universität Wien
Betreuung: Wolfgang Zagler Ao.Univ.Prof.Dipl.Ing. Dr.techn. Mitwirkung: Peter Mayer Projektass. Dipl.Ing.
Wien, 28. Juni 2016
Dzenan Hamzic Dipl.Ing. Peter Mayer Dr.techn. Wolfgang Zagler
Dzenan Hamzic TU Wien 1
Dzenan Hamzic TU Wien 2
Erklärung zur Verfassung der Arbeit
Dzenan Hamzic Stavangergasse 4/7/3 1220 Wien [email protected]
Hiermit erkläre ich, dass ich diese Arbeit selbstständig verfasst habe, dass ich die verwendeten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe.
Wien, 28. Juni 2016 Dzenan Hamzic
Dzenan Hamzic TU Wien 3
Acknowledgments
First and foremost I would like to thank Amra Omeragic for her patience and loving support throughout the long writing process.
I would like to thank Peter Mayer and Wolfgang Zagler for contributions and navigating this project with their advices, ideas and revisions.
I must also thank my mother and father, Nevresa and Enver Hamzic, for all of their patience, support and love, and for beeing my biggest fans.
Finally, I would like to thank all those who stood by me in good and bad times.
Dzenan Hamzic
Dzenan Hamzic TU Wien 4
Abstract
The eHome project from the Vienna University of Technology [1] is an R&D project with goals of providing assistive technologies for private households of older people with idea to give them possibilities for longer and independent living in their homes. The eHome system consists of an adaptive intelligent network of wireless sensors for activity monitoring with a central contextaware embedded system [2].
The primary goal of this thesis is to investigate unsupervised prediction and clustering possibilities of user behaviour based on collected timeseries data from infrared temperature sensors in the eHome enviroment. Three different prediction approaches are described. Hourly Based Event Binning approach is compared to two clustering algorithms, Hierarchical Clustering and Dirichlet Process GMM. Prediction rates are measured on data from three different test persons. This thesis first examines two different approaches for event detection from infrared signal data. In a second stage three different methods for unsupervised prediction analytics are discussed and tested on selected datasets. Clustering algorithms parameter settings for timeseries data have also been discussed and tested in detail. Finally the prediction performance results are compared and each method's advantages and disadvantages have been discussed. The practical part of this thesis is implemented in IPython notebook. Python version was 2.7 on 64 bit Ubuntu linux 12.04 LTS. Data analysis has been implemented with Python’s Pandas library. Visualisations are made with Matplotlib and Seaborn libraries. The results reveal that prediction accuracy depends on data quantity and spread of data points. The simplest method in prediction comparison, the Hourly Based Binning has however given the best prediction rates overall. By contrast to the Hourly Based Binning the Dirichlet Process Gaussian Mixture Models clustering show best prediction performance on smaller training data sets and well spread data. By further parameter tuning on Dirichlet Process GMM clustering the prediction rates could be further improved coming very close or even over performing the Hourly Based Binning. Due to the unknown distribution and well spread data, choosing the right threshold parameter for the Hierarchical Clustering was trickier than initially assumed. Despite the initial assumptions for Hierarchical Clustering, this method was at least applicable for unsupervised prediction analytics on used data sets.
Dzenan Hamzic TU Wien 5
Contents Erklärung zur Verfassung der Arbeit ………………………………… 3
Acknowledgments ……………………………………………………… 4
Abstract .……………………………………………………………….... 5
1. Introduction ……………………………………………………… 8 1.1. Motivation …………………………………………………………….. 8 1.2. Outline ………………………………………………………………… 8 1.3. The Goal ………………………………………………………………. 8 1.4. Methodology …………………………………………………………... 9
2. Theory Section …………………………………………………... 10
2.1. Gaussian Mixture Model ……………………………………………... 10 2.2. Dirichlet Process GMM ……………………………………………..... 12 2.3. Hierarchical Clustering ……………………………………………….. 13
3. Implementation …………………………………………………. 15
3.1. Software Specification ……………………………………………….. 15 3.2. Data and Sensors ……………………………………………………... 15 3.3. Data Visualisation and Inspection ……………………………………. 16 3.4. Sensor values Discretisation and Extraction …………………………. 18 3.5. Unsupervised Event Extraction ………………………………………. 24 3.6. Data Structure for Event Analysis ……………………………………. 28 3.7. Data Quality and Quantity ……………………………………………. 29 3.8. Predictive Analysis ………………………………………………….... 33
3.8.1. Hourly Binning Analysis ……………………………………... 34 3.8.2. Clustering Analysis …………………………………………... 37
3.8.2.1. Hierarchical Clustering Analysis ……………………... 37 3.8.2.2. Dirichlet Process Gaussian Mixture Model Clustering
Analysis ………………………………………………. 43
4. Summary ………………………………………………………… 52 4.1. Results ………………………………………………………………... 52 4.2. Discussion ……………………………………………………………. 53
Literature ………………………………………………………………. 55
List of Figures ………………………………………………………….. 56
Dzenan Hamzic TU Wien 6
List of Tables …………………………………………………………… 58
List of Abbreviations …………………………………………………... 59
Dzenan Hamzic TU Wien 7
1. Introduction
We are living in times of “smarthomes”. Sensors are registering our every movement. We can use these data on various ways like for preventing leakage, disasters, robbery and automating doors, lights etc. On the other hand, we can use these data to to support older people by finding out if they are behaving other than usual in order to trigger some help alarm like sending an emergency call or cutting the powersupply in order to prevent disaster. The eHome project from the Vienna University of Technology [1] is an R&D project with goals of providing assistive technologies for private households of older people with the idea to give them possibilities for longer and independent living in their homes.
1.1 Motivation Most people when they are old want to be supported by their loved ones. At the same time they do not want to be a burden to them. There are a lot of cases where such people live alone. Many of them have difficulties in walking and are prone to falling without possibility to get up again or calling help. As another example, if a cooking plate stays on without being noticed the house could be placed on fire. In order to prevent such a disaster it is a goal to detect events like that. If detected there is a possibility to trigger an alarm and to remind to turn off the plate and send someone to help the person to get up. For that reason and many others there is a need for a system like eHome that detects unusual and dangerous behaviour and supports older people by providing them assisting technologies.
1.2 Outline Multiple wireless sensors of different types like Temperature Sensor, IIR (Infrared Temperature Sensor), REED (Magnetic Contact Sensor), PIR (Passive Infrared Sensor) etc. are placed in the house. The data from all sensors are recorded and analysed on eHome’s central unit. Events like cooking and leavingflat must firstly be correctly recognised in sensor data. Having recognised and extracted events, it can be further investigated if any behavioural patterns or event accumulations can be recognised in data which would be then useful in predicting the future behaviour of the person.
1.3 The Goal The goal is to have as reliable as possible solutions for detecting the typical times of a person’s behaviour like cooking without any supervised interventions. If the person usually cooks in intervals between 10 AM and 1 PM and 3 PM and 5 PM it should be possible to find such time intervals. It should also be possible to have a probability of an event occurring in any given hour or time span.
Dzenan Hamzic TU Wien 8
1.4 Methodology Three methods of possible event predicting are going to be discussed. The Hourly Binning of detected events is the simplest solution which returns event probabilities for every given hour. The two clustering approaches Dirichlet Process Gaussian Mixture Model and Hierarchical Clustering are used for finding event accumulations in timespans. Prediction possibilities are measured by dividing the whole dataset of a test person by 50:50 to training and testing datasets and then comparing them for event hits and misses.
Dzenan Hamzic TU Wien 9
2. Theory Every person is different and behaves differently. Some people cook normally at 12 PM others at 6 AM. Some people cook only once a day while others cook three or even four times a day. Some people don’t cook at all. This leads to the conclusion that no fixed number of cooking events can be assigned to each person. Having this in mind, the eHome system must have a possibility of unsupervised finding the number of events taking place in homes of older people. The clustering algorithms chosen in this section are later used for finding event accumulations and event clustering on timeseries data from the eHome project’s Infrared sensors. The Dirichlet Process Gaussian Mixture Model and Hierarchical Clustering need no fixed number of clusters to be found what makes them appropriate for unsupervised finding of eventual behavioural patterns. 2.1 Gaussian Mixture Model Common and important clustering techniques are based on probability density estimation using Gaussian Mixture Model and Expectation Maximisation. The Kmeans Algorithm uses just single points in feature space as the cluster centre. Each data point is then assigned to the nearest cluster using euclidian distance from the cluster centre. Using only euclidian distance the Kmeans algorithm is not well suited for overlapping data points in feature space and for clusters that form no circular shape. Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object [3]. Kmeans is often actually viewed as a special case of a Gaussian Mixture Model. GMMmodels can be seen as an extension to Kmeans models in which clusters are modeled with gaussian distributions using not only their means but also covariance that describes their ellipsoid shapes. Covariance parameter in GMM can be constrained to spherical, diagonal, tied or full.
Dzenan Hamzic TU Wien 10
Figure 2.1 Gaussian Mixture Model density estimation [4].
Figure 2.1 shows 3 gaussian components. Each component is described as Gaussian distribution so each has a mean ᵻ�, variance or covariance ᵻ� and the “size” ᵻ�. The mean is responsible for the distribution shift. Variance determines how wide/narrow the component is. The ᵻ� is the component's height. The goal of performing an GMM clustering on a dataset is to find the model parameters (Mean and variance of each component) so the model fits the data as much as possible. This bestfit estimation usually translates into likelihoodmaximisation of the GMM model data. Likelihood maximisation is performed by the Expectation Maximisation algorithm [B1]. The EM algorithm proceeds iteratively in 2 steps. The Expectationstep treats gaussian component’s mean, covariance and the size as fixed. For each datapoint i and each cluster c probability value Ric of that datapoint belonging to cluster c is computed. If the particular probability value Ric is not high that data point i does not belong to c cluster. The best possible explanation for single component belonging to one cluster is Ric = 1. The Maximisationstep starts with assignment of Ric probabilities and updates gaussians components parameters mean, covariance, and size. For each cluster c parameters are update using and estimated weights of Ric probabilities. Each iterative step increases the loglikelihood of the model. Prior knowledge of cluster number is assumed for this clustering algorithm. Advantages:
Fastest algorithm for mixture models learning No cluster shape and size limits
Dzenan Hamzic TU Wien 11
Disadvantages:
Bad performance on small datasets Fixed number of components
2.1 Dirichlet Process Gaussian Mixture Model As noted earlier Gaussian Mixture Model and Kmeans algorithms assume a fixed number of components which should be found but in the most real world problems data are unstructured and no exact conclusion on datapoints distribution is can be made in advance. The DPGMM is a Gaussian Mixture Model variant where no prior knowledge of cluster number is necessary. It uses a maximum number of clusters parameter as an upper bound for maximum components number to be found. Setting this parameter to e.g. five components should find all possible clusters in data up to maximum five. The algorithm should not simply split data into five components but deliver the real cluster number and cluster data accordingly. This is illustrated on diagram below.
Figure 2.2 Gaussian Mixture Model and Dirichlet Process GMM both initialised with 5
components [5]. This upper bound parameter should be loosely coupled with the real cluster number. In comparison to Gaussian Mixture Model, Dirichlet Process GMM uses one additional parameter ᵬ� (alpha) to specify datapoints concentration.The alpha parameter controls the number of components used to fit the data. Lowering alpha parameter clusters the datapoints tightly as the expected number of clusters is alpha*log(N). Doing the opposite, more clusters are produced in any finite set of points. Given low data quantity, the DPGMM tends to fit data points to only single component.
Dzenan Hamzic TU Wien 12
Depending on the data type and data distribution DPGMM allows to set which cluster parameters are going to be updated in the training process. It can be setup to update wweights, mmeans and ccovariances or any combination of the three. The Dirichlet Process can be explained by the “chinese restaurant process” which satisfies properties [6][7]:
“Rich get richer” property (the more people sitst at the table the higher the chance of a new person joining in)
There exists always a small probability that a new person entering joins the new table Probability of a new group is set by concentration parameter alpha
Tables in the chinese restaurant paradigm are components of GMM. Advantages:
Relatively stable (no big changes with small parameter tuning) Less tuning needed No need for component number specification (only loose upper bound)
Disadvantages:
Dirichlet Process makes inference slower (not much) Implicit biases (sometimes better to use finite mixture models as GMM)
2.3 Hierarchical Clustering Hierarchical methods need no cluster number and no cluster seed specification. In hierarchical clustering methods a nested series of clusters is produced. Hierarchical clustering tries to capture the underlying datastructure by constructing a tree of clusters. Two hierarchical approaches are possible [B2]. Bottomup approach where at start every dataobject is a cluster by itself. Nearby clusters are iteratively merged into bigger clusters until all clusters are merged into a single cluster in the highest hierarchy level or some stopping criterion is met. Topdown approach starts from one big cluster containing all datapoints in the highest level of the hierarchy. Going towards bottom in hierarchy this method repeatedly performs splitting of clusters resulting in smaller and smaller clusters until every data point is a cluster for itself or some stopping criterion is met. Depending on the dataobject’s distances, a threshold for flat cluster formatting is to be set. Both approaches can use distance as a stopping criterion. Computing distance between all data pointsin two clusters is an expensive operation especially on big datasets. Therefore, the Hierarchical Clustering method offers multiple algorithms for computing distances between clusters.
Dzenan Hamzic TU Wien 13
Singlelink algorithm: computes distance between two nearest points each in a different cluster. Completelink algorithm: computes the distance between two furthest points each in different cluster (opposite of singlelink algorithm). Centroid algorithm: computes the distance between two in cluster center points each in different cluster Averagelink algorithm: computes distance between all data points pairs of each in different cluster. Advantages:
Can provide more insight into the data (eventual cluster hierarchy) Simple to implement Can provide clusters at different levels of granularity
Disadvantages:
No dataobject resignation to other clusters Time complexity О(n³) Distance matrix requires О(n²) memory space
Dzenan Hamzic TU Wien 14
3. Implementation 3.1 Software Specification The practical part of this thesis is implemented in IPython Notebook [8] on Ubuntu 12.04 LTS. Python version is 2.7.10 with 64 Bit Anaconda 2.3.0 [9]. Anaconda is a completely free scientific Python distribution. It includes more than 300 of the most popular Python packages for science, math, engineering and data analysis. NumPy is used for mathematical functions like transponding, rounding and others, Pandas for CSV datatables management, scikit for Dirichlet Process GMM Algorithm implementation, scipy for Hierarchical Clustering implementation. Visualisations are made using Matplotlib and Seaborn [10] libraries.
3.2 Data and Sensors The eHome system consists of an adaptive intelligent network of wireless sensors placed in homes of older people for activity monitoring with a central contextaware embedded system. In each home the data are monitored by the central system in real time. The data subsets of monitored test persons used in this thesis is a collection of all sensor events recorded in a time frame of a few months.
Data events from wireless sensors like Accelerometers, Temperature Sensors, IIR (Infrared Temperature Sensor), REED (Magnetic Contact Sensor) and PIR (Passive Infrared Sensor) [2] are recorded to single CSV formatted data files. Each line represents a single sensor event. Each new day begins with a new data file. The recorded data are separated by a comma and have the following format:
day.month.year hour:min:sec, unixtimestamp, milliseconds, sensor type, event type, event subtype, sensor ID, network ID, sensorvalue.
The CSV file looks as following:
Dzenan Hamzic TU Wien 15
2010.05.15 00:03:28,1273881808,716,R,0,4,573769,173,1,22.600000,2130706433 2010.05.15 00:03:30,1273881810,74,R,0,6,573771,183,1,0.000000,2130706433 2010.05.15 00:05:55,1273881955,232,R,0,6,573805,174,1,0.000000,2130706433 2010.05.15 00:06:29,1273881989,272,R,0,6,573811,173,1,3.000000,2130706433 2010.05.15 00:06:51,1273882011,301,R,0,6,573817,178,1,3.000000,2130706433 2010.05.15 00:06:57,1273882017,581,R,0,6,573822,177,1,3.000000,2130706433 2010.05.15 00:07:04,1273882024,338,R,0,6,573824,181,1,3.000000,2130706433 2010.05.15 00:07:05,1273882025,162,R,0,6,573826,175,1,3.000000,2130706433 2010.05.15 00:07:22,1273882042,59,R,0,6,573828,166,1,3.000000,2130706433 2010.05.15 00:07:37,1273882057,313,R,0,3,573832,180,1,24.500000,2130706433 2010.05.15 00:07:42,1273882062,834,R,0,6,573834,171,1,0.000000,2130706433 2010.05.15 00:08:52,1273882132,638,R,0,3,573851,146,1,19.500000,2130706433 2010.05.15 00:09:05,1273882145,308,R,0,3,573857,175,1,21.500000,2130706433 2010.05.15 00:10:44,1273882244,775,R,0,6,573877,176,1,3.000000,2130706433 2010.05.15 00:10:52,1273882252,784,R,0,6,573880,146,1,0.000000,2130706433 2010.05.15 00:11:37,1273882297,607,R,0,6,573890,180,1,3.000000,2130706433 2010.05.15 00:11:55,1273882315,719,R,0,6,573897,182,1,0.000000,2130706433 2010.05.15 00:13:30,1273882410,808,R,0,6,573916,183,1,0.000000,2130706433 2010.05.15 00:13:31,1273882411,141,R,0,4,573918,173,1,22.600000,2130706433 2010.05.15 00:15:55,1273882555,966,R,0,6,573952,174,1,0.000000,2130706433 2010.05.15 00:16:31,1273882591,700,R,0,6,573959,173,1,3.000000,2130706433
Table 3.1. Sensor data sample from CSV file.
The marked lines indicate temperature of 22.6 C recorded by Infrared sensor with ID 173.
3.3 Data Visualisation and Inspection
The first step of almost every data analysis is to get to know the data. In this case the data of the Infrared Temperature Sensor (placed by the cooking plate) and Magnetic Contact Sensors (placed on doors) are going to be inspected by visualising their values.
Dzenan Hamzic TU Wien 16
Figure 3.1 Cooking Sensor data overview.
As can be seen from figure 3.1, the cooking plate was active multiple times on 15.05.2010. It can be noticed that the temperature is not so high as usuall on cooking plates. The reason is that the sensor is not placed directly on the cooking plate but to the side. By visually inspecting the cooking sensor temperature values, 3 big peaks and multiple small peaks can be seen. Such small peaks are of no interest for this thesis because they do not imply cooking. Cooking usually takes longer than 10 minutes. Event length detection and filtering will be discussed in further sections. The IIR (Infrared temperature sensor) cooking sensor is continuously delivering temperature values. Minimal sending interval is 1 minute. Maximal sending interval is 10 minutes. It is triggered by a temperature change of∓0,5 C. This setup can be visually confirmed in Figure 3.2 below.
Figure 3.2 Infra Red Sensor sending intervals.
It can be seen that the sensor fired multiple times in the interval between 8:46 and 8:56 what indicates a strong increase in temperature concluding that the cooking plate was turned on.
Sensors like PIRSensor (Movement sensor) are working with discrete values. PIR sensors are saving ones if a movement is detected. Otherwise nothing is saved. Such discrete values are
Dzenan Hamzic TU Wien 17
easier to work with. The clustering algorithms can directly be fed with such values. Below is the visualisation of the Passive Infrared sensor values on 19. and 20. May of 2010 in the home of Test Person 1.
Figure 3.3 PIR Sensor values visualisation for TP1.
From Figure 3.3 can nicely be seen if there was any movement in house. The value gaps, which could indicate sleeping or being outdoors, can easily be filled with virtual sensor values in order to be processed further. In order to do density estimations on passive infrared sensor timeseries values, some kind of conversion to discrete values is needed. When working with timeseries data density estimation clustering algorithms are to be filled only with values when the cooking plate is turned on. All other values from the cooking sensor can be filtered out from the dataset or can be set to 0.
Multiple steps are needed in order to achieve discretisation of values from the cookingplate infrared sensor. There are multiple possible solutions to this issue and some of them are going to be discussed in next section of the thesis.
3.4 Sensor Values Discretisation and Extraction As noted earlier IIR sensor values need discretisation. The idea is to implement some kind of sampling on continuous values of the IIR cooking sensor. This is needed to be able to extract the cookings from the rest of the sensor events from the dataset.
Having a discrete signal from infrared temperature sensor, two approaches for cooking event extraction from a dataset are going to be discussed. The goal is to have possibilities of reliable and correct cooking events detection in a single dataset.
Dzenan Hamzic TU Wien 18
The first approach is the sequencing of temperature rises that belong together. The second approach is unsupervised event extraction using a clustering algorithm for grouping temperature increases that belong together. The latter is discussed in the next section. Three big peaks from Figure 3.1 need to be discretized since only “heating in progress” on the cooking plate is of interest. This gives an idea that positive temperature increases should be inspected and leads to conclusion that the first step of temperature signal transforming should be a difference operation on temperature values.
The result of temperature values difference operation is visualised on the following diagram.
Figure 3.4 Temperature differences.
Figure 3.4 clearly shows temperature peaks. Small increases in temperature can be categorised as signal noise (∓0.1 C). This noise was present in most of the datasets and can clearly be seen e.g. between 03:00 and 06:00. Negative temperature differences can simply be filtered out from the new dataset. The positive “noise” in temperature signal (+0.1 C) does also have to be filtered out in order to get only relevant temperature increases. This can be dealt with using 2 approaches.
First: Taking the sensors sensitivity of ∓0.5C into consideration every temperature difference value below that threshold could be set to 0.
Second: Counting the absolute probability for every positive temperature difference and filtering (setting to 0) all values between some threshold. This threshold could be set by finding the values which have probabilities of e.g. 15% or 20% in difference dataset. Using this approach demands detailed analysis of sensor behaviour.
Dzenan Hamzic TU Wien 19
Below is the table of probabilities for every temperature difference.
0.0 C: 62x, 34 % 0.1 C: 48x, 26% 0.2 C: 4x, 2% 0.3 C: 2x, 1% 0.4 C: 3x, 1% 0.5 C: 1x, 0% 0.6 C: 22x, 12%
0.7 C: 12x, 6% 0.8 C: 8x , 4% 0.9 C: 5x , 2% 1.0 C: 2x , 1% 1.1 C: 1x , 0% 1.2 C: 2x, 1% 1.6 C: 1x, 0%
1.9 C: 1x, 0% 2.5 C: 1x, 0% 2.9 C: 1x, 0% 3.1 C: 1x, 0% 3.3 C: 1x, 0% 3.6 C: 1x, 0% 5.8 C: 1x, 0%
Figure 3.5 Absolute probabilities of temperaturevalue differences.
Almost every provided dataset was noisy. One of the two above mentioned methods can be used to get the clean signal. After successful implementation of this step the dataset should no longer contain the sensor noise.
Figure 3.6. Filtered temperature differences.
The filtered signal can now be manipulated easily. One could add a new binary column to the new dataset which indicates the positive increases in temperature. This column would set ones to rows which indicate that the cooking plate is turned on. This isolates the needed signal events from the rest of the dataset. By plotting the the temperature values marked as true in new column we get the following diagram:
Dzenan Hamzic TU Wien 20
Figure 3.7 Isolated cooking signals.
Three big cooking peaks from figure 3.1 are now isolated and clearly to be seen on figure 3.7. The new created binary column contains sampled cooking signal which can be further manipulated.
Figure 3.8 Sampled temperature rises.
Figure 3.8 shows datapoints where the temperature was rising. Some standingalone points like the one between 11:19:00 and 11:49:00 could indicate that the cooking plate was shortly turned on. Such single standing points indicate no cooking and should also be filtered out.
One possibility to find such single temperature increases is to check if the datapoints are isolated relatively far from others. If such points are not in the vicinty (duration based) to other positive temperature rising sequences they can be filtered out. Simple check on timing of previous and succeeding positive temperature increase would be enough. Taking into account that some cooking plates “turn off” for short periods of time after reaching a certain temperature, and then turning on again, the threshold for filtering such singlestanding events should be carefully chosen and be not too short.
Dzenan Hamzic TU Wien 21
If there is no positive temperature increase before or after such an isolated point, it can be set to “NO SEQUENCE EVENT” or in following case 1. The following diagram demonstrates the idea of events that stand close to other positive temperature increases.
20100515 09:32:29, 0.3, 27.7 , 28.0 , sequence_event: 1 20100515 09:42:32, 0.4, 27.3 , 27.7 , sequence_event: 1 20100515 09:47:39, 0.6, 27.9 , 27.3 , sequence_event: 0 20100515 09:50:39, 0.7, 28.6 , 27.9 , sequence_event: 1 20100515 09:51:40, 0.7, 29.3 , 28.6 , sequence_event: 2 20100515 09:55:41, 0.6, 29.9 , 29.3 , sequence_event: 3 20100515 09:57:41, 0.7, 30.6 , 29.9 , sequence_event: 4 20100515 10:00:42, 0.8, 31.4 , 30.6 , sequence_event: 5 20100515 10:04:43, 0.7, 32.1 , 31.4 , sequence_event: 6 20100515 10:10:44, 0.8, 32.9 , 32.1 , sequence_event: 7 20100515 10:17:46, 3.1, 36.0 , 32.9 , sequence_event: 8 20100515 10:18:46, 1.9, 34.1 , 36.0 , sequence_event: 1 a) belongs to sequence. 20100515 10:20:47, 0.9, 35.0 , 34.1 , sequence_event: 0 20100515 10:21:47, 0.8, 35.8 , 35.0 , sequence_event: 1 20100515 10:22:47, 2.5, 38.3 , 35.8 , sequence_event: 2 20100515 10:23:49, 1.0, 37.3 , 38.3 , sequence_event: 1 b) belongs to sequence. 20100515 10:25:49, 0.7, 38.0 , 37.3 , sequence_event: 0 20100515 10:29:04, 0.9, 37.1 , 38.0 , sequence_event: 1 20100515 10:31:04, 0.9, 36.2 , 37.1 , sequence_event: 1
Table 3.2. Sequencing cooking sensor events. (Datetime, temperature difference from previous event, current temperature, previous event
temperature, is_in_sequence ) Table 3.2 demonstrates the idea of sequencing the positive temperature differences. Events a and b should be merged with other sequences since they are very close to preceding and following sensor event. If they were e.g 15 or 20 minutes distanced from other positive increase they could be filtered out. Having the sensor events sequenced and single temperature increases filtered out, one additional field could be added to the dataset that indicates the sequence name. This is needed in order to add a functionality of a programmatically selection of distinct cookings events from the dataset. The idea is to have some functionality of select distinct or group by in the data set. The Table 3.3 demonstrates the idea.
Dzenan Hamzic TU Wien 22
20100515 08:46:17 , 0.6 , 23.2 , sequence_event: 0 , sequence_name: A 20100515 08:48:17 , 1.1 , 24.3 , sequence_event: 1 , sequence_name:A 20100515 08:49:18 , 0.6 , 24.9 , sequence_event: 2 , sequence_name:A 20100515 08:50:18 , 0.6 , 25.5 , sequence_event: 3 , sequence_name:A 20100515 08:52:18 , 1.0 , 26.5 , sequence_event: 4 , sequence_name:A 20100515 08:53:19 , 0.6 , 27.1 , sequence_event: 5 , sequence_name:A 20100515 08:55:19 , 0.9 , 28.0 , sequence_event: 6 , sequence_name:A 20100515 08:57:20 , 0.9 , 28.9 , sequence_event: 7 , sequence_name:A 20100515 08:59:20 , 0.6 , 29.5 , sequence_event: 8 , sequence_name:A 20100515 09:02:21 , 0.7 , 30.2 , sequence_event: 9 , sequence_name:A 20100515 09:05:22 , 0.6 , 30.8 , sequence_event: 10 , sequence_name:A 20100515 09:07:22 , 0.7 , 30.1 , sequence_event: 1 , nan 20100515 09:10:23 , 0.7 , 29.4 , sequence_event: 1 , nan 20100515 09:14:24 , 0.7 , 28.7 , sequence_event: 1 , nan 20100515 09:22:27 , 0.7 , 28.0 , sequence_event: 1 , nan 20100515 09:32:29 , 0.3 , 27.7 , sequence_event: 1 , nan 20100515 09:42:32 , 0.4 , 27.3 , sequence_event: 1 , nan 20100515 09:47:39 , 0.6 , 27.9 , sequence_event: 0 , sequence_name:B 20100515 09:50:39 , 0.7 , 28.6 , sequence_event: 1 , sequence_name:B 20100515 09:51:40 , 0.7 , 29.3 , sequence_event: 2 , sequence_name:B 20100515 09:55:41 , 0.6 , 29.9 , sequence_event: 3 , sequence_name:B 20100515 09:57:41 , 0.7 , 30.6 , sequence_event: 4 , sequence_name:B 20100515 10:00:42 , 0.8 , 31.4 , sequence_event: 5 , sequence_name:B 20100515 10:04:43 , 0.7 , 32.1 , sequence_event: 6 , sequence_name:B 20100515 10:10:44 , 0.8 , 32.9 , sequence_event: 7 , sequence_name:B 20100515 10:17:46 , 3.1 , 36.0 , sequence_event: 8 , sequence_name:B
Table 3.3 Naming the sequences.
(Datetime, temperature difference from previous event, current temperature, sequence event number, sequence name)
With sequences named, the cooking events can now be programmatically selected. The distinct cooking event sequences can now be visualised.
Figure 3.9 Programmatically selected cooking events.
Dzenan Hamzic TU Wien 23
Found events are summarised in the following table.
Sequence A B C D E
Minutes 21.1 31.1 3.0 4.0 7.0
Seconds 1265.1 1867.5 182.0 241.0 421.7
Start 08:46:17 09:47:39 10:20:47 12:29:56 12:36:58
End 09:05:00 10:17:46 10:22:47 12:32:57 12:42:59
Count events 11 9 3 4 2
Table 3.4. Programmatically selecting cooking sequence events.
Table 3.4 lists 5 distinctive cooking events. Sequences B and C are very close one to another. As well as sequences D and E. Looking at the start and end times of those sequences leads to the conclusion that such close to other sequences can be merged into larger ones since they are divided only by a few minutes from another.
The merging of sequences can be done programmatically with some threshold variable to check on sequence distances or can be done by some clustering algorithm which could group togetherbelonging dataevents. The latter will be discussed in the next section.
3.5 Unsupervised Event Extraction Another approach of extracting cooking events from datasets is to feed an unsupervised clustering algorithm with sampled cooking data. Advantage of this method is that the cooking sequences close one to another are going to be automatically merged. Precondition for this approach is, as already mentioned, transformation of continuous data to discrete format and removal of single peaks in temperature values.
Hierarchical clustering can be used for unsupervised grouping of temperature increasing sequences. As noted earlier, some cooking plates shortly turn off for a short period of time and then turn on again producing multiple positive temperature increases close one to another. Hierarchical clustering merges such temperature increasing sequences nicely.
It performs well on sparse data which is the case in positive temperature increases. Hierarchical clustering is further discussed in theory section (see Chapter 2.3). Timeseries data need to be converted before hierarchical clustering algorithm can be fed with them.
Dzenan Hamzic TU Wien 24
Time dimension needs to be converted into distance dimension. One possible solution would be to convert sensor timestamps into distance from the day begin. One new column in the dataset can be created to hold seconds+milliseconds delta for every sensor event from the beginning of the day.
Having cooking temperature values sampled and delta times from day begin, hierarchical clustering can be tested.
Using default parameters of the algorithm [11] with criterion= ”distance”, threshold of 1000 (16 min) seconds and “single linkage” which reduces computing time, the following clusters are found:
Figure 3.10 Hierarchical Clustering of sequenced datapoints.
Cluster 1 2 3
Minutes 11 21.1 34.2
Seconds 662.7 1265.1 2049.5
Start 12:29:56 08:46:17 09:47:39
End 12:42:59 09:05:22 10:22:47
Count events 6 11 12
Table 3.4. Hierarchical clustering cluster statistics.
Dzenan Hamzic TU Wien 25
Figure 3.10 and Table 3.4 show 3 bigger dataevent groupings. All small peaks in temperature are merged with those closest to them. This result can be checked if it matches three big peaks from the Figure 3.1. As can be seen, found clusters correspond exactly to the all cooking peaks from the dataset.
This method has been tested on all provided datasets and has given good results in recognising cooking events. The threshold of 1000 seconds with single linkage has given very good results in recognizing and merging of temperature increases in the event detection step.
Another example of having multiple temperature increasing close one to another, all of them indicating one and the same cooking event is on the following diagram. The green cluster consists of multiple temperature increases. The Hierarchical clustering algorithm does a nice job by grouping them together.
Figure 3.11 Hierarchical Clustering of sequenced datapoints.
Figure 3.11 shows a good example of long cooking where periodically cookingplate turning off is nicely to be seen.
Both methods of cooking events extraction have been tested on multiple datasets. They are reliable methods of cooking events detection and extraction. In step of event detection each method can be used alone as well as combined together. Hierarchical clustering provides great results in the step of merging shorter temperature increases with longer ones. Since the quantity of datasets is not huge, algorithmic complexity plays no big role in this case. It also reduces programming complexity and need for duration threshold between temperature increases. Additionally one could also implement temperature increase delta check. If a temperature
Dzenan Hamzic TU Wien 26
difference in a peak is not significant, which implies to no cooking, that peak could not be of interest and can be filtered out.
Successful and reliable implementation of cooking events detection and extraction on a single dataset is a basis for automating that method for events extraction on all provided datasets at once. One data structure should be made for holding all relevant data of cooking events for each day. This is going to be discussed in the next section of the thesis.
Dzenan Hamzic TU Wien 27
3.6 Data Structure for Event Analysis One unified data structure is needed to hold all recognised events from given test person’s data set to have a possibility of further analysis of that person’s behaviour.
This data structure may be implemented in various ways. One possibility is to have some kind of nested dictionary or multiplelevel keyvalue structure. Since the practical part of this thesis was implemented in python, nested dictionary was used to hold all extracted cooking events.
Nested dictionary has the following form: #first level dictionary outputDict = {} # holds the date of the datasource file date_as_key =('%Y_%m_%d') # second level dictionary outputDict[key] = {} #number of extracted event clusters with hierarchical clustering outputDict[key][“hclustersnr”] = #number_found #number of programmaticaly extracted events with DISTINCT outputDict[key][“eventgroupsnr”] = #number_found # third level dictionary for holding list or arrays of extracted events outputDict[key][“hclusters”] = {} outputDict[key][“eventgroups”] = {}
The “hclusters” and “eventgroups” dictionaries can be now filled with detected cooking events. # foreach event detected with hierarchical clustering
# first event (first cooking detected) list outputDict[key][“hclusters”][1]=[eventNumber,length,startTime,endTime,meanTime,
sensorEventsNumber] # second event (second cooking detected) list outputDict[key][“hclusters”][2]=[eventNumber,length,startTime,endTime,meanTime,
sensorEventsNumber] # nth event (nth cooking detected) list outputDict[key][“hclusters”][n]=[eventNumber,length,startTime,endTime,meanTime,
sensorEventsNumber] # foreach event programmaticaly detected
# first event (first cooking detected) list outputDict[key][“eventgroups”][1] = [eventNumber, length, startTime, endTime,
meanTime, sensorEventsNumber] # second event (second cooking detected) list outputDict[key][“eventgroups”][2] = [eventNumber, length, startTime, endTime,
meanTime, sensorEventsNumber] # nth event (nth cooking detected) list outputDict[key][“eventgroups”][n] = [eventNumber, length, startTime, endTime,
meanTime, sensorEventsNumber]
Such data structure gives many possibilities for further data manipulation and analysis. For each day the number of events detected is saved. For each single cooking event the length, starttime, endtime, meantime, sensoreventscounter and eventnumber is saved. Having such data structure gives a possibility of sorting by date which is needed to divide the test person’s dataset into training and testing parts.
Dzenan Hamzic TU Wien 28
3.7 Data Quantity and Quality Data quality and data quantity are of big importance for measuring prediction rates and for finding event accumulations. No pattern can be learned if the data collection period is too short. If the test person is rarely cooking or using the cookplate only to shortly heat up meals, longer periods of data collecting are needed to detect reliable patterns.
Having extracted all events from the test person’s datasets and holding them in one single data structure, the cooking event distribution can be inspected . The dataset statistics as shown in the following table are useful to get the insight into possible underlying event patterns and the person’s behaviour. All this can help in further analysis of prediction methods.
Three randomly chosen test persons are going to be compared. Below are the dataset statistics.
Test Person TP10 TP2 TP3
Mean Cooking Time: Cooking Standard Dev: DaysCooked: Days: NoCooking: Total Days: Percentage Cooked: Percentage NoCooking:
11:00:45 03:19:34 26 45 71 36% 63%
10:23:09 03:43:24 75 52 127 59% 40%
11:08:37 03:51:26 90 26 116 77% 22%
Table 3.5 Test persons dataset statistics.
“Total Days” parameter indicates the amount of days of data being recorded. Test persons 2 and 3 have significantly more recorded data than testperson 10. Those 2 datasets are good examples where prediction rates should be higher. .“Percentage cooked” and “Percentage NoCooking” parameters indicate an event density for a given person.
The visualisation of event occurrence density over recording time period gives another good insight into person's behaviour patterns and possible hints for prediction rates.
Dzenan Hamzic TU Wien 29
Figure 3.12 TP10 Cooking density overview.
Dzenan Hamzic TU Wien 30
Figure 3.13 TP3 Cooking density overview.
Dzenan Hamzic TU Wien 31
Figure 3.14 TP2 Cooking density overview.
Dzenan Hamzic TU Wien 32
Figure 3.12 shows that the testperson 10 is using the cooking plate mostly once a day. Most of the days no cooking was detected. Making reliable predictions with such non dense dataset may be a hard task taking the relatively small amount of data in consideration.
Figures 3.13 and 3.14 show relatively good recorded dataquality of test persons 2 and 3. Test person 2 has only one big gap where no cooking event was registered. Finding some behavioural patterns from dense datasets like this two should be possible. This statement shall be put to the test further in the thesis.
3.8 Predictive Analysis This section analyses two different approaches in predicting a testperson's behaviour. The hourly based event grouping vs. event accumulations by clustering algorithms. The main goal is to have as much reliable and precise prediction as possible.
The first approach is to bin detected events by hour and calculate hourly probability of event occurring. This is the easier and less computationally intensive approach of the two. Many say that the simplest method is mostly the best but we’ll put this saying to the test.
The second approach uses clustering algorithms for the possible datapoints accumulations to be discovered. Two clustering algorithms are put to the prediction analysis test. Hierarchical Clustering which was already used on cooking events detection and Dirichlet Process Gaussian Mixture Model.
The Hierarchical Clustering has already proven good performance on clustering sparse discredited sensor signals. Having extracted all cooking events from given datasets the new dataset is created with dense data. The question is how the Hierarchical Clustering performs on such datasets.
The testperson’s datasets containing all extracted cooking events (nested dictionary from previous section) are split by 50:50 to training and testing dataset. The fiftyfifty split is pretty unusual but the idea is to find out if the event accumulations are to be found and if the cooking patterns are predictable at all.
In case of hourlybinning predictions, the hit rate is measured by checking if the meantimes of testing dataset points are in any hourlybins from training dataset. In case of clustering predictions the hit rate is measured by checking if the meantimes of datapoints from testing dataset fall in any of the found clusters in the training dataset.
Dzenan Hamzic TU Wien 33
3.8.1 Hourly Binning Analysis Cooking events from the training dataset are binned into hour intervals. If the meantime of cooking event is 8:45 it will be put into the 8 hour bin. Overlapping events with mean times very close to the next hour (like 10:59) are also binned according to their hour time. Such events can theoretically be binned into next or previous overlapping hours. This may be an issue in further analysis of hourly binning predictions.
Having data from the training dataset binned, the probability of an event occurring at a certain hour (bin) can be calculated. The same binning will be applied to the testing dataset and overlapping counts are going to be calculated. Below is the visualisation of binned events of training and testing datasets for TP2
.
Figure 3.15 TP2 training vs testing data bins comparison.
Some behavioural patterns are visible at first sight. Most probable bins had the biggest match rate except the 4 AM bin. Only the 14 hour and 21 hour bins had no matches. Visualising the hourly cooking probability gives an even better insight into prediction possibilities.
Dzenan Hamzic TU Wien 34
Figure 3.16 TP2 training vs testing hourly probability comparison.
Figures 3.15 and 3.16 show clear resemblance in test persons 2 behaviour. Training and testing dataset visualisations can be made for other test persons to get better intuition of possible end results.
Figure 3.17 TP3 training vs testing data bins and hourly probability comparison.
Below is the binning visualisation of test person’s 10 training and testing dataset which clearly shows the need for better quality or bigger amount of data.
Dzenan Hamzic TU Wien 35
Figure 3.18 TP10 training vs testing data bins and hourly probability comparison.
Due to the rare cooking events and low size of test person’s 10 dataset as shown on figure 3.18 the excellent prediction results are not to be expected.
The table below that summarises hourly binning prediction rates on all three datasets.
Test Person TP2 TP3 TP10
Hit Rate 92.40 % 100.00 % 62.00 %
Miss Rate 7.60 % 0.00 % 38.00 %
Training set size 60 87 21
Testing set size 79 103 15
Table 3.6. Hourly binning prediction analysis summary.
Hourly based pattern discovery yields relatively good prediction rates. Test person 3 data were the biggest in size and had 100% prediction rate. Test person’s 10 non satisfactory numbers can be explained by the datashortage. The thing to note in hourly binning prediction is that prediction rates are higher if the datapoints are nicely distributed across the hourly bins. If for example the test person cooked only once in every hour in the training dataset the prediction
Dzenan Hamzic TU Wien 36
rate would be 100%. The advantage of hourly based binning is that the cooking probability can be computed for every single hour.
Taking low computational demands into consideration the hourly binning is the way to go if the provided hardware has no huge processing resources.
3.8.2 Clustering Analysis Clustering is unsupervised learning. It tries to learn only from the provided data. The main idea is to find similarity measures and group similar objects together.
Cooking plate temperature values have no labels and no known structures. In order to find any hidden structures in temperature values unsupervised learning is the way to go. Unsupervised learning uses machine learning algorithms to discover and describe key data features. It is closely related to density estimation in statistics. Depending on the data, different techniques are applicable. In order to produce better results, combinations of different unsupervised learning methods are also applicable. The main problem in most of the unsupervised learning methods is finding a good number of clusters.
Fed with the same data, unsupervised learning algorithms behave differently and produce different results. Choosing the optimal solution for a specific dataset is more an art than a science. Lots of different method testing and parameter settings tryouts are needed to be able to choose the appropriate method.
In this section two clustering algorithms are going to be compared on unsupervised clustering hit/miss performance. Since the Hierarchical Clustering has less parameters to play and test with it is a good decision to start with it. A cluster number eventually found from Hierarchical Clustering can be used as maxcomponents parameter in Dirichlet Process GMM.
3.8.2.1 Hierarchical Clustering analysis As noted earlier Hierarchical Clustering takes a threshold input parameter as basic distance measure between dataclusters. The cooking eventsdata have two dimensions. Distance from the beginning of the day in seconds and the binary value indicating that cooking event took place. Taking the time distance between cooking events into consideration the Hierarchical Clustering algorithm is to be set up with criterion = “distance”.
Typical cooking takes place few times a day. Normally the distance between cooking events should be at least 30 minutes. Considering this, the Hierarchical Clustering algorithm can be set
Dzenan Hamzic TU Wien 37
up with threshold of 1800 seconds. Lowering the threshold would result in producing more clusters with a smaller distance between the same.
Below are hierarchical clustering execution results on test person’s 2, 3 and 10 data with criterion “distance” and threshold of 1800 seconds.
Figure 3.19 Hierarchical Clustering on TP3 training data with threshold of 1800 seconds.
Figure 3.19 shows found clusters from the training data. Test data (red points) are plotted just for comparison. Blue and green lines indicate hourly probability of cooking events occurring on the training and testing dataset’s respectively. The black horizontal line stands for hourly based probability standard deviation. This parameter could be an answer to the question “when should the cluster be grouped?”. Hourly probability standard deviation seemed to be a good threshold parameter for filtering rare events on time dimension. This can be visually inspected by looking at clusters 5, 6, 7 (CL 5, CL 6, CL 7) which are based on single points. Cluster 2 with 3 points and half of a cluster 3 would be filtered out. Following figure demonstrates hourly std. filtering on TP3 dataset.
Dzenan Hamzic TU Wien 38
Figure 3.20 Threshold filtered TP3 dataset.
There are only 3 clusters in Figure 3.20 which is pretty close to the real cluster number. hourly std. threshold filtering could be used to estimate loose cluster number wich could be used as input parameter in Kmeans or DPGMM.
There is an alternative to hourly std. based filtering. One could use cluster duration or cluster datapoints number as constraints. After execution of the Hierarchical Clustering method on the dataset a cluster duration check can be made which would filter out too short clusters. Using this approach would filter out single clusters. Clusters below e.g. 10 minutes are not long enough and could also be filtered out. Additionally, check on datapoints number in cluster constraint can be made. If a cluster contains less than e.g. 5 datapoints it could be filtered out. This can be seen on cluster 2 with only 3 points. Although it has only 3 datapoint is is 17 minutes in duration. Using this filtering methods would result in having same clusters as in Figure 3.20.
The table below shows some cluster statistics from Figure 3.19.
Dzenan Hamzic TU Wien 39
Cluster 1 2 3 4 5 6 7
Start 16:06:46 14:16:34 12:32:26 05:39:21 15:13:27 04:34:26 21:29:20
End 18:22:47 14:33:40 13:40:08 11:56:54 15:13:27 04:34:26 21:29:20
Duration 2:16:01 0:17:06 1:07:42 6:17:33 0:00:00 0:00:00 0:00:00
Elements 18 3 9 57 1 1 1
Percentage 20% 3% 10% 63% 1% 1% 1%
Count hits 19 0 7 70 0 0 0
Hit percentage 16% 0% 6% 61% 0% 0% 0%
Table 3.7. Hierarchical Clustering cluster statistics of TP3 training data.
Using the same parameters as described at beginning of the section, visualisations and cluster statistics are made for TP 2 and TP10 respectively.
Figure 3.21 Hierarchical Clustering on TP2 training data.
Dzenan Hamzic TU Wien 40
Cluster # 1 2 3 4 9 10 11
Start 08:05:33 04:08:46 06:36:29 05:25:40 16:59:45 15:08:28 15:56:04
End 13:01:07 04:45:19 07:24:44 05:30:06 18:11:43 15:12:53 16:00:47
Duration 4:55:34 0:36:33 0:48:15 0:04:26 1:11:58 0:04:25 0:04:43
Elements 37 9 3 2 5 2 2
Percentage 57 14 4 3 7 3 3
Hit count 48 2 1 1 4 0 0
Hit percentage 60% 2% 1% 1% 5% 0 0
Table 3.8. Hierarchical Clustering cluster statistics of TP2 training data.
Clusters and statistics for TP 10.
Figure 3.22. Hierarchical clustering on TP10 training data.
Dzenan Hamzic TU Wien 41
Cluster 1 2 3 4 5 6
Start 08:55:49 12:33:33 11:50:19 05:54:14 07:36:25 14:44:41
End 10:45:17 12:53:05 11:50:19 06:22:49 07:36:25 14:44:41
Duration 1:49:28 0:19:32 0:00:00 0:28:35 0:00:00 0:00:00
Elements 10 3 1 5 1 1
Percentage 47% 14% 4% 23% 4% 4%
Hit count 3 0 0 0 0 0
Hit percentage
21% 0% 0% 0% 0% 0%
Table 3.9. Hierarchical Clustering cluster statistics of TP10 training data.
As can be seen from the diagrams 3.19, 2.21 and 3.22 the filtering with hourly probability std. threshold would lead only to an even worse hit rate since it would cut off most of the good clusters. It can however be used to find realistic cluster number which can be used as initialisation parameter in other clustering techniques. Below is the statistics of all three dataset clustering hit/miss performances. The results are from the clustering with no thresholds and parameter initialisation as described at the beginning of the section. Single point clusters are of no relevance for hit/miss performance in the results since they have the same starting and ending time.
Test Person TP3 TP2 TP10
Hits 96 56 3
Misses 17 23 11
Hit Rate 84% 70% 21%
Miss Rate 16% 30% 79%
Training set size 87 60 21
Testing set size 103 79 15
Table 3.10: Summary of hierarchical clustering on hit/miss performance.
Every single datapoint from the testing data set (second data portion) was checked if it belongs to any cluster from the training data set. More exactly this means if the mean time of a second dataportion event is between any cluster’s start and end time. The results from the table 3.10 show direct correlation between trainingset size and and hitrate. TP10 results are unsatisfactory due to the insufficient amount of data.
Dzenan Hamzic TU Wien 42
3.8.2.2 Dirichlet Process GMM Clustering Analysis The first test for the Dirichlet Process GMM would be to see if it groups the datapoints into a reasonable number of clusters. The best way to test this is to set a high maximum number of components. The Dirichlet Process GMM should deliver the real density estimates as described in theory section. Initialising DPGMM with 20 as maximum component number leaving alpha, covariance type “diag”, cluster updating parameters on w “weight”, m “means” and c “covariances” combined and number of iterations 10 at default results in the following diagram [12].
Figure 3.23 DPGMM run with default parameter setting.
DPGMM simply puts all data to one single cluster. Increasing the alpha parameter which should lead to larger number of clusters, makes no changes at all. Increasing the number of iterations parameter returns also no better results. To make an useful model with the Dirichlet Process GMM the distribution of the data has to be closely inspected. How are the data aligned? Is there any possible convergence between the data points? Are there any obvious accumulation points? All cooking events are aligned on one horizontal line, incremented by seconds they form one long straight line. This can be seen as perfect convergence. Accumulation points are possible but not so frequent. Small well spread data chunks are also visible which labels the cooking event data as sparse with noise.
Dzenan Hamzic TU Wien 43
All the points converge to one straight line which may be the reason for Dirichlet Process GMM initialised with “covariance” to put them in one single cluster. As discussed in the theory section each gaussian component can be described with mean, variance/convergence and the sizeᵻ�. The mean is responsible for the distribution shift, variance describes the component's width and the ᵻ� is responsible for the component’s hight (the component number in the direct neighborhood that are clustered together). This leads to the conclusion that DPGMM’s cluster update parameters in the training process and covariance types are to be tested with in order to get a meaningful model. Since the data are perfectly convergent the convergence parameter can be removed from the training process. Removing the convergence from the cluster update parameter leaving it to ‘wm’ only, meaning that only weight (components height ᵻ�) and mean components are updated in clustering, the training process results in following diagram:
Figure 3.24 DPGMM Clustering with weights and means combined on TP3 dataset.
Looking at figure 3.24 above, four clusters are found. All four correspond to the hourly probabilities of cooking events occurring (blue line). Cluster probabilities are calculated based on the number of data objects they contain. Outliers far away from others, as the first point from the right (red cluster) and the fist point from the left (blue cluster) are not grouped into single clusters but grouped with their nearest objects. This widens the clusters in length which is optimal for better hit rates. This seems to be an acceptable solution for clustering with DPGMM.
Dzenan Hamzic TU Wien 44
Removing weights parameter (“w”) results in the same cluster configuration as shown in figure 3.24. This means that for this kind of timeseries data the mean “m” parameter which shifts the gaussian components center is crucial. Changing the covariancetype between all four possible options had no influence on the result [11]. The following figure shows gaussian components for TP3 data.
Figure 3.25 TP3 training vs testing datasets gaussian components.
The table 3.11 gives some cluster insights for comparison.
Cluster 2 4 5 12
Start 04:34:26 07:23:44 09:59:54 14:16:34
End 07:06:44 09:34:48 13:40:08 21:29:20
Duration 2:32:18 2:11:04 3:40:14 7:12:46
Elements 22 17 28 23
Percentage 24% 18% 31% 25%
Hit count 20 16 37 25
Hit percentage 17% 14% 32% 22%
Table 3.11. DPGMM cluster statistics and hit performance on TP3.
Dzenan Hamzic TU Wien 45
Cluster five contains 31% of the training dataset. The cluster starts at 10:00 and ends at 13:40. Looking at the test data this is the most probable cooking time of the test person 3. Comparing this with the testing dataset, this cluster has the highest hit percentage of 32%. The second biggest cluster is number 12. It’s the longest cluster with duration of over 7 hours due to the outlier point at the end of the cluster. Although this increases the possible hit rate (depending on data quality) it is highly improbable that the person typically cooks at 9PM. The Same can be said for cluster 2 with an outlier at the beginning. Initialising DPGMM with the same training data and setting maximum components on 2 with “means” parameter would result in returning 2 clusters combining clusters 2 with 4 and 5 with 12. Setting the maximum components number to 3 would result in combining clusters 2 and 4 �