35
RESEARCH PROPOSAL CHALLENGES AND OPPORTUNITIES OF ANALYSING COMPLEX DATA USING DEEP LEARNING NICLAS STÅHL Informatics

CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

RESEARCH PROPOSAL

CHALLENGES AND OPPORTUNITIES OF ANALYSINGCOMPLEX DATA USING DEEP LEARNING

NICLAS STÅHLInformatics

Page 2: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the
Page 3: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

ABSTRACT

The era of big data and data analysis is here. Unlike data analysis just some decadesago, the analysis today does not only comprise data that are stored in well organizedtables. Instead the data are much more diverse and may, for example, consist ofimages or text. This type of data do often go under the term complex data. However,there are no profound definition for complex data and this term is often used tohighlight that an analysis of the data is non-trivial. Therefore, this research aims tofind such a definition and specify a set of properties that is required for the data tobe complex.

A sub-field within machine learning that has shown promising results analysingcomplex data in the last years is deep learning. Therefore, the properties of complexdata will be analysed from a deep learning perspective. Even though deep learninghas been successful inmany fields, there are still several open problems that need tobe solved. There are also other fields in which deep learning still has not made anymajor breakthroughs in. This research aims to find case studies in such fields and toexplain why deep learning still has not succeeded and connect this to the propertiesof the data from that field. With knowledge about the limitations these propertiespose to the current deep learning methods, the aim is to refine these methods anddevelop new ones.

keywords: Complex data, Deep learning

i

Page 4: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

ii

Page 5: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

CONTENTS

1 Background 5

1.1 Machine learning and computer aided analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Unsupervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.3 Feature learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.4 Multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Different types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Feature engineering – Creating structured data from unstructured data 8

1.4 Deep learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4.2 Feed forward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.3 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.4 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4.5 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Challenges and open problems in deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Problem specification 15

2.1 Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Time plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Method 21

4 Preliminary results 23

4.1 Case 1 – Molecule property prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Case 2 – Steel rolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Bibliography 27

1

Page 6: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

2 CONTENTS

Page 7: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

INTRODUCTION

The era of big data and data analysis is here. Our modern society now generatesdata in such a speed that it only takes less than two days to generate more datathan all data that was produced by humans before 20031. This increase in datageneration has not stagnated, but kept growing exponentially and the world hasstarted to go through a big data revolution1–6. Before the big data revolution, alot of effort were put in designing different data collection schemes and surveysfor data collection5,6. The analysis process of the data was often predefined beforethe data collection and served a given purpose. However today this has changed.Data are now abundant in many fields and are no longer hard and expensive togather. Hence, data are no longer collected for a specific study, but are instead oftencollected as a bi-product from a given process, so called secondary data collection7.This creates several opportunities and challenges for the field of data analysis1,2,4,8.One of the main challenges with this type of data, that is identified by both Fan etal.8 and Chen & Zhang9, is that the data often are very heterogeneous and do notfollow a predefined structure. This is something most conventional methods fordata analysis can not handle10. The data can also be stored in different formats,have different quality and granularity and come from many different sources2,8,9,factors making it more difficult to analyse the data. These factors, as well as thestudied process, may also change over time, thus making it even more difficult.

A research field that has shown promising results solving problems arising due tothis type of unstructured data frommultiple sources is deep learning11. Deep learn-ing is a new sub-field ofmachine learning that has revolutionized several fields suchas image processing12,13, speech recognition14 and natural language processing15,16.Due to the big success in these areas deep learning has gained a lot of attentionfrom the scientific community and popular media. This has caused researchersfrom many different fields, e.g chemistry17,18, finance19 and biology6, to use deeplearning to solve problems in their respective field. Even though deep learning al-gorithms sometimes are presented as something that will work straight out of thebox this is seldom the case in reality. How to initialize and train a deep learningmodel is often a non-trivial problem, that requires expert knowledge20. Due to thisthere are still many open problems that potentially could be solved with deep learn-ing. There are also several open problems within the field of deep learning. Oneof the main problems is that there is still no complete understanding on why deeplearning works as good as it does21. The understanding of deep learning, so far, ismostly based on empirical studies and heuristics. How to select, configure and traindeep learning models is by many still seen as a “black art”.

3

Page 8: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

4 CONTENTS

Page 9: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

CHAPTER 1

BACKGROUND

The first part of this section gives a short and simplistic definition ofmachine learn-ing (ML). Machine learning is sub-field within artificial intelligence (AI) that is fo-cused on how machines may learn and draw conclusions from data. This researchwill focus on how such algorithms are affected by properties in the data. There-fore, the second part of this section is designated to an overview of different typesof data and what characterize each type. In this research there is a special focus onone of these types of data, namely complex data. This type of data are often hardto analyse with conventional ML methods. To solve this problem, the data is oftenfirst manually crafted into a a new dataset with new features. This is called featurelearning and this process is described in the third part of this section. In the lastfew years, a new sub-field of machine learning that can handle complex data with-out feature engineering has emerge. This sub-field is calledDeep learning (DL) andan introduction to this field and the models contained within will be described inthe forth part of this section. In the final part, open problems and challenges withdeep learning is described.

1.1 MACHINE LEARNING AND COMPUTER AIDED ANALY-SIS

A very general definition ofmachine learning, is thatmachine learning is a sub-fieldof computer science which aims to make computers learn22,23. Even thought this isa somewhat simplistic definition, it still becomes problematic, since the meaning oflearning must be defined. Since the start of the study of artificial intelligence, re-searchers have tried to create machines that are able to learn in the same mannersas humans. Human learning is a very complex process, which is far from fully un-derstood and no general computerized algorithm for learning are able to mimic itso far. However, several methods have been developed that allow computers to dostatistical inferences from presented examples, which in some sense could be calledlearning. Thus, by presenting a lot of training samples to amachine, it would be ableto extrapolate knowledge from the observed data. The machine would then be ableto use the gained knowledge to draw conclusions in new examples that are similar tothose presented during training, while not exactly the same. This is something thathas been proven to be of utility in many problems domains, both for automaticallydrawing conclusions and for acting as a support to humans for decision processes.

1.1.1 SUPERVISED LEARNING

Supervised learning canbe regarded to be similar to the learning process of a teacherteaching a student. In this case, the teacher shows the student several examplesand how these examples should be solved. It is then up to the student to generalizethis knowledge, so that it can be used on future examples. Most success withinthe field of machine learning have so far been through supervised learning. Thisdoes, for example, include classification of images, where a lot of images and their

5

Page 10: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

6 CHAPTER 1. BACKGROUND

corresponding classes are presented to the computer12. From these examples, thecomputer is able to learn what patterns that make a given image belong to a certainclass. With this generalized knowledge the computer can later on correctly classifynew images that where not presented during the training phase.

1.1.2 UNSUPERVISED LEARNING

In unsupervised learning, the computer tries to generalize the data and to learn un-derlying patterns of it. This is useful when dividing the data into different clustersor trying to find samples with similar meanings. As stated before, most successeswithinmachine learning comes from supervised learning. However, there is a greatbelief in unsupervised learning, and many, for example Bengio23, believe that un-supervised learning will become more and more important in the future. This ismainly based on that it is believed that the human learning process is mostly unsu-pervised24.

1.1.3 FEATURE LEARNING

In feature learning, which should not be confused with feature engineering, ma-chine learning algorithms are trained to create new representations of the data.Feature leaning can be conducted in both a supervised and unsupervised manner.As with feature engineering, the most common purpose for learning new represen-tations of data is to reduce the number of dimensions while keeping the variancebetween samples. This is often done in order to be able to visualize and explore thedata. Reducing the number of variables per sample (if not done in excess) generallyincreases the performance of classification and regression algorithms25. A secondreason for learning new representations of the data is to find a more interpretablerepresentation of the data. This is often done by adding a sparsity penalty on thenew representation26. In this case the number of variables is often increased in-stead of decreased but the new data representation is sparse. Thus for each sampleonly a couple of variables are non-zero. This can be interpreted as having a long listof different properties, where only the main characterizing properties are selectedfor each sample.

The idea behind automatically learn new features is not new and there are severalclassical methods for feature learning. Examples of such methods are: SingularValue Decomposition (SVD)27, Principle Component Analysis (PCA)27 and NoneNegative Matrix Factorization (NNMF)28. However, in their basic form, these al-gorithms do only find new variables that are linear combinations of the old ones.There are also several deep learning methods, that are able to learn new featuresthat are non-linear combinations of variables. These methods use models such asRestricted BoltzmannMachines (RBM),Deep Belief Networks (DBN) and Autoen-coders (AE)29–31. These methods have been used to learn better representations ofdata in many different cases and settings.

One example of howRBMs can be used is provided by Jaitly &Hinton32, which usesRBMs to learn new representations of sound waves. In this case the RBMmanagedto learned several interesting patterns to represent the sound waves. These repre-sentations could then be used to learn more about the differences and similaritiesin the data. Another example of how deep learning can be used for feature learningare Plötz et al.33 that use a DBN to automatically learn representations of sensordata that were used for activity recognition. A DBN is a stack of many layers ofRBMs and can therefore learnmore abstract and complex representations30. Usingthe representations learned by the DBN Plötz et al.33 were able to train a classifier

Page 11: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

1.2. DIFFERENT TYPES OF DATA 7

with a better classification accuracy than if heuristically designed features would beused.

1.1.4 MULTIMODAL LEARNING

In multimodal learning a model learns from data with multiple modalities. Ngiamet al.34 present several models that are able to learn shared representation of audioof a speech and a video of the speaking lips. Themodels that were presented built onboth RBMs, AEs and DBNs. It was shown that these models where much more sta-ble than models trained on either audio or video. Srivastava & Salakhutdinov35 usea similar approach and train a DBN to model the bimodal distribution of imagesand captions. This DBN could then be used to find images for captions and cap-tions for images. In a later paper Srivastava & Salakhutdinov36 also trained a DeepBoltzmannMachine (DBM) for the same task. Several authors, such as Kim et al.37,do also use DBNs for learning joint representations of multimodal data. Other au-thors, such as Kahou et al.38, do combine the unsupervised learning of DBNs withsupervised learning of convolutional neural networks.

A classical approach to analyse multimodal data is through multiple kernel learn-ing. Multiple kernel learning is a method where a set of multiple kernels are com-bined in either a linear or non-linear way39,40. This approach is beneficial in atleast two ways: Firstly this reduce the importance of the choice of kernel and hyper-parameters, since kernels with better performance will be given more significancewhen the kernels are combined. The second benefit, is that different kernels mayhandle different formats of inputs. Thus one kernel may learn from images whileanother learns from text40.

1.2 DIFFERENT TYPES OF DATA

There have been several attempts to divide data into different classes. One of theoldest andmost widely known definitions ismade by Stevens41 who divides all mea-sured data into four different categories: nominal, ordinal, interval and ratio. How-ever there have been several critiques of this definition, especially from the statis-tical community42. To resolve this, Mosteller & Tukey43 presented a new way of di-viding data into a larger number of classes, namely: names, grades, ranks, countedfractions, counts, amounts, balances and circles.

However, the works of both Stevens41 and Mosteller & Tukey43 are from a timewheremost datawhere stored in tabular form and did therefore only consider struc-tured data. Structured data are data with a high level of organization, such as infor-mation stored in a relational database44. As previously mentioned, most of the datathat are generated today are not in this format, but instead in an unstructured for-mat, such as text8. This research proposal will mainly concern such data, but thereare several more ways to classify data. To avoid any further confusion, the followingdefinitions of different data types will be used:

Structured data:

Structured data are data that are structured in a tabular form, where each row rep-resent a sample that is independent of all other rows. Due to the format of the data,relational databases are often used to store structured data. Since this has beenthe main solution on how to store data historically, most machine learning meth-ods have been developed to analyse structured data45. Many conventional machinelearning methods are limited to the analysis of this type of data.

Page 12: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

8 CHAPTER 1. BACKGROUND

Unstructured data:

In this research proposal, unstructured data is defined as all data that are not struc-tured data. Thus all data that cannot be stored in tabular form, where each rowis independent from all other rows, are unstructured. However, this naming con-vention can be confusing since the name suggests that there are no structure in thedata. This is seldom the case and unstructured data do often have a lot of internalstructures, even if it is impossible to arrange it as a row in a table. Images and textare two examples of data that are unstructured even though the data follow certainrules and there are a lot of internal structures in the data.

Multi-levelled data:

Multi-levelled data is data that are measured with different granularity and wheredatawith lowgranularity coversmore thanone sample in thehigh granularity data46,47.For example, in the prediction of the opinion of an individual both individual dataand data about the local area or region can be used46. The molecular data in thestudy described in section 4 are multi levelled data as well, since information fromboth atom and molecular level is used.

Multimodal data:

Multimodal data are data concerning multiple and diverse modalities36. Thus datawhere, for example, both text and images are stored for the same instance are mul-timodal48,49. Due to the introduction of internet, and especially social media, thistype of data have gotten more and more common48. Another typical example ofmultimodal data is video streams, containing both a sequence of images and au-dio38,50.

Complex data:

Even though many authors, such as Bastian et al.51, Cao et al.52, Talia53 and Lumet al.54, use the term complex data, a thorough definition of complex data is seldomgiven. Instead the term complex data is often used to emphasize that the data arenon trivial, and thus worth studying. In some publications, complex data is thesame as high dimensional data. Blanco et al.55, for example, define complex dataas “a collections of tuples with many attributes”. Another definition of complexdata, which is used by Toubiana et al.56, is that complex data are data generated bycomplex interactions in the studied system. Van Leeuwen & Knobbe57 bring boththese definitions together and states that the complexity of a dataset may either bedue to that the dataset “contain many rows as well as many attributes” or that thedataset contains “non-trivial interactions between attributes”.

1.3 FEATURE ENGINEERING -- CREATING STRUCTUREDDATA FROM UNSTRUCTURED DATA

Mostmachine learning algorithms require structured data, and thus they can not beapplied to unstructured data, for example sequences and graphs10. To bypass thisproblem, themachine learning algorithm can instead be applied to a fixed set of fea-tures that are extracted from the raw data22,33,58,59. This process where new featuresare created is called feature enginering and is amanually conducted step in the dataanalysing process. Another problem that is often solved by feature engineering andfeature selection is when to much information is collected for each instance. Thismakes the training of any model more complicated and may cause important infor-mation to disappear in the noise of all irrelevant information. However, whenever

Page 13: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

1.4. DEEP LEARNING 9

feature engineering is applied there is a risk that some important information is lostand that a given problem no longer can be solved. This risk will increase when thecomplexity of the data grows, since there may be complex interactions that are hardto spot even for domain experts.

Domingos22 argued that feature engineering is themost important factor determin-ing the failure or success when solving a problem with machine learning. Plötz etal.33 agrees with this view, but points out that feature engineering is often based onheuristic arguments and that the lack of systematic research on feature engineer-ing is lacking. Feature engineering do, however, often require domain expertise,something that is often costly and hard to find. Another problem with feature engi-neering, which is pointed out by Kim et al.37, is that often only linear combinationsof variables are considered. Thus manual feature engineering often misses com-plex high ordered dependencies between variables. Due to these shortcomings ofmanual feature engineering, deep learning methods that are capable of automaticfeature creation and selection are of great utility60. Such methods include auto en-coders (AEs), restricted Boltzmann machines (RBMs) and deep belief networks(DBNs).

1.4 DEEP LEARNING

Deep learning is a type of representation learning where the machine itself learnsseveral internal representations from raw data in order to perform regression orclassification11. This is in contrast to more classical machine learning algorithmswhich often require carefully engineered features that are based on domain exper-tise45. Deep learning models are built up in a layer-wise structure where each layerlearns a set of hidden representations, which in many cases cannot be understoodby a human observer. The representations in each layer are non-linear composi-tions of the representations in the previous layer. This allows the model to firstlearn very simple representations in the first layers, which are then combined intomore andmore complex and abstract representations for each layer. An example ofthis is that when deep learning models are used on images they often start by learn-ing to detect edges and strokes61. These are then combined into simple objects thatbecomes more and more complex for each layer. Since each layer only learns fromthe representation of the previous layer a general purpose learning algorithm, suchas back propagation62, can be used.

1.4.1 ARTIFICIAL NEURAL NETWORKS

Most algorithms in deep learning are based on artificial neural networks11. In con-trast to deep learning, the field of artificial neural networks has been around forsome time. It all started in 1943 when McCulloch and Pitts, a neuroscientist andmathematician, defined amathematicalmodel of how they believed aneuronworkedin a biological brain63. The next step came in 1949 when Hebb came up with a rulethat could be used to make an artificial neuron learn a set of given patterns64. In1958, Rosenblatt, a psychologist, further generalised the works of McCulloch andPitts and proposed a model, called the perceptron, for an artificial neuron65. Themathematical definition of a perceptron is given in equation (1.1) and a graphicalrepresentation is shown in Figure 1.1.

y = f ((∑i

xi ∗wi) + b) , (1.1)

Page 14: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

10 CHAPTER 1. BACKGROUND

∑w1x1

⋮⋮

wnxn

w0x0

b

Acti-vationfunction

Figure 1.1: A graphical illustration of a perceptron. The output of a perceptronis an activation function applied to the weighted sum of the inputs plus a bias.The mathematical definition of a perceptron is given by equation (1.1).

where xi represents input variable i, wi represents the weighting of variable i, b isthe bias and f is the activation function. When the perceptron first was introducedthe stepping function

f(x) = {1, if x > 0

0, if x ≤ 0(1.2)

was used used as the activation function f .

The perceptron was then further analysed and developed by Minsky and Papert66.In the analysis of the perceptron, Minsky and Papert showed that a single percep-tron was not sufficient to learn certain problems, for example the XOR problem.Instead they argued thatmulti-layered perceptrons were needed to solve such prob-lems. However, such networks were not possible to train at that time. This lead toan AI winter and very little research on ANNs were conducted on neural networksfor some time. This has changed during the last few years thanks to the increase incomputational power67 and improvements to the methodology, such as the intro-duction of the backpropagation algorithm62, unsupervised pre-training60 and therectified linear unit68. These improvements have allowed researchers to build net-works with many hidden layers, so called deep neural networks11. In the followingsections, several different architectures of artificial neural networks are presented.

1.4.2 FEED FORWARD NEURAL NETWORKS

A feed forward neural network is an artificial neural network where informationonly moves in one direction. Thus, feed forward networks are acyclical and there-fore free of loops69. This was also the first network architecture that was presented.The layout of a typical feed forward network is shown in Figure 1.2. The most basicfeed forward network is the perceptron65 where the output is an activation func-tion applied to the weighted sum of the input plus a bias. A common choice for theactivation function is the sigmoid function,

σ(x) =1

1 + e−x. (1.3)

Page 15: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

1.4. DEEP LEARNING 11

X1

X2

X3

X4

Y1

Firsthiddenlayer

Inputlayer

Secondhiddenlayer

Outputlayer

Figure 1.2: The layout of a multilayer perceptron with two hidden layers, eachhaving 5 neurons.

However, in the last few years, there has been a transition from the sigmoid functionto the rectifier function

f(x) =max(x, 0), (1.4)

because of its simple and well behaved derivative. The rectifier function in equa-tion (1.4) was introduced as an activation function by Nair & Hinton68 and units inthe neural network that use this function are called rectified linear units or relu forshort. If the sigmoid function, described in equation (1.3), is used as the activationfunction a single perceptron performs exactly the same task as logistic regression70.A standard architecture of feed forward networks is to arrange multiple neurons ininterconnected layers. Each neuron in any layer, except the final output layer, hasdirected connections to all neurons in the subsequent layer. This type of networksare calledmultilayer perceptrons. Aswith the perceptron, the output that each neu-ron will propagate to the next layer is an activation function applied to the weightedsum of all inputs plus a bias. As long as the activation function is differentiable itis possible to calculate how the output will change if any of the weights is changed,and thus the network can be optimized with gradient based methods.

1.4.3 CONVOLUTIONAL NEURAL NETWORKS

Convolutional neural networks (CNNs) are a special type of neural networks thatare mainly used in image analysis71, but there are also some authors that success-fully use CNNs for natural language processing72. The main idea behind the CNNarchitecture is that basic features in a small area of an image can be analysed inde-pendently of their position and the rest of the image. Thus, an image can be splitup in many small patches, each patch can then be analysed in the same way andindependently of the other patches. The information from each patch can then bemerged creating a more abstract representation of the image.

This scheme is implemented in a CNN using two different steps: convolutional andsub-sampling. In a convolutional step a feed forward neural network is applied toall small patches of the image, generating several maps of hidden features. In the

Page 16: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

12 CHAPTER 1. BACKGROUND

X1

X2

X3

X4

Y1

Hiddenlayer

Inputlayer

Outputlayer

Figure 1.3: A recurrent neural network with one hidden layer, consisting of twoneurons. Note the recurrent connections in the hidden layer where each neuronhave a connection to itself. Thus the outputs of the neurons in the hidden layerdo depend on the input as well as the previous output of the hidden layer.

sub-sampling step the size of the feature map is reduced. This is often done byreducing a neighbourhood of features to a single value. The most common way forthis reduction is to either represent the neighbourhood with the maximum or theaverage value. These two steps are then combined into a deep structure with severallayers.

It has been shown that a CNN learns to detect general and simple patterns in thefirst layers, such as detecting edges, lines and dots61. The abstraction of the learnedfeatures will increase in each layer and if the first layer detects edges and dots, thenext layer may combine these lines and dots into simple structures, structures thatmay be combined into more complex and abstract objects in the next layer. Oneof the main benefits of this approach is that the CNN learns translation invariantfeatures. Thus, a CNN can learn general features about objects in an image inde-pendently of their position within the image.

1.4.4 RECURRENT NEURAL NETWORKS

A recurrent neural network (RNN) is a type of artificial neural network where thereare cyclical connections between neurons, unlike feed forward networks which areacyclical73. Such connections are illustrated in Figure 1.3, which shows a recurrentneural network. This allows the network to keep an inner state allowing it to acton information from previous input to the network, thus exhibit dynamic temporalbehaviour. This makes RNNs optimal for the analysis of sequential data, such astext74 and time series75. One big problem with recurrent neural networks, whichalso occurs in deep feed forward networks, is that the gradients in the backpropa-gation will either go to zero or infinity76. This has, however, partially been solvedby the introduction of special network architectures, such as the long short termmemory (LSTM)77 and the gated recurrent unit (GRU)78.

1.4.5 GENERATIVE ADVERSARIAL NETWORKS

Generative adversarial networks (GANs) were introduced by Goodfellow et al.79.The idea behind a GAN is that you have two artificial neural networks that com-

Page 17: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

1.5. CHALLENGES AND OPEN PROBLEMS IN DEEP LEARNING 13

pete. The first network, called the generator, tries to generate examples followingthe same distribution as the collected data. While the second network, called thediscriminator, tries to distinguish between examples that are generated by the gen-erator and data that are sampled from the real data distribution.

The training of these two networks consist of two phases where the first part aims totrain the generator and the second to train the discriminator. In the first phase, thegenerator generates several examples and get information about how the discrimi-nator would judge these examples, and in which direction to change these examplesso that they are more likely to pass as real data to the discriminator. In the secondphase, several generated and real examples are presented to the discriminator, thatclassifies them as real or generated. The discriminator is then given the right an-swers and how to change its settings to be better at the classification in future trials.This can be compared to the competition between amoney counterfeiter and a bank.The task of the counterfeiter is to generate fake money and the bank should be ableto determine if money is faked or not. If the counterfeiter gets better at creatingfake money, the bank must take new measures to discover the fake money and ifthe bank gets better at discovering fake money the counterfeiter must come up withbetter andmore creative ways to create fakemoney. The hope when training a GANis that the generating network and the discriminating network will reach a stale-mate where they both are good at their tasks. There are several successful worksthat shows how GANs can be used to generate new samples. Such examples are:the generation of images of human faces80, generation of images of hotel rooms81

and the generation of text tags to images82.

1.5 CHALLENGES AND OPEN PROBLEMS IN DEEP LEARN-ING

One of the biggest problems with deep neural networks has historically been, andis to some extent today, that the gradient based learning used to train the networksis computationally demanding83. This has been partially solved by the increase ofthe computational power of computers and the introduction of graphical processingunits (GPUs) for computations67. Themain problem that arises with gradient basedlearning methods is that they may get stuck on suboptimal saddle points for a longtime84. While the problem regarding the training time has been addressed bymanyauthors, such as Erhan et al.60, Dauphin et al.84 and Kingma & Ba85, there are stillroom for improvements.

So far, most progress in deep learning has been through exploration of differentnetwork set-ups and architectures. Most of these works are only validated throughexperimentation and very little effort has been put into understanding of why andhow state of the art results can be achieved with deep neural networks21,83. Many ofthe improvements to the methodology of training deep networks, such as the intro-duction of rectified linear units86 and dropout87, are therefore based on heuristicarguments and have no base in theory83.

While the flexibility and expressiveness of neural networks are two of the reasonsfor their great success, these properties do also come with some drawbacks. Themost obvious drawbacks are the loss of interpretability and the difficulty of trainingthe networks. Szegedy et al.88 point out that the great expressiveness of the net-works also causes them to generalize badly in the neighbourhood of examples inthe dataset. Thus, Szegedy et al.88 show that neural networks are very sensitive ofsmall perturbations, hence it is possible to create examples that are almost exactlythe same but interpreted completely differently by a given network. This also im-

Page 18: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

14 CHAPTER 1. BACKGROUND

plies another drawback formodels with high flexibility: that errors andmisreadingsin the data may be perfectly incorporated into the model.

Another big challenge in the training of deep neural networks is the large amountof hyper-parameters, which are parameters that controls the behaviour of the algo-rithms andmust be optimizedmanually83,89. Due to this, training of deep networksdo often require expert knowledge, and the training process is by some considered a“black art”. This has created a need for experts to share their knowledge and severalguides on how to train deep architectures have been published, such as Hinton29.However even if these guides can offer some insights and help, they cannot offerany general guidelines, since the optimal hyper-parameter settings depend on thedata89. Many of the hyper-parameters that need to be decided concern the structureof the network, such as the number of layers and the size of each layer. Therefore itwould be desirable to have models that could adjust their structure automatically,which Angelov & Sperduti83 describes the need for. There has been some work inthis direction, such as the work of Côté & Larochelle90. However, this problem isfar from solved and many of the hyper-parameters that are manually selected spec-ify the trade-of between computational cost and performance, which both increasewith a more complex network. Since the most appreciated results are those frommodels with high performance, the aim of the hyper-parameter optimization is of-ten to find a model with as high performance as possible, while keeping the com-putational cost tractable. This optimal state is not fixed through time and differentmachine configurations. Therefore, there is a need for a framework where differentmodels can be compared, not only by their performance, but also on the complex-ity83 With such a framework it would be possible to credit models with the optimaltrade-off between performance and complexity, instead of just the models with thebest performance.

Evenwhile deep learningmethods aremore flexible and easier tomodify thanmanyclassical methods there are still some limitations on how they can be applied todata that are very heterogeneous and complex4. However, this type of data is verycommon in the industry and novel methods to analyse such type of data are of greatneed2.

A problematic issue that is increasing is that the research on deep learning has beenmore and more focused on achieving the best performance metric on some specificproblem. To achieve this domain knowledge is often built into these models, mak-ing them more powerful but also less generalizable. This pose a threshold for theimplementation of deep learning into new domains. Therefore, a general modelthat could be applied to any domain would be of great utility.

Page 19: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

CHAPTER 2

PROBLEM SPECIFICATION

Asmentioned earlier in section 1.2, the term complex data is often used to highlightthe non triviality of the data analysis. While no profound definition is established,multiple authors have given their own definitions of complex data. One that is com-monly used, for example by Blanco et al.55 and Cao et al.52, is that complex data isthe same as high dimensional data that are heterogeneous. However, this defini-tion does not imply that complex data are harder to analyse than any other type ofheterogeneous data. There is nothing that prevents high dimensional data from be-ing generated by a simple process, where all dependencies in the data easily can befound. Therefore, the previous definition does not add any new aspect to the anal-ysis. If complex data is the same as high-dimensional and heterogeneous data, theanalysis of complex data would not pose any additional challenges. Because of this,one of the aims of this research is to find a more thorough definition of complexdata. This will be done by further expanding another definition of complex data,namely that complex data are data that are generated by complex interactions ina system that are being studied.

Much of the research today within the field of deep learning (and machine learn-ing in general), are focused on developing new methods to analyse data. However,it is seldom reflected upon how the type, quality and complexity of the data affectsthe analysis. Therefore, it is of great interest to study how different properties of thedata affects the result of an analysis algorithm. As described in section 1.5, the anal-ysis of complex data is considered to be difficult2,4. However, so far it has not beeninvestigated what causes this to be a hard problem. This work will therefore look athow properties of data can affect the analysis of different deep learning algorithmsnegatively, with a special focus on the effect of properties possessed by complexdata. This will be done iteratively through several case studies on real world data.Since no profound definition of complex data exists, there is currently no way toclassify data as complex or not. Hence, it would be impossible to take a theoreticalapproach to the problem of studying the properties of complex data or to generateantithetical complex data. Due to this, data that are referred to as complex dataneed to be gathered and analysed through several case studies. Since it would beimpossible to cover every dimension of complex data, especially without a finalizeddefinition of it, this research will initially focus on some different types of data andsome specific properties, which are all listed below. While these lists are far fromcomplete and several new entries may be added, they specify a minimum of whatshould be investigated within this research. The different types of data that shouldbe investigated are:

• Data that are structured as a graph.

• Sequential data where both long and short time dependencies exist.

The initial focus in the study of data properties will be:

• The granularity of the data, thus how good the measures in the data are.

15

Page 20: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

16 CHAPTER 2. PROBLEM SPECIFICATION

• Errors in the data and how they affect the result of an analysis.

• The number of interacting agents or parts in the system that has generatedthe data.

• Skewness of different classes in the data.

The conducted case studies will then be used to validate the hypothesis that it isthe properties of the complex data that have prevented the use of deep learningmethods in some fields. With these case studies knowledge about data propertieswill be gathered. The next step of this work will then be to refine currentmethods ordevelop new methods for the analysis of complex data. Methods that do not sufferfrom the limitations identified in the case studies.

To conclude, the research questions of this work are:

Q1 What are complex data and how can it be defined in terms of metrics?

Q2 What properties of complex data are problematic for the current deep learn-ing methods to handle and what is the reason for it? The deep learning meth-ods that will be considered are all described in section 1.4, for example, RNNand CNN.

Q3 How can the current deep learning methods be refined to handle these prop-erties of the data, or do new methods have to be developed?

To answer these questions, the following objectives will be considered:

O1 Find a thorough and profound definition of complex data, which is groundedin the definitions in section 1.2. Namely, that complex data are data gener-ated from systems with several interacting agents57.

O2 Identify one or more case studies involving complex data, in which deeplearning methods can potentially be used.

O3 Show which properties of the collected data that were the limiting factors forthe analysis.

O4 Set up a framework in which syntetical complex data can be generated. Thisframework will be used to evaluate the behaviour of different methods, whenthe complexity of the data changes.

O5 Develop new methods or refine deep learning methods so that they bettercan cope with limiting factors identified inO3.

O6 Generalize and summaries the knowledge gained from all case studies.

The two first objectives, O1 and O2, will be conducted to answer the first of theposed questions, Q1. As stated in O1, a profound definition must be specified, be-fore Q1 can be answered. This definition must be reasonable and relevant, thustheremust exist real world cases of data which corresponds to the definition. There-fore the aimO2 is to find such cases. To be able to give amore refined answer toQ1and support O1 and O3, a framework will be set-up to generate synthetic complexdata, as stated in O4. This would allow for an analysis of how a given method willbehave during a transition from non-complex to complex data and will thus alsohelp answering the second questionQ2. This framework would also specify certainmeasures for complex data. The second question, Q2, will then be answered using

Page 21: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

2.1. RELATED WORK 17

the complex data from the real world cases identified by O2. As described in O3,this is done by finding the limiting factors in deep learning methods that currentlyare considered as state of the art. A special focus will be to find limiting factors thatarises due to the properties of the data. The limitations these properties enforce onthe studied deep learning methods will be identified in O3 and will lead to the lastquestion, Q3. As specified in objective O5, an attempt to develop new methods orrefine current ones will be conducted in order to improve these methods to betterhandle the identified limitations. While both the questions and objectives have beennumbered, this is only for the convenience when referencing, thus the numberingdo not state the priority or the order in which the objectives and questions will betackled in. Furthermore the process will not be linear but instead iterative. Espe-cially objective O1 and O3 will be iterated and insights gained when working withone of them will feed back into the other. This process will be iterated until a soliddefinition of complex data has been defined, making the impact of more iterationsinsignificant. This allows for the conclusion of this research and thus the final objec-tiveO6. In this objective, the knowledge gained in all iterations will be summarizedand generalized to compile a final answer to all research questions asked.

2.1 RELATED WORK

While many authors define complex data in their own way, the starting definitionof complex data that will be used in this work is a definition similar to the definitiongiven by Toubiana et al.56, that complex data are data generated by complex inter-actions in the studied system. Thus, complex data are data that are collected frominteractions in a complex system. The reason for this definition is that there alreadyexists a thorough definition of complex systemwhich is agreed upon. Therefore, thefirst case studies will therefore be to investigate data that originates from differentcomplex systems.

Haken91, for example, gives the following definition of complex systems:

“Systems which are composed of many parts, or elements, or components whichmay be of the same or of different kinds. The components or parts may be con-nected in a more or less complicated fashion.”

As stated by Haken91, this definition is somewhat naive and simplified definition ofcomplex systems. However, it is sufficient for the purpose of this work. Examplesof such system are present in almost any branch of the scientific research, such so-cial science, biology, engineering and physics. Regarding the data collected fromcomplex systems, Haken91 states that:

“The data to be collected often seem to be quite inexhaustible. In addition it isoften impossible to decide which aspect to choose a priori, and we must insteadundergo a learning process in order to know how to cope with a complex system.”

Thus, Haken91 suggests that a complex system can not be analysed bymerely study-ing the data, but the system has to be learned and understood first. This is mostintriguing from a machine learning perspective, since this would require that algo-rithms to first learn about the system before they would be able to analyse the data.Thus, algorithms that solely maps a given input to an output in an advanced way,would not be sufficient. Instead, there are a need of algorithms that can learn andkeep track of the state of the system and determine how the system reacts on a givenevent. Thus, machine learning must go to algorithms that are able to learn systems,instead of algorithms that learn functions for mapping inputs to outputs.

Page 22: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

18 CHAPTER 2. PROBLEM SPECIFICATION

2.2 DELIMITATIONS

Thisworkwill be focused on the analysis of complex data, using deep learningmeth-ods, and will therefore only consider and analyse models for such data. Since deeplearning methods have shown promising results within this domain, the researchwill be focused on just deep learning methods. Other methods, such as methodswithin the field of statisticalmachine learningwill therefore not be considered. Sev-eral different deep learning methods are presented in section 1.4.1. However, all ofthese may not be considered in the later research and some are only presented as ahistorical overview and as a natural passage to more advanced methods.

Page 23: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

2.3. TIME PLAN 19

2.3 TIME PLAN

Table 2.1: Project time plan

Q1 Q2 Q3 Q4

Year 1Literature Review Case study 1

Research Proposal

Year 2

Case study 2 Case study 3

Paper 1 Paper 2

Research Proposal

Year 3

Half way report Case study 4

Define Complex Data

Paper 3 + 4 Paper 5

Year 4Case study 5 Case study 6

Paper 6

Year 5

Generalize Case studies

Write thesis

Paper 7

Page 24: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

20 CHAPTER 2. PROBLEM SPECIFICATION

Page 25: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

CHAPTER 3

METHOD

The first question posed in this research aims to finding a profound definition ofcomplex data, for the use within the field of data analysis and deep learning. Thisdefinition has to be grounded in previous work, as well as being relevant and ap-plicable to real world datasets which are considered to be complex. The definitionmust also relate to adjacent fields, which also consider complex data, such as thestudy of complex systems. The first step of this research will therefore be to con-duct a literature study, to survey current definitions and opinions. This will givean initial working-definition of complex data. The main perspective for this will bethat the term complex data originates from the study of complex systems and suchliterature will therefore be studied, as well as literature within data analysis anddeep learning. The focal point of the literature study will be to highlight previouslyproposed definitions and defining properties of complex data.

Several case studies will then be conducted to validate, refine and consolidate theproduced definition, as well as to further study the properties of complex data andthus also answer the second research question. The reason for the selection of casestudies, as a method, is grounded in the lack of theoretical work within the field ofartificial neural networks and hence deep learning as well. Due to this lack of the-orymost groundbreaking results are so far only based onheuristic argumentswhichare based on knowledge gained from solving a specific case, and thus the same ap-proach will be used within this research. The case studies within this research willbe selected so that an approach using deep learning methods is feasible and can bebeneficial. These cases will be selected to cover as many different types of complexdata as possible, where the different types of complex data that will be consideredmust fulfil the current working definition of complex data. In the selection of casestudies, the possibility to collaborate with industry will be viewed as positive, sincethis would make it possible to incorporate experts with domain knowledge in thecase study. Hence, it will become easier to understand the studied data and to val-idate obtained models. It is likely that several different deep learning models willbe applied in each case study. The selection as well as the training of models willconsider similar studies were such models were applied successfully. This allowsfor a comparison between the performance of the given model in the two cases tobe conducted. The aim will then be to connect the differences of the performanceof the model to properties of the data. This would lead to a greater understandingof the problem and the limitations posed by the data. Using this knowledge, theutilized model will be further refined to handle the studied problem. Thus, the aimof each case will be to study the strengths and weaknesses of the selected models inrelation to the properties of the data. This knowledge will then be incorporated intothe proposed definition of complex data and considered in the selection of futurecase studies. Thus, this will be an iterative process where the result of a case studyis incorporated into the definition of complex data, and thus affects the selection ofthe next case study.

When a sufficient amount of case studies have been conducted, a framework forgenerating data will be created. The framework will allow for the generation of datathat possess the discovered properties dedicated to complex data. This will make

21

Page 26: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

22 CHAPTER 3. METHOD

it possible to study these properties with a finer granularity and thus further studythe behaviour of different models when the data goes from not being complex tobeing complex. This would also make it possible to study the combination of sev-eral properties, which only has been observed in separate case studies. Hence, theframework will therefore help to provide a robust answer to the two latter researchquestions.

So far contacts have been taken with Astra Zeneca and Outokumpu, offering possi-ble research problems within chemistry and steel manufacturing. Two fields withmany complex processes that generates a lot of data. The preliminary results ofthese preliminary results are further described in the next chapter.

Page 27: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

CHAPTER 4

PRELIMINARY RESULTS

4.1 CASE 1 -- MOLECULE PROPERTY PREDICTION

A first study have been conducted in collaborationwith Astra Zeneca. In this study aflexible deep convolutional neural networkmethod for the analysis of arbitrary sizedgraph structures representing molecules was presented. Thus this case investigateshow data that are structured in a graph structure can be analysed, using convolu-tional neural networks. The used method makes use of RDKit92, an open-sourcecheminformatics software, allowing the incorporation of any global molecular andlocal information. Themodel is evaluated on the SIDER (Side Effect Resource) v4.1dataset and it is shown that the presented model significantly outperforms anotherrecently proposed model presented by93.

The main challenge that this work solves is how molecules can be analysed withoutany feature engineering. Instead these molecules were represented as mathemati-cal graphs, a representation that is commonly used. One other challenge that wasfaced in this paper was that molecules have information on many different levels.There are local information of each atom and the bonds connecting these and globalinformation about the whole molecule.

Molecules can be considered to be complex systems, with atoms as agents that in-teract with each other94. Thus, most properties of a given molecule do not emergedue to that it consists of a certain number of a given atom or due to some givenmetric. If, for example, two molecules that are both toxic are combined into a newmolecule, this molecule may be non-toxic. Thus, a molecule can not been studiedby just studying the atoms it consists of, but their interactions and relations mustalso be considered.

In the paper, “Improving the use of deep convolutional neural networks for the pre-diction of molecular properties”95, the results from this case study is reported andit is reflected on how different subsets of molecular information and the structureof the input data, affects the predictive power of the presented convolutional neu-ral network. This reflection highlights several open problems that should be solvedto further improve the use of deep learning within cheminformatics. Problems thatmay be further studiedwithin this work, such that the result of the predictions couldbe increased by considering the unbalance in the data. This study also shows theimportance of the selection of optimization method for the model in relation to theperformance metric, something that is rarely reflected upon but needs to be high-lighted.

4.2 CASE 2 -- STEEL ROLLING

A second case study have also been started in collaboration with the steel rollingcompany Outokumpu. In this case study, the aim is to detect what causes faultysteal sheets, using sensor readings from multiple machines from different parts ofthe process. As with the molecules, the factory where the steel is rolled into coils

23

Page 28: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

24 CHAPTER 4. PRELIMINARY RESULTS

is a very complex system. A faulty coil can, for example, be produced due to theinteraction of multiple events that each occur somewhere along the product line.

Another big problem that needs to be resolved is that the data are alsomulti-levelled.Some data readings represent the state of machines and furnaces and while otherdata readings are connected to each individual metal sheet. It is also problematicthat the amount of the data may differ among metal sheets. This is due to that thefinal product is not always the same. The final coils can for example have differentwidth and thickness. Part of the steel rolling process is iterative, so the same metalsheet passes the same stations and sensors several times. However, the propertiesof the metal changes during the different passes. The length of the metal sheet is,for example, increased during the process. Therefore, sensors that measure val-ues every meter, will give much more readings late in the process, compared to thebeginning. This is problematic for most conventional analysis methods since theamount of data varies a lot between different samples.

Page 29: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

BIBLIOGRAPHY

Page 30: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the
Page 31: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

BIBLIOGRAPHY 27

1. Sagiroglu, S. & Sinanc, D. Big data: A review in Collaboration Technolo-gies and Systems (CTS), 2013 International Conference on (2013), 42–47.

2. Boyd, D. & Crawford, K. Critical questions for big data: Provocations fora cultural, technological, and scholarly phenomenon. Information, com-munication & society 15, 662–679 (2012).

3. McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. & Barton, D. Bigdata. The management revolution. Harvard Bus Rev 90, 61–67 (2012).

4. Chen, X.-W. & Lin, X. Big data deep learning: challenges and perspectives.IEEE Access 2, 514–525 (2014).

5. Kitchin, R. Big Data, new epistemologies and paradigm shifts. Big Data &Society 1, 2053951714528481 (2014).

6. Marx, V. Biology: The big challenges of big data. Nature 498, 255–260 (2013).7. Hox, J. J. & Boeije, H. R. Data collection, primary vs. secondary. Encyclo-

pedia of social measurement 1, 593–599 (2005).8. Fan, J., Han, F. & Liu, H. Challenges of big data analysis. National science

review 1, 293–314 (2014).9. Chen, C. P. & Zhang, C.-Y. Data-intensive applications, challenges, tech-

niques and technologies: A survey on Big Data. Information Sciences 275,314–347 (2014).

10. Yang, Q. & Wu, X. 10 challenging problems in data mining research. In-ternational Journal of Information Technology & Decision Making 5, 597–604 (2006).

11. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444(2015).

12. Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Infor-mation Processing Systems 25 (eds Pereira, F., Burges, C. J. C., Bottou, L.& Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc., 2012). <http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-

convolutional-neural-networks.pdf>.13. Szegedy, C. et al. Going deeper with convolutions in Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition (2015), 1–9.14. Hinton, G. et al. Deep neural networks for acoustic modeling in speech recog-

nition: The shared views of four research groups. IEEE Signal ProcessingMagazine 29, 82–97 (2012).

15. Zhang, X. & LeCun, Y. Text Understanding from Scratch. ArXiv e-prints.arXiv: 1502.01710 [cs.LG] (Feb. 2015).

16. Kim, Y. Convolutional Neural Networks for Sentence Classification. ArXive-prints. arXiv: 1408.5882 [cs.CL] (Aug. 2014).

17. Dahl, G. E., Jaitly, N. & Salakhutdinov, R. Multi-task Neural Networks forQSAR Predictions. ArXiv e-prints. arXiv: 1406.1231 [stat.ML] (June 2014).

18. Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: tox-icity prediction using deep learning. Frontiers in Environmental Science3, 80 (2016).

19. Heaton, J. B., Polson, N. G. & Witte, J. H. Deep Learning in Finance. ArXive-prints. arXiv: 1602.06561 [cs.LG] (Feb. 2016).

20. Sutskever, I., Martens, J., Dahl, G. E. & Hinton, G. E. On the importanceof initialization and momentum in deep learning. ICML (3) 28, 1139–1147(2013).

21. Lin, H. W. & Tegmark, M. Why does deep and cheap learning work so well?ArXiv e-prints. arXiv: 1608.08225 [cond-mat.dis-nn] (Aug. 2016).

Page 32: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

28

22. Domingos, P. A Few Useful Things to Know About Machine Learning. Com-mun. ACM 55, 78–87. ISSN: 0001-0782 (Oct. 2012).

23. Bengio, Y. Learning Deep Architectures for AI. Foundations and Trends®in Machine Learning 2, 1–127. ISSN: 1935-8237 (2009).

24. Le, Q. V. Building high-level features using large scale unsupervised learn-ing in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE In-ternational Conference on (2013), 8595–8598.

25. Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of datawith neural networks. science 313, 504–507 (2006).

26. Bengio, Y., Courville, A. & Vincent, P. Representation learning: A reviewand new perspectives. IEEE transactions on pattern analysis and machineintelligence 35, 1798–1828 (2013).

27. Wall, M. E., Rechtsteiner, A. & Rocha, L. M. in A practical approach to mi-croarray data analysis 91–109 (Springer, 2003).

28. Lee, D. D. & Seung, H. S. Algorithms for non-negative matrix factoriza-tion in Advances in neural information processing systems (2001), 556–562.

29. Hinton, G. A practical guide to training restricted Boltzmann machines.Mo-mentum 9, 926 (2010).

30. Bengio, Y. Deep learning of representations: Looking forward in Inter-national Conference on Statistical Language and Speech Processing (2013),1–37.

31. Hinton, G. E. & Zemel, R. S. Autoencoders, minimum description lengthand Helmholtz free energy in Advances in neural information process-ing systems (1994), 3–10.

32. Jaitly, N. & Hinton, G. Learning a better representation of speech sound-waves using restricted boltzmann machines in Acoustics, Speech and Sig-nal Processing (ICASSP), 2011 IEEE International Conference on (2011),5884–5887.

33. Plötz, T., Hammerla, N. Y. & Olivier, P. L. Feature learning for activity recog-nition in ubiquitous computing in Twenty-Second International Joint Con-ference on Artificial Intelligence (2011).

34. Ngiam, J. et al. Multimodal deep learning in Proceedings of the 28th in-ternational conference on machine learning (ICML-11) (2011), 689–696.

35. Srivastava, N. & Salakhutdinov, R. Learning representations for multimodaldata with deep belief nets in International conference on machine learn-ing workshop (2012).

36. Srivastava, N. & Salakhutdinov, R. R.Multimodal learning with deep boltz-mann machines in Advances in neural information processing systems(2012), 2222–2230.

37. Kim, Y., Lee, H. & Provost, E. M. Deep learning for robust feature gener-ation in audiovisual emotion recognition in Acoustics, Speech and SignalProcessing (ICASSP), 2013 IEEE International Conference on (2013), 3687–3691.

38. Kahou, S. E. et al. Emonets: Multimodal deep learning approaches for emo-tion recognition in video. Journal on Multimodal User Interfaces 10, 99–111 (2016).

39. Gönen, M. & Alpaydın, E. Multiple kernel learning algorithms. Journal ofMachine Learning Research 12, 2211–2268 (2011).

Page 33: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

BIBLIOGRAPHY 29

40. Bucak, S. S., Jin, R. & Jain, A. K. Multiple kernel learning for visual objectrecognition: A review. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 36, 1354–1369 (2014).

41. Stevens, S. S. On the theory of scales of measurement (1946).42. Velleman, P. F. & Wilkinson, L. Nominal, ordinal, interval, and ratio typolo-

gies are misleading. The American Statistician 47, 65–72 (1993).43. Mosteller, F. & Tukey, J. W. Data analysis and regression: a second course

in statistics. Addison-Wesley Series in Behavioral Science: QuantitativeMethods (1977).

44. Buneman, P., Davidson, S., Fernandez, M. & Suciu, D. Adding structureto unstructured data in International Conference on Database Theory (1997),336–350.

45. Kotsiantis, S. B., Zaharakis, I. D. & Pintelas, P. E. Machine learning: a re-view of classification and combining techniques. Artificial Intelligence Re-view 26, 159–190 (2006).

46. Steenbergen, M. R. & Jones, B. S. Modeling multilevel data structures. amer-ican Journal of political Science, 218–237 (2002).

47. Hox, J. in Classification, data analysis, and data highways 147–154 (Springer,1998).

48. Gao, Y., Wang, F., Luan, H. & Chua, T.-S. Brand data gathering from livesocial media streams in Proceedings of International Conference on Mul-timedia Retrieval (2014), 169.

49. Ma, L., Lu, Z., Shang, L. & Li, H.Multimodal convolutional neural networksfor matching image and sentence in Proceedings of the IEEE InternationalConference on Computer Vision (2015), 2623–2631.

50. Wu, Z., Wang, X., Jiang, Y.-G., Ye, H. & Xue, X.Modeling spatial-temporalclues in a hybrid deep learning framework for video classification in Pro-ceedings of the 23rd ACM international conference on Multimedia (2015),461–470.

51. Bastian, M., Heymann, S., Jacomy, M., et al. Gephi: an open source soft-ware for exploring and manipulating networks. Icwsm 8, 361–362 (2009).

52. Cao, L., Zhang, H., Zhao, Y., Luo, D. & Zhang, C. Combined mining: dis-covering informative knowledge in complex data. IEEE Transactions onSystems, Man, and Cybernetics, Part B (Cybernetics) 41, 699–712 (2011).

53. Talia, D. Clouds for scalable big data analytics. Computer 46, 98–101 (2013).54. Lum, P. et al. Extracting insights from the shape of complex data using topol-

ogy. Scientific reports 3, 1236 (2013).55. Blanco, L., Crescenzi, V., Merialdo, P. & Papotti, P. Probabilistic models

to reconcile complex data from inaccurate data sources in InternationalConference on Advanced Information Systems Engineering (2010), 83–97.

56. Toubiana, D., Fernie, A. R., Nikoloski, Z. & Fait, A. Network analysis: tack-ling complex data to study plant metabolism. Trends in biotechnology 31,29–36 (2013).

57. Van Leeuwen, M. & Knobbe, A. Non-redundant subgroup discovery in largeand complex data.Machine Learning and Knowledge Discovery in Databases,459–474 (2011).

58. Scott, S. & Matwin, S. Feature engineering for text classification in ICML99 (1999), 379–388.

Page 34: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

30

59. Wu, X., Zhu, X., Wu, G.-Q. & Ding, W. Data mining with big data. ieee trans-actions on knowledge and data engineering 26, 97–107 (2014).

60. Erhan, D. et al.Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research 11, 625–660 (2010).

61. Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutionalnetworks in European conference on computer vision (2014), 818–833.

62. Le Cun, Y., Touresky, D, Hinton, G & Sejnowski, T. A theoretical frame-work for back-propagation in The Connectionist Models Summer School1 (1988), 21–28.

63. McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent innervous activity. The bulletin of mathematical biophysics 5, 115–133 (1943).

64. Hebb, D. The organization of behavior; a neuropsychological theory. (Wi-ley, 1949).

65. Rosenblatt, F. The perceptron: A probabilistic model for information stor-age and organization in the brain. Psychological review 65, 386 (1958).

66. Minsky, M. & Papert, S. Perceptrons. (1969).67. Ngiam, J. et al. On optimization methods for deep learning in Proceed-

ings of the 28th International Conference on Machine Learning (ICML-11) (2011), 265–272.

68. Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmannmachines in Proceedings of the 27th international conference on machinelearning (ICML-10) (2010), 807–814.

69. Schmidhuber, J. Deep learning in neural networks: An overview. Neuralnetworks 61, 85–117 (2015).

70. Dreiseitl, S. & Ohno-Machado, L. Logistic regression and artificial neuralnetwork classification models: a methodology review. Journal of biomed-ical informatics 35, 352–359 (2002).

71. LeCun, Y. et al. Comparison of learning algorithms for handwritten digitrecognition in International conference on artificial neural networks 60(1995), 53–60.

72. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning ap-plied to document recognition. Proceedings of the IEEE 86, 2278–2324(1998).

73. Pineda, F. J. Generalization of back-propagation to recurrent neural net-works. Physical review letters 59, 2229 (1987).

74. Mikolov, T. & Zweig, G. Context dependent recurrent neural network lan-guage model. SLT 12, 234–239 (2012).

75. Connor, J. T., Martin, R. D. & Atlas, L. E. Recurrent neural networks androbust time series prediction. IEEE transactions on neural networks 5,240–254 (1994).

76. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recur-rent neural networks. ICML (3) 28, 1310–1318 (2013).

77. Graves, A. Generating sequences with recurrent neural networks. arXiv preprintarXiv:1308.0850 (2013).

78. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Gated feedback recurrent neu-ral networks in International Conference on Machine Learning (2015),2067–2075.

79. Goodfellow, I. et al. Generative adversarial nets in Advances in neural in-formation processing systems (2014), 2672–2680.

Page 35: CHALLENGESANDOPPORTUNITIESOFANALYSING ...his.diva-portal.org/smash/get/diva2:1164371/FULLTEXT01.pdf · 6 CHAPTER1.BACKGROUND correspondingclassesarepresentedtothecomputer12.Fromtheseexamples,the

BIBLIOGRAPHY 31

80. Gauthier, J. Conditional generative adversarial nets for convolutional facegeneration. Class Project for Stanford CS231N: Convolutional Neural Net-works for Visual Recognition, Winter semester 2014, 2 (2014).

81. Radford, A., Metz, L. & Chintala, S. Unsupervised representation learningwith deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434 (2015).

82. Mirza, M. & Osindero, S. Conditional generative adversarial nets. arXiv preprintarXiv:1411.1784 (2014).

83. Angelov, P. & Sperduti, A. Challenges in Deep Learning in Proceedings ofthe 24th European Symposium on Artificial Neural Networks (ESANN).i6doc. com (2016), 489–495.

84. Dauphin, Y. N. et al. Identifying and attacking the saddle point problemin high-dimensional non-convex optimization in Advances in neural in-formation processing systems (2014), 2933–2941.

85. Kingma, D. & Ba, J. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014).

86. Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neu-ral network acoustic models in Proc. ICML 30 (2013).

87. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdi-nov, R. Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research 15, 1929–1958 (2014).

88. Szegedy, C. et al. Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199 (2013).

89. Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimiza-tion. Journal of Machine Learning Research 13, 281–305 (2012).

90. Côté, M.-A. & Larochelle, H. An infinite restricted Boltzmann machine. Neu-ral computation (2016).

91. Haken, H. Information and self-organization: a macroscopic approachto complex systems 2., enl. English. ISBN: 9783540662860;3540662863;(Springer, Berlin, 2000).

92. Landrum, G. RDKit: Open-source cheminformatics. Online). http://www.rdkit. org. Accessed 3, 2012 (2006).

93. Wu, Z. et al.MoleculeNet: A Benchmark for Molecular Machine Learning.ArXiv e-prints. arXiv: 1703.00564 [cs.LG] (Mar. 2017).

94. MacKerell Jr, A. D. et al. All-atom empirical potential for molecular mod-eling and dynamics studies of proteins. The journal of physical chemistryB 102, 3586–3616 (1998).

95. Ståhl, N., Falkman, G., Karlsson, A., Mathiason, G. & Boström, J. Improv-ing the use of deep convolutional neural networks for the prediction of molec-ular properties (2017).