51
Finnish Centre of Excellence for Algorithmic Data Analysis Research (Algodan) Director: Professor Esko Ukkonen Host organizations: University of Helsinki, Helsinki University of Technology Summary The importance of data analysis in science and in industry is increasing continuously, as our ability to measure and store data grows. While data analysis is as old as science it- self, the new methods of collecting raw data pose unprecedented challenges and oppor- tunities to data analysis and to the algorithms of data analysis. The Algorithmic Data Analysis (Algodan) Centre of Excellence develops new concepts, algorithms, principles, and frameworks for data analysis. The work combines strong basic research in computer science with interdisciplinary work in a wide variety of sci- entific disciplines and industrial problems. The research of the Algodan CoE lies in the areas of combinatorial pattern matching, data mining, and machine learning. The work in Algodan is strongly interdisciplinary: we cooperate constantly with application experts in various application areas, formulat- ing novel computational concepts and ways of attacking the scientific and industrial problems of the application areas. Developing new concepts and algorithms is an itera- tive process consisting of interacting extensively with the application experts, formulat- ing computational concepts, analyzing the properties of the concepts, designing algo- rithms and analyzing their performance, implementing and experimenting with the al- gorithms, and applying the results in practice. The main application areas of the Algo- dan CoE are in biology, medicine, telecommunications, environmental studies, linguis- tics, and neuroscience. The formulation of new computational concepts, their analysis, and the design of algo- rithms are some key ingredients that make the Algodan CoE unique. First, rather than concentrating on improvements to existing problems and methods, the CoE focuses on defining new tasks where significant impact can be made by introducing new concepts. Second, we emphasize the need for analyzing the performance of the algorithms, in- stead of just relying on heuristic approaches. Third, we use our strong background in algorithmic and probabilistic methods to guarantee that our algorithms perform well both in terms of modelling accuracy and robustness, and in terms of computational complexity and practical efficiency. The research in Algodan is grouped under four interacting themes: sequence analysis, learning from and mining complex and heterogeneous data, discovery of hidden struc- ture in high-dimensional data, and foundations of algorithmic data analysis. All these themes combine aspects of combinatorial pattern matching, data mining, and machine learning. The host organizations of the Algodan CoE are University of Helsinki and Helsinki University of Technology. The CoE is in part a continuation of the “From Data to Knowledge” CoE, and consists of about 70 persons. The director of the Algodan CoE is Professor Esko Ukkonen and the vice-director is Academy Professor Heikki Mannila.

Finnish Centre of Excellence for Algorithmic Data Analysis …€¦ · Finnish Centre of Excellence for Algorithmic Data Analysis Research (Algodan) Director: Professor Esko Ukkonen

  • Upload
    vandiep

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Finnish Centre of Excellence for Algorithmic Data Analysis Research (Algodan) Director: Professor Esko Ukkonen Host organizations: University of Helsinki, Helsinki University of Technology

Summary The importance of data analysis in science and in industry is increasing continuously, as our ability to measure and store data grows. While data analysis is as old as science it-self, the new methods of collecting raw data pose unprecedented challenges and oppor-tunities to data analysis and to the algorithms of data analysis. The Algorithmic Data Analysis (Algodan) Centre of Excellence develops new concepts, algorithms, principles, and frameworks for data analysis. The work combines strong basic research in computer science with interdisciplinary work in a wide variety of sci-entific disciplines and industrial problems. The research of the Algodan CoE lies in the areas of combinatorial pattern matching, data mining, and machine learning. The work in Algodan is strongly interdisciplinary: we cooperate constantly with application experts in various application areas, formulat-ing novel computational concepts and ways of attacking the scientific and industrial problems of the application areas. Developing new concepts and algorithms is an itera-tive process consisting of interacting extensively with the application experts, formulat-ing computational concepts, analyzing the properties of the concepts, designing algo-rithms and analyzing their performance, implementing and experimenting with the al-gorithms, and applying the results in practice. The main application areas of the Algo-dan CoE are in biology, medicine, telecommunications, environmental studies, linguis-tics, and neuroscience. The formulation of new computational concepts, their analysis, and the design of algo-rithms are some key ingredients that make the Algodan CoE unique. First, rather than concentrating on improvements to existing problems and methods, the CoE focuses on defining new tasks where significant impact can be made by introducing new concepts. Second, we emphasize the need for analyzing the performance of the algorithms, in-stead of just relying on heuristic approaches. Third, we use our strong background in algorithmic and probabilistic methods to guarantee that our algorithms perform well both in terms of modelling accuracy and robustness, and in terms of computational complexity and practical efficiency. The research in Algodan is grouped under four interacting themes: sequence analysis, learning from and mining complex and heterogeneous data, discovery of hidden struc-ture in high-dimensional data, and foundations of algorithmic data analysis. All these themes combine aspects of combinatorial pattern matching, data mining, and machine learning. The host organizations of the Algodan CoE are University of Helsinki and Helsinki University of Technology. The CoE is in part a continuation of the “From Data to Knowledge” CoE, and consists of about 70 persons. The director of the Algodan CoE is Professor Esko Ukkonen and the vice-director is Academy Professor Heikki Mannila.

ii

Finnish Centre of Excellence for Algorithmic Data Analysis Research (Algodan) Contact information:

Table of contents Part I: Research plans 2008 – 2013................................................................................1 1. Research programme.................................................................................................1

1.1 Introduction ........................................................................................................1 1.2 Structure and background of the Algodan CoE ...................................................3 1.3 Main themes of Algodan .....................................................................................5 1.4 Research tasks .....................................................................................................6 1.5 Organization and schedule ................................................................................18

2. International significance and innovativeness ..........................................................20 3. Inter- and multidisciplinary work.............................................................................21

3.1 General remarks ................................................................................................21 3.2 Current collaboration partners in applications ...................................................22 3.3 Example challenge problems .............................................................................24

4. Integration of research with the host organizations’ strategies .................................26 5. Comparison with the previous CoE ........................................................................27 6. Personnel ................................................................................................................27 7. Funding...................................................................................................................28 8. Exit strategy ............................................................................................................28

Director: Prof. Esko Ukkonen Helsinki Institute for Information Technology HIIT Department of Computer Science University of Helsinki, P.O. Box 68 FIN-00014 University of Helsinki Phone: +358 9 191 51280 Email: [email protected]

Vice-director: Academy Prof. Heikki Mannila Helsinki Institute for Information Technology HIIT Laboratory of Computer and Information Science, Helsinki University of Technology P.O. Box 5400, FIN-02015 TKK Phone: +358 9 191 51246 Email : [email protected]

iii

Part II: Research environment ....................................................................................29 9. Added value ...........................................................................................................29 10. Collaboration with foreign research units .............................................................29 11. Collaboration with Finnish research units..............................................................33 12. Collaboration with enterprises ...............................................................................36 13. Management and cohesion between research teams ..............................................38 14. Promotion of research careers (Women and young researchers)............................39 15. Facilities ................................................................................................................40 16. Promotion of creative research..............................................................................40 17. Contribution to various sectors of society .............................................................41 Part III: Success and potential of the proposed CoE in researcher training .................42 18. Organization of researcher training .......................................................................42 19. Degrees .................................................................................................................43 20. Plan of degrees ......................................................................................................43 21. Need of researchers...............................................................................................43 Part IV: Scientific merits and output ...........................................................................44 22. Publications and other output ...............................................................................44 23. Improving professional and public understanding of science.................................47 24. Prizes and scientific honours .................................................................................47 25. Positions of trust ...................................................................................................47 26. Visits .....................................................................................................................48 27. Involvement in international research programmes ...............................................48

1

__________________________________ Part I: Research plans 2008 – 2013 __________________________________

1. Research programme

1.1 Introduction The importance of data analysis in science and in industry is increasing continuously. This is partly due to technological factors: computers are becoming faster and cheaper, memory sizes increase, the cost of telecommunications is going down, and new meas-urement technologies and sensors are revolutionizing data gathering methods. Many areas of research can be called data-rich. Thus the need for new solutions is increasing all the time. Simultaneously, advances in computational methods have shown that there is a wealth of potential in the development of new computational methods for data analysis. While data analysis is as old as science itself, the new methods of collecting raw data pose unprecedented challenges to data analysis methods. The volume of the data is huge, there are many different types of data, the dimension of the data is large, and the models that one would like to study are much more complex than before. Defining the appropriate computational concepts or model classes is difficult, and designing suffi-ciently fast methods of computation requires deep knowledge of algorithms; no in-crease in computational power will change this situation. Algorithms are at the core of data analysis, as emphasized by two influential recent re-ports1 on computational science. The Algorithmic Data Analysis (Algodan) Centre of Excellence develops new algorithms, principles and frameworks for computational data analysis. The work combines strong basic research in computer science with interdisci-plinary work in a wide variety of applications. The research of the Algodan CoE falls within the areas of combinatorial pattern match-ing, data mining, and machine learning. Combinatorial pattern matching is an area of algorithmics that develops efficient methods for detecting regularities in discrete data. It has a central role in various important application areas such as bioinformatics, natural language processing, and information retrieval. Data mining develops techniques of pattern and structure discovery in heterogeneous and high-dimensional data sets. It has in recent years emerged as a large research theme on the border of algorithms, data-bases, and statistics; the use of methods from combinatorial pattern matching and ma-chine learning is crucial in data mining. Machine learning has its roots in artificial intel-

1 “Steering the future of computing”, Editorial, Nature 440, 383 (23 March 2006); “Computational Science: Ensuring America's Competitiveness", President's Information Technology Advisory Committee (PITAC), http://www.nitrd.gov/pitac/reports/20050609_computational/computational.pdf.

2

ligence and in the attempt to build intelligent systems. Recent developments in machine learning emphasize advanced statistical methods as well as the probabilistic and data-driven nature of learning. This brings the field close to the algorithmic methods of pat-tern matching and data mining. These three areas are thus highly connected. In these areas the emphasis of the Algodan CoE is on basic algorithmic research in computer science, combined with strong commitment to interdisciplinary work. The process of developing new concepts and algorithms is an iterative process consisting of the following ingredients:

1. Deep and continuous interaction with the application experts 2. Formulating computational concepts 3. Analyzing the properties of the concepts 4. Designing algorithms and analyzing their performance 5. Implementing and experimenting with the algorithms 6. Applying the results in practice.

The work in Algodan is strongly interdisciplinary: we cooperate with application experts in various areas, looking for novel computational concepts and ways of attacking the scientific and industrial problems of the applications. The main application areas of the Algodan CoE are in biology, medicine, telecommunications, environmental studies, lin-guistics, and neuroscience. (See Sections 3 and 10-12 for a more detailed description of the cooperation.) The scientific results and publications from these collaborations de-pend strongly on the novel concepts. In many areas, the collaborations include funding of Algodan researchers by the application groups and interchange of personnel. The formulation of new computational concepts, their analysis, and design of algo-rithms are some key ingredients that make the Algodan CoE unique: rather than con-centrating on improvements to existing problems and methods, the CoE focuses on finding new tasks where significant impact can be made by developing new concepts. We emphasize the need for analyzing the performance of the algorithms, instead of just relying on heuristic approaches. We use our strong background both in algorithmic and probabilistic methods to guarantee that our algorithms perform well both in terms of modelling accuracy and robustness, and in terms of computational complexity. In Algodan the interplay between concept formation and algorithm design is one of the key aspects. The work proceeds by first defining the type of structure (probabilistic model, pattern class, or other representation) that can be used to describe the salient structure of the data in the application, and only then designing efficient algorithms for finding the structure. The key researchers of Algodan have extensive experience in this type of work, and the success of the previous CoE “From Data to Knowledge” has demonstrated the utility of the approach. The different application areas provide inter-esting questions for basic research in computer science, and the techniques used in one application area can often also be used in other areas. The research in the Algodan CoE is grouped under four interacting themes: sequence analysis, learning from and mining complex and heterogeneous data, discovery of hidden structure in high-dimensional data, and foundations of algorithmic data analysis. These areas all contain as-pects of combinatorial pattern matching, data mining, and machine learning.

3

The structure and background of the Algodan CoE is given in Section 1.2, while the above four themes are described in Section 1.3. More details of the research tasks are given in Section 1.4, and the organization and schedule are discussed in Section 1.5.

1.2 Structure and background of the Algodan CoE The Algodan CoE contains five research teams. The team leaders are Prof. Esko Ukkonen, Prof. Heikki Mannila, Prof. Hannu Toivonen, Dr. Aapo Hyvärinen, and Prof. Jyrki Kivinen. The boundaries between the teams are not strict, and there is day-to-day interaction between the teams. We expect that in the period 2008 – 2013 new teams will emerge and the structure of the existing teams will change. We next describe briefly the teams and the background of the team leaders. The team of Prof. Esko Ukkonen consists of 4 senior members (Ukkonen, Helena Aho-nen-Myka, Kjell Lemström, and Juha Kärkkäinen), 5 postdocs, and 7 Ph.D. students. The team is algorithmic in character: the main interests are in sequence analysis, string algorithms, combinatorial pattern matching, and network models. In applications, the team works on genome structure, metabolic modelling, gene regulation, music informa-tion retrieval, and language technology. The team of Academy Prof. Heikki Mannila consists of 3 senior members (Mannila, Aristides Gionis, Jaakko Hollmén), 10 postdocs and 10 Ph.D. students. The back-ground of the team is in algorithms and data mining. The main research emphasis is on sequence analysis (e.g., finding orders and segmentation), mining complex and hetero-geneous data, especially pattern discovery, and structure discovery in high-dimensional data sets. The team has strong ties to application groups in gene mapping and genome structure, environmental sciences, linguistics, and telecommunications. Prof. Hannu Toivonen’s team has 2 senior members (Toivonen, Petteri Sevon), 1 post-doc, and 6 Ph.D. students. The group works on pattern discovery, especially on struc-tured and heterogeneous data, and also on sequences. The main applications are in bio-informatics, genetics, and in ubiquitous computing, in tight collaboration with applied scientists and companies. The senior members of Prof. Jyrki Kivinen’s team are Kivinen and Juho Rousu; addi-tionally, the group has 2 postdocs and 3 Ph.D. students. The main focus of the team is in machine learning, especially on-line learning, learning structured representations, and kernel methods. In particular, the team studies theoretically well-founded methods and rigorous performance bounds for them. This includes analyzing the methods both in the classical statistical setting and in an online setting, where many of the classical as-sumptions can be avoided. The main applications are systems biology and language technology. The team of Dr. Aapo Hyvärinen has 2 senior members (Hyvärinen, Hoyer), 3 postdocs and 4 Ph.D. students. The main competence of the team is in non-gaussian probabilis-tic models. The work of this team is centred on a recent revolution in probabilistic modelling of continuous-valued data: exploitation of non-gaussian structure in the data.

4

Cooperation between the teams. The teams in Algodan cooperate on a day-to-day basis. As mentioned, the team boundaries are not strict, and many researchers work in ad hoc collaborations that span several teams. Out of the five team leaders, Ukkonen, Mannila, Toivonen, and Kivinen have collaborated on many occasions over the years, while Hyvärinen is a new addition to the CoE. The different backgrounds and ap-proaches of the teams have high potential for combination of ideas, and we expect that the period 2008 – 2013 will demonstrate very interesting work spanning various data analysis methods. Brief background of the team leaders. The work of Esko Ukkonen on combinatorial pattern matching started already in the 1980s. The group has developed several meth-ods that belong to the core algorithmics of approximate pattern matching, DNA se-quence assembly, and pattern matching in large masses of text or images. Many of the results have been included in recent international textbooks in the field. A very recent successful work by Ukkonen’s group is a novel algorithm for finding gene regulatory elements in the DNA of mammals. The results, verified in vivo, were published in Cell. Heikki Mannila’s group focuses on data mining. Mannila has introduced and analyzed several central data mining concepts, including the problem of finding integrity con-straints from databases. With Professor Hannu Toivonen he studied association rules and introduced the problem of finding episodes from sequences. Episodes have been applied to telecommunications and to medical genetics. Recently, Mannila’s group has defined the concepts of recurrent sources in sequences and studied the problem of finding orders from unordered data. Heikki Mannila received the ACM SIGKDD In-novation award in 2003. He coauthored the book “Principles of Data Mining” with David Hand and Padhraic Smyth (MIT Press 2001). Mannila also has wide industrial experience from Microsoft Research (Redmond, USA) and Nokia Research Centre. Hannu Toivonen has in recent years concentrated on the development of data analysis methods for human genetics, as well as on context-sensitive computation. The group has developed novel concepts and methods for gene mapping, for instance, based on discovery of genetically motivated tree-structured patterns. The group introduced (vari-able length) Markov models to the problem of reconstructing haplotype strings. These methods have turned out to be very useful in the practice of medical genetics. In con-text-sensitive computation, Toivonen and his group developed the ContextPhone software that is in wide use in several research institutions all over the world. Toivonen has worked on a large number of applications, and also has six years of industrial ex-perience from Nokia Research Centre. He is currently a visiting scientist at the Univer-sity of Freiburg. Jyrki Kivinen has specialized in computational learning theory. His main themes have been online prediction with sparse targets or wide margins and combinations of ex-perts. He has also worked on computational issues in data mining. In 2003, Kivinen returned from Australian National University. He is currently the President of the As-sociation for Computational Learning. Aapo Hyvärinen is a leading expert in non-gaussian probabilistic modelling of continu-ous-valued data. Hyvärinen’s FastICA method for independent component analysis (ICA) is computationally very efficient, and has become the most widely-used ICA method in the world. The software package has been downloaded more than 10,000

5

times. Hyvärinen coauthored the book “Independent Component Analysis” (Wiley, 2001) with E. Oja and J. Karhunen. Recent work by Hyvärinen’s team emphasizes ap-plications of non-gaussian probabilistic models in neuroinformatics, and the discovery of causal models.

1.3 Main themes of Algodan The main research themes of the Algodan CoE are the following. S Sequence analysis L Learning from and mining structured and heterogeneous data D Discovery of hidden structure in high-dimensional data F Foundations of algorithmic data analysis There is considerable overlap between the themes: certain algorithmic and probabilistic techniques occur in many themes. In the same way, several themes can be used for a single application. We next describe the themes briefly. Sequence analysis considers the algorithmic techniques for sequential data. The key meth-ods in the theme are string algorithms, pattern discovery techniques, dynamic pro-gramming, and probabilistic modelling. Examples of the algorithmic tasks in the area are approximate string matching, episode discovery, and finding motifs and orders from data. The techniques of sequence analysis have numerous applications in, for ex-ample, gene mapping, finding regulatory regions in genomes, telecommunications, lin-guistics, and paleontology. Most applications have multiple types of data objects, many different types of data, etc., instead of the classical situation of a single table with observations and variables. Learn-ing from and mining structured and heterogeneous data looks for techniques for data analysis tasks involving such data sets. The methods studied are pattern discovery, prediction of structured objects, the analysis of flows, etc. The applications include biological data analysis, information retrieval, telecommunications, and environmental studies. Algo-rithmic techniques for probabilistic modelling are crucial in this theme. The high dimensionality of many datasets causes interesting modelling problems and leads to extremely challenging algorithmic questions. The third theme, discovery of hidden structure in high-dimensional data, looks at how to find latent structure in high-dimensional data sets. The latent structure can be in the form of components, as in independent component analysis, or cluster-like structures, or it can be a parsimonious model giving weight only to a small fraction of the observed variables. The techniques in this theme are based on probabilistic modelling, with a strong algorithmic component. The theme on foundations of algorithmic data analysis looks at the frameworks of algo-rithmic data analysis. What can be said about the limitations of pattern discovery? What are the fundamental bounds on the efficiency of string algorithms? What is the compu-tational complexity of fitting probabilistic models of a certain type? Questions such as these abound in algorithmic data analysis, and they are fascinating problems in core computer science.

6

1.4 Research tasks In this section we list the concrete research tasks we plan to work on in the period 2008 – 2013. Obviously, new tasks will be developed during this period, and some tasks will be dropped.

S: Sequence analysis Sequential data occurs in many applications: telecommunication alarm logs, actions of users of a computer system, DNA sequences, documents, newsfeeds etc. are all exam-ples of inherently sequential data. The research tasks in this main theme are connected with sequences. The basic algorithmic techniques are dynamic programming, greedy set cover methods, matching algorithms etc. Many problems that are NP-hard in general situations can be solved in polynomial time on sequential structures by using, e.g., dy-namic programming, which can also be used in the estimation of probabilistic models. The first task of this theme is developing string pattern matching and pattern synthesis algorithms, with applications in various domains. The other tasks include segmenting sequences, finding orders from data, and developing kernel methods for sequences. In each of these the role of combinatorial algorithms is strong. S.1: String algorithms. This is an extensive task that continues our long-term and suc-cessful work on combinatorial pattern matching algorithms and their applications in the analysis of text strings, biological sequences, and generalized sequences such a music files. This work also supports many other research tasks of Algodan unit. On the theoretical side, text indexing and motif synthesis are the main questions. We participate with our international collaborators in the currently very active work on compressed self-indexes for texts. It is possible to represent a text string and a full in-dex of its all substrings in less space than originally needed by the text alone. What op-erations can be supported by such structures and their practical applicability are still largely open questions we plan to work on. We plan to use our Generic Library of String Algorithms (GLAS) as the platform for algorithm engineering work on this area. One particularly interesting theme is indexing of XML data. There is still no clear un-derstanding of what kind of queries need to be answered, what kind of patterns should be discovered, and so on. Our aim is to develop new index structures suitable for in-dexing XML documents. The experience of the researchers on full-text indexes pro-vides a starting point for handling the text aspect of XML documents, but new tech-niques are needed to handle the structural relationships in the tree. The goals include an efficient support for even the more complicated XML queries, small size using com-pressed data structure techniques, and advanced pattern discovery.

7

Developing indexing structures for music files is another indexing problem we will work on. Here quite intriguing problems are encountered: a good index should be in-variant against time scaling and transposition of music as well as robust against local fluctuations. So a special type of approximate matching is needed. Our earlier work on geometric distance functions for polyphonic music serves as a starting point. Novel dis-tance functions, which possibly will be tuned using machine learning methods, are needed. In applied work on biological sequence analysis we will work on different motif finding problems. Using Hidden Markov techniques and our idea of founder-based modelling of genotypes and haplotypes we plan to find disease-associated signals and motifs from sequence datasets provided by our partners. We also continue our work on genome-wide prediction of regulatory modules using dynamic programming methods and novel characterizations of such modules. This work is done within a European consortium in close collaboration with biologists. Teams: Ukkonen, Mannila, Toivonen; persons: E. Ukkonen, J. Kärkkäinen, K. Lemström, V. Mäkinen, T. Kivioja, H. Mannila, H. Ahonen-Myka, M. Koivisto, H. Toivonen. S.2: Finding orders from data. Tasks involving ordering are often encountered in computer science: sorting was one of the earliest problems considered in the study of algorithms, and finding top-k elements is a typical question in information retrieval. Recently, based in part on the work done by the researchers of Algodan, the problem of finding orders has been studied from totally new viewpoints: the order relations be-tween the items must be inferred from the data, and the best structure to be inferred is often something else than a total order or a top-k list. The first problem, finding the ordering relation between items (or within sets of items), has been approached by us by using the spectral methods, fragments, and probabilistic techniques, including Markov chain Monte Carlo methods. The problem of finding or-dering relations from complex data sets is far from being solved, and we expect to make significant progress in this field by further developing these ideas and methods. The second problem is finding good ways of representing ordering information. In many applications, a total order is usually unnecessary – the user is probably not inter-ested in the relative ordering of the items 562 and 563 in the ranked list. On the other hand, general partial orders are often too complex to be of use to the user interpreting the results. We have studied the concept of bucket orders, or orders with ties. They form a middle ground between the total orders and general partial orders. The bucket orders are usually very natural and intuitive for the user: when people are asked to rank items they usually rank them in a dozen or so buckets (one example of such system is the five star ranking of movies). We expect that representing ranking with bucket orders, or mixtures of bucket orders, are in many real world applications the best representation. The Algodan CoE will design efficient algorithms that can solve this newly formulated ranking problem. One application of ordering, and an example of an extremely successful interdiscipli-nary collaboration, is seriation, or finding the temporal order of fossil sites, a central

8

problem in paleontology. We have in the last years developed algorithms for this task. The methods have a large potential in enabling full use of existing large fossil databases. Further challenges include the discovery of partial orders from data, a problem that has applications also in the combination of ranked data. Similar ordering problems are en-countered in non-gaussian Bayesian networks, where the ordering is related to causality. We will investigate the possibility of using the algorithms developed in this task in Bayesian networks as well. The applications include paleontology, linguistics, ecology, and language technology. Teams: Mannila, Hyvärinen, Ukkonen; persons: H. Mannila, K. Puolamäki, A. Gionis, A. Hyvärinen, P. Hoyer, E. Ukkonen. S.3: Sequence segmentation and structure. In many applications the sequences have a natural segmental structure. For example, a DNA sequence has segmental duplica-tions, inversions, etc. A newsfeed, on the other hand, can be viewed as a merge of sev-eral sequences, each corresponding to a particular topic or theme. Each topic is active only for a limited period of time, but at the same time many topics will be active. This merged mixture of segments is also the structure of telecommunication alarm logs and user interface logs. We study methods for finding the global structure of sequential data. For sequence segmentation, the task is to find a decomposition of the sequence into a small number of segments each of which can be viewed as having a unique source. In most applica-tions it is useful to consider the case where the number of sources is much smaller than the number of segments. The task of finding k segments from h sources, called (k,h)-segmentation, has been introduced by us a few years ago, and the basic framework is very promising. Simple algorithms can be shown to have a guaranteed small approxima-tion factor, and the practical performance results are excellent. There are several algorithmic open problems, e.g., characterizing for which classes of probabilistic sources does the method work, extending the methods to a fully Bayesian treatment of segmentation, designing good approximation methods that work in sub-quadratic time, etc. Also the question of evaluating the significance of discovered seg-ments needs further work. The other problem is finding the sources such that the input sequence can be consid-ered to be a random merge of these sources. Defining the correct computational con-cepts is far from trivial, and it is not yet clear how to do this. The work will proceed by, once again, interplay of theory and practice, i.e., by investigation of whether the con-cepts make sense in the applications, and in the development of algorithms for the con-cepts. This research task has a strong interplay of probabilistic modelling and algorithmic techniques. The models considered are general probabilistic ones, while the basic algo-rithmic methods are dynamic programming approaches generalized from string tech-niques, combined with approximation methods. The main applications of the task are genome studies, newsfeed tracking in language technology, and linguistics.

9

Teams: Mannila; Toivonen, Ukkonen; persons: H. Mannila, H. Ahonen-Myka, A. Gionis, H. Toivonen, E. Ukkonen, E. Bingham. S.4: Kernel algorithms for sequences. Kernel methods such as support vector ma-chines (SVMs) have been successful in many application domains. Using them for structured data such as sequences requires the definition of a feature map translating the structured object into an inner product space and algorithms to compute the kernel values, the vector inner products in that space. While many kernel algorithms exist for sequences, they typically use time quadratic in the length of the sequences and in the number of data points, which constitutes a major bottleneck for scalability. Efficient string algorithms to approximate sequence kernels are thus required. In structured output learning with kernels, one typically has some feature representation of the output as well, and the prediction of the model is a feature vector rather than a structured object such as a sequence. Thus, one is faced with the problem of computing the pre-image of the feature map, i.e., the (set of) objects that are (approximately) mapped to a given feature vector. We will develop efficient pre-image algorithms for string and subsequence kernels and apply them in structured prediction problems in bioinformatics and natural language processing. A further goal is development of effi-cient pre-image algorithms for complex structures such as trees and graphs, and com-plexity analysis of pre-image algorithms. Teams: Kivinen, Ukkonen; persons: J. Rousu, V. Mäkinen, E. Ukkonen, J. Kärk-käinen, T. Mielikäinen

L: Learning from and mining structured and heterogeneous data Traditionally, data analysis has been concerned with data consisting of a single observa-tional matrix of observations and variables. However, most applications have multiple types of data objects, many different types of data, etc.: for example, in biological and ecological datasets and in information retrieval the schema of the data is very complex. The tasks in this theme consider different ways of obtaining patterns and models from structured and heterogeneous data. One task considers the theory, applications, and limitations of pattern discovery from heterogeneous data, while another looks at the definition of similarities of objects on the basis of contextual information. The other tasks involve XML, language technology, learning of structured objects via kernel methods, and the analysis of flows. L.1: Pattern languages and search algorithms for heterogeneous data. Many data mining tasks can be characterized as pattern discovery of the following form: given a class of patterns, a selection predicate, and a data set, find all patterns that satisfy the selection predicate in the database. The pattern class determines what can be discovered and expressed about the data set. The pattern language and properties of the selection

10

predicate together determine the complexity of the discovery task. Certain simple and popular instances of the problem are already quite well understood (e.g., the discovery of association rules). However, most of the interesting cases are complex, and efficient algorithms are needed to find optimal or near-optimal solutions. There have been ef-forts, for instance, on mining patterns of first-order logic, and on mining sequences, trees, and graphs. A central theoretical data mining task is to specify and characterize pattern discovery tasks of interest. We will study pattern languages relevant for complex and heterogene-ous data. The main questions will be to define the appropriate pattern classes and selec-tion predicates. That is, what types of patterns are needed, and how are the interesting patterns selected? The pattern classes will probably be based on Boolean formulae or models on substruc-tures of the data, using the successful examples from, for example, episode patterns on sequences. Traditionally the selection predicates have been based on frequency of oc-currence of the pattern or other quantities that are simple to compute. Especially monotone selection predicates behave well. We plan to emphasize heavily other types of selection predicates, e.g., ones based on similarities and analogies between a small number of objects in the data. The research has applications in various areas, including genome structure, mining biological databases, and ecology. Teams: Toivonen, Mannila, Ukkonen; persons: H. Toivonen, H. Mannila, P. Sevon, A. Gionis, T. Mielikäinen, J. Seppänen, E. Ukkonen, J. Kärkkäinen. L.2: Contextual similarities and analogies. Similarity is a central concept in data mining: discovery of patterns or regularities as well as clustering and classification are often based on finding similarities between data points. Usually similarity is based on direct distance measures. Indirect similarities have been investigated, e.g., in text analy-sis: two terms can be considered similar, if they tend to occur in similar contexts. Simi-lar approaches are promising in other contexts as well, especially in domains where there is no meaningful way to define direct similarity. Contextual similarity can be de-fined recursively (objects are similar if their contexts are similar), leading to issues of first, second and higher-order similarities. An orthogonal approach to contextual similarity is hierarchically defined analogy: two structures can be considered similar if many of their parts and their relationships are similar. Discovery of such analogies gives an interesting possibility for inference: if two objects are analogous by some of their features, inferences or predictions can poten-tially be drawn about their other features. While the mainstream data mining and ma-chine learning research tries to find regularities that hold for many objects, here a goal is also to discover strong analogies between any pair of objects. The aim in this task is to define useful similarity and analogy measures for complex data, especially graph-structured data, and to devise efficient search algorithms for find-ing analogies between objects. The problem can be computationally very complex. For instance, when data objects are described as graphs, a simple form of the analogy search problem can be stated as the maximum common subgraph isomorphism problem. We plan to experiment with several different pattern classes and selection predicates, inves-

11

tigating the practical utility and the algorithmic properties. The main applications of the work are in mining biological databases and ecological datasets. Teams: Toivonen, Hyvärinen; persons: H. Toivonen, P. Sevon, A. Hyvärinen. L.3: XML Mining. In XML documents, textual content is organized with a hierarchi-cal structure. When information is collected from many sources, a document collection typically is heterogeneous, both in terms of the topics and the structure of the docu-ments. For instance, if biomedical information is collected from several sources, these sources probably provide both similar information (the same topic, but varying struc-ture) and complementary information (both topics and structures vary). We address the problem of mining collections of heterogeneous XML documents. Most of the XML mining related research so far restricts its scope to some aspects of XML only. For in-stance, often only structure is considered, or the textual content only with very limited local structural context, if any. We will define classes of patterns that take into account both the hierarchical tree struc-ture and sequential natural language content, and develop discovery methods for these pattern classes. We will consider, for instance, clustering of heterogeneous collections of XML documents, i.e., identifying homogeneous classes for both structure and con-tent. Teams: Ukkonen, Kivinen; persons: A. Doucet, H. Ahonen-Myka, M. Salmenkivi, J. Rousu, J. Kärkkäinen. L.4: Minimally supervised methods for text understanding. Information Extrac-tion (IE), or fact finding, consists of finding instances of particular types of events, fo-cal events, in a large body of text, e.g., a news stream. A crucial sub-problem is the un-derstanding of entities which act as participants in these events. IE converts from plain text to a structured representation. To date, IE systems have been built by two methods: by manual writing of rules (pat-terns) and ontologies of terms, or by annotating examples of events and entities in large amounts of text for supervised learning. Both of these approaches necessitate a sub-stantial amount of human labour. Our focus is on truly minimally-supervised methods for building IE systems. The methods are general and applicable not only to IE, but to a host of other text-understanding applications (such as question answering and sum-marization). We have had some success in devising bootstrapping methods for acquiring knowledge for text understanding. The methodology for discovering patterns that express facts is similar to the methodology for discovering classes of terms. Given a very small num-ber of seed examples of patterns that describe a given topic (e.g., “corporate lawsuits”, or “epidemics of infectious disease”) we can bootstrap a large set of relevant patterns (that approaches a complete cover for the topic). A parallel procedure finds terms be-longing to a given semantic class, starting from a small number of seed examples. These procedures are related conceptually to the research on emerging patterns in data

12

mining. Discovery of large semantic classes of terms is also closely related to research on the semantic web. A recent realization is that the precision of the bootstrapping procedures may be greatly improved if, in addition to the focal domain (the topic or semantic class that we seek), seeds are provided also for “negative” categories, i.e., domains that we are not interested in. However, these new approaches are far from being well understood. Just as the seeds for the focal domains, the seeds for negative domains must be given explicitly, which is an undesirable requirement. What is the nature of the interplay between the competing (positive and negative) domains? Where do the candidates for the negative domains come from, and what distinguishes between good and bad candidates? These are the questions we propose to investigate. Thus, in contrast to traditional approaches to machine learning methods for knowledge acquisition for text understanding, we view the problem as essentially multi-class, and, in a sense, maximally multi-class. Whereas it is common to view multi-class problems as a technical inconvenience, and hence to decompose N-class problems into N binary classification problems, we wish to learn simultaneously about as many domains as is feasible. Teams: Ukkonen, Toivonen; persons: H. Ahonen-Myka, R. Yangarber, H. Toivonen. L.5: Prediction of structured objects. The desired output of prediction tasks in ap-plication domains is often structured. For example, image segmentation, recognition of hand-written words, genome annotations, and classification of objects into a taxonomy all are prediction tasks where the output has structure. Traditionally, machine learning has treated such tasks by decomposing the problem into series of scalar classification or regression problems, and then combining the results to make the global prediction. Re-cently, methods that do not (fully) decompose the target but predict it as a whole (or in chunks) have emerged and have shown to have superior performance. We concentrate on developing kernel-based methods for structured outputs with applications in bioin-formatics (sequence annotation, protein structure and function prediction) and in lan-guage technology (hierarchical text classification, statistical machine translation). The goals include scaling up the methods to very large datasets, and developing a learning theory for structured prediction. Teams: Kivinen, Ukkonen; persons: J. Rousu, M. Kääriäinen, J. Kivinen, V. Mäkinen. L.6: Analysis of flows. Flows occur in various applications. For example, migration of persons within a country, the transfer of goods, or radio frequency identification (RFID) logs can be viewed as a large dataset consisting of edges (start, end), or paths containing also intermediate location and timing information. Other types of flows arise from modelling and analysis of the metabolic system of cells. In metabolic networks a basic question is to construct consistent metabolic network models from genome, gene expression, proteome, and metabolome data. This is a challenge for integrated data analysis utilizing heterogeneous data, and the available solutions so far are not satisfac-tory. For metabolic flows we can apply, quite unexpectedly, techniques from classic data flow algorithms to monitor the flow of atoms as well as advanced linear and non-linear

13

optimization methods to solve the reaction rates (flows) of metabolic reactions. The flow of atoms and molecules in such a network is of major interest in understanding the dynamics of metabolism and also in biotechnological applications where one wants to optimize the cell to maximally yield some useful product. For analysis of sets of edges or paths the useful models and algorithms are far from clear at present; we view the area as one that has high potential in the future. Teams: Ukkonen, Mannila, Kivinen; persons: E. Ukkonen, H. Mannila, J. Rousu, A. Gionis, T. Mielikäinen.

D: Discovery of hidden structure in high-dimensional data The high dimensionality of many datasets poses interesting modelling problems and leads to extremely challenging algorithmic questions. The third theme in Algodan is dis-covery of hidden structure in high-dimensional data. The focus in this theme is on finding latent structure in high-dimensional data sets. The latent structure can be in the form of com-ponents, as in independent component analysis, or in cluster-like structures, or it can be a parsimonious model giving weight only to a small fraction of the observed variables. The techniques in this theme are based on probabilistic modelling, with a strong algo-rithmic component. D.1: Parsimonious models for high-dimensional data. If the data set has very high dimension, then predictive models (i.e., multivariate regression) and descriptive models (i.e., clustering based on mixture models) can be difficult to interpret. Parsimonious or sparse models are models where only a subset of the variables is used in the model. For regression models, different priors on the functions can be used to achieve this, but the algorithmic problems are not easy. For clustering, striving for sparse solutions leads to the problem of subspace (or subset) clustering, where cluster membership depends for each cluster on a cluster-specific set of variables. Again, the algorithmic problems are hard. The goals of the initial stages of this work are to find solid concepts of subset clustering and good algorithms for estimating the parameters of the models. The computational complexity of subset clustering is also of interest. The applications include ecology and bioinformatics, where collaborating domain experts can give judgments about the plau-sibility of the resulting, parsimonious models. As an example, understanding the struc-ture and evolution of communities of species requires novel methods. The same is true for the analysis of large phenotypic databases in medical genetics. Teams: Mannila, Hyvärinen; persons: J. Hollmén, H. Mannila, A. Gionis, M. Salmenkivi, S. Hyvönen, A. Hyvärinen D.2: Decomposition into dependent components. Independent component analysis (ICA) has been tremendously successful in many applications; many data sets are strongly non-gaussian, and this makes ICA a good choice of a decomposition method.

14

However, most realistic datasets cannot be adequately described as a linear combination of strictly independent components. There is thus a need for more flexible representa-tions. One approach is to consider decompositions which allow for some statistical depend-encies between the components. We formulate hierarchical models in which higher lev-els determine the dependencies on lower levels, whereas the lower levels determine the actual decomposition of the data. The higher-level parameters that determine the de-pendencies are estimated simultaneously with the lower-level parameters, i.e., the de-composition. In some cases, it is possible to find methods that correctly estimate the lower-level parameters even in the presence of the dependencies induced by the higher level. Another approach is to consider general nonlinear transformations, instead of the linear transformation used in the basic case. This means that the data is considered to be a nonlinear transformation of latent independent components, and likewise, the compo-nents are obtained as nonlinear functions of the data variables. In the most general case, using completely flexible nonlinearities leads to problems of identifiability, as we have shown previously. Fortunately, in many applications it is possible to consider a restricted class of nonlinear transformations whose parameters can still be identifiable. We are also developing extensions of independent component analysis to discrete-valued data. Formulating the component analysis problem for binary data leads to problems that are NP-hard and difficult to approximate. However, some algorithmic techniques have promising behaviour in these settings. The discrete component analysis question also leads to interesting conjectures on the relationship of the Boolean rank and real rank of binary matrices. We also plan to work on the concept of effective di-mension of binary data. Nonlinear and dependent component methods can be applied to virtually all the application areas of the CoE, including but not limited to neurosci-ence, bioinformatics and systems biology, ecology, and linguistics. Teams: Hyvärinen, Kivinen, Mannila; persons: A. Hyvärinen, P. Hoyer, J. Hurri, E. Bingham, A. Gionis, J. Kivinen, P. Kaski, H. Mannila D.3: Non-gaussian Bayesian networks and causality. The ultimate goal of much of the empirical sciences is the discovery of causal relationships between variables. When randomized experiments are not feasible one must rely on uncontrolled data when in-ferring causal structure. This “causal discovery” problem has recently attracted signifi-cant attention in the machine learning community. We have recently developed a new framework, based on independent component analysis, for learning linear non-gaussian Bayesian networks. Using non-gaussianity, the method is able to learn the causal structure from continuous-valued data in cases where classic methods fail. Future work will focus on developing new algorithms in this framework under more challenging assumptions, and combining the framework with existing ones. Important extensions include estimation in the presence of latent confounding variables; analysis of a combination of uncontrolled and controlled data (when such is available); using

15

active learning methods; and development of non-linear extensions of the basic method, including the case where some of the variables are discrete-valued. Especially interesting are methods combining the present approach with some of the classic tech-niques for learning Bayesian networks. The method can be applied in the analysis of a variety of data being studied in this CoE, in particular brain imaging data and gene expression data and other data related to gene regulation, genome structure, and systems biology. Teams: Hyvärinen, Mannila; persons: P. Hoyer, A. Hyvärinen, M. Koivisto, E. Bingham. D.4 Spatiotemporal data and models. Several applications contain multidimensional data with additional spatial or temporal components. For example, in ecology and pale-ontology we have for each location presence/absence data with additional spatial and temporal information. For microarray measurements, we know the expression values of genes of an organism with possibly spatial information about the location of the tissue, and the location of the gene in the genome. Additionally, some microarray datasets have a time-series nature. Analysing such high-dimensional datasets with spatial and temporal character poses interesting challenges, both for modelling and for algorithms. As an example, the data analysis of microarrays lacks commonly accepted powerful tools. For biological sequences there are standard tools such as BLAST that rapidly find similarities (homologies) between the sequences. It is not clear whether such simple and widely applicable models of search can be developed for microarray data. We plan to work on structure discovery in high-dimensional data with spatiotemporal aspects, con-centrating in the applications on microarray data and on ecological data sets. Teams: Ukkonen, Mannila, Hyvärinen; persons: J. Hollmén, J. Seppänen, E. Ukkonen, H. Mannila, S. Hyvönen, A. Hyvärinen, M. Salmenkivi.

F: Foundations of algorithmic data analysis The fourth theme looks at the fundamental issues of algorithmic data analysis. What can be said about the limitations of pattern discovery? What are the fundamental bounds on string algorithms? What is the computational complexity of fitting probabil-istic models from a model class? Questions such as these abound in algorithmic data analysis, and they are fascinating problems in core computer science. All teams of Al-godan participate in this theme. It should be emphasized again that the boundaries be-tween the tasks are not strict: many of the tasks listed in the earlier themes contain foundational studies, and the tasks described in this section also draw their motivation from more applied research themes. F.1: Theory of string matching. String matching and related questions, such as index-ing, are well understood theoretically for the case of exact matching and lossless data compression. However, this is not true for the case of approximate matching. Finding approximate matches, synthesis of internal motifs in strings or segmenting strings into

16

similar sections, and finding a clustering for a set of strings are examples of important problems whose solution requires deep understanding of approximate matching. For approximate string matching, the properties of the methods for matching and for filtering and indexing depend deeply on the underlying distance or similarity function. Hamming and Levenshtein distances and their weighted versions are standard examples of string distance functions, used in numerous applications in text processing, bioin-formatics and elsewhere. We will work towards a unifying theoretical framework for string distance functions. The framework should facilitate systematic study of various problems that relate to ap-proximate matching. We anticipate this to lead to improved understanding of the trade-offs between time and space as well as to new useful families of distance functions. We see it possible that for example the kernel approach currently widely used in machine learning can also give interesting string distance functions. Another already classic but not really utilized approach is to define string distance measures by using the relative compressibility of the strings. Finally it seems that in some applications such as music retrieval the distance function should be learned or at least fine-tuned from data. Teams: Ukkonen, Kivinen; persons: E. Ukkonen, J. Kärkkäinen, V. Mäkinen, K. Lemström, J. Rousu. F.2: Estimation of highly complex probabilistic models. Many of the probabilistic models we want to use are difficult to estimate with classic methods such as maximum likelihood. A classic problem is that model probability density does not integrate to unity, and the integral is too difficult to compute, so the model remains non-normalized. In other cases, the model includes latent variables which cannot be easily integrated out. We have recently started developing a new framework for estimating non-normalized models. This is based on matching the gradients of the log-densities of the model and data distribution, which can be computed easily by using a trick of partial integration. Regarding latent variables, we are investigating a similar idea where we match the den-sity functions of the model and the data using a suitable metric. By a judicious choice of the metric, the computation can become quite simple. In structured output learning with kernels, similar issues are encountered. So-called Max-Margin Markov models have proven to be effective; there, one learns max-margin parameters of a conditional Markov random field, so that the margin of the correct structure to the worst incorrect structure is maximized in the feature space. Develop-ment of such methods would be tremendously useful in all the applications of probabil-istic modelling, e.g., vision modelling, analysis of brain imaging data, ecology, and phe-notype analysis. Teams: Hyvärinen, Kivinen; persons: A. Hyvärinen, J. Rousu

17

F.3: Assessment of data mining results. The key aspect of the Algodan CoE is to look at novel computational concepts and to search for efficient algorithms for com-puting them. Determining the statistical significance of the results is, of course, very important for the applications: does the result tell us something real, or would we have obtained the same type of results even for random data? For complex structured data traditional significance testing methods based on distributional assumptions are seldom applicable. Randomization techniques based on permutation tests, bootstrapping, etc. can, on the other hand, be used fairly often. However, randomization tests are not always easy to apply. For example, for sequence analysis or segmentation algorithms it is not immediately clear what the appropriate randomization method would be. The same is true for complex data. We have recently studied with good success the use of generation of matrices with given row and column sums in assessment of data mining results; a crucial part is the analysis of the algo-rithmic properties of the methods. Work similar to this is clearly needed for other types of data mining, especially for evaluating the results of mining methods on complex and heterogeneous data. Teams: Mannila, Toivonen, persons: H. Mannila, A. Gionis, M. Koivisto, T. Mielikäinen, H. Toivonen, P. Sevon. F.4: Sums of products. Data analysis with probabilistic models often involves the computation of large sums of products: an example is structure discovery in Bayesian network models. Recent results in the CoE include an exact algorithm for learning Bayesian networks of moderate size, and a major improvement in the complexity of partition problems, which for example lowers the running time on exact algorithms for graph colouring from O(p(n) 2.4023n) to O(p(n)2n), where p(n) is a polynomial. The task involves a study of the basic computational properties of such computations, combin-ing various techniques (dynamic programming, inclusion-exclusion expressions, Fourier and Möbius transformations). We plan to continue the research on sum-product algorithms in several directions. On one hand, we plan to study new techniques and their combinations for classical, theo-retically important problems. On the other hand, we will apply the methodology in spe-cific data analysis contexts. Specifically, we will design algorithms for sequence segmen-tation and for variants of variable length Markov models and apply them to genome data. To make some algorithms feasible for realistic problem sizes, we will also study approximation schemes that are derived from exact sum-product algorithms. Teams: Mannila, Kivinen, persons: M. Koivisto, H. Mannila, P. Kaski, J. Rousu, J. Kivinen. F.5: Semi-supervised and active learning and active data mining. In many applica-tions, unlabeled data (inputs for the predictor) are abundant but labelling (obtaining the desired outputs) is difficult. This is the case, e.g., in text classification, and also on do-mains with complex and thus expensive to obtain structured outputs. One learning-theoretic model for such situations is semi-supervised learning, where the learning algo-rithm uses cheap unlabeled data to compensate for the small amount of expensive la-

18

belled data. Another model is active learning, where the learning algorithm is allowed to steer the labelling effort towards data points that are most likely to yield useful in-formation. Thus, such active learning algorithms can be useful also in systematizing and automating the experimental design problem on applications ranging from biology to linguistics. Developing efficient algorithms for these models and understanding their theoretical background are currently very significant research topics in machine learn-ing. Techniques from online learning will be useful here, both in truly online active set-tings where the unlabeled data comes as a stream, and also in the batch setting where online algorithms may provide fast solutions with provable performance guarantees. Further challenges are provided by extending the methods to deal with more complex structured outputs. Teams: Kivinen, Hyvärinen, Mannila; persons: M. Kääriäinen, J. Kivinen, J. Rousu, T. Mielikäinen, P. Hoyer.

1.5 Organization and schedule The Algodan CoE consists of five research teams, as described above. The research teams are organized on the basis of basic approaches and shared interests. The bounda-ries of the teams are not strict, as mentioned: many persons participate in the work of several teams. The teams are strong individually, but the major strength in Algodan comes from the interaction and cooperation of the teams. This cooperation has many forms, from day-to-day discussion on general approaches to joint publications to educational activities. The relationships of the teams and tasks are shown in Table 1. In formulation the re-search plan we have emphasized the tasks that will benefit from the varied expertise of the different themes: the potential for deep collaboration is obvious in most of the tasks, and we expect to see many fine results from joint operation of the teams. A rough schedule of the tasks is given in Table 2. Obviously, the plans for the first three-year period 2008 – 2010 are more developed than those for the second period of 2011 – 2013.

19

Tasks

Tea

ms

Tea

m H

yvär

inen

Tea

m K

ivin

en

Tea

m M

anni

la

Tea

m T

oivo

nen

Tea

m U

kkon

en

S.1 String algorithms x x x S.2 Finding orders from data x x x S.3 Sequence segmentation and structure x x x S.4 Kernel algorithms for sequences x x

L.1 Pattern languages for complex data x x x L.2 Contextual similarities x x L.3 XML Mining x x L.4 Minimally supervised methods for text understanding x x L.5 Prediction of structured objects x x L.6 Analysis of flows x x x

D.1 Parsimonious models x x D.2 Decomposition into dependent components x x x D.3 Non-gaussian Bayesian networks x x D.4 Spatiotemporal models x x x F.1 Theory of string matching x x F.2 Estimation of highly complex models x x F.3 Assessment of data mining results x x F.4 Sums of products x x x F.5 Semi-supervised and active learning and mining x x x

Table 1. The tasks and the teams of the Algodan CoE. An “x” implies that the team participates in the task; a bold “x” means that the team has the main responsibility for the task.

20

Task

Y

ear

Cur

rent

200

8

200

9

201

0

201

1

201

2

201

3

S.1 String algorithms S.2 Finding orders from data S.3 Sequence segmentation and structure S.4 Kernel algorithms for sequences L.1 Pattern languages for complex data L.2 Contextual similarities L.3 XML Mining L.4 Minimally supervised methods for text … L.5 Prediction of structured objects L.6 Analysis of flows D.1 Parsimonious models D.2 Decomposition into dependent components D.3 Non-gaussian Bayesian networks D.4 Spatiotemporal models F.1 Theory of string matching F.2 Estimation of highly complex models F.3 Assessment of data mining results F.4 Sums of products F.5 Semi-supervised and active learning and mining

Table 2. A rough schedule of the tasks in the Algodan CoE. A black cell means that the task is under intensive study, a dark gray cell denotes medium activity, and a light gray cell means initial feasibility studies or final activities.

2. International significance and innovativeness The Algodan CoE combines strong basic research in computer science with interdisci-plinary work in a wide variety of scientific disciplines and industrial problems. It also combines world-class expertise in many areas of algorithms and their applications, and will produce high-quality results with a strong impact. The main approach of Algodan has its unique features: we stress the development of new computational concepts that are motivated by applications, and we analyze the computational properties of the concepts rigorously, while also taking care of the prac-tical applicability of the methods. This style of work has in the past been very success-ful, and we expect that the success will continue. Our past work also means that the collaboration partners are eager to use our methods and they have a strong commit-ment to the work that we are pursuing. Several of the application challenges are very difficult, and require long-term work from both sides of the cooperation; we are in the good position of having solid relationships with the application partners, and new sug-gestions for collaboration arrive regularly. The results of Algodan will also be visible in terms of actual software prototypes for the application areas and their use in the re-search.

21

The researchers in Algodan have over the years pioneered several different algorithmic concepts and methods that have found wide applicability and have been adopted by several research groups elsewhere. In particular, in the areas of pattern discovery, inde-pendent component analysis and algorithms on strings we have had seminal interna-tional impact. We expect to produce strong results also in the future. The plan of Algodan for 2008 – 2013 focuses on algorithmic themes that will have a strong impact on the theory and practice of data analysis. The interplay between meth-ods and approaches from discrete algorithms and from probability and statistics is one of the key themes in our research agenda. Statistics has over the years developed ex-tremely powerful methods for modelling various phenomena; the emphasis has been on relatively complicated models (as the phenomena are complex, the models also have to be complex). Such complex models are necessary in many situations, but they often lead to computationally infeasible algorithms. Discrete algorithmic techniques, on the other hand, have been concentrating on finding fast ways for producing understandable summaries from various types of high-dimensional or heterogeneous data. The combi-nation of these two research approaches will be one of the focus points of the work of the Algodan CoE. Recent work in the CoE in, e.g., using hidden Markov techniques for modelling genotypes has led to extremely good results and shows that such combina-tions of techniques can be found and that they can be effective. This combining aspect is central in all four main themes of the research plan, and it is present in almost all of the tasks listed above. Another general challenge is combining different types of data in data analysis, the main point of part L in the research plan. The CoE is ideally suited for developing the novel concepts and algorithms needed for handling this type of situations: our background in modelling, discrete algorithms, and databases means that the necessary conceptual tools are available. As an example, consider the question of modelling flows on cells or other contexts. One of the challenges is finding good ways of using the different types of data (genome, proteome, metabolome). We expect that this work will lead to useful new computational concepts and algorithms. The significance of the Algodan CoE will also be visible in the conceptual frameworks of algorithmic data analysis. For example the work on methods and limitations of pat-tern discovery methods, theory of string matching, assessment of data mining results and estimating non-normalized models all have a large potential.

3. Inter- and multidisciplinary work

3.1 General remarks The work of the Algodan CoE combines basic research in computer science with ap-plied interdisciplinary work in various applications. The CoE has an extensive network of collaborations in several different areas of science and industry. In this section we describe some aspects of the collaborations; details of the partners are given in Sections 10 and 11.

22

As mentioned in the introduction, intensive discussions with the application group concentrate on formulating the relevant computational concepts. The analysis of the computational properties of the concepts and the design and implementation of the methods happens within Algodan. As soon as preliminary experimental results are available, discussions with the application groups resume, resulting possibly in the re-formulation of the computational concepts. The cycle from concept formation to early computational results can on many occasions be quite fast, just months or even weeks. The collaborations of the Algodan CoE with the other sciences can be grouped into the following areas: • Bioinformatics and systems biology (gene regulation, genome structure and map-

ping, metabolic modelling) • Environmental research (atmospheric research, ecology, paleontology, forest re-

search) • Linguistics (language variation and change, onomastics, language technology) • Neuroscience (analysis of brain imaging data, vision modelling) In the period 2008 – 2013 the role of environmental research will increase: this area is data-rich, and there is a great need for more realistic models for understanding the ef-fects of global change. Other expanding applications areas are forest industry and other types of process industry. We are also increasing the efforts in linguistics, where the new approaches to the analysis of linguistic data will have a strong impact.

3.2 Current collaboration partners in applications This section describes our collaboration network in the application areas, concentrating on cooperation with Finnish research groups. In bioinformatics and systems biology, we have several long-term collaborations with the leading research units in the area. As an example, in the area of medical genetics we collaborate with the groups of Leena Peltonen (University of Helsinki and National Public Health Institute) and Juha Kere (Karolinska Institute); Peltonen and Kere are members of the CoE on Disease Genetics. The collaboration involves joint supervision of Ph.D. students, sharing of personnel, and funding of Algodan researchers from sources of the application groups. In the closely related area of cytomolecular genetics, we work with Prof. Sakari Knuutila’s group; the collaboration includes shared funding for a joint Ph.D. student. In gene regulation, we collaborate with Prof. Jussi Taipale (CoE on Translational Genome-Scale Biology) and Prof. Irma Thesleff (CoE on De-velopmental Biology) in methods for finding regulating modules of motifs. In metabolic modelling we have a long collaboration with Prof. Merja Penttilä's group (VTT Biotechnology). The group is an expert in yeast biotechnology and can provide up-to-date mass-spectrometry and NMR data. One of our former PhDs works in Pent-tilä's group. Another long collaboration is with Dr. Alvis Brazma (European Bioinfor-matics Institute). We work on the analysis of gene expression and gene regulation. Brazma' group is famous for their work on building microarray data repository. Brazma gives us a good source of data and an excellent international networking. With

23

Prof. Liisa Holm's group (Institute of Biotechnology, University of Helsinki) we work on predicting biological functions from biological sequences using machine learning methods. Prof. Holm is well-known researcher bioinformatics who has developed widely-used software for protein analysis. The above-mentioned groups have excellent sources of data, and in the next years the amount of data and its complexity will grow explosively. Most of the research tasks of Algodan listed above can be applied to some aspects of bioinformatics and systems bi-ology. The sequential structure of genomes means that the sequence analysis theme is especially important, but also the other themes are highly relevant: for example the study of heterogeneous data and component techniques for high-dimensional data oc-cur in the mining of biological databases and in the study of phenotypic data. In environmental research there are three main lines of application partners. In ecology and paleontology we study the spatiotemporal dynamics of communities of species. This will be done in collaboration with Prof. Mikael Fortelius and Prof. Jukka Jernvall (University of Helsinki), as well as Prof. Ilkka Hanski (CoE on Metapopulation re-search); we are currently pursuing a European project aiming at the study of the interac-tion of species communities and environmental factors. Fortelius is funding a Ph.D. student in the Algodan CoE. The second theme is collaboration with the Markku Kul-mala’s group (CoE on Physics, Chemistry and Biology of Atmospheric Composition and Climate Change). The third theme in environmental research is the emerging col-laboration with Finnish Forest Research Institute (Metla), with Prof. Kari Mielikäinen and Prof. E. Tomppo on aspects of forest growth. In the environmental research, the themes on sequence analysis, mining heterogeneous data, and discovering structure on high-dimensional data are especially relevant. In linguistics and language technology the synchronic and diachronic corpora provide unprecedented possibilities of describing the structure of components of language, for modelling change in language, and for developing new techniques for natural language processing. Our main emphasis will be on corpus based studies, where the relatively recently started collaborations with Prof. Terttu Nevalainen (CoE on Language Varia-tion and Changes), Prof. Ritva-Liisa Pitkänen (University of Helsinki and Kotus), and Prof. Ulla-Maija Kulonen will have high priority. In language technology, the collabora-tors include Prof. Kimmo Koskenniemi and Lauri Carlson (University of Helsinki). In neuroinformatics the main collaborators are Dr. Simo Vanni (Helsinki University of Technology), Dr. Vesa Kiviniemi (Oulu University Central Hospital), Dr. Fabrizio Esposito (University of Naples), Prof. Elia Formisano and Prof. Rainer Goebel (Uni-versity of Maastricht) on analysis methods for brain image data. The work on vision modelling is done in collaboration with Dr. Pentti Laurinen (Department of Psychol-ogy, University of Helsinki); the collaboration involves a jointly supervised Ph.D. stu-dent. Our collaboration with industry can be viewed under four themes: biotechnology, data-base systems, information extraction, and telecommunications. Details of participating companies are given in Section 12. Table 3 describes in more detail the relationship between the application areas and the tasks described in part 1.4. As can be seen, many of the application areas can be ad-

24

dressed by a variety of methods, and most methodological tasks are applicable to sev-eral areas of applications.

Task

App

licat

ion

Gen

ome

stru

ctur

e

Gen

e re

gula

tion

Met

abol

ic m

odel

ling

Atm

osph

eric

dat

a

Eco

logy

, pal

eont

olog

y

For

est

rese

arch

Lin

guis

tics

Lan

guag

e te

chno

logy

Neu

rosc

ienc

e

Bio

tech

nolo

gy

Dat

abas

e sy

stem

s

Inf

orm

atio

n ex

trac

tion

Tel

ecom

mun

icat

ions

S.1 String algorithms x x x x x x x x

S.2 Finding orders from data x x x x x

S.3 Sequence segmentation and … x x x x x x x

S.4 Kernel algorithms for sequences… x x x x x x

L.1 Pattern languages for complex … x x x x x x x

L.2 Contextual similarities x x x x x

L.3 XML Mining x x x

L.4 Minimally supervised methods … x x

L.5 Prediction of structured … x x x

L.6 Analysis of flows x x x x x x x

D.1 Parsimonious models x x x x x x

D.2 Decomposition into dependent … x x x x

D.3 Non-gaussian Bayesian nets x x x x

D.4 Spatiotemporal models x x x x

F.1 Theory of string matching x x

F.2 Estimation of highly complex x x x x x

F.3 Assessment of data mining … x x x x x x x

F.4 Sums of products x x x

F.5 Semi-supervised and active … x x x x x x x

Table 3. The tasks and application areas of the Algodan CoE. An “x” implies that the results of the tasks can probably be used in the application area.

3.3 Example challenge problems In this section we briefly state some examples of larger challenge problems we plan to work on, and list the most relevant tasks for these challenges. Challenge 1: Learning network structures Network-like structures are numerous in various domains. For example, the molecular level processes of a living organism can be modelled by using gene regulatory networks and metabolic networks. The human society is full of different social networks. The

25

Internet is a huge network, with a plethora of subnetworks. New computational meth-ods are needed for finding the structure of such networks and for understanding their dynamic behaviour. As explicit knowledge of the networks is typically missing, we have to resort to computational techniques that reconstruct networks from data. As grow-ing amounts of data describing such networks are available, we see improved possibili-ties for significant progress on this very demanding area. Especially relevant tasks: L.6 Analysis of flows; S.2 Finding orders from data; D.3 Non-gaussian Bayesian networks; F.5 Semi-supervised and active learning and mining. Challenge 2: Finding structure and knowledge in natural-language data How is information conveyed in natural language? Natural language, and text in par-ticular, is a remarkably effective and efficient medium, in the sense that it compactly conveys a wealth of information to a human reader. Yet, computers – as good as they are at amassing ever growing volumes of text – remain poor at making sense of the information contained in it. The problem of text understanding is hard in large part be-cause language conveys meaning at multiple levels, and permits a seemingly limitless variety of ways to say the same thing. Also, understanding requires some model of prior domain knowledge. For example, in several state-of-the-art application areas, such as text summarization, information extraction, question answering, topic- or do-main-specific patterns of expression and prior facts are explicitly encoded or inferred from sets of training examples via supervised learning. Manual or supervised knowl-edge engineering is costly and does not scale to novel domains. Especially relevant tasks: S.1 String algorithms, S.2 finding orders from data, S.3 Se-quence segmentation and structure, S.4 Kernel algorithms for sequences; L.2 Contex-tual similarities; L.5 Prediction of structured objects; D.2 Decomposition into depend-ent components. Challenge 3: The vocabulary, grammar and history of genomes The genome codes information, the blueprint, of the species and the individual. How does it do this? Data about genome structure are created continuously: In a few years there will be available massive sets of full genomic sequences of many different species, as well as detailed data of the genomic variations between individuals. Still, many fun-damental questions are unanswered. Why do genome sizes differ so much? What is the role of inter- and intraspecies variation? How is the variation reflected in the operation of the organism? What are the functional components in gene regulation (vocabulary) and how are they connected (grammar)? How have the genomes of different species evolved in the history? Especially relevant tasks: S.1 String algorithms, S.2 Finding orders from data, S.3 Se-quence segmentation and structure, L.2 Contextual similarities; L.5 Prediction of struc-tured objects; D.1 Parsimonious models; D.4 Spatiotemporal models; F.3 Assessment of data mining results. Challenge 4: Understanding the environment through computational modelling of ecosystems

26

How do components of the environment interact with each other? The environment can be measured in many ways on different scales ranging from remote-sensing based satellite images of landscapes to chemical compositions of nutrients in individual plants. The complex interactions in both the spatial and temporal domains across different scales are largely unknown, and their importance is growing. Especially relevant tasks: S.2 Finding orders from data, S.3 Sequence segmentation and structure, L.1 Pattern languages for complex data; L.2 Contextual similarities; L.6 Analysis of flows; D.1 Parsimonious models; D.3: Non-gaussian Bayesian networks and causality; D.4 Spatiotemporal models; F.3 Assessment of data mining results. Challenge 5: Information processing in the visual system of the brain How does the visual system in the brain work? How is the two-dimensional retinal im-age transformed and analyzed? Humans are visual animals, and solving the riddle of vision is essential for our understanding of brain function in general. What is needed to solve this problem is new theoretical insights that can serve as a basis for future ex-periments. At the same time, new instruments for measuring brain function gather enormous amounts of measurement data. Functional magnetic resonance imaging (fMRI) gives three-dimensional images of brain activity with a resolution of tens of thousands of voxels. Magnetoencephalography and electroencephalography measure brain activity with a millisecond temporal resolution. How do we discover meaningful structures in such high-dimensional, continuous-valued data sets? Especially relevant tasks: S.2 Finding orders from data; D.1 Parsimonious models; D.2 Decomposition into dependent components; D.3 Non-gaussian Bayesian networks; F.2 Estimation of highly complex models.

4. Integration of research with the host organizations’ strategies The host organizations of the Algodan CoE are University of Helsinki and Helsinki University of Technology. These universities have established a joint strategic research institute, the Helsinki Institute for Information Technology HIIT, and the Algodan CoE is part of HIIT. Hence Algodan research naturally integrates with the strategies of the two host universities. The universities provide some of the funding of the CoE through HIIT. The recent strategy efforts of the two universities have stressed the importance of digi-talization as a general trend, and the increasingly important role of interdisciplinary work in science: these ideas are strongly present in the work of the Algodan CoE. The new directions of the CoE in methods and in applications stress the importance of both universities in the operation of the Algodan CoE. On the methods side, Algodan innovatively combines areas that have traditionally been represented in only one of the universities. In applications, the CoE is emphasizing activities in the areas of linguistics and humanities on the one hand, and applications stemming from process industry, es-

27

pecially paper and pulp industry on the other. These areas have important roles in the strategies of the host universities.

5. Comparison with the previous CoE The Algodan CoE is in part a continuation of our current CoE “From Data to Knowl-edge (FDK)”, funded in 2002 - 2007. The FDK CoE has combinatorial pattern match-ing and data mining as its methodological backbone, aims at combining combinatorial and probabilistic approaches in data analysis, and works in close co-operation with sev-eral application groups to develop innovative solutions to new data analysis problems. We see that this approach is extremely fruitful and there is no reason to change it fun-damentally. We have lots of interesting developments on the algorithmics side and our pool of high-quality collaborating groups is expanding and can offer interesting and relevant data analysis problems for years. To make our methodological basis stronger, the Algodan CoE will include two new very strong senior researchers: Prof. Jyrki Kivinen will bring to the CoE a strong exper-tise in computational learning theory; Dr. Aapo Hyvärinen and his group are experts in non-gaussian probabilistic modelling and in independent component analysis, where Hyvärinen is one of the leading researchers. These additions give the CoE a comple-mentary collection of main techniques of modern data analysis. They provide us a unique possibility for interdisciplinarity within data analysis, in addition to interdiscipli-nary work with the application sciences. Our research plan and mode of operation has been devised to provide maximum pos-sibilities for fruitful cooperation both within the CoE and between the CoE and the application partners. The groups of Hyvärinen and Kivinen have already started a strong interaction with the groups in FDK, and we expect to see many useful results from this combination of expertise. Compared to the start of FDK, our palette of applications has grown a lot. We are working on many areas of science, and, as mentioned, there are regular contacts from groups who would like to initiate collaboration with us. Building successful collabora-tions takes time, and the efforts we have made in the FDK unit are now beginning to bring in handsome returns.

6. Personnel The personnel profile, as anticipated on January 1, 2008, is given on Form 1a. There are 4 professors, 10 senior researchers, 21 postdoctoral researchers and about 30 students and other personnel. The total number of members is 68 (full-time equivalent of 65 person-years). Among the members, 9 are from abroad, and 15 are female. Professor Esko Ukkonen is the director of the CoE and one of the team leaders. The other team leaders are Academy Professor Heikki Mannila (the vice-director of the CoE), Professor Hannu Toivonen, Professor Jyrki Kivinen, and Dr. Aapo Hyvärinen.

28

Subteam leaders within groups include Dr. Aris Gionis, Dr. Jaakko Hollmen, Dr. He-lena Ahonen-Myka, Dr. Kjell Lemström, Dr. Juho Rousu, and Dr. Juha Kärkkäinen. The balance between different personnel categories is satisfactory. We emphasize that our researchers have various backgrounds, some of them having their basic academic education also in application areas (biotechnology, medicine). In the future hiring we emphasize senior positions, hiring females, and obtaining postdocs and other long-term visitors from abroad.

7. Funding The current FDK CoE gets its funding from a variety of sources, including the Univer-sity of Helsinki and the Helsinki University of Technology, HIIT, graduate schools (HeCSE, KIT, and ComBi), Academy of Finland, the National Technology Agency in Finland (Tekes), the European Commission, and private enterprises and foundations. The funding planned for years 2008 – 2013 is shown in Part II: Forms 2a and 2b. The use of the funding is as follows. The funding from the University of Helsinki and the Helsinki University of Technology is used to provide the necessary infrastructure for the unit, as well as salaries for some team and sub-team leaders (who also participate in teaching). We anticipate that the graduate schools will provide salaries for ten of our graduate students. The CoE funding from the Academy of Finland is used foremost for hiring senior and postdoctoral researchers, both from Finland and abroad, and also for hiring additional graduate students and research staff. Many of the researchers in the unit use part of their time in the application research teams.

8. Exit strategy The work will be continued in smaller scale without goals shared between various groups. An exit funding from the Academy is necessary for 2 years to make a smooth reduction of personnel possible as well as to get new funding.

29

Part II: Research environment

__________________________________ Cooperation and organization of activities

9. Added value We see the Centre-of-Excellence status as a very high recognition of the merits of our research, which is granted based on tough competition and careful international review. Our CoE brings together a coherent collection of research teams who can benefit from each others’ special competencies. We have an extensive portfolio of application partners who give us up-to-date problems and data. Working as a CoE, we can combine our knowledge to find better solutions. Also, we organize joint research efforts on some selected challenge problems. We have found the relatively long-term CoE funding extremely fruitful. It allows us to concentrate on longer projects and to take risks in the research. It also gives flexibility: if a good postdoc candidate shows up, it is easy to hire such a person rapidly. In particular, long funding is a good tool for developing international collaborations as well as hiring from abroad. Also important is that the CoE status itself attracts good people and interesting collaboration partners as well as helps in getting additional funding from elsewhere.

10. Collaboration with foreign research units International collaboration is mainly focused on methods and has been ordered according to the main tasks, Sequence analysis, Learning on and mining structured and heterogeneous data, Discovery of hidden structure in high-dimensional data, and Foundations of algorithmic data analysis.

S: Sequence analysis Collaborating unit: Yahoo! Research Barcelona, Professor Ricardo Baeza-Yates General themes: Web mining, text mining, information retrieval, analysis of the web graph and social networks, query mining, link analysis, and natural language processing applied to information retrieval. Forms and expected results: Exchange of researchers, joint projects. Collaborator: Professor Gonzalo Navarro, Center for Web Research, University of Chile

30

General themes: String algorithmics. Forms and expected results: Long-term collaboration on theory and practice of string matching, which has most recently concentrated on compressed self-index structures. Long-term visits, joint publications. Collaborators: Professors Robert Giegerich and Jens Stoye, Universität Bielefeld General themes: Biological sequence analysis, bioinformatics. Forms and expected results: Exchange of researchers, joint actions in PhD education, joint publications. Collaborating unit: European Network of Excellence for Integrated Genome Annotation (BIOSAPIENS), Dr. Janet Thornton, European Bioinformatics Institute, Hinxton, UK General themes: Analysis and annotation of biological sequences. Forms and expected results: Joint research, exchange of researchers, researcher education. Collaborating unit: Regulatory Genomics – European Union research consortium (STREP). Consortium members: University of Helsinki, Department of Medical Genetics, Biomedicum Helsinki, Finland, Deutsches Krebsforschungszentrum DKFZ, Germany, Aarhus University Hospital, Denmark, Pomeranian Medical University, Poland, Febit Biotech GmbH General themes: Regulatory patterns in genomes, effect of SNPs on gene regulation. Forms and expected results: Joint research to develop methods for finding regulatory motifs and cascades. Integrated approach that combines computational and wet lab techniques. Collaborator: Professor Alberto Apostolico, Georgia Tech, U.S.A., and University of Padova General themes: String algorithmics. Forms and expected results: Joint publications, exchange of researchers. Collaborators: Professor Geraint Wiggins, Goldsmith College, London; Professor Costas Iliopoulos, King’s College, London; Professor Remco Weltkamp, Utrecht University General themes: Algorithms and software for retrieval and analysis of music sequences. Forms and expected results: Joint projects, exchange of researchers. L: Learning on and mining structured and heterogeneous data Collaborating unit: April II – Application of Probabilistic Inductive Logic Programming. European Union research project (STREP). Consortium members: Albert-Ludwigs-Universität Freiburg, Germany, Imperial College of Science, Technology and Medicine, UK, University of Helsinki/Institute of Information and Technology, Finland, INRIA, Rocquencourt, France, University of Florence, Florence, Italy. General themes: Probabilistic modeling using inductive logic programming. Forms and expected results: Joint research, exchange of researchers.

31

Collaborating unit: IQ – Inductive Queries for Mining Patterns and Models. European Union research project (STREP). Consortium members: Institute Jozef Stefan, Slovenia, coordinator, Albert-Ludwigs Universität Freiburg, Germany, INSA Lyon, France, University of Wales Aberysthwyth, United Kingdom, University of Helsinki HIIT, Finland, and University of Antwerp, Belgium. General themes: Inductive databases targeted to applications in bioinformatics. Forms and expected results: Joint publications, exchange of researchers. Collaborator: Professor Masanori Arita, University of Tokyo General themes: Metabolic networks, metabolic reconstruction. Specific topics: Atom-level representations of metabolic reactions. Forms and expected results: Atomic-level representations and reconstruction of metabolism. Current topic of interest is the in silico prediction of atom-mappings of biochemical reactions. Collaborator: Dr. Alvis Brazma, European Bioinformatics Institute, UK General themes: Gene expression data analysis. Forms and expected results: Exchange of researchers, joint project. Collaborator: Professor Dimitris Achlioptas, University of California at Santa Cruz General themes: High-dimensional data, sums of products. Forms and expected results: Joint publications, NSF proposal in preparation. Collaborator: Professor Dimitrios Gunopulos, University of California at Riverside General themes: Pattern discovery in complex data sets, spatiotemporal data. Forms and expected results: Joint publications. Collaborator: Professor Gautam Das, University of Texas at Arlington General themes: Pattern discovery, high-dimensional data. Forms and expected results: Joint publications. Collaborator: Professor Ivan Janssens, University of Antwerp, Department of Biology, Research Group of Plant and Vegetation Ecology, Antwerp, Belgium General themes: Environmental research, forest research, ecology; high-dimensional data. Collaborator: Dr. Harriet Wikman, University Medical Center Hamburg-Eppendorf, Department of Tumor Biology, Hamburg, Germany General themes: Genome structure and gene mapping: Cytomolecular genetics. Collaborator: Professor Luc De Raedt, University of Freiburg and Katholieke Universiteit Leuven General themes: Machine learning in relational data Specific topics: Probabilistic inductive logic programming, multi-relational data mining, inductive querying. Forms and expected results: Collaboration in joint research projects (currently April II and IQ), exchange of personnel (Professor Toivonen is currently on a 12 month visit to Freiburg).

32

Collaborating unit: ECDC European Centre for Disease Prevention and Control (European Commission), Stockholm. General theme: Applying language technology to process epidemiological text collections. Forms and expected results: Joint research. Collaborating unit: Professor Ralf Grisham, Proteus Project, Department of Computer Science, New York University, USA General theme: Advanced methods for information extraction. Forms and expected results: Joint projects. Collaborating unit: Professor Jennifer Spenader, Artificial Intelligence Department, University of Groningen General theme: Information extraction. Forms and expected results: Joint project on discourse processing for information extraction.

D: Discovery of hidden structure in high-dimensional data Collaborators: Dr. Shohei Shimizu, Institute of Statistical Mathematics, Tokyo; and Professor Yutaka Kano, Division of Mathematical Sciences of Osaka University. General theme: Non-Gaussian probabilistic modeling. Specific topic: Causal modeling. Forms and expected results: Joint supervision of PhD student (Shohei Shimizu), who visited us for 18 months during his PhD and is now back for a post-doc period. Dr. Hyvärinen visited Institute of Statistical Mathematics for one month in 2006. Collaborators: Dr. Fabrizio Esposito, University of Naples; Professor Elia Formisano and Professor Rainer Goebel, University of Maastricht General theme: Brain imaging data analysis methods. Specific topic: Stability analysis of ICA and application of ICA on data from different subjects. Collaborators: Dr. Patrick Bas, Institut Polytechnique de Grenoble. General theme: Image processing. Specific topic: Image watermarking and ICA. Collaborating unit: Canadian Institute for Advanced Research network Neural Computation and Adaptive Perception, organized by Geoffrey Hinton, Toronto.

F: Foundations of algorithmic data analysis Collaborator: Associate Professor Tommi Jaakkola, Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT General themes: Machine learning methods and their applications in modeling and analysis of molecular biological systems. Forms and expected results: Yearly visits, joint research.

33

Collaborators: Professor John Shawe-Taylor, University College London and Dr. Craig Saunders, Dr. Sandor Szedmak, University of Southampton General themes: Kernel methods, prediction of structured objects, large-scale optimization. Forms and expected results: Exchange of researchers and joint projects. The collaboration circles around kernel methods for structured data, especially string and graph kernels and structured output methods, especially large-scale optimization algorithms for them. Collaborator: Professor Manfred Warmuth, University of California, Santa Cruz General themes: Computational learning theory. Forms and expected results: Joint publications, exchange of researchers. Collaborating unit: PASCAL – European Network of Excellence on Pattern Analysis, Statistical Modeling and Computational Learning. General themes: Europe-wide Distributed Institute which will pioneer principled methods of pattern analysis, statistical modelling and computational learning. Forms and expected results: Joint publications, exchange of researchers, researcher education, research seminars.

11. Collaboration with Finnish research units The main collaboration relations of the CoE with application partners are listed below according to the application area as collaboration with Finnish research institutes is mainly focused on applications. On the methodological side, we currently collaborate with the CoE on Inversion Problems (Professor Lassi Päivärinta, University of Helsinki) and plan to continue earlier collaboration with Professor Elja Arja’s statistics group at the Department of Mathematics and Statistics of the University of Helsinki. We also collaborate with the Statistical Machine Learning and Bioinformatics group at Helsinki University of Technology (Professor Sami Kaski) and the Finnish IT Centre for Science (CSC).

Genome structure Collaborating unit: CoE on Disease Genetics, Academy Professor Leena Peltonen-Palotie, University of Helsinki and National Public Health Institute; Professor Juha Kere, Karolinska Institute, Stockholm. General themes: Sequence analysis, mining complex data. Specific topics: Genome segmentation, the vocabulary of the genome, structure of haplotype data. Forms and expected results: Exchange of personnel, joint projects, joint supervision of Ph.D. students, joint publications.

34

Collaborators: Professor Sakari Knuutila, University of Helsinki, Laboratory of Cytomolecular Genetics, Dr. Sisko Anttila, Finnish Institute of Occupational Health. General themes: Cytomolecular genetics. Forms and expected results: Joint project, joint publications. Gene regulation Collaborating unit: CoE in Translational Genome-Scale Biology, Professor Jussi Taipale, Academy Professor Lauri Aaltonen, University of Helsinki General themes: Gene regulatory motifs, effect of SNPs on regulation, regulatory networks. Forms and expected results: Joint research projects, exchange of researchers. Collaborating unit: CoE in Developmental Biology, Professor Irma Thesleff, Professor Jukka Jernvall, University of Helsinki General theme: Gene regulatory motifs. Forms and expected results: Joint publications, exchange of researchers. Collaborator: Professor Arto Urtti, Drug Discovery and Development Technology Centre, University of Helsinki General theme: Gene regulatory motifs. Forms and expected results: Joint project. Metabolic modeling Collaborator: Professor Liisa Holm., University of Helsinki, Institute of Biotechnology General themes: Protein bioinformatics, functional genomics. Specific topics: Protein structure and function prediction, function evolution. Forms and expected results: Development of new prediction methods for enzyme function and modeling of function evolution. Kernel based structured prediction methods are applied. Collaborating unit: VTT Biotechnology, Professor Merja Penttilä, Professor Hans Söderlund General themes: Metabolic modeling, flux analysis in metabolic networks. Forms and expected results: Joint projects, exchange of researchers.

Atmospheric data analysis Collaborating unit: CoE on Atmospheric Composition and Climate Change, Professor Markku Kulmala, University of Helsinki. General themes: Mining complex data, structure discovery in high-dimensional data. Specific topics: Prediction of nucleation events in atmospheric data. Forms and expected results: jointly funded postdoc, joint publications.

35

Ecology and Paleontology Collaborators: Professor Mikael Fortelius, Department of Geology, University of Helsinki, Professor Jukka Jernvall, Institute of Biotechnology, University of Helsinki, Ilkka Hanski, University of Helsinki General themes: Sequence analysis, structure discovery in high-dimensional data. Specific topics: Finding orders from data. Forms and expected results: Joint projects, joint publications. A Ph.D. student in the Algodan CoE is funded by Professor Fortelius.

Forest research Collaborating unit: Finnish Forest Research Institute METLA, Professor Erkki Tomppo, Professor Kari Mielikäinen General themes: Structure discovery in high-dimensional data, spatio-temporal data. Forms and expected results: Joint research.

Linguistics Collaborating unit: CoE of Variation, Contacts, and Change in English, Professor Terttu Nevalainen, University of Helsinki. General themes: Sequence analysis, structure discovery in high-dimensional data. Specific topics: Language variation and change. Forms and expected results: Joint research. Collaborating unit: University of Helsinki and Research Institute for the Languages of Finland (Kotus), Professor Ritva-Liisa Pitkänen General themes: Structure discovery in high-dimensional data. Specific topics: Dialects and place names. Forms and expected results: Joint supervision of Ph.D. student. Collaborating unit: Professor Ulla-Maija Kulonen, Department of Finno-Ugric Studies, Dr. Risto Pulkkinen, Department of Comparative Religion, University of Helsinki General themes: Linguistic population history. Forms and expected results: Joint research.

Language technology Collaborating unit: Director Mikko Olkkonen, Pöyry Application Services, Pöyry Oy General theme: Term discovery and extraction for ontology building in the domain of engineering Forms and expected results: Joint project. Collaborating unit: Connexor General theme: Advanced methods for information extraction, discovering new terms for building ontologies.

36

Forms and expected results: Joint project. Collaborating unit: Language Technology Research Network (LANTERN) Members of the steering group: Professor Kimmo Koskenniemi, Professor Lauri Carlson, Dr. Timo Honkela (Adaptive Informatics CoE at Helsinki University of Technology), Professor Juha Puustjärvi (Lappeenranta University of Technology). General themes: Learning on and mining structured and heterogeneous data Specific topics: XML mining, minimally supervised methods for text understanding. Forms and expected results: Regular co-operation with language technology industry cluster in Finland (e.g. AAC Global, Lingsoft, Sanako, Master's Innovations, Kielikone). Collaborating unit: Adaptive Informatics CoE at Helsinki University of Technology, Dr. Timo Honkela. General theme: Computational language modeling. Specific theme: Feature extraction from natural language using ICA. Forms and expected results: Joint research.

Neuroinformatics Collaborator: Visual Science group at the Dept of Psychology of the University of Helsinki. General theme: Data analysis for experimental visual psychophysics. Specific topic: Nonlinear classification images. Forms and expected results: Two collaborative Academy projects and supervision of a PhD student (Ilmari Kurki). Collaborator: Dr. Simo Vanni at the Brain Research Unit of the Helsinki, University of Technology. General theme: Brain imaging in vision research. Specific topic: Functional specialization of extrastriate visual cortex. Forms and expected results: Dr. Hyvärinen is leader of an interdisciplinary brain research consortium funded by the Academy of Finland which includes this group and the one above. Collaborator: Dr. Vesa Kiviniemi of the Oulu University Central Hospital, Finland. General theme: Brain imaging data analysis methods. Specific topics: Analysis of spontaneous activity using unsupervised learning techniques.

12. Collaboration with enterprises The Algodan CoE has wide contacts to companies both in Finland and elsewhere. In this section we briefly describe the expected form of collaboration. In the area of telecommunications, we have a long tradition of collaboration with Nokia Research Center. Currently, we are discussing projects related to ubiquitous and proactive computing. As mobile devices are becoming multimedia computers instead of

37

just phones, the need for more intelligent user interfaces and ways of understanding the behavior of users is becoming more important. In the area of pharmacogenomics, genome structure and gene mapping, our current industrial partners include Orion Pharma, Biocomputing Platforms, GeneOS, Perkin-Elmer, Medix Biochemica, and Jurilab. For example, with Orion Pharma we are analyzing large phenotypic databases: the goal is to search for subdiagnoses that might be amenable to target medication. We expect that the cooperation with these and other biotechnical companies will continue. In the area of language technology, current industrial cooperation involves Nokia NBI, Citec Informatics, Fujitsu Services, Pasanet/Lingsoft, Finland’s Slot Machine Association, and Wärtsilä Finland. Again, we expect that similar collaborations will continue. One of the goals of the Algodan CoE is to become more deeply involved in forest research and in the paper and pulp industry. We will start collaborations in this area. In information technology, we have industrial ties to large companies abroad, including Microsoft (Search Labs; Dr. Rakesh Agrawal) and Yahoo! Research (Professor Ricardo Baeza-Yates and Professor Raghu Ramakrishnan).

Biotechnology Orion Oyj, Biocomputing Platforms Ltd, GeneOS Ltd, Jurilab Ltd, Mebidiag Oy, MediCel Oy, Oy Panimolaboratorio - Bryggerilaboratorium Ab, Oy Sinebrychoff Ab, Danisco Sugar Oy, Roal Oy, Whitelake Software Point Oy, Fibrogen Europe Oyj, Medipolis GMP Oy, Perkin-Elmer , Medix Biochemica

Database systems Google, Microsoft Research, Yahoo! Research

Information extraction Citec Information Oy, Fujitsu Services Oy, Nokia NBI, Pasanet/Lingsoft Oy, Raha-automaattiyhdistys ry (Finland’s Slot Machine Association), Wärtsilä Finland Oy, Esmerk Oy, Connexor Oy

Telecommunications Nokia Research Center

38

13. Management and cohesion between research teams

The organization chart of the Algodan CoE is shown above. The management board, consisting of the five team leaders of the unit, co-ordinates and organizes the joint activities and develops the research programme and fund raising of the CoE. The leader is responsible for the allocation of the joint CoE funding to the teams, new hirings, and developing the infrastructure of the CoE. When taking these decisions, he should consult with the management team. The University of Helsinki and the Helsinki University of Technology are the host universities of the Algodan CoE. It is located at the Helsinki Institute for Information

Helsinki University of Technology

University of Helsinki Helsinki Institute

for Information Technology HIIT

CoE director- E. Ukkonen Vice director - H. Mannila

Administrative services University of Helsinki Dept. of Computer Science - Administrative services - Computing services - Library services

Administrative services Helsinki University of Tech- nology, CIS - Administrative services - Computing services

Management Board - E. Ukkonen - H. Mannila - A. Hyvärinen - J. Kivinen - H. Toivonen

Administrative Office- T. Kujala - G. Lindén

Team Hyvärinen -9 members

Team Kivinen -10 members

Team Mannila -23 members

Team Toivonen -9 members

Team Ukkonen -17 members

Tasks

Applications

Home institutions

39

Technology HIIT, a joint research institute of the two universities. HIIT has in part outsourced its administration and gets administrative, computing and other infrastructure services from the Department of Computer Science of the University of Helsinki and also from the Laboratory of Computer and Information Science (CIS) of the Helsinki University of Technology. This arrangement leads to efficient use of resources and works well in practice. The CoE will utilize this scheme. The five teams have competences complementing each other, and there have already been several collaborative projects between some of them. In the future this will be even stronger as evidenced by Table 1 which summarizes joint research of the teams on the planned tasks. Also important is that most team members work in the same building, physically close to each other.

14. Promotion of research careers (Women and young researchers) The Algodan CoE actively promotes the research careers of all its young researchers. The Ph.D. students work in research groups, and we aim at giving them a strong back-ground both in theory and in applications, and also the necessary skills for cooperative and interdisciplinary work in industry or in sciences. We actively encourage fresh Ph.D. graduates who want to pursue academic careers to search for postdoc positions abroad. While, e.g., the two-body problem has made this difficult in some cases, we strongly feel that this is a necessary part of the development of research career. The CoE will help them to find a suitable location for such a visit. In the similar way, we actively recruit postdocs abroad to broaden the skill set of the Al-godan CoE. We emphasize the formation of own research programs by the postdoctoral research-ers. Many of our current senior or postdoctoral researchers already are co-leaders or sub-group leaders in a team, and participate in planning and management as well as run independently some projects. For instance Helena Ahonen-Myka, Juho Rousu, Aris-tides Gionis, Jaakko Hollmén, Kjell Lemström and Juha Kärkkäinen have such a status. We expect (and hope) their and others leader status to get still stronger in the future. To concretely facilitate this important development we allocate some funds to be used as start-up funding for new subteams. There are 15 females (23 percent) in our planned research personnel. Within computer science, this is a relatively high number, but obviously there is a lot of room for im-provement. The CoE participates in recruiting efforts by the Departments of Computer Science of the host universities, and some of these efforts are specifically targeted at attracting more women to the area. In our own recruiting, we actively search for female postdoctoral researchers and Ph.D. students. At lower levels, we observed that there is a relatively high number of 8 females among our 21 summer students in 2006.

40

15. Facilities The CoE has an excellent infrastructure for research. Most of the activity takes place in the new Exactum building located at the Kumpula campus of the University of Helsinki. The departments of Computer Science, and Mathematics and Statistics are located in the same building. We have a very stable and powerful computing infrastructure maintained by the Computer Science Department, and the Computer Science Library of the University of Helsinki is the best in the country and is on the level of the best such libraries in the world. In addition to the normal updating of our regular computing environment we need in 2008-2013 several powerful workstations with very large memory capacity to study retrieval and mining of massive data in core memory.

16. Promotion of creative research The CoE has an extensive network of collaborators and partners on various methodological and application domains. For example, we are collaborating with eight current or past Centres-of-Excellence of the Academy of Finland, and we are members in five European Union research networks or research consortiums. Some our researchers are jointly hired with an application partner, and some of our former researchers have moved to work in an application group. They of course take their methodological tool box to the new environment. Our concrete actions to strengthen the creative research and training environment within the CoE and in its immediate surroundings include:

We distribute freely software that we have implemented and that is based on our research as well as make many software packages available via www servers. This has significant potential in serving the community and also in promoting the CoE itself as a site of interesting advanced software. The researchers of the CoE regularly participate in teaching at the host institutions' departments by giving lectures, arranging seminars, and supervising of project work and theses.

We favour joint hiring and exchange of researchers with our application partners.

The Helsinki Graduate School in Computer Science and Engineering (HeCSE) is coordinated in the CoE; Professor Hannu Toivonen is the director of the school, which has 20 full-time PhD positions and a coordinator financed by the Ministry of Education for 2007-2011. The participating universities include the University of Helsinki and the Helsinki University of Technology.

41

The Graduate School on Computational Biology, Bioinformatics and Biometry (ComBi) is coordinated in the CoE. Academy Professor Heikki Mannila is the director of the school. It has 13 full-time PhD positions financed by the Ministry of Education and a coordinator. The participating universities include the Universities of Helsinki, Tampere and Turku, and the Helsinki University of Technology. The ComBi School is a member of European network BREW of PhD programs in bioinformatics. The joint Masters Programme in Bioinformatics (MBI) of the University of Helsinki and the Helsinki University of Technology is coordinated in the CoE and lead by a CoE member, Dr Juho Rousu. We intend to organise multidisciplinary seminars comprising the subjects of other CoEs.

17. Contribution to various sectors of society The Algodan CoE contributes to society in many different ways. The CoE has a strong impact on scientific research, education of researchers, and industry, and the members of the CoE participate in the general societal discussion on aspects of information society. The impact of Algodan CoE on science is felt on the one hand in computer science and on the other hand in the application sciences. The influence in computer science stems from the novel concepts, algorithms, and frameworks that are obtained in the work of the CoE. The concepts and methods will have a large international impact. The Centre of Excellence also has a significant impact on other sciences. First, the new methods designed by Algodan will produce new and important results for the various applications. Furthermore, the use of novel algorithmic data analysis methods will change the way many researchers in other sciences operate: study designs will change when new possibilities for data analysis are opened. There are many current examples of this effect of the work of Algodan: for example in gene regulation, gene mapping, paleontology, and linguistics our work is currently strongly influencing the future plans of the application sciences. Our goal is to still strengthen this type of impact: especially in the humanities, environmental sciences, and process technology there are many possibilities of a large impact. In education of researchers Algodan CoE will also have a large effect. We educate Ph.D. and M.Sc. students in the design and analysis of algorithmic data analysis and other areas of computer science, while at the same time we give the students solid experience in working on interdisciplinary projects. This provides the students with abilities to work on different themes, and gives them a good start in their future careers in industry, academia, or elsewhere.

42

The educational effect of Algodan is also felt in the application areas: the scientists in the applications learn to use and evaluate new data analysis methods; this is especially important, as the improvements in measurement technology and data collection methods are changing the mode of operation in many areas. The industrial impact of Algodan is partly due to direct industrial cooperation and partly to the education of students in Algodan. The work done by the Algodan CoE will impact the research and development efforts of industrial companies both in the area of information technology and in other areas, by providing the industries specific results and a good general view on the developments in the areas of data analysis, databases, data mining, and information retrieval. Our students who move to industry take with them solid skills in computer science and the experience and mindset they have gained from the interdisciplinary application projects. We expect that some of our students will also start their own companies. In addition to the purely scientific and educational goals, one of the missions of the universities is to act as experts for the society as a whole. In that spirit, the researchers of the proposed CoE have given, on invitation, expert statements to the Finnish parliament and ministries on current issues related to their field of expertise. ____________________________________________________________________

Part III: Success and potential of the proposed CoE in researcher training ____________________________________________________________________

18. Organization of researcher training The Algodan CoE offers a unique multi- and interdisciplinary environment for PhD education. The students can take courses and seminars, as well as conduct discussions, on all facets of data analysis, ranging from strings to matrices, and from combinatorial algorithms to probabilistic theory. Such an environment cannot be found in traditional statistics or computer science departments. Moreover, the students can gain in-depth knowledge about applications in molecular biology, genetics, and neuroscience to name just a few. Currently the CoE has ten PhD student positions funded by graduate schools (HeCSE, ComBi, KIT, and ComIT). We expect this to stay on about the same level in the future. The graduate school students work simultaneously in our research teams. Project work will give them an opportunity to obtain the results that eventually will make their doctoral thesis. As mentioned, the CoE has close connections to two graduate schools: The Helsinki Graduate School in Computer Science and Engineering (HeCSE) is coordinated in the CoE; Professor Hannu Toivonen is the director of the school. It has 20 full-time PhD positions and a coordinator financed by the Ministry of Education for 2007-2011. The participating universities are the University of Helsinki and the Helsinki University of Technology. The Graduate School on Computational Biology, Bioinformatics and

43

Biometry (ComBi) is also coordinated in the CoE. Academy Professor Heikki Mannila is the director of the school. It has 13 full-time PhD positions financed by the Ministry of Education and a coordinator. The participating universities include the Universities of Helsinki, Tampere and Turku, and the Helsinki University of Technology. A new Masters Programme in Bioinformatics will start in 2006, coordinated by the CoE.

19. Degrees The members of the CoE have supervised 20 completed PhD degrees in 2001-2005 (see Form 3a). The number of professors who are expected to supervise PhD students has been four or even less as some have started quite recently. Dr. Aapo Hyvärinen has been supervising several PhDs without being a professor. In 2006-2007 we expect about 20 PhD students to graduate. Among the 20 graduated PhDs, 12 continue working in the CoE. One of them (Juho Rousu) is currently a (acting) professor, the others being postdocs. Five of the PhDs have also made a postdoctoral visit abroad (e.g., Berkeley, Bielefeld, and Royal Holloway London). Among those who have left the CoE, three work in private companies (one of them in Estonia), two as researchers in universities, two in other research institutions, and one as a teacher in a Polytechnic.

20. Plan of degrees The plan of new doctoral degrees earned within the CoE in 2008-2012 is given on Form 3b. We hope to be able to produce somewhat more than currently as the number of senior members in the CoE has been increasing.

21. Need of researchers As the importance of data analysis in science and in industry is increasing continuously, also the need of researchers trained by the Algodan CoE will grow. Some special areas of the CoE such as bioinformatics, neuroinformatics and language technology as well as information retrieval and filtering are expanding fields where competent researchers are sorely needed. There are job opportunities in academic research and in many types of industries. Launching a start-up company is also possible: four former members of the CoE are currently employed by such enterprises. Also in the long run, the need for researchers in the areas represented by Algodan seems to be very high. The role of computational science in general and especially data analysis is constantly growing in the sciences and in industry, and this implies that the need for researchers in the area will remain high also in the future. Our emphasis on providing the Ph.D. students with experience on several application projects gives them good skills for their careers, as versatility and the ability to tackle new areas are among the important recruiting criteria

44

__________________________________

Part IV: Scientific merits and output

__________________________________ The quality and quantity of scientific output

22. Publications and other output The members of Algodan CoE have during 2001-2005 produced 120 publications in international refereed journals and 239 publications in refereed international conferences and edited works and two international monographs (Form 4a). The most significant publications are listed on Form 4b. In the other output, the number of patents is 6 and the number of computer programs 56. Here we have included the more significant programs that are available in the Internet and are reported in some publication.

Most significant discoveries and innovations (Form 4d)

Methods and tools for gene mapping, haplotyping, diagnostic markers and gene regulation

This long-term collaborative project is a good example of the working style of the Algodan CoE. We started with the problem of how to find loci in the genome that predispose to certain diseases. While many methods (e.g., linkage analysis) have been developed for this problem, they typically fail to take into account certain important aspects of the data. Since ten years, the CoE has been involved in a deep collaboration with medical geneticists. The goal of the collaboration is to develop statistical data analysis methods that can be used for the analysis of haplotype data. The first important results included tools for association analysis of haplotype data using techniques from data mining (Toivonen et al., American Journal of Human Genetics, 2000). This algorithm was then successfully used by geneticists in the Karolinska Institutet, Stockholm, to locate the asthma gene, a highly significant finding that was published in Science. Later, we developed a novel 'founder model' for genomes of a population which led to new efficient algorithm for haplotyping genome data, using hidden Markov techniques (Ukkonen, WABI 2003; Koivisto et al, WABI 2005). The resulting haplotyping software has accuracy and speed that is among the very best available at the moment. A similar founder approach has recently been applied by at least two leading groups elsewhere. With gene copy number analyses, we have identified chromosomal regions, where gene copy number changes occur in the case of asbestos-exposure related lung cancer (Nymark, Cancer Research, 2006). This offers an avenue for further research in

45

developing diagnostic procedures to specifically detect asbestos-exposure related lung cancer in relation to lung cancer in general. One patent application has been filed covering the diagnostic use of the chromosomal copy number change regions. Most recently, we have developed in collaboration with Professor Jussi Taipale (Biomedicum, Helsinki) a new model for so-called gene enhancer elements in mammalian genomes. Such elements have important role in the regulation of gene activity. Finding and understanding them is a crucial step in the current worldwide effort of deciphering the functions of the program encoded in the genome. We implemented the enhancer model using dynamic programming algorithm and carried out a genomewide comparative analysis, and were able to predict several new enhancer elements. Some of them were then successfully verified in vivo (Hallikas et al, Cell 2006; Palin et al, Nature Protocols 2006).

Advances in full-text indexing In the area of text indexing there has been recent significant international progress including the discovery of so-called self-indexes: It is possible to replace a text string with a compressed (self-)index that takes space less than the text but at the same time supports many kinds of queries of the text content. We have several contributions to these developments. The PhD work of Juha Kärkkäinen on Repetition based text indexes (1999) introduced indexes based on Lempel-Ziv parsing whose space was close to the text size, yet the text was stored separately. Many up-to-date self-index structures are based on Lempel-Ziv parsing following the approach initiated by Kärkkäinen and Ukkonen. Recently Juha Kärkkäinen, with P. Sanders, was able to solve the long-standing open problem of 'direct' suffix-array construction in small space (Kärkkäinen and Sanders: ICALP 2003 and JACM, to appear). This (with two independently and almost simultaneously discovered alternative solutions) has been a celebrated result in the combinatorial pattern matching community. The Kärkkäinen-Sanders algorithm has already been included in a text book Handbook of Data Structures and Applications, Editors Mehta & Sahni, Chapman & Hall, 2005. Finally, Veli Mäkinen has contributed to the conceptual simplification of self-indexes and co-developed the most space-efficient self-index up-to-date, the main term of the space-usage being optimal in the terms of the k-th order entropy (Ferragina, Manzini, Mäkinen, and Navarro: SPIRE 2004 and ACM Transactions on Algorithms, to appear).

ContextPhone In the area of context-awareness and smart phones, there has been significant success in recognizing context by analysing user situation data. The results include a prototyping platform for context-aware applications running on Smartphones, specifically on Nokia's S60 platform (Raento et al., IEEE Pervasive Computing, 2005). The ContextPhone platform consists of about 30 distinct components that implement data gathering, generalized event services, data logging, user interfaces, network protocols and debugging facilities. The platform has been published in both the sense of academic publications and as freely downloadable software, licensed both under GPL version 2 and MIT free software licenses. Applications built on top of ContextPhone have been used in several research institutes. The ContextContacts application has been central to the study of user perceptions of context, social awareness and privacy at HIIT (Oulasvirta et al., MOBILEHCI, 2005, Oulasvirta et al., Human-Computer Interaction, 2006) and is currently under commercialization. The

46

data logging ContextLogger was used to gather a unique dataset from one hundred participants over nine months by Nathan Eagle at the MIT Media Lab, and has been the basis for data analysis method development at HIIT (Laasonen et al., Pervasive, 2004, Laasonen 2005 (PKDD)). ContextMedia, a contextual mobile media gathering tool, has been used together with the University of Art and Design Helsinki in several artist-led workshops around the world as well as by the Garage Cinema Research Group at UCB, with an end-user version released for public consumption under the name Merkitys-Meaning. A special-purpose sensor network version of ContextPhone is used in an artist-led cross-disciplinary project (Evans in press). Datasets from these experiments have been released publicly (Raento, Pervasive, 2004) and have been used amongst others by research at the University of Helsinki and University of Jyväskylä (Mazhelis et al., HICSS, 2006). Although the data gathered has been used mostly to study general human behaviour, it could also be used to get accurate statistics on the ways mobile phones are used. This would be useful both in academia and industry.

Finding orders from data In certain applications there is a natural ordering for the rows or columns of the data. For example, in paleontological presence/absence data the rows represent sites and the columns represent species: the task is to find an ordering for the sites so that for each species its occurrences are in consecutive observations. In the error-free case this seriation problem reduces to consecutive ones problem, but it is NP-hard for realistic data. We have in the last years developed novel algorithms for this seriation task (Gionis et al., Paleobiology 2006; Puolamäki et al PLoS Computational Biology 2006); their performance is excellent compared to previous approaches. Recent results (Gionis et al., KDD 2006, to appear) show also that finding partial orders can be done efficiently.

Techniques and tools for learning linear latent variable models A common data-analysis framework for continuous data is to describe the data as a linear mixture of some underlying hidden variables. This family of methods includes Independent Component Analysis (ICA) and Non-negative Matrix Factorization (NMF), which both have received considerable attention in the machine learning community. We have contributed significantly to the problem formulation, solution algorithms, and freely available software for these methods. In particular, we have published a book which is now the standard reference on ICA (Hyvärinen et al, Independent Component Analysis, Wiley, 2001). We have also developed and improved the FastICA MATLAB package (www.cis.hut.fi/projects/ica/fastica/), implementing the world-wide most widely-used ICA algorithm, which we developed in the 1990's (Hyvärinen, IEEE Trans. NN, 1999). Furthermore, we have focused on the important problem of estimating the reliability of ICA components (Himberg et al, NeuroImage, 2004), and implemented the resulting algorithms in the novel ICASSO software package. For non-negative data analysis, we have extended the standard NMF method to include sparseness constraints. The resulting method (Hoyer, JMLR, 2004) has become a main reference for modern approaches to NMF, and our corresponding MATLAB package a widely-used implementation of NMF-based techniques.

47

23. Improving professional and public understanding of science The Algodan COE plans to continue publishing also popularizing articles in Finnish and international fora. The CoE will also use the web for popular demos. The status of researchers in their field

24. Prizes and scientific honours We list thirteen items of various prizes on Form 5. To name a few, Academy Professor Heikki Mannila received the ACM SIGKDD Award in 2003 and Dr. Aapo Hyvärinen the Nokia Educational Award in 2001. In 2005, Dr. Petteri Kaski received an award for best doctoral dissertation (out of 141) at Helsinki University of Technology. The Finnish Society for Computer Science selects each year the best doctoral dissertation in Finland. In 2000 they gave the award to Jaakko Hollmén and in 2006 to Taneli Mielikäinen, both members of Algodan CoE. Professor Ukkonen and Academy Professor Mannila are members of the Finnish Academy of Science and Letters.

25. Positions of trust The numerous international (175) and domestic (35) positions of trust of the CoE members in 2001-2005 are listed on Forms 6a and 6b. Among others, Professor Esko Ukkonen is Editor-in-Chief of the Nordic Journal of Computing, and Member of the Board of Editors of the Journal of Universal Computer Science. He is also Associate Editor of the IEEE-ACM Transactions on Computational Biology and Bioinformatics. He is Director of the current From Data to Knowledge CoE. Professor Ukkonen was chairman of the Board of the Rolf Nevanlinna Institute in 1998-2003, and currently chairman of the Board of the joint Master's Programme of Bioinformatics of the University of Helsinki and the Helsinki University of Technology. He is a Board member of the Institute of Biotechnology and of the Helsinki Area Master's Programme in Biotechnology. Academy Professor Heikki Mannila is Area Editor of IEEE Transactions on Knowledge and Data Engineering and Associate Editor of ACM Transactions on Database Systems. He was recently also Editor-in-Chief of Data Mining and Knowledge Discovery and associate editor of ACM Transactions on Internet Technology. He currently Chair of the ESFRI Expert Group on Computing and Data

48

Treatment, Senior Science Adviser of the Finnish IT Centre of Science and vice chair of the Finnish Genome Centre. He is also Chairman of the Board of the Graduate School on Computational Biology, Bioinformatics, and Biometry (ComBi). Dr. Aapo Hyvärinen is Action Editor of Neural Computation and Journal of Machine Learning Research as well as Contributing Faculty Member of Faculty of 1000 Biology. He was previously Co-Editor-in-Chief of International Journal of Neural Systems. Professor Jyrki Kivinen is the current President of the Association for Computational Learning, while Professor Hannu Toivonen is Action Editor of Data Mining and Knowledge Discover and on the Editorial Board of International Journal of Data Mining and Bioinformatics. Researcher mobility and involvement in international research programmes

26. Visits The longer visits from the CoE had in 2001-2005 a total duration of 167 person-months and the visits to the CoE a duration of 124 person-months (Forms 7a and 7b). Examples of short important visits to the CoE are given on Form 7c. To mention one, Professor Richard M. Karp, inventor of NP hardness theory, visited the unit in 2004. During his visit, a post-doc visit to his home institution, the International Computer Science Institute (ICSI) was set up and Dr. Matti Kääriäinen followed up by visiting ICSI for 12 months during 2005-2006.

27. Involvement in international research programmes Academy Professor Heikki Mannila was the Programme Director of the Research Programme on Proactive Computing that run for three years during 2002-2005 (PROACT). The programme was co-funded by the Academy of Finland, Tekes and the French Ministry of Research and New Technologies and it provided funding for fourteen research projects in the field of proactive computing. The CoE has memberships in 5 European research consortiums or networks of excellence, namely in Strep projects Application of Probabilistic Inductive Logic Programming (April II), Inductive Queries for Mining Patterns and Models (IQ) and Regulatory Genomics; and in the European Network of Excellence on Pattern Analysis, Statistical Modeling and Computational Learning (PASCAL) and in the European Network of Excellence for Integrated Genome Annotation (BIOSAPIENS).