1 Semantic Web Usage Mining – Overview and Case Studies – Bettina Berendt Humboldt University Berlin Institute of Information Systems berendt

1 Semantic Web Usage Mining Overview and Case Studies Bettina Berendt Humboldt University Berlin Institute of Information Systems 2 Goals and top-level questions n Make the worlds knowledge available to the world n How do people discover knowledge on the Web? n How can more knowledge sources contribute to the Web? 3 Web Mining extracts implicit knowledge The Semantic Web makes knowledge machine- understandable Semantic Web Mining use semantics to improve mining use mining results to generate semantics Semantic Web Mining use semantics to improve mining use mining results to generate semantics [Berendt, Hotho, & Stumme, Proc. ISWC 2002] [Berendt, Mladenic, et al. (Eds.), From Web to Semantic Web, Springer LNAI 2004] [Berendt, Grobelnik, Mladenic et al. (Eds.), Semantics, Web, and Mining, Springer LNAI 2006] Approaches to the current Webs biggest challenges: lots of data, human-understandable 4 Agenda Web Mining Why? 5 1. What should I buy? 6 2. Where do I find relevant information on...? 7 3. What do people do there? Name 8 4. How can a site be made usable for a worldwide audience? 9 5a. Why go to a shop if everything is available on the Internet? 10 5b. What is my site worth for my business? 11 6. How to help people become active members of the knowledge society help them to contribute content? 12 Agenda Web Mining How? 13 Web Mining Knowledge discovery (aka Data mining): the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. 1 Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas: Web content mining Web structure mining Web usage mining 1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press 14 Data analysis: the textbook version n The meaning of attributes is clear n The meaning of attribute values is clear Data modelling can be applied directly (e.g., regression, classification, clustering, association-rule discovery) (A simplified extract from the adult dataset in the UCI machine learning repository) 15 Data analysis: the reality data mining / knowledge discovery process... p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03: ]"GET /search.html?t=jane%20austen&SID=023785&ord=asc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05: ] "GET /search.html?t=jane%20austen&m=video&SID=023785&ord=desc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06: ] "GET /view.asp?id=3456&SID= HTTP/1.0" n What is the meaning of the attributes? n What is the meaning of the attribute values? Data modelling is only one part! CRISP-DM 16 Where does semantics come in? Semantics 17 Agenda Semantic Web How? 18 What is an ontology? Definition Core ontology with axioms: a structure O := ( C, C, R, , R, A ) consisting of n two disjoint sets C (concept identifiers) and R (relation identifiers) n a partial order C on C (concept hierarchy or taxonomy) n a function : R C + (signature), where C + is the set of all finite tuples of elements in C n a partial order R on R (relation hierarchy), where l r 1 R r 2 implies |(r 1 )| = |(r 2 )| l i ((r 1 )) C i ((r 2 )) for all 1 i |(r 1 )|, with i the projection on the i-th component n a set A of axioms in a logical language L [Stumme, Hotho, & Berendt, Journal of Web Semantics, 2006, and sources there] an explicit specification of a shared conceptualisation (Gruber, 1993) 19 Agenda Web Mining Semantic Web... p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03: ]"GET /search.html?t=jane%20austen&SID= &ord=asc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05: ] "GET /search.html?t=jane%20austen&m=vide o&SID=023785&ord=desc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06: ] "GET /view.asp?id=3456&SID= HTTP/1.0" Under- stand 20 Semantics of requests Step 1: Domain ontology [Oberle, Berendt, Hotho, & Gonzalez, Proc. AWIC 2003] community portal ka2portal.aifb.uni-karlsruhe.de ka2portal.aifb.uni-karlsruhe.de ontology-based: Knowledge base in F-Logic Static pages: annotations Dynamic pages: generated from queries Queries also in F-Logic Logs contain these queries affiliation 21 RESEARCHER PERSON PROJECT PUBLICATION RESEARCHTOPIC EVENT ORGANIZATION RESEARCHINTEREST LASTNAME TITLE ISABOUT EVENTS EVENTTITLE WORKSATPROJECT AUTHOR AFFILIATION ISWORKEDONBY PROGRAMCOMMITTEE EMPLOYS NAME RESEARCHGROUPSAn example query with concepts and relations: FORALL N,PEOPLE > "http://www.anInstitute.org"] and PEOPLE:Person[lastName->>N]. Query = feature vector of concepts + relations Session = feature vector of concepts + relations, summed over all queries in the session Semantics of requests Step 2: Modelling requests and sessions-as-sets Clustering, Association rules, Classification,... 22 Semantics of sequences Step 3: Strategy pattern discovery An ontology of navigation strategies l Define strategy templates as regular expressions Of requests (mapped to ontological entities) Of transitions (between ontological entities) Ex. [.search.* individual] l Discover strategies by learning a strategy trie affiliationSearch, 629 topicSearch, repetition, 402 refinement, 113 individual, repetition, [Berendt & Spiliopoulou, VLDB Journal, 2000] [Berendt, Data Mining and Knowledge Discovery, 2002] 23 NB: For more exploratory analyses: The Web Usage Miner WUM select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2 [Spiliopoulou, 1999; Berendt & Spiliopoulou, VLDB Journal, 2000] 24 Semantics of sequences Step 4: Strategy pattern evaluation Use strategy patterns statistics to l Derive descriptive measures of patterns support, confidence popularity, effectiveness, efficiency l Apply inferential statistics to compare patterns [Berendt, Data Mining and Knowledge Discovery, 2002] 25 First search page Goal: Individual page Concreteness Time Reach goal Refine search Remain unspecific Abandon search Communication Visual data mining Step 5: Mapping an ontological relation over concepts to a linear order and to visual variables More constraints on search 26 Ad Q.3: What do people do there? 27 [Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. WebKDD 2001] Entry page Search with x para- meters { Individual page Search criterion location Search criterion textual property Communication Visual data mining Step 5 Example 28 An online shop with a difference [Berendt, Gnther, & Spiekermann, Communications of the ACM,2005] 29 Communication Visual data mining Step 6: Visual abstraction new semantic patterns Shopping for jackets Shopping for cameras Close- ness to product [Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. WebKDD 2002] 30 Ad Q.4: Worldwide usability 31 The impact of language and domain knowledge on search option choice 2 studies on the use of search options in the eHealth site: n Webserver log: requests / sessions from 188 countries l 83.2 % first-language users, 16.8% second-language users n Webserver log + Questionnaire: 165 (106) people from 34 countries l 84.9% first-language users, 15.1% second-language users l 10.4% physicians, 89.6% patients n Results: l Search engine, alphabetical search: in particular first-language users, physicians l Content-organized search: in particular second-language patients Domain knowledge compensates for limited language knowledge. [Kralisch & Berendt, New Review of Hypermedia and Multimedia, 2005] 32 Semantics: Service ontology Alphabetic al search Diagnosis Diagnosis info TOP Search 33 Results on frequent search patterns Diagnoses are hubs" for navigation (5.3%, 4%) Alphabetical search: hub-and-spoke only linguistic relations (6.4%) Localization search: linear / Depth-first search refinement & medical knowledge (5%) [Berendt, Postproc. WebKDD 2005] 34 [Berendt, Postproc. WebKDD 2005] Mining with ISOVIS: Semantic drill-down, visualizing detail & context 35 Ad Q.5: Shopping behaviour and Web site value 36 5. What is my site worth for my business? n A site is often only a part of a distribution strategy / one channel to reach customers. n What are the conversion rates (how many visitors become buyers etc.)? n What are the cross-channel effects? Internet market shares [BCG 2002] 37 Semantics: The buying process as a service ontology 38 Mining (example): Association rules for investigating preferences in the buying process Study based on ~100K sessions, ~13K transactions from 2002 at a leading European retailer of consumer electronics showed, among other things: Online payment Direct delivery (s=0.27, c=0.97) < 1/3 tradit. online users! Online payment In-store pickup (s=0.02, c=0.03) Cash on delivery Direct delivery (s=0.02, c=0.03) In-store payment In-store pickup (s=0.69, c=0.94) Site is primarily used for information search. Key performance indicators (Web metrics ), e.g.: conversion efficiency offline conversion effectivity and effiziency of search options Key performance indicators (Web metrics ), e.g.: conversion efficiency offline conversion effectivity and effiziency of search options [Berendt & Spiliopoulou, VLDB Journal, 2000, Berendt, Data Mining and Knowl. Discovery, 2002; Teltzrow & Berendt, Proc. WebKDD 2003] 39 Agenda Web Mining (Semantic) Web 40 Step 6: Deployment of results Example 1: Using results for site improvement Name Path analysis + metrics + 2 analysis showed: All search criteria were approx. equally effective Location-based search was most popular City-based search was most efficient... but least popular Modify site design to make efficient search more popular City [Berendt & Spiliopoulou, VLDB Journal, 2000, Berendt, Data Mining and Knowl. Discovery, 2002; Spiliopoulou & Pohle, DMKD, 2001] 41 Step 6: Deployment of results Example 2: Using results for personalization Recommendations for Web site design [Kralisch, Eisend, & Berendt, Proc. HCI International, 2005] 42 Step 6: Deployment of results Example 3: A privacy-preserving Web-metrics analysis service [Teltzrow, Preibusch, & Berendt, IEEE EC Conf. 2004] 43 Agenda Literaturverzeichnis [1] Agarwal, R.; Krueger, B. P.; Scholes, G. D.; Yang, M.; Yom, J.; Mets, L.; Fleming, G. R. U ltrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements, J. Phys. Chem. B, 2000, 104, 2908,... Web Mining Semantic Web contribute 44 Data and metadata in the Digital Library EDOC 136 Literaturverzeichnis... [2] Albrecht, T. F.; Bott, K.; Meier, T.; Schulze, A.; Koch, M.; Cundiff, S. T.; Feldmann, J.; Stolz, W.; Thomas, P.; Koch, S. W.; Gbel; E. O. Disorder mediated biexcitonic beats in semiconductor quantum wells, Phys. Rev. B, 1996, 54, 4436,... ( 45 n Surveys & Web usage mining analysis of a digitial publishing service showed: l Metadata creation is one of the main barriers for contribution. n Reasons include deficiencies in l information flow l understanding and use of structured search l education in structured writing l HCI aspects Authoring support for document servers Marketing [Berendt, Brenstein, Li, & Wendland, Proc. ETD 2003] [Berendt, Proc. AAAI Spring Symposium KCVC, 2005] ) ) Education ) Intelligent Authoring Tools 46 and this has consequences (problems of the fully manual approach) 136 Literaturverzeichnis [1] Agarwal, R.; Krueger, B. P.; Scholes, G. D.; Yang, M.; Yom, J.; Mets, L.; Fleming, G. R. U ltrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements, J. Phys. Chem. B, 2000, 104, 2908,... 47 The fully automatic approach 48 Why is this a problem? [Cardona & Marx, Physik Journal 2004] [Berendt, in Neues Handbuch Hochschullehre, 2003] 49 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings Build a tool that is o user-friendly o intelligent o modular and extensible 50 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings [Berendt, Dingel, & Hanser, Proc. ECDL 2006] 51 Web services IR-THESIS System architecture Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro 52 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings 53 Search and retrieval 54 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings 55 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings 56 Organisation of the literature /bibliography construction 57 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings 58 Discussion 59 The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings The Scientific Authoring Process Search & retrieval Reading Sense-making: Organisation of the literature Discussion Writing Why invest the extra effort to provide good metadata? because I see they are useful for me because I have tools that help me Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro System architecture corrected, XML annotated, and formatted Search query; retrieval via Web + OAI from Citeseer (http://citeseer.ist.psu.edu) Clustering based on co-citation analysis; manual re-organisation and labelling of clusters Annotation of clusters; Publication and sharing in HTML and XML Reference parsing and Information Extraction via Combination of code used in citeseer and citebase (Paratools) NB: Evaluation of IR/IE quality: see our paper in proceedings 60 Writing corrected, XML annotated, and formatted 61 Conclusions and outlook n Semantics are often necessary to do mining at all n Semantics often allow the analyst to make more sense of the results n Semantic Web Mining is semi-automatic interactive tools! n Standardisation can make the mining process more automatic n Mining can help to generate semantics n To what extent are further user and context modelling useful a/o necessary for valid conclusions (intentions, goals, constraints, )? n How can we encourage standards? n When are explicit (formal) semantics better, when implicit semantics? n How can we move beyond the Web (ubiquitous environments)? n How can privacy be protected in a data-rich and mining-rich world? (Are privacy semantics la P3P a solution?) n What do users want? What about other stakeholders? Whom and what and how to ask? 62 Thank you for your attention! 63 Discussion points 1: Is reference markup ontological / Semantic Web? n DiML (Dissertation Markup Language), used in the case study above, is approximately structured like Bibtex (with the difference that the type of publication is an attribute, so there is only one top-level concept citation). This makes it comparable also to Dublin Core. [The system in ist latest versions also contains mapping to DC and other commonly used schemata.] n This makes it indeed an extremely primitive ontology (essentially, a concept hierarchy with one concept, publication with attributes with literals as value range: author, title, etc.). n Extensions to make this really semantic include (some are part of our current work) l Author, affiliation, etc. as concepts with instances, as in Repec.org introduces relations like is-author-of l Unique identifiers of publications that allow the detection of duplicates, as in Citeseer l Links to libraries, as in OpenURL l Versioning and other interesting relations between different publications (cf. The Dublin Core element relation) 64 Discussion point 2: Can folksonomies be used instead of ontologies? (1) n This is a difficult question, not least because it is still unclear what exactly tags are: l an object-level summary and thus more content, or l a truly meta-level classification which comes from a set of labels that is categorically different from just more content words ? n In the following, I use the second interpretation. I refer to folksonomy tags as "concepts" because a folksonomy can formally be regarded as an extremely simple ontology: a set of concepts with no hierarchical or other relations between them. n The answer to the question in the title of this slide depends on the aspect of folksonomies one is most interested in, and how important one thinks certain properties of ontologies. 65 Discussion point 2: Can folksonomies be used instead of ontologies? (2) The answer tends to be YES when one focuses on n WHO DEFINES THE CONCEPTS l All ontologies used in the case studies shown were based on or extended popular models and/or ontologies in the domain of investigation search in the educational portal: models of information search from information science; shopping: models of the customer buying process from marketing; shopping with bot assistance: the same + our design of questions, developed in conjunction with a major German retailer; search in the medical portal: like search in the educational portal plus the medical ICD-9, the International Classification of Diseases; DiML/DC). l But in fact, none of the ontologies used in the case studies here was a "standard" in the sense that many people agree on it and many applications use it - in fact, there are precious few such standard behaviour models! l In that sense, the ontologies used here are, like much of the Semantic Web work, just one possibility proposed by a number of people (the research group + application partners), instead of the result of a standardisation effort. n IN FOLKSONOMY-STYLE TAGGING, A RESOURCE USUALLY HAS MORE THAN ONE TAG l Any set of concepts that a group agrees on can be used. l In SWUM (Semantic Web Usage Mining), Web pages are mapped to single concepts (ex.: slides 22ff.) or sets of concepts (ex.: slide 21). This set of concepts could also be a tag set as in del.icio.us. 66 Discussion point 2: Can folksonomies be used instead of ontologies? (3) The answer tends to be MAYBE when one focuses on l DYNAMICS introduce a non-stability of the mapping, which means that the patterns would change "depending on how you look at them" - which may or may not be desirable My opinion: This quickly becomes untractable, thus an ontology-based treatment of different viewpoints and dynamics ( ontology evolution) appears to be the better choice. The answer tends to be NO when one focuses on n FORMAL PROPERTIES l HIERARCHIES: generalization is an important feature of many mining algorithms (unless you abstract, you may not find any pattern. l (Non-hierarchical) RELATIONS: In folksonomies, there are no relations on concepts. Therefore, meaningful visualizations become harder to produce (note that the stratograms shown on slides 27 and 29 require relations that induce a linear order on concepts). Also, all other inference possibilities are lost. n COMPARABILITY: The results of SWUM can only be compared (e.g., conversion rates in one site with those in another site) if stable and uniform ontologies are used. 67 Discussion point 3: Which of the techniques shown in this talk are being used in industry and other real-world sites? (1) Pre-remark 1: The contents of this talk was (recent) research, thus it would be surprising to see it already incorporated into industrial practice. However, given that Web usage mining has been around for a number of years, the question is valid. Pre-remark 2: Web usage mining is used on a large scale by search engines. Google says it, Yahoo! Says it. Both say they rely rather on latent-semantic-indexing style semantics than on Semantic-Web-style semantics (but they do use lexica and other helpers); the boundaries are fluid. Anyway, they dont say too much about the details of their algorithms. After all, mining is their business model... Anyway, we believe that SWUM is applicable to analysing search when the focus is on what services of a site(`s interface) are used, not when the content of searches is investigated (cf. content vs. Service conceptual hierarchies in Berendt & Spiliopoulou, VLDB Journal 2000). Thus, search engines are not the intended application areas of our techniques, but retail, information, e-Government, etc. sites. The question should therefore be rephrased as 3 questions: n Do off-the-shelf software packages (used by end-user companies either on-site or in ASP mode, i.e. without external consultants to do the analyses) support Web usage mining, and specifically Web usage mining with semantics? l The answer is: Very partly. n Do consultants offer SWUM analyses? l The answer is: partly. n What are the likely reasons? l A tentative answer is: Perception problems and lack of incentives. 68 Discussion point 3 (2): Support in off-the-shelf software basic forms of analysis n Pageview counts and simple OLAP-type analyses (hits by country, by language, etc.) are pretty standard and supported even by most of the simplest freeware products (e.g., Analog). Their usage is very common in industry. n State-of-the-art commercial analysis software like Webtrends allows a certain degree of programming for extracting more attributes that can be subjected to OLAP-type analyses (see below for an example). n State-of-the-art software often also supports the extraction of more information transferred via Javascript. An example is Google Analytics. n Syntax is generally the only basis. Semantics usually comes in only insofar as the Content Management System used by most sites today provides a certain frame of reference and meaning. 69 Discussion point 3 (3): Support in off-the-shelf software Conversion rates n Software generally also supports the definition of simple templates from which conversion rates can be computed automatically (e.g., a click on page X with referrer Y, or after a sequence of pages that started with referrer Y, is a converted customer brought to us from the banner shown on affiliated site S). n Conversion rates are not only extremely simple (divide the number of sessions that reached X and then Y by the number of sessions that reached X), but also quite powerful: Every success measure that can be defined via reachability can be cast a conversion rate. n The 3-click rule (every page must be reachable with 3 clicks) is a related and equally simple-to-compute measure. That a page is reachable in 3 clicks can be computed from the site graph, that it is reached can be computed from frequent sequences. This only requires that the tool can compute frequent contiguous sequences, which is algorithmically simple and requires little thinking on the part of the analyst. n For conversion-rate computations, semantics occurs in the simple sequence templates offered by the tools, the mapping is gathered from the users via Web forms or scripts. n Conversion rates are also related to pricing models such as GoogleAds. For a survey of software, see 70 Discussion point 3 (4): Support in off-the-shelf software possibilities and limitations / example country & language n Language l is usually defined as either the presentation language (in a site with dynamic pages generated by a content management system, this can easily be extracted) l or the language (assumed to be) preferred by the user (the browser setting, which in most cases is likely to be the default with which the browser is shipped). n Country is inferred from the IP address and an IP geo-coordinates mapping. Such mappings are provided by software like Maxmind. This is relatively reliable according to the producers and according to a test we did (publication in preparation). n To obtain the users native language, we inferred it from the Geo-IP mapping and official data on official languages in countries around the world. In a small experimental sample in which we asked users to specify their native language, we obtained quite high accuracy (Kralisch & Berendt, NRHM 2005). n I do not know of data on the accuracy of the browser setting native language mapping, or of data comparing it to the Geo-IP approach we used. n But: only the combination presentation language + users native language gives information about whether a user accesses content in his/her native language or in a foreign language and this knowledge may be much more important for personalization than presentation language or preferred language alone (see Kralisch, Ph.D. dissertation 2006,berlin.de/docviews/abstract.php?id=27410)berlin.de/docviews/abstract.php?id=27410 n Nonetheless, even the semantics of presentation language =/= user language are to my knowledge not utilized in off-the-shelf software. One reason is that the awareness of the importance of language in Internet design has only begun. 71 Discussion point 3 (5): Consultancy companies n More advanced forms of conversion-rate analysis, which rely on (some) semantics, have been introduced or popularized by consultancy companies. Examples: l NetGenesis (Cutler & Sterne) E-Metrics White Paper, 2000, The funnelmetrics introduced there are now also offered, for example, by Google Analytics:l Accenture (R. Ghani), Mining the Web to add semantics to retail data mining, in Berendt et al., Web Mining: From Web to Semantic Web (2004). l survey by Anand et al., On the deployment of Web usage mining, ibid. n Unfortunately, publicly available data on Web usage are usually at a very high level of aggregation and (also for this reason) build on essentially non-semantic analysis types, e.g. 72 Discussion point 3 (6): Likely reasons n One major problem is a divergence between the (current or definitional?) nature of data mining / knowledge discovery on the one hand, and business expectations on the other: l KD is still more an art than an engineering process, with few standards even for process. l Business often expects data mining to be a set of fully automatic, pre- packaged black-box solutions. n The CRISP-DM process model shown on slide 16, for example, is a very high-level attempt at standardisation which leaves many details open. n In fact, it can be (and often is) argued that the search for interesting and novel patterns through exploratory data analysis by definition involves hand-crafting. Going back to the original definition of data mining (see slide 13), one could argue that looking for the values of pre-defined pattern templates (e.g., conversion rates) is the antithesis of novel patterns and thus by definition not data mining. n On the other hand, Web usage mining is essentially market research: a study of user / consumer behaviour. Market research is an established discipline in which it is quite accepted that methods involve human intervention and interpretation rather than the automatic application of pre-packaged procedures (one example is the focus-group method). 73 Discussion point 3 (7): Likely reasons contd. n Maybe this is a perception problem: While it is clear that consumer opinions bear a strong qualitative element (such that focus groups cannot be prepared, administered and interpreted by a machine only), data mining carries the image of number crunching (implying that computers are the main actors here). n In line with this, the responsible people often have disjoint qualifications: The market research people have a strong background in the relevant social- science methods; the IT people (who are expected to do the data mining on the side) can use tools, but usually have limited knowledge about empirical methods in general or data mining in particular. n This point was discussed at a panel at the WebKDD workshop at SIGKDD 2005 one result was that the job description Chief Data Officer (= a senior- management person with resources who knows about data mining in the sense of data analysis AND computers) was a really recent invention. In the meantime, data-mining consultancies filled the gap (but had to convince companies they were worth it). n Or it is a problem of lacking standards (once we have behaviour models of retail sites, of education sites, etc., we can pre-package these behaviour ontologies and even compare sites). n Standards (in behaviour modeling) require that there is an interest in what the behaviour models say, and an interest in being comparable to other sites. Encouraging developments in this direction can currently be observed in the digital libraries community.... to be continued... 74 Thank you for your questions!

Documents

1 Semantic Web Usage Mining – Overview and Case Studies – Bettina Berendt Humboldt University Berlin Institute of Information Systems berendt