Upload
mathieu-daquin
View
506
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Overview of the UCIAD (user centric integration of activity data), presented during the JISC visit - 20/04/2011
Citation preview
User Centric Integration of Activity Data
Mathieu d’Aquin, Stuart Brown, Salman Elahi, Enrico Motta
The Open University
Agenda
• Introduction of the Team
• Objectives and Hypothesis
• Overview of technical realization
• Challenges
• Summary of results so far and dissemination
Team
• Dr Mathieu d’Aquin – Research fellow, KMi – project director
• Stuart Brown – Web developments and online communities, communication services – member of the steering group, liaison with online services
• Salman Elahi – Resarch assistant and PhD student, KMi – developer/researcher
• Prof Enrico Motta – Professor of knowledge technologies, KMi – Chair of the steering group
Objectives and Hypothesis
Hypothesis1. Taking a user centric point of view can
allow different types of analysis of logs/activity data, which are valuable to the organisation and the user
2. Ontologies and Ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources
Organisation Centric Activity Data
Users
Organisation
Website 1
Website 2
Website 3
Website 4
Logs 1Logs 2
Logs 3
Logs 4
ConsolidationConsolidation Consolidation
Analytics = aggregated stats
At the Open University
• An analytics system building aggregated data from various university’s websites
• Based on a manually defined sitemaps• Good for website optimization, marketing
campaigns, etc.• But the data being pre-aggregated, it is limited
with respect to what it can do• Limited control• No user view
User Centric Activity Data
Users
Organisation
Website 1
Website 2
Website 3
Website 4
Logs 1Logs 2
Logs 3
Logs 4
ConsolidationIntegration
Interpretation
Activity analysis for and by individual users
Ontologies
Ontologies
• Formal conceptual models of a domain– Here, the domain is online user activity
• At the basis of Semantic Web technologies– Standard languages for expressing ontologies and
ontological data (RDF, OWL)– Tools to manipulate and work with ontologies and
semantic data (NeOn Toolkit, OWLIM)– Many ontologies to reuse (cf. Watson)
• Adhere to a logical formalism– Enable inferences on the data
Objectives and Deliverables
• Build the technical infrastructure that can hold traces of activity data as semantic data– Include triple store with reasoning capability, log parsers for
different formats of logs, and renderers as semantic data (RDF)
• Build the ontologies to interpret and reason upon activity data– Including various aspects of activity data in a way which is
extensible
• Tools to support users in analyzing their own activity data– Recognize a user from the different settings and provide view on
his/her own data – Allow him/her to customize the view, by customizing the ontology
• Test, validate, deploy, distribute
Technical infrastructure
Server1 Server2 Server3
Application
Application
Log Log
Log Log
Log
Parser/RDF renderer
Parser/RDF renderer
Parser/RDF renderer
Parser/RDF renderer
Parser/RDF renderer
Daily RDF traces
Daily RDF traces
Daily RDF traces
Daily RDF traces
Daily RDF traces
Scheduler/Manager
Semantic Triple Store
Technical infrastructure
• Development of parsers for different kinds a log formats – Currently handle Apache web server log files,
parameterized from the Apache configuration– Easily extensible for dedicated log formats
• Provide a common data structure serialized in RDF by the RDF renderer
• Each server produces a daily extract from the logs in RDF, which is being used to populate the semantic triple store
• The triple store includes multiple repositories and sub-spaces depending on time/user/server
Ontologies
• Key concepts to be represented:– Actors (human users and robots)– Sitemaps– Traces (broad notion of logs)– Activities
• Reusing existing ontologies– FOAF: for people and documents– Time Ontology: for traces– Action ontology: for traces and activities– (Planned) OPO: Online presence– (Planner) SIOC: Online communities
Iterative and extensible construction of the ontologies– Provide a base with actors, sitemaps and traces– Specific extensions with typologies of activities, depending
on user and site– Dynamically building and integrating
Tool for analysis
• Need a tool which given– A set of ontologies– A data repository (which can be the overall one, the
one restricted by time, and one for a given user)
can provide a meaningful and interactive overview of the activity data
• To be used for – Provide an ontology-specific view of data analytics– Support the iterative development of the ontologies– Provide a user centric view of the data
Tools for analysis
Example
In the ontology:/robot.txt is a RobotTXT page
A Spider is an RobotAgent (ActorAgent)
An agent used to access a RobotTXT is a Spider
An AutomaticActivity is a Trace realized by a RobotAgent
Result:Thousands of traces
automatically classified as automatic activities.
Example
In the ontology:UCIAD-Blog and LUCERO-Blog
are Blogs (Website)
A BlogPage is a page which is part of a Blog
An activity onBlog is an activity happening on a Blog Page
Result:Can look specifically at activities
happening on a Blog and specialize them (same applies to Wikis, and other types of websites)
Example
In the ontology:A SPARQLEndpoint is a specific type of
Webpage
AccessingSparqlEnpoint is an activity on a SPARQLEndpoint
SPARLQQueryParameter is a parameter with the name “query” used in an AccessingSPARQLEndpoint activity
ExecutingSPARQLQuery is an AccessingSPARQLQuery activity attached to a SPARQLQueryParameter
Result:Can explore the specific activity of executing
SPARQL queries and its parameters
Can combine: Detect the activity of Automatically Accessing a SPARQL endpoint: and automatic activity and accessing a SPARQL endpoint.
Next step: User support
• Allow users – to log-in– detect setting – bring up the relevant data – explore it
• But also, – to customize the view of the data– to extend the ontologies to provide a personalized
analysis of activity data– to export (interpreted) activity data for reuse
User support
User Logging or register
Display Activity Data related to all known settings of the user
Detect setting (agent+IP)
Check setting non-
ambiguous
It is the first time you log into UCIAD with this setting (detail) do you want to attach it to your
account?
Add setting to known setting
Register setting as
ambiguous
known setting for user
unknown setting
ambiguousnon-
ambi
guou
s
yes
no
User support: data for a user
For a user <u> the SPARQL query
Construct {?trace ?p ?y. ?y ?q ?z} where
{<u> actor:hasKnownSetting ?s.
?trace trace:hasSetting ?s. ?trace ?p ?y. ?trace ?q ?z}builds the traces of activities around the known setting of
<u>
Used to populate a specific repository with sub-spaces for each registered users
Deployment, test, validation
• At the moment, testing for websites of projects and events hosted on KMi servers:– Sssw.org, sssw09.org, loted.eu, lucero-project.info,
uciad.info, data.open.ac.uk, lucero.open.ac.uk, …
• Next level up, websites/systems from main open university website:– www.open.ac.uk, study at the OU,
podcasts.open.ac.uk, VLE
• Extend to deployment of instances for specific projects with distributed websites
Challenges
• Scalability– OWLIM triple store can handle billions of triples– But struggle with millions when inference is “on”– 1 repository without inference with all historical data, 1 with inference with 1
week of data only, and 1 with inference for registered users
• User management and privacy– Ensuring that the user who logs in from a particular setting is the one having the
activity is difficult (e.g., in the case of shared computers)– Is this really a problem?– Check ambiguity – ask verification questions – moderate?
• Distribution and IPR– Code and ontologies under open licenses (small uncertainty regarding code
developed in other projects)– Overall data: privacy issues (is k-anonymity actually applicable? Would it work?)– Overall data: institutional issues (can we show the traffic on our websites to
everybody)– User data export: what license?
Summary and dissemination
• Promising initial results– Can create new ways of analysis at run-time by editing the ontologies!– Mechanisms to provide personal views on own activity data across
websites
• First version of the ontologies: ongoing task• First version of the tools: test and validate!• Dissemination
– Blog / Twitter #uciad– KMi’s internal news letter (KMi Planet)– Salman’s paper at the ESWC 2011 PhD symposium: “Personal
Semantics: Personal information management in the Web with Semantic Technologies”
– Position paper at the W3C Web tracking and privacy workshop: “Self-Tracking on the Web: Why and How”
– Submission to the Personal Semantic Data workshop at K-CAP 2011
More info
UCIAD Blog: http://uciad.info
Code base: http://github.com/uciad
Twitter: #uciad
@mdaquin