35
Better Logging to Improve Interactive Data Analysis Tools Sara Alspaugh . . . . . . . . . [email protected] Archana Ganapathi . . . . . . . [email protected] Marti Hearst . . . . . . . . . . . . . . [email protected] Randy

Better Logging to Improve Interactive Data Analysis Tools

Embed Size (px)

DESCRIPTION

Better Logging to Improve Interactive Data Analysis Tools. Sara Alspaugh . . . . . . . . . [email protected] Archana Ganapathi . . . . . . . [email protected] Marti Hearst . . . . . . . . . . . . . . [email protected] - PowerPoint PPT Presentation

Citation preview

Page 1: Better Logging to Improve Interactive Data Analysis Tools

Better Logging to Improve Interactive Data Analysis Tools

Sara Alspaugh . . . . . . . . . [email protected]

Archana Ganapathi . . . . . . . [email protected]

Marti Hearst . . . . . . . . . . . . . . [email protected]

Randy Katz . . . . . . . . . . . . . [email protected]

Page 2: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012

18:28:01.134, user=splunk-system-user,

action=search, info=granted,

search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_10125

6’, search=‘search index=_internal metrics

per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,

ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,

apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',

savedsearch_name=“sample scheduled search for dashboards (existing job case)”]

even

t

Page 3: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-

28-2012 18:28:01.134, user=splunk-system-user,

action=search, info=granted,

search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_10125

6’, search=‘search index=_internal metrics

per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,

ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,

apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',

savedsearch_name=“sample scheduled search for dashboards (existing job case)”]

timestamp

even

t

Page 4: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-

28-2012 18:28:01.134, user=splunk-system-user,

action=search, info=granted,

search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_10125

6’, search=‘search index=_internal metrics

per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,

ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,

apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',

savedsearch_name=“sample scheduled search for dashboards (existing job case)”]

even

t

user

timestamp

Page 5: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-

28-2012 18:28:01.134, user=splunk-system-user,

action=search, info=granted,

search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_10125

6’, search=‘search index=_internal metrics

per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,

ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,

apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',

savedsearch_name=“sample scheduled search for dashboards (existing job case)”]

even

t

user

timestamp

action

Page 6: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-

28-2012 18:28:01.134, user=splunk-system-user, action=search,

info=granted,

search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_10125

6’, search=‘search index=_internal metrics

per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0, ttl=120,

max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,

apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',

savedsearch_name=“sample scheduled search for dashboards (existing job case)”]

timestamp

user

action

even

t

parametersexecution environmentconfiguration and versionstack trace

Page 7: Better Logging to Improve Interactive Data Analysis Tools

Why do we need better logging?

Motivation

Page 8: Better Logging to Improve Interactive Data Analysis Tools
Page 9: Better Logging to Improve Interactive Data Analysis Tools
Page 10: Better Logging to Improve Interactive Data Analysis Tools
Page 11: Better Logging to Improve Interactive Data Analysis Tools

Visualizing records of user activity to help optimize the user experience using Google Analytics Goal Flow Tool

Page 12: Better Logging to Improve Interactive Data Analysis Tools

Applications of Good User Activity Records

Jaideep Srivastava, Robert Cooley, Makund Deshpande and Pang-Ning Tan. “Web usage mining: discovery and applications of usage patterns from web data.” SIGKDD Explorations Newsletter. 2000.

recommenderspredictive interfacestask guidelinesactivity visualizationstraffic analysisUX optimization

Page 13: Better Logging to Improve Interactive Data Analysis Tools

Examples of this in IDEA tools

• SYF: Systematic yet flexible (Perer and Shneiderman)– social network analysis tool– task guidelines for exploring social

network data– users can provide feedback on task

usefulness– records when users have completed

tasks• SeeDB (Parameswaran, Polyzotis, Garcia-

Molina)– recommend visualizations for a given

SQL query

Adam Perer and Ben Shneiderman. “Systematic yet flexible discovery: guiding domain experts through exploratory data analysis.” Conference on Intelligent User Interfaces (IUI). 2008.Aditya Parameswaran, Neoklis Polyzotis, and Hector Garcia-Molina. “SeeDB: visualizing database queries efficiently.” International Conference on Very Large Databases (VLDB). 2013.

Page 14: Better Logging to Improve Interactive Data Analysis Tools
Page 15: Better Logging to Improve Interactive Data Analysis Tools

“Understanding the domain experts’ tasks is necessary to defining the systematic steps for guided discovery. Although some professions such as physicians, field biologists, and forensic scientists have specific methodologies defined for accomplishing tasks, this is rarer in data analysis. Interviewing analysts, reviewing current software approaches, and tabulating techniques common in research publications are important ways to deduce these steps.”

Page 16: Better Logging to Improve Interactive Data Analysis Tools

Some problems with logging

• ICSE 2012 study of logging best practices

• looks at four top OSS projects, finds logging is:– “often a subjective and arbitrary practice”– “seldom a core feature provided by the

vendors”– “written as ‘after-thoughts’ after a failure”– “arbitrary decisions on when, what and

where to log”Ding Yuan, Soyeon Park, and Yuanyuan Zhou. “Characterizing logging practices in open-source software.” International Conference on Software Engineering (ICSE). 2012.

Page 17: Better Logging to Improve Interactive Data Analysis Tools

“. . . it is critical to gain access to a stream of user actions. Unfortunately, systems and applications have not been written with an eye to user modeling."Eric Horvitz, Jack Breese, David Heckerman, David Hovel, and Koos Rommelse. “The Lumière project: Bayesian user modeling for inferring the goals and needs of software users.” Conference on Uncertainty in Artificial Intelligence. 1998.

Page 18: Better Logging to Improve Interactive Data Analysis Tools

Recommendations

Plan ahead to capture high-level user actions when designing the system.

Track detailed provenance for all events.

Observe intermediate user actions that are not “submitted” to the system.

Record the metadata and statistics of the data set being analyzed.

Collect user goals and feedback.

Work towards a standard for logging data analysis activity records.

Page 19: Better Logging to Improve Interactive Data Analysis Tools

Plan ahead to capture high-level user actions when designing the system.

Recommendation #1

Page 20: Better Logging to Improve Interactive Data Analysis Tools

High-level task: clustering in Excel

Page 21: Better Logging to Improve Interactive Data Analysis Tools
Page 22: Better Logging to Improve Interactive Data Analysis Tools

Examples of this in IDEA tools

• HARVEST (Gotz and Zhou)– visual analytics tool that incorporates

action semantics not events as core design element

– based on catalogue of common analytics actions derived through review of many analytics systems

– exposes high-level actions that retain rich semantics as way of interacting with dataDavid Gotz and Michelle Zhou. “Characterizing users’ visual analytic

activity for insight provenance.” Symposium on Visual Analytics Science and Technology (VAST). 2008.

Page 23: Better Logging to Improve Interactive Data Analysis Tools

“...work in this area has relied on either manually recorded provenance (e.g., user notes) or automatically recorded event-based insight provenance (e.g., clicks, drags, and key-presses), both approaches have fundamental limitations.”

Page 24: Better Logging to Improve Interactive Data Analysis Tools

Track detailed provenance for all events.

Recommendation #2

Page 25: Better Logging to Improve Interactive Data Analysis Tools

09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134, user=salspaugh, action=search, info=granted ,

search_id=`scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256', search=`search source=*access_log* | eval

http_success = if(status=200, true, false) | timechart count by http_success’, autojoin=`1', buckets=0, ttl=120, max_count=500000,

maxtime=8640000, enable\_lookups=`1', extra_fields=`', apiStartTime=`ZERO_TIME', apiEndTime=`Fri Sep 28 18:28:00 2012',

savedsearch_name=“”]

interactively entered at search

bar

triggered by dashboard reload

issued from external user

script

bad if same event is logged

sources of data transformation activity

Page 26: Better Logging to Improve Interactive Data Analysis Tools

“...the log files do not differentiate between Show Me and Show Me Alternatives. These commands are implemented with the same code and the log entry is generated when the command is successfully executed.”

Visualization recommendation in Tableau’s Show Me.

Page 27: Better Logging to Improve Interactive Data Analysis Tools

Record the metadata and statistics of the data set being analyzed.

Recommendation #3

Page 28: Better Logging to Improve Interactive Data Analysis Tools

data action

scatter plot

bar chart

{categorical, categorical}

.001 .999

{categorical, quantitative}

.185 .815

{quantitative, quantitative}

.900 .100

Toy Example Influence Diagram

Toy Example Conditional Probability Table

action

data P( action | data )

Page 29: Better Logging to Improve Interactive Data Analysis Tools

Wolfram Predictive Interface in Mathematica

Recommendation ranking based on the data

Initial recommendation ranking

Page 30: Better Logging to Improve Interactive Data Analysis Tools

Collect user goals and feedback.Recommendation #4

Page 31: Better Logging to Improve Interactive Data Analysis Tools
Page 32: Better Logging to Improve Interactive Data Analysis Tools
Page 33: Better Logging to Improve Interactive Data Analysis Tools
Page 34: Better Logging to Improve Interactive Data Analysis Tools

Work towards a standard for logging data analysis activity records.

Recommendation #5

Page 35: Better Logging to Improve Interactive Data Analysis Tools

Conclusion

• Goal: improve interactive data exploration and analysis (IDEA): interfaces, recommender systems, task guidelines, predictive suggestions

• Problem: need better data to mine• Recommendations for logging IDEA

activity• When you build your next system for

IDEA, will you consider how you log user activity?