35
Ad Hoc Data: From Uggh Ad Hoc Data: From Uggh to Smug to Smug David Walker David Walker Princeton University Princeton University 0000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r 0000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 0000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............' 0000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste 0000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I......... 0000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6............... 0000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux..... 0000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail 0000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man............. :-

Ad Hoc Data: From Uggh to Smug David Walker Princeton University 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872...............r 00000010: 6573 6561

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Ad Hoc Data: From Uggh to Ad Hoc Data: From Uggh to SmugSmug

David WalkerDavid Walker

Princeton UniversityPrinceton University

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............

:-

Ad Hoc Data is EverywhereAd Hoc Data is Everywhere• Lots of data in databases ==> even more data that Lots of data in databases ==> even more data that

isn’tisn’t• Ad Hoc Data:Ad Hoc Data: sets of semi-structured data files for which sets of semi-structured data files for which

standard data processing tools are unavailablestandard data processing tools are unavailable

• Tasks:Tasks: “getting the data into a database” (and other “getting the data into a database” (and other kinds of transformations), data cleaning, querying, kinds of transformations), data cleaning, querying, editing, parsing...editing, parsing...

• Troubles:Troubles: error prone, limited documentation, evolving error prone, limited documentation, evolving formats, huge volume, ...formats, huge volume, ...

Web Logs

Network Monitoring

Billing Info

Router Configs

Cosmology Data

Two New SystemsTwo New Systems

• Anne: A “Mark-up Language” for Ad Hoc Data Anne: A “Mark-up Language” for Ad Hoc Data [PLDI [PLDI 2010]2010]

• with Qian Xi (Princeton)with Qian Xi (Princeton)

• Forest: A Language for Specifying Environmental Forest: A Language for Specifying Environmental AssumptionsAssumptions• with Kathleen Fisher (AT&T)with Kathleen Fisher (AT&T)• Nate Foster (Princeton)Nate Foster (Princeton)• Kenny Zhu (Jiao Tong Shanghai University)Kenny Zhu (Jiao Tong Shanghai University)

Anne: Anne: A Context-A Context-

free Mark-up free Mark-up Language for Language for

Ad Hoc DataAd Hoc Data

[PLDI 2010][PLDI 2010]

Qian Xi

The ProblemThe ProblemWhat is the What is the fastestfastest, , most reliablemost reliable way to go from data like this: way to go from data like this:

To a parse tree like this: To a parse tree like this:

And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, xml converter, ...xml converter, ...

207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013polux.entelchile.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540...

IP

207.136.97.49

Message

GET

URL Protocol Code Size

Entry

Sort

/turkey/amnty1.gif HTTP/1.0 200 3013

EntryList

...

...

Our Solution: AnneOur Solution: Anne• Develop a “Develop a “mark-up languagemark-up language” for ordinary text” for ordinary text

• programmers annotate raw text using a set of “programmers annotate raw text using a set of “grammatical directivesgrammatical directives””• a simple, predictable algorithm generates a complete grammar a simple, predictable algorithm generates a complete grammar & processing tools from directives + the & processing tools from directives + the

surrounding raw datasurrounding raw data

Pros:Pros:• really easy to usereally easy to use

• directives are simple -- applied when & where neededdirectives are simple -- applied when & where needed• you can do it at 3amyou can do it at 3am

• predictable predictable • documentation and toolsdocumentation and tools may be generated automatically may be generated automatically

Cons:Cons:• not completely automaticnot completely automatic

• but I’m skeptical any other more magical bullet exists anywaybut I’m skeptical any other more magical bullet exists anyway

207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450

Generated Grammar:

Document:

{Entry:207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450

Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ – ‘ ‘ ‘”’ word ... int ‘ ‘ int

Generated Grammar:

Document:Edit document to add directives

Default tokenization of tagged data

Non-terminal name drawn from directive

{Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450

ID ::= ‘-’Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int

Generated Grammar:

Document:

Default grammar now incluldes new non-terminal

Second directive

New grammar rule

{Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450

ID ::= ‘-’ + wordEntry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int

Generated Grammar:

Document: multiple identical name occurrences imply union of grammars

union of grammars

{Entry:207.136.97.49 – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450

ID ::= ‘-’ + wordEntry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

Generated Grammar:

Document:= denotes presence of constant string

{Entry:{Loc$:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450

Loc ::= {[^ ]*}ID ::= ‘-’ + wordEntry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

Generated Grammar:

Document:

any string terminated by a space

$ directs the system to infer a terminating symbol

a space follows the closing brace

Interjection: The Config FileInterjection: The Config File

def db [0-9][0-9]def zone [+-][0-1][0-9]00def ampm am\|AM\|pm\|PMdef trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9]...

exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)?

exp IP {trip}\.{trip}\.{trip}\.{trip}

default.config:

• A A config fileconfig file provides a mechanism for defining provides a mechanism for defining regular expressions and giving them namesregular expressions and giving them names• def is an internal definitiondef is an internal definition• exp is an exported named regular expressionexp is an exported named regular expression

• The The default config filedefault config file provides regular expressions provides regular expressions for common systems data (IP, dates, times, URL, for common systems data (IP, dates, times, URL, email, ... )email, ... )

{Entry:{IP:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gi .... 200 3013}

207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76

polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540

152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -

ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168

ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450

IP ::= ... from config file ...ID ::= ‘-’ + wordEntry ::= IP ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

Generated Grammar:

Document: pre-defined token

Definition drawn from config file

XML Generation & Debugging XML Generation & Debugging

Other FeaturesOther Features• Most features inspired by similar constructs found in PADSMost features inspired by similar constructs found in PADS

• EnumerationsEnumerations• Recursion (context-freedom)Recursion (context-freedom)• Kleene Star Kleene Star

• with optional element definitions, separators, and terminators)with optional element definitions, separators, and terminators)

• OptionsOptions• Prioritized UnionsPrioritized Unions• AssertionsAssertions• TablesTables

• Generated Artifacts:Generated Artifacts:• PADS description (and from there, the PADS tool suite)PADS description (and from there, the PADS tool suite)• XML & CSS for debuggingXML & CSS for debugging

• Semantics: connections to Relevance Logic [see PLDI 10]Semantics: connections to Relevance Logic [see PLDI 10]

{Record*[|]:9152271|9152271|1|0|0|0|0|1}

Elem ::= intRecord ::= (Elem (‘|’ Elem)* )?

Repetition (2)

Repetition (1) Kleene Star with elements separated by ‘|’ and defined by first element

{Record/Item*[|]:9152271|{Item:9152271}|1|0|0|0|0|1}

Item ::= intRecord ::= (Item (‘|’ Item)* )?

Kleene Star with elements separated by ‘|’ and defined by Item

{Parens?:({Parens!:(((())))})}

Parens ::= (’(‘ Parens ‘)’)?

Assertions & Context-Freedom

? denotes optional dataOptional Data

{Record/Item*[|]:9152271|{Item?:9152271}|1|0||0||1}

Item ::= int?Record ::= (Item (‘|’ Item)* )?

missing elelments

! claims underlying data will satisfy nonterminal Parens

{E#:Jason Blake, 78 25 38 63 -2Alexei Ponikarovsky, 82 23 38 61 6...}

Table (1)

Row ::= Word ‘ ‘ Word ‘,’ ‘\t’ int ...Record ::= Row (NL Row)*

{E#h:Name GP Goals Assists Points +/-Jason Blake, 78 25 38 63 -2Alexei Ponikarovsky, 82 23 38 61 6...}

Table (2)

Row ::= ...Header ::= ‘Name’ ‘\t’ ...Record ::= Header NL Row*

Forest:Forest:A A

SpecificationSpecificationLanguageLanguage

for for EnvironmentEnvironment

alalAssumptionsAssumptions

[work in progress!][work in progress!]

Nate Foster

Kenny Zhu

Kathleen Fisher

PADS Web SitePADS Web Site

Various causes for errors:•Missing files•Directories/files in wrong locations•Wrong permissions•Links to wrong targets

If only we could...If only we could...

• Describe Describe required file and directory structure, required file and directory structure, including permissions, etc.including permissions, etc.

• CheckCheck that the actual file system matches the spec. that the actual file system matches the spec.• EliminateEliminate a whole class of errors! a whole class of errors!

CORAL Monitoring SystemCORAL Monitoring System• Monitoring system for an “Internet-scale, self-Monitoring system for an “Internet-scale, self-

organizing, web-content distribution network” organizing, web-content distribution network” developed by Mike Freedman, Princeton.developed by Mike Freedman, Princeton.

Observations on MonitoringObservations on Monitoring• Coral is similar to other monitoring systems: Coral is similar to other monitoring systems:

PlanetLab and a multitude of systems at AT&T.PlanetLab and a multitude of systems at AT&T.

• Often a configuration file specifies which hosts Often a configuration file specifies which hosts to monitor, what data to collect, and how often.to monitor, what data to collect, and how often.

• File and directory names encode meta-data. File and directory names encode meta-data.

• Want to ask questions such as:Want to ask questions such as:• what was the total load on planetlab1 last week?what was the total load on planetlab1 last week?

• on what days and at what times are files are missing?on what days and at what times are files are missing?

• what is the maximum memory usage?what is the maximum memory usage?

• Answering questions requires formulating Answering questions requires formulating queries both in terms of the contents of files queries both in terms of the contents of files and the structure of the file system (directory and the structure of the file system (directory names, files names) names, files names)

Other Possible ExamplesOther Possible Examples• File Hierarchy Standard (FHS) for unix-like installationsFile Hierarchy Standard (FHS) for unix-like installations• Haskell code base, PADS Source TreeHaskell code base, PADS Source Tree

• source code, data, examples, executables, ...source code, data, examples, executables, ...

• Cabal system for GHC librariesCabal system for GHC libraries• Disk cache for browser history, IMAP mailDisk cache for browser history, IMAP mail• Scientific data setsScientific data sets• CVS, SVN, other source control systemsCVS, SVN, other source control systems

To Do!To Do!• We need a language not just for specifying the contents We need a language not just for specifying the contents

(formats) of ad hoc data files but also for the structure of (formats) of ad hoc data files but also for the structure of file system fragmentsfile system fragments• specify filesspecify files• directory structuredirectory structure• dependencies (config files determine file system structure)dependencies (config files determine file system structure)• meta-data (permissions, sizes, owners, modification times)meta-data (permissions, sizes, owners, modification times)

• The PlanThe Plan• Build such a specification language on top of PADSBuild such a specification language on top of PADS• Generate a checker from the specificationsGenerate a checker from the specifications• Interface that allows programs to slurp up specified data from the Interface that allows programs to slurp up specified data from the

file system file system • Stand-alone tools: query engine, monitor, etc...Stand-alone tools: query engine, monitor, etc...

Back to CORALBack to CORAL

Example: CORALExample: CORAL

ptype conf_t = ... {- pads description -}ptype corald_t = ... {- pads description -}ptype dns_t = ... {- pads description -}ptype web_t = ... {- pads description -}ptype probe_t = ... {- pads description -}

Example: CORALExample: CORAL

ptype conf_t = ... {- pads description -}ptype corald_t = ... {- pads description -}ptype dns_t = ... {- pads description -}ptype web_t = ... {- pads description -}ptype probe_t = ... {- pads description -}

ptype date_d(t::pdate) = pdirectory { corald is "corald.log" :: corald_t <| timestamp >= t |>; coraldns is "nssrv.log" :: dns_t <| timestamp >= t |>; coralweb is "websrv.log" :: web_t <| timestamp >= t |>; probe is "probed.log" :: probe_t <| timestamp >= t |>; time :: pdate = t; }

Example: CORALExample: CORAL

ptype conf_t = ... {- pads description -}ptype corald_t = ... {- pads description -}ptype dns_t = ... {- pads description -}ptype web_t = ... {- pads description -}ptype probe_t = ... {- pads description -}

ptype date_d(t::pdate) = pdirectory { ... as before ... }

ptype host_d = pdirectory { times is [t::date_d(t) | t <- pdate]; }

Example: CORALExample: CORALptype conf_t = ... {- pads description -}ptype corald_t = ... {- pads description -}ptype dns_t = ... {- pads description -}ptype web_t = ... {- pads description -}ptype probe_t = ... {- pads description -}

ptype host_d(h::phostname, t::pdate) = pdirectory { ... as before ... }

ptype host_d () = pdirectory { hosts is [t::date_d(t) | t <- pdate]; }

ptype coral_d () = pdirectory { hostNames is “Config” :: conf_t; hosts is [h::host_d | h <= hostNames]; }

Current & Future PlansCurrent & Future Plans• Designing a semantics based on a classical logic of treesDesigning a semantics based on a classical logic of trees

• We considered using one of the substructural (“separating”) tree logics but we discarded We considered using one of the substructural (“separating”) tree logics but we discarded it as the substructural logics gave us the wrong defaults & made the system harder to it as the substructural logics gave us the wrong defaults & made the system harder to design and understand (especially in the presence of parent pointers)design and understand (especially in the presence of parent pointers)

• Building a “file system parser” & tool generation infrastructure in HaskellBuilding a “file system parser” & tool generation infrastructure in Haskell• Leverage type-directed programming.Leverage type-directed programming.• Leverage laziness in loading structures.Leverage laziness in loading structures.

• Envision a collection of file system management tools based on Envision a collection of file system management tools based on descriptionsdescriptions• valid –desc dvalid –desc d -- check for conformance to d-- check for conformance to d• ls –desc dls –desc d -- list files described by d-- list files described by d• grep pattern –desc dgrep pattern –desc d -- grep for pattern in files described by d-- grep for pattern in files described by d• mv –desc d foo bar mv –desc d foo bar -- move files described by d rooted at foo to bar-- move files described by d rooted at foo to bar

• Thinking about a query engine & continuous monitoring systemThinking about a query engine & continuous monitoring system

• Considering extensions to handle other elements of the programming Considering extensions to handle other elements of the programming environment: environment variablesenvironment: environment variables

The EndThe End