30
Doing data science with Clojure @sbelak [email protected]

Doing data science with Clojure

Embed Size (px)

Citation preview

Page 1: Doing data science with Clojure

Doing data science with Clojure

@sbelak [email protected]

Page 2: Doing data science with Clojure
Page 3: Doing data science with Clojure
Page 4: Doing data science with Clojure

Design constraints

Page 5: Doing data science with Clojure

The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow

< 2min < 20min project

squeeze in somewhere in the day

fail

roadmapahoy!

Page 6: Doing data science with Clojure

Think in distributions, not numbers

Page 7: Doing data science with Clojure

No throwaways

Page 8: Doing data science with Clojure

Sharing results

• Have one canonical version that is always current.

• Concentrate discussion in one place and make it searchable and persistent.

• Include methodology (=code).

Page 9: Doing data science with Clojure

The environment

Page 10: Doing data science with Clojure

REPL vs. notebook

Page 11: Doing data science with Clojure

REPL vs. notebook+

Page 12: Doing data science with Clojure
Page 13: Doing data science with Clojure

(hacked) gorilla-repl.org +

auto-refresh +

hypothes.is

Page 14: Doing data science with Clojure

#alderaan #sales #growth

Page 15: Doing data science with Clojure

Code hidden, but can be expanded

Questions, comments,

& annotations

Shareable

Periodically re-run to keep it fresh

#alderaan #sales #growth

discoverability

Page 16: Doing data science with Clojure

Wishlist/TODO• Better editor (shaunlebron.github.io/parinfer/ ?)

• Embedded REPL

• Better exception reporting

• Browsable data structures

(tried and miserably failed: org-babel)

Page 17: Doing data science with Clojure

The tools

Page 18: Doing data science with Clojure
Page 19: Doing data science with Clojure

Data frame

• Data tends to be heterogeneous

• Clojure excels in structure manipulation/encoding

Page 20: Doing data science with Clojure

github.com/sbelak/huri• No data structures, just functions over collections

• Composable (even DSLs — no macros!)

• Reasonably fast (transducers <3)

• Do-what-I-mean (auto-sort, liberal with inputs, …)

• Minimal buy-in

• Support reaching into nested structures everywhere

Page 21: Doing data science with Clojure

composable data structure based DSLs

->> and partial friendly Support reaching into nested structures everywhere

vanilla vector of maps

interoperability

Provide curried versions where possible

Page 22: Doing data science with Clojure

Composability is key to quick iterating

• Provide curried versions where possible

• ->> and partial friendly

• encode computation in structure (comp, some-fn, every-pred, data structure based DSLs, …)

• consistent API

Page 23: Doing data science with Clojure

Catching errors early ⇒ more context ⇒ easier debugging ⇒ faster iterating

Page 24: Doing data science with Clojure

<3 Bret Victor

Page 25: Doing data science with Clojure

Q: What about machine learning?

A: farm it out to sklearn

Page 26: Doing data science with Clojure

huri.plot

• DSL on top of ggplot2 (via gg4clj)

• Targets Gorilla REPL

• Follows the rest of Huri’s design philosophy

• bar chart, scatter plot, line chart, box & violin plot, heatmap, histogram

Page 27: Doing data science with Clojure
Page 28: Doing data science with Clojure

Wishlist/TODO• (even) better structure manipulation (via Spectre?)

• Interactive plots

• More transducer-compatible (online) math functions

• Optimizing ->> (rewrite code on the fly to do more with transducer composition)

Page 29: Doing data science with Clojure

Projects worth keeping an eye on

github.com/thi-ng/geom

github.com/yieldbot/vizard

zeppelin-project.org

github.com/aphyr/tesser

github.com/nathanmarz/specter

Page 30: Doing data science with Clojure

Questions@sbelak

github.com/sbelak/huri