45
Docker for Data Science Down with package managers, up with docker Calvin Giles - [email protected] - @calvingiles

Docker for data science

Embed Size (px)

Citation preview

Page 1: Docker for data science

Docker for Data ScienceDown with package managers, up with docker

Calvin Giles - [email protected] - @calvingiles

Page 2: Docker for data science

Who knows what docker is?

Who uses docker?

Page 3: Docker for data science

Who am I?A Physicist - MPhys from University of SouthamptonA Data Scientist at AdthenaA PyData meetup and conference co-organiserData Science Advisor for untangleconsulting.ioProgramming in python for nearly 10 yearsUsing docker for 3 months

Page 4: Docker for data science

Who am I not?A computer scientistDevOpsA docker expertA docker contributor

Page 5: Docker for data science

My ProblemI maintained a dev_setup.md document. It was 450 lines long and growing.

Page 6: Docker for data science

It got worse

RubyA project requiring ruby wasn't supported by MacPorts. i would have to install outside of mypackage manager.

Page 7: Docker for data science

Things started to break.

Page 8: Docker for data science

Then I decided to contribute to sklearn.

Followed by many build errors.

The quickest solution - disable macports.

$ git clone git://github.com/scikit-learn/scikit-learn.git$ python setup.py build_ext --inplace

Page 9: Docker for data science

When I re-enabled MacPorts, it was never the same again.

Page 10: Docker for data science

With my environment in tatters......and faced with re-installing from scratch, I decided there must be a better way.

What about:

HomebrewBoxenvirtualenvanacondanpm, rpmvagrant, chef, puppet, ansibleVirtualBox, VM FusionDocker, CoreOS Rocketfig, dokku, flynn, deis

Surely one of these would help?

Page 11: Docker for data science

What do I want in a solution?Trivial to wipe the slate clean and recreatePortable (home laptop env == work laptop state)Easy to shareConfigure once, use everywhere

Remote databases, servers etc.Customisation (sublime, .virc, .bashrc etc.)Installation quirks

No system-wide backup requiredCompatible with deployment to serversOS X -centric

Page 12: Docker for data science

Introducing Docker!!boot2docker - a single virtual machine running on VirtualBox (OS X or Windows)docker daemon running on boot2docker osdocker containers running in partial isolation inside the same boot2docker virtualmachinedocker client running on the host (OS X) to simplify the issuing of commandsdocker images as templates for containers

docker images -> .iso style templates

docker containers -> lightweight virtual machines intended to run just one process each

Page 13: Docker for data science

What do you get with docker?Run multiple environments indipendentlyRun services indipendently of environments e.g. databasesPermit an environment to interact with a specific subset of the host filesShare a pool of resources between all environments

A single container can consume 100% of CPU, RAM and HDDQuotas for when resources busy

Page 14: Docker for data science

What can be problematic

Trust - processes are given read-write access to your filesstick to trusted builds and automated buildsnot really different to installing any software

Resources are limited to VM allocationLot's to learnManaging containers (starting, stopping etc.)

Page 15: Docker for data science

Get docker

boot2docker .dmg or .exeapt-get install docker.io ...

Initialise:

https://docs.docker.com/installation/ (https://docs.docker.com/installation/)

$ boot2docker init$ boot2docker start$ $(boot2docker shellinit)$ docker login

Page 16: Docker for data science

What can you do?Start an ipython shell:

$ docker run -it --rm ipython/ipython ipython

Page 17: Docker for data science

What can you do?Run a python script in an ipython shell with the scipy stack:

$ docker run \ -it --rm \ -v $(pwd):/home \ -w=/home \ ipython/scipystack \ ipython my-script.py

Page 18: Docker for data science

What can you do?Run a notebook server:

$ docker run -e PASSWORD=MyPass -it --rm ipython/scipyserver

Page 19: Docker for data science

What can you do?Convert a .ipynb file into reveal.js slides (like these) and serve them:

$ docker run \ -i -t --rm \ -p 8000:8000 \ -v "$(pwd)":/slides \ nbreveal \ /convert_and_reveal.sh 'Docker for Data Science.ipynb'

Page 20: Docker for data science

What can you do?Start a complete environment:

$ cd ~/fig/data-science-env$ fig up -d

Page 21: Docker for data science

What do all these arguments do?

-d run as daemon-i -t --rm run interactively and auto-remove on exit-e set an env variable-p map a port like -p host:container-v map a host volume into the container--link automatically link containers, particularly databases-w set the working directory--volumes-from map all the volumes from the named container

Page 22: Docker for data science

Where do containers come from?

Containers can be started using: docker run <image>.

Page 23: Docker for data science

Where do images come from?The trusted builds on docker hub (like ubuntu, postgres, node etc.)Open source providers with automated builds (like ipython, julia etc.)Public images uploaded in a built state (quite opaque)Private images (built locally or via docker login)

Page 24: Docker for data science

How do I build my own images?Write a Dockerfile.

Either build and run it locally like:

Or upload it to github and have the docker hub build it for you automatically:

Wait for build...

$ docker build -t calvingiles/magic-image .$ docker run calvingiles/magic-image

$ git push

$ docker run calvingiles/magic-image

Page 25: Docker for data science

What is this Dockerfile?FROM ipython/scipyserver

MAINTAINER Calvin Giles <[email protected]>

# Create install folderRUN mkdir /install_files

# Install postgres libraries and python dev libraries# so we can install psycopg2 laterRUN apt-get updateRUN apt-get install libpq-dev python-dev

# install python requirementsCOPY requirements.txt /install_files/requirements.txtRUN pip2 install -r /install_files/requirements.txtRUN pip3 install -r /install_files/requirements.txt

# Set the working directory to /notebooksWORKDIR /notebooks

Page 26: Docker for data science

Components of a DockerfileFROM: another image to build upon (ubuntu, debian, ipython...)RUN: execute a command in the container and write teh results into the imageCOPY: copy a file from the build filesystem to the imageWORKDIR: change the working directory (the container starts in the last WORKDIR)

ENV: set and env variable

EXPOSE: open up a port to linked containers and the host

Page 27: Docker for data science

So how do I actually use docker?Find an image to start your environment off (ubuntu, ipython/scipystack,rocker/rstudio)Create a Dockerfile containing only a FROM line:

build and run

FROM ipython/scipystack

Page 28: Docker for data science

Let's start with the ipython notebook server with scipystack:

Find your boot2docker ip:

Navigate there https://your-ip:443 and sign in with the PASSWORD

$ echo 'FROM ipython/scipyserver' > Dockerfile$ docker build -t ipython-dev-env .$ docker run -i -t --rm -e PASSWORD=MyPass -p 443:8888 ipython-dev-env

$ boot2docker ip

Page 29: Docker for data science

How do I build on this?

Page 30: Docker for data science

Install an extra python module into a notebook serverTest the install of the package you want:

In [5]: !pip3 search gensim

gensim - Python framework for fast Vector Space Modelling

In [7]: !pip3 install gensim

Downloading/unpacking gensim Downloading gensim-0.10.3.tar.gz (3.1MB): 3.1MB downloaded Running setup.py (path:/tmp/pip_build_root/gensim/setup.py) egg_info for package gensim warning: no files found matching '*.sh' under directory '.' no previously-included directories found matching 'docs/src*'Requirement already satisfied (use --upgrade to upgrade): numpy>=1.3 in /usr/local/lib/python2.7/dist-packages (from gensim)Requirement already satisfied (use --upgrade to upgrade): scipy>=0.7.0 in /usr/local/lib/python2.7/dist-packages (from gensim)Requirement already satisfied (use --upgrade to upgrade): six>=1.2.0 in /usr/lib/python2.7/dist-packages (from gensim)Installing collected packages: gensim Running setup.py install for gensim warning: no files found matching '*.sh' under directory '.' no previously-included directories found matching 'docs/src*' building 'gensim.models.word2vec_inner' extension x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/tmp/pip_build_root/gensim/gensim/models -I/usr/include/python2.7 -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -c ./gensim/models/word2vec_inner.c -o build/temp.linux-x86_64-2.7/./gensim/models/word2vec_inner.o In file included from /usr/include/python2.7/numpy/ndarraytypes.h:1761:0, from /usr/include/python2.7/numpy/ndarrayobject.h:17, from /usr/include/python2.7/numpy/arrayobject.h:4, from ./gensim/models/word2vec_inner.c:232: /usr/include/python2.7/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]

Page 31: Docker for data science

In [10]: import gensim

If that works, add the install commands to your Dockerfile:

And rebuild:

FROM ipython/scipyserverRUN pip2 install gensimRUN pip3 install gensim

$ docker build -t ipython-dev-env .$ docker run -i -t --rm -e PASSWORD=MyPass -p 443:8888 ipython-dev-env

Page 32: Docker for data science

I want to use a requirements.txt fileCreate requirement.txt

Dockerfile:

$ echo 'gensim' >> requirements.txt

FROM ipython/scipyserverCOPY requirements.txt /requirements.txtRUN pip2 install -r /requirements.txtRUN pip3 install -r /requirements.txt

Page 33: Docker for data science

But what do I put in requirements.txt?In [12]: !pip3 freeze | head

Cython==0.20.1post0Jinja2==2.7.2MarkupSafe==0.18Pillow==2.3.0Pygments==1.6SQLAlchemy==0.9.8Sphinx==1.2.2brewer2mpl==1.4.1certifi==14.05.14chardet==2.0.1

Page 34: Docker for data science

How do I install system libraries for MSSQL Server?Create your odbcinst.ini, odbc.ini and freetds.conf files.

RUN apt-get update &&& apt-get -y install \ unixodbc \ unixodbc-dev \ freetds-dev \ tdsodbcCOPY freedts.conf >> /etc/freetds/COPY odbcinst.ini /etc/COPY odbc.ini /etc/

Page 35: Docker for data science

How do I install the PyODBC library from source?RUN pip2 install https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zipRUN pip3 install https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zip

Page 36: Docker for data science

How do I get a database?

You will get the IP and PORTS to connect to as env variables in the ipython-dev-env container

$ docker run -d --name dev-postgres postgres$ docker run -d \ -e PASSWORD=MyPass \ -p 443:8888 \ --link dev-postgres:dev-postgres \ ipython-dev-env

Page 37: Docker for data science

What about my data?

$ docker run -d \ -v "~/Google\ Drive/data:/data" \ --name gddata \ busybox echo$ docker run -d \ -e PASSWORD=MyPass \ -p 443:8888 \ --volumes-from gddata \ ipython-dev-env

Page 38: Docker for data science

Help, I ran out of RAM$ VBoxManage modifyvm boot2docker-vm --memory 5555$ boot2docker stop$ boot2docker start$ boot2docker info { ... 'Memory': 5555 ... }

Page 39: Docker for data science

Git push?In Docker hub, create new and select a Automated Build.Point it to your github or bitbucket repoWait for the build to complete

$ docker pull calvingiles/data-science-environment$ docker run calvingiles/data-science-environment

Page 40: Docker for data science

I seem to be running a lot of containersFig can help a lot with that.

Install fig: Create a fig.yml file specifying a set of containers to startfig up -d to begin

fig.sh/install.html (fig.sh/install.html)

Page 41: Docker for data science

Is this all really better than before?I use docker for 100% of my data science tasks.

I use docker for nearly everything else.

Page 42: Docker for data science

FROM ipython/scipyserver

MAINTAINER Calvin Giles <[email protected]>

# Create install folderRUN mkdir /install_files

# Update aptitude with new repoRUN apt-get update

# Install software RUN apt-get install -y git# Make ssh dirRUN mkdir /root/.ssh/

## Authenticate with github# Copy over private key, and set permissionsCOPY id_rsa /root/.ssh/id_rsaRUN chmod 600 /root/.ssh/id_rsa

# Create known_hostsRUN touch /root/.ssh/known_hosts# Add github keyRUN ssh-keyscan github.com >> /root/.ssh/known_hosts

## install pyodbc so we can talk to MS SQL# install unixodbc and freetdsRUN apt-get -y install unixodbc unixodbc-dev freetds-dev tdsodbc# configure Adthena database with read-only permissionsCOPY freetds.conf.suffix /install_files/freedts.conf.suffixRUN cat /install_files/freedts.conf.suffix >> /etc/freetds/freetds.confCOPY odbcinst.ini /etc/odbcinst.iniCOPY odbc.ini /etc/odbc.ini

# Install pyodbc from sourceRUN pip2 install https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zipRUN pip3 install https://pyodbc.googlecode.com/files/pyodbc-3.0.7.zip

Page 43: Docker for data science

# install python requirementsCOPY requirements.txt /install_files/requirements.txtRUN pip2 install -r /install_files/requirements.txtRUN pip3 install -r /install_files/requirements.txt

# Clone wayside into the docker containerRUN mkdir -p /repos/wayside WORKDIR /repos/wayside RUN git clone [email protected]:Adthena/wayside.git .RUN python2 setup.py developRUN python3 setup.py develop

# Get rid of ssh key from image now repos have been clonedRUN rm /root/.ssh/id_rsa

# Put the working directory back to notebooks at the endWORKDIR /notebooks

Page 44: Docker for data science

Sum upFind a base imageRun a container and trial run your install stepsCreate a Dockerfile to perform those steps consistently

My environmentsmy public development environment -

my public docker images -

docker run -it --rm calvingiles/<image>build upon with FROM calvingiles/<image>fork (in github) if you need things a little different

github.com/calvingiles/data-science-environment (https://github.com/calvingiles/data-science-environment)

hub.docker.com/u/calvingiles/(https://hub.docker.com/u/calvingiles/)

Page 45: Docker for data science

Thanks

[email protected]@calvingiles