Version control and open- · 2018. 1. 25. · • syncing work within the team. What version...

Preview:

Citation preview

Version control and open-source methodologyin data scienceMislav Marohnić, software developer at GitHub

Topics

Topics

• What is version control,

Topics

• What is version control,

• How has open source influenced software,

Topics

• What is version control,

• How has open source influenced software,

• How can this be relevant to researchers in data science.

Version control

Version control

Version control

Version control

version control outside of the software world

version control outside of the software world

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

this project elsewhere

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

this project elsewhere

isolated changes

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

this project elsewhere

isolated changes

combining changes

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

this project elsewhere

isolated changes

combining changes

copying a project

Version control terminology• repository

• “checking in”

• commit

• push/pull

• remote

• branch

• merge

• fork

• pull request

project directory

adding files

saving changes

syncing

this project elsewhere

isolated changes

combining changes

copying a project

contributingchanges

What version control facilitates

What version control facilitates

• code storage & backups

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

• syncing work within the team

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

• syncing work within the team

• project history

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

• syncing work within the team

• project history

• tracking down software bugs

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

• syncing work within the team

• project history

• tracking down software bugs

• release management

What version control facilitates

• code storage & backups

• isolated environment (branches) to experiment with changes

• syncing work within the team

• project history

• tracking down software bugs

• release management

• continuous integration (CI)

What version control looks like

What version control looks like

What version control looks like

Open-source

Open-sourceFOSS: Anyone is freely licensed to use, copy, study, and change the

software in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software.

Examples of open source

Examples of open source• Python / R

Examples of open source• Python / R

• the Web & most browsers

Examples of open source• Python / R

• the Web & most browsers

• Linux

Examples of open source• Python / R

• the Web & most browsers

• Linux

• parts of Apple's macOS

Examples of open source• Python / R

• the Web & most browsers

• Linux

• parts of Apple's macOS

• Android OS

Examples of open source• Python / R

• the Web & most browsers

• Linux

• parts of Apple's macOS

• Android OS

• Microsoft .NET

Benefits of open source

Benefits of open source

• transparency → trust

Benefits of open source

• transparency → trust

• fosters learning

Benefits of open source

• transparency → trust

• fosters learning

• fosters collaboration

Benefits of open source

• transparency → trust

• fosters learning

• fosters collaboration

• more resilient software

Benefits of open source

• transparency → trust

• fosters learning

• fosters collaboration

• more resilient software

• longer-lasting software

What GitHub provides

What GitHub provides

• web interface for git

What GitHub provides

• web interface for git

• storage & backups

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

• collaboration features

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

• collaboration features

• project management

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

• collaboration features

• project management

• downloadable releases

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

• collaboration features

• project management

• downloadable releases

• web site publishing

What GitHub provides

• web interface for git

• storage & backups

• issue tracking

• pull requests

• code search

• collaboration features

• project management

• downloadable releases

• web site publishing

• API for integrations

Pull Requestsa small “unit” of collaboration

The GitHub Flow

The GitHub Flow: new branch

The GitHub Flow: changes (commits)

The GitHub Flow: create pull request

The GitHub Flow: collaboration

The GitHub Flow: peer approval

The GitHub Flow: merge

Continuous integration (CI)The killer feature of pull requests

Continuous integration (CI)The killer feature of pull requests

Version control in data science

Similarities to software development

• syncing materials & data

• writing actual code (e.g. R)

• collaboration within a team

• peer review process

• publishing

Writing formats

Writing formats

• LaTeX

Writing formats

• LaTeX

• Markdown

Writing formats

• LaTeX

• Markdown

• R Markdown

Writing formats

• LaTeX

• Markdown

• R Markdown

• Jupyter (IPython) Notebook

Writing formats

• LaTeX

• Markdown

• R Markdown

• Jupyter (IPython) Notebook

• AsciiDoc

github.com/mislav/utrecht

github.com/mislav/utrecht

github.com/mislav/utrecht

Potential problems

Potential problems

• git can be tricky to learn for non-developers

Potential problems

• git can be tricky to learn for non-developers

• large datasets can be inconvenient to add to version control

Potential problems

• git can be tricky to learn for non-developers

• large datasets can be inconvenient to add to version control

• transition paths from other tools aren't always clear

Recommended