Scientific Python

Preview:

Citation preview

Scientific PythonEueung Mulyana

http://eueung.github.io/python/sciHint: Navigate with Arrow Keys | Attribution-ShareAlike CC BY-SA

1 / 31

Agenda1. Jupyter / IPython2. NumPy3. SciPy4. matplotlib5. pandas6. SymPy7. scikit-learn8. jakevdp: The State of the Stack

2 / 31

Jupyter / IPython

3 / 31

IPythonPowerful interactive shellSupports tab completion of just about everythingInline help system for modules, classes etc. with ?, sourcecode with ??Browser based notebook (Jupyter) with support for(runnable) code, text, mathematical expressions usingLATEX, inline plots etc.Could be used as a computational lab notes/worksheetsMagic functions to access the shell, run R code etc.Parallel computing

4 / 31

 

Notes on Jupyter

1. The Jupyter Notebook works with over 40 languages2. Jupyter Notebooks render on GitHub

JupyterComputational Narratives

1. Computers are optimized for producing, consuming andprocessing data.

2. Humans are optimized for producing, consuming andprocessing narratives/stories.

3. For code and data to be useful to humans, we need toolsfor creating and sharing narratives that involve code anddata.

The Jupyter Notebook is a tool for creating and sharingcomputational narratives.

5 / 31

Jupyter & Data ScienceThe Jupyter Notebook is a tool that allows us to explore thefundamental questions of Data Science

with a particular datasetwith code and datain a manner that produces a computational narrativethat can be shared, reproduced, modified, and extended.

At the end of it all, those computational narratives encapsulatethe goal or end point of Data Science. The character of thenarrative (prediction, inference, data generation, insight, etc.)will vary from case to case.

The purpose of computing is insight, not numbers.

Hamming, Richard (1962). Numerical Methods for Scientists and

6 / 31

NumPy

7 / 31

NumPyNumPy is the fundamental package for scientific computing withPython. It contains among other things:

A powerful N-dimensional array objectSophisticated (broadcasting) functionsTools for integrating C/C++ and Fortran codeUseful linear algebra, Fourier transform, and randomnumber capabilities

Besides its obvious scientific uses, NumPy can also be used asan efficient multi-dimensional container of generic data.

Arbitrary data-types can be defined. This allows NumPy toseamlessly and speedily integrate with a wide variety ofdatabases.

NumPy provides a powerful N-dimensions array object

Methods on these arrays are fast because they relies onwell-optimised librairies for linear algebra (BLAS, ATLAS,MKL)NumPy is tolerant to python’s lists

NumPy inherits from years of computer based numericalanalysis problem solving

8 / 31

import numpy as np

a = np.array([1, 2, 3]) # Create a rank 1 arrayprint type(a) # Prints "<type 'numpy.ndarray'>"print a.shape # Prints "(3,)"print a[0], a[1], a[2] # Prints "1 2 3"a[0] = 5 # Change an element of the arrayprint a # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 arrayprint b.shape # Prints "(2, 3)"print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4"

# -----a = np.zeros((2,2)) # Create an array of all zerosprint a # Prints "[[ 0. 0.] # [ 0. 0.]]"

b = np.ones((1,2)) # Create an array of all onesprint b # Prints "[[ 1. 1.]]"

c = np.full((2,2), 7) # Create a constant arrayprint c # Prints "[[ 7. 7.] # [ 7. 7.]]"

d = np.eye(2) # Create a 2x2 identity matrixprint d # Prints "[[ 1. 0.] # [ 0. 1.]]"

e = np.random.random((2,2)) # Create an array filled with random valuesprint e # Might print "[[ 0.91940167 0.08143941] # [ 0.68744134 0.87236687]]"

NumpyNumpy is the core library for scientific computing in Python. Itprovides a high-performance multidimensional array object(MATLAB style), and tools for working with these arrays.

ArraysA numpy array is a grid of values, all of the same type, andis indexed by a tuple of nonnegative integers.The number of dimensions is the rank of the array; theshape of an array is a tuple of integers giving the size ofthe array along each dimension.We can initialize numpy arrays from nested Python lists,and access elements using square brackets.Numpy also provides many functions to create arrays.

9 / 31

SciPy

10 / 31

SciPySciPy is a Python-based ecosystem of open-source software formathematics, science, and engineering. SciPy core packages:IPython, NumPy, SciPy Library, SimPy, matplotlib, pandas.

SciPy LibrarySciPy is a collection of mathematical algorithms and conveniencefunctions built on top of NumPy includes modules for: statistics,integration & ODE solvers, linear algebra, optimization, FFT, etc.

We use the terms SciPy and SciPy Library interchangeably.Meaning depends on context.

SciPy is a toolbox for researchers/scientists, it contains manyhidden treasures for them.

11 / 31

SciPy & NumPyNumpy provides a high-performance multidimensional arrayand basic tools to compute with and manipulate these arrays.

SciPy builds on this, and provides a large number of functionsthat operate on numpy arrays and are useful for different typesof scientific and engineering applications.

SciPy provides numerous numerical routines, that run efficientlyon top of NumPy arrays for: optimization, signal processing,linear algebra and many more. It also provides some convenientdata structures as compressed sparse matrix and spatial datastructures. If you had already use some scikits (scikit-learn,scikit-image) you already used scipy extensively.

A few thoughts on SciPy:

Contains linear algebra routines that overlap with NumPy;SciPy’s linear algebra routines always run on theoptimized system libraries (LAPACK, ATLAS, Intel MathKernel Library, etc.)Sparse matrix supportExtends NumPy’s statistical capabilitiesUnder active development, new toys added constantly!

12 / 31

SciPyA big box of tools:

Special functions (scipy.special)Integration (scipy.integrate)Optimization (scipy.optimize)Interpolation (scipy.interpolate)Fourier Transforms (scipy.fftpack)Signal Processing (scipy.signal)Statistics (scipy.stats)Linear Algebra (scipy.linalg)File IO (scipy.io)

Sparse Eigenvalue Problems with ARPACKCompressed Sparse Graph Routines(scipy.sparse.csgraph)Spatial data structures and algorithms (scipy.spatial)Multi-dimensional image processing (scipy.ndimage)Weave (scipy.weave)

from scipy.stats import linregress(slope, intercept, r, p, se) = linregress(x, noisy_y)

# ---

from scipy.stats import spearmanr, pearsonr

x_cubed = x ** 3x_cubed += np.random.normal(0,3,10)

13 / 31

matplotlib

14 / 31

matplotlibThe ultimate plotting library that renders 2D and 3D high-qualityplots for python.

pyplot implements Matlab-style plottingObject-oriented API for more advanced graphicsThe API mimics, in many ways the MATLAB one, easing thetransition from MATLAB users to pythonOnce again, no surprises, matplotlib is a very stable andmature project (expect one major release per year)

Inline plots in the notebook:

ipython notebook --pylab inline

15 / 31

import numpy as npimport matplotlib.pyplot as plt

# Compute the x and y coordinates for points on a sine curvex = np.arange(0, 3 * np.pi, 0.1)y = np.sin(x)

# Plot the points using matplotlibplt.plot(x, y)plt.show() # You must call plt.show() to make graphics appear.

import numpy as npimport matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curvesx = np.arange(0, 3 * np.pi, 0.1)y_sin = np.sin(x)y_cos = np.cos(x)

# Plot the points using matplotlibplt.plot(x, y_sin)plt.plot(x, y_cos)plt.xlabel('x axis label')plt.ylabel('y axis label')plt.title('Sine and Cosine')plt.legend(['Sine', 'Cosine'])plt.show()

matplotlibmatplotlib tries to make easy things easy and hard thingspossible. You can generate plots, histograms, power spectra,bar charts, errorcharts, scatterplots, etc, with just a few lines ofcode.

For simple plotting the pyplot interface provides a MATLAB-likeinterface, particularly when combined with IPython. For thepower user, you have full control of line styles, font properties,axes properties, etc, via an object oriented interface or via a setof functions familiar to MATLAB users.

With just a little bit of extra work we can easily plot a morecomplex chart e.g. multiple lines at once, and add a title,legend, and axis labels.

16 / 31

Notes

17 / 31

TL;DRNumPy is the foundationSciPy is built upon NumPy, with some overlappingfunctionalitymatplotlib complements both

NumPy, SciPy, matplotlibNumPy is the foundation of scientific and numericalcomputing with PythonSciPy is a collection of mathematical and scientific toolsmatplotlib is a technical plotting package

NumPy ArraysImplemented in C for efficiencyPython indexing and slicingElements are strongly typed

Taking advantage of NumPyThink in parallel!Replace loops with vector operations

matplotlibPrimarily 2D plottingBasic 3D plots available with mplot3d (importmpl_toolkits.mplot3d)

18 / 31

Other NotesNumPy/SciPy/scikit-learn rely on many low-level Fortran/Clibrary such as BLAS, ATLAS, the Intel MKL…

most of these libraries are shipped by your favorite OSunoptimized (well, maybe not the case for Mac)you may want to re-compile these libraries or to use apackaged python distribution (anaconda, canopy)libraries for performance: numba, cython, ...

19 / 31

pandas

20 / 31

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis toolsfor the Python programming language.

pandas"R for Python"Provides easy to use data structures & a ton of usefulhelper functions for data cleanup and transformationsFast! (backed by NumPy arrays)Integrates well with other libs e.g. scikit-learn

21 / 31

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt

s = pd.Series([1,3,5,np.nan,6,8])dates = pd.date_range('20130101', periods=6)df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list(df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype= 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test" 'F' : 'foo' })

pandaspandas provides the DataFrame class, which is verysimilar to a data.frame in RBuilt on top of NumPy arrays, and allows mixed columntypesCopes well with missing values (unlike NumPy)Intelligently matches on columns/indices (supports SQL-like joins etc.)Read and write .csv, .xls, HTML tables etc.Lots of useful data analysis tools built in

22 / 31

SymPy

23 / 31

SymPySymPy is a Python library for symbolic mathematics. It aims tobecome a full-featured computer algebra system (CAS) whilekeeping the code as simple as possible in order to becomprehensible and easily extensible.

SymPy is written entirely in Python and does not require anyexternal libraries.

import sympysympy.sqrt(8)# 2*sqrt(2)

from sympy import symbolsx, y = symbols('x y')expr = x + 2*yexpr# x + 2*y

expr - x# 2*y

24 / 31

scikit-learn

25 / 31

scikit-learnMachine Learning algorithms implemented in Python ontop of NumPy & SciPyConveniently maintains the same interface to a widerange of algorithmsIncludes algorithms for: Classification, Regression,Clustering, Dimensionality reductionAs well as lots of useful utilities (cross-validation,preprocessing etc.)

from sklearn import datasetsiris = datasets.load_iris()digits = datasets.load_digits()

print(digits.data)digits.targetdigits.images[0]

from sklearn import svmclf = svm.SVC(gamma=0.001, C=100.)clf.fit(digits.data[:-1], digits.target[:-1])

26 / 31

The State of the Stack

27 / 31

28 / 31

Many More Tools ..PerformanceNumba, Weave, Numexpr, Theano . . .

VisualizationBokeh, Seaborn, Plotly, Chaco, mpld3, ggplot, MayaVi, vincent, toyplot, HoloViews . . .

Data Structures & ComputationBlaze, Dask, DistArray, XRay, Graphlab, SciDBpy, pySpark . . .

Packaging & distribution:pip/wheels, conda, EPD, Canopy, Anaconda ...

29 / 31

References1. Brian Granger: Project Jupyter as a Foundation for Open Data Science2. Juan Luis Cano Rodriguez, IPython: How a notebook is changing science | Python as a real alternative to

MATLAB, Mathematica and other commercial software3. Olivier Hervieu: Introduction to scientific programming in python4. CS231n: IPython Tutorial, http://cs231n.github.io/ipython-tutorial/5. J.R. Johansson: Introduction to scientific computing with Python6. Introduction to solving biological problems with Python by pycam7. Jake VanderPlas: The State of the Stack

30 / 31

ENDEueung Mulyana

http://eueung.github.io/python/sciHint: Navigate with Arrow Keys | Attribution-ShareAlike CC BY-SA

31 / 31

Recommended