Transcript

P a g e 1 | 57

Chapter 10 – Introducing Python Pandas

Introduction

Python Panda is Python’s library for data analysis.

Panda – “ Panel Data Analysis”

What is Data Analysis?

It refers to process of evaluating big data sets using analytical & statistical tools so as to discover useful information and

conclusion to support business decision making.

Python pandas & Data Analysis

Python pandas provide various tools for data analysis and makes it a simple and easy process.

Author of Pandas is Wes Mckinney.

Using Pandas

Pandas is an opens source library built for Python programming language, which provides high performance

data analysis tools.

In order to work with pandas in Python, you need to import pandas library in your python environment.

Benefits of using Panda for Data Analysis

1. It can read or write in many different data formats(integer,float,double,etc.)

2. It can calculate in all ways data is organized, i.e., across rows and down columns.

3. It can easily select subsets of data from bulky data sets and even combine multiple datasets together.

4. It has functionality to find and fill missing data.

5. It supports advanced time-series functionality(Time series forecasting is the use of a model to predict future

values based on previously observed values)

**Pandas is best at handling huge tabular data sets comprising different data formats.

NumPy Arrays

NumPy(‘Numerical Python’ or ‘Numeric Python’) is an open source module of Python that offers functions and

routines for fast mathematical computation on array and matrices.

In order to use Numpy, you must import in your module by using a statement like:

import numpy as np

The above statement has given np as alias name for numpy module. Once imported you can use

both names i.e. numpy or np for functions, e.g. numpy.array( ) is same as np.array( ).

Array

It refers to a named group of homogenous (of same type) elements. E.g. students array containing 5

entries as [34, 37, 36, 41, 40] then students is an array.

Types of Numpy array

A NumPy array is simply a grid that contains values of the same/homogenous type. NumPy Arrays

come in two forms:

1-D(one dimensional) arrays known as Vectors(having single row/column only)

Multidimensional arrays known as Matrices(can have multiple rows and columns)

You can use any identifier name in place of

np

P a g e 2 | 57

Example 1: (1-D array)

import numpy as np

list = [1,2,3,4]

a1=np.array(list)

print(a1)

Output : [1 , 2 , 3 , 4]

**Individual elements of above array can be accessed just like you access a list’s i.e. arrayname [index]

Example 2: (2-D array)

import numpy as np

a7 = np.array([ [10,11,12,13] , [21,22,23,24] ])

print(a7[1,3])

print(a7[1][3])

print(a7)

Output:

type( ) function in NumPy

It is used to check the type of objects in Python.

There are two associated attributes :

(i) shape that returns dimensions of a NumPy array and

(ii) itemsize that returns the length of each element of array in bytes.

Example:

import numpy as np

list=[1,2,3,4]

a1=np.array(list)

a2 = np.array([ [10,11,12,13] , [21,22,23,24] ])

print(type(a1))

print(type(a2))

print(a1.shape)

print(a2.shape)

print(a2.itemsize)

Output:

It will create a NumPy array from

the given list

This is a 2-D array having rank 2

You can access elements of multi-

dimension arrays as

<array>[row][col]

or as

<array>[row, col]

The shape attribute gives the dimensions of a NumPy array.

The itemsize attribute returns the length of each element of

array in bytes.

P a g e 3 | 57

** you can check data type of a NumPy array’s elements using <arrayname>.dtype

** In a NumPy array, dimensions are called axes. The number of axes is called rank.

** the default dtype is float in an NumPy array.

Difference between NumPy and List

S.No. NumPy List

1. Once a Numpy array is created, you cannot change its size.

Size can be changed.

2. Every NumPy array contain elements of homogenous types, i.e. all its elements have one and only one data type.

List can contain elements of different data type.

3. NumPy arrays support vectorized operations, i.e. if you apply a function, it is performed on every item in the array.

It does not support vectorized.

NumPy Data Types

The NumPy arrays can have elements in data types supported by NumPy. Following table are the data types

supported by NumPy:

P a g e 4 | 57

Ways to create NumPy Arrays

1. Creating empty arrays using empty( )

Sometimes you need to create empty arrays or an uninitialized array of specified shape and dtype, in

which you can store actual data as and when required. For this you can use empty( ) function as:

** After creating empty array, if you display the contents of the array, it will display any random contents,

which are uninitialized garbage values.

Example:

import numpy as np

arr1 = np.empty([3,2])

arr2 = np.empty([3,4] , dtype=np.int8)

print(arr1.dtype , arr2.dtype)

print(arr1)

Output:

2. Creating arrays filled with zero using zeros( )

The function zeros( ) takes same attributes as empty( ), and creates an array with specifies size and type

but filled with zeros.

No dtype specified

dtype specified as int8

empty( ) creates array with any random garbage values

P a g e 5 | 57

Example:

import numpy as np

arr1 = np.zeros([3,2],dtype=np.int64)

print(arr1)

Output:

3. Creating arrays filled with 1’s using ones( )

The function ones( ) takes same attributes as empty( ), and creates an array with specified size and type

but filled with ones.

Example:

import numpy as np

arr1 = np.ones([3,2],dtype=np.int64)

print(arr1)

Output:

** There are three more functions empty_like( ), zeros_like( ) and ones_like( ) that you can use to create an

array similar to another existing array.

4. Creating arrays with a numerical range using arange( )

arange( ) creates a NumPy array with evenly spaced values within a specified numerical range. It is used

as:

Example:

import numpy as np

P a g e 6 | 57

arr1 = np.arange(7)

print(arr1)

arr2=np.arange(1,7,2,np.float32)

print(arr2)

Output:

5. Creating arrays with a numerical range using linspace( )

linspace( ) is used to generate evenly spaced elements between two given limits.

Example:

import numpy as np

arr1 = np.linspace(2,10,3)

print(arr1)

Output:

Pandas Data Structure

A data structure is a particular way of storing and organizing data in a computer to suit a specific purpose so that

it can be accessed and worked with in appropriate ways.

Two basic data structure in Pandas are : Series and Dataframe

Series Data Structure

It represents a one-dimensional array of indexed data. A series type object has two main components:

1. An array of actual data

2. An associated array of indexes or data labels.

Creating Series Objects

A series type object/variable can be created using panda library series( ). Make sure that you have imported pandas

and numpy modules with import statements.

I. Creating empty series Object

To create an empty object i.e., having no values, you can just use the series( ) as :

<series object> = pandas.Series( )

The above statement will create an empty series type object with no value having default datatype which is

float64.

Example:

import numpy as np

P a g e 7 | 57

import pandas as pd

obj1 = pd.Series( )

print(obj1)

Output: Series([], dtype: float64)

II. Creating non-empty Series objects

To create non-empty Series objects, you need to specify arguments for data and indexes as per following syntax:

<Series-object> = pd.Series(data , index=idx)

where data is the data part of the Series object, it can be one of the following:

* A Python sequence * An ndarray * A Python dictionary * A Scalar Value

(i) Specify data as Python Sequence

<Series-object> = pd.Series(<any Python sequence>)

Example:

import numpy as np

import pandas as pd

obj1 = pd.Series(range(7))

obj2 = pd.Series([3.5 , 5 ,6.5, 8])

print(obj1)

print(obj2)

Output:

(ii) Specify data as ndarray

The data attribute can be ndarray also.

Example:

import numpy as np

import pandas as pd

arr1=np.array([2,4,6,8])

obj1 = pd.Series(arr1)

print(obj1)

Output:

P a g e 8 | 57

(iii) Specify data as a Python dictionary

The sequence that you provide with Series( ) can be any sequence, including dictionaries.

Example:

import numpy as np

import pandas as pd

obj1 = pd.Series({'Jan': 31 , 'Feb' : 28 , 'Mar' : 31})

print(obj1)

Output:

(iv) Specify data as a scalar value

The data can be in form of a single value or a scalar value. BUT if data is scalar value, then the index must be

provided. The scalar value(given as data) will be repeated to match the length of index. E.g.

III. Creating Series Objects – Additional Functionality

(i) Specifying/Adding Nan Values in a Series object

Sometimes you need to create a series object of a certain size but you do not have complete data available at that

time. In such cases, you can fill missing data with a NaN(Not a Number) value.

Example:

import numpy as np

import pandas as pd

obj1 = pd.Series([2.4 , np.NaN , 3.8])

print(obj1)

P a g e 9 | 57

Output:

(ii) Specify index(es) as well as data with Series( )

While creating Series type object is that along with values, you can also provide indexes.

The syntax for this is as follows:

<Series Object> = pandas.Series(data=None , index = None)

Example:

import numpy as np

import pandas as pd

arr = [31 , 28 , 31 , 30]

mon = ["Jan" , "Feb" , "Mar" , "Apr"]

obj3 = pd.Series(data=arr, index=mon)

print(obj3)

Output:

*You could skip keyword data also, i.e., following statement will also do the same as above:

obj3 = pd.Series(arr, index=mon)

*You may use loop for defining index sequence also, e.g.

s1 = pd.series(range(1,5,3) , index = [x for x in ‘abcde’]

(iii) Specify data type along with data and index

You can specify data type along with data and index with Series( ) as per following syntax:

<Series Object> = pandas.Series(data=None , index=None , dtype=None)

None is the default value for different parameters taken in case no value is provided for a parameter.

Example:

import numpy as np

import pandas as pd

arr=[31,28,31,30]

mon=["Jan","Feb","Mar","Apr"]

obj5=pd.Series(data=arr,index=mon,dtype=np.float64)

print(obj5)

These are default values, if you skip

these parameters.

P a g e 10 | 57

Output:

(iv) Using a mathematical function/expression to create data array in Series( )

The Series( ) allows you to define a function or expression that can calculate values for data sequence. It is

done in following form:

<Series Object> = pandas.Series(index=None, data=<function|expression>)

Example1:

import numpy as np

import pandas as pd

a= np.arange(9,13)

print(a)

obj1 = pd.Series(index=a, data=a*2)

print(obj1)

obj2 = pd.Series(index=a, data= a**2)

print(obj2)

Output:

Example2:

import numpy as np

import pandas as pd

list1=[10,20,30,40]

obj1 = pd.Series(data=list1*2)

print(obj1)

Output:

P a g e 11 | 57

** A number multiplied with a list replicates the list those many times and hence the data array has the same

list values replicated twice. ( 2 * list1)

Important

While creating a Series object, when you give index array as a sequence then there is no compulsion for the uniqueness

of indexes. That is, you have duplicate entries in the index array and Python won’t raise any error.

e.g.

import numpy as np

import pandas as pd

val=np.array([10,20,30,40])

obj1 = pd.Series(index=['a','b','a','b'],data=val)

print(obj1)

Output:

Series Object Attributes

When you create a Series type object, all information related to it(such as its size, its datatype etc.) is available

through attributes. You can use these attributes in the following format to get information about the Series

object:

<Series object>.<attribute name>

(a) Retrieving Index Array and Data Array of a Series Object

See, index array has duplicate entries.

P a g e 12 | 57

You can access the index array and data values array of an existing Series object by using attributes index and

values.

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,40])

obj6=pd.Series(data=[56.6,70.5,78.6], index=['a','b','c'])

print(obj5.index)

print(obj5.values)

print(obj6.index)

print(obj6.values)

Output:

(b) Retrieving Types(dtype) and Size of Type(itemsize)

To retrieve the data type of individual elements of a series object, you can use attribute dtype with Series object as

<objectname>.dtype . You can use itemsize attribute to know the number of bytes allocated to each data item,

e.g.

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,40])

obj6=pd.Series(data=[56.6,70.5,78.6], index=['a','b','c'])

print(obj5.dtype)

print(obj5.itemsize)

print(obj6.dtype)

print(obj6.itemsize)

Output:

int64

8

float64

8

(c) Retrieving Shape

The shape of a Series object tells how big it is, i.e., how many elements it contains including missing or empty

values(NaN). Since there is only one axis in Series object, it is shown as (<n>, ), where n is the number of elements in

the object.

e.g.

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,40])

obj6=pd.Series(data=[56.6,70.5,78.6],index=['a','b','c'])

print(obj5.shape)

print(obj6.shape)

Output:

P a g e 13 | 57

(d) Retrieving Dimension, Size and Number of Bytes

e.g.

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,40])

obj6=pd.Series(data=[56.6,70.5,78.6],index=['a','b','c'])

print(obj5.ndim, obj6.ndim)

print(obj5.size, obj6.size)

print(obj5.nbytes, obj6.nbytes)

Output:

(e) Checking Emptiness and Presence of NaNs

If you want to check if the Series object is empty, then you can do it by checking the empty attribute as shown

below. Similarly if you want to check if a Series object contains some NaN value or not, you can use hasnans

attribute.

If you use len( ) on a series object, then it returns total elements in it including NaNs but <series>.count() returns

only the count of non-NaN values in a Series object.

e.g.

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6],index=['a','b','c'])

obj7=pd.Series()

print(obj5.empty, obj6.empty, obj7.empty)

print(obj5.hasnans,obj6.hasnans,obj7.hasnans)

print(len(obj5),len(obj6))

print(obj5.count( ),obj6.count( ))

Output:

Accessing a Series Object and it Elements

After creating Series type object,you can access its indexes separately, its data separately and even access

individual elements.

P a g e 14 | 57

1. Accessing Individual Elements

To access individual elements of a Series object, you can give its index in square bracket along with its name as you

have been doing with other Python sequence, i.e., as :

<Series Object name> [<valid index>]

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

print(obj5[2], obj5[3])

print(obj6['b'], obj6['c'])

print(obj6['a'])

Output:

** Only legal and valid indexes should be used inside the square bracket.

** If the series object has duplicate indexes, then giving an index with the Series object will return all the entries

with that index.

2. Extracting Slices from Series Object

Like other sequences, you can extract slices too from a Series object. Here, you need to understand an

important thing about slicing, which is that:

Slicing takes place position wise and not the index wise in a series object.

Irrespective of series object indexes, position always start with 0 and go on like 1, 2, 3 and so on.

When you have to extract slices, then you need to specify slices as [start : end : step] like you do for other

sequences, but the start and stop signify the positions of elements not the indexes.

E.g.

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

print(obj5[0 : 2])

print(obj5[ 0: 4: 2])

P a g e 15 | 57

print(obj5[: :-1])

print(obj6[0 : : 2])

print(obj6[ 4: 5])

Output:

Operations on Series Object

1. Modifying Elements of Series Object

The data values of a Series object can be easily modified through item assignment, i.e.

<SeriesObject> [<index>] = <new_data_value>

Above assignment will change the data value of the given index in Series Object.

<SeriesObject>[start : stop] = <new_data_value>

Above assignment will replace all the values falling in given slice.

You can even change indexes of a Series object by assigning new index array to its index attribute, i.e.

<Object>.index = <new index array>

But the size of new index array must match with existing index array’s size.

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

obj5[0]=20

obj6['b']=67.8

print(obj5)

print(obj6)

obj6.index = ['x','y','z','x']

print(obj6)

Output:

P a g e 16 | 57

2. The head( ) and tail( ) functions

The head( ) function is used to fetch first n rows from a pandas object and tail( ) function returns last n rows

from a pandas object. The syntax to use these functions is :

<pandas object>.head([n])

Or <pandas object>.tail([n])

If you do not provide any value for n, then head( ) and tail( ) will return first 5 and last 5 rows respectively of

pandas object.

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

print(obj5.head( ))

print(obj6.tail( ))

print(obj5.head(2))

print(obj6.tail(3))

Output:

3. Vector operations on Series Object

Vector operations mean that if you apply a function or expression then it is individually applied on each

item of the object.

P a g e 17 | 57

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

obj7 = obj6 * 2

print(obj5+2)

print(obj5*2)

print(obj7)

print(obj5>15)

Output:

4. Arithmetic on Series Objects

You can perform arithmetic like addition, subtraction,division etc. with two Series objects but the operation

is performed only on the matching indexes. Also, if the data items of the two matches indexes are not

compatible to perform operation, it will return NaN (Not a Number) as the result of those operations.

You can store the result of object arithmetic in another object, which will also be a Series object.

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

obj7 = obj5* 2

print(obj5+obj6)

print(obj5+obj7)

Output:

P a g e 18 | 57

5. Filtering Entries

You can filter out entries from a Series objects using expressions that are of Boolean type (i.e., the expressions

that yield a Boolean value) as per following syntax:

<Series Object> [<Boolean expression on Series Object>]

Example:

import numpy as np

import pandas as pd

obj5=pd.Series(data=[10,20,30,np.NaN])

obj6=pd.Series(data=[56.6,70.5,78.6,89.8],index=['a','b','c','a'])

print(obj5<15)

print(obj5[obj5<15])

Output:

6. Dropping Entries from an axis

Sometimes you do not need a data value at a particular index. You can remove that entry from series object

using drop( ) as per this syntax:

<Series Object>.drop(<index to be removed>)

Difference between NumPy Arrays and Series Objects

S.No. Ndarrays/NumPy Series Objects

1. Vectorized operations can be performed only if the Shapes of two ndarrays match, otherwise it returns an error.

Shapes doesn’t matter. For non-matching indexes, NaN is returned.

2. In ndarrays, the indexes are always numeric starting from 0 onwards.

Series objects can have any type of indexes, Including numbers.

***********************

CHAPTER 11 – PYTHON PANDAS II – DATAFRAMES AND OTHER OPERATIONS

DataFrame Data Structure

P a g e 19 | 57

A DataFrame is another pandas data structure, which stores data in two-dimensional array. It is actually a two

dimensional labelled array, which is actually an ordered collection of columns where columns may store different types

of data, e.g. numeric or string or floating point or boolean type etc.

A two dimensional array is an array in which each element is itself an array. For instance, an array A [m][n] is an m by

n table with m rows and n columns containing m x n elements.

Characteristics

1. It has two indexes or we can say that two axes – a row index (axis=0) and column index (axis=1).

2. Each value is identifiable with the combination of row index and column index. The row index is known as index in

general and the column index is called the column-name.

3. The indexes can of numbers or letters or strings.

4. There is no condition of having all data of same type across columns; it columns can have data of different types.

5. You can easily change its values, i.e., it is value-mutable.

6. You can add or delete rows/columns in a DataFrame. In other words, it is size-mutable.

Creating and Displaying a DataFrame

A DataFrame object can be created by passing data in two-dimensional format. Like series data structure, before

start working with DataFrame the following two libraries needs to be imported:

import pandas as pd

import numpy as np

To create a DataFrame object, you can use syntax as :

P a g e 20 | 57

1. Creating a DataFrame Object from a 2-D Dictionary

A two dimensional dictionary is a dictionary having items as (key: value) where value part is a data structure of

any type : another dictionary, an ndarray, a Series object, a list etc. But here the value parts of all keys should

have similar structure and equal lengths.

(a) Creating a dataframe from a 2D dictionary having values as lists/ndarrays

e.g.

import numpy as np

import pandas as pd

dict1={'Students': ['Ruchika' , 'Neha', 'Mark' , 'Gurpreet' , 'Jamaal'] ,

'Marks': [79.5 , 83.75 , 74 , 88.5 , 89] ,

'Sport' : ['Cricket', 'Badminton', 'Football' , 'Athletics' , 'Kabaddi'] ,

}

dtf1 = pd.DataFrame(dict1)

print(dtf1)

Output:

** As you can see that the DataFrame object created has its index assigned automatically (0 onwards) just as it

happens with Series objects, and the columns are places in sorted order. keys of the dictionary have become

columns.

** You can specify your own indexes too by specifying a sequence by the name index in the DataFrame( )

function, e.g.

dtf1 = pd.DataFrame(dict1, index=[‘I’, ‘II’, ‘III’ , ‘IV’ , ‘V’])

print(dtf1)

(b) Creating a DataFrame from a 2D dictionary having values as dictionary objects:

e.g.

import numpy as np

import pandas as pd

yr2015 = { 'Qtr1' : 34500 , 'Qtr2' : 56000 , 'Qtr3' : 47000 , 'Qtr4': 49000}

yr2016 = {'Qtr1' : 44900, 'Qtr2' : 46100 , 'Qtr3' : 57000 , 'Qtr4': 59000}

yr2017 = { 'Qtr1' : 54500 , 'Qtr2' : 51000 , 'Qtr3' : 57000 , 'Qtr4': 58500}

diSales = { 2015 : yr2015 , 2016 : yr2016 , 2017 : yr2017}

df1 = pd.DataFrame(diSales)

print(df1)

P a g e 21 | 57

Output:

- In this case , Python interprets the outer dict keys as the columns and the inner keys as the row indices.

- As the keys of all inner dictionaries(yr2015 , yr2016 , yr2017) are exactly the same in number and

names, the dataframe object df2 also has the same number of indexes. Since the inner keys have

values in all the inner dictionaries, there is no missing value in the dataframe object.

- Now had there been a situation where inner dictionaries had non-matching keys, then in that case

Python would have done following things:

(i) There would have been total number of indexes equal to sum of unique inner keys in all the

inner dictionaries.

(ii) For a key that has no matching keys in other inner dictionaries, value NaN would be used to

depict the missing values.

Example:

import numpy as np

import pandas as pd

yr2015 = { 'Qtr1' : 34500 , 'Qtr2' : 56000 , 'Qtr3' : 47000 , 'Qtr4': 49000}

yr2016 = {'Q1' : 44900, 'Q2' : 46100 , 'Q3' : 57000 , 'Q4': 59000}

yr2017 = { 'A' : 54500 , 'B' : 51000 , 'C' : 57000 }

diSales = { 2015 : yr2015 , 2016 : yr2016 , 2017 : yr2017}

df1 = pd.DataFrame(diSales)

print(df1)

Output:

Example:

import numpy as np

import pandas as pd

yr2015 = { 'Qtr1' : 34500 , 'Qtr2' : 56000 , 'Qtr3' : 47000 , 'Qtr4': 49000}

yr2016 = {'Qtr1' : 44900, 'Qtr2' : 46100 , 'Q3' : 57000 , 'Q4': 59000}

yr2017 = { 'A' : 54500 , 'B' : 51000 , 'Qtr4' : 57000 }

P a g e 22 | 57

diSales = { 2015 : yr2015 , 2016 : yr2016 , 2017 : yr2017}

df1 = pd.DataFrame(diSales)

print(df1)

Output:

2. Creating a DataFrame Object from a 2-D ndarray

You can also pass a two-dimensional NumPy array to DataFrame( ) to create a dataframe object.

Example:

import numpy as np

import pandas as pd

narr1=np.array([[40,43,53],[64,55,46]],np.int32)

dtf1 = pd.DataFrame(narr1)

print(dtf1)

Output:

** As no keys are there, hence default names are given to indexes and columns, i.e. 0 onwards.

You can however, specify your own column names and/or index names by giving a columns sequence and/or

index sequence.

Example:

import numpy as np

import pandas as pd

narr1=np.array([[40,43,53],[64,55,46]],np.int32)

dtf1 = pd.DataFrame(narr1,columns=['First','Second','Three'], index=['A','B'])

print(dtf1)

Output:

If rows of ndarrays differ in length, i.e., if number of elements in each row differ, the Python will create just

single column in the dataframe object and the type of column will be considered as object.

P a g e 23 | 57

Example:

import numpy as np

import pandas as pd

narr1=np.array([[40,43],[64,55,46], [46.2,56.2]])

dtf1 = pd.DataFrame(narr1)

print(dtf1)

Output:

3. Creating a DataFrame object from a 2D dictionary with values as Series Object

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

Output:

** Dataframe object created (dtf2) has columns assigned from the keys of the dictionary object and its index

assigned from the indexes of the series objects which are the values of the dictionary object.

4. Creating a DataFrame Object from another DataFrame Object

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

dtf3= pd.DataFrame(dtf2)

print(dtf3)

Single column created this time

because the lengths of rows of

ndarray did not match.

P a g e 24 | 57

Output:

DataFrame Attributes

When you create a DataFrame object, all information related to it (such as its size, its datatype etc.) is available through

attributes. You can use these attributes in the following format to get information about the dataframe object.

<DataFrame object>.<attribute name>

(a) Retrieving index(axis 0), Columns(axis 1), axes’ details and data type of columns

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.index)

print(dtf2.columns)

print(dtf2.axes)

print(dtf2.dtypes)

Output:

P a g e 25 | 57

(b) Retrieving size(number of elements), shape, number of dimensions

Use attributes size, shape and ndim to get number if elements, dimensionality and number of axes respectively of a

dataframe object, e.g.

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.size)

print(dtf2.shape)

print(dtf2.ndim)

Output:

(c) Checking for emptiness of dataframe or presence of NaNs in dataframe

Use attribute empty to check for emptiness of a dataframe

e.g.

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

P a g e 26 | 57

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.empty)

Output:

(d) Getting number of rows in a dataframe

The len(<DF Object>) will return the number of rows in a dataframe.

(e) Getting count of non-NA values in dataframe

You can use count( ) with dataframe to get the count of Non-NaN values,but count( ) with dataframe is little

elaborate:

I. If you do not pass any argument or pass 0 (default is 0 only), then it returns count of Non-NA values for each

column.

II. If you pass argument as 1, then it returns count of non-NaN values for each row.

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(len(dtf2))

print(dtf2.count(0))

print(dtf2.count(1))

Output:

P a g e 27 | 57

(f) Transposing a Dataframe

You can transpose a dataframe by swapping its indexes and columns by using attribute T ,

e.g.

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {0 : population , 1 : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.T)

Output:

SELECTING OR ACCESSING DATA

1. Selecting/Accessing a Column

Single column at a time

<DataFrame object> [<Column name>]

Or

<DataFrame object>.<Column name>

Multiple columns at a time

<DataFrame object>[ [columnname , columnname, ……….]]

P a g e 28 | 57

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print("========")

print(dtf2.Population)

print("========")

print(dtf2[['Population','Avg. Income']])

Output:

2. Selecting/Accessing a SubSet from a Dataframe using Row/Column Name

For this purpose, you can use following syntax to select/access a subset from a dataframe object:

<DataFrameObject>.loc [<startrow>: <endrow>, <startcolumn> :<endcolumn>]

I. To access a row, just give the row name/label as this : <DF Object>.loc[<row label> , : ]

Make sure not to miss the COLON AFTER COMMA.

II. To access multiple rows, use : <DF object>.loc[<start row> :<endrow>, : ]

Make sure not to miss the COLON AFTER COMMA.

III. To access selective columns, use : <DF object>.loc[ : , <start column> , <end column>]

IV. To access a range of columns from a range of rows, use:

<DF object>.loc [<startrow>: <endrow>, <startcolumn> :<endcolumn>]

Example:

P a g e 29 | 57

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print("==Accessing Single row==")

print(dtf2.loc['Delhi' , :])

print(dtf2.loc['Kolkata' ,:])

print("==Accessing Multiple rows==")

print(dtf2.loc['Mumbai' : 'Chennai' , :])

print("==Accessing Columns==")

print(dtf2.loc[ : , 'Population'])

print("==Accessing range of columns and rows==")

print(dtf2.loc['Delhi' : 'Mumbai' , 'Population' : 'Avg. Income'])

Output:

3. Obtaining a Subset/Slice from a Dataframe using Row/Column Numeric Index/Position

Sometimes your dataframe object does not contain row or column labels or even you may not remember them.

In such cases, you can extract subset from dataframe using the row and column numeric index/position, but this

time you will use iloc instead of loc. iloc means integer location.

<DF object>.iloc[<startrow index>: <endrow index>, <startcolumn index> :<endcolumn index>]

P a g e 30 | 57

** endindex is excluded here.

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.iloc[0:2,0:1])

Output:

4. Selecting/Accessing Individual Value

(i) Either give name of row or numeric index in square brackets with, i.e., as this :

<DF object>.<column>[<row name or row numeric index>]

(ii) You can use at or iat attribute with DF object as shown below:

<DF object>.at [<row name>, <column name>]

Or

<DF object>. iat [<numeric row index>, <numeric column index>]

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print(dtf2.Population['Delhi'])

print(dtf2.at['Delhi', 'Population'])

print(dtf2.iat[0,0])

Output:

P a g e 31 | 57

5. Assigning/Modifying Data Values in Dataframe

(a) To change or modify a single data value, use syntax :

<DF>.<columnname>[<row name/label>] = <value>

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

dtf2.Population['Mumbai'] = 63819621

print(dtf2)

Output:

6. Adding Columns , rows and Deleting Columns in DataFrames

(a) To change or add a column, use syntax :

<DF object>[< column name >] = <new value>

If the given column name does not exist in dataframe then a new column with this name is added. But the

rows of this new column have the same given value.

Other ways of adding a column to a dataframe :

<DF object> at [ : , <columnname>] = <values for column>

Or

<DF Object> loc [ : , <columnname>] = < values for column >

(b) Similarly, to change or add a row, use syntax:

<DF object> at [<rowname> , : ] = <new value>

P a g e 32 | 57

Or

<DF Object> loc [<row name> , : ] = <new value>

Likewise, if there is no row with such row label , then Python adds new row with this row label and assigns

given values to all its columns. But the columns of this new row have the same given value.

(c) If you want to add a new column that has different values for all its rows, then you can assign the data values

for each row of the column in form of a list, e.g.

<DF Object>[<column name>] = [<value>, <value>, ……]

Example:

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print(dtf2)

print("==Adding Column==")

dtf2['density']=1219

print(dtf2)

print("==Adding Row==")

dtf2.at['Bangalore', : ] = 1200

print(dtf2)

print("==Adding Column with different values==")

dtf2['density']= [1500, 1219 , 1630, 1050, 1100]

print(dtf2)

Output:

7. Deleting Columns and rows

To delete a column, you use del statement as this :

del <Df object>[<column name>]

To delete rows from a dataframe, you can use :

<DF>.drop(<DF object>.index[[index value(s)]])

P a g e 33 | 57

e.g.

import numpy as np

import pandas as pd

population=pd.Series([10927986,12691836,4631392,4328063],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

AvgIncome = pd.Series([7216781092,8508781269,4226785362,5261784321],\

index=['Delhi', 'Mumbai','Kolkata','Chennai'])

dict2 = {'Population' : population , 'Avg. Income' : AvgIncome}

dtf2 = pd.DataFrame(dict2)

print("Dataframe before deletion of column")

dtf2['density']= [1500, 1219 , 1630, 1050]

print(dtf2)

print("Dataframe after deletion of column")

del dtf2['density']

print(dtf2)

print("Dataframe after deletion of first and third row")

print(dtf2.drop(dtf2.index[[0,2]]))

Output:

ITERATING OVER A DATAFRAME

Two methods are used for iterating over a given dataframe:

1. The method <DF>.iterrows( ) views a dataframe in the form of horizontal subsets i.e., row-wise. Each horizontal

subset is in the form of (row index, Series) where Series contains all column values for that row-index.

2. The method <DF>.iteritems( ) views a dataframe in the form of vertical subsets i.e., column-wise. Each vertical

subset is in the form of (col-index, Series) where Series contains all row values for that column-index.

P a g e 34 | 57

Using pandas.iterrows( ) Function

Example:

import numpy as np

import pandas as pd

diSales = { 2015 : {'Qtr1': 34500 , 'Qtr2' : 56000 , 'Qtr3' : 47000 , 'Qtr4': 49000}, \

2016 : {'Qtr1': 44900 , 'Qtr2' : 46100 , 'Qtr3' : 57000 , 'Qtr4': 59000 }, \

2017 : { 'Qtr1': 54500 , 'Qtr2' : 51000 , 'Qtr3' : 57000 , 'Qtr4': 58500}

}

df1 = pd.DataFrame(diSales)

print(df1)

for (row,rowSeries) in df1.iterrows( ) :

print("Row index", row)

print("Containing:")

print(rowSeries)

Output:

Using pandas.iteritems( ) Function

Command

continuation marks

Each row is taken one at a time in the form of (row,

rowSeries) where row would store the row-index

and rowSeries will store all the values of the row in

form of a Series object.

Row-index

All values in a row are in form of a

Series object.

P a g e 35 | 57

Unlike iterrows( ), iteritems( ) brings you vertical subsets from a dataframe in the form of column index and a

series object containing values for all rows in that column.

Example:

import numpy as np

import pandas as pd

diSales = { 2015 : {'Qtr1': 34500 , 'Qtr2' : 56000 , 'Qtr3' : 47000 , 'Qtr4': 49000}, \

2016 : {'Qtr1': 44900 , 'Qtr2' : 46100 , 'Qtr3' : 57000 , 'Qtr4': 59000 }, \

2017 : { 'Qtr1': 54500 , 'Qtr2' : 51000 , 'Qtr3' : 57000 , 'Qtr4': 58500}

}

df1 = pd.DataFrame(diSales)

print(df1)

for (col,colSeries) in df1.iteritems( ) :

print("Column index", col)

print("Containing:")

print(colSeries)

Output:

Binary Operations in a DataFrame

Each column is taken one at a time in the form of

(col, colSeries) where col would store the col-index

and colSeries will store all the values of the column

in form of a Series object.

Col-index

All values in a column are in form

of a Series object.

P a g e 36 | 57

In a binary operation, the data from the two dataframes are aligned on the bases of their row and

column indexes and for matching row, column index, the given operation is performed and for

non-matching row, column indexes NaN value is stored in the result.

You can perform add binary operation on two dataframe objects using either + operator or using

add( ) as per syntax: <DF1>.add(<DF2>) which means <DF1>+<DF2> or by using radd( ) i.e., reverse

add as per syntax: <DF1>.radd(<DF2>) which means <DF2>+<DF1>

You can perform subtract binary operation on two dataframe objects using either – (minus)

operator or using sub( ) as per syntax: <DF1>.sub(<DF2>) which means <DF1> - <DF2> or by using

rsub( ) i.e., reverse subtract as per syntax: <DF1>.rsub(<DF2>) which means <DF2> - <DF1>

You can perform multiply binary operation on two dataframe objects using either * operator or

using mul( ) as per syntax : <DF>.mul(<DF>)

You can perform division binary operation on two dataframe objects using either / operator or

using div( ) as per syntax : <DF>.div(<DF>)

Example:

import numpy as np

import pandas as pd

d1 = { 'A' : [1,4,7], 'B' : [2,5,8] , 'C' : [3, 6 ,9]}

df1 = pd.DataFrame(d1)

d2 = { 'A' : [10,40,70], 'B' : [20,50,80] , 'C' : [30, 60 ,90]}

df2 = pd.DataFrame(d2)

d3 = { 'A' : [100,400], 'B' : [200,500] , 'C' : [300, 600]}

df3 = pd.DataFrame(d3)

d4 = { 'A' : [1000,4000,7000], 'B' : [2000,5000,8000]}

df4 = pd.DataFrame(d4)

print(df1)

print(df2)

print(df3)

print(df4)

print("==Addition Operation==")

print(df1+df2)

print(df1.add(df3))

print(df1+df4)

print("==Subtraction Operation==")

print(df1-df2)

print(df1.sub(df3))

print(df1-df4)

print("==Multiply Operation==")

print(df1*df2)

print(df1.mul(df3))

print(df1*df4)

P a g e 37 | 57

print("==Division Operation==")

print(df1/df2)

print(df1.div(df3))

Output:

Note:

P a g e 38 | 57

* results of df1+df3 have the sum values displayed as float numbers. The reason behind it is that whenever any

column of a dataframe has a NaN value in it, then its type is automatically changed to floating point because in

Python integer types cannot store NaN values.

* df1 – df2 is equal to df1.sub(df2) but df2 – df1 is equal to df1.rsub(df2)

* when + is performed on string values, it concatenates them.

* data type of values in dataframe should be binary operator compatible.

Matching and Broadcasting Operations

While performing arithmetic operations on dataframes, data is aligned on the basis of matching indexes and then

performed arithmetic ; for non-overlapping indexes, the arithmetic operations result as a NaN(Not a Number). This

default behaviour of data alignment on the basis of matching indexes is called MATCHING.

BROADCASTING

BROADCASTING IN DATAFRAMES

P a g e 39 | 57

Given below are two dataframes(namely row and col ) that are created by extracting middle row

and middle column respectively from the above shown dataframe df1.

Let us see how these will perform an arithmetic operation say subtraction with df1. That is, what

will be produced by df1 – row (or df1.sub(row)) and df1 – col (or df1.sub(col))

Let us check how df1 – col or df1.sub(col) will be performed.

will internally transform to

P a g e 40 | 57

Handling Missing Data

Missing Values

These are the values that cannot contribute to any computation or we can say that missing values

are the values that carry no computational significance. E.g. NaN value.

Why handle missing data?

The simple answer is that data science is all about inferring accurate results/predictions from the

given data and if the missing values are not handled properly, then it may affect the overall result

and the researchers may end up drawing an inaccurate inference about the data.

For detecting and handling missing data, we shall use following sample objects having some missing

data : a series object namely dtSer and a dataframe patientDF.

Detecting/Filtering Missing Data

You can use isnull( ) to detect missing values in a panda object; it returns True or False for each values in

a pandas object if it is a missing value or not. It can be used as :

P a g e 41 | 57

<PandaObject>.isnull( )

<PandaObject> means it is applicable to both Series as well as DataFrame objects.

Consider the following outputs:

If you want to filter data which is not a missing value i.e., non-null data, then you can use following

for Series object:

<Series object>[SeriesObject.notnull( )]

E.g.

** The above method will not work for a dataframe object.

Handling Missing Data – Dropping Missing Values

To drop missing values you can use dropna( )in following three ways:

(a) <PandaObject>.dropna( )

This will drop all the rows that have NaN values in them, even row with a single NaN value in it, e.g.

(b) <DF>.dropna(how=’all’)

With argument how=’all’ , it will drop only those rows that have all NaN values, i.e. , no value is non-

null in those rows, e.g.

(c) <DF>.dropna(axis=1)

P a g e 42 | 57

With argument axis =1 , will drop columns that have any NaN values in them, e.g.

Handling Missing Data – Filling Missing Values

Though dropna( ) removes the null values, but you can lose other non-null data with it too. To avoid this,

you may want to fill the missing data with some appropriate value of your choice. For this purpose you

can use fillna( ) in following ways:

(a) <PandaObject>.fillna(<n>)

This will fill all NaN values in a pandas object with the given <n> value, e.g.

(b) Using dictionary with fillna( ) to specify fill values for each column separately

You can create a dictionary that defines fill values for each of the columns. And then you can pass this

dictionary as an argument to fillna( ), Pandas will fill the specified value for each column defined in the

dictionary. It will leave those columns untouched or unchanged that are not in the dicntionary.

The syntax to use this format of fillna( ) is :

<DF>.fillna(<dictionary having fill values for columns>)

e.g.

More operations with Pandas Objects

P a g e 43 | 57

1. Comparisons of Pandas Object

For understanding this thing we are using following objects:

Normal comparison operators can be used as shown below:

Now, look carefully another add operation on dfc1 and dfc3 objects.

Even though you see both results(extreme left and extreme right above) are just the same, their

comparison returned False for some value. The reason behind it is that : NaN values do not compare

as equals.

To compare objects having NaN values, it is better to use equals( ) that returns True if two NaN

values are compared for equality. E.g.

Similarly, you can compare pandas objects with scalar values ; use equals( ) if the values involve

NaNs :

P a g e 44 | 57

** For comparison, two DataFrame objects should match in lengths and in columns.

2. Comparing Series

Comparison of two series objects is also based on same points as Dataframe:

For comparison, two Series objects should match in lengths and in indexes.

NaN values cannot be compared using comparison operators.

equals( ) results in True while comparing two NaN values.

Boolean Reductions

P a g e 45 | 57

With Boolean reduction, you can get the overall result of a row or column with a single True or False.It

summarize the Boolean result for an axis. For this purpose Pandas offers following Boolean reduction

functions or attributes:

1. empty – this attribute is indicator whether a DataFrame is empty. It is used as :

<DF>.empty

Pandas will return True or False if a dataframe is empty or not, e.g.

An empty dataframe does not contain any value, not even NaN; a dataframe having all NaN value

is also not considered empty. E.g.

2. any( ) – This function returns True if any element is True over requested axis, By default ,it checks if any

value in a column(default axis 0) meets this criteria, if it does it returns True, i.e., if any of the values along

the specified axis is True, this will return True. It is used as per following syntax:

<DataFrame comparison result object>.any(axis=None)

3. all( ) – Unlike any( ), the all( ) will return True/False if only all the values on an axis are True or False

according to given comparison. It is used as per syntax:

<DataFrame comparison result object>.any(axis=None)

P a g e 46 | 57

Combining DataFrames

1. Combining Dataframes using combine_first( )

This combines the two dataframes in a way that uses the method of patching the data : Patching the data

means, if in a dataframe a certain cell has missing data and corresponding cell(the one with same index and

column id) in other dataframe has some valid data then, this method will pick the valid data from the second

dataframe and patch it in the first dataframe so that now it also has valid data at that cell. The combine_first(

) is used as per syntax:

<DF>.combine_first(<DF2>)

P a g e 47 | 57

2. Combining Dataframes using concat( )

The concat( ) can concatenate two dataframes along the rows or along the columns. This method is useful if

the two dataframes have similar structures. It is used as per following syntax:

pd. concat([<df1>,<df2>])

pd.concat([<df1>,<df2>, ignore_index=True])

pd.concat([<df1>,<df2>, axis=1])

If you skip the axis = 1 argument, it will join the two dataframes along the rows, i.e., the result will be

union of rows from both the dataframes. E.g.

Consider the following dataframes:

As you can see above that by default concat( ) has concatenated all the rows from both the dataframes

(sd1 and sd2), including the row indexes.

But if you do not want this mechanism for row indexes and want to have new row indexes generated

from 0 to n-1, then you can give argument ignore_index = True , e.g.

P a g e 48 | 57

By default it concatenates along the rows; to concatenate along the column, you can give argument axis

= 1 as shown below:

3. Combining Dataframes using merge( )

Sometimes, you would want to combine two dataframes such that two rows with some common value

are merged together in the final result. For this, Pandas provides merge( ) function in which you can specify

the field on the basis of which you want to combine the two dataframes. It is used as per syntax:

pd.merge(<DF1>,<DF2> )

pd.merge(<DF1>,<DF2>, on=<field_name>)

If you skip the argument on=<field_name> , then it will take any merge on common fields of the two

dataframes but you can explicitly specify the field on the basis of which you want to merge the two

dataframes.

E.g.

***********************

P a g e 49 | 57

CHAPTER 12 – DATA TRANSFER BETWEEN FILES, SQL DATABASES AND DATAFRAMES

Transferring Data between .csv Files and DataFrames

The acronym CSV is short for Comma Separated Values. The CSV format refers to a tabular data that has been

saved as plaintext where data is separated by commas.

In CSV format:

1. Each row of the table is stored in one row i.e, the number of rows in a CSV file are equal to number of

rows in the table.

2. The field-values of a row are stored together with commas after every field value; but after the last

field’s value in a line/row, no comma is given, just the end of line.

Advantages of CSV format

A simple, compact and ubiquitous format for data storage.

A common format for data interchange.

It can be opened in popular spreadsheet packages like MS-Excel, Calc etc.

Nearly all spreadsheets and databases support import/export to csv format.

Loading Data from CSV to Dataframes

Python’s Pandas library offers two functions read_csv( ) and to_csv( ) that help you bring data from a CSV file

into a dataframe and write a dataframe’s data to a CSV file.

Suppose we have following CSV file – sample.csv, shown in notepad as well as in MS Excel.

Before doing anything with Pandas, like always, make sure to import the Pandas library by giving command:

import pandas as pd

Reading From a CSV File to Dataframe

You can use read_csv( ) function to read data from a CSV file in your dataframe by using the function as per

following syntax:

<DF> = pandas.read_csv(<filepath>)

That is, if you give following command after importing pandas library(suppose sample file is stored in c:\data

folder):

df = pd.read_csv(“c:\\data\\sample.csv“)

Now if you print dataframe df, Python will show:

P a g e 50 | 57

Reading CSV File and Specifying Own Column Names

Case I :

But you may have a CSV file that does not have top row containing column headers(as shown below):

Now if you read such a file by just giving the filepath, it will take the top row as the column headers, eg., as

shown below:

df2 = pd.read_csv(“c:\\data\\mydata.csv”)

Now if you print dataframe df2, Python will show :

But the top row (1, Sarah, Kapur) is data, not column headings. For such a situation, you can specify own

column heading in read_csv( ) using names argument as per following syntax:

<DF> = pandas.read_csv(<filepath> , names=<sequence containing column names>)

That is, you need to modify above statement as :

df2 = pd.read_csv(“c:\\data\\mydata.csv”, names=[“Roll_no”, “First_Name”,”Last_Name”])

And now when you print df2, Python will show:

P a g e 51 | 57

Case II:

If you want the first row not to be used as header and at the same time you do not want to specify column

headings rather go with default column headings which go like 0,1, 2, 3, ………., then simply give argument as

header=None in read_csv( ), i.e., as:

<DF> = pandas.read_csv(<filepath>, header=None)

For example, the following statement will not consider row 1 for column heading and default column headings will

be taken.

df3 = pd.read_csv(“c:\\data\\mydata.csv”, header=None)

The dataframe df3 will be storing data as shown below:

Case III:

Now comes a situation where you have first row of CSV file storing some column headings but you do not want to

use them, e.g. as shown below:

In that case, in read_csv( ) if you simply give

(i) Argument names = <sequence of column headings> then the first row’s data will be taken as data, e.g.

df4 = pd.read_csv(“C:\\data\\mydata.csv”, names=[“Roll_no”, “Name”, “Marks”])

(ii) Or if you give argument as header = None, then also it will take default headings as 0,1,2…., which you do not

want

df3 = pd.read_csv(“C:\\data\\mydata.csv”,header=None)

P a g e 52 | 57

The solution to this problem is to find a way so that it skips row 1 while fetching data from CSV file.For this, you

need to give two arguments along with file path : one for column headings i.e., names=<column headings

sequence> and another skiprows = <n> , i.e.,

<DF> = pandas.read_csv(<filepath>, names =<column names sequence>, skiprows=<n> )

The number given with skip rows tells how many rows to be skipped from CSV while reading data. Thus, if you give

statement as given below:

Reading Specified Number of Rows from CSV file

Given argument nrows =<n> in read_csv( ), will read the specified number of rows from the CSV file, e.g.

Reading from CSV files having Separator Different from Comma

Some CSV files are so created that their separator character is different from comma such as a semicolon( ;) or a pipe

symbol (|) etc. To read data from such CSV files, you need to specify an additional argument as sep=<separator

character>. If you skip this argument then default separator character(comma) is considered.

e.g.

Storing Dataframe’s Data to CSV File

For this purpose, Python Pandas provides to_csv( ) function that saves the data of a dataframe in a CSV file.

Function to_csv( ) can be used as per following syntaxes:

<DF>.to_csv(<filepath>)

Or

<DF>.to_csv(<filepath>, sep=<separator_character> )

** The separator character must be a one character string only.

P a g e 53 | 57

For instance, if you have a dataframe namely df7 storing following data:

To store this data of df7, you can write:

df7.to_csv(“c:\\data\\new.csv” )

As no separator has been mentioned, it will take default separator character, comma, to separate data in the CSV

file.Also, even if there exists a file with the same name at the given location, it will simply overwrite the file without

any warning.

Handling NaN Values with to_csv( )

Suppose dataframe df7 will store data as:

Now, if you store this dataframe in a CSV file by giving following command:

df7.to_csv(“c:\\data\\new3.csv”, sep=”|”)

By default, the missing/NaN values are stored as empty strings in CSV file.

You can specify your own string that can be written for missing/NaN values by giving an argument

na_rep=<string>, i.e., following statement will write NULL in place of NaN values in the CSV file:

df7.to_csv(“c:\\data\\new3.csv”, sep=”|”, np_rep=”NULL” )

P a g e 54 | 57

Transferring Data Between Dataframes and SQL Databases

1. SQL Databases- relational database- data stored in realtions(tables)- make use of SQL commands. E.g. MySQL

2. Here we will learn to store data into an SQL based table and to read data from an SQL based table using a Python’s

linrary sqlite3 that is by default bundled with Python.

SQLite Database

It is an embedded SQL database engine, which implements a relational database management system that

is self-contained , serverless and requires zero configuration.

It is in public domain and is freely available for any type of use.

It can be downloaded from SQLite’s download page: www.sqlite.org/download.html

COMMANDS UNDER SQLITE

1. Creating Database in sqlite

.open <databasename>.db

This command works in two ways:

If the given database does not exist, sqlite will create one and then open it for processing.

If the given database exists, sqlite will open the database for processing.

2. Creating Table

To create a table, you can write SQL statement CREATE TABLE as per following syntax:

CREATE TABLE <table name> (

<column1> <datatype> <constraint>,

<column2> <datatype> <constraint>,

…………

) ;

You can use any of the following, most common data types:

INTEGER – for non-fractional numbers, whole numbers

TEXT – for storing text,alphabets,alphanumeric data

REAL – for storing fractional numbers i.e., numbers with decimals

CHAR(<n>) – for storing exact number of characters in a field.

Following SQL command will create a basic table namely student:

Sqlite > CREATE TABLE student(

Rollno INTEGER NOT NULL PRIMARY KEY,

Name TEXT NOT NULL,

Marks REAL ) ;

You can give .tables command on sqlite prompt to see if the table has been created in your database

i.e.,

Sqlite > . tables

3. Insert Data in Table

You can insert data in an existing table using SQL command INSERT INTO as per following syntax:

INSERT INTO <tablename>

VALUES(<value for column1>, <value for column2>…..) ;

For example, following command will add a record in our Student table that we created just before:

Sqlite > INSERT INTO Student

VALUES(1001, “Esha”, 67.5);

P a g e 55 | 57

4. Querying Upon Table

You can check what data is stored in table using SELECT commands as per following syntaxes:

SELECT * FROM <table>;

SELECT * FROM <table> WHERE <condition>;

SELECT <column1>,<column2>, FROM <table> WHERE <condition>;

Bringing Data from SQL Database Table into a Dataframe

Step 1: import sqlite3 library by giving Python statement : import sqlite3 as sq

Once you have imported sqlite3 library, you need to do the following to bring data from a database table to a

dataframe.

Step 2: Make connection to SQL database : Connection to SQL database is required that works like a channel connecting

Python and the database file through which data can travel to and from SQL database. You can make connection

to database by giving command :

conn = sqlite3.connect(<database name and path>)

The database that we created , new.db , was in the same folder as sqlite3 i.e., it was stored in folder C:\sqlite3

on our computer. Thus we gave command as:

conn = sq.connect(“C:\\sqlite3\\new.db “)

Now Python has established a connection to our SQL database new.db and it can be used in Python with name

conn. You will need this connection name while reading from the database.

Step 3: Once the connection to SQL database is established, read data from a table into a dataframe using read_sql( ) as

per following syntax:

<DF> = pandas.read_sql(“<SQL Statement>”, <connection name> )

That is, if you give following command, it will read all the records from table Student of database new.db (to

which we have established connection with name conn).

df = pd.read_sql(“SELECT * FROM Student;”, conn )

Storing a Dataframe’s Data as a Table in an SQL Database.

Step1 : Import pandas and sqlite3 libraries using import commands.

Step 2: Establish connection to an SQL database as explained earlier, i.e.,

conn = sqlite3.connect(<path to SQL database file>)

Step 3 : Next, you can write a dataframe in the form of a table by using to_sql( ) as per following syntax:

<DF>.to_sql(<tablename>,<connectionname> [, if exists = “append” | “replace” ] )

An identifier You can use sqlite3 or sq as you imported it as sq

The SQL query statement must

and with a semicolon

This is the connection that we

created in step 2.

P a g e 56 | 57

If you run to_sql( ) that has a table name which already exists then you must specify argument if_exist =”append”

or if_exists = “replace” otherwise Python will give ERROR. If you set the value as “append” , then new data will be

appended to existing table and if you set the value as “replace” then new data will replace the old data in given

table.

e.g. if we have another dataframe dtf5 having following data :

Now if you give following command:

dtf5.to_sql(“metros”, conn, if_exists =”append” )

Now you can check in SQLite database (after opening the database- new.db in our case) by giving command

SELECT * FROM metros ; if the new data has been appended.

**********************

P a g e 57 | 57

Chapter 10 – Introducing Python Pandas (Question Bank)

Q1. What is the use of Python Pandas library?

Q2. What are ndarrays? How are they different from Python lists?

Q3. What are axes in a numpy array?

Q4. To create a sequence of numbers, NumPy provides a function namely ____________that works like

range( ) but returns arrays instead of lists.

Q5. In NumPy arrays, dimensions are called__________ and number of axes is called ______________.

Q6. Can you duplicate indexes in a series object?

Q7. What do these attributes of series signify? (a) size (b) itemsize (iii) nbytes

Q8. If S1 is a series object then how will len(S1) and S1.count( ) behave?

Q9. Write commands to create following data structures, each having three integer elements 3,4,5 :

(a) ndarray (b) list (c) series

Q10. What are NaNs? How do you store them in a datastructures?

Q11. True/False. Series object always have indexes 0 to n-1.

Q12. Given a Series that stores the area of some states in km2 . Write code to find out the biggest and smallest

three areas from the given Series. Given Series has been created like this:

Ser1 = pd.Series([34567,

890,450,67892,34677,78902,256711,678291,637632,25723,2367,11789,345,256517] )

Q13. Given are two objects, a list object namely lst1 and a Series object namely ser1, both are having similar

values i.e., 2, 4 , 6, 8. Find out the output produced by following statements:

(i) print(lst1 * 2) (ii) print(ser1 * 2 )

Q14. Can you specify the reason behind the output produced by the code statements of previous question?

Q15. From the series of areas (given earlier that stores areas of states in km2 ), find out the areas that are

more than 50000 km2 .

**************************


Recommended