46
Sređivanje podataka i grafički prikaz May 14, 2020 1 Učitavamo potrebne biblioteke [1]: # osnovno import pandas as pd import numpy as np # priprema podataka import missingno as msno # grafikoni import matplotlib.pyplot as plt import squarify as sq import seaborn as sns # podešavanja %matplotlib inline plt.rcParams['figure.figsize'] = [10, 5] # podrazumevana veličina grafikona 2 Učitavamo podatke [2]: titanik = pd.read_csv('train.csv') titanik.head() [2]: PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 1

Sređivanje podataka i grafički prikaz

Embed Size (px)

Citation preview

Sređivanje podataka i grafički prikaz

May 14, 2020

1 Učitavamo potrebne biblioteke

[1]: # osnovnoimport pandas as pdimport numpy as np

# priprema podatakaimport missingno as msno

# grafikoniimport matplotlib.pyplot as pltimport squarify as sqimport seaborn as sns

# podešavanja%matplotlib inlineplt.rcParams['figure.figsize'] = [10, 5] # podrazumevana veličina grafikona

2 Učitavamo podatke

[2]: titanik = pd.read_csv('train.csv')titanik.head()

[2]: PassengerId Survived Pclass \0 1 0 31 2 1 12 3 1 33 4 1 14 5 0 3

Name Sex Age SibSp \0 Braund, Mr. Owen Harris male 22.0 11 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 12 Heikkinen, Miss. Laina female 26.0 03 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 14 Allen, Mr. William Henry male 35.0 0

1

Parch Ticket Fare Cabin Embarked0 0 A/5 21171 7.2500 NaN S1 0 PC 17599 71.2833 C85 C2 0 STON/O2. 3101282 7.9250 NaN S3 0 113803 53.1000 C123 S4 0 373450 8.0500 NaN S

2.1 Osnovna analiza[3]: titanik.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Ticket 891 non-null objectFare 891 non-null float64Cabin 204 non-null objectEmbarked 889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.7+ KB

[4]: # Očigledno da nam podaci falemsno.matrix(titanik)

[4]: <matplotlib.axes._subplots.AxesSubplot at 0x2252efac748>

2

2.2 Prvobitno sređivanje podataka

[5]: # Malo je verovatno da ime putnika (Name), kabina u kojoj je bio (Cabin) i␣↪→slično, mogu da utiču na to da li će preživeti ili ne

titanik.drop(columns=['PassengerId', 'Name', 'Cabin', 'Ticket'], inplace=True)␣↪→# brišemo kolone

[6]: # Fale nam dve vrednosti za mesto ukrcavanja. Stavićemo da to bude najčešće␣↪→mesto ukrcavanja

najcesce = titanik.Embarked.mode()[0]print(najcesce)titanik.Embarked = titanik.Embarked.fillna(value=najcesce)

S

[7]: msno.matrix(titanik)

[7]: <matplotlib.axes._subplots.AxesSubplot at 0x2252f0c5ec8>

3

[8]: # Broj unikatnih vrednosti u svakoj kolonititanik.nunique()

[8]: Survived 2Pclass 3Sex 2Age 88SibSp 7Parch 7Fare 248Embarked 3dtype: int64

3 Vizualizacija podataka

[9]: # Grafički prikaz broja vrednosti po grupisns.countplot(x='Pclass', data=titanik)

[9]: <matplotlib.axes._subplots.AxesSubplot at 0x2252f13a788>

4

[10]: sns.distplot(titanik.Age.dropna())

[10]: <matplotlib.axes._subplots.AxesSubplot at 0x2252f040a88>

[11]: # Izvlačimo različite kolonekategorije = ['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']vrednosti = ['Age', 'Fare']

5

3.1 Python generatori

[12]: for i in range (0, 10):print(i)

0123456789

[13]: for i in range (0, len(kategorije)):print(i)

012345

[14]: for i in range (0, len(kategorije)):print(kategorije[i])

SurvivedPclassSexSibSpParchEmbarked

3.1.1 Šta smo još mogli da uradimo

[15]: # sve kolonefor col in titanik.columns:

print(col)

SurvivedPclassSexAgeSibSpParch

6

FareEmbarked

[16]: # niz kolona i broj unikatnih vrednosti[(col, titanik[col].nunique()) for col in titanik.columns]

[16]: [('Survived', 2),('Pclass', 3),('Sex', 2),('Age', 88),('SibSp', 7),('Parch', 7),('Fare', 248),('Embarked', 3)]

[17]: # kako pravimo DataFrame od nizapd.DataFrame.from_records([('Kolona', 10), ('Druga', 15)])

[17]: 0 10 Kolona 101 Druga 15

[18]: # "Bolji" nazivi kolonapd.DataFrame.from_records([('Kolona', 10), ('Druga', 15)], columns=['Naziv',␣↪→'Broj'])

[18]: Naziv Broj0 Kolona 101 Druga 15

[19]: # Gradimo DataFrameniz = [(col, titanik[col].nunique()) for col in titanik.columns]unikatno = pd.DataFrame.from_records(niz, columns=['Kolona', 'Unikatno'])# Prikazujemo ga sortiranogunikatno.sort_values(by=['Unikatno'])

[19]: Kolona Unikatno0 Survived 22 Sex 21 Pclass 37 Embarked 34 SibSp 75 Parch 73 Age 886 Fare 248

7

3.2 Dodatne informacije

[20]: # Histogram godinasns.distplot(titanik.Age.dropna())

[20]: <matplotlib.axes._subplots.AxesSubplot at 0x2252fa24e88>

3.2.1 Dekade

Ovo je proporcionalna raspodela gde su svi opsezi jednaki

[21]: # da utvrdimo najmanju i najveću starosttitanik.Age.describe()

[21]: count 714.000000mean 29.699118std 14.526497min 0.42000025% 20.12500050% 28.00000075% 38.000000max 80.000000Name: Age, dtype: float64

[22]: np.linspace(0, 100, 11) # od 0 do 100 po dekadama

[22]: array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])

8

[23]: tacke = np.linspace(0, 90, 10) # nemamo nikoga starijeg od 80 godina pa je ovo␣↪→sasvim dovoljno

nazivi = ['0s', '10s', '20s', '30s', '40s', '50s', '60s', '70s', '80s']

titanik['AgeGroup'] = pd.cut(titanik.Age.dropna(), tacke, labels=nazivi)titanik.head(10)

[23]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup0 0 3 male 22.0 1 0 7.2500 S 20s1 1 1 female 38.0 1 0 71.2833 C 30s2 1 3 female 26.0 0 0 7.9250 S 20s3 1 1 female 35.0 1 0 53.1000 S 30s4 0 3 male 35.0 0 0 8.0500 S 30s5 0 3 male NaN 0 0 8.4583 Q NaN6 0 1 male 54.0 0 0 51.8625 S 50s7 0 3 male 2.0 3 1 21.0750 S 0s8 1 3 female 27.0 0 2 11.1333 S 20s9 1 2 female 14.0 1 0 30.0708 C 10s

3.2.2 Kategorije po godinama

Ovo nije proporcionalna raspodela jer su neki opsezi veći od drugih

Kategorija Prevod Od DoMissing Nepoznato -1 0Infant Dojenčad 0 5Child Dete 6 12Teenager Tinejdžer 13 19Young Adult Mladi 20 35Adult Odrasli 36 60Elderly Stariji 61 100

[24]: tacke = [-1, 0, 6, 13, 20, 36, 61, 101] # neophodno je staviti i dodatnu tačku␣↪→101

nazivi = ['Missing', 'Infant', 'Child', 'Teenager', 'Young Adult', 'Adult',␣↪→'Elderly']

titanik['AgeCategory'] = pd.cut(titanik.Age.fillna(-0.5), tacke, labels=nazivi)titanik.head(10)

[24]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \0 0 3 male 22.0 1 0 7.2500 S 20s1 1 1 female 38.0 1 0 71.2833 C 30s2 1 3 female 26.0 0 0 7.9250 S 20s3 1 1 female 35.0 1 0 53.1000 S 30s4 0 3 male 35.0 0 0 8.0500 S 30s

9

5 0 3 male NaN 0 0 8.4583 Q NaN6 0 1 male 54.0 0 0 51.8625 S 50s7 0 3 male 2.0 3 1 21.0750 S 0s8 1 3 female 27.0 0 2 11.1333 S 20s9 1 2 female 14.0 1 0 30.0708 C 10s

AgeCategory0 Young Adult1 Adult2 Young Adult3 Young Adult4 Young Adult5 Missing6 Adult7 Infant8 Young Adult9 Teenager

3.2.3 Veličina familije

[25]: # broj braća/sestara + broj supružnikatitanik['FamilySize'] = titanik.SibSp + titanik.Parch

3.3 Tip podatka

[26]: # Prvih 5 redovatitanik.head()

[26]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \0 0 3 male 22.0 1 0 7.2500 S 20s1 1 1 female 38.0 1 0 71.2833 C 30s2 1 3 female 26.0 0 0 7.9250 S 20s3 1 1 female 35.0 1 0 53.1000 S 30s4 0 3 male 35.0 0 0 8.0500 S 30s

AgeCategory FamilySize0 Young Adult 11 Adult 12 Young Adult 03 Young Adult 14 Young Adult 0

[27]: # Informacije o kolonamatitanik.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890

10

Data columns (total 11 columns):Survived 891 non-null int64Pclass 891 non-null int64Sex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Fare 891 non-null float64Embarked 891 non-null objectAgeGroup 714 non-null categoryAgeCategory 891 non-null categoryFamilySize 891 non-null int64dtypes: category(2), float64(2), int64(5), object(2)memory usage: 65.3+ KB

[28]: # Menjamo u booltitanik.Survived = titanik.Survived.map({0: False, 1: True})

# Puni nazivi lukaport_map = {'S': 'Southampton', 'C': 'Cherbourg', 'Q':'Queenstown'}titanik.Embarked = titanik.Embarked.map(port_map)

# Kategorijefor kolona in ['Sex', 'Embarked', 'AgeGroup', 'AgeCategory']:

titanik[kolona] = titanik[kolona].astype('category')

titanik.info() # koristimo dosta manje memorije

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 11 columns):Survived 891 non-null boolPclass 891 non-null int64Sex 891 non-null categoryAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Fare 891 non-null float64Embarked 891 non-null categoryAgeGroup 714 non-null categoryAgeCategory 891 non-null categoryFamilySize 891 non-null int64dtypes: bool(1), category(4), float64(2), int64(4)memory usage: 47.2 KB

[29]: # Brojevifor kolona in ['SibSp', 'Parch', 'FamilySize']:

11

titanik[kolona] = titanik[kolona].astype('category')

titanik.info() # koristimo dosta manje memorije

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 11 columns):Survived 891 non-null boolPclass 891 non-null int64Sex 891 non-null categoryAge 714 non-null float64SibSp 891 non-null categoryParch 891 non-null categoryFare 891 non-null float64Embarked 891 non-null categoryAgeGroup 714 non-null categoryAgeCategory 891 non-null categoryFamilySize 891 non-null categorydtypes: bool(1), category(7), float64(2), int64(1)memory usage: 30.0 KB

[30]: # Sve i dalje radi kako trebatitanik.head()

[30]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \0 False 3 male 22.0 1 0 7.2500 Southampton 20s1 True 1 female 38.0 1 0 71.2833 Cherbourg 30s2 True 3 female 26.0 0 0 7.9250 Southampton 20s3 True 1 female 35.0 1 0 53.1000 Southampton 30s4 False 3 male 35.0 0 0 8.0500 Southampton 30s

AgeCategory FamilySize0 Young Adult 11 Adult 12 Young Adult 03 Young Adult 14 Young Adult 0

3.4 Univarijantna analiza

[31]: # Veličina grafikonafig = plt.figure(figsize=(30, 20))

# Za sve kategorijefor i in range (0, len(kategorije)):

fig.add_subplot(3, 3, i+1) # broj redova, broj kolona, redni brojsns.countplot(x=kategorije[i], data=titanik);

12

# Za sve vrednostifor i in range (0, len(vrednosti)):

fig.add_subplot(3, 3, i+1 + len(kategorije)) # preskačemo postojeće␣↪→grafikone

sns.distplot(titanik[vrednosti[i]].dropna())

[32]: # Koje tačno kolone imamo, i kog su tipatitanik.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 11 columns):Survived 891 non-null boolPclass 891 non-null int64Sex 891 non-null categoryAge 714 non-null float64SibSp 891 non-null categoryParch 891 non-null categoryFare 891 non-null float64Embarked 891 non-null categoryAgeGroup 714 non-null categoryAgeCategory 891 non-null categoryFamilySize 891 non-null category

13

dtypes: bool(1), category(7), float64(2), int64(1)memory usage: 30.0 KB

[33]: kolone = ['Pclass', 'Sex', 'Age'] # preskočiće Sex jer je categorytitanik[kolone].hist(bins=15, color='steelblue', edgecolor='black', grid=False)plt.tight_layout(rect=(0, 0, 1.2, 1.2))

[34]: titanik_orig = pd.read_csv('train.csv')titanik_orig.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Ticket 891 non-null objectFare 891 non-null float64Cabin 204 non-null objectEmbarked 889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.7+ KB

14

[35]: kolone = ['Survived', 'Pclass', 'Age', 'Parch', 'SibSp', 'Fare']titanik_orig[kolone].hist(bins=15, color='steelblue', edgecolor='black',␣↪→grid=False)

plt.tight_layout(rect=(0, 0, 1.2, 1.2))

[36]: # Veličina grafikonabar = plt.figure(figsize=(10, 12))

bar.add_subplot(321) # 3 reda, 2 kolone, 1vi grafikonsns.boxplot(y='Age', data=titanik, color='yellow')

bar.add_subplot(322) # 3 reda, 2 kolone, 2gi grafikonsns.boxplot(y='Fare', data=titanik)

bar.add_subplot(323) # 3 reda, 2 kolone, 3ci grafikonsns.violinplot(y='Age', data=titanik, color='yellow')

bar.add_subplot(324) # 3 reda, 2 kolone, 4ti grafikonsns.violinplot(y='Fare', data=titanik)

bar.add_subplot(325) # 3 reda, 2 kolone, 5ti grafikonsns.stripplot(y='Age', data=titanik, alpha=0.5, color='black')

bar.add_subplot(326) # 3 reda, 2 kolone, 6ti grafikonsns.stripplot(y='Fare', data=titanik, alpha=0.5)

[36]: <matplotlib.axes._subplots.AxesSubplot at 0x22530728048>

15

3.4.1 Odabir podataka

[37]: # Hoćemo da vidimo samo godinu i cenu jer imaju više različitih vrednostipodaci = titanik[['Age', 'Fare']]podaci

[37]: Age Fare0 22.0 7.25001 38.0 71.2833

16

2 26.0 7.92503 35.0 53.10004 35.0 8.0500.. … …886 27.0 13.0000887 19.0 30.0000888 NaN 23.4500889 26.0 30.0000890 32.0 7.7500

[891 rows x 2 columns]

[38]: # spajamo podatke u dve kolone. variable - ime kolone; value - vrednost iz te␣↪→kolone

podaci = pd.melt(podaci)podaci

[38]: variable value0 Age 22.001 Age 38.002 Age 26.003 Age 35.004 Age 35.00… … …1777 Fare 13.001778 Fare 30.001779 Fare 23.451780 Fare 30.001781 Fare 7.75

[1782 rows x 2 columns]

[39]: sns.boxplot(x='variable', y='value', data=podaci)

[39]: <matplotlib.axes._subplots.AxesSubplot at 0x2253080b8c8>

17

[40]: # Vraćamo se na originalne vrednostipodaci = titanik[['Age', 'Fare']]podaci.head()

[40]: Age Fare0 22.0 7.25001 38.0 71.28332 26.0 7.92503 35.0 53.10004 35.0 8.0500

[41]: # Originalna raspodelafig = plt.figure(figsize=(15, 7))

fig.add_subplot(121)sns.distplot(podaci['Age'].dropna())

fig.add_subplot(122)sns.distplot(podaci['Fare'].dropna())

[41]: <matplotlib.axes._subplots.AxesSubplot at 0x22530267608>

18

[42]: # standardizacija / Z-score normalizacija / normalizacija po srednjoj vrednostipodaci = (podaci - podaci.mean()) / podaci.std()podaci.head()

[42]: Age Fare0 -0.530005 -0.5021631 0.571430 0.7864042 -0.254646 -0.4885803 0.364911 0.4204944 0.364911 -0.486064

[43]: # raspodelafig = plt.figure(figsize=(15, 7))

fig.add_subplot(121)sns.distplot(podaci['Age'].dropna())

fig.add_subplot(122)sns.distplot(podaci['Fare'].dropna())

[43]: <matplotlib.axes._subplots.AxesSubplot at 0x2252ff8b388>

19

[44]: # Vraćamo se na originalne vrednostipodaci = titanik[['Age', 'Fare']]

# normalizacija / min-max metod / razmeravanje (pretvaramo u opseg od 0 do 1)podaci = (podaci - podaci.min()) / (podaci.max() - podaci.min())podaci.head()

[44]: Age Fare0 0.271174 0.0141511 0.472229 0.1391362 0.321438 0.0154693 0.434531 0.1036444 0.434531 0.015713

[45]: # min-max raspodelafig = plt.figure(figsize=(15, 7))

fig.add_subplot(121)sns.distplot(podaci['Age'].dropna())

fig.add_subplot(122)sns.distplot(podaci['Fare'].dropna())

[45]: <matplotlib.axes._subplots.AxesSubplot at 0x2253021e6c8>

20

[46]: # I sada konačno možemo da prikažemo odnos između vrednosti u koloni Age i␣↪→koloni Fare

spojeno = pd.melt(podaci)sns.boxplot(x='variable', y='value', data=spojeno)

[46]: <matplotlib.axes._subplots.AxesSubplot at 0x2253009ab08>

[47]: sns.violinplot(x='variable', y='value', data=spojeno)

[47]: <matplotlib.axes._subplots.AxesSubplot at 0x22530108548>

21

[48]: sns.stripplot(x='variable', y='value', data=spojeno, alpha=0.5)

[48]: <matplotlib.axes._subplots.AxesSubplot at 0x22530845c88>

22

3.4.2 Šta smo postigli sa dekadama / kategorijama

[49]: # Broj putnika po dekadi i kategoriji godinabox = plt.figure(figsize=(20, 7))kategorije = ['AgeGroup', 'AgeCategory']

for i in range (0, len(kategorije)):box.add_subplot(1, 2, i+1)sns.countplot(x=kategorije[i], data=titanik);

3.5 Bivarijantna analiza3.5.1 Histogram

[50]: # Broj poginulih/preživelih na osnovu godinagrupe = titanik.groupby('Survived')

plt.figure(figsize=(10, 7))plt.title("Rezultat po godinama", fontsize=18)for name, group in grupe:

plt.hist(group['Age'].dropna(), bins=15, alpha=0.5, label=str(name))plt.legend(['Poginuli','Preživeli'])

23

[51]: g = sns.FacetGrid(titanik, col='Survived', height=5)g.map(plt.hist, 'Age', bins=15, color='orange', edgecolor='black')

[51]: <seaborn.axisgrid.FacetGrid at 0x2252fe06e88>

24

[52]: g = sns.FacetGrid(titanik, col='Survived', row='Sex', height=5)g.map(plt.hist, 'Age', bins=15, color='steelblue', edgecolor='black')

[52]: <seaborn.axisgrid.FacetGrid at 0x2252fcdfd48>

3.5.2 Kolone i procenti

[53]: # Broj preživelih / utopljenih po polugrouped = pd.crosstab(index=titanik['Sex'], columns=titanik['Survived'])

grouped.plot.bar(stacked=True)

25

plt.title("Count of surivors by gender", fontsize=18)plt.legend(title='Survived')plt.show()

[54]: # Procenat preživelih / utopljenih po polugrouped = pd.crosstab(index=titanik['Sex'], columns=titanik['Survived'],␣↪→normalize='index')

grouped.plot.bar(stacked=True)plt.title("% of surivors by gender", fontsize=18)plt.legend(title='Survived')plt.show()

26

[55]: # Broj preživelih / utopljenih po klasigrouped = pd.crosstab(index=titanik['Pclass'], columns=titanik['Survived'])

grouped.plot.bar(stacked=True)plt.title("Count of surivors by the ticket class", fontsize=18)plt.legend(title='Survived')plt.show()

27

[56]: # Procenat preživelih / utopljenih po klasigrouped = pd.crosstab(index=titanik['Pclass'], columns=titanik['Survived'],␣↪→normalize='index')

grouped.plot.bar(stacked=True)plt.title("% of surivors by the ticket class", fontsize=18)plt.legend(title='Survived')plt.show()

[57]: sns.catplot('Survived', col='AgeCategory', hue='Sex', data=titanik, col_wrap=3,␣↪→kind='count')

[57]: <seaborn.axisgrid.FacetGrid at 0x22532779508>

28

3.5.3 Kocke

[58]: boxes = pd.melt(pd.crosstab(index=titanik['Sex'], columns=titanik['Survived']))labels = ['Drown Females', 'Drown Males', 'Survived Females', 'Survived Males']

sq.plot(sizes=boxes.value, label=labels, alpha=.4, color=["red","green","blue",␣↪→"orange"])

plt.title("Outcome by gender", fontsize=18)plt.axis('off')plt.show()

29

3.5.4 Violina

[59]: # Šanse da prežive (1) ili ne (0) na osnovu pola i klase kabineplt.figure(figsize=(10, 7))plt.title("Šansa za preživljavanje", fontsize=18)sns.violinplot(x="Pclass", y="Survived", hue="Sex", data=titanik, split=True,␣↪→palette="muted")

[59]: <matplotlib.axes._subplots.AxesSubplot at 0x2252fdc3e08>

30

3.5.5 Kombinovani

[60]: # Ko je koliko plaćao za kartusns.jointplot(x='Age', y='Fare', data=titanik)

[60]: <seaborn.axisgrid.JointGrid at 0x2253120a248>

31

[61]: # Drugačiji prikazsns.jointplot(x='Age', y='Fare', data=titanik, kind='hex')

[61]: <seaborn.axisgrid.JointGrid at 0x22532dfdd08>

32

[62]: sns.jointplot(x='Age', y='Fare', data=titanik, alpha=.3, color='red')

[62]: <seaborn.axisgrid.JointGrid at 0x225337246c8>

33

3.5.6 Scatter

[63]: plt.figure(figsize=(12, 9))sns.scatterplot(x='Age', y='Fare', data=titanik)

[63]: <matplotlib.axes._subplots.AxesSubplot at 0x225348ec588>

34

[64]: # Možemo da dodamo i pol (različita boja)plt.figure(figsize=(12, 9))sns.scatterplot(x='Age', y='Fare', data=titanik, hue='Sex')

[64]: <matplotlib.axes._subplots.AxesSubplot at 0x2252fd11b88>

35

[65]: # Možemo da dodamo i da li su preživeli (različita veličina)plt.figure(figsize=(12, 9))sns.scatterplot(x='Age', y='Fare', data=titanik, hue='Sex', size='Survived',␣↪→alpha=.7) # alpha da malo lakše vidimo

[65]: <matplotlib.axes._subplots.AxesSubplot at 0x22534967948>

36

[66]: # Različit tip polaznok mesta (oblik)plt.figure(figsize=(16, 12))sns.scatterplot(x='Age', y='Fare', data=titanik,

hue='Sex', size='Survived', style='Embarked', alpha=.7)

[66]: <matplotlib.axes._subplots.AxesSubplot at 0x22534d1e4c8>

37

[67]: # Dijagram koji smo koristili i za histogramg = sns.FacetGrid(titanik, col='Survived', row='Pclass', hue='Sex', height=5)g.map(plt.scatter, 'Age', 'Fare')

[67]: <seaborn.axisgrid.FacetGrid at 0x22534d9c448>

38

39

4 Brojke4.1 Grupisani podaci

[68]: # Hoćemo grupisane podatkedef grupisano(grupa):

df = titanik[[grupa, 'Survived']] # uzimamo samo ove dve kolonegr = df.groupby(grupa, as_index=False).mean() # prosečnu vrednost po grupireturn gr.sort_values(by='Survived', ascending=False) # vraćamo sortirano u␣

↪→opadajućem nizu

# Šta želimo da vidimogrupisano('Pclass') # Po klasi

[68]: Pclass Survived0 1 0.6296301 2 0.4728262 3 0.242363

[69]: grupisano('Sex') # Po polu

[69]: Sex Survived0 female 0.7420381 male 0.188908

[70]: grupisano('AgeCategory') # Po starosti

[70]: AgeCategory Survived1 Infant 0.7021284 Young Adult 0.4000005 Adult 0.3806822 Child 0.3750003 Teenager 0.3703700 Missing 0.2937856 Elderly 0.263158

4.2 Opisni podaci

[71]: titanik.describe()

[71]: Pclass Age Farecount 891.000000 714.000000 891.000000mean 2.308642 29.699118 32.204208std 0.836071 14.526497 49.693429

40

min 1.000000 0.420000 0.00000025% 2.000000 20.125000 7.91040050% 3.000000 28.000000 14.45420075% 3.000000 38.000000 31.000000max 3.000000 80.000000 512.329200

[72]: perc = [.20, .40, .60, .80] # percentiliinclude = ['object', 'float', 'int'] # dtypes koje želimo da vidimo

titanik.describe(percentiles=perc, include=include)

[72]: Age Farecount 714.000000 891.000000mean 29.699118 32.204208std 14.526497 49.693429min 0.420000 0.00000020% 19.000000 7.85420040% 25.000000 10.50000050% 28.000000 14.45420060% 31.800000 21.67920080% 41.000000 39.687500max 80.000000 512.329200

[73]: titanik.describe(include='all') # kada hoćemo da vidimo sve kolone bez obzira␣↪→na tip

[73]: Survived Pclass Sex Age SibSp Parch Fare \count 891 891.000000 891 714.000000 891.0 891.0 891.000000unique 2 NaN 2 NaN 7.0 7.0 NaNtop False NaN male NaN 0.0 0.0 NaNfreq 549 NaN 577 NaN 608.0 678.0 NaNmean NaN 2.308642 NaN 29.699118 NaN NaN 32.204208std NaN 0.836071 NaN 14.526497 NaN NaN 49.693429min NaN 1.000000 NaN 0.420000 NaN NaN 0.00000025% NaN 2.000000 NaN 20.125000 NaN NaN 7.91040050% NaN 3.000000 NaN 28.000000 NaN NaN 14.45420075% NaN 3.000000 NaN 38.000000 NaN NaN 31.000000max NaN 3.000000 NaN 80.000000 NaN NaN 512.329200

Embarked AgeGroup AgeCategory FamilySizecount 891 714 891 891.0unique 3 8 7 9.0top Southampton 20s Young Adult 0.0freq 646 230 340 537.0mean NaN NaN NaN NaNstd NaN NaN NaN NaNmin NaN NaN NaN NaN

41

25% NaN NaN NaN NaN50% NaN NaN NaN NaN75% NaN NaN NaN NaNmax NaN NaN NaN NaN

4.3 Opis DataFrame-a

[74]: titanik.dtypes # tip kolone

[74]: Survived boolPclass int64Sex categoryAge float64SibSp categoryParch categoryFare float64Embarked categoryAgeGroup categoryAgeCategory categoryFamilySize categorydtype: object

[75]: titanik.shape # broj redova, broj kolona

[75]: (891, 11)

[76]: titanik.axes # opis osa podataka

[76]: [RangeIndex(start=0, stop=891, step=1),Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',

'Embarked', 'AgeGroup', 'AgeCategory', 'FamilySize'],dtype='object')]

[77]: titanik.ndim # broj dimenzija (ovo je dvodimenzionalni niz)

[77]: 2

[78]: print(891 * 11) # broj redova * broj kolonatitanik.size # ukupan broj celija

9801

[78]: 9801

[79]: titanik.nunique() # broj unikatnih vrednosti

42

[79]: Survived 2Pclass 3Sex 2Age 88SibSp 7Parch 7Fare 248Embarked 3AgeGroup 8AgeCategory 7FamilySize 9dtype: int64

4.4 Pregled podataka

[80]: titanik.head() # sa početka

[80]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \0 False 3 male 22.0 1 0 7.2500 Southampton 20s1 True 1 female 38.0 1 0 71.2833 Cherbourg 30s2 True 3 female 26.0 0 0 7.9250 Southampton 20s3 True 1 female 35.0 1 0 53.1000 Southampton 30s4 False 3 male 35.0 0 0 8.0500 Southampton 30s

AgeCategory FamilySize0 Young Adult 11 Adult 12 Young Adult 03 Young Adult 14 Young Adult 0

[81]: titanik.tail() # sa kraja

[81]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \886 False 2 male 27.0 0 0 13.00 Southampton 20s887 True 1 female 19.0 0 0 30.00 Southampton 10s888 False 3 female NaN 1 2 23.45 Southampton NaN889 True 1 male 26.0 0 0 30.00 Cherbourg 20s890 False 3 male 32.0 0 0 7.75 Queenstown 30s

AgeCategory FamilySize886 Young Adult 0887 Teenager 0888 Missing 3889 Young Adult 0890 Young Adult 0

43

[82]: titanik.sample(5) # nasumični redovi

[82]: Survived Pclass Sex Age SibSp Parch Fare Embarked \375 True 1 female NaN 1 0 82.1708 Cherbourg142 True 3 female 24.0 1 0 15.8500 Southampton187 True 1 male 45.0 0 0 26.5500 Southampton282 False 3 male 16.0 0 0 9.5000 Southampton874 True 2 female 28.0 1 0 24.0000 Cherbourg

AgeGroup AgeCategory FamilySize375 NaN Missing 1142 20s Young Adult 1187 40s Adult 0282 10s Teenager 0874 20s Young Adult 1

4.5 Statistika[83]: titanik.corr() # korelacija

[83]: Survived Pclass Age FareSurvived 1.000000 -0.338481 -0.077221 0.257307Pclass -0.338481 1.000000 -0.369226 -0.549500Age -0.077221 -0.369226 1.000000 0.096067Fare 0.257307 -0.549500 0.096067 1.000000

[84]: titanik.cov() # kovarijansa

[84]: Survived Pclass Age FareSurvived 0.236772 -0.137703 -0.551296 6.221787Pclass -0.137703 0.699015 -4.496004 -22.830196Age -0.551296 -4.496004 211.019125 73.849030Fare 6.221787 -22.830196 73.849030 2469.436846

[85]: titanik.var() # varijansa

[85]: Survived 0.236772Pclass 0.699015Age 211.019125SibSp 1.216043Parch 0.649728Fare 2469.436846FamilySize 2.603248dtype: float64

[86]: titanik.std() # standardna devijacija

44

[86]: Survived 0.486592Pclass 0.836071Age 14.526497SibSp 1.102743Parch 0.806057Fare 49.693429FamilySize 1.613459dtype: float64

[87]: titanik.skew()

[87]: Survived 0.478523Pclass -0.630548Age 0.389108SibSp 3.695352Parch 2.749117Fare 4.787317FamilySize 2.727441dtype: float64

[88]: titanik.kurtosis()

[88]: Survived -1.775005Pclass -1.280015Age 0.178274SibSp 17.880420Parch 9.778125Fare 33.398141FamilySize 9.159666dtype: float64

[89]: titanik.mad() # mean absolute deviation

[89]: Survived 0.473013Pclass 0.761968Age 11.322944Fare 28.163692dtype: float64

[90]: titanik.sem() # standard error of the mean

[90]: Survived 0.016301Pclass 0.028009Age 0.543640SibSp 0.036943Parch 0.027004Fare 1.664792

45

FamilySize 0.054053dtype: float64

[91]: titanik.mean() # prosečna vrednost

[91]: Survived 0.383838Pclass 2.308642Age 29.699118SibSp 0.523008Parch 0.381594Fare 32.204208FamilySize 0.904602dtype: float64

[92]: titanik.Fare.sum() # ukupna vrednost, nema smisla da gledamo za sve

[92]: 28693.9493

[93]: titanik.median() # medijana - vrednost u sredini (polovina je >= a polovina <=␣↪→od ove)

[93]: Survived 0.0000Pclass 3.0000Age 28.0000SibSp 0.0000Parch 0.0000Fare 14.4542FamilySize 0.0000dtype: float64

[94]: titanik.mode() # najčešća vrednost

[94]: Survived Pclass Sex Age SibSp Parch Fare Embarked AgeGroup \0 False 3 male 24.0 0 0 8.05 Southampton 20s

AgeCategory FamilySize0 Young Adult 0

4.6 Čuvanje promena

[95]: titanik.to_csv('titanik.csv', index=False)

46