30
STAT 408 Data Scraping and SQL Data Scraping SQL STAT 408 Data Scraping and SQL March 8, 2018

STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

STAT 408Data Scraping and SQL

March 8, 2018

Page 2: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Data Scraping

Page 3: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Data Scraping

Data scraping is defined as using a computer to extractinformation, typically from human readable websites. We couldspend multiple weeks on this, so this will be a basic introductionthat will allow you to:

extract text and numbers from webpages andextract tables from webpages.

Page 4: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

A bit about HTML

HTML elements are written with a start tag, an end tag, andwith the content in between: content. The tags which typicallycontain the textual content we wish to scrape. Some tagsinclude:

< h1 >, < h2 >,. . . ,: for headings< p >: Paragraph elements< ul >: Unordered bulleted list< ol >: Ordered list< li >: Individual List item< div >: Division or section< table >: Table

Page 5: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

HTML Example

Figure 1: MSU website

Page 6: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping with rvest

library(rvest)library(stringr)msu.math <- read_html("http://math.montana.edu/")msu.math

## {xml_document}## <html lang="en-US">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...## [2] <body class="responsive">\n <header class="wrapper" id="header" ...

Page 7: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping with rvest

msu.math %>% html_nodes('h1')

## {xml_nodeset (1)}## [1] <h1>\n \t\tDepartment of Mathematical Science ...

msu.math %>% html_nodes('h1') %>% html_text()

## [1] "\n \t\tDepartment of Mathematical Sciences\n \t\n "

Page 8: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Tidying Up

msu.math %>% html_nodes('h1') %>% html_text() %>%str_replace_all("\\s+", " ") %>%

str_replace_all(pattern = "\n", replacement = "") %>%str_replace_all(pattern = "\t", replacement = "")

## [1] " Department of Mathematical Sciences "

Page 9: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping h3

msu.math %>% html_nodes('h3') %>% html_text() %>%str_replace_all("\\s+", " ") %>%

str_replace_all(pattern = "\n", replacement = "") %>%str_replace_all(pattern = "\t", replacement = "")

## [1] "Faculty" "Undergraduate" "News"## [4] "Events" "More Information" "Resources"## [7] "Follow Us"

Page 10: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

A River Runs Through It

Figure 2: IMDB: A River Runs Through It

Page 11: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Get Story line

river <- read_html("http://www.imdb.com/title/tt0105265/")story.line <- river %>%

html_nodes('#titleStoryLine') %>%html_nodes('p') %>% html_text() %>%str_replace_all(pattern = "\n", replacement = "")

The storyline is : The Maclean brothers, Paul and Norman, live a relativelyidyllic life in rural Montana, spending much of their time fly fishing. Thesons of a minister, the boys eventually part company when Norman moveseast to attend college, leaving his rebellious brother to find trouble backhome. When Norman finally returns, the siblings resume their fishingoutings, and assess both where they’ve been and where they’re going.Written byJwelch5742 .

Page 12: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SelectorGadget for accessing actors

Figure 3: Using SelectorGadget

Page 13: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Actors

# Get Actorsriver %>% html_nodes('#titleCast') %>%html_nodes(".itemprop span") %>% html_text()

## [1] "Craig Sheffer" "Brad Pitt" "Tom Skerritt"## [4] "Brenda Blethyn" "Emily Lloyd" "Edie McClurg"## [7] "Stephen Shellen" "Vann Gravage" "Nicole Burdette"## [10] "Susan Traylor" "Michael Cudlitz" "Rob Cox"## [13] "Buck Simmonds" "Fred Oakland" "David Creamer"

Page 14: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Selecting Tables: baseball data

Figure 4: HTML Table

Page 15: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping Tables

batting <- read_html("https://www.baseball-reference.com/leagues/MLB/2017-standard-batting.shtml")batting.list <- batting %>% html_nodes('table') %>% html_table()batting.df <- tbl_df(batting.list[[1]])kable(batting.df)

Tm #Bat BatAge R/G G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB LOB

ARI 45 28.3 5.01 162 6224 5525 812 1405 314 39 220 776 103 30 578 1456 .254 .329 .445 .774 93 2457 106 54 39 27 44 1118ATL 49 28.7 4.52 162 6216 5584 732 1467 289 26 165 706 77 31 474 1184 .263 .326 .412 .738 94 2303 137 66 59 32 57 1127BAL 50 28.6 4.59 162 6140 5650 743 1469 269 12 232 713 32 13 392 1412 .260 .312 .435 .747 99 2458 138 50 10 37 12 1041BOS 49 27.3 4.85 162 6338 5669 785 1461 302 19 168 735 106 31 571 1224 .258 .329 .407 .736 92 2305 141 53 9 36 48 1134CHC 47 27.1 5.07 162 6283 5496 822 1402 274 29 223 785 62 31 622 1401 .255 .338 .437 .775 100 2403 134 82 48 32 54 1147CHW 51 26.7 4.36 162 6059 5513 706 1412 256 37 186 670 71 31 401 1397 .256 .314 .417 .731 95 2300 124 76 35 33 17 1055CIN 47 27.1 4.65 162 6213 5484 753 1390 249 38 219 715 120 39 565 1329 .253 .329 .433 .761 98 2372 116 72 50 42 41 1135CLE 41 28.0 5.05 162 6234 5511 818 1449 333 29 212 780 88 23 604 1153 .263 .339 .449 .788 104 2476 125 50 23 45 30 1158COL 41 28.3 5.09 162 6201 5534 824 1510 293 38 192 793 59 34 519 1408 .273 .338 .444 .781 91 2455 143 44 62 41 46 1088DET 49 29.6 4.54 162 6150 5556 735 1435 289 35 187 699 65 34 503 1313 .258 .324 .424 .748 96 2355 128 52 11 27 21 1104HOU 46 28.8 5.53 162 6271 5611 896 1581 346 20 238 854 98 42 509 1087 .282 .346 .478 .823 126 2681 139 70 11 61 27 1094KCR 49 28.9 4.33 162 6027 5536 702 1436 260 24 193 660 91 31 390 1166 .259 .311 .420 .731 92 2323 160 45 17 37 19 1005LAA 55 30.0 4.38 162 6073 5415 710 1314 251 14 186 678 136 44 523 1198 .243 .315 .397 .712 93 2151 141 70 17 46 30 1033LAD 52 27.9 4.75 162 6191 5408 770 1347 312 20 221 730 77 28 649 1380 .249 .334 .437 .771 103 2362 119 64 31 38 41 1146MIA 43 28.4 4.80 162 6248 5602 778 1497 271 31 194 743 91 30 486 1282 .267 .331 .431 .761 104 2412 119 67 50 41 48 1130MIL 50 27.4 4.52 162 6135 5467 732 1363 267 22 224 695 128 41 547 1571 .249 .322 .429 .751 94 2346 116 53 42 26 34 1088MIN 52 27.0 5.03 162 6261 5557 815 1444 286 31 206 781 95 28 593 1342 .260 .334 .434 .768 105 2410 105 46 26 39 26 1147NYM 52 28.8 4.54 162 6169 5510 735 1379 286 28 224 713 58 23 529 1291 .250 .320 .434 .755 98 2393 118 57 36 37 31 1099NYY 51 28.6 5.30 162 6354 5594 858 1463 266 23 241 821 90 22 616 1386 .262 .339 .447 .785 105 2498 119 64 18 56 22 1184OAK 54 28.7 4.56 162 6126 5464 739 1344 305 15 234 708 57 22 565 1491 .246 .319 .436 .755 103 2381 129 43 13 40 15 1075PHI 51 26.6 4.26 162 6133 5535 690 1382 287 36 174 654 59 25 494 1417 .250 .315 .409 .723 91 2263 128 47 21 36 25 1079PIT 47 28.2 4.12 162 6136 5458 668 1331 249 36 151 635 67 36 519 1213 .244 .318 .386 .704 84 2105 120 88 42 28 39 1129SDP 52 26.2 3.73 162 5954 5356 604 1251 227 31 189 576 89 33 460 1499 .234 .299 .393 .692 84 2107 99 53 52 33 20 1037SEA 61 29.5 4.63 162 6166 5551 750 1436 281 17 200 714 89 35 487 1267 .259 .325 .424 .749 101 2351 131 78 14 35 31 1084SFG 49 29.5 3.94 162 6137 5551 639 1382 290 28 128 612 76 34 467 1204 .249 .309 .380 .689 83 2112 136 36 31 52 37 1093STL 48 28.0 4.70 162 6219 5470 761 1402 284 28 196 728 81 31 593 1348 .256 .334 .426 .760 99 2330 139 65 47 44 36 1118TBR 53 28.3 4.28 162 6147 5478 694 1340 226 32 228 671 88 34 545 1538 .245 .317 .422 .739 101 2314 115 55 16 48 33 1114TEX 51 28.3 4.93 162 6122 5430 799 1326 255 21 237 756 113 44 544 1493 .244 .320 .430 .750 93 2334 110 81 27 39 18 1015TOR 60 30.9 4.28 162 6154 5499 693 1320 269 5 222 661 53 24 542 1327 .240 .312 .412 .724 88 2265 153 51 25 35 12 1064WSN 49 29.2 5.06 162 6214 5553 819 1477 311 31 215 796 108 30 542 1327 .266 .332 .449 .782 100 2495 116 31 43 45 56 1101LgAvg 45 28.3 4.65 162 6177 5519 753 1407 280 27 204 719 84 31 528 1337 .255 .324 .426 .750 97 2351 127 59 31 39 32 1098

1358 28.3 4.65 4860 185295 165567 22582 42215 8397 795 6105 21558 2527 934 15829 40104 .255 .324 .426 .750 97 70517 3804 1763 925 1168 970 32942Tm #Bat BatAge R/G G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB LOB

Page 16: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping Exercise: Get Team Info

Visit the baseball reference website for the Colorado Rockieshttps://www.baseball-reference.com/teams/COL/2017.shtmland scrape a table or text.

Page 17: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Scraping Solution: Get Team Info

batting.CO <- read_html("https://www.baseball-reference.com/teams/COL/2017.shtml")tables.CO <- batting.CO %>% html_nodes('table') %>% html_table()tbl_df(tables.CO[[1]])

## # A tibble: 48 x 28## Rk Pos Name Age G PA AB R H `2B`## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>## 1 1 C Tony Wolters* 25 83 266 229 30 55 8## 2 2 1B Mark Reynolds 33 148 593 520 82 139 22## 3 3 2B DJ LeMahieu 28 155 682 609 95 189 28## 4 4 SS Trevor Story 24 145 555 503 68 120 32## 5 5 3B Nolan Arenado 26 159 680 606 100 187 43## 6 6 LF Gerardo Parra* 30 115 425 392 56 121 24## 7 7 CF Charlie Blackmon* 30 159 725 644 137 213 35## 8 8 RF Carlos Gonzalez* 31 136 534 470 72 123 34## 9 Rk Pos Name Age G PA AB R H 2B## 10 9 UT Ian Desmond 31 95 373 339 47 93 11## # ... with 38 more rows, and 18 more variables: `3B` <chr>, HR <chr>,## # RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>, BA <chr>,## # OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>, GDP <chr>,## # HBP <chr>, SH <chr>, SF <chr>, IBB <chr>

Page 18: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL

Page 19: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQLite

For this class we will use SQLite which enables users to storedatabase files locally, but the principles are the same forquerying a server-based database.

We will use a European soccer database available athttps://www.kaggle.com/hugomathien/soccer/ which canbe downloaded with the following link: https://www.kaggle.com/hugomathien/soccer/downloads/database.sqlite

Page 20: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Accessing Database

library(DBI)library(RSQLite)## connect to a database WHICH IS STORED LOCALLYmy.database <- dbConnect(SQLite(),

dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")dbListTables(my.database)

## [1] "Country" "League" "Match"## [4] "Player" "Player_Attributes" "Team"## [7] "Team_Attributes" "sqlite_sequence"

dbDisconnect(my.database)

Page 21: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Identifying fields in table

my.database <- dbConnect(SQLite(),dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")

dbListFields(my.database, "Player")

## [1] "id" "player_api_id" "player_name"## [4] "player_fifa_api_id" "birthday" "height"## [7] "weight"

Page 22: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL commands

The most basic SQL queries have the following structure:

SELECT var1name, var2name (filter columns)FROM tablename (identify table)WHERE condition1 (filter rows)GROUP_BY var3name (aggregate data)HAVING condition2 (filter aggregated data)ORDER_BY var (arrange ordering)

Page 23: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Query 1

Select all columns for player and view first 5 rows.

kable((dbGetQuery(my.database,"SELECT * FROM Player"))[1:5,])

id player_api_id player_name player_fifa_api_id birthday height weight

1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 1872 155782 Aaron Cresswell 189615 1989-12-15 00:00:00 170.18 1463 162549 Aaron Doran 186170 1991-05-13 00:00:00 170.18 1634 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 1985 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154

Page 24: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Query 2

Retain player name, weight, height for players over 200 cmkable(dbGetQuery(my.database,"SELECT player_name, height, weight FROM Player WHERE height > 200"))

player_name height weight

Abdoul Ba 200.66 212Asmir Begovic 200.66 183Bogdan Milic 203.20 216Costel Pantilimon 203.20 212Daniel Burn 200.66 192Danny Wintjens 200.66 168Fejsal Mulic 203.20 185Fraser Forster 200.66 198Jurgen Wevers 203.20 212Kevin Vink 203.20 194Konrad Jalocha 200.66 187Kristof van Hout 208.28 243Lacina Traore 203.20 192Nikola Zigic 203.20 212Paolo Acerbis 203.20 190Peter Crouch 200.66 185Pietro Marino 203.20 209Robert Jones 200.66 170Stefan Maierhofer 203.20 216Vanja Milinkovic-Savic 203.20 203Wojciech Kaczmarek 200.66 218Zeljko Kalac 203.20 209

Page 25: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Query 3

Compute average weight for players of 200 cm

dbGetQuery(my.database,"SELECT AVG(weight) as mean_weight FROM Player WHERE height > 200")

## mean_weight## 1 200.2727

Page 26: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Create a database

new.db <- dbConnect(RSQLite::SQLite(), ":memory:")dbListTables(new.db)

## character(0)

player <- tbl_df(dbGetQuery(my.database,"SELECT * FROM Player"))dbWriteTable(new.db, "player", player)dbListTables(new.db)

## [1] "player"

dbDisconnect(new.db)dbDisconnect(my.database)

Page 27: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

Additional SQL

SQL also has functionality for merging and updating tables. Seethe cheat sheet for more details.

Page 28: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Exercise

Select the average goals scored in matches in different countriesfrom the match table

Page 29: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Solution

Select the average goals scored in matches in different countriesfrom the match tablemy.database <- dbConnect(SQLite(),dbname="~/Google Drive/teaching/STAT408/data/database.sqlite")

st <- "SELECT AVG(home_team_goal + away_team_goal) as total_goals,country_id FROM match GROUP by country_id"

dbGetQuery(my.database,st)

## total_goals country_id## 1 2.801505 1## 2 2.710526 1729## 3 2.443092 4769## 4 2.901552 7809## 5 2.616838 10257## 6 3.080882 13274## 7 2.425000 15722## 8 2.534600 17642## 9 2.633772 19694## 10 2.767105 21518## 11 2.929677 24558

dbListTables(my.database)

## [1] "Country" "League" "Match"## [4] "Player" "Player_Attributes" "Team"## [7] "Team_Attributes" "sqlite_sequence"

Page 30: STAT 408 Data Scraping and SQL - Montana State …math.montana.edu/ahoegh/teaching/stat408/lecture...STAT 408 Data Scraping and SQL Data Scraping SQL A bit about HTML HTMLelementsarewrittenwithastarttag,anendtag,and

STAT 408Data

Scrapingand SQL

DataScraping

SQL

SQL Solution

Select the average goals scored in matches in different countriesfrom the match tablest2 <- "Create Table goals as SELECT AVG(home_team_goal + away_team_goal) as

total_goals, country_id FROM match GROUP by country_id "dbGetQuery(my.database,st2)

## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for## statements, only for queries

## data frame with 0 columns and 0 rows

dbGetQuery(my.database, "SELECT * from goals INNER JOIN country on goals.country_id = country.id")

## total_goals country_id id name## 1 2.801505 1 1 Belgium## 2 2.710526 1729 1729 England## 3 2.443092 4769 4769 France## 4 2.901552 7809 7809 Germany## 5 2.616838 10257 10257 Italy## 6 3.080882 13274 13274 Netherlands## 7 2.425000 15722 15722 Poland## 8 2.534600 17642 17642 Portugal## 9 2.633772 19694 19694 Scotland## 10 2.767105 21518 21518 Spain## 11 2.929677 24558 24558 Switzerland

dbSendQuery(my.database, "Drop Table goals")

## <SQLiteResult>## SQL Drop Table goals## ROWS Fetched: 0 [complete]## Changed: 0

dbDisconnect(my.database)

## Warning in rsqlite_disconnect(conn@ptr): There are 1 result in use. The## connection will be released when they are closed