14
Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics Venice, November 23 rd 2016 Serena Signorelli 1,2 , Fernando Reis 2 , Silvia Biffignandi 1 1 University of Bergamo 2 EUROSTAT

Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Wikipedia as a Big Data source for Tourism

Global Forum on Tourism Statistics

Venice, November 23rd 2016

Serena Signorelli1,2, Fernando Reis2, Silvia Biffignandi1 1University of Bergamo 2EUROSTAT

Page 2: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Wikipedia as a Big Data source for Tourism

Introduction on the study

Big data source

Official Statistics source

Methodology and results

Conclusions and future research

Page 3: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Introduction on the study

Source of data

Wikipedia page views, starting from Wikidata (linked data source of the Wikimedia Foundation)

Official Tourism data, composed by arrivals and overnight stays

Aim of the study

evaluate the use of these data as a source of information for the identification of factors that drive tourism to an area and whether it is possible to predict tourism flows using these data.

Context

January 2012 to December 2015

three cities: Barcelona, Bruges, Vienna

Page 4: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Big data source

Description of data

Wikipedia page views: how many people have visited an article during a given time period

Languages considered: 31

24 official languages of the European Union

7 languages in the top Wikipedia rankings in no. of page views

Summary of data

City Number of Wikidata items Number of Wikipedia articles

Barcelona 1093 3996

Bruges 561 868

Vienna 2663 6315

Page 5: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Official Statistics source

Description of data

Official Tourism data:

arrivals (number of passengers)

overnight stays (number of bookings)

Collection of data

Available on the municipality of Barcelona website

Available on the Flemish tourism website

Available on Statistics Austria website

+ missing months provided by Statistics Austria

Page 6: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Methodology and results

Type of results

Only using the big data source

Maps with points of interest in cities

Charts with top six languages time series

Maps with top six languages points of interest

Classification of points of interest

Using the two combined data sources

Modelling Official Tourism data using big data series as regressors

Page 7: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Classification of Wikipedia articles

Method

Topic modelling using Latent Dirichlet Allocation (LDA) algorithm

For unclassified/uncertain categories, string match between some identified keywords and title of the article

Classified Wikipedia articles classified Wikidata items

Quality of classification

95.5% of the total number of Barcelona Wikidata items

89.7% of the total number of Bruges Wikidata items

79.2% of the total number of Vienna Wikidata items

Page 8: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Classification of Wikipedia items

Barcelona โ€“ 14 categories

โ€ขSport 64.3%

โ€ขSagrada Familia 14.8%

โ€ขBuildings 9.0%

โ€ข Public transport 2.8%

โ€ขStreets and districts 2.4%

Bruges โ€“ 11 categories

โ€ขSport 40.1%

โ€ข Places of worship 23.3%

โ€ขDistricts 13.7%

โ€ขBuildings 8.7%

โ€ขStreets and streams 6.8%

Vienna โ€“ 23 categories

โ€ขHistory 38.4%

โ€ข Institutions/organizations 22.8%

โ€ขBuildings 7.9%

โ€ขMuseums 6.7%

โ€ขSport 3.7%

Page 10: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Combined data sources

Type of model โ€ข Autoregressive Integrated Moving Average with Explanatory Variables

(ARIMAX)

Data โ€ข We try to model number of passengers (arrivals) and number of bookings

(overnight stays) using the previously classified series as external regressors

Procedure โ€ข We run the models and identified the significant parameters: No doubts on interpretation of positive parameters

Uncertainty about the negative ones

โ€ข Further analysis into each significant category with negative parameter Identify of the Wikipedia article that collects the highest number of page views

Remove that article from the category and put into the model separately

Page 11: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Combined data sources

Barcelona ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐ฉ๐š๐ฌ๐ฌ๐ž๐ง๐ ๐ž๐ซ๐ฌ = ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐’‘๐’‚๐’”๐’”๐’†๐’๐’ˆ๐’†๐’“๐’”๐ญโˆ’๐Ÿ +๐ŸŽ. ๐Ÿ‘๐Ÿ”๐Ÿ๐Ÿ”๐Ÿ–๐Ÿ• ๐ฉ๐š๐ซ๐ค๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ’๐Ÿ–๐Ÿ๐ŸŽ๐Ÿ๐Ÿ– ๐ฌ๐ฉ๐จ๐ซ๐ญ โˆ’๐ŸŽ. ๐Ÿ‘๐Ÿ•๐Ÿ๐Ÿ“๐ŸŽ๐Ÿ ๐†๐ซ๐š๐ง ๐“๐ž๐š๐ญ๐ซ๐จ ๐๐ž๐ฅ ๐‹๐ข๐œ๐ž๐จ + ๐๐’•

๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ = ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ๐ญโˆ’๐Ÿ +๐ŸŽ. ๐Ÿ’๐Ÿ“๐Ÿ๐Ÿ๐Ÿ—๐Ÿ’ ๐ฉ๐š๐ซ๐ค๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ‘๐Ÿ–๐Ÿ–๐Ÿ—๐Ÿ‘๐Ÿ– ๐ฌ๐ฉ๐จ๐ซ๐ญ + ๐›œ๐ญ

Bruges ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐ฉ๐š๐ฌ๐ฌ๐ž๐ง๐ ๐ž๐ซ๐ฌ = โˆ’ ๐ŸŽ. ๐Ÿ•๐Ÿ’๐Ÿ“๐Ÿ”๐Ÿ๐Ÿ‘๐Ÿ— ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐ฉ๐š๐ฌ๐ฌ๐ž๐ง๐ ๐ž๐ซ๐ฌ๐ญโˆ’๐Ÿ

+ ๐ŸŽ. ๐Ÿ๐Ÿ—๐Ÿ•๐Ÿ”๐Ÿ๐Ÿ‘๐Ÿ ๐›๐ฎ๐ข๐ฅ๐๐ข๐ง๐ ๐ฌ + ๐›œ๐ญ

๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ = โˆ’ ๐ŸŽ. ๐Ÿ–๐ŸŽ๐Ÿ’๐ŸŽ๐Ÿ“๐Ÿ”๐Ÿ’ ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ๐ญโˆ’๐Ÿ + ๐ŸŽ. ๐Ÿ‘๐Ÿ๐Ÿ•๐Ÿ“๐Ÿ๐Ÿ•๐Ÿ” ๐›๐ฎ๐ข๐ฅ๐๐ข๐ง๐ ๐ฌ + ๐ŸŽ. ๐Ÿ“๐Ÿ๐Ÿ’๐Ÿ‘๐Ÿ“๐ŸŽ๐Ÿ‘ ๐œ๐จ๐ฆ๐ฉ๐š๐ง๐ข๐ž๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ“๐Ÿ“๐Ÿ’๐Ÿ–๐Ÿ”๐Ÿ—๐Ÿ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐œ๐ญ๐ฌโˆ’ ๐ŸŽ. ๐Ÿ๐Ÿ—๐Ÿ—๐Ÿ‘๐Ÿ•๐Ÿ“๐Ÿ ๐™๐ž๐ž๐›๐ซ๐ฎ๐ ๐ ๐ž โˆ’ ๐ŸŽ. ๐Ÿ๐ŸŽ๐Ÿ‘๐Ÿ•๐Ÿ–๐Ÿ–๐ŸŽ ๐๐ž๐ฅ๐Ÿ๐จ๐ซ๐ญ ๐•๐š๐ง ๐๐ซ๐ฎ๐ ๐ ๐ž + ๐›œ๐ญ

Page 12: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Combined data sources

Vienna

๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐ฉ๐š๐ฌ๐ฌ๐ž๐ง๐ ๐ž๐ซ๐ฌ= ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐ฉ๐š๐ฌ๐ฌ๐ž๐ง๐ ๐ž๐ซ๐ฌ๐ญโˆ’๐Ÿ + ๐ŸŽ. ๐Ÿ‘๐Ÿ๐Ÿ–๐Ÿ๐ŸŽ๐Ÿ‘ ๐ฉ๐ฅ๐š๐œ๐ž๐ฌ ๐จ๐Ÿ ๐ฐ๐จ๐ซ๐ฌ๐ก๐ข๐ฉ+ ๐ŸŽ. ๐Ÿ๐Ÿ–๐Ÿ—๐Ÿ”๐Ÿ๐Ÿ’ ๐ฆ๐จ๐ฎ๐ง๐ญ๐š๐ข๐ง๐ฌ + ๐ŸŽ. ๐Ÿ๐Ÿ’๐ŸŽ๐Ÿ“๐Ÿ’๐Ÿ‘ ๐œ๐ž๐ฆ๐ž๐ญ๐ž๐ซ๐ข๐ž๐ฌโˆ’ ๐ŸŽ. ๐Ÿ‘๐Ÿ”๐Ÿ๐Ÿ—๐Ÿ—๐ŸŽ ๐ข๐ง๐ฌ๐ญ๐ข๐ญ๐ฎ๐ญ๐ข๐จ๐ง๐ฌ ๐จ๐ซ๐ ๐š๐ง๐ข๐ณ๐š๐ญ๐ข๐จ๐ง๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ‘๐Ÿ๐Ÿ”๐Ÿ”๐Ÿ๐Ÿ“ ๐›๐ฎ๐ข๐ฅ๐๐ข๐ง๐ ๐ฌโˆ’ ๐ŸŽ. ๐Ÿ๐Ÿ•๐Ÿ•๐Ÿ“๐Ÿ“๐Ÿ ๐’๐œ๐ก๐ฅ๐จ๐ฌ๐ฌ ๐’๐œ๐ก๐จ๐ง๐›๐ซ๐ฎ๐ง๐ง โˆ’ ๐ŸŽ. ๐Ÿ‘๐ŸŽ๐Ÿ๐Ÿ๐Ÿ’๐ŸŽ ๐”๐ง๐ข๐ฏ๐ž๐ซ๐ฌ๐ข๐ญรค๐ญ ๐–๐ข๐ž๐งโˆ’ ๐ŸŽ. ๐Ÿ๐Ÿ’๐Ÿ“๐ŸŽ๐Ÿ๐Ÿ– ร–๐ฌ๐ญ๐ž๐ซ๐ซ๐ž๐ข๐œ๐ก โˆ’ ๐”๐ง๐ ๐š๐ซ๐ง + ๐›œ๐ญ

๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ= ๐๐ฎ๐ฆ๐›๐ž๐ซ ๐จ๐Ÿ ๐›๐จ๐จ๐ค๐ข๐ง๐ ๐ฌ๐ญโˆ’๐Ÿ + ๐ŸŽ. ๐Ÿ‘๐Ÿ’๐Ÿ’๐Ÿ๐Ÿ–๐Ÿ“ ๐ฉ๐ฅ๐š๐œ๐ž๐ฌ ๐จ๐Ÿ ๐ฐ๐จ๐ซ๐ฌ๐ก๐ข๐ฉ+ ๐ŸŽ. ๐Ÿ๐Ÿ๐Ÿ—๐Ÿ’๐Ÿ—๐Ÿ” ๐œ๐จ๐ฆ๐ฉ๐š๐ง๐ข๐ž๐ฌ + ๐ŸŽ. ๐Ÿ”๐Ÿ‘๐Ÿ–๐Ÿ–๐Ÿ’๐Ÿ– ๐ฆ๐ฎ๐ฌ๐ž๐ฎ๐ฆ๐ฌ + ๐ŸŽ. ๐Ÿ‘๐Ÿ๐Ÿ๐Ÿ•๐ŸŽ๐Ÿ— ๐ญ๐ก๐ž๐š๐ญ๐ซ๐ž๐ฌโˆ’ ๐ŸŽ. ๐Ÿ“๐Ÿ๐Ÿ—๐Ÿ–๐Ÿ•๐Ÿ” ๐ž๐ฆ๐›๐š๐ฌ๐ฌ๐ข๐ž๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ—๐Ÿ‘๐Ÿ”๐Ÿ”๐Ÿ–๐Ÿ‘ ๐ซ๐ข๐ฏ๐ž๐ซ๐ฌ ๐š๐ง๐ ๐ฉ๐š๐ซ๐ค๐ฌ โˆ’ ๐ŸŽ. ๐Ÿ’๐Ÿ‘๐Ÿ๐Ÿ—๐Ÿ๐Ÿ“ ๐›๐ฎ๐ข๐ฅ๐๐ข๐ง๐ ๐ฌโˆ’ ๐ŸŽ. ๐Ÿ‘๐Ÿ‘๐Ÿ“๐Ÿ’๐Ÿ๐Ÿ“ ๐ก๐ข๐ ๐ก ๐ž๐๐ฎ๐œ๐š๐ญ๐ข๐จ๐ง โˆ’ ๐ŸŽ. ๐Ÿ’๐ŸŽ๐Ÿ—๐Ÿ’๐ŸŽ๐Ÿ ๐’๐œ๐ก๐จ๐ฌ๐ฌ ๐’๐œ๐ก๐จ๐ง๐›๐ซ๐ฎ๐ง๐งโˆ’ ๐ŸŽ. ๐Ÿ๐Ÿ•๐Ÿ๐Ÿ—๐Ÿ๐Ÿ” ๐”๐ง๐ข๐ฏ๐ž๐ซ๐ฌ๐ข๐ญรค๐ญ ๐–๐ข๐ž๐ง โˆ’ ๐ŸŽ. ๐Ÿ“๐Ÿ•๐Ÿ๐Ÿ—๐Ÿ•๐Ÿ“ ร–๐ฌ๐ญ๐ซ๐ž๐ซ๐ซ๐ž๐ข๐œ๐ก โˆ’ ๐”๐ง๐ ๐š๐ซ๐ง + ๐›œ๐ญ

Page 13: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Conclusions and future research

Future developments

Edit history of a Wikipedia article

โ€ข Multicollinearity effects between page views series

โ€ข Lags in the Wikipedia series

โ€ข

โ€ข Split the tourism flow into residents' tourism and foreigners' tourism

Page 14: Big Data StrategyTitle - 14th Global Forum on โ€ฆtsf2016venice.enit.it/images/articles/Presentations/s2/2...Wikipedia as a Big Data source for Tourism Global Forum on Tourism Statistics

Thank you for your attention

Serena Signorelli [email protected]

Fernando Reis [email protected]

Silvia Biffignandi [email protected]

Barcelona Bruges Vienna