Download pptx - Linking Socio-economic and Demographic Characteristics to Twitter Topics - Guy Lansley, UCL

Linking Socio-economic and Demographic Characteristics to Twitter

Topics in London

Guy LansleyDepartment of Geography, University College London

@[email protected]

ECTQG 2015Bari, Italy

Contents• Twitter Topic Classification• Variations in topics between users• Twitter topics and demographics

• Inferring characteristics from names• Twitter topics and socio-economics

• Inferring characteristics from neighbourhoods

A small proportion of Twitter users make their location publically available

Twitter

Recap• We have previously created a textual classification of 1.3 million

georeferenced Tweets recorded from Inner London in 2013• 20 Groups of topics• 100 Subgroups

• The classification was created from running a Latent Dirichlet Allocation model on text cleaned Tweets

• The topics of Tweets were found to vary by space and time

The sample area The spatial distribution of Tweets from subgroup 13D: Education

20 Twitter Topics1 Photography and Sights2 Optimism, Kindness and Positivity3 Leisure and Attractions4 TV and Film5 Humour and Informal Conversations6 Transport and Travel7 Politics, Beliefs and Current Affairs8 Sport and Games9 Anticipation and Socialising10 Business, Information and Networking11 Pessimism and Negativity12 Music and Musicians13 Routine Activities14 Food and Drink15 Body, Appearances and Clothes16 Social Media and Apps17 Slang and Profanities18 Place and Check-Ins19 Wishes and Gratitude20 Foreign and Other

Photography and Tourism

Optimism, Kindness and Positivity

Leisure and Attractions

TV and Film

Humour and Informal Conversations

Transport and Travel

Politics, Beliefs and Current Affairs

Sport and Games

Anticipation and Socialising

Business, Information and Networking

Pessimism and Negativity

Music

Routine Activities

Food and Drink

Body, Appearances and Fashion

Social Media and Apps

Slang and Profanities

Place and Check-Ins

Wishes and Gratitude

Foreign and Other

Photography and TourismOptimism, Kindness and PositivityLeisure and AttractionsTV and FilmHumour and Informal ConversationsTransport and TravelPolitics, Beliefs and Current AffairsSport and GamesAnticipation and SocialisingBusiness, Information and NetworkingPessimism and NegativityMusic and MusiciansRoutine ActivitiesFood and DrinkBody, Appearances and ClothesSocial Media and AppsSlang and ProfanitiesPlace and Check-InsWishes and GratitudeForeign and Other

Disassociations between Twitter Topics• Index of Dissimilarity

• Based on Duncan and Duncan (1955)• Based on all the individual users who

have 100 or more Tweets in our database

30 40 50 60 70 80

Key

• It measures the pairwise evenness of which two groups are distributed between users

• Higher values correspond with higher dissimilarity

Disassociations between Twitter Topics

Gephi of inverse dissimilarity scores

Twitter Users• Twitter users who frequently Tweet about one particular topic, are

also more or less likely to Tweet about other topics• So there is a pattern in behaviour on Twitter between users• Can this be explained by the characteristics of Twitter users? Those who Tweeted about

Transport and Travel were also likely to talk about Routine Activities

Those who Tweeted about Business, Information and Networking were likely to also Tweet about Politics, Beliefs and Current Affairs

Those who regularly Tweeted in the Foreign and Other category were less likely to Tweet in any other category

Problem• Twitter data does not include much information about the users,

and the information which is collected is not standardised• What useful information is absent?

• Demographics (age, gender)• Ethnicity • Occupation and Socio-economics• Etc…

Inferring Characteristics• There are three main means of inferring characteristics about

Twitter users:• Harnessing characteristics from names• Estimating probable area of residential location from

georeferenced Tweets• Text mining the user bios

• But all three approaches have limitations • E.g.

• Not all users use their real names• Few users use the geolocation function frequency• There is no consistency between what users put in their

Twitter bios which makes the inference of set characteristics extremely difficult

• There is already plenty of research into naming conventions and what names can infer about individuals

• I.e. ONOMAP – a classification of forename surname pairs into cultural, ethnic and linguistic groups

• For use on Twitter data see Longley et al, 2015

What can names tell us?

UK Forenames Database• 32,000 unique names• Built from:

• Market data• From CACI Ltd.• Representative of 7,085,617 • Only includes individuals aged 18+

• Birth certificate records• From the Office for National Statistics• Representative of 10,412,724 births in England and

Wales• Only includes individuals below 18

• Both sources are referenced to 2012• Data reweighted by the age-gender distributions from the 2011

Census

The Most Common Forenames

Female Male

Rank Name Estimated Pop. Name Estimated

Pop.1 MARGARET 555,000 JOHN 1,003,5002 SUSAN 544,600 DAVID 969,0003 SARAH 425,600 JAMES 614,6004 ELIZABETH 363,200 MICHAEL 612,8005 PATRICIA 350,200 PAUL 546,7006 MARY 332,500 ROBERT 482,8007 CHRISTINE 321,400 PETER 480,6008 JULIE 314,600 ANDREW 435,7009 KAREN 313,700 WILLIAM 415,30010 LINDA 298,200 MARK 370,700

Age Group Female Male

0-4 OLIVIA OLIVER5-9 EMILY JACK

10-14 CHLOE JACK15-19 SARAH JAMES20-24 SARAH JAMES25-29 SARAH DAVID30-34 SARAH DAVID35-39 SARAH PAUL40-44 KAREN PAUL45-49 JULIE DAVID50-54 SUSAN DAVID55-59 SUSAN DAVID60-64 SUSAN JOHN65-69 MARGARET JOHN70-74 MARGARET JOHN75-79 MARGARET JOHN80-84 MARGARET JOHN85+ MARGARET JOHN

The projected most common forenames

The most common name for each age band

Forenames – Gender

Forenames – Age (Females)

5 clusters of forenames based on their age distributions

Forenames – Age (Males)

5 clusters of forenames based on their age distributions

Isolating Names from Twitter Data• A text analytics algorithm was used to tokenise Twitter user

names to extract probable forenames. • The algorithm used space characters within the user names

(where available) to split them into separate string tokens• A database of over 300 million names was then used to

identify if the tokens were probable surnames or forenames

‘User Name’ field divided into separate ‘forename’ and ‘surname’ fields. (Longley et al, 2015)

The Inferred Demographic Structure of Tweeters

The O2 Arena The Emirates stadium Canary Wharf Westfield Stratford

Age & Gender and Twitter TopicsZ-Scores

Socio-economics• There is an association between names and socio-economics

and geodemographics• E.g. Top 5 forenames for each 2011 OAC Supergroup

Rural Residents Cosmopolitans Ethnicity Central Multicultural Metropolitans

PENELOPE TOM MOHAMED MOHAMMEDHUGH NICK AHMED MUHAMMADALASTAIR HARRIET ALI MOHAMMADROSEMARY MAX JOSE ABDULPHILIPPA ALEX ABDUL AHMED

Urbanites Suburbanites Constrained City Dwellers

Hard-Pressed Living

TOBY HILARY LILLIAN KAYLEIGHPHILIPPA GEOFFREY MAY LEANNEJEREMY KATHRYN ETHEL LYNDSEYKATHERINE JILL KAYLEIGH STACEYDUNCAN GILLIAN ELSIE KYLE

Data: 2011 Enhanced Electoral Roll (CACI UK Ltd)

Socio-economics• However, this association is a bit more tenuous than estimating

age and gender due to social mobility and popular trends in baby names

• Therefore, instead we have tried to link neighbourhood characteristics to Tweets by estimating the probable residential areas of users1

• Such as NS-SEC 2, deprivation, etc…

2 National Statistics Socio-economic Classification

1 Using Census Output Areas (OAs) as neighbourhood units. OAs are typically representative of 309 individuals as recorded in the 2011 Census. Most residential data from the 2011 Census is available at this geography

Assigning Census Data to Tweets• All of the Tweets from the model from users who had sent

multiple tested• Using the Generalised Land Use Database (GLUD) Tweets sent

from residential land parcels could be identified• Users were assigned Census Output Areas of which users had

submitted the most text from within residential land parcels.• Some neighbourhood characteristics could be broadly

representative of the individual, such as deprivation and NS-SEC due to spatial segregations.

• Each Twitter user was assigned the value of the proportion of each NS-SEC group in their residential OA

Twitter Topics and NS-SEC Groups

Higher managerial, & professi

onal occupa

tions

Lower managerial & professi

onal occupa

tions

Intermediate

occupations

Lower supervisory & technic

al occupa

tions

Semi routine occupa

tions

Routine occupa

tions

Never worked & long term

unemployed

Twitter Topic Group

Photography and Tourism 1.16 1.05 0.89 0.9 0.86 0.87 0.92Optimism, Kindness and Positivity 1.04 1.02 1.01 0.98 0.95 0.95 0.95Leisure and Attractions 1.13 1.06 0.93 0.89 0.88 0.89 0.91TV and Film 1.02 1.01 1.02 0.99 0.98 0.98 0.97Humour and Informal Conversations 0.91 0.95 1.05 1.07 1.1 1.07 1.05Transport and Travelling 1.02 1.01 1.01 0.99 0.99 0.98 0.98Politics, Beliefs and Current Affairs 1.06 1.03 0.99 0.94 0.95 0.97 0.97Sport and Games 1.00 0.99 1.04 1.00 1.01 1.01 1.01Anticipation and Socialising 1.00 1.00 1.01 1.01 1.01 1.00 0.98Business, Information and Networking 1.15 1.06 0.95 0.91 0.87 0.87 0.89Pessimism and Negativity 0.95 0.97 1.03 1.04 1.05 1.05 1.03Music and Musicians 0.98 1.00 1.00 1.01 1.02 1.02 1.00Routine Activities 0.96 0.98 1.03 1.03 1.04 1.02 1.01Food and Drink 1.1 1.05 0.96 0.92 0.91 0.91 0.94Body, Appearances and Clothes 0.98 0.99 1.02 1.01 1.03 1.02 1.02Social Media and Apps 1.03 1.01 0.99 0.99 0.99 0.98 0.99Slang and Profanities 0.83 0.92 1.05 1.09 1.16 1.16 1.14Place and Check-Ins 1.18 1.12 0.83 0.81 0.79 0.82 0.89Wishes and Gratitude 0.94 0.97 1.04 1.06 1.06 1.04 1.02Foreign and Other 0.97 1.00 0.97 1.05 0.99 1.04 1.05

Loca

tion

Quo

tient

s1

= av

erag

e pe

netra

tion

Conclusions• Based on a large sample of georeferenced Tweets in London

from 2013:• There is a distinctive pattern of associations and

dissociations between the frequency of topics per user• The popularity of certain topics on Twitter does vary by age

and gender• The popularity of topics also varies by neighbourhood socio-

economics• Demographic characteristics of Twitter users could be

estimated using the registered users’ forenames• It is also possible to link neighbourhood statistics to geo-

referenced Twitter users by locating their probable neighbourhood of origin from their activity. • Although it is not possible to confidentially identify the

correct residential locations

References• Blei, D., Ng, A., and Jordan, M. (2003) Latent Dirichlet allocation.

Journal of Machine Learning Research, 3 993–1022• Duncan, O. D., & Duncan, B. (1955). A methodological analysis of

segregation indexes. American Sociological Review, 20, 210-217• Lansley, G. and Longley, P. A. (2015) The Geography of Twitter

Topics in London. In Press• Lansley, G. and Longley, P. A. (2015) Deriving Age and Gender from

Forenames for Consumer Analytics. In Press• Longley, P. A., Adnan, M., Lansley, G. (2015) The geotemporal

demographics of Twitter usage Environment and Planning A, 47(2); 465 – 484

• Mateos P., Longley P. A. and O’Sullivan, D. (2011) Ethnicity and population structure in personal naming networks. PLoS ONE, 6(9) e22943; 1-12

Guy LansleyDepartment of Geography, [email protected] @GuyLansley

End