Linking Socio-economic and Demographic Characteristics to Twitter
Topics in London
Guy LansleyDepartment of Geography, University College London
ECTQG 2015Bari, Italy
Contents• Twitter Topic Classification• Variations in topics between users• Twitter topics and demographics
• Inferring characteristics from names• Twitter topics and socio-economics
• Inferring characteristics from neighbourhoods
A small proportion of Twitter users make their location publically available
Recap• We have previously created a textual classification of 1.3 million
georeferenced Tweets recorded from Inner London in 2013• 20 Groups of topics• 100 Subgroups
• The classification was created from running a Latent Dirichlet Allocation model on text cleaned Tweets
• The topics of Tweets were found to vary by space and time
The sample area The spatial distribution of Tweets from subgroup 13D: Education
20 Twitter Topics1 Photography and Sights2 Optimism, Kindness and Positivity3 Leisure and Attractions4 TV and Film5 Humour and Informal Conversations6 Transport and Travel7 Politics, Beliefs and Current Affairs8 Sport and Games9 Anticipation and Socialising10 Business, Information and Networking11 Pessimism and Negativity12 Music and Musicians13 Routine Activities14 Food and Drink15 Body, Appearances and Clothes16 Social Media and Apps17 Slang and Profanities18 Place and Check-Ins19 Wishes and Gratitude20 Foreign and Other
Photography and Tourism
Optimism, Kindness and Positivity
Leisure and Attractions
TV and Film
Humour and Informal Conversations
Transport and Travel
Politics, Beliefs and Current Affairs
Sport and Games
Anticipation and Socialising
Business, Information and Networking
Pessimism and Negativity
Music
Routine Activities
Food and Drink
Body, Appearances and Fashion
Social Media and Apps
Slang and Profanities
Place and Check-Ins
Wishes and Gratitude
Foreign and Other
Photography and TourismOptimism, Kindness and PositivityLeisure and AttractionsTV and FilmHumour and Informal ConversationsTransport and TravelPolitics, Beliefs and Current AffairsSport and GamesAnticipation and SocialisingBusiness, Information and NetworkingPessimism and NegativityMusic and MusiciansRoutine ActivitiesFood and DrinkBody, Appearances and ClothesSocial Media and AppsSlang and ProfanitiesPlace and Check-InsWishes and GratitudeForeign and Other
Disassociations between Twitter Topics• Index of Dissimilarity
• Based on Duncan and Duncan (1955)• Based on all the individual users who
have 100 or more Tweets in our database
30 40 50 60 70 80
Key
• It measures the pairwise evenness of which two groups are distributed between users
• Higher values correspond with higher dissimilarity
Disassociations between Twitter Topics
Gephi of inverse dissimilarity scores
Twitter Users• Twitter users who frequently Tweet about one particular topic, are
also more or less likely to Tweet about other topics• So there is a pattern in behaviour on Twitter between users• Can this be explained by the characteristics of Twitter users? Those who Tweeted about
Transport and Travel were also likely to talk about Routine Activities
Those who Tweeted about Business, Information and Networking were likely to also Tweet about Politics, Beliefs and Current Affairs
Those who regularly Tweeted in the Foreign and Other category were less likely to Tweet in any other category
Problem• Twitter data does not include much information about the users,
and the information which is collected is not standardised• What useful information is absent?
• Demographics (age, gender)• Ethnicity • Occupation and Socio-economics• Etc…
Inferring Characteristics• There are three main means of inferring characteristics about
Twitter users:• Harnessing characteristics from names• Estimating probable area of residential location from
georeferenced Tweets• Text mining the user bios
• But all three approaches have limitations • E.g.
• Not all users use their real names• Few users use the geolocation function frequency• There is no consistency between what users put in their
Twitter bios which makes the inference of set characteristics extremely difficult
• There is already plenty of research into naming conventions and what names can infer about individuals
• I.e. ONOMAP – a classification of forename surname pairs into cultural, ethnic and linguistic groups
• For use on Twitter data see Longley et al, 2015
What can names tell us?
UK Forenames Database• 32,000 unique names• Built from:
• Market data• From CACI Ltd.• Representative of 7,085,617 • Only includes individuals aged 18+
• Birth certificate records• From the Office for National Statistics• Representative of 10,412,724 births in England and
Wales• Only includes individuals below 18
• Both sources are referenced to 2012• Data reweighted by the age-gender distributions from the 2011
Census
The Most Common Forenames
Female Male
Rank Name Estimated Pop. Name Estimated
Pop.1 MARGARET 555,000 JOHN 1,003,5002 SUSAN 544,600 DAVID 969,0003 SARAH 425,600 JAMES 614,6004 ELIZABETH 363,200 MICHAEL 612,8005 PATRICIA 350,200 PAUL 546,7006 MARY 332,500 ROBERT 482,8007 CHRISTINE 321,400 PETER 480,6008 JULIE 314,600 ANDREW 435,7009 KAREN 313,700 WILLIAM 415,30010 LINDA 298,200 MARK 370,700
Age Group Female Male
0-4 OLIVIA OLIVER5-9 EMILY JACK
10-14 CHLOE JACK15-19 SARAH JAMES20-24 SARAH JAMES25-29 SARAH DAVID30-34 SARAH DAVID35-39 SARAH PAUL40-44 KAREN PAUL45-49 JULIE DAVID50-54 SUSAN DAVID55-59 SUSAN DAVID60-64 SUSAN JOHN65-69 MARGARET JOHN70-74 MARGARET JOHN75-79 MARGARET JOHN80-84 MARGARET JOHN85+ MARGARET JOHN
The projected most common forenames
The most common name for each age band
Forenames – Gender
Forenames – Age (Females)
5 clusters of forenames based on their age distributions
Forenames – Age (Males)
5 clusters of forenames based on their age distributions
Isolating Names from Twitter Data• A text analytics algorithm was used to tokenise Twitter user
names to extract probable forenames. • The algorithm used space characters within the user names
(where available) to split them into separate string tokens• A database of over 300 million names was then used to
identify if the tokens were probable surnames or forenames
‘User Name’ field divided into separate ‘forename’ and ‘surname’ fields. (Longley et al, 2015)
The Inferred Demographic Structure of Tweeters
The O2 Arena The Emirates stadium Canary Wharf Westfield Stratford
Age & Gender and Twitter TopicsZ-Scores
Socio-economics• There is an association between names and socio-economics
and geodemographics• E.g. Top 5 forenames for each 2011 OAC Supergroup
Rural Residents Cosmopolitans Ethnicity Central Multicultural Metropolitans
PENELOPE TOM MOHAMED MOHAMMEDHUGH NICK AHMED MUHAMMADALASTAIR HARRIET ALI MOHAMMADROSEMARY MAX JOSE ABDULPHILIPPA ALEX ABDUL AHMED
Urbanites Suburbanites Constrained City Dwellers
Hard-Pressed Living
TOBY HILARY LILLIAN KAYLEIGHPHILIPPA GEOFFREY MAY LEANNEJEREMY KATHRYN ETHEL LYNDSEYKATHERINE JILL KAYLEIGH STACEYDUNCAN GILLIAN ELSIE KYLE
Data: 2011 Enhanced Electoral Roll (CACI UK Ltd)
Socio-economics• However, this association is a bit more tenuous than estimating
age and gender due to social mobility and popular trends in baby names
• Therefore, instead we have tried to link neighbourhood characteristics to Tweets by estimating the probable residential areas of users1
• Such as NS-SEC 2, deprivation, etc…
2 National Statistics Socio-economic Classification
1 Using Census Output Areas (OAs) as neighbourhood units. OAs are typically representative of 309 individuals as recorded in the 2011 Census. Most residential data from the 2011 Census is available at this geography
Assigning Census Data to Tweets• All of the Tweets from the model from users who had sent
multiple tested• Using the Generalised Land Use Database (GLUD) Tweets sent
from residential land parcels could be identified• Users were assigned Census Output Areas of which users had
submitted the most text from within residential land parcels.• Some neighbourhood characteristics could be broadly
representative of the individual, such as deprivation and NS-SEC due to spatial segregations.
• Each Twitter user was assigned the value of the proportion of each NS-SEC group in their residential OA
Twitter Topics and NS-SEC Groups
Higher managerial, & professi
onal occupa
tions
Lower managerial & professi
onal occupa
tions
Intermediate
occupations
Lower supervisory & technic
al occupa
tions
Semi routine occupa
tions
Routine occupa
tions
Never worked & long term
unemployed
Twitter Topic Group
Photography and Tourism 1.16 1.05 0.89 0.9 0.86 0.87 0.92Optimism, Kindness and Positivity 1.04 1.02 1.01 0.98 0.95 0.95 0.95Leisure and Attractions 1.13 1.06 0.93 0.89 0.88 0.89 0.91TV and Film 1.02 1.01 1.02 0.99 0.98 0.98 0.97Humour and Informal Conversations 0.91 0.95 1.05 1.07 1.1 1.07 1.05Transport and Travelling 1.02 1.01 1.01 0.99 0.99 0.98 0.98Politics, Beliefs and Current Affairs 1.06 1.03 0.99 0.94 0.95 0.97 0.97Sport and Games 1.00 0.99 1.04 1.00 1.01 1.01 1.01Anticipation and Socialising 1.00 1.00 1.01 1.01 1.01 1.00 0.98Business, Information and Networking 1.15 1.06 0.95 0.91 0.87 0.87 0.89Pessimism and Negativity 0.95 0.97 1.03 1.04 1.05 1.05 1.03Music and Musicians 0.98 1.00 1.00 1.01 1.02 1.02 1.00Routine Activities 0.96 0.98 1.03 1.03 1.04 1.02 1.01Food and Drink 1.1 1.05 0.96 0.92 0.91 0.91 0.94Body, Appearances and Clothes 0.98 0.99 1.02 1.01 1.03 1.02 1.02Social Media and Apps 1.03 1.01 0.99 0.99 0.99 0.98 0.99Slang and Profanities 0.83 0.92 1.05 1.09 1.16 1.16 1.14Place and Check-Ins 1.18 1.12 0.83 0.81 0.79 0.82 0.89Wishes and Gratitude 0.94 0.97 1.04 1.06 1.06 1.04 1.02Foreign and Other 0.97 1.00 0.97 1.05 0.99 1.04 1.05
Loca
tion
Quo
tient
s1
= av
erag
e pe
netra
tion
Conclusions• Based on a large sample of georeferenced Tweets in London
from 2013:• There is a distinctive pattern of associations and
dissociations between the frequency of topics per user• The popularity of certain topics on Twitter does vary by age
and gender• The popularity of topics also varies by neighbourhood socio-
economics• Demographic characteristics of Twitter users could be
estimated using the registered users’ forenames• It is also possible to link neighbourhood statistics to geo-
referenced Twitter users by locating their probable neighbourhood of origin from their activity. • Although it is not possible to confidentially identify the
correct residential locations
References• Blei, D., Ng, A., and Jordan, M. (2003) Latent Dirichlet allocation.
Journal of Machine Learning Research, 3 993–1022• Duncan, O. D., & Duncan, B. (1955). A methodological analysis of
segregation indexes. American Sociological Review, 20, 210-217• Lansley, G. and Longley, P. A. (2015) The Geography of Twitter
Topics in London. In Press• Lansley, G. and Longley, P. A. (2015) Deriving Age and Gender from
Forenames for Consumer Analytics. In Press• Longley, P. A., Adnan, M., Lansley, G. (2015) The geotemporal
demographics of Twitter usage Environment and Planning A, 47(2); 465 – 484
• Mateos P., Longley P. A. and O’Sullivan, D. (2011) Ethnicity and population structure in personal naming networks. PLoS ONE, 6(9) e22943; 1-12
Guy LansleyDepartment of Geography, [email protected] @GuyLansley
End