Prediction of Reaction towards Textual Posts in Social Networks

Prediction of Reaction towards Textual Posts in Social Networks

Mohamed Mahmoud ([email protected])

Abstract

Posting on social networks could be agratifying or a terrifying experience de-pending on the reaction the post and itsauthor —by association— receive from thereaders. To better understand what makesa post popular, this project inquires intothe factors that determine the number oflikes, comments, and shares a textual postgets on LinkedIn; and finds a predictorfunction that can estimate those quanti-tative social gestures.

Keywords: Linear Regression; LinkedIn;Machine Learning; Popularity; Social Net-works

1 Introdcution

Social media have been playing a prominent rolein our daily lives in the past decade; they connectus with our families, friends, and colleagues in un-precedented ways. Sharing on social networks be-came a primary means of communication —a basichuman need— that has the potential of reachinga large group of recipients around the globe. Yourposts on social networks are a reflection of who youthink you are (self-perception), or what you wantothers to see; and how readers react to them, isa reflection of how they perceive you. The lack ofimmediacy in such social interactions allows the au-thor to ruminate on what to write, when sharing atextual post, in hopes of maximizing the fulfillmentof a desideratum (akin to an essayist or a reporterseeking a certain goal). On the other hand, readersreciprocate by showing appreciation of the effortexerted in the form of virtual social gestures (akinto fan letters), which may help the post travel fur-ther, and propagate through many social networks.

Let’s take LinkedIn as an example of social net-works where members are allowed to like, commenton, and re-share posts; the more likes a LinkedInpost gets, the more fulfilled the author feels aboutit; such reaction vouches for the author’s ethos, andsupports the author’s claim to fame. Conversely,the author’s reputation and self-esteem might suf-

fer when the post is uncelebrated. In order to alle-viate the social pressure associated with sharing onsocial media, this work proposes a system that pre-dicts quantitative reaction towards textual posts insocial networks over a specific time window. Thistime series analysis can help authors gauge the pop-ularity (as a proxy for quality) of an update beforeposting it; they can refine the post if the scores areunsatisfactory, and examine how various versionsof the same post score. In addition, the system canbe used by social networks to predict which postsare more appealing to readers for ranking purposes[1]. This work will focus on LinkedIn as the socialnetwork of choice.

2 Input-Output Behavior

The input to the proposed system is a textual postshared by a member of a social network, and theoutput is a prediction of the quantitative reaction(number of likes, comments, and shares) the postshall receive within a certain time window; we’lluse a fixed window of one day. Take, for example,the post in the figure below:

Figure 1: an example of a textual post

The output corresponding to this input post isa vector of (predicted number of likes, predictednumber of comments, and predicted number ofshares) — one for each day. Here’s another exam-ple that should score much higher than the former:

Figure 2: an example of a popular textual post

1

3 Model

Disclaimer: The analysis, exploration, and prepa-ration of data; feature engineering, and extrac-tion; use, and fine-tuning of machine learning al-gorithms, along with code developed for train-ing, validating, and testing the model; and prac-tices adopted for this project are driven by myown personal experience, and not connected toany LinkedIn product. The data discussed herewere anonymized —by removing personally identi-fiable information— in accordance with LinkedIn’sstrict policies regarding data privacy. In orderto protect LinkedIn’s intellectual property, someof the features mentioned below are redacted.

The goal of the proposed system is to obtaina predictor function f that maps new input x tooutput y ∈ R3 corresponding to (predicted numberof likes, predicted number of comments, and pre-dicted number of shares) for the input post and agein days (time window).

This is a regression problem, which is solvedby using the following framework (using supervisedlearning, which is given the training data to pro-duce the predictor function):

Figure 3: diagram of a learning framework

The training data Dtrain is a set of examples,which are basically input-output pairs: the inputsare the posts (including age in days), and outputsare the corresponding social gestures.

3.1 Data Preparation

The data came from original textual posts onLinkedIn; an original post is authored by its poster,and not a re-share of another post. In order to learna predictor, three datasets were gathered for train-ing, validation, and test; the datasets came fromhistorical LinkedIn data about using ETL (Extract,

Transform, Load); one of the challenges faced dur-ing this project is performing ETL at the scale ofLinkedIn: joining multiple datasets that containbillions of records; each dataset contains certainaspects of the record that serves as an example fortraining. Member IDs, along with personally iden-tifiable information, were removed once the meta-data were generated. The combined tally of exam-ples in the training and validation sets is around3.8 million — after removing outliers. In order tomaintain good data hygiene, the test set was gath-ered after training was completed from a date rangethat doesn’t overlap with that of the training andvalidation datasets, and it amounts roughly to 0.25million examples (out of 4.05 million examples intotal). The data are split to 75% for training, 19%for validation, and 6% for testing.

3.1.1 Outliers

Outliers were determined by plotting [2] the dis-tribution of the label values; here’s an example ofthe distribution of label values for likes using a logscale:

0e+00

2e+05

4e+05

6e+05

log(label value)

count

Figure 4: distribution of label values for likes

After exploring the distribution of data at dif-ferent bucket widths, a cutoff line was chosen toremove outliers that are unlikely to occur (basedon the gathered data). This process was repeatedfor comments and shares as well.

3.1.2 Missing Data

Some of the fields in the collected records weremissing at random; for example, fields that wereleft blank by the members of the social network.

2

Whenever possible, a replacement value was cal-culated for the missing field. For example, for amissing timezone field, an approximation was cal-culated using the member’s country. A more in-teresting example for a missing field is one that’sreal-valued, which was replaced by the mean valueof that field in the observed examples, which is anacceptable solution to replace data missing at ran-dom (MAR).

3.1.3 Raw Data

A labeled example is basically a tuple of the tex-tual post along with its metadata (input); and itslabel is the number of likes, comments, and sharedgenerated for the post (output). A typical recordin a dataset looks like the following:

((Text & Metadata), (Likes, Comments, Shares))

Metadata include data about the post like age, vis-ibility (e.g. public), author metadata (e.g. networksize), etc.; such metadata is essential to better rep-resent the inputs (in correlation to the output) inthe context of the problem at hand; in a social net-work, the text of a post per se is insufficient topredict its popularity; we need to consider otherfactors like metadata about the post and its au-thor; for example, the size of the author’s networkis expected to play a major role in predicting pop-ularity.

After the raw data were extracted, the recordswere serialized and stored into binary files to savespace. Those binary files served as input files forthe feature extractor.

3.2 Scoring

Each input is going to be distilled into a fea-ture vector φ(x) = [φ1(x), . . . , φd(x)] ∈ Rd,which represents the input and will be computedusing a feature extractor. Correspondingly, aweight vector wi = [wi1, . . . , wid] ∈ Rd; i ∈{likes, comments, shares} specifies the weight ofeach feature to each component of the predictionvector y ∈ R3.

Given a feature vector φ(x) and a weight vec-tor wi, the respective prediction score componentyi ∈ R is their inner product:

yi = wi · φ(x)

The score vector represents a snapshot of thepost’s popularity at the given age in days, to getthe time series, the age is incremented by the de-sired time unit.

3.3 Feature Extraction

Based on domain knowledge, a feature vector φ(x)is picked to represent an input x and contribute tothe prediction vector y.

The examples were loaded from the binary fileswhere they were stored, then transformed into tu-ples of (feature vector, label vector). The featurevector is the union of raw features and features de-rived from the raw data. The feature extractorstored the tuples into binary files using Python’scPickle module. Caching the feature vectors andtheir respective labels speeds up the training pro-cess instead of re-extracting the features from theexamples every time predicted scores are calcu-lated.

The features that contribute to the popularityof textual posts can be divided into the following:

• Textual features (pertaining to the post’stext)

• Post metadata features (pertaining to the ac-tion of posting)

• Author features (pertaining to the author)

In each category, there are real-valued and indi-cator features that are listed below as feature tem-plates (to be filled out by the training data).

3.3.1 Textual Features

• log(post length) ∈ [0.0, 3.2): boolean feature

• log(post length) ∈ [3.2, 6.4): boolean feature

• log(post length) ∈ [6.4, ): boolean feature

• contains a URL: boolean feature

• contains a question: boolean feature

• contains an e-mail address: boolean feature

• contains a hashtag: boolean feature

• contains a simley emoticon: boolean feature

• contains a frowny emoticon: boolean feature

• post length: real-valued feature

• log(post length): real-valued feature

• ratio of non-alphanumeric characters: real-valued feature

• word count: real-valued feature

• stemming: real-valued features, one for eachstem in the text and its count

3

• unigrams: real-valued features, one for eachword in the text and its count

• bigrams: real-valued features, one for each bi-gram (two adjacent words) in the text and itscount

• trigrams: real-valued features, one for eachtrigram (three adjacent words) in the textand its count

3.3.1.1 Bucketization of Post Length

In order to figure out the relationships betweenthe proposed features and their respective weights,interactive exploration of the data was performedas a part of the feature engineering process; plot-ting various features vs. the number of social ges-tures uncovered some insights. For example, thelog(post length) can be bucketized into three buck-ets:

0 2 4 6log(post length)

num

ber o

f lik

es e

xclu

ding

out

liers

Figure 5: bubble chart of log(post length) and num-ber of likes excluding outliers

The bucket boundaries are 0, 3.2, and 6.4; un-surprisingly, counts of comments and shares fol-lowed suit due to the correlation between the threesocial gestures:


num

ber o

f com

men

ts e

xclu

ding

out

liers

Figure 6: bubble chart of log(post length) and num-ber of comments excluding outliers


num

ber o

f sha

res

excl

udin

g ou

tlier

s

Figure 7: bubble chart of log(post length) and num-ber of shares excluding outliers

3.3.1.2 StemmingUsing the Snowball stemmer from nltk [3],

which is language-specific, whenever the post’s lan-guage was supported; otherwise, the Porter stem-mer was used. The Snowball stemmer has a muchbetter understanding of the language model —including stopwords exclusion— and it supportsthe following languages: Danish, Dutch, English,

4

Finnish, French, German, Hungarian, Italian, Nor-wegian, Portuguese, Romanian, Russian, Spanishand Swedish; Many more languages were found inthe examples.

3.3.2 Post Metadata Features

• language of post: indicator feature

• day of month: indicator feature

• day of week: indicator feature

• hour of day: indicator feature

• day of month and hour: indicator feature

• day of week and hour: indicator feature

• sharing visibility: indicator feature

• post is in member interface locale: booleanfeature

• post is in member default locale: boolean fea-ture

• post is in member locale: boolean feature

• post age in days: real-valued feature

• log(post age in days): real-valued feature

• post age in minutes: real-valued feature

• log(post age in minutes): real-valued feature

• mentions count: real-valued feature

3.3.2.1 Language IdentificationLanguage identification was performed using a

language identifier software library developed byLinkedIn.

3.3.2.2 Locality of TimestampsTimestamps were adjusted to represent local

time according to the post’s timezone; the predictorshould calculate the same score for two posts sharedat the same local time — all other factors beingequal. For example, if a member who lives in Cali-fornia shared a post at 10 AM PST, it should havethe same score as a post shared at 10 AM EST byan identical member who lives in New York; time-based features have to be trained with respect togroups of values that contribute to the score in thesame fashion. This is important because of the rolethe time of day when a post was published plays inpredicting its popularity; we here assume that themajority of the post’s target readership lives in thesame timezone as the poster [4], [5].

3.3.3 Author Features

• default locale: indicator feature

• interface locale: indicator feature

• country: indicator feature

• industry: indicator feature

• timezone: indicator feature

• connections visibility: indicator feature

• feed visibility: indicator feature

• picture visibility: indicator feature

• is a LinkedIn influencer: boolean feature

• interface locale is default: boolean feature

• connections count: real-valued feature

• log(connections count): real-valued feature

• followers count: real-valued feature

• log(followers count): real-valued feature

• average likes count: real-valued feature

• average comments count: real-valued feature

• average shares count: real-valued feature

• a set of proprietary features (redacted)

Figure 8: chart of average likes per member (ex-cluding outliers) and number of likes (excludingoutliers); unsurprisingly, there is a correlation be-tween the two dimensions

5

4 Approach

4.1 Baseline

A rule-based system was chosen to predict the num-ber of likes, comments, and shares for a given inputby searching for certain keywords in the text, andfactoring in the size of the author’s network in aformula for each prediction, for example:

likes = α(network size) +∑

word∈text

weight(word)

The coefficients and weights in such system canbe guessed based on heuristics or domain exper-tise. An example of a baseline that was chosenwith α = 0.011 and weight(’I’) = 1 yielded a largetest error; see the results section for more details.

4.2 Oracle

An oracle can see the future, and tell us exactlyhow many likes, comments, and shares a post hasat the end of a future time window. So, in our case,it’s basically a time machine.

4.3 Linear Regression

Linear regression can be used to predict the num-ber of social gestures by learning the weights vec-tors that contribute to the scores. The objectiveis to minimize the average loss determined by thesquared loss function:

Losssquared(x, yi,wi) = (wi · φ(x)− yi)2

4.4 Stochastic Gradient Descent

The choice of stochastic gradient descent (SGD) oflinear regression was an obvious one as the datasetsused in this project are large, a fact that influencedthe tuning of hyperparameters and the algorithmas well.

4.4.1 Hyperparameters

Because of the large number of example, the fol-lowing formula was used for the learning rate:

η = min(0.001,√

number of updates)

It starts with a value that’s relatively large –yetsmall enough to keep the weights from overflowing–then get smaller as the number of updates increases(as the convergence rate increases).

Termination of the algorithm was determinedby either reaching diminishing improvements (ε =

0.0001) of the combined training and validationerrors (|TEt+1 − TEt| + |VEt+1 − VEt| < ε), orexhausting the maximum number of iterations al-lowed; however, the latter can be incremented by 1if a significant improvement in the validation errorhas been observed (ε = 0.1); The same conditionwas used to save a snapshot of the training pro-gram in case it got aborted. When the programwas done, the weights vector was pickled (serial-ized) and saved to disk for the test program to loadand evaluate. The examples used for training wereequally divided to 100 file; the training dataset had80 files, and the validation dataset had 20 files. Theorder of the training files to process at the start ofeach iteration was picked at random, and the con-tent of each file was then shuffled as well in hopesof increasing the convergence rate.

4.4.2 Feature Normalization

One of the issues found while training was overflowof weights; the feature values were highly variant,and had to be normalized; the following formulawas used to rescale the values:

φ(x)′ =φ(x)−min(φ(x))

max(φ(x))−min(φ(x))

4.5 Gradient Descent Using VW

Vowpal Wabbit [6] makes use of parallel threads,feature hashing, and cache files to speed up the gra-dient descent algorithm. Running VW with a singlepass, and with 100 passes generated similar results;when run with multiple passes, VW held out 10%of the examples for validation and reported the devloss instead of the training loss. VW was run usingthe squared loss function, and feature normaliza-tion.

4.5.1 Hyperparameters

VW was also run with the adaptive option, whichsets an individual learning rate for each feature,which improves learning when the feature vector islarge [7]. The initial learning rate was set to thedefault value (η0 = 0.5).

5 Results

Please note that the fitted coefficients wereredacted to protect LinkedIn’s intellectual prop-erty.

6

Baseline SGD VW (1 Pass) VW (100 Passes)

LikesIterations N/A 101 1 100Features Count 2 3016665 590650105 53157724900Training RMSE N/A 3.930 3.683 N/ADev RMSE N/A 4.222 N/A 3.384Test RMSE 8.352 2.440 N/A N/ACommentsIterations N/A 25 1 100Features Count 2 3016878 590650105 5315772490Training RMSE N/A 2.374 2.144 N/ADev RMSE N/A 2.497 N/A 2.120Test RMSE 2.576 1.068 N/A N/ASharesIterations N/A 2 1 100Features Count 2 3017243 590650105 53157724900Training RMSE N/A 0.446 0.474 N/ADev RMSE N/A 0.484 N/A 0.447Test RMSE 63.878 0.424 N/A N/A

Table 1: comparison of various approaches

RMSE is the root-mean-square error (since thepredictors used the squared loss function); in thetable above, it’s rounded to the third decimal place.

VW produced a model file of hashed featuresand their respective weights; parsing that exces-sively large vector into a format that the test har-ness understands —to calculate Test RMSE— wasprohibitive; Dev RMSE is a good proxy for it (inthe scope of this project).

6 Literature Review

Another project that was set to predict Facebooklikes [5] compared two approaches: linear regres-sion, and nearest neighbor; and found out the for-mer is more effective in predicting likes than thelatter. It’s also worth mentioning that the numberof examples used in that project, 49216, is two or-ders of magnitude smaller than the one used here,which indicates that the test data may not haveenough statistical coverage for a social network ofmore than one billion daily unique users [8].

7 Potential Improvements

There are a few more improvements that I wouldhave liked to explore; for example, adding moretextual features like part-of-speech tagging, andlemmatization.

Another possible addition to the feature vectoris metadata features about who liked, commentedon, and shared a post; the observed labels repre-sent a time series of an underlying sequences ofvalues that aren’t IID (Independent and IdenticallyDistributed); for example, the probability that apost gets n more likes at time tj depends on wholiked it at all times before tj because likes propa-gate through the news feed (when member x likes apublic post y, it shows up as news for x’s network,and they can like it, comment on it, and/or shareit from their news feeds); this is known as serialcoupling [9]; augmenting the feature vector withmetadata features about members who reacted tothe post can improve the accuracy of predicting itspopularity [10], but it will make the cardinality ofthe feature vector orders of magnitude larger thanwhat is currently used. Exploring other loss func-tions, ensemble learning, more non-linear features,and feature interactions (e.g., the cross product ofmetadata features) might yield more accurate pre-dictions.

8 Acknowledgments

I’d like to thank Percy Liang, and CS221 TA’s forthe guidance they gave me throughout the quarter.I’d also like to thank LinkedIn (special thanks toGuy Lebanon and Bee-Chung Chen) for providingthe data that made this project possible.

7

References

[1] D. Agarwal, B.-C. Chen, Q. He, Z. Hua, G. Lebanon, Y. Ma, P. Shivaswamy, H.-P. Tseng, J. Yang,and L. Zhang, “Personalizing linkedin feed”, in Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, ser. KDD ’15, Sydney, NSW, Australia: ACM,2015, pp. 1651–1660, isbn: 978-1-4503-3664-2. doi: 10.1145/2783258.2788614. [Online]. Avail-able: http://doi.acm.org/10.1145/2783258.2788614.

[2] H. Wickham, Ggplot2: Elegant graphics for data analysis. Springer New York, 2009, isbn: 978-0-387-98140-6. [Online]. Available: http://had.co.nz/ggplot2/book.

[3] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st. O’Reilly Media, Inc.,2009, isbn: 0596516495, 9780596516499.

[4] Z. Ellison and S. Hildick-Smith, “Blowing up the twittersphere: Predicting the optimal time to tweet”,Stanford, Stanford, CA, Tech. Rep., 2014. [Online]. Available: http://cs229.stanford.edu/proj2014/Seth%20Hildick- Smith, %20Zach%20Ellison, %20Blowing%20Up%20The%20Twittersphere-%20Predicting%20the%20Optimal%20Time%20to%20Tweet.pdf.

[5] K. Chen, B. Huang, and B. Lee, “Facebook like predictor within your friends”, Northwestern Uni-versity, Evanston, IL, Tech. Rep., 2015. [Online]. Available: http://kbbz.github.io/files/Final%20report.pdf.

[6] J. Langford, L. Li, and A. Strehl, Vowpal Wabbit, 2007.

[7] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochasticoptimization”, J. Mach. Learn. Res., vol. 12, pp. 2121–2159, Jul. 2011, issn: 1532-4435. [Online].Available: http://dl.acm.org/citation.cfm?id=1953048.2021068.

[8] M. Zuckerberg, 2015.

[9] L. Cao, “Non-iidness learning in behavioral and social data”, The Computer Journal, 2013. doi:10.1093/comjnl/bxt084. eprint: http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.full.pdf+html. [Online]. Available: http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.abstract.

[10] M. Dundar, B. Krishnapuram, J. Bi, and R. B. Rao, “Learning classifiers when the training data is notiid”, in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI’07,Hyderabad, India: Morgan Kaufmann Publishers Inc., 2007, pp. 756–761. [Online]. Available: http://dl.acm.org/citation.cfm?id=1625275.1625397.

8

http://dx.doi.org/10.1145/2783258.2788614

http://doi.acm.org/10.1145/2783258.2788614

http://had.co.nz/ggplot2/book

http://cs229.stanford.edu/proj2014/Seth%20Hildick-Smith,%20Zach%20Ellison,%20Blowing%20Up%20The%20Twittersphere-%20Predicting%20the%20Optimal%20Time%20to%20Tweet.pdf



http://kbbz.github.io/files/Final%20report.pdf

http://kbbz.github.io/files/Final%20report.pdf

http://dl.acm.org/citation.cfm?id=1953048.2021068

http://dx.doi.org/10.1093/comjnl/bxt084

http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.full.pdf+html

http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.full.pdf+html

http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.abstract

http://comjnl.oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.abstract



Appendix A Learning Rate Plots

0

5

10

15

0 5 10 15 20log2(examples count)

RMSE

comments

likes

shares

Figure A9: line chart of learning rate using VW (single pass)

9

0

5

10

15

0 10 20log2(examples count)

RMSE

comments

likes

shares

Figure A10: line chart of learning rate using VW (100 passes)

10

Software

Prediction of Reaction towards Textual Posts in Social Networks