16
Computer-Aided Civil and Infrastructure Engineering 00 (2015) 1–16 Communities of Interest–Interest of Communities: Social and Semantic Analysis of Communities in Infrastructure Discussion Networks Mazdak Nik-Bakht* & Tamer E. El-diraby Department of Civil Engineering, University of Toronto, Ontario, Canada Abstract: The process of public consultation for plan- ning and construction of sustainable urban infrastructure requires a bidirectional communication among as many stakeholders of the project as possible. In modern infras- tructure projects, social web and online social media play an important role in involving citizens through what is referred to as “microparticipation.” However, the social network which is formed behind the scene of micropar- ticipation is usually ignored by both researchers and practitioners. This article combines community detection in social networks with information retrieval methods, to detect and label communities of project followers and cores of interest in the network of urban infrastructure project stakeholders. We study dynamics of the network and its communities by monitoring them over time. This technique is used to analyze followers of a Light Rail Transit (LRT) project in microblogging website Twitter and profile them based on their interests and ideas they support. Although methods of this article can be readily applied to other cases where connectivity of project fol- lowers and their descriptions are retrievable, our work is mainly interested in how to analyze Twitter discussions without any assumption that they replace the other on- line or offline means of public involvement in infrastruc- ture projects. This article is an extended version of a con- ference paper that appeared as Nik Bakht & El-Diraby, 2013(2). 1 INTRODUCTION Civil infrastructure in the modern society is a complex network of physical assets, together with actors who develop, maintain, and operate them, and even the end users who use services provided by the developed *To whom correspondence should be addressed. E-mail: [email protected]. system and set its performance measures (Ottens et al., 2006). Infrastructure planning/management is on the other hand, a multicriteria problem; maximum effi- ciency and usability (Yang et al., 2007), and minimum maintenance costs (Ng et al., 2009) are among other parameters traditionally comprising those criteria. More recently, sustainability of the system has also been emphasized to be included as a part of the design procedure (Ferguson et al., 2012; Abdul Aziz and Ukkusuri, 2012). Although sustainability provisions are usually intro- duced to the decision-making problem at later phases, they are crucial criteria at a strategic planning level (Lopez and Monzon, 2010). As the realm of sustainabil- ity is context-sensitive and project-dependent, it adds even more to the complexity of decision environment. Decision making for infrastructure system in such a complex environment must happen within a network, rather than a hierarchy, of decision makers (Bruijn and Heuvelhof, 2000), and should involve as many social in- teractions among the stakeholders as possible (Parkin, 1994). This includes not only the internal stakeholders (who are directly involved in the decision making), but also the end users of the system, as external stakehold- ers who are affected by the decisions made (Atkin and Skitmore, 2008). Acceptability of decisions in a democ- ratized society is deeply tied to the relevant social value structure (Parkin, 1994). Hence, sustainability of infras- tructure is closely dependent on locally defined norms and limitations (Akinyemi and Zuidgeest, 2002).The “best solutions” in one environment may be a failure when applied to a different community with different cultural and demographic characteristics. Detection of norms and standards of the local community is there- fore an integral part of decision making, which requires public engagement. C 2015 Computer-Aided Civil and Infrastructure Engineering. DOI: 10.1111/mice.12152

Communities of Interest–Interest of Communities: Social and Semantic Analysis of Communities in Infrastructure Discussion Networks

Embed Size (px)

Citation preview

Computer-Aided Civil and Infrastructure Engineering 00 (2015) 1–16

Communities of Interest–Interest of Communities:Social and Semantic Analysis of Communities

in Infrastructure Discussion Networks

Mazdak Nik-Bakht* & Tamer E. El-diraby

Department of Civil Engineering, University of Toronto, Ontario, Canada

Abstract: The process of public consultation for plan-ning and construction of sustainable urban infrastructurerequires a bidirectional communication among as manystakeholders of the project as possible. In modern infras-tructure projects, social web and online social media playan important role in involving citizens through what isreferred to as “microparticipation.” However, the socialnetwork which is formed behind the scene of micropar-ticipation is usually ignored by both researchers andpractitioners. This article combines community detectionin social networks with information retrieval methods,to detect and label communities of project followers andcores of interest in the network of urban infrastructureproject stakeholders. We study dynamics of the networkand its communities by monitoring them over time. Thistechnique is used to analyze followers of a Light RailTransit (LRT) project in microblogging website Twitterand profile them based on their interests and ideas theysupport. Although methods of this article can be readilyapplied to other cases where connectivity of project fol-lowers and their descriptions are retrievable, our work ismainly interested in how to analyze Twitter discussionswithout any assumption that they replace the other on-line or offline means of public involvement in infrastruc-ture projects. This article is an extended version of a con-ference paper that appeared as Nik Bakht & El-Diraby,2013(2).

1 INTRODUCTION

Civil infrastructure in the modern society is a complexnetwork of physical assets, together with actors whodevelop, maintain, and operate them, and even theend users who use services provided by the developed

*To whom correspondence should be addressed. E-mail:[email protected].

system and set its performance measures (Ottens et al.,2006). Infrastructure planning/management is on theother hand, a multicriteria problem; maximum effi-ciency and usability (Yang et al., 2007), and minimummaintenance costs (Ng et al., 2009) are among otherparameters traditionally comprising those criteria.More recently, sustainability of the system has alsobeen emphasized to be included as a part of the designprocedure (Ferguson et al., 2012; Abdul Aziz andUkkusuri, 2012).

Although sustainability provisions are usually intro-duced to the decision-making problem at later phases,they are crucial criteria at a strategic planning level(Lopez and Monzon, 2010). As the realm of sustainabil-ity is context-sensitive and project-dependent, it addseven more to the complexity of decision environment.Decision making for infrastructure system in such acomplex environment must happen within a network,rather than a hierarchy, of decision makers (Bruijn andHeuvelhof, 2000), and should involve as many social in-teractions among the stakeholders as possible (Parkin,1994). This includes not only the internal stakeholders(who are directly involved in the decision making), butalso the end users of the system, as external stakehold-ers who are affected by the decisions made (Atkin andSkitmore, 2008). Acceptability of decisions in a democ-ratized society is deeply tied to the relevant social valuestructure (Parkin, 1994). Hence, sustainability of infras-tructure is closely dependent on locally defined normsand limitations (Akinyemi and Zuidgeest, 2002).The“best solutions” in one environment may be a failurewhen applied to a different community with differentcultural and demographic characteristics. Detection ofnorms and standards of the local community is there-fore an integral part of decision making, which requirespublic engagement.

C© 2015 Computer-Aided Civil and Infrastructure Engineering.DOI: 10.1111/mice.12152

2 Nik-Bakht & El-diraby

To involve end users in the network of decisionmakers, “public relations” programs in infrastructureprojects are evolving into an “engagement partner-ship”; that is, instead of updating the public througha top-down transfer of information from the projectmanagement team to the community, communityengagement programs today target other forms ofconsultation by involving public communities throughtwo-way communication tools. As a result, citizens arebecoming important sources of input for infrastructuredevelopment, construction, and maintenance plans.The e-society formed in the social web provides anopportunity for involving these important sources.Knowledge-enabled citizens in the e-society can bebetter defined as “prosumers” (a portmanteau formedby contracting the word professional, or also producer,with the word consumer) rather than mere users.

1.1 Online public involvement (PI) in infrastructure:Status quo

“Prosumerism” has recently been a source of newbusiness models and scientific achievements. Websitessuch as Amazon, Google, and Facebook collect revenuefrom it, and projects such as Wikipedia and IMDB relyon the power of prosumers to create corpora of knowl-edge. Governments and other macrolevel decisionmakers in the AEC (Architecture, Engineering, andConstruction) industry have also started to benefit fromprosumer culture in the process of engaging publicpartnerships for infrastructure projects. As a result, inNorth America alone, 82 out of the 100 strategic infra-structure projects announced in 2011 by North Americastrategic infrastructure leadership, have active Face-book or Twitter accounts to complete the bidirectionalcommunication path between official/technical decisionmakers and nontechnical/public decision contributors.

With more than 645 million active users, and stillgrowing at an average rate of 135,000 new users per day,microblogging website Twitter records 58 million tweetsevery day (Twitter Statistics, 2015). People express theiropinions on many different issues—including the builtenvironment—in less than 140 character statements,and this can be a significant opportunity for decisionmakers to communicate with citizens and detect theirdemands or feedback. Evans-Cowley and Griffin (2012)refer to public engagement through microblogging as“microparticipation.” They focused on transit-relatedideas discussed over Twitter and analyzed public dis-cussions from content (type and theme) and sentimentpoints of view. Results of this study approve thataggregation of microblogs can create “meaning” andhelp PI practitioners to understand the perspective ofthe public community.

In the industry, on the other hand, based on TRB(Transportation Research Board) transit cooperativeresearch program (synthesis 99), major transportationservice providers who use microparticipation to involvethe public in United States and Canada find Twitter inmany aspects the most (an in some aspects the secondmost) convenient communication tool (Bregman et al.,2012). As this survey shows, public relation agenciesuse social media with goals such as communicating withcurrent customers, improving customer satisfaction, andimproving agency image. Apart from connecting withthe customers/community, where online social mediaplays the role of a tool for real time communication, ad-vocacy, feedback collection, etc., it puts the customersin power and creates opportunities for user innovation(Bregman and Watkins, 2013).

1.2 Gaps, challenges, and scope of this article

Web 2.0 plays the role of a platform which not onlybrings all internal and external stakeholders together,but also archives their opinions, showcases their ideas,and maintains the flow of project-related discussionsamong them. As a result, a combination of people andtheir ideas about a construction project are linked toeach other, as explained by El-Diraby (2011), and forman Infrastructure Discussion Network—IDN (Nik Bakhtand El-diraby, 2013a). Nodes of an IDN are knowledge-enabled members of the e-society. They use free accessto information about construction projects and discussits different aspects.

However, related studies and practices in the domainof infrastructure, mainly target social media as a com-munication medium; they have mostly focused on thecommunicated content and ignored the social networkformed in the background of online communication.Analysis of social connectivity over the IDN can pro-vide decision makers with significant insights regardingtypology of stakeholders. We have formerly discussedsome opportunities in this regard, including detection ofinfluential actors to be targeted for public consultation,and reverse marketing for feedback and demand detec-tion (Nik Bakht and El-diraby, 2013a). In this article,another important application will be discussed.

Clustering stakeholders and classification of theirvested interests are among major goals of PI programsin infrastructure projects (Olander, 2007). This is a dif-ficult task due to the complexity of stakeholders’ net-work and diversity of interests. In offline PI programs,lack of documentation on participants’ identities adds tothe difficulty. Moreover, interests change over time andcomplicate the issue even more. In this article, we willshow how analysis of IDN can provide guidelines fordecision makers to tackle this issue. Analysis of ideas

Communities of interest–interest of communities 3

discussed without respect to identity of users support-ing such discussions is not enough for decision makers(Evans-Cowley and Griffin, 2011). On the other hand,profiling such identities at individual level and aggre-gating information at a community level require somelevels of automation. Moreover, although communitiesof online social networks normally form around sim-ilarities and shared interests, detecting infrastructure-project-specific interests and similarities among nodesof IDN is challenging. We offer an approach by com-bining some of the best practices in community detec-tion, influence assessment, and text mining to overcomethese challenges. This article focuses on detecting com-munities of followers in an IDN, labeling each com-munity based on commonalities of its members, anddetecting core interests in each community by analyz-ing their statements. We offer a hybrid topology analysisand topic detection algorithm and apply it to the case ofan LRT (Light Rail Transit) development in Toronto,ON, Canada, for validation. We will detect commu-nities and core interests of the project followers andwill finally analyze dynamics of the communities andopinions.

It must be noted that we target microparticipationand scope Twitter in this article. This portrays one seg-ment of project followers and from some aspects maynot be a full representation of the whole public. Con-sequently, our work does not intend to replace tradi-tional/offline community engagement or even the needfor analysis of inputs from other electronic contents (be-yond Twitter).

2 BACKGROUND LITERATURE ANDRELATED WORKS

A “network” can be generally defined as a set ofinterdependent individuals (nodes) with all of theinteractions among them (edges or links). If the nodesare social entities (actors), and the edges are sociallinkages (relational ties or social interactions), then theresulted network is called a “Social Network.” Socialnetwork analysis (SNA) is a well-established line ofresearch in many domains, and has found extensiveapplications during the past few decades. Particularly,the prevalence of social web has defined many new ap-plications for SNA in analysis of online social networks.Construction industry has been organized in projectnetworks since the 1950s (Stinchcombe, 1959). Projectnetworks, as Taylor and Levitt (2007) suggest, aresocial networks of experts working together on specificconstruction projects. As the network perspective ofa construction project is a more mature structure thanthe traditional hierarchical management system (Pryke,

2004), the “social network model of construction” wasintroduced by Chinowsky and colleagues (2008) to in-crease the efficiency in management of projects throughmanagement of nontechnical components and the teamperformance. This model had two main components:dynamics (to address dynamics of interactions) andmechanics (to address the free flow of knowledgeamong project participants).

In the context of urban infrastructure projects, IDNcan be considered as a propagation of the project net-work in online social media. Mixing the project net-work (of technical and official decision makers) with thenetwork of public (end users), on one hand increasescomplexity in the resulted heterogeneous network, buton the other hand creates new opportunities for de-cision makers. Detecting customers’ reaction to policymakers’ decisions is one important opportunity, whichcan assist decision makers. By focusing on housing in-frastructure, Nejat and Damnjanovic (2012) used sim-ulation to model dynamic behavior of homeowners inresponse to the decisions. Monitoring IDN can give amore realistic image of such dynamism. Demand detec-tion is another opportunity provided by this heteroge-neous network. Multiple sources of uncertainty in thedemand give it a chaotic nature and convince many re-searchers to model it as a random variable (Ukkusuriet al., 2007, among others). Analysis of customers’ on-line behavior in many industries has helped to detecttrends of demand and reinforce the process of evaluat-ing such a random variable. Taking advantage of suchopportunities requires finding patterns of order thatexist beneath the chaos of the IDN. We have shownthat IDNs follow the general mathematical behaviorof social networks, including small world and scale-free nature (Nik Bakht and El-diraby, 2014). SNA cantherefore be helpful to detect such patterns of orderin IDNs.

Social networks are typically composed of groups ofnodes that are tightly connected among themselves andare sparsely connected to nodes from other groups. Thisbehavior has roots in the formation process of the net-work and is due to the fact that people are more likelyto join communities in which not only they have morefriends, but also their friends are more densely con-nected to each other. These densely connected coreswhich are “clusters” of the social connectivity graph arecalled “modules” or “communities.” Splitting a graphinto smaller components is referred to as “partition-ing”; partitions are subsets of graph nodes with all linksamong them. Communities are in fact “dense” par-titions in the graph of social connectivity. “Commu-nity detection” is, therefore, the mathematical problemof graph partitioning to find such densely connectedclusters.

4 Nik-Bakht & El-diraby

Various classes of community detection algorithmshave been developed in the SNA literature. Usingstrength of weak ties (Girvan and Newman, 2002), mod-ularity maximization (Clauset et al., 2004 and Blondelet al., 2008), spectral graph partitioning (Donetti andMunoz, 2004), trawling (Kumar et al., 1999), and cliquepercolation method (Palla Derenyi et al., 2005) can bementioned among other algorithms. For a comprehen-sive review on these methods one can see Fortunato(2010).

We have recently compared applicability of thesemethods to the context of IDNs, based on algorithmicefficiency, scalability, stability, and accuracy. Table 1gives an overview of the results. The results suggestedthat modularity maximization algorithms are goodmatches due to their high computational efficiency andaccuracy (Nik Bakht and El-Diraby, 2013b). In thisarticle, we use modularity maximization due to its highperformance.

Modularity maximization is an optimization problem.Modularity, the objective function of this optimizationproblem, is defined as the difference between the num-ber of existing edges in a partition and the expectednumber of edges that can exist among nodes of thatpartition. Modularity of a partition (Q) is estimated as(Newman, 2006):

Q = 12m

∑i, j

[Ai j − ki k j

2m

]δ (ci , c j ) (1)

where Ai j is the ij entry of the graph’s adjacency matrix(weight of link ij), ki is the degree of node I, m is thetotal number of edges, ci is the community that node ibelongs to, and δ (u, v) is the Dirac delta function.

The higher the modularity of a partition is, the denserthat community will be. Therefore, by taking Q asan objective function, community detection problemcan be formulated as partitioning a graph into clus-ters to maximize Q over all graph partitions. It isshown that this is an NP-hard optimization problemwith no closed solution (Fortunato, 2010). However,various heuristic methods or theoretical approximationalgorithms are suggested in the literature for solvingit. Agglomerative clustering (Clauset et al., 2004), Fastmodularity optimization (Newman, 2006), and Fast un-folding (Blondel et al., 2008) are examples of suchmethods.

However, notwithstanding with their high accuracy,modularity maximization algorithms, as will be dis-cussed later, are heuristic-search-based, and therefore,lack algorithmic stability. This issue will be addressed inSection 3.1.

3 COMMUNITY DETECTION ANDLABELING/PROFILING COMMUNITIES

IN THE IDN

Community detection has found extensive applicationsin various domains including biology, social sciences,marketing, bibliometrics, and scientific collaboration.Even providing a list of all these applications would bebeyond the scope of this article; however, applicationssuch as topic detection in collaborative tagging systems(as reviewed by Papadopoulos et al., 2012) and socialtrust evaluation for recommender systems (addressedby Pitsilis et al., 2011, among others) pinpoint that com-munity detection can lead to profiling and groupingusers based on their common interests. This is due tothe simple fact of: “birds of the same feather flock to-gether,” which indirectly is reflected in the topology ofthe social connectivity graph. Communities of a net-work can be thought of as groups of people getting to-gether around common interests, and therefore in thecontext of IDN, detecting communities can help to dis-tinguish some of the interests around an infrastructureproject. Some potential applications of community de-tection in IDN for official decision makers of the infras-tructure project can be listed as follows:

1. profiling end users of the facility or system, whichis being developed;

2. highlighting cores of interest around the project;3. finding community leaders to be engaged by public

engagement programs;4. detecting interactions and interrelations among

different communities;5. finding possible bottlenecks in the process of pub-

lic communication and partnership for infrastruc-ture projects; and,

6. monitoring social reactions to decisions made, bystudying community dynamics.

Achieving these goals would not only depend on de-tecting communities of the IDN, but also requires “la-beling” them based on their common interests and ana-lyzing semantics of topics discussed within and betweenthe communities. Direct search for shared interests andclustering nodes based on the similarity of their inter-ests is one solution to handle this issue. Steinhaeuserand Chawla (2008) suggested defining the interest asa node attribute and then to cluster the graph basedon the attribute similarity among nodes. In a similarstudy, Kalafatis (2009) clustered Twitter users basedon their similar interests by looking for occurrenceof some predefined keywords in their biography onTwitter.

Communities of interest–interest of communities 5

Table 1Community detection algorithms reviewed by Nik Bakht and El-Diraby (2013b)

ComputationalAlgorithm class performance Stability Scalability Recommended for

Strength of weak ties Low Stable Low Un-directed/un-weighted graphsModularity

High Unstable High All graphsMaximizationSpectral graph

Average Depends on the features Depends on algorithm All graphsPartitioningTrawling Average Stable Average Detecting the “signature of

networks” in large-scale/densebipartite graphs

Clique percolation Low Depends on the features Low Detecting overlappingcommunities

Although such methods can detect the main interestsaround a project, they ignore the “networked-ness” ofinterests and ideas discussed. Discussions over the IDNare associated with nodes of a network; therefore, statedideas and interests not only get weight due to the num-ber of people who support them, but also (and maybemore importantly), based on how densely these sup-porters are connected to one another and how stronglythey can influence other nodes of the network. We referto such a property as network value of interests/ideas ex-pressed over the IDN. Evaluation of interests within thecontext of their network value requires a reverse pro-cedure; that is, topological communities of the networkmust be detected first, and then nodes’ interests withineach community should be mined to find similaritiesamong the community members. This article suggestssuch an approach by combining community detectionwith computational linguistics to create a social andsemantic analysis tool. In the following, we explain ourmethodology.

3.1 Community detection algorithm

As mentioned, modularity maximization is an NP-hardproblem, solved through combinatorial optimizationand heuristic search. Like many other combinatorialoptimization methods, modularity maximization algo-rithms lack algorithmic stability; that is, the results (bothmodularity and community membership) depend on theorder of heuristic search over the graph. As we aimto study dynamics of communities, stability is an im-portant criterion in selecting the algorithm. This is notlimited to algorithmic stability and also requires de-tection of stable communities, which intrinsically havehigher levels of stability in their membership over time.Blondel et al. (2008) fast unfolding (sometimes referredto as “Louvain method”) is the method we use in this

article. This algorithm results in the highest modularity(i.e., the best partitions) compared to other modularitymaximization algorithms tested at different snapshotsof our network. As Srinivasan and Bhowmick (2012)suggest, stable communities are associated with betterpartitions, and therefore, despite its lower algorithmicstability (compared to methods such as Clauset et al.,2004), Louvain method is used in this article. To over-come the algorithmic instability, for each snapshot werun the algorithm several times (each time the graph isintroduced at a different random order) and pick thebest result (the highest modularity).

Louvain fast unfolding is a bottom-up (agglomera-tive) method, which starts by forming small communi-ties via local optimization of modularity, and then ag-gregates communities to form larger clusters. Initially,each node in the graph is one community. Then an it-erative greedy algorithm is applied by visiting all nodesand computing �Qxy (gain in modularity if communi-ties x and y are joined) for merging each pair of com-munities. An important advantage of this algorithm isthe convenience in calculating �Q; when a node i joinsa community, Equation (1) will be modified as followsto evaluate the change in the modularity, as:

�Q =(∑

in +ki,in

2m−(∑

tot +ki

2m

)2)

−(∑

in

2m−(∑

tot

2m

)2

−(

ki

2m

)2)

(2)

where∑

in is the total number of edges among nodesinside the community, ki,in is the total number oflinks from node i to nodes inside the community, and∑

tot is the total number of incoming edges to thecommunity.

6 Nik-Bakht & El-diraby

This is an extension to Equation (1) by bringing thecalculation from individual nodes level into the commu-nity level (see Blondel et al., 2008, for more details).Communities x and y are merged as long as �Qxy ispositive, or modularity is increasing. This phase stopswhen no further increase in the modularity is possible bymerging more communities. The second phase of this al-gorithm is called “folding,” where all nodes in the samecommunity are merged into one “hyper node.” Weightsare assigned to nodes and edges such that weight of eachhyper-node shows the sum of the weights of the links in-side it (

∑in

) and weight of the edge between two hyper

nodes represents the number of cross-community con-nections. Then the first phase is reapplied to the meta-communities and the phases are iterated until the maxi-mum of modularity is reached (Blondel et al., 2008). Inperforming calculations for the weighted graph, num-ber of nodes and edges are replaced by total weights ofnodes and edges.

3.2 Labeling and profiling communities

Once communities of the IDN are detected, they mustbe labeled based on the profile of community mem-bers and their common interests. For this purpose, weapply topic detection to biographies and recent tweetsof nodes within each community via Term Frequency–Inverse Document Frequency (TF-IDF). This can helpto detect dominating themes of opinion and interests foreach community. TF-IDF is in fact a term weighting sys-tem used in text mining to evaluate representativenessof term in a set of documents. This measure is com-posed of two components: TF, which simply gives higherweights to terms with higher occurrence in a text (be-cause in a document they are more likely to be descrip-tive than terms with low frequencies); and IDF, whichscores down common terms in multiple documents (asthey are less likely to be good discriminators for a par-ticular document). Therefore, if term i appears fij timesin document j, then

TF i j = ( fi j/m j ) (3)

in which m j = maxi

( fi j ) to cancel the document’s size

effect. On the other hand, if n shows the number of doc-uments, then

IDF i = log (n/1 + di )

where di is the number of documents in which term ihas occurred (document frequency). The unity is addedin the denominator to prevent division by zero. Sev-eral experiments have approved that the product ofthe two components (called TF-IDF) provides a goodmeasure of representativeness for term i, in a given

document j. We consider each community of the IDN asone document, which collects all the community nodes’discussions (or their descriptions), and build our analy-sis corpus as the aggregation of all these documents. Acollection of all terms in the corpus will form a dictio-nary, and then TF-IDF can help to detect terms fromthis dictionary, which are shared within one communityand not the others.

The standard TF-IDF requires to be slightly modi-fied before being applied to the analysis of communities.The +1 in the denominator of Equation (4) suggeststhat if di = n − 1 (i.e., the term is repeated in exactly allexcept one of the communities), then IDF will returnzero for such a term, and the term will be completelyremoved from the list of buzz words (irrespective of itsTF). In our case, we need to modify the formula to geta small (yet nonzero) value for di = n − 1. One solutioncan be using log (n/di ). Since the dictionary is formedon the corpus of communities, every term in it is occur-ring at least once, di �= 0), and we will not have the issueof division by zero. However, to track the terms whichoccur in all communities, we modify the IDF as follows:

IDF i = log (n/0.05 + di ) (4)

This will always return positive/nonzero values, ex-cept when a term is common among all the communi-ties (in which di = n and IDF returns a negative, stillnonzero number). Therefore, on top of traceability ofcommon words among all communities, TF-IDF in thisform can detect terms, which are absent in some com-munities; TF-IDF of zero for a term in a community,necessarily stems from TF = 0 and indicates that theterm has never been used by users of that community.After this simple correction, TF-IDF analysis, can becombined with community detection to detect termswhich uniquely describe each community. As a result,communities of the IDN will be detected and labeled,and themes of intercommunity and cross-communitydiscussions can be extracted.

One issue with applying such a method will be thehigh volume of noise in detected terms with high TF-IDF. Although IDF intends to remove common words(which are used frequently but are not descriptive), stillmany specific words with high occurrence exist that arenot relevant to the particular scope of the infrastruc-ture project. For example, high TF-IDF terms such as“brunch” and “pizza” in one community, and “rock”and “soundcloud” in another one, although may giveinformation about members of those communities, arenot relevant to the context of project stakeholders seg-mentation! Therefore, a manual screening on the listof top terms would be necessary to select the groupof infrastructure-relevant terms, which describe each

Communities of interest–interest of communities 7

community. For this purpose, after detecting descriptiveterms of each community, we should manually screenlists of top words and remove terms, which are not rel-evant to the context of infrastructure project, or are notadding any meaning to the process of profiling. In thenext part we suggest a filter to foster this task.

3.3 Modified TF-IDF

As mentioned earlier, ideas discussed over an IDNare dependent on a network of project followers, andthe analysis of IDN must take into account such anetworked-ness. Different followers of a project havedifferent network values based on the level of influencethey have on others, and this must be considered whenaggregating discussions. Here we suggest a modifiedversion of TF-IDF, which incorporates users’ networkvalue in evaluating their interests and descriptions. Inthis method, weight of each term not only depends onhow uniquely it describes a community, but also is afunction of influence degree of the person who uses it.Network value of each node in IDN can be interpretedas the level of influence that the node has on othermembers of the network. Nodes in a social networksecure their level of influence through connectionsto their friends and followers. Several measures existto evaluate the influence level of a node in a socialnetwork. The most conventional one is “centrality” indifferent forms, including degree, closeness, between-ness, and eigenvector centrality. However, as shown bythe authors in Nik Bakht and El-diraby (2013a), thereare other methods that match better to the context ofIDN, due to their background logic and actual outputs.PageRank is one of these measures, which counts forboth quantity and quality of followers of a node, whencalculating its influence degree.

PageRank, as originally introduced by Brin and Page(1998), is in fact a variant of load centrality used byGoogle to rank results of a search, based on the con-nectivity of web network. This algorithm starts by allo-cating equal weights (called PageRank) to all nodes of anetwork. Weights are then iteratively updated based ona flow from each node to its neighbors (people followedby the node) through outgoing links. As a result, themore number of links a node receives from other nodeswith high level of influence, the higher its PageRank willbe at the end. In this sense, not only the number of fol-lowers, but also their quality and level of importancewill influence its PageRank.

We combine TF-IDF of terms detected in the cor-pus of communities’ discussions with PageRank of theperson who uses them. A new measure called “mod-ified TF-IDF” is introduced here, which integrates so-cial connectivity of nodes with semantic classification of

their discussions. Therefore, if node n, which is a mem-ber of community j, uses term i, then the modified TF-IDF for this term will be calculated as

ModifiedTF − IDF = PageRankn × TFij × IDFj (5)

Using such a measure will not only help by filteringthe noise in results of TF-IDF analysis, but also adds“context” to the topic detection process. It is a social-semantic evaluation of discussions in the specific con-text of infrastructure project. In the next part, we willuse this measure to analyze dynamics of communities inTwitter network of an infrastructure project.

4 ANALYSIS OF COMMUNITIES ANDCOMMUNITY DYNAMICS

As a project evolves throughout its life cycle, theassociated impacts and influences on citizens’ livestake different forms. On the other hand, as the projectbecomes more visible to the eyes of the public, morepeople find interests in its different aspect and startfollowing it more closely/actively. This can change thecomposition and balance of groups of followers, andintroduce a level of dynamism to the IDN. Analysis ofdynamics in the structure and interests of subcommuni-ties can provide decision makers with a key controllingfactor by making a connection between public opinionsand parameters such as level of stability of groupssupporting (or opposing) those opinions. In this part,we analyze eight different snapshots of an IDN overa course of 2 years. In this respect, we first introducethe project and its IDN on Twitter. Then we detect itscommunities and label them. We will finally look atthe evolution of communities as the project proceeds.Results of these analyses, which can be simply repeatedfor other projects, provide insights for the decision-making process, which are normally impossible to gainthrough offline public relations programs.

4.1 Case study project

The “Eglinton Crosstown LRT” is one of the largesttransit projects currently under construction in NorthAmerica. It is a part of a bigger city-wide transit plancalled “Transit City,” which was announced in 2007 andhas been under long debates since then. Cancellationof Transit City by the mayor of Toronto (who wassupporting a subway alternative) in late 2010, faced aresistance by groups from the public community whowere supporting the project. They formed a campaigncalled “save transit city” and efficiently employed socialmedia among other tools to encourage citizens to con-tact the provincial governments and MPPs (Members of

8 Nik-Bakht & El-diraby

Provincial Parliament) and appeal against this decision.The pressure from the public, finally convinced theToronto City Council to resume the plan in early 2012.This was among other reasons that put this project un-der specific attention. Crosstown is an $8.2 billion, 25.2-km east–west LRT line passing through a congested cor-ridor of Toronto’s midtown and is running undergroundin major parts (19.5 km). The street-level segment isplanned to be separate from the street traffic with raisedmedians. TTC (Toronto Transit Commission) is theeventual operator of the project. Metrolinx (a provin-cial planning and finance agency) is the owner and theprovincial government’s agent and in the construction.Several Canadian contractors and consultants are other“technical” partners of the project. Procurement beganon March 2011 with manufacturing of precast tunnel lin-ings, and the opening is planned for 2020. Constructionofficially launched on November 2011, and currently inDecember 2014, tunneling operation is underway.

The history of urban transit plans in the city proves ahigh cost for social opposition, as well as the high sensi-tivity of residents of Toronto regarding transit projects.St. Clair streetcar upgrading, a similar LRT project ina close neighborhood, can be mentioned as an exam-ple to highlight such costs. Lack of clear communicationwith the public during planning and preconstructionphases of that project eventually resulted in a $100 mil-lion lawsuit against the city by local businesses and theneighborhood community in 2010. Therefore, severalcommunity meetings have been held for public consul-tation, and more are planned on different aspects ofthe project such as general specifications, station de-signs, construction schedule, and operation plan. In theera of online social media, official decision makers alsolaunched a Twitter account for the project with screenname “@CrosstownTo” on December 2011.

At the time data were first collected (September2012), CrosstownTo had 521 followers, and 2 yearslater, in the last data collection, this number was in-creased to more than 2,500 followers. The network ofCrosstownTo followers is a good example of an IDN.In the following, we detect communities of followers inthis network, and apply the proposed method to pro-file them based on descriptions of community members.Then we analyze tweets posted by the followers to dis-till shared interests of each community. We take theshort biography in Twitter user-profiles as a descrip-tion of users’ affiliations and the dominant themes oftheir recent tweets as an indicator of users’ interests andconcerns. It must be noticed that we do not limit ouranalysis to project-specific tweets. This helps to reach abetter perspective of followers’ typology/interests, anddoes not block project followers who do not participatein project-related discussions; rather, their information

will be included in deriving the social composition of theIDN. To scope the project context, we will use modi-fied TF-IDF, which, as explained, helps to filter project-irrelevant terms from results of the analysis.

We will repeat the analysis on eight different snap-shots of the IDN to study trends of evolution in compo-sition and opinion of communities.

4.2 Data collection and analysis methodology

Social-semantic analysis of communities requires col-lecting data on two aspects of the IDN: first, the connec-tivity among followers, and second, contents they poston Twitter. To collect data from Twitter, we communi-cate with Twitter API (Application Programing Inter-face). This requires encoding a request (including therequest URL, information of the object(s) on whichdata are inquired, and the electronic signature of thesender for authentication), and sending it to TwitterAPI. For the social connectivity, we sent requests tocollect followers lists for each follower of CrosstownTo.For contents, on the other hand, two groups of requestswere sent; one on user descriptions, and one on user sta-tus (which includes the recent tweets by each user). Forthe latter, we collected the last 200 tweets of each user.It must be noted that some users decide to leave theirprofile descriptions blank; some do not provide enoughinformation; and some do not include exact informationin their biography. Analysis of recent tweets can partic-ularly help to accommodate such cases.

Twitter API responses to the requests it receives, inform of a .json (Java Script Object Notation) file, whichrequires preprocessing before being used. The list offollowers’ followers was searched to detect connectiv-ity between any two users. This resulted in a directednetwork in which each node represents a Twitter ac-count, and directed edge AaB implies that A is follow-ing B. We created a code to detect these connectionsby reading the full followers list and returning an edgelist (an m × 2 matrix E in which m is the total num-ber of edges and each row i introduces an edge of thegraph: entry Ei1 is following entry Ei2). Figure 1 visual-izes such a directed network for the first data set col-lected on September 2012. Version 0.8.1 of SoftwareGephi (https://gephi.org/) was used to visualize thisnetwork.

User descriptions and status (collected in form of twoseparate .json files) went under rounds of cleaning toremove the noise and unneeded texts. RegEx (Regu-lar Expression) was used to clean the collected con-tent. We removed all html tags and attributes (whichare not normally visible on a browser), and replaced allhtml character codes with their ASCII equivalents. Wealso removed all URLs to eliminate advertisements, and

Communities of interest–interest of communities 9

Fig. 1. IDN of Crosstown project on September 2012, as adirected graph.

removed Twitter-specific characters such as mentioningother users (anchored by @ sign). Moreover, as all mon-etary values were grouped into the same semantic clus-ter, we substituted any number anchored with a dollarsign, by a traceable character ($XXX). A similar tech-nique was used for all percentages in the texts (resultingin the semantic tag XX%). All texts were transformedinto lower case (to minimize spelling differences whencounting frequencies), and nondescriptive terms (suchas numbers, the gibberish, punctuations) were removedusing a built-in stop list.

Data were collected at eight different points in time,starting from September 2012 when the project was stillat preconstruction phase. Six months later, data werecollected again, and since the tunneling machines werelaunched (in June 2013) we have been repeating datacollection every other month until September 2014.All snapshots of the IDN are “ego-centered” networkshaving “CrosstownTo” as their focal actor (look atFigure 1). Ego-centered networks consist of a focalactor, which is called ego, and a set of alters whohave direct ties to it. Such networks are widely usedby anthropologists to study the social environmentsurrounding individuals or families, as well as by sociol-ogists to study social support. An IDN in our definitionis always an ego-centered network formed around theinfrastructure project. Collecting data was not possiblefor users with protected profiles. However, in differentsnapshots, such nodes comprise less than 1% of the IDNpopulation and they all have low follower-counts. More-over, as tweets by such individuals are not publicallyaccessible, it was assumed that these actors cannot have

high impacts on the general network, or even on theirown communities. Therefore, in the graph of the IDN,such profiles appear as nodes with only one outgoinglink (following the project profile).

Different modularity maximization and hierarchicalclustering algorithms were applied to the first snapshotof this network (Nik Bakht and El-Diraby, 2013b). Theresults showed that modularity maximization in generaloutperforms other community detection algorithms interms of computational performance. Applying thosealgorithms to the other seven snapshots more or lessconfirms the same conclusion. Therefore, we selectedthe “fast unfolding modularity maximization” as thecommunity detection algorithm to analyze communitiesand their dynamics. In the following, we initially analyzecommunities in the first snapshot, and then will studytheir evolution over time.

4.3 Analysis of communities

A closer look at communities in each snapshot ofthe IDN reveals interesting details about the socialstructure behind followers of Crosstown LRT project.We explain the first snapshot as an example. Eachcommunity is more or less composed of people withdifferent backgrounds and different affiliations withrespect to the project. However, analysis of top in-fluential nodes (those with high PageRank) in eachcommunity shows that each community is mainly underthe influence of a certain group of nodes (Nik Bakhtand El-diraby, 2013a). Politicians (including but notlimited to the mayor and some provincial ministers),technical/official decision makers of the project (theowner: Metrolinx, the operator: TTC, and some ofthe contractors), policy makers at the city government(mainly city councilors), some journalists and reporters(city hall and transit reporters in particular), urbanplanners, and transportation experts (with no officialaffiliation to this specific project) are the main groupsforming community leaders.

Such a classification in communities and their leaderswas formerly performed manually by searching throughdescriptions of top influential nodes (Nik Bakht andEl-diraby, 2013a). Here we will show that the modi-fied TF-IDF introduced in this article can accelerateand somewhat automate profiling communities of IDN.Our method combines the level of influence extractedfrom social connectivity among followers with compu-tational linguistics to detect context-related keywords,which uniquely describe interests over each commu-nity. The modified TF-IDF is applied twice: once to la-bel communities by analyzing users-profile descriptionsand the second time to detect shared interests in eachcommunity through analysis of their members’ most

10 Nik-Bakht & El-diraby

recent tweets. The latter is then extended to find clueson cross-community dialogues. In each of the two cases,the analysis is performed at a community level, thatis, a compilation of users’ descriptions (or their last 50tweets) for each community is considered as one doc-ument. Consequently, two dictionaries are generated:one for terms used in users’ biographies (which con-tains 1,952 terms for the first snapshot), and anotherfor the last 50 tweets (having 36,944 words for the firstsnapshot). These dictionaries are formed automaticallythrough parsing the input texts.

Table 2 collects the top keywords detected bymodified TF-IDF of users’ descriptions in the firstsnapshot of the network. Semantic similarities canbe declared among themes in each community. Highfrequency of hashtags with different combinations of“TTC” in community C0 can indicate the dominanceof nodes from the project operator in this community.Similarly, analysis of terms with high weights in othercommunities can reveal dominant themes, which havebrought people together in those communities. Termssuch as cityhall, ward, and councillor in communityC1, and terms such as author, journalist, magazine, andTorontoist (a city blog about Toronto) are examplesin the case. Given the broadness of concepts coveredby politicians and their dependents’ biographies, thereis a level of diversity in themes represented by termswith high weight in community C3. However, termssuch as minister, queenspark, and government can helpto explain the dominant theme of this community. Itmust be noted that labeling tends to give a generaloverview on the social construct of the IDN and is asubjective process. In many cases, a clear/solid line maynot exist among affiliations of nodes from two differentcommunities. The subjectivity increases at the diversityof community members. For example, it is difficult todetect a firm similarity among keywords of communityC2. When top influential nodes of this community arerevisited, a combination of councilors, journalists, andeven TTC members are found among them. Therefore,it would be difficult to label this community at thissnapshot. However, as we will see in the next part, thiscommunity is unstable; in 6 months it will be decom-posed and completely absorbed by other communities.

A comparison between results of normal and mod-ified TF-IDF for profiling communities shows thatdiscriminating terms for each community are retrievedconsiderably faster when the modified version isapplied. More specifically, distinctive terms that aredetected by screening the top 150 words when thenormal TF-IDF is applied, can be found among thetop 25 to 50 terms under application of the modifiedversion. For example, the keyword “city hall,” which isa notion of having councilors or city council reporters,

is the top word in community C1 under modifiedTF-IDF, but is ranked 33rd under the normal version.Moreover, some key terms are lost when the normalversion of TF-IDF is used. As an example the keyword“ToCityhall,” the 24th top term in C1 under modifiedTF-IDF, is not among the top 50 terms when the normalversion is applied. This can be interpreted as filteringnoise and more importantly adding context sensitivityto the analysis when modified TF-IDF is used. It is adirect result of giving more weight to terms related tonodes having a higher weight in the specific context ofinfrastructure project.

On the other hand, it must be admitted that in somecases, results of the modified version may be mislead-ing. This happens in low-sized communities, where oneor two nodes have a considerably high influence degree(PageRank) in the main network. In such cases modi-fied TF-IDF helps those nodes to hijack the whole com-munity by removing all terms associated to other nodesof the community. In the present IDN, such a case onlyhappened for community C5 in snapshot of September2013, where size of the community is 6 in a networkof size 1,392. Also, the PageRank of the top influen-tial node in this community is 4.5 times greater than thesecond top and 9 times greater than other nodes of thecommunity. As a result, the output terms for labelingthis community are entirely related to the profile of thesingle top influential node. However, as it will be seenin the next part, such a community has a transient na-ture and is quickly absorbed by more stable communi-ties. Therefore, this problem will not challenge the ap-plicability of modified TF-IDF in a wide time period.

Taking such an analysis one step further and min-ing topics discussed by users in each community cangive decision makers a better perspective on followers’opinions, concerns, and interests. This can be a sum-mary index of social discussions about the project, clus-tered based on the community of people who supportthem. Particularly, using the modified TF-IDF can helpto speculate about communities who have (or have not)participated in discussions regarding certain aspects ofthe project. A positive number for modified TF-IDFshows that a term has been used by community mem-bers (at least once); zero indicates that the term hasnever occurred in tweets of community members; anda negative number means that all communities haveused the term in their tweets. In CrosstownTo IDN,none of the project-related terms is exclusively usedby one community and not the others. As an example,we can mention the hashtag #St.clair-disaster-glorified-streetcars-subway, which refers to the failure of St.Clair project. All groups of CrosstownTo followers ex-cept project technical and official decision makers havereferred to this experiment. Monorail (as an alternative

Communities of interest–interest of communities 11

Table 2Top descriptive terms in user descriptions at each community (first snapshot of the network)

Community

C0 C1 C2 C3 C4

#TTCwifi #cityhall gta university journalistpublic #cityofto instructor minister columnist#TTCbendybus #govmaker progressive Davisville freelance

Terms with construction strategist LRT industry torontoisthigh modified Hamilton sustainable town #queenspark authorTF-IDF dedicated ward Eglinton infrastructure civic

ttc #tocouncil business air magazineGTA committee economics adventure write#TTChelps ceo #climatechange government neighborhood#TTCnotices councillor TV mgmt. planner

Technical decision makers City policy makers Mixed Politicians/ mixed Public

to the LRT system), in singular or plural forms is an ex-ample of a term used by all communities. It is also inter-esting that all communities except politicians have usedthe hashtag #fix-transit-now. Hashtag #develop-transit-etc., on the other hand, has been used by all commu-nities except politicians and policy makers. #transitcityand #antitransit have been two terms, common betweenthe public and politicians, which can be a reflective ofthe long negotiations against and in favor of the transitcity project.

Focusing on terms with negative TF-IDF showsthat terms such as: subway, highways, and transit-system, reflecting debates on other alternatives andsubstitutions for LRT, as well as preconstruction,contracting, and carbon-tax that are generally relatedto the project phase (back in September 2012) havebeen used by users from all communities. This providesa general perspective on cross-community social dis-cussions about the project. It is worth pointing out thatcross-community discussions formed around a specifichashtag can be traced back to detect the involvedcommunities. Figure 2 illustrates some of the termsshared by different communities in their most recent 50tweets by September 2012.

4.3 Dynamics of communities

We can extend similar analyses over time to study theevolution of communities as the project proceeds in itslife cycle. As studies suggest, sustainability of infrastruc-ture system must be integrated over the life cycle, andbe incorporated to its serviceability at any given timeperiod (Akinyemi and Zuidgeest, 2002). Consequently,a meaningful PI process must be open and ongoing

during all different phases of project life cycle (Wag-ner, 2013). The time dimension is recently being con-sidered by many researchers in design of transportationnetworks.

Demand dynamics and change in travelers’ (orprospective travelers) behavior are among other fac-tors to be captured over time by the infrastructure man-agement team to keep consistency in social equity levelover the life span of the system (Szeto et al., 2010). Suchan ongoing monitoring and capturing can be applied inthe online environment more easily. Not only monitor-ing IDN can provide decision makers with a good es-timation of social reactions to decisions, but also de-tecting core interests and their supporters as well asevolution of the two over time can assist public rela-tions agencies in designing an ongoing public engage-ment program with the right scope, and by involving theright stakeholders. In this part, we detect communitiesin eight snapshots of CrosstownTo IDN illustrated inTable 3, and after labeling each community using mod-ified TF-IDF, we look at trends of evolution over time.

Communities of the IDN are not static clusters; theirproperties including structure, size, and membershipcontinuously change over time. Some communities ex-pand and some shrink over time. The expansion andshrinkage are results of attraction power of communi-ties as well as attachment of new arrivals. A transfor-mation matrix was formed to follow communities overtime. Rows and columns of this matrix refer to com-munities in different snapshots and entries reflect therate of transformation of nodes from a community inone snapshot to a community in the next snapshot. Forexample, entry 1,1 of this matrix has a value of 0.71,suggesting that 71% of nodes from community C0 in

12 Nik-Bakht & El-diraby

Fig. 2. Co-occurrence of terms in the last 50 tweets of CrosstownTo main four communities on September 2012.

Table 3Eight snapshots of the IDN and their geometric properties

Date taken Sep-12 Apr-13 Jul-13 Sep-13 Nov-13 May-14 Jul-14 Sep-14

Size 523 971 1,251 1,392 1,464 2,078 2,167 2,544Ave. Degree 19 24 19 20 19 22 22 23Ave. Clustering 0.664 0.689 0.591 0.577 0.579 0.532 0.528 0.517Modularity 0.172 0.188 0.188 0.205 0.211 0.212 0.220 0.229#of Communities 5 6 5 5 4 5 5 5Community Size C0 78 386 574 462 639 553 593 635C1 84 211 253 274 290 424 330 384C2 77 143 202 203 215 394 446 502C3 205 154 111 362 320 462 479 614C4 78 63 111 91 N.A. 245 318 409C5 N.A. 14 N.A. N.A. N.A. N.A N.A N.A

1. Size = Total number of nodes.2. Average degree = Total number of edges

Total number of nodes3. Clustering of a node = Ratio of number of edges between its neighbors to the maximum possible number of such edges.4. Average clustering = Expectation of clustering over all nodes.

September 2012 are transferred to community C0 inApril 2013. We filtered out entries below 0.15 from thismatrix and divided the rest of cells into three categories:intervals between 0.15 and 0.25, between 0.25 and 0.5,and above 0.5. When more than 50% of nodes from onecommunity transfer to another community, we assumedthat the same community is maintained. Such commu-nities are considered stable communities of the IDN.

Unstable communities are usually decomposedand absorbed by other communities (with differentproportions) over time. To analyze the transformationmatrix, we have visualized its results in Figure 3. Bub-bles in this figure represent communities, and their sizes

mimic the size of communities. Arrows highlight howusers of a community behave over time. As this figuresuggests, communities C0 and C1 are the most stablecommunities, that is, they maintain their basic structureas time passes. Community C2 after April 2013, and C3after September 2013 reach such stability. On the otherhand, community C4 is the least stable one. Membersof this community (which is the community of the pub-lic) are continuously absorbed by other stable commu-nities. C0 (technical decision makers) and C2 (the mixedcommunity) have the highest level of attractiveness tomembers of C4. Community C4 is even removed com-pletely in November 2013 and is formed again by new

Communities of interest–interest of communities 13

Fig. 3. Evolution of communities over time.

arrivals and some members of C1. On the other hand,members of C3 have a great tendency toward the morestable community C0 at the beginning, and before get-ting stable. Behavior of the newborn community C5 inApril 2013 is also worth mentioning. It is the result ofjoining new groups of followers and is absorbed by themore attractive community C0 soon after the formation.This figure suggests that community C0 has the high-est level of attraction power and often absorbs membersfrom other communities, as well as the new arrivals. Acomparison between the left side and the right side ofFigure 3 suggests that in the 2-year horizon of analysis,communities comparatively become more stable as theIDN evolves.

Profiling communities can now connect the dots andgive a complete overview on social construct of IDN andbehavior of different groups of project followers overtime. Community C0, which is initially composed oftechnical decision makers, lacks some important nodesin its first snapshot. The owner (Metrolinx) and theproject team (Twitter account “Crosstownteam”) aretwo of the missing nodes in C0 on September 2012.However, 6 months later, C0 has absorbed these nodestogether with the project ID (ego), and forms the largestand the most stable community of the network. Thiscommunity not only absorbs technical actors of theproject, which are added later (such as Aecom, oneof the main tunneling contractors added in September2013), but also attracts many other project followers(such as Toronto mayor in July 2013, and many pub-lic officials). Analysis of user descriptions in all latersnapshots shows that technical decision makers of theproject dominate this community.

Community C1 has more or less similar conditions,but is dominated by city policy makers. City councilors,members of parliament, and associated reporters formtop influential nodes of this community. Many city coun-cilors, who were initially in C2, are attracted by C1

in later phases. Keywords such as City hall, council-lor, ward, cityofto, which after July 2013 are constantlyamong the top 20 keywords of this community, con-firm this speculation. Community C2 has two totallydifferent phases. In September 2012, this community ismainly composed of city councilors, journalists, and cityhall reporters. But since April 2013, keywords such asMPP, minister, and ministry, suggest that some politi-cians (many at provincial level) have immigrated to thiscommunity. Since then, the community takes a morestable form and continues its expansion.

C3 and C4 after April 2013 are mainly communities ofthe public. Urban planners, transit reporters, and trans-portation specialists who do not have any official af-filiation to the project are community leaders of theseclusters. A noticeable characteristic of communities ofthe public is the relatively low stability in their mem-bership, particularly at early phases of the IDN. On theother hand, content of discussions for these communi-ties has the largest amount of noise (unrelated terms tothe scope of the project). But in general, related key-words in tweets of these communities belong to theeconomy and environment semantic classes, as well asconstruction impacts.

Some recommendations for official decision makersof the project based on results of the 2-year analysis canbe listed as follows:

1. Beyond a communication tool, online social me-dia can be used for analysis of connectivityamong project stakeholders; social network ana-lytics combined with lexical analyses and seman-tics can help to understand typology of project’sonline followers, and to segment their vestedinterests.

2. Leaders of communities C3 and C4 are good can-didates to be targeted and involved in the processof consultation with the public.

14 Nik-Bakht & El-diraby

3. Communities of the IDN generally stabilize overtime. Turbulence in behavior of IDN communitiesat early stages of the project life cycle should notmislead decision makers regarding social constructof project followers.

4. Communities of the public have the lowest levelof stability and frequently get absorbed by othercommunities. In the case study project, communityof technical decision makers has the highest levelof attractiveness to the public, which can be a goodsign for decision makers, as nodes affiliated withthe owner and operator form the leaders of thiscommunity.

5. Communities that are too small in size, most ofthe time, have a temporary nature and will be ab-sorbed by larger communities. Labeling such com-munities using modified TF-IDF may be mislead-ing; so it is best to wait until they are melt in otherstable communities.

6. Communities may completely change their com-position as the project proceeds in its lifecycle;therefore, monitoring must be applied as a contin-uous process over different phases of the project.

7. Construction impacts, economy, and environ-mental-related issues are among the major cate-gories discussed by communities of the public. Thiscan indicate the main concerns of the public com-munities with respect to this project. Also, thereare no major topics discussed within one commu-nity and not the others, which can be a positivesign of mutuality of main interests among majorgroups of followers.

8. In combination with a sentiment analysis, thepresent results can give a clear perspective on fol-lowers’ positions with respect to the project andassociated decisions.

5 CONCLUSION AND FUTURE WORK

This article employed a combination of mathematicalmethods and information retrieval techniques for de-tecting and labeling/profiling communities in the mish-mash of the IDN. Communities are normally formedaround common interests and similarities. Hence, de-tecting and labeling communities can direct decisionmakers to some of the core interests in the project.Semantic similarity among terms discussed and socialconnectivity of members, are the main criteria we usedin profiling communities. One beauty of this methodis its bottom-up nature in assigning higher weights tonodes having higher levels of influence on others. It wasshown that such analyses can reveal patterns of termsand topics communicated by different groups of projectfollowers.

Modified TF-IDF algorithm introduced here out-performs the normal TF-IDF by removing noise andadding context to the text analysis. By the aid of thisalgorithm we studied community dynamics and showedhow communities shrink or expand over time. We alsoevaluated the level of attraction power for differentcommunities. As an IDN evolves, its communities be-come more stable. Newborn communities formed fromtime to time, are most of the time too small in size andhave a transient nature. Therefore, they quickly mergein larger communities. Results of the case study showedthat communities of technical decision makers in thisproject are the most stable and attractive communitiesof the IDN. On the other hand, structure of communi-ties of the public changes more frequently over time.

In combination with a proper sentiment analysis ofonline discussions, and semantic clustering of detectedterms, findings of this article can provide official de-cision makers with a mental map of the major onlinefollowers of the project. Analysis of dynamics in dis-cussions can help decision makers to evaluate socialfeedback regarding decisions they make. Interactionsin real world are often reflected in online behavior ofthe actors. The presented method has the eminence ofself-organizing nature emerging from interactions of in-volved actors from “within” the system. Outcomes ofanalyzing IDN can therefore provide a guideline forcommunication with the public.

At the end, the method’s limitations should be admit-ted. Twitter (or any other social media) does not nec-essarily reflect the exact picture of the society. Despitethe acceptable distribution in backgrounds of Twitterfollowers for our case study, which makes our IDN adescent sample, it is by no means the full representationof Toronto. There always exist people with stakes in theproject, who are not social-web savvy, those who have ahigh impact on their communities but are not active inonline environment or do not have a Twitter account.Moreover, the issue of confidentiality of the project-related as well as followers-related information can be abarrier making online discussions a distorted picture ofthe reality. Furthermore, online and offline social atti-tudes do not necessarily match perfectly. Consequently,existence of a followership relation between two nodeson Twitter might not essentially be the best notion of in-fluence or not a strong social relational tie in some cases.Our ongoing research is focusing on closer types of con-nection and stronger ties among project followers (suchas mentioning and re-tweeting) as well as more intuitivemeasures of influence. On the other hand, communitiesof the IDN can be known as an incomplete/imperfectimage of communities of interest in the real project.However, as IDNs mature and involve more followers,this image improves. Moreover, TF-IDF stops at the

Communities of interest–interest of communities 15

term level and does not go beyond the topic. Althoughthe list of key terms can provide decision makers withan overview of social concerns and interests, detectingthe content and sentiment of discussions is necessary forcompleting layout of the social mental map with respectto the project. Natural language processing and seman-tic clustering can help with this issue and studies are cur-rently underway in this direction.

REFERENCES

Abdul Aziz, H. M. & Ukkusuri, S. V. (2012), Integration of en-vironmental objectives in a system optimal dynamic trafficassignment model, Computer-Aided Civil & InfrastructureEngineering, 27, 494–511.

Akinyemi, E. O. & Zuidgeest, M. H. (2002), Managingtransportation infrastructure for sustainable development,Computer-Aided Civil & Infrastructure Engineering, 17(3),148–61.

Atkin, B. & Skitmore, M. (2008), Editorial: stakeholder man-agement in construction, Construction Management andEconomics, 26(6), 549–52.

Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre,E. (2008), Fast unfolding of communities in large networks,Journal of Statistical Mechanics: Theory and Experiment,10, 2000–12.

Bregman, S. & Watkins, K. (2013), Best Practices for Trans-portation Agency Use of Social Media, CRC Press, BocaRaton, FL.

Bregman, S., TRB & TCRPSYNTH. (2012), TCRP-Synthesis99-Use of Social Media in Public Transportation, TRB,Washington.

Brin, S. & Page, L. (1998), The anatomy of a large-scale hyper-textual Web search engine, Computer Networks and ISDNSystems, 30(1-7) 107–17.

Bruijn, H. D., & Heuvelhof, E. T. (2000), Networks and Deci-sion Making, LEMMA Publishers, Utrecht.

Chinowsky, P., Diekmann, J. & Galotti, V. (2008), Social net-work model of construction, ASCE Journal of ConstructionEngineering & Management, 134(10).

Clauset, A., Newman, M. E. & Moore, C. (2004), Finding com-munity structure in very large networks, Physical Review E,70, 066111–1 to 066111–6.

Donetti, L. & Munoz, M. A. (2004), Detecting network com-munities: a new systematic and efficient algorithm, Journalof Statistical Mechanics: Theory and Experiment, 10, 127–34.

El-Diraby, T. E. (2011), Civil infrastructure as a chaotic socio-technical system: how can information systems supportcollaborative innovation, CIBW078, Computer KnowledgeBuilding, Sophia Antipolis, France.

Evans-Cowley, J. S. & Griffin, G. (2012), Microparticipationwith social media for community engagement in transporta-tion planning, Transportation Research Record (Journal ofthe Transportation Research Board), 2307, 90–98.

Evans-Cowley, J. & Griffin, G. (2011), Micro-Participation:The Role of Microblogging in Planning. Availableat: http://ssrn.com/abstract=1760522 or http://dx.doi.org/10.2139/ssrn.1760522, accessed February 2014.

Ferguson, E. M., Duthie, J. & Waller, S. (2012), Comparingdelay minimization and emission minimization in the net-

work design problems, Computer-Aided Civil & Infrastruc-ture Engineering, 27, 288–302.

Fortunato, S. (2010), Community detection in graphs, PhysicsReports, 486, 75–174.

Girvan, M. & Newman, M. E, (2002), Community structurein social and biological networks, Proceedings of NationalAcademy of Sciences of the USA, 99(12), 7821–26.

Kalafatis, T. (2009), Twitter analytics: cluster analysis re-veals similar twitter users. Available at: Life analytics:http://lifeanalytics.blogspot.ca/2009/05/twitter-analytics-cluster-analysis.html, accessed April 5, 2015.

Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A.(1999), Trawling the web for emerging cyber-commuinities.Computer Networks: The International Journal of Com-puter and Telecommunications Networking, 31(11), 1491–493.

Lopez, E. & Monzon, A. (2010), Integration of sustainabilityissues in strategic transportation planning: a multi-criteriamodel for the assessment of transportation infrastructureplans. Computer-Aided Civil & Infrastructure Engineering,25(6), 440–51.

Nejat, A. & Damnjanovic, I. (2012), Agent-based modeling ofbehavioral housing recovery following disasters, Computer-Aided Civil & Infrastructure Engineering, 27, 748–63.

Newman, M. E. (2006), Modularity and community structurein networks, PNAS (Proceedings of the National Academyof Sciences of United States of America), 103(23), 8577–82.

Ng, M., Lin, D. & Waller, S. (2009), Optimal long-term in-frastructure maintenance planning accounting for traffic dy-namics, Computer-Aided Civil & Infrastructure Engineer-ing, 24, 459–69.

Nik Bakht, M. & El-diraby, T. (2014), Infrastructure dis-cussion networks: analyzing social media debates of LRTprojects in North American cities, in TRB 93rd AnnualMeeting, Transportation Research Board, Washington DC.

Nik Bakht, M. & El-diraby, T. E. (2013a), Analyzing infras-tructure discussion networks: order of ‘influence’ in chaosof ‘followers’, in CSCE Annual Conference-4th Construc-tion Specialty Conference, CSCE, Montreal.

Nik Bakht, M. & El-Diraby, T. E. (2013b), What Does SocialMedia Say about the Infrastructure Construction Project?CIB W78, Beijing, China.

Olander, S. (2007), Stakeholder impact analysis in construc-tion project management, Construction Management andExonomicsa, 25, 277–87.

Ottens, M. M., Franssen, M., Kroes, P. A. & Poel, V. I. (2006),Modelling infrastructures as socio-technical systems, Inter-national Journal of Critical Infrastructures, 2(2), 133–45.

Palla, G., Derenyi, I., Farkas, I. & Vicsek, T. (2005), Uncov-ering the overlapping community structure of complex net-works in nature and society, Nature, 435(7043), 814–18.

Papadopoulos, S., Kompatsiaris, Y., Vakali, A. & Spyridonos,P. (2012), Community detection in social media, Journal ofData Mining and Knowledge Discovery, 24(3), 515–54.

Parkin, J. (1994), A power model of urban infrastructure de-cision making, Ceoforum, 25, 203–11.

Pitsilis, G., Zhang, X. & Wang, W. (2011), Clustering recom-menders in collaborative filtering using explicit trust infor-mation, in Proceedings of Trust Management V, Springer,Copenhagen, Denmark, 82–97.

Pryke, S. D. (2004), Analysing construction project coalitions:exploring the application of social network analysis, Con-struction Management and Economics, 22(8), 787–97.

Srinivasan, S. & Bhowmick, S. (2012), Using stable com-munities for maximizing modularity, 10th DIMACS

16 Nik-Bakht & El-diraby

Implementation Challenge: Graph Partitioning and GraphClustering, DIMACS, Atlanta, Georgia.

Steinhaeuser, K. & Chawla, N. V. (2008), Community detec-tion in a large real-world social network, in InternationalConference on Social Computing, Behavioral Modeling andPrediction, Springer, Phoenix, Arizona, USA, 168–75.

Stinchcombe, A. (1959), Bureaucratic and craft administra-tion of production: a comparative study, Administrative Sci-ence Quarterly, 4(2), 168–87.

Szeto, W., Jaber, X. & O’Mahony, M. (2010), Time-dependentdiscrete network design frameworks considering land use,Computer-Aided Civil & Infrastructure Engineering, 25(6),411–26.

Taylor, J. & Levitt, R. (2007), Innovation alignment andproject network dynamics: an integrative model for change,Project Management Journal, 38(3), 22–35.

Twitter Statistics (2015), Available at Statisticbrain: http://www.statisticbrain.com/twitter-statistics/, accessed January15, 2015.

Ukkusuri, S. V., Mathew, T. V. & Waller, S. (2007), Robusttransportation network design under demand uncertainty,Computer-Aided Civil & Infrastructure Engineering, 22(1),6–18.

Wagner, J. (2013), Measuring the performance of public en-gagement in transportation planning: three best princi-ples, Transportation Research Record: Journal of Trans-portation Research Board (Issue Number 2397), 13(2646),38–44.

Yang, Z., Yu, B. & Cheng, C. (2007), A parallel ant colonyalgorithm for bus network optimization, Computer-Aided Civil & Infrastructure Engineering, 22(1),44–55.