8

Click here to load reader

Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

Embed Size (px)

Citation preview

Page 1: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

Microsoft Academic Search and Google ScholarCitations: Comparative Analysis of Author Profiles

José Luis OrtegaVICYT-CSIC, Serrano, 113 28006 Madrid, Spain. E-mail: [email protected]

Isidro F. AguilloCybermetrics Laboratory, CCHS-CSIC, Albasanz, 26-28 28037 Madrid, Spain.E-mail: [email protected]

This article offers a comparative analysis of the personalprofiling capabilities of the two most important freecitation-based academic search engines, namely,Microsoft Academic Search (MAS) and Google ScholarCitations (GSC). Author profiles can be useful for evalu-ation purposes once the advantages and the shortcom-ings of these services are described and taken intoconsideration. In total, 771 personal profiles appearingin both the MAS and the GSC databases were analyzed.Results show that the GSC profiles include more docu-ments and citations than those in MAS but with a strongbias toward the information and computing sciences,whereas the MAS profiles are disciplinarily betterbalanced. MAS shows technical problems such as ahigher number of duplicated profiles and a lower updat-ing rate than GSC. It is concluded that both servicescould be used for evaluation proposes only if they areapplied along with other citation indices as a way tosupplement that information.

Introduction

In November 2009, Microsoft ResearchAsia started a newweb search service specializing in scientific information.Even though Google (Google Scholar) already introduced anacademic search engine in 2004, the proposal of MicrosoftAcademic Search (MAS) went beyond a mere documentretrieval service that counts citations. It automatically pro-vided the bibliographic records grouped by authors, journals,institutions, or research disciplines that, albeit with limitedquality control, is enough for research evaluation and scien-tific benchmarking. Given the limitations in the control ofidentifiers, the most interesting feature is that the whole

search service relies on self-edited personal profiles. Thatmeans that they can be updated, modified, or merged, afterapproval, by the researchers themselves. This allows scien-tists to develop a public, qualitatively controlled, and acces-sible curriculum vitae to promote their research activities andoverall performance. At the very beginning, this service wasbeta-tested with databases covering computer science disci-plines only, but since March 2011, it has become multidisci-plinary, with the agreement of different source providers.

At the same time, Google launched Google Scholar Cita-tions (GSC) (Google Scholar blog, 2011a, 2011b) in responseto Microsoft’s initiative (Fenner, 2011). Google Scholar (GS)is considered the largest academic search engine, whichharvests not only scientific papers available on the webthrough repositories, publisher platforms, and personal webpages but also other academic materials, court opinions, andpatents. GSC provides facilities for exporting bibliographicrecords, which is useful for building a personal homepagewith the following bibliographic and bibliometric informa-tion gathered by GS: lists of publications, names of coau-thors, subject labels, and bibliometric indicators such as theh-index. Unlike MAS, each profile should be voluntarilyself-created and self-edited by the author using a Googlepersonal account, selecting his or her own references, choos-ing his or her partners, and choosing the labels that bestdescribe his or her fields of interest in natural language. Thus,whereas MAS adopts a restricted model in which theresearchers only can suggest changes or merges in theirautomatically supplied profiles, the GSC model grants theusers complete ownership of their profiles, with full capabili-ties for editing and modifying them freely.

With regard to traditional databases used for researchanalysis and evaluation (Web of Science [WoS] andScopus), both MAS and GSC profilers introduce severalinnovations specifically suited for research assessment andother scientometric studies. Unlike WoS and Scopus, the

Received November 16, 2012; revised February 18, 2013; accepted June 6,

2013

© 2014 ASIS&T • Published online 26 February 2014 in Wiley OnlineLibrary (wileyonlinelibrary.com). DOI: 10.1002/asi.23036

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 65(6):1149–1156, 2014

Page 2: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

two services are access free, which means they can reachwider and larger audiences, both in academia and outside;this is particularly welcomed in developing countries. Thesecond innovation is that these new services are focused onthe author as the main unit, instead of the journal, easing theway to science 2.0 functionalities and web 2.0 tools such asthe social networks (Waldrop, 2008). The improvementin the (self-)standardization of names, the availability ofunique global identifiers for authors, the exporting capabili-ties, and the new-generation indicators provided are going toenhance significantly scientist research evaluation, facilitat-ing also the incorporation (through ImpactStory, forexample) of new research measures related to social net-working and web performance – altmetrics indicators(Priem, Piwowar, & Hemminger, 2011). A third interestingaspect is that both search engines are using databases com-pletely different from those of WoS and Scopus for extract-ing and compiling their records and indicators, offering, forcitation analysis, a new scenario that has limited overlapwith the other traditional sources. However, feasibility ishampered by several technical issues that, because of thenovelty of the products, are still not fixed. Those shortcom-ings include a high number of duplicate profiles (due tomistakes in the names or affiliations and interinstitutionalmobility) and records (not showing the full count of receivedcitations distributed along different entries), which isexplained largely by the high number and format diversity ofthe different sources that these services use to build theirdatabases. The open capabilities for external editing can alsobe used to manipulate the data (Delgado López-Cozar,Robinson-Garcia, & Torres-Salinas, 2012), although MASrequires prior approval of changes, and GSC penalizes suchunethical behavior.

These profiling capabilities are similar in MAS andGSC, and they are better developed than in the traditionalbibliographic databases, so we focus on them, excludingboth WoS and Scopus, which have been extensively dis-cussed by other authors. The new services are a seriousalternative to the traditional bibliographic databases, incoverage and access terms (Thomson Reuters, 2012;Wikipedia, 2013; Scopus, 2013). Their commercial strategyregarding the provision of aggregated indicators haschanged during the past few years, but the quantitativemeasurements play a central role in both services, withGSC providing h-index indicators and MAS focusing moreon different usages of citations. Not only are the indicatorsincorporated into the profiles but they are used to buildrankings at different levels of aggregation. In this way, it ispossible to use these tools not only for microlevel (author)research evaluation but also at the meso- or macrolevels(journals, institutions, disciplines). Furthermore, GSC is byfar the largest platform currently available; it providesadditional research materials from usually neglected orpoorly covered fields or from informally published docu-ments that are deposited in subject or institutional reposi-tories. The main justification for this article is that this isthe first time a comparative analysis of the data provided by

the profiling capabilities of the free citation databases hasbeen performed.

Related Research

There are no previous studies comparing MAS and GSCfrom the point of view of the implications for researchevaluation of their end user–friendly academic profilingsystems and data. Therefore, this literature review presentsthe most relevant papers that have studied both searchengines from a research evaluation perspective.

Several papers have used GS as a data source because ofits large coverage not only of the main scientific peer-reviewed journals and preprints from open access reposito-ries but also from obscure or not so formally publishedscientific documents, such as popularization papers, confer-ence presentations, or learning supporting materials. Mostof these studies have analyzed the citation coverage of GSin relation to the major scientific citation databases.Bakkalbasi et al. (2006) studied the coverage of GS,Scopus, and WoS in two research disciplines, finding thatGS was the source that contributed most new citations,being an important complementary citation index to thetraditional ones. Along the same lines, Meho and Yang(2007) observed that GS is a supporting source to WoS andScopus because it contains citations from conference pro-ceedings, non-English-language journals, and other aca-demic materials, which enriches the research evaluation,having a broader international scope. Harzing and Van derWal (2008) went further and suggested that GS impactmeasures have to be taken into account when evaluating theimpact of both journals and researchers in disciplinespoorly represented in WoS, such as management and inter-national business. In this sense, Kousha and Thelwall(2007) argued that GS is a more comprehensive tool forcitation tracking in social sciences, because this branch ofknowledge produces the most varied research objects, suchas books, working papers, technical reports, and so on.However, Jacsó (2008, 2010) was critical of the technicalproblems and advised that the use of GS for bibliometricevaluation could produce unscholarly measures. Aguillo(2012) also warned that the use of GS for evaluation pur-poses should be done with great care, especially regardingthe items not overlapping with those present in Scopusor WoS.

However, there are still very few papers related to GSCdata. Huang and Yuan (2012) carried out an exploratorystudy of GSC, showing different citation patterns betweendisciplines and concluding that GSC is a powerful datasource for citation analysis, whereas Bar-Ilan et al. (2012)checked the web footprint of a group of information scien-tists, uncovering that 23% of them hold a GSC profile.Ortega and Aguillo (2012) used GSC to build a map ofscience from the labels that describe each personal profile;introducing a global country and institutional collaborationmap from GSC profile affiliations (Ortega and Aguillo,

1150 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014DOI: 10.1002/asi

Page 3: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

2013). Jacsó (2012) corroborated the current limitations butpredicted a role for GSC for scientific evaluation in thefuture.

With regard to MAS, just a couple of articles have beenpublished on the performance of the Microsoft service:Jacsó (2011) presented a review of its principal functional-ities in comparison with Scopus and WoS, concluding thatMAS may become a free tool to help in research assessment;Zhang (2011) analyzed the h-index and the coauthornetwork of 100 profiles in MAS.

Objectives

This paper presents an empirical comparison of two webscientific services that provide personal profiles, MAS andGSC. In the first stage, we intend to provide a descriptiveview of the coverage of both search engines in terms ofdisciplines and institutions. The objective is to detectwhether there are gaps and biases between themand to determine whether they can be used effectively inbibliometric/webometric studies as well as in researchevaluations.

In a second stage, a detailed comparison of a group ofprofiles present in both services is made. Our objective is toobserve the variation between documents and citations in thetwo databases and to obtain quantitative estimations fromthese differences.

Methods

Data Extraction

MAS is a scientific web database that gathers biblio-graphic information from the principal scientific publishing(Elsevier, Springer) and bibliographic services (CrossRef).It contains roughly 40 million documents and 19 millionprofiles (August 2012), which are automatically createdfrom the signatures of the authors of these papers. In addi-tion to other units such as journals, institutions, and con-ferences, MAS also builds profiles for individuals, whichinclude the author’s list of publications; bibliometric indi-cators (publications, citations); disciplinary areas of interest;and different sets of the most frequent coauthors, journals,and key words.

Each MAS author’s profile includes an identificationnumber randomly assigned. This randomness is assumedbecause there are IDs larger than the total number of pro-files. Because of this, an SQL script was designed to requestautomatically a range of numbers from 1 to 100,000 and toextract the existing profiles in that range. These marginswere established for technical reasons to avoid an excessivequerying time. Hence, from 100,000 possible profiles,15,211 (but only 0.08% of the total profiles) were detected.For each of these profiles, the name, affiliation, number ofdocuments, number of citations, and h-index were collected.

On the other hand, GSC is a platform in which a regis-tered user can build his or her personal scientific profile from

the documents indexed by GS. This publications list is alsosupplied with some bibliometric indicators (citations,h-index, and i10-index), a coauthors list, and some keywords, in this case defined by the user. The total number ofprofiles is unknown. A crawling process performed in June2012 allowed us to retrieve 396,072 profiles, 83,774 ofwhich were not duplicated. The standard capture–recaptureformula (accepting a close population given the short perioddevoted to sampling) was used to estimate a population fromthe number of duplicated records (Ortega & Aguillo, 2012),so the total number of profiles in GSC in June 2012 could beestimated at roughly 106,246 profiles.

This crawling process was performed in two stages. First,600 random queries were built by the combination of the 25letters of the Latin alphabet in groups (strings) of two (i.e.,aa, ab, ac, and so on), then they were automatically launchedto the search engine, and after checking that the searchaccepted the strings and could be contained in the names asintended, the resulting profiles from these queries wereretrieved and stored. That exhaustive list of profiles wasused in the second stage to harvest author information,such as working address, e-mail domain, bibliometricindicators (papers, citations, h-index, and i10-index), andcoauthors list. An SQL routine was implemented for boththe crawling and harvesting tasks to accomplish the dataextraction process.

Data Comparison

To compare the performance of each profile in MAS andGSC, we have identified how many profiles from the MASlist (15,211) are also included in the GSC list (83,774). First,we broke down each name into first name, initials, secondname, and last name both in MAS and GSC. Then wepermuted queries between each field, looking for matchingbetween similar names in MAS and GSC; 802 candidatecommon profiles were detected in this way. Next we refinedeach pair, checking whether the affiliation was the same forboth profiles and comparing the list of publications andcoauthors. By manually checking the current personal webpages of each scientist, we found that 41% of the profiles inMAS presented an outdated affiliation, whereas GSC con-tained far fewer inaccurate addresses. The low updating/correcting/merging/deleting rate in MAS occurs not onlybecause the end user must explicitly request modificationsbut also because of slow quality control from the editorsof the service. Otherwise, multiple profiles of the sameauthor will be available showing the different affiliationsappearing in the original sources during the years. Thisappraisal indicates that institutional or country studies withMAS are not recommended or at least have to be performedcarefully and after a detailed verification of the affiliations.Finally, 771 similar profiles were identified in both samples.That means that 5% of the MAS profiles are presented inGSC, whereas just 1% of the GSC profiles are in MAS.According to the traditional overlap measure (Hood &Wilson, 2003), the overlap between both sources is .079%.

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014 1151DOI: 10.1002/asi

Page 4: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

However, these estimations are based on samples of differ-ent representativeness and size. Thus, whereas the GSCsample is 78% of the registered scientists, the MAS samplerepresents just 0.08% of the total author population. Anotherway to compute the overlap, perhaps more precisely, wouldbe to match directly each profile against the entire databaseof MAS or GSC, but to do that it would be necessary toextract all the profiles in both search engines to carry outthe matching process. This procedure would be too expen-sive in time and computing resources. Even in this case, thenumber of spurious matches (i.e., popular names such asLópez, Smith, or Wong would produce multiple coinci-dences) could provoke a large and tedious manual cleaningprocess.

However, although this approximation has to be cau-tiously considered, the observed overlap in general is ratherlow. This could be due to MAS creating profiles automati-cally from each author that appears in a new paper, whereasin GSC the profile should be intentionally created by a user,provoking a disproportionately much greater number of pro-files in MAS than in GSC.

Results

Comparative Analysis

This comparative analysis was performed with the totaldata that MAS can provide through its search interface.According to the main page of MAS, it contained about 19million profiles or authors in August 2012. In the case ofGSC, that information is unknown, but the crawl carried outin June 2012 allows an estimate of 106,246 as the totalnumber of profiles. This means that our sample represents78.85% of that population. This calculation shows thatthe number of profiles in MAS is almost 200 times thecurrent number of profiles in GSC. Two main reasons couldexplain this large difference between the two providers:(a) MAS has been working since 2009, whereas GSCstarted 2 years later, in 2011, and (b) profiles in MAS are

automatically created from the metadata (Microsoft, 2012)of several bibliographic services (CrossRef, Elsevier,Springer, etc.), whereas the GSC profiles are created onlywhen the end user, an author, freely decides to do it.

Table 1 shows a comparative distribution by disciplinesof the number of authors registered in both search engines. Itcontains only the 10 most frequent subjects. These servicesdo not use the same subject matter classification; GSC doesnot use any at all, only uncontrolled labels supplied by theauthors. Therefore, these labels were used to classify eachprofile according to a defined classification scheme, in thiscase the subject area categories of Scopus (2008). On theother hand, MAS indeed uses its own classification scheme(Microsoft, 2011), which is similar to that of Scopus, allow-ing a fair comparison between them. Because the sameprofile can be classified under several disciplines, the total inTable 1 does not match the total number of profiles in MASand GSC.

The most interesting aspect of both distributions is thehuge presence of computer science profiles in GSC, with21.4%, followed by engineering (6.3%) and biochemistryand genetics and molecular biology (6.1%), whereas impor-tant disciplines such as medicine (5%) and chemistry (2.1%)are comparatively poorly represented. This fits with thestatement that GSC may be seen as a database populated byscientists interested in new information technologies,science 2.0, and scientific networking, which causes a largepresence of computer science profiles (Ortega & Aguillo,2012). On the contrary, MAS shows a distribution moresimilar to the traditional citation indices, in which medicine(26.8%), biology (13.9%), and chemistry (12.7%) are theexpected disciplines with the largest number of authors. Thisbalanced distribution in MAS is due to its more homoge-neous data sources, such as generalist scientific editorials(Elsevier, Springer) and bibliographic services (CrossRef).However, the different taxonomies used by MAS and GSChave to be considered when analyzing these discrepancies.For example, the medicine class in MAS is split up intpmedicine, health professions, and neurosciences in GSC. If

TABLE 1. Distribution of authors by disciplines in GSC and MAS.

Google Scholar Citations Microsoft Academic Search

Subject classes Authors Percentage Categories Authors Percentage

Computer sciences 20,122 21.4 Medicine 5,849,432 26.8Engineering 5,936 6.3 Biology 3,034,758 13.9Biochemistry, genetics, and molecular biology 5,752 6.1 Chemistry 2,774,794 12.7Mathematics 5,524 5.9 Engineering 2,464,999 11.3Physics and astronomy 5,218 5.5 Computer science 1,583,061 7.2Agricultural and biological sciences 5,167 5.5 Physics 1,560,789 7.1Medicine 4,687 5.0 Social science 977,975 4.5Social sciences 4,403 4.7 Geosciences 581,715 2.7Multidisciplinary 4,015 4.3 Arts & humanities 482,143 2.2Environmental science 3,875 4.1 Mathematics 384,713 1.8Total 94,093 100 Total 21,857,368 100

1152 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014DOI: 10.1002/asi

Page 5: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

these three categories are combined, then medicine would bethe second most popular GSC discipline with 6.9%.

Table 2 shows the 10 most important research institu-tions ranked by the number of authors registered in GSCand MAS. Higher education institutions predominate in thecase of GSC, with most of them from the United States.Hence the organizations with the most authors are Univer-sity of California, Berkeley (0.57%), University ofMichigan (0.55%), and Harvard University (0.52%). Onthe contrary, MAS profiles show a completely differentdistribution in which the institutions with the largernumber of scientists are the Russian Academy of Sciences(0.23%), the Chinese Academy of Sciences (0.20%), andHarvard University (0.17%). This distribution in MAScould be due to several data processing mistakes. The firstis that MAS creates these profiles from several biblio-graphic services, which do not unify the different authors’name variants. Thus, it is very common that the same

author can have two or even more different profiles(Figure 1). This is especially true in languages with manypossible name variants and different translations, such asPortuguese, Chinese, and Russian. The second is that thelow updating rates can cause some authors to show theirprevious or first addresses when they are already workingin other places. For instance, the 37,562 profiles fromHarvard University could be due to alumni or retired pro-fessors, because now Harvard has only 2,100 facultymembers (Harvard University, 2012). The third is thatMAS also contains data from the 20th and 19th centuries,and it is likely that many of these profiles are outdated orobsolete. A specific crawl of Stanford University’s profilesindicated that 22% of the profiles had been inactive sincethe year 2000. In spite of these technical problems, thisinstitutional distribution also reflects the emergence of newresearch powers such as the BRIC countries (Brazil,Russia, India, and China).

TABLE 2. Distribution of authors by affiliations in GSC and MAS.

Google Scholar Citations Microsoft Academic Search

Affiliation Authors Percentage Affiliation Authors Percentage

University of California, Berkeley 473 .6 Russian Academy of Sciences 44,066 .2University of Michigan 462 .6 Chinese Academy of Sciences 37,562 .2Harvard University 435 .5 Harvard University 31,889 .2University of Washington 433 .5 University of São Paulo 28,763 .2University of Oxford 369 .4 Stanford University 25,586 .1University of British Columbia 368 .4 ERNET India 24,305 .1Stanford University 353 .4 University of Toronto 21,979 .1Massachusetts Institute of Technology 325 .4 Assistance Publique Hôpitaux de Paris 22,156 .1University of Minnesota 314 .4 National Institutes of Health 20,940 .1University of Cambridge 305 .4 University of California Berkeley 20,696 .1Total 83,665 100 Total 19,165,290 100

FIG. 1. Example of multiples profiles for the same author (November 15, 2012). [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014 1153DOI: 10.1002/asi

Page 6: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

Profile Comparison

A comparison of the number of documents, citations, andh-index values was made from the 771 profiles that werepresent at the same time in both GSC and MAS to detect thedifferences and similarities between the services. The aimwas to obtain a relative measure of the proportion of ele-ments included in a group of profiles in both search engines:

P

x

y

Nx y

i

ii

N

, = ⋅=∑

1 100 (1)

where Px,y is the average proportion of the number of itemsinclude in a search engine (x) between the number of itemsin the other search engine (y) from the same profile (i). Thesummation of these proportions and division by the totalnumber of profiles produce an average measure of the rela-tive size of a search engine and the profile completenesswith regard to other search engines. Px,y > 1 means that pro-files in x contain on average more items than y or vice versa.This indicator avoids the size effect bias when both magni-tudes are compared in absolute terms. However, this mea-surement cannot be understood as an overlapping indicatorbetween search engines, because it does not count how manydocuments or citations are present in the two services butrather the size of one profile in two search services accord-ing to the number of assigned items.

According to the sample of 771 profiles, the GSC pro-files show a higher proportion of items than the MAS ones.Thus, GSC gathers 158.3% more documents per profilethan MAS, 327.4% more citations, and 155.8% higherh-index values. On the contrary, MAS only cover 89.5% ofGSC documents, 76.7% of citations, and 72.1% of h-indexscores per profile.

Figure 2 shows the proportion of documents and citationsper profile between GSC and MAS by research disciplines.Disciplines with fewer than five profiles were excluded,such as arts and humanities, environmental sciences, agri-culture science, and materials science. GSC has a larger

proportion of documents than MAS in every discipline,except for chemistry (MAS = 200%) and medicine (MAS =190%), in which MAS has more documents per profile. Onthe other hand, the areas in which GSC has the largestproportion of documents per profile are social sciences(GSC = 390%) and geosciences (GSC = 350%). Theseresults confirm that the data sources of the two services arerather different and thus produce significant differences inthe number of documents and citations by disciplines.Whereas MAS is based only on metadata from bibliographicservices, GSC considers a wide range of scientific and tech-nical documents from academic websites. The high propor-tion of social science and geosciences documents in GSCmay be caused by books, technical reports, or other schol-arly materials. Another candidate reason is the local natureof many of the contributions in these fields that can bepublished in local journals not belonging to the mainstreamgroup of international titles. Perhaps this difference is mostnoticeable in citations per profile, in which GSC shows thelargest proportion of citation in every discipline, highlight-ing social sciences (2200%) and physics (930%). These highpercentages in all disciplines confirm the wide variety ofdocument types indexed by GSC.

Discussion

A few considerations can be derived from the resultsconcerning both academic search engines. MicrosoftResearch Asia’s (the location of the laboratory is unrelatedto any geographical bias in the sources) product is anautonomous citation database built mainly on bibliographicmetadata from CrossRef (73%) and records from other sci-entific publishers such as Elsevier (13%) and Springer (9%).As a consequence, its disciplinary coverage is more bal-anced and similar to Scopus or WoS because it gathersmostly journal articles and proceedings papers. This stan-dard coverage and the fact that MAS counts on a classifica-tion scheme suggest that this search engine can be a suitabletool for disciplinary studies.

FIG. 2. Proportion of documents and citations between GSC and MAS.

1154 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014DOI: 10.1002/asi

Page 7: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

MAS data show that the number of documents is clearlysmaller than those in GSC, a fact that has a strongeffect on citation counts and the derived bibliometricindicators (g-index and h-index). However, MAS’s princi-pal limitation is that after combination these bibliographicmetadata are not processed to remove duplicates, nor arethe affiliations updated when they change. We suspect thatmany of the 19 million authors are duplicates becauseeach profile is created whenever an author does not fitautomatically with a previous name. At the same time,41% of the affiliations seem outdated with respect toGSC, and Table 2 shows a strange distribution of researchinstitutions. These technical faults along with the unusualmistakes detected by Jacsó (2011) warn against the useof this service as an evaluation tool mainly in orga-nizational benchmarking and even more for evaluatingauthors.

GSC is populated, in addition to contracts with a largenumber of editors and database providers, with dataobtained from a crawling process that recovers andindexes academic documents in the public web in a fashionsimilar to any search engine, like its parent productGoogle. This fact explains why the numbers of documentsare significantly higher than in MAS and, therefore, thenumber of citations and h-index scores are larger too.Moreover, the GSC harvesting process includes not onlyjournal articles and proceedings papers but all academicmaterial on the web, such as reports, books, presentations,and so on. This goes beyond the formal scientific publish-ing channels, and it explains why disciplines such as socialor computer sciences, which produce a great deal ofinformal scholarly documents, have achieved a strongacademic impact on the web. Kousha and Thelwall (2007)noted this fact when they found that GS citations weremore numerous than ISI citations in social sciencesand computer science, suggesting that GS is a morecomprehensive tool for citation tracking in the socialsciences and in high-obsolescence fields. Meho andYang (2007) also found that GS stands out for the highnumber of citations because of its elevated coverageof proceeding papers. However, its most important weak-ness as an evaluation tool is that the profiles are created bythe researchers themselves, which introduces a bias infavor of authors strongly linked with the new informationtechnologies and web 2.0 environments (Ortega & Aguillo,2012), and it may cause the intentional manipulation ofindicators values (Delgado López-Cozar, Robinson-Garcia,& Torres-Salinas, 2012). The larger-than-expected propor-tion of computer science authors in GSC in Table 1could thus be evidence of this unbalanced subject-matterdistribution. Perhaps another problem that affects dataoperations is that GSC does not use any taxonomy (oreven preferred language) to express the profiles, andmuch of this information is expressed in free, naturallanguage such as affiliations and labels, for which thedata processing is noticeably more difficult in GSC than inMAS.

Conclusions

The comparison between MAS and GSC shows that GSCcontains more academic materials, which produces morecitations than MAS; whereas MAS is a more thematic bal-anced service, GSC is biased in favor of the computer sci-ences disciplines, In addition, the restrictive environment ofMAS causes low updating rates and the proliferation ofmultiple profiles of the same author; meanwhile, the open-ness of GSC favors accuracy of the data but makes dataprocessing more time consuming. In view of these draw-backs and advantages between the two search engines, weconclude that both introduce important improvements toresearch evaluation, such as a wider coverage of academicmaterials and therefore broader coverage of research impact;maintenance of public and free, customizable profiles thatreinforces personal evaluations; and an autonomous systemthat makes possible the bibliometric analyses of huge popu-lations. However, their technical limitations, such as dupli-cate profiles, spurious citations, and possible manipulation,make it advisable that the use of these citation indices inresearch evaluation should include other citation databasesthat permit detection of biases or gaps that could undervaluean institution, discipline, or author. Many of these limita-tions probably are due to the novelty of these platforms, soit can be expected that future developments will improvethese services and that they will become stronger competi-tors for subscription-based citation databases.

Especially with regard to the suitability of these servicesfor bibliometric research, we conclude that MAS is betterfor disciplinary studies than for analyses at institutional andindividual levels. On the other hand, GSC is a good tool forindividual assessment because it draws on a wider variety ofdocuments and citations. However, it is not yet recom-mended for disciplinary studies because of its unbalancedthematic distribution.

References

Aguillo, I.F. (2012). Is Google Scholar useful for bibliometrics? A webo-metric analysis. Scientometrics, 91(2), 343–351.

Bakkalbasi, N., Bauer, K., Glover, J., & Wang, L. (2006). Three options forcitation tracking: Google Scholar, Scopus and Web of Science. Biomedi-cal Digital Libraries, 3(7). Retrieved from http://www.bio-diglib.com/content/3/1/7

Bar-Ilan, J., Haustein, S., Peters, I., Priem, J., Shema, H., & Terliesner, J.(2012). Beyond citations: Scholars’ visibility on the social web. 17thInternational Conference on Science and Technology Indicators (p. 14).Montréal: Science-Metrix and OST.

Delgado López-Cozar, E., Robinson-Garcia, N., & Torres-Salinas, D.(2012). Manipulating Google Scholar Citations and Google Scholarmetrics: Simple, easy and tempting. EC3 Working Papers 6.

Fenner, M. (2011). Google Scholar Citations, researcher profiles, and whywe need an open bvibliography. PLoS Blogs. Retrieved from http://blogs.plos.org/mfenner/2011/07/27/google-scholar-citations-researcher-profiles-and-why-we-need-an-open-bibliography/

Google Scholar blog. (2011a). Google Scholar Citations. Retrieved fromhttp://googlescholar.blogspot.com/2011/07/google-scholar-citations.html

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014 1155DOI: 10.1002/asi

Page 8: Microsoft academic search and Google scholar citations: Comparative analysis of author profiles

Google Scholar blog. (2011b). Google Scholar Citations open to all.Retrieved from http://googlescholar.blogspot.com/2011/11/google-scholar-citations-open-to-all.html

Harvard University. (2012). Harvard at a glance. Retrieved from http://www.harvard.edu/harvard-glance

Harzing, A.W.K.b & Van der Wal, R. (2008). Google Scholar as a newsource for citation analysis. Ethics in Science and Environmental Poli-tics, 8(1), 61–73.

Hood, W.W., & Wilson, C.S. (2003). Overlap in bibliographic databases.Journal of the American Society for Information Science and Technol-ogy, 54(12), 1091–1103.

Huang, Z., & Yuan, B. (2012). Mining Google Scholar Citations: An explo-ratory study. Lecture Notes in Computer Science, 7389/2012, 182–189.

Jacsó, P. (2008). Google Scholar revisited. Online Information Review,32(1), 102–114.

Jacsó, P. (2010). Metadata mega mess in Google Scholar. Online Informa-tion Review, 34(1), 175–191.

Jacsó, P. (2011). The pros and cons of Microsoft Academic Search from abibliometric perspective. Online Information Review, 35(6), 983–997.

Jacsó, P. (2012). Google Scholar Author Citation Tracker: Is it too little, toolate? Online Information Review, 36(1), 126–141.

Kousha, K., & Thelwall, M. (2007). Google Scholar Citations and Googleweb-URL citations: A multi-discipline exploratory analysis. Journal ofthe American Society for Information Science and Technology, 58(7),1055–1065.

Meho, L.I., & Yang, K. (2007). Impact of data sources on citation countsand rankings of LIS faculty: Web of Science versus Scopus and GoogleScholar. Journal of the American Society for Information Science andTechnology, 58(13), 2105–2125.

Microsoft. (2011). Academic categories in Microsoft Academic Search.Retrieved from http://social.microsoft.com/Forums/en-US/mas/thread/bf20d54a-ede2-48a9-8bbb-f6c1c1f30429

Microsoft. (2012). Microsoft Academic Search data coverage. Retrievedfrom http://social.microsoft.com/Forums/en-US/mas/thread/7ed5d49d-65b7-4a2f-9adf-4de9e23ee70e

Ortega, J.L., & Aguillo, I.F. (2012). Science is all in the eye of thebeholder: Keyword maps in Google Scholar Citations. Journal of theAmerican Society for Information Science and Technology, 63(12),2370–2377.

Ortega, J.L., & Aguillo, I.F. (2013). Institutional and country collaborationin an online service of scientific profiles: Google Scholar Citations.Journal of Informetrics, 7(2), 394–403.

Priem, J., Piwowar, H., & Hemminger, B. (2011). Altmetrics in the wild:An exploratory study of impact metrics based on social media. In Metrics2011: Symposium on Informetric and Scientometric Research, NewOrleans, LA.

Scopus. (2008). Subject Area Categories. Retrieved from http://help.scopus.com/robo/projects/schelp/h_subject_categories.htm

Scopus. (2013). Scopus_SciVerse. Retrieved from http://www.info.sciverse.com/scopus

Thomson Reuters. (2012). Web of Science: Fact sheet. Retrievedfrom http://thomsonreuters.com/content/science/pdf/Web_of_Science_factsheet.pdf

Waldrop, M.M. (2008). Science 2.0. Scientific American, 298, 68–73.Wikipedia. (2013). Microsoft Academic Search. Retrieved from http://

en.wikipedia.org/wiki/Microsoft_Academic_SearchZhang, S. (2011). Hirsch index and a co-authorship network. Retrieved

from http://scenic.princeton.edu/network20q/blog/?p=583

1156 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—June 2014DOI: 10.1002/asi