34
Towards a Semantic Citation Index for the German Social Sciences William Dinkel, Philipp Mayr, Frank Sawitzky, Andreas Strotmann* GESIS – Leibnizinstitut für Sozialwissenschaften, Köln *alphabetic ordering of names

Towards a Semantic Citation Index for the German Social Sciences

Embed Size (px)

Citation preview

Towards a Semantic Citation Index for the German Social

SciencesWilliam Dinkel, Philipp Mayr,

Frank Sawitzky, Andreas Strotmann*

GESIS – Leibnizinstitut für Sozialwissenschaften, Köln

*alphabetic ordering of names

The Problem ● German sociology / political science research output /

impact coverage in SSCI– SOLIS: ~ 1/3 each of books, journal articles, chapters

● Cover ~ 50% of German researchers' “relevant” output*

– ~1/3 of core journals covered in SSCI**

– So, ~10% of literature indexed there

– Very low percentage of cited literature indexed in SSCI***● * Research rating exercise Sociology, Wissenschaftsrat● ** compared to SOLIS “class A” journals● *** Chi (IfQ) study of core German political science journals

The Problem (ctd.)● Citation culture in the social sciences

– Citations are important● Perhaps even more so than in the natural sciences

– Some authors are extremely highly cited (Weber, Marx...)● Suspect very high(!!) Gini coefficient in distribution● But: it is their books (not articles) that are highly cited!

– Significant fraction of citations are contrastive

– Datasets (survey results) highly mentioned, not cited

– Multilingual citation environment

The Need

● German social scientists & SSCI – They consider their field inadequateinadequately

represented in “the” citation index

– But useBut use it quite heavily anyway● e.g. for research, evaluation

● Survey of sociologists and political scientists, GESIS

The Need (ctd.)

● We need a citation index for the (German) social sciences – Existing citation indexes frankly inadequate

● No reasonable effort in sight to resolve this

– Hence, we need to build our own● If we want to do serious bibliometrics on SocSci● If we want to provide a decent social science citation

index in, e.g., sociology or political science

The Need (ctd.)● We need an open semantic citation index for the

(German) social sciences – Incorporate referential semantics into search engine

● e.g., reliable hyperlinks to referenced articles● e.g., equivalence or hierarchy relations for translations, aggregations

– Publish referential semantics as linked open data● Allow other institutions to discover references to their holdings in our

database(s)● Invite them to offer the same service to us, too

– Bibliometrics requires cleaned/disambiguated data!

The Long-Term Goal A globally distributed open semantic citation index

● Based on digital full-text collections (cooperate with publishers)– Semi-automatic / Computer-aided

– Algorithms + professional indexers (authority files) + crowd sourcing +...

● Reference extraction (with contexts)– Enables sentiment analysis (important in social sciences)

● Reference matching – Enables referential semantics

● Open reference semantics information exchange– „<this> paper indexed in our collection cites <that> paper indexed in yours“

Sowiport – German Social Sciences Research Information

● GESIS' Sowiport portal: Single access point to 18 databases, including – 6 Cambridge Scientific Abstracts databases on social sciences– GESIS' own SOLIS (literature) and SOFIS (projects) RISs– SSOAR (Social Science Open Access Repository) @ GESIS

● Goal: Extend to social science citation index– CSA comes with cited refs for some docs– SSOAR – extract refs from OA full text and index in Sowiport– Extract links to data sets / surveys used but not cited from full texts – Crawl Google Scholar for citations to “our” docs – Link to/from RepEc (and other) data ...

First Steps: National CSA Social Sciences Citation Index

● Cambridge Scientific Abstracts – Social Sciences– 6 CSA databases offered & run by GESIS

● National research licence for Germany

– Include >8 mio references● A good starting point● Recently activated in Sowiport● ~25-30% refs found to link to other records

– Using simple matching algorithm– Biased towards accuracy (>90%), not recall

First Steps: CSA Reference Matching

Reference matching is much(!) harder in social sciences● Social science publication culture

– Books & chapters, and articles● Published in roughly equal numbers, books cited most

– Multilingual publishing● English is not the only language● Publications may be cited in translation, different editions

– Broad referencing behaviour● Large proportion of references to non-source items

=> A first-try high-precision match rate of ~25-30% is an excellent result● Close to expected rate of references to journal articles

CSA References in GESIS' Sowiport Database

● Each full record contains „references“ and „cited-by“ information– Some with actionable links to full records

● Combines WoS/Scopus and Google Scholar approaches to citation index construction

First Steps: Citation Extraction

● SSOAR full texts – First successful experiments to extract

references from full text● Based on RepEc's ParsCit ● Extended to German citation styles

– First successful experiments to identify acknowledgments of large surveys in text

Next Steps: “Haus der Sozialwissenschaften”

● Goal: Digital Special Collection for German Social Scientists– Digital access to full literature in one place

● Large parts unfortunately only accessible in-house● Collect existing digital versions from “all” sources● Digitize “important” literature where necessary● Full text of literature, survey data, project descriptions...

● Joint DFG application with Sondersammelgebiet Sozialwissenschaften, Univ.- & Stadt-Bibl. Köln

Next Step: “GESIS Application Laboratory Web 3.0”

● Full text collection and processing results available in toto to visiting researchers– Social scientists

– Computer scientists

– Computer linguists

– Bibliometricians: You are invited!!!

● Upgrade database– e.g. disambiguation of authors, institutions, titles

e.g. incorporation of external authority files / semantic web

Experiment: E-Traces ● Goal: Tracking ideas through the sociology literature

(“text re-use”)– Experiment (ongoing): attempt to categorize citation contexts

as positive/neutral/negative (sentiment analysis)– BMBF funded project with U Leipzig, U Göttingen

● Long term use: identify negative citations and contrastive co-citations for social science citation index

Summary ● For GESIS' core covered social sciences (German sociology, political

science), traditional citation indexes are inadequate● and Google Scholar only provides “cited by” info

● Yet, GESIS' core audience uses them● and complains about their inadequacies

● Bibliometrics requires an adequate citation index for reliable results (given typical distributions)

● but no improvements in sight for classic indexes

● Therefore, we need to build our own● and we have the expertise at GESIS to succeed where others have failed● and we have taken the first few steps in this direction●

Summary (ctd.) ● In the long run, we would like

– A citation index that is● Semantic (with explicit referential semantics)● Distributed (each institution builds their own)● Open (each institution shares semantics as LOD)● Global (implemented world wide)● Cooperative (indexers+researchers contribute)● Computer-aided (software to get started, people to improve)

– Based on best practices we hope to develop

Thank You!

Two Models of Citation GraphsBipartite (Classic IR) Model:Citing and Cited Partitions

• Citing nodes: full bibliographic records

• Cited nodes: „keys“, e.g.– First author name & initials

+ Year of publication+ Journal key, + volume, +number, +page

Uniform Model:Interconnected Documents

• All nodes: bibliographic records– Citing nodes full records– Cited nodes mostly simplified

records– „Matched“ cited nodes have

full records

Citation Matching

• Goal: Citation network–Unique nodes for documents

• Sub tasks:–Match cited references to each other–Match cited references to full records–Match full records across databases

Matching Citations to Full Records

„Internal“ matching● Direct access to

full database(s)● Options: match

key based or algorithmic matching

„External“ matching● Access only via

search engine● Options: matching

against same or different database

Scopus Citations

• Cited reference info contains–Up to 8 author names (family+inits)

• Including last author• Frequently as cited (not standardized or corrected)

–Publication year, title, journal name/vol./nr./p.• Frequently as cited

–Reasonably well parsable, not normalized

Matching Scopus Citations to Scopus Full Records

External matching: Scopus search engine● „Algorithm“: parse Scopus reference into subfields,

construct complex search queries for Scopus engine, download resulting full records, choose best fit

● High precision searches: complex searches allowed, many searchable fields– Improve recall by successively vaguer queries

● Small number of downloads allowed, so many queries needed to construct sizable citation index

Matching Scopus Citations to PubMed Full Records

CrossDB External Match: Scopus/Medline● „Algorithm“: parse Scopus reference, construct

PubMed batch citation matcher queries, download matched PubMed(!) records– Only for biomedical fields

– Result is a citation network of PubMed records, not Scopus

– Requires matching of Scopus citing records as well● Either direction (Scopus<->PubMed)● Both include PubMed IDs

Matching Web of Science References to WoS Full RecordsWoS cited reference info contains● First author (last name plus initials)● Publication year● Source title code● Vol./num./page● More and more frequently DOI

No title included!

Matching WoS Cited References to WoS Record

External matching via WoS web search● Only small queries supported

– Many downloads necessary

● Crucial search fields not supported (vol., num.)– Therefore highly ambiguous results to be expected

● Requires translation of source title from code to full● Requires algorithmic filtering of correct hit from long

result list

Matching WoS references to WoS

● Internal Matching● Kompetenzzentrum Bibliometrie has full local

copy of WoS data● Experiment: good „match key“ to support

this?– Dinkel (2011), ISSI

– Results in error estimates for references

Building a Citation Index for the Social Sciences: CSA

● Basis: Cambridge Scientific Abstracts (Social Sciences)– To be extended with additional sources of cited refs info

● Nationwide licensing scheme for Germany administered at GESIS

● Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal– Now including ~8.5 mio cited references

● No matchings to full records provided by Proquest● Early experimental results available on portal

– Focus on precision, not recall

Citation Matching in CSA„Algorithm“:● Internal matching

– However, across multiple CSA databases

● Parse references; construct search queries (Solr) – exact title and year

– or fuzzy title and year and ISSN;

– choose first match

● Favors precision over recall – Fuzzy match only for journal literature, for example

● Research to be continued!

Experiments - Datasets

Caveat● Scopus/PubMed and WoS experiments run on stem cell

research field (biomedical area)– < 100k citing docs, ~1mio references– >95% refs are to journal articles

● CSA experiment run on social sciences databases– ~1mio full records, ~10mio references

● Only recent records contain refs● Many(!!) refs to non-journal articles

Some Rough Numbers● Scopus ↔ PubMed full record matching

– >95% match rate

● Scopus references → Scopus/PubMed full record– ~90% match rate „exact“ + ~5% fuzzy match

– ~1% false positives needed to be filtered out

● WoS references → WoS full record– ~90% match rate– >>50% false positives needed to be filtered out

● CSA references → CSA full record

– ~30% match rate– ~1% false positives

CSA reference information● Fields: citing ID, reference ID, authors, title, year, publisher,

source title/num/vol/p., ISSN– Format changes, though

● Mostly automatically parsed, as fields frequently mis-assigned● Example (book):

<CI>200601317</CI><CA>Voice UK</CA>

<CT>No More Abuse.</CT><CY>2000</CY>

<CZ>Derby: Voice UK</CZ>

Discussion

● Plenty of research opportunities to improve matching of non-journal literature references to source records– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases

– e.g. by crawling Google Scholar for reference links

– You are invited to try your hands at this, too!● See below: GESIS Application Laboratory