If you can't read please download the document
Upload
yuji-fujita
View
452
Download
1
Embed Size (px)
Citation preview
Finding Nobel prize window by PageRank
FUJITA Yuji, Turnstone Research Inst., Nihon Univ.
The Window
Cited number v.s. PageRank
Graph and Network
Graph theoryPart of mathmatics
Network scienceInter-disciplinary study ofGraph theory
Physics
Social science
Informatics
particular topics from finance, biology, ...
, ., , ,
Graph theory
Date back to 1730'sObjectivesLower dimensional topological structure
Combinatorial and topological studies
TopicsFour colour theorem
Invariants
From Wikipedia
, ., , ,
Network science
ObjectivesStatistics and dynamics
Social, Financial, Technological themes
Topics6 degrees of separation
Scale-free networks
PageRank
, ., , ,
Bibliometrics
Quantitativeevaluation of (academic) documents
Conventional approach: number of citation
Citation networkNode: paper Edge: citation
directed graph
More true metric: PageRank
Citation vs PageRank
Best cited do not have the best score
Top articles
Clinical MedicineEffects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patientsClinical MedicineVitamin E supplementation and cardiovascular events in high-risk patientsImmunologyCytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationImmunologyImmunologic self-tolerance maintained by CD25(+)CD4(+) regulatory T cells constitutively expressing cytotoxic T lymphocyte-associated antigen 4PhysicsString theory and noncommutative geometryPhysicsLarge-N limit of non-commutative gauge theoriesMolecular Biology & GeneticsSmac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibitionMolecular Biology & GeneticsIdentification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteinsMolecular Biology & GeneticsSystematic variation in gene expression patterns in human cancer cell linesMolecular Biology & GeneticsA gene expression database for the molecular pharmacology of cancer
The Protein Data BankEffects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patientsThe genome sequence of Drosophila melanogasterString theory and noncommutative geometryThe complete atomic structure of the large ribosomal subunit at 2.4 angstrom resolutionSmac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibitionIdentification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteinsThe SWISS-PROT protein sequence database and its supplement TrEMBL in 2000Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzymeCytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationnil
Graph expression
Embedding: drawing on sphere/space
Matrix
3. 0(), , ,. , .
PageRank overview
Link from a great node is more important degree as a score
But how can it be done? - the process can be lost in a loop..
Figure from The PageRank Citation Ranking: Bringing Order to the Web
, ? ?
Finite state Markov chain
Node: status, Transition matrix: moving along the edgeRow: linked (cited) vector
Column: link (cite) vector
Probability vector refreshed by multiplying the transition matrix
Steady state gives PageRank
Some Markov chain has a unique steady state
Steady state given by eigenvectorA vector such that Mx = ax
Eigenvector given by linear algebraWidely known how to compute
Why PageRank works?
Not all citations are equally significant
Less citation can be a signal of even more great workFundamental work not cited directly
Academic cascade
Meanings of citation
Brainchild
History
Respect
Identity
something more than tag
To reach the top
Many great childrenEach child give birth to many works
= great scientific achievement
Limitations
Prof. Yamanaka's work (CELL, 2006) has poor PageRank score, which is a shame to say at least.
SPAM issues; not so serious as naiive citation count
To practice
Get citation dataProduct or scrape
Transition matrixRandom surfer model
Iterate matrix-vector product operationSparse matrix operation
Data
Tomson-Reuter, Elsevier,
Scrape the web (arXive..)
Common SQL server will hold the data
NLP required
Transition matrix
Not all transition matrix has unique eigenvector
Random surfer model: let the graph be connected and get out of loop
+
=
Adaptation to papers
Old paper cannot cite newer oneNon-uniform random surfing
Adjust decay rate
Sparse matrix
Most of the elements are Zeros
Compressed form reduces space and time
libcsparsemade by UFL people and others, distributed under LGPL
Reference
L Page, S Brin, R Motwani, T WinogradThe PageRank citation ranking: bringing order to the web.
Dylan Walker1,2 , Huafeng Xie2,3 , Koon-Kiu Yan1,2 , Sergei Maslov2Ranking Scientific Publications Using a Simple Model of Network Traffic
P. Chen,1, H. Xie,2, 3, S. Maslov,3, and S. Redner1, Finding Scientific Gems with Google
Hajime BABAGoogle - PageRank
Acknowledgment
Mr. Kazuhisa Takei for ruby interface of libcsparse in ffi
Dr. Mari Jibu for citation data handling
Dr. Wataru Souma for network scientific suggestions and comments
Dr. Yoshi Fujiwara for choosing this topic and invitation
Free software developers
About me
2010- Turnstone Research, Inst.
2011- Nihon Univ. researcher
2009-2010 finance sector
2007-2009 Network analysis at NiCT
2001-2007 Venture firm CEO
1994-2002 Discrete math graduate student
Ski, climbing, bicycle, art