Upload
paul-gardner
View
211
Download
3
Embed Size (px)
DESCRIPTION
http://www.eresearch.org.nz/event/eresearch-nz-2013
Citation preview
Transforming a biological database withcollaboration
Paul Gardner
June 28, 2013
Paul Gardner Engaging Scientists: eResearch 2013
Today I will be presenting...
I How I inherited a sad and tired biological database
I and transformed it with collaborative editing
I and engaged a community by using Wikipedia.
Paul Gardner Engaging Scientists: eResearch 2013
What is RNA?
I RNA is a fundamental biological molecule, essential for untoldbiological processes
I New technologies are accelerating the rate of RNA discovery.I My aim is to build an analog to the Periodic Table for
classifying RNA families and motifs, enabling researchers topredict function.
base basepair
R
AU
AGA
U YA
C
AU
U5´
Y
GA
A
R
5´
CU
U CG
G
5´
R
UR R R
Y
5´
R
RGC
GU
R ARA
GCY
5´
RYGGAGY
R RR
RC R
RGA
R
R
5´
CGAAGYY
R
Y
Y RR
GG
GRUGGAG
5´
CCRA
YCC
CRU C
CG
AA
CUYGG
5´
A N Y A G N R A U N C G T loop U turn k t rn1 k t rn2 tw ist
R
CYRG
GAA
C UGA
RC R
UYA
GUA
C
GG
G A R R A5´
Y
YY
AG
U A
G Y RA
G
GAA
RR
R
5´ RY
GR
YAA
Y
C
RY
A YYA
G
RG
AA
YC
5´
R
CAG G A
G
Y5´
AC
AC U
GRY
R
Y G Y R R R R
RYCAR
U
Y
5´RAGC
R
C
GR
A
GY A
YG
YYRGUUY
5´AAAAAGCYRYY R
RYGGYUUUUU
U Y U Y5´RRARR Y
YUUU
U U U Y5´
sar r ic1 sar r ic2 U A A G A N C sr C loop domV term1 term2
R Y Y Y Y
GCGAGCAGACGCARAA
CRCCCRR
Y
R
R
YGGGYG
UUYUGCGUCUGCUCGC
R R R R5´
YUYUC
UCAA
CA
G UGY
UUGR
RRAAY
5´ YY
YY
Y
AUG
A Y GRYY
YYA
AA Y
YY
YY
RR
GR
RYC U G
AU
YY
YR
RR
5´GGGUC UCUCU
GYUAGA CCAGAU
CU G
AGC
CU
G GGA
GCUCUCUGGCUARCUAGGGAACCC
A C5´ UGUAAACAUCCU
Y GACUGGAAGCUG
URR
R Y R YRR
RRGCUUUCAGUCGGAUGUUUGC
5´ U CUUUGGUUAUCUAGCU
GU
AUGAG
UG
YY R
CRU
CAU
AA
AGCUAGAU
ACCGAA
R U5´ CYYR
UCCC
UGAGA
CCCU
AACYUGUGAG
YU
YYYAG
YUU
CACARGU
RGGY
UCUY
GGG
RCY
RGG
5´GCUAAAAGGAACGAUCGUUGUGAUAUGC
GUURR
U UYCGU
UAC
AUAUCACAGUGAUUUUCCUUU AUARC
G C5´ C Y GYG
YYCAUCUUAC
YGR
GCAGUGUUGGA
U
GYY
YRRGY
CUC
UAAYACUGY
CU
GGUAA YGAUGR
CR
Y C G G5´ Y Y Y Y R R G
YACAURCUUCUUUAUAU
CCC AUAY
RA
YRR
RCU
AUGGA
AUGUAAAGAAGUAUGUA
Y Y Y G G Y5´ Y R R YYCRUCAAARUGGYUGUGA
R UGU
Y
R
UCA
UAUCACAGCCACUUUGAUGA
G Y U Y R R5´ Y A A RAAGGGAAYRGUUGCUGUGAURUA
YYY
A Y
YY
Y UYU
AUAUCACAGUGGCUGUUCUUUU
U G G U Y5´ YCRGG
UGAGGUAGUAGGUUGUAUAGUU
RRRR
Y
YYY Y
GG
AGYAACURUACAAYCURCUA
CUUYCCUGR
5´GGCUGGUCCGA
RR
GUAGUGGGU
UA
YRU
YAAYY
Y
Y
UU
RY Y Y Y
UCYC
CCYC
YCACU R
CUR
YAC
UUGACURGCC
U U U5´ Y
YYCUGYRRUGUCGUAR
Y
YYY
YUGARCCRAY
YYYYY
GGG
RGYY
YY
YRG
GYA
G CC
CYY G
GGA
ARC
AAR
YRRRRYR
CCC A CCUR
RRY
RYRGGUUCA
RRR
R
YACGGCAYYRYGGR
Y
YY Y5´
YY
RCGRCC
AUA
CRR
R
GRA
RC
AC
CYGRUC
CC
A U CC
GA A
CYCRG
AA
GU U
AA
GC
YYY YG
G C YR R
GUA C U
RG R Y
G RGR
AYC
CUGGGAA
RYRGGUGYYGY
RR
Y5´
G
RUA
GYYYARY
GGY A
R R R CR
YY
RG
YUY A
A
YYRR R
R
YR
RG
G UUC
RARUC
CY
YY
YR5´ R
R
AARYU
CR
YR
RR
RGYYACR
RY
GA
GU
R
YY R
YR
CU
C Y
CY
YY
Y G G G A A GGUC U G A G
A
RGCCAY
YRCCCU
GGGGYR Y
YYYYYG
R R
RRGRRR
R Y G R G Y YACCA
GA A A Y
RR Y Y
Y
Y
R
RGYU
UGGAA
RRCUYRYGGCYR G Y R R Y U
AGU
CAA
UR
YGRRYRR
Y
YYR
AACYCR
AUUCAG
AC
UA
UCU
Y
Y
5´
T R I T I R E SE C I S m ir -T A R m ir -30 m ir -9 l in -4 m ir -5 m ir -8 m ir -1 m ir -2 m ir -6 let -7 Y R N A 6S 5S tR N A R N aseP
AU
RR
GR
YAGGY
AUUGAA
CU
GU
A U U G U GC
R C C UU
GC
AU
AR
AG
CU
AA
AG
CA
CU
AA
AA
AG
GA
GU
AA
5´
AGUCAUGA
UY
GCUAUUC
YY Y
AAAUAGU
GA
UUGUGAU
AGCG
AUGCGG
YGUGU
UG
CGCACR
YCGYAY
CGC
G C U5´
AG
AG
GA
AR
CR
GGGGCC
AY
GCAG
AA GCGUUC
AC G UC G C G G C C C C
UG
UCAGAUUCR
GU R A A U C U GC
GA
AU
UC
UG
CU
5´ G A U ACAUAGGA
ACCU
CCUC A
AAGGA
UUCUAUG
G A C AGUCGAUGCAGGGAG
G
G A CRR
CUCCCUGCAUCGGC
G A U U U U5´ AC
GRR
GU
R RA
RU
G
C
GA U A A Y A Y
AA
UAA
UGAAAU
UCCUC
UU U G A CGGCCAAUA
GC G
AUAUUGGCC
AUU
UUU
UU
5´ RYCUUUAG
CGGG
YU
R
RR
UY A R U CURGYY
GGYGU U U
CGCC
GRCY Y
U
RCY
YUGA
YRY
5´ RY
YR
YY
CCGUGGU
GA
UUUGRY
CG
GC
CG G C U U G C
AG C C A C G
UU
AA
AY
AA
UC
GC
UA
AA
RA
GG
CC
GR
GG
RR
R5´
GUCGRR
U
Y Y C ACU
G A U G AG U C Y
U R
ARGACG
AAA
C
5´ Y Y R
AUYU
AAARA
AACA G C
U U UC A A
G U G CCU U U Y U G
C A GUU
YYYCARGAGCGC
AAGAUR
G R U A5´RYGGY
Y GYUUGCCAUACGCC
CY
Y Y YYC
GGC A
GGUAUGGAARCA
CCCY
C G Y A CGACUGGY
YC G
GAC
AC
Y GYC
GUCC
CG
CCAG AUC
5´ CACAUCAG
AU U U
CCUGGUGU
A A CGAAUUUUCAAGUG
CU U C
UUGCAUAAGCAA GUUU
RAUCCCG
CYC
CY Y
CGRG
YCGGGAU
U U5´ A U GGAGAC
AUGGCR
UA
A AG C C AG A RA
G U R AGA
AC R U A A C
YU
AGAC
UR
UACUUGAAC
UG
AUUYRC
AUCUC
A U U U U5´
GCR
CYG
CA
A AAU
CRGRYGC
C G G G AU UGG
YAYCCCGRA
Y
RRRRYR
A R C G CY
GCGYUUUUUU
5´
Y U R C G U G A C G A A G CGCG
CG
CAA
AGU
GG A C AAUA
AAG
CCU
R A G CRUYR
AGUAG
UCG YC
AGACGCCGG
UU A A
GCCGGCGUU
U U U U5´ YR
YA
C
G
UR
YCY
GU
UR
UR G
YCCGGUU
GCU
UU
G GUC
GGUGA
CCGGR
R R RRAGCCCRC
UU
GGUGGGYUU
U U U5´
G
G
YCRGCYC
RCC C C C
CR
GRGCYGR
C
CG A C G G C C C C C G C
U CCCC
CCYGGCGGGGGYCGUC
CCYY
5´
U U G G C G A U R UUUUUGGU
U GGAAUG
UAGUGYYY
UU A
R C A C U AA A C
G C U GC
C A C AAAUA
ACCUGU
CAGUUAUUUCA
YCAAAA
A U A A A5´
RY
YR
YU
GCCCUC
Y G G G CG
UU
UC
CU
CC
CU
AG
AC
UU
GGCYYYY
R R G G C CU
UU
UU
UU
UY
YY
5´
SA M V symR C P E B 3 F inP sroB msr SA M a H H 3 V mntn3 l ivK D srA C A E SA R isrK sroD isrB 6C r spL suhB
UYGC
AUCCGCYAA
YCGGUY
A G C C GU G UCG C GG A A G
GUUY Y
YA
AC
CA G C U
R YY U Y Y G R
AACRRAG
RRAGGUG
AGCG
5´
UGAAAGACG
CG
CAUUUG
UU A U C A U C
AU
CC C
UGU Y
CA
GAG
AU
GY
AAU
UU
GG
CC
AC A
GY
RY
GU
GG
CC
UUUUC
5´ * UUCUACUGACU
CU
UUUAAA
AUAA
U UAUUCAUU
GGA
G G U UUAA
UAUGAAUA U
AA A G G A U G A G C
A U AUAG
AAG
CGUUUG
CUCYUU
GUUAGA
UC
RGUUAGUAGGA
A5´G A U U U
GGURRCUGCGCU
CU
U C UA
AGCCAGUUACC
CGGUUCAAARA
UUG C C
AGCUU
YGAACC
U UCGAAAAACCACCU
Y CR
RGGUGGUUU UUUCG
U5´R R R R R R R R
CUCRUAU
AAYYYCRRRAA
UAU G G
Y Y Y G R R AG
U U U C UACC R R G Y R C C
GU
AAAYRYYYG
ACU
AYGAG
RR R5´
CGGCAUC
CCCAU
UA C C
UAUGG A
CA
CGGUGCCG
C A R G C U C U G G R AG UUC
GUYCCRGAGYYUGYYGGAARGGUUUUCCGUGUCCAG
5´
R
R
YGG
ARGCRR
UG
A RYRY
YYYU
YAUY
U G G G CACYU
GRR
RYRY
GGA
GCYA
G U R GUGC
AACCG
RCCR
YR
RR
5´
GUUGUAAC
UA
UGUUGCARY
A R A C G AGAACCGAG
UAUA
GU
UC
AU
GGGRU Y A
CA
UG
AA
UU G U U
UAACURUC
C UC
UGGAU U
CCC
GUCCAU
GRCAGUCGGUUC
5´
CU
UACUGAGAGCAC
AA
AGU
UUCCCGUGC
CA A C
AG G G A G U G U U
AU A
AC G G U
UU
AU
UAGUCUGGAG
AC G GC A G A C U A
UC
CU
CU
UCCCGGU
CCCC
U A UG C C G G G
UU
UU
UU
UU
AU
GU
C5´ UU
RG
RY
UY
RC
CUGAAU
GUGACU
A U C A C U U CA
AA
CR
RY
GR
GY
AA
CC
UC
AG
UA
UC
AU
CR
YR
GA
GY
UAAACCCUCGCCGCC
UG A C G G Y G A G G G U U U U
CU
UU
UG
GR
5´ U G U A A A A A A C A U Y A U U UAGCGUGAYU
UU
CUAUCAAC
AGC U A A C
AAUUGUUA
UUACU
G CCUAA
YGYUCAU
AA G G G U A A
UUUUAA
AA
AAGGG C
GAUA
AAA
A ACGAUUG G GG
GAUGAGA
YA
UGAAC
GCU
C A A G C A5´
C C C A G A G G U A U U G A UUGGUGA
U RRCAY
YU C U
RUGYUY
A
UUY A
UUR
CACCA
A C C U G C G C RGAUGCGCAGGU
UUUUUUU
5´
ARR
R Y YY
YY
AA
UR
YC
AA
CY
UU
UA
GC
GC
AC
GGCUCUYY
A A G A G C CA
UU
YC
CC
UAGRCCAAAC
AGGA
A U YG U U U G G Y C U
UU
UU
UU
5´
GGGCARGAUA
UG
UG
AA
GUR
GCY
AC
C
GCA
A GC
YG
RU
A
CY
CU
UC
AC Y
Y Y C CUUA U U
CG C
U
YGC
UCAAC
GGR
AUCYUGCUC
U G C G A G G C Y5´
GU
GC
RR
YC
YR
AU
UY
YR
GYYGYGCCY
RYRARAAC
AU
CA
YA A R A
UA
CG G C R C R R C
CA
CR
AU
UU
CC
CU
GGUG
UUG
GCGCAGU
AU U CG C G C A C C
CC
GG
UC
UA
CC
5´
Y
UUYRYURRUUU
YAUCA
RAY
C U GUU
UGAURRAAGYUARYGAR
R Y Y C A Y UAAC
RGCU
YUYGC
Y G
GCY Y G
AC
CCGAG
RYY
GUU
U U U U U5´
RA
CG
UU
CA
YCCYYY
R G G RC
GC
AY
RA
YCARRYCAYGG
AA
C G GG G
RY Y U G R R
5´
sucA SraD sxy R N A I P ur ine SA M -C hl cd iG M P 2 A nt i -Q G adY rnk ldr P r fA O mrA -B R yeB t raJ 2 SraH 23Smeth D S-pep
U U C G G C C Y CGCRRCG
YU U Y
UY
CGYYGC
C C U C U G C A YGCCGU CGCCG
ACGCAY
UCC
YAUUC
GA A Y Y G U
GCGAUC C U
GUCGC C
YUC
CU
GCGGCGCGGC
5´ CGYRGC
GC
UUGU
UA U U
URYY
G C UG
UG
UAG U G U
C
G
U
CY
YR A R Y Y R G R R Y Y Y
AAACCCCGCCY
UU Y
GGCGGGGUUUU
G C U U U U U5´ ** CUUACCGGAGGY
RUAUGGACC
CU
G A UCC C A
CY C C UCUCCC
C GA
UGG
AG
AAU
YYYU
UUCCGGUAAG
C C Y G Y C U Y YRCUGYYUUAC
CG
G UGY
GUAAGGCAGU
G A C G U Y U5´GGRAG
RYR
YCU
GGU G R
Y
CG
GC
UU
C A AA
CC
GR
Y GR
RG
YR
Y
YY
Y
GGYRGG UU
CGAY
UCCYRY
Y
CUYCC
5´ UGACCCU
U
UA R
CCR
AGGGUCA
C C U A G C C A A C U G A C GUUGU
UA
GUGAAY
YY A
UGUUCAC A
RAUAR
GCCAAUCGC
UUU
GCGRUUGG
C U U U U U U U U U5´ C U U A A URAACAA
GA
AAACYA
AR C GUACYUUC
CY C
CUGA
G UU
CAGGC
UGGAAUGCGC A C
AG C U R
A U U G U U G A U AA G G G C
UACUC
AUACCGACAAGC
CAGUGA
AG
CG
AU
GAAU
GU
CGG
UUC
C A C5´
RUYY
RCU
GAYG
A GUCC
CA
A AUA
GGACG
AAA C G C
GCGUCY
GRAU
5´ CUCCAUGU
AUCUU
U GGGA
CCUGUC
AG
C UGU
GGCAG U
CUCCC UU
CCU
AGCC
AUGGA
A G A G C A U A U U C UUGUU
UA
UUGGCAA
AG
CUG
UC
ACCA
UU
U RAU
UGGU
AU
CAGA U U
CU
GACUUGC ACAA
GU
AACA
U U C5´C Y G G U U G
GUGG
CGCACU
UCCYY
ACGGGC
GGUGU RUYAC
G Y R Y U R Y R R Y A G A R R R A Y A C CAGCCCGCY
RR
R AGCGGGCU
U U U U U5´
GUCAUAC
U A CG
GU
GCA
AYG
YR R
AA
AGU A
AAC
GAUGAC
C C YARG
AACUCYR
G G U AA A A
URCR
UAUCAAAAUGYAAAAUUG
UY U G A C C U G G G
R U
YY
UCCGGGUYRGYUYUUUU
5´
U R U G C U A A C U R R R A A YGUUGY
A U
RYAAC
CCUUGRYGC
UUA
U YCCUU
URYCAAG
C A U A U U A Y A
RCG
RU
CGYY
A A A G G A G A A A U G5´U C R A A A G A A C A
UGAAAU
GGAGGAGAAAUU
ACAG
C A A U U UAU
C ARC U
GA
AA
UUA
UAG
GU
GUA
G AC
AC A
UG
UCAG
C R G UGGAA
ACAGU
UU
C U A UCA A A A UU A A A
GUA
UUUAG
AGAUUUUC
CUC A
AAUUUCA
A A U5´
AC
AG
GGUARGGRYYYYYUUR U R R R R R Y C C U U A C C G G
RU
UU
CUC
AARUYGGRGYA
AA Y C C G R U U G RA
RU
AU
AR
AG
GA
RG
5´ CG
YG
UU
AUAUGCCU U U A
UU G U
CA
CA
RU
UY
UU
UU
UY
YGYUGR
YC
AUUG
GY
AY
YA U U R A U
UY
C C A G CR
AU
AA
AY
GAC
AAGCCCGAACRY U G U U C G G G C U U U U
UU
UU
RR
UY
A5´
Y Y Y AUGGYGG
Y
GR
GGGR
RCCUU
YG GG Y
YGCCGGUU
CCY
Y R
CCGGU Y U RCCA
ACCC
YY
R
CYRCCA
C C Y5´ AU
GG
AY
RU
GCGCAGGA A G C G C R
AA
GA
CA
RA
CA
GG
GA
CA
CR
YA
GG
RA
CCC G G AU
GG
YG
GR
RY
AG
GA
UG
UC
AG
GR
AA
CA
GU
CU
GCAAAGCCCCGCYY
Y G G C G G G G U U U U5´
P s-R ho rnk ps M gsens tR N A S Q r r isrC H H 1 SN R 24 T rp ldr greA preQ 12 H A R 1F T ermL eu M icC C 4 R smY R ibosome
Paul Gardner Engaging Scientists: eResearch 2013
What is Rfam?
I A database of ncRNA alignments and structures
I Used for annotating RNAs in genome sequences, bioinformaticalgorithm development and molecular evolutionary analyses
Gardner et al. (2008) Rfam: updates to the RNA families databaseNucleic Acids Research.
Paul Gardner Engaging Scientists: eResearch 2013
How can we keep textual descriptions of RNAs up to date?
AC RF00005
ID tRNA
CC Transfer RNA (tRNA) molecules are approximately 80 nucleotides in
CC length. Their secondary structure includes four short
CC double-helical elements and three loops (D, anti-codon, and T
CC loops). Further hydrogen bonds mediate the characteristic
CC L-shaped molecular structure. tRNAs have two regions of
CC fundamental functional importance: the anti-codon, which is
CC responsible for specific mRNA codon recognition, and the 3’ end,
CC to which the tRNAs corresponding amino acid is attached (by
CC aminoacyl-tRNA synthetases). tRNAs cope with the degeneracy of
CC the genetic code in two manners: having more than one tRNA (with
CC a specific anti-codon) for a particular amino acid; and ’wobble’
CC base-pairing, i.e. permitting non-standard base-pairing at the
CC 3rd anti-codon position.
RN [1]
RM 8256282
RT The tertiary structure of tRNA and the development of the genetic
RT code.
RA Hou YM;
RL Trends Biochem Sci 1993;18:362-364.
RN [2]
RM 9023104
RT tRNAscan-SE: a program for improved detection of transfer RNA genes
RT in genomic sequence.
RA Lowe TM, Eddy SR;
RL Nucleic Acids Res 1997;25:955-964.
Paul Gardner Engaging Scientists: eResearch 2013
This Wikipedia thing looks pretty good!
Paul Gardner Engaging Scientists: eResearch 2013
WikiProject RNA
I The WikiProjects are social corners of Wikipedia for interestedparties to discuss themed articles
I Involved in reviewing, ranking and rating articlesI Now rolled into the larger WikiProject Molecular and Cellular
Biology
Paul Gardner Engaging Scientists: eResearch 2013
How has the Wikipedia experiment gone?
x x x x
x
x
x xx
xx x x x x x x
x x x x x x x x
x xx
x x x x x x x x x x
x
x x
xx
x x
0
2000
4000
6000
8000
10000
Number of Rfam pages edited
Year
Num
ber
of edits
2007 2008 2009 2010 2011
9089
x x xxxxxxxxxxx xxxxxxxxxxxx xxxxx xx x106
Total edits
Vandalism
Gardner et al. (2011) Rfam: Wikipedia, clans and the “decimal”release Nucleic Acids Research.
Paul Gardner Engaging Scientists: eResearch 2013
Who are these Wikipedians donating their time?
Rfa
mbo
t
Ppg
ardn
e
Cita
tionb
ot1
Will
owW
Sm
ackB
ot
DO
I_bo
t
Add
bot
Ale
xbat
eman
Jebu
s989
Jenn
iferR
fm
Zas
haw
Rjw
ilmsi
Qw
yrxi
an
Yobo
t
RE
73
Nar
ayan
ese
Ric
hFar
mbr
ough
Add
shor
e
Wgs
cott
MiR
roar
Rjw
ilmsi
Bot
Arc
adia
n
DO
11.1
0
Gor
tonk
Ban
us
Drm
ed36
Fre
scoB
ot
Bog
hog
Top 20 Rfam wikiproject editors
Num
ber
of e
dits
0
200
400
600
800
1000
BotsProof ReadersScientists
Paul Gardner Engaging Scientists: eResearch 2013
What incentives can we give to Academics?
I Academics love publishing articles
I Introducing the “families track” at RNA Biology (IF:4.841)
I Publication requirements are an alignment & a Wikipediaarticle
I 100s of new families have been added thanks to this track
Paul Gardner Engaging Scientists: eResearch 2013
The pros and cons of presenting biological informationwithin Wikipedia
I Private Wiki or Wikipedia?I Wikipedia: pros - high profile, lots of editors, top Google hit, ...I Wikipedia: cons - need to follow ”rules”, article granularity,
vandalism, ...I Private Wiki: pros - granular as you like, your own rules
(invitation only vs public), ...I Private Wiki: cons - small communities (no Daniel Ramskold),
policing, maintenance, ...
I Eleven new Wiki-based databased were published in the lastdatabases issue of Nucleic Acids Research. Including:
I Directly incorporating Wikipedia: GeneWiki, Pfam, ...I Presenting information via Wikis: EcoliWiki, SNPedia &
WikiPathways, ...
Finn, Gardner, Bateman (2012) Making your database availablethrough Wikipedia: the pros and cons Nucleic Acids Research.
Paul Gardner Engaging Scientists: eResearch 2013
Wikipedia need you!
I What is the highest impact contribution academics can make?
I Rule 1: Register an AccountI Rule 2: Learn the Five Pillars
I ENCYC, NPOV, FREE, RESPECT, NORULES
I Rule 3: Be Bold, but Not Reckless
I Rule 4: Know Your Audience
I Rule 5: Do Not Infringe Copyright
I ...
Paul Gardner Engaging Scientists: eResearch 2013
Who might be reading about your field?
Paul Gardner Engaging Scientists: eResearch 2013
Thanks!
I The Rfam Consortium
I Wikipedians & the longtail!
PPG is supported by a Rutherford Discovery Fellowship from Government funding, administered by the RoyalSociety of New Zealand.
Paul Gardner Engaging Scientists: eResearch 2013