Discussion Points for 2 nd Pseudogene Call

Discussion Points for 2nd Pseudogene Call

Mark Gerstein

2005,09.22 11:00 EST

86

8787

Havana-Gencode: 167 pseudogenes

Yale: 184 pseudogenes

UCSC retrogenes: 15 expressed (7-8 pseudogenes) + 143 not expressed (all pseudogenes)

16 18

22

17

4535 21

42

18

Provided by France.

Intersection of Pseudogenes from Three Groups: Original

86 havana peudogenes overlap with any Yale pseudogene and 87 Yale pseudogenes overlap with any havana pseudogene (idem for retrogenes). This is a global result: maybe in some loci three havana pseudogenes overlap with only one yale pseudogene, but in other loci, several yale pseudogenes overlap with one havana pseudogene.

82 (34)



UCSC retrogenes: 146 not expressed

17 (7)

33 (1)

15 (1)

14 (2) 16 (0)

52 (2)

• The numbers in parentheses are pseudogenes from GIS.• All from http://pseudogene.org/ENCODE/cross-ref• Pseudo-exons were merged to form pseudogenes and used for this comparison

(now a pseudogene has only a single start and end)

• Strand information is ignored• There are a total of 229 pseudogenes in the union

Intersection of Pseudogenes from 4 Groups: Updated

http://pseudogene.org/ENCODE/cross-ref



82 (34)



UCSC retrogenes: 146 not expressed

17 (7)

33 (1)

15 (1)

14 (2) 16 (0)

52 (2)

Intersection of Pseudogenes from 4 Groups: Non-processed Consensus

GENCODE Processed

GENCODE Non-Processed

Yale Processed 7 / 8 5 / 5

Yale Non-Processed

4 / 4 39 / 37

Roughly agreement now is:

82 + 52 – 7 = 127from 229 total

What to do with 102?

How to Pick Pseudogenes for RT-PCR?

• Start with the intersection 127• Duplicated v processed: how many of each? (2:1?)• Rank Pseudogenes:

– By likelihood to be transcribed according to ENCODE evidence• ditag, then CAGE, then tiling array

– By their uniqueness in genome• Good primers• Non cross-hybridizing probes

• How to get a consistent rank?• Who will do RT-PCR ?• What coordinates to use ?• (Ignore 1 processed pseudogene already being sequenced by GIS group.)

How to generate a consensus for remaining 102 pseudogenes?

• Stick with the intersection 127• Develop a consistent criteria for identifying pseudogenes and

uniformly apply to ENCODE– E.g. protein matches with disablements found from a pipeline– Ignores tricky cases flagged by manual annotation

• Do a simple union of UCSC, Havana & Yale, giving 229– GIS is a subset of other 3– Describe pseudogenes as being identified by multiple approaches and

then explicitly flag each group’s unique ones in final annotation– Easy but perhaps biases stats

• Do a qualified union– Allow each group to “question” particular pseudogenes in another’s set– Send questions around and then have a call to sort out differences– Need a way to arbitrate– e.g. we could demand an obvious disablement– We might learn something!

• How do we represent this in the browser & in stats?

Once we have consensus, how to agree on pseudogene boundaries?• Keep unchanged each group’s boundaries

– If pseudogenes overlap, take largest region (union) or smallest

• Develop a uniform criteria for assigning pseudogene boundaries and apply it to each of the pseudogenes in the consensus set– Could just take each pseudogene in the

consensus and have one group realign it against parent

Documents

Discussion Points for 2 nd Pseudogene Call