27
Genome Genome reannotation: reannotation: Dealing with the Dealing with the atypical, the atypical, the ambiguous, and the ambiguous, and the contrary contrary

Genome reannotation: Dealing with the atypical, the ambiguous, and the contrary

Embed Size (px)

Citation preview

Genome Genome reannotation:reannotation:

Dealing with the Dealing with the atypical, the atypical, the

ambiguous, and the ambiguous, and the contrarycontrary

Release 3.2 contributorsRelease 3.2 contributors

Kathy CampbellKathy Campbell Lynn CrosbyLynn Crosby Beverley MatthewsBeverley Matthews Andy SchroederAndy Schroeder Brian BettencourtBrian Bettencourt Yanmei HuangYanmei Huang Leyla Leyla

BayraktarogluBayraktaroglu Pavel HradeckyPavel Hradecky

Gillian MillburnGillian Millburn Sima MisraSima Misra Chris SmithChris Smith Eleanor WhitfieldEleanor Whitfield

Peili ZhangPeili Zhang Pinglei ZhouPinglei Zhou

Bottom linesBottom lines Annotate generouslyAnnotate generously

Criteria should not be too stringentCriteria should not be too stringent

Label the ambiguous and atypicalLabel the ambiguous and atypical Define a “problematic” categoryDefine a “problematic” category Use a CV to describeUse a CV to describe

Devise a confidence-rating system or an Devise a confidence-rating system or an evidence tally systemevidence tally system

Comments for validation Comments for validation flagsflags

Unusual spliceUnusual splice Short CDSShort CDS Short intronShort intron Overlaps transposonOverlaps transposon Unconventional translation startUnconventional translation start Multiphase exonMultiphase exon CDS overlapCDS overlap DicistronicDicistronic

The dubious annotationThe dubious annotation

Categorized as problematic/provisionalCategorized as problematic/provisional Described using controlled commentsDescribed using controlled comments

““Short CDS”Short CDS” ““Gene prediction only”Gene prediction only” ““Possible gene fragment”Possible gene fragment”

Allows capture of the ORF without Allows capture of the ORF without condoning the gene modelcondoning the gene model

The dubious transcriptThe dubious transcript

Problematic transcriptProblematic transcript ““Truncated ORF” Truncated ORF” ““Supported by single cDNA”Supported by single cDNA”

Controlled comments; distinguish Controlled comments; distinguish between:between: Truncated ORFTruncated ORF Short CDS relative to cDNA length Short CDS relative to cDNA length

(stops throughout; no long ORF)(stops throughout; no long ORF) Short CDS (previous case)Short CDS (previous case)

Annotated, but…Annotated, but…

Third transcript classified as Third transcript classified as problematicproblematic Can be excludedCan be excluded Clearly flaggedClearly flagged

Controlled commentsControlled comments ““Truncated ORF”Truncated ORF” ““Supported by single cDNA”Supported by single cDNA” ““Suspect cDNA: possible unspliced Suspect cDNA: possible unspliced

intron”intron”

Transcript confidence Transcript confidence ratings:ratings:

data typesdata types cDNA data (complete/partial)cDNA data (complete/partial) Protein homology/protein Protein homology/protein

domain(s)domain(s) Gene predictionGene prediction

Flagged as problematicFlagged as problematic

Evidence tally systemEvidence tally system

Yes/no indication for each Yes/no indication for each different level of supporting datadifferent level of supporting data

Flexible and open-endedFlexible and open-ended Can be dense and nuancedCan be dense and nuanced Users can easily set different Users can easily set different

combinations of criteria for bulk combinations of criteria for bulk data setsdata sets

Evidence tally:Evidence tally:cDNA and EST datacDNA and EST data

Transcript structure supportedTranscript structure supported UTRs supportedUTRs supported CDS supported (full-length)CDS supported (full-length) CDS supported (partial)CDS supported (partial)

Transcript overlaps cDNA(s) or Transcript overlaps cDNA(s) or EST(s)EST(s)

Evidence tally:Evidence tally:supporting protein datasupporting protein data

Homologous proteinsHomologous proteins High scoring of similar lengthHigh scoring of similar length Less similarLess similar Indication of taxonomic range?Indication of taxonomic range?

Complete protein domain(s) Complete protein domain(s) identifiedidentified

Evidence tally: cont.Evidence tally: cont.

Gene prediction(s) Gene prediction(s)

Problematic: [CV]Problematic: [CV] Short CDS; possible gene fragmentShort CDS; possible gene fragment Truncated CDSTruncated CDS Possible pseudogenePossible pseudogene CDS overlapCDS overlap etc.etc.

Evidence tally: open -Evidence tally: open -endedended

Experimental determination of 5’ endExperimental determination of 5’ end Northern dataNorthern data ORFeome dataORFeome data Microarray expression dataMicroarray expression data In situ expression dataIn situ expression data Protein expression dataProtein expression data

Dealing with the messy Dealing with the messy onesones

Allow provisional/problematic Allow provisional/problematic annotationsannotations Minimize biases of current knowledgeMinimize biases of current knowledge Can exclude from rigorous data setsCan exclude from rigorous data sets

Describe and categorize using Describe and categorize using controlled commentscontrolled comments

Fold into a transcript rating systemFold into a transcript rating system Evidence tallying systemEvidence tallying system