16
Regular Meeting December 22, 2008 Mark Borodovsky Ivan Antonov

Regular Meeting December 22, 2008

Embed Size (px)

DESCRIPTION

Regular Meeting December 22, 2008. Mark Borodovsky Ivan Antonov. Topics. What have been done FSMark HMM implementation Answers to the previous meeting questions Future work. What have been done. HMM implementation in FSMark has been changed - PowerPoint PPT Presentation

Citation preview

Page 1: Regular Meeting December 22, 2008

Regular Meeting

December 22, 2008

Mark BorodovskyIvan Antonov

Page 2: Regular Meeting December 22, 2008

11/6/2008 GATech 2

Topics

1.What have been done

2.FSMark HMM implementation

3.Answers to the previous meeting questions

4.Future work

Page 3: Regular Meeting December 22, 2008

11/6/2008 GATech 3

What have been done

•HMM implementation in FSMark has been changed

•Some questions from the previous meeting have been answered

Page 4: Regular Meeting December 22, 2008

FSMark HMM implementation

Page 5: Regular Meeting December 22, 2008

11/6/2008 GATech 5

Current HMM implementation

• Currently for a given position i we look backward on 2 nucleotides instead of looking forward

• FSMark starts examining sequence from the 3rd position only (i=2), so we have complete emission string (there are strange results if we start with 1st position)

• Since FSMark starts with i=2 gene without frame shift will have state 2

Page 6: Regular Meeting December 22, 2008

11/6/2008 GATech 6

FSMark prediction depends on FS letter

• A test has been done for a sample gene inserting different letters in the middle of the gene. FSMark-GM hmm_def file was used.

FS letter FSMark prediction

A Gene overlap

C Frame shift

G Frame shift

T Frame shift

Page 7: Regular Meeting December 22, 2008

Answers to the previous meeting

questions

Page 8: Regular Meeting December 22, 2008

11/6/2008 GATech 8

Control

Genome without frame

shifts

GeneMark 417

overlaps

FSMark-GM

118 frame shifts

True Positive

0

False Positive

118

False Negative

0

Page 9: Regular Meeting December 22, 2008

11/6/2008 GATech 9

Experiment

Genome with frame shifts in

400 genes

GeneMark 599

overlaps

FSMark-GM

325 frame shifts

True Positive

113

False Positive

212

False Negative

287

171 overlaps

caused by frame shift

Page 10: Regular Meeting December 22, 2008

11/6/2008 GATech 10

Questions to answer

• Take a look at the distribution of overlap lengths in GeneMark output

• Understand why GeneMark predicts gene overlap for less than 50% of genes with Frame Shifts. There are two possible reasons:– Missing short part, i.e. GeneMark predicts one gene only– GeneMark predicts two genes but they don’t overlap

• Try to understand why did we get more False Positive in experiment than in control

Page 11: Regular Meeting December 22, 2008

11/6/2008 GATech 11

All overlaps length (genome without FS)

0

50

100

150

200

250

300

4 7 8 10 11 13 14 16 17 19 20 22 23 25 26 29 31 32 35 38 40 43 56

Page 12: Regular Meeting December 22, 2008

11/6/2008 GATech 12

Overlaps caused by frame shift

0

5

10

15

20

25

8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 77 89 95 140

Page 13: Regular Meeting December 22, 2008

11/6/2008 GATech 13

GeneMark analysis

• Why does GeneMark barely predict overlaps for genes with frame shift?

• In my GeneMark output there are 357 typical genes (out of 400).

• Probably I use wrong GeneMark option?

Page 14: Regular Meeting December 22, 2008

11/6/2008 GATech 14

GeneMark output statistics

Genome with frame

shifts in 400 genes

4,388 gene

s

599 gene

overlaps

335 genes with fs

171 overlaps

caused by fs

22 genes with fs

are missing

fs in 164 genes didn’t

cause overlap

4 fs caused new gene downstream the initial

gene

163 decreased

their lengths

Page 15: Regular Meeting December 22, 2008

11/6/2008 GATech 15

Conclusions

• I need to check how to run GeneMark in order to get the same 400 typical genes

• It seems that the small chunk in the shifted frame is not enough for GeneMark to predict a new gene

Page 16: Regular Meeting December 22, 2008

11/6/2008 GATech 16

Time Table

Date TODO

Dec 24, Wed

Insensitive zone length analysis for FSMark to determine length of zones 1 and 3

2009 Apply FSMark-GM to 3 typical genomes using found zone 1 and 3 lengths