View
221
Download
1
Tags:
Embed Size (px)
Citation preview
1
Parsing
• Analyze text: split it into meaningful units, tokens
• Extract relevant information, disregard irrelevant information
• ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?
2
Blast
• Program package for finding similarities between biological sequences
• blastn compares DNA sequences with DNA sequences
• Input: – Fasta file with query sequences– Formatted Fasta file with database sequences– Sensitivity parameter (and more)
• Output:– Result of comparing each query to each database sequence
3
Example run
Query file: arachis.fastaDatabase file: arabidopsis_nucleotides.fasta
Format the database: formatdb –i arabidopsis.fasta –p F –o T
Command:
/users/chili/usr/blast-2.2.13/bin/blastall -p blastn -e 0.000000002 -d arabidopsis.fasta -i arachis.fasta -o arachis_arab.bn
4
Example output – query with no match
BLASTN 2.2.6 [Apr-09-2003]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= CL5Contig1 (797 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
***** No hits found ******
..
5
Example output – query with matches
Query= CL69Contig1 (372 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04
..
6
Example output – match alignment
>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009
Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus
Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826
..
General form of output:
Repetitions of (query, subject matches, alignments)
7
Extract information from blast output
• Extract the best hit for each query sequence
Query= CL69Contig1 (372 letters)..
Score ESequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..
8
Algorithm
• Read blast output file line by line
• Introduce two states:1. Looking for next query
2. Looking for hit list
• Return dictionary of query best hit
9
First state: Looking for next query
Query= CL69Contig1 (372 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..
Look for a line starting with
Query=
(the = is important!)
10
Why we look for Query= and not just Query
>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009
Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus
Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| ||||||||||||Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826
..
11
Second state: Looking for hit list
Query= CL69Contig1 (372 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..
Case A: hits were found
12
Case B: no hits were found
Query= CL69Contig1 (372 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..
Query= CL5Contig1 (797 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
***** No hits found ******
13
Second state: Looking for hit list
Query= CL69Contig1 (372 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10..
Query= CL5Contig1 (797 letters)
Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters
Searching..................................................done
***** No hits found ******
Look for a line starting with
Searching
Then read a few more lines to distinguish case A/B
Look for a line starting with
Searching
Then read a few more lines to distinguish case A/B
14blas
tpar
ser.
py (
part
1)
Find the query ID:Query= CL69Contig1
15blas
tpar
ser.
py (
part
2)
Find the best match ID:
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10
Find the best match ID:
Searching..................................................done
***** No hits found ******
16
Test
>>> from blastparser import parseBlastallOutput
>>> d = parseBlastallOutput(“arachis_arab.bn”)
>>> d[“gi|30419745”]
‘gb|BP625785.1|BP625785’
>>> d[“gi|30419753”]
‘none’
25
Evolutionary tree of life (animal kingdom)
• Huge hierarchy of groups and subgroups
• Each node in the tree has a name and a (possibly empty) list of descendant trees (sons)
Two pass-parsing
Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.
26
Abstract data structure to represent a
general tree (not necessarily
binary)
tree
.py
27
How can we write a tree to a sequential file?
(File format should be readable by other systems, so we can’t use cPickle)
– A tree is a labeled node containing a (possibly empty) list of other (sub)trees
– Write tree node using start and end tags: <N=“Insects”> [sons] </N>
• Formally (context-free grammar):
T → <N=“L”>S</N> S → λ | TSL → string label
Insects
Beetles
Flies
B
AE
D
C
28
Recursive method: string representation of tree tr
ee.p
y
First obtain string representation of sons (empty string if no sons) by calling function recursively..
.. then create string with start tag, label, sons’ representation, and end tag
Insects
Beetles
Flies
B
AE
D
C
.. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..
29
Larger tree – How can we read a tree from a sequential file?
<N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N>
We need a parser!
part
_of_
the_
tree
_of_
life.
txt
30
Two-pass parsing
Complex parsing is often split in two passes:
1. Lexical analysis• Identify and assemble tokens: logical units of text
2. Structural analysis• Determine the structural hierarchy of the tokens
In our case, the tokens are the two kinds of tag:
31
Lexical analysis
phyl
ogen
ypar
ser.
py (
part
1)
Match either a start tag or an end tag
Define a group containing the start tag’s label
Search text from index pointer
Create token of right type
Move index pointer
32
Structural analysis
phyl
ogen
ypar
ser.
py (
part
2)
current_node
new_node
.. <N="Kangaroo"></N><N="Koala"></N> ..
1
2
1
2
current_node
3
current_node
3
Kangaroo
.. <N="Kangaroo"></N><N="Koala"></N> ..
Real root will be first son of this node
33
Terrestrial vertebrates
Synapsida
Reptilia
Therapsida
Mammalia
MarsupiliaEutheria
Kangaroo
Koala
Primates
Human
GorillaChimpanzee
Carnivora
Walrus
Wolf
Proboscidea
Elephant
Diapsida
TestudinesTurtle
Lepidosauromorpha
Lizard
Snake
Archosauromorpha
Tyrannosaurus
Penguin
Owl
34phyl
ogen
ypar
ser.
pyTest
program
35
Navigating in the tree
Name: DiapsidaFather: ReptiliaSiblings: TestudinesSons: Archosauromorpha Lepidosauromorpha
(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? bNumber of sibling (0-0)? 0
Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle
(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p<N="Testudines"><N="Turtle"></N></N>
Name: TestudinesFather: ReptiliaSiblings: DiapsidaSons: Turtle
(f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f
Name: ReptiliaFather: Terrestrial vertebratesSiblings: SynapsidaSons: Diapsida Testudines
Reptilia
Diapsida
TestudinesTurtle
Lepidosauromorpha
Archosauromorpha
36
.. on to the exercises