View
228
Download
3
Embed Size (px)
Citation preview
Genomic Signal Processing
Dr. C.Q. Chang
Dept. of EEE
Outline
• Basic Genomics
• Signal Processing for Genomic Sequences
• Signal Processing for Gene Expression
• Resources and Co-operations
• Challenges and Future Work
Basic Genomics
Genome• Every human cell contains 6 feet of double stranded (ds) DNA• This DNA has 3,000,000,000 base pairs representing 50,000-
100,000 genes• This DNA contains our complete genetic code or genome• DNA regulates all cell functions including response to disease,
aging and development• Gene expression pattern: snapshot of DNA in a cell• Gene expression profile: DNA mutation or polymorphism over
time• Genetic pathways: changes in genetic code accompanying
metabolic and functional changes, e.g. disease or aging.
Gene: protein-coding DNA
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
In more detail(color ~state)
Signal Processing for Genomic Sequences
The Data Set
The Problem• Genomic information is digital letters A, T, C and G• Signal processing deals with numerical sequences,
character strings have to be mapped into one or more numerical sequences
• Identification of protein coding regions• Prediction of whether or not a given DNA segment
is a part of a protein coding region• Prediction of the proper reading frame• Comparing to traditional methods, signal processing
methods are much quicker, and can be even more accurate in some cases.
Sequence to signal mapping
1 , 1 , 1 , 1a j t j c j g j
[ ] [ ] [ 1] / 2 [ 2] / 4y n x n x n x n
Signal Analysis
• Spectral analysis (Fourier transform, periodogram)
• Spectrogram
• Wavelet analysis
• HMT: wavelet-based Hidden Markov Tree
• Spectral envelope (using optimal string to numerical value mapping)
Spectral envelope of the BNRF1 gene from the Epstein-Barr virus
(a) 1st section (1000bp), (b) 2nd section (1000bp),
(c) 3rd section (1000bp), (d) 4th section (954bp)
Conjecture: the 4th quarter is actually non-coding
Signal Processing for Gene Expression
Biological Question
Sample preparationMicroarray
Life Cycle
Data Analysis & Modeling
Microarray Reaction
MicroarrayDetection
Taken from Schena & Davis
cDNA clones(probes)
PCR product amplificationpurification
printing
microarray Hybridise target to microarray
mRNA target)
excitation
laser 1laser 2
emission
scanning
analysis
overlay images and normalise
0.1nl/spot
Image Segmentation
• Simple way: fixed circle method• Advanced: fast marching level set segmentation
Advanced Fixed circle
Clustering and filtering methodsPrincipal approaches:• Hierarchical clustering (kdb trees, CART, gene shaving)• K-means clustering• Self organizing (Kohonen) maps• Vector support machines• Gene Filtering via Multiobjective Optimization• Independent Component Analysis (ICA)Validation approaches:• Significance analysis of microarrays (SAM)• Bootstrapping cluster analysis• Leave-one-out cross-validation• Replication (additional gene chip experiments, quantitative PCR)
ICA for B-cell lymphoma data
Data: 96 samples of normal and malignant lymphocytes.
Results: scatter-plotting of 12 independent components
Comparison: close related to results of hierarchical clustering
Resources and Co-operations
Resources: databases on the internet such as
• GeneBank
• ProteinBank
• Some small databases of microarray data
Co-operations in need:
• First hand microarray data
• Biological experiment for validation
Challenges and Future Work• Genomic signal processing opens a new signal
processing frontier• Sequence analysis: symbolic or categorical signal,
classical signal processing methods are not directly applicable
• Increasingly high dimensionality of genetic data sets and the complexity involved call for fast and high throughput implementations of genomic signal processing algorithms
• Future work: spectral analysis of DNA sequence and data clustering of microarray data. Modify classical signal processing methods, and develop new ones.