Extraction of functional information from large-scale gene expression data

Bioinformatics 91.580 2003 Spring

Jianping Zhou

Extraction of functional information from large-scale gene expression data

Contents

A prominence feature of cell cycle-regulated genes

----- show more remarkable and active functions than others

SP( shortest-path) analysis to extract functional information

----- An alternative and complementary to clustering analysis

Prominence FeatureAssumption

Because of their ruling features, the cell cycle-regulated genes are assumed to be more active and remarkable than others in the Yeast Saccharomyces cerevisiae genome.

When performing filtering process against original dataset by some thresholds in terms of significance, if the cell cycle-regulated genes show higher survival ratio than others, we may conclude they are more active and remarkable

Prominence FeatureMethods

The preprocess utility of Gepas package can be used to prepare the comparing dataset

Microarray gene data are the ideal data sources

800 Spellman’s identified cell cycle-regulated genes for Yeast Saccharomyces cerevisiae are the most complete spectrum at this point

Prominence FeatureMethods (cont)

Use a single sentence of Common Lisp to count the hitting genes:

(length (intersection regu '(plain text file content)))

regu: the preset CL list representing the list of 800 cell cycle-regulated gene names. It is defined in CL as:

(setf regu ‘(plain text of 800 cell cycle-regulated gene))

The plain text of 800 cell cycle-regulated gene can be got by copy and paste of ORF column of CellCycle98.xls

plain text file content: Copy and paste of preprocess or clustering output plain text file inside which the ORFs corresponding to selected genes are contained.

Prominence FeatureSteps

Prominence FeatureSteps (cont)



Parameter: Pe, Pk, Sd

Pe: Minimum percentage of existing values -- patterns with missing values greater this rate will be removed.

Pk: Minimum number of peaks -- patterns with peak values less this value will be removed.

Sd: Threshold for standard deviation -- patterns with a standard deviation below the threshold will be removed.

P0: total profiles in the original file

P1: Removed profiles with missing values, determined by Minimum percentage of existing values

P2: Profiles mended through imputing missing values, determined by Minimum number of peaks

P3: Removed profiles through filtering out flat profiles by number of peaks

P4: Removed profiles through filtering out flat profiles by standard deviation

P5: Profiles remaining in the result dataset

Hit: Count of genes existing in both result dataset and 800 Spellman cell cycle-regulated gene dataset.

Hit rate: Hit / P5

Prominence FeatureResult

Hit Rate vs Pe

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0% 20% 40% 60% 80% 100% 120%

Pe

Hit

Rat

e

Pk: 0 / Sd: 0.5

Pk: 1 / Sd: 0.5

Pk: 0 / Sd: 1.0

Pk: 1 / Sd: 1.0

Prominence FeatureResult (cont)

Hit Rate vs Sd

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

0 0.2 0.4 0.6 0.8 1

Sd

Hit

Ra

te

Pk: 0

Pk: 1

Pe = 95%

SP( shortest-path) analysisIntroduction

SP( shortest-path) analysis is used to identify transitive genes between two given genes from the same biological process.

Transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway.

Recent advances in computational and experimental technologies have opened up real opportunities for annotating gene functions not only at the phenomenological levels but also at the mechanistic levels.

SP( shortest-path) analysisDiscovery

With Yeast Saccharomyces cerevisiae genome, The author, X. Zhou [5], constructed the cytoplasm graph (another two graphs include mitochondria, nucleus), which contain 398 genes. All those genes are got involved in the same biological pathway.

Through matching the cytoplasm outcome with Spellman CellCycle98.xls, six genes are identified, they are

YPR045C YPL221W(BOP1) YIL056W YHR029C YDR130C YBR053C

SP( shortest-path) analysisDiscovery (cont)

Referring to CellCycle98.xls, all these genes are with unknown process and far away cluster order number each other.

For the SOM clustering output with respect to normalized file, which has 561 hits with 800 Spellman genes, those genes exist in YPR045C Cluster (2, 4); YPL221W Cluster (1, 1); YBR053C Cluster (2, 7). Other three are not found.

As far as all my clustering outputs, none is found in clustering.

All Ftigo linked databases have no results for these five genes or ORFs

No evidence show these six genes can stay in the same cluster.

References[1] Paul T. Spellman, Gavin Sherlock,Michael Q. Zhang, Vishwanath R. Iyer,§ Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization MBC, Vol. 9, Issue 12, 3273-3297, December 1998

[2] Oliveros, J.C., Blaschke, C., Herrero, J., Dopazo, J. & Valencia, A. (2000) Expression profiles and biological function. Genome Informatics Workshop 2000, 11, 106-117

[3] M. Q. Zhang Extracting functional information from microarrays: A challenge for functional genomics PNAS, October 1, 2002; 99(20): 12509 - 12511.

[4] M. Q. Zhang Large-Scale Gene Expression Data Analysis: A New Challenge to Computational Biologists Genome Res., August 1, 1999; 9(8): 681 - 688.

[5] X. Zhou, M.-C. J. Kao, and W. H. Wong From the Cover: Transitive functional annotation by shortest-path analysis of gene expression data PNAS, October 1, 2002; 99(20): 12783 - 12788.

[6] www.biostat.harvard.edu/complab/SP/

http://www.biostat.harvard.edu/complab/SP/









Documents

Extraction of functional information from large-scale gene expression data