4
letter 422 nature genetics • volume 33 • march 2003 Natural variation in human gene expression assessed in lymphoblastoid cells Vivian G. Cheung 1–3 , Laura K. Conlin 1 , Teresa M. Weber 3 , Melissa Arcaro 3 , Kuang-Yu Jen 1,2 , Michael Morley 3 & Richard S. Spielman 2 Departments of 1 Pediatrics and 2 Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA. 3 The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA. Correspondence should be addressed to V.G.C. (e-mail: [email protected]). The sequencing of the human genome has resulted in greater attention to genetic variation among individuals, and varia- tion at the DNA sequence level is now being extensively studied. At the same time, it has become possible to study variation at the level of gene expression by various methods. At present, it is largely unknown how widespread this varia- tion in transcript levels is over the entire genome and to what extent individual differences in expression level are genetically determined. In the present study, we used lym- phoblastoid cells to examine variation in gene expression and identified genes whose transcript levels differed greatly among unrelated individuals. We also found evidence for familial aggregation of expression phenotype by comparing variation among unrelated individuals, among siblings within families and between monozygotic twins. These observations suggest that there is a genetic contribution to polymorphic variation in the level of gene expression. Phenotypic variation among humans has always been the start- ing point for genetic analysis, whether this is directed at disease or ‘normal’ variation. Recent work 1,2 has characterized variation at the DNA level in the form of single-nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms. It has also become possible to screen the entire expressed genome for differ- ences among individuals in level of gene expression 3 , allowing expression to be studied genetically like other phenotypes. Genetically determined polymorphic variation in proteins is a well-known phenotype 4 ; similarly, variation in gene expression is probably due in large part to polymorphisms in DNA sequence. With this in mind, we began a genome-wide genetic analysis of variability in gene expression. Whereas genetic diseases are the extreme manifestation of genetic variation, phenotypic variation in transcript level represents an intermediate stage between DNA sequence differences and complex human traits and diseases. In addition, the gene-expression phenotype itself serves as a model for quantitative-trait analysis in humans. Recent studies investi- gated intra- and interspecies variation in gene expression in sev- eral species, including chimpanzees, mice, yeast and fish 5–8 . One study has also shown allelic variation in gene expression for sev- eral genes in humans 9 . In the present study, we focused on a genome-wide assessment of gene expression in humans and identified genes whose expression levels in lymphoblastoid cells varied markedly among individuals. We also found evidence for a genetic basis for this variation. We used cDNA microarrays to identify the genes whose tran- script levels varied most among normal individuals. The arrays contained about 5,000 randomly selected cDNA clones from the IMAGE consortium clone collection. We hybridized differen- tially labeled cDNA from an individual and cDNA from a refer- ence sample, pooled from 10 individuals, onto each microarray. We initially analyzed 35 unrelated individuals from the Centre d’Etude du Polymorphisme Humain (CEPH) Utah pedigrees (ref. 10; see Supplementary Table 1 online). We measured the gene-expression level of each gene as the ratio of the intensity of hybridization signal for an individual relative to that for the ref- erence pool. We carried out all hybridizations with four repli- cates, yielding four values of the expression level for each gene in each individual. About 50% of the genes on the microarrays are expressed in lymphoblastoid cells. Of those, we analyzed only the genes with valid observations on three or more replicates for at least 30 indi- viduals; 813 genes met these criteria. As a measure of variability in expression levels, we calculated the variance ratio or F statistic Published online 3 February 2003; doi:10.1038/ng1094 Fig. 1 Scatter plot of variance in expression level between individuals and between replicates for 813 genes. The genes with the highest variance ratio (top 5%) are highlighted in red. The dotted line indicates a variance ratio of 1.0. © 2003 Nature Publishing Group http://www.nature.com/naturegenetics

Natural variation in human gene expression assessed in lymphoblastoid cells

Embed Size (px)

Citation preview

letter

422 nature genetics • volume 33 • march 2003

Natural variation in human gene expression assessedin lymphoblastoid cells

Vivian G. Cheung1–3, Laura K. Conlin1, Teresa M. Weber3, Melissa Arcaro3, Kuang-Yu Jen1,2, Michael Morley3

& Richard S. Spielman2

Departments of 1Pediatrics and 2Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA. 3The Children’s Hospitalof Philadelphia, Philadelphia, Pennsylvania 19104, USA. Correspondence should be addressed to V.G.C. (e-mail: [email protected]).

The sequencing of the human genome has resulted in greaterattention to genetic variation among individuals, and varia-tion at the DNA sequence level is now being extensivelystudied. At the same time, it has become possible to studyvariation at the level of gene expression by various methods.At present, it is largely unknown how widespread this varia-tion in transcript levels is over the entire genome and towhat extent individual differences in expression level aregenetically determined. In the present study, we used lym-phoblastoid cells to examine variation in gene expressionand identified genes whose transcript levels differed greatlyamong unrelated individuals. We also found evidence forfamilial aggregation of expression phenotype by comparingvariation among unrelated individuals, among siblingswithin families and between monozygotic twins. Theseobservations suggest that there is a genetic contribution topolymorphic variation in the level of gene expression.Phenotypic variation among humans has always been the start-ing point for genetic analysis, whether this is directed at diseaseor ‘normal’ variation. Recent work1,2 has characterized variationat the DNA level in the form of single-nucleotide polymorphisms(SNPs) and insertion/deletion polymorphisms. It has alsobecome possible to screen the entire expressed genome for differ-ences among individuals in level of gene expression3, allowingexpression to be studied genetically like other phenotypes.Genetically determined polymorphic variation in proteins is awell-known phenotype4; similarly, variation in gene expression isprobably due in large part to polymorphisms in DNA sequence.

With this in mind, we began a genome-wide genetic analysis ofvariability in gene expression. Whereas genetic diseases are theextreme manifestation of genetic variation, phenotypic variationin transcript level represents an intermediate stage between DNAsequence differences and complex human traits and diseases. Inaddition, the gene-expression phenotype itself serves as a modelfor quantitative-trait analysis in humans. Recent studies investi-gated intra- and interspecies variation in gene expression in sev-eral species, including chimpanzees, mice, yeast and fish5–8. Onestudy has also shown allelic variation in gene expression for sev-eral genes in humans9. In the present study, we focused on agenome-wide assessment of gene expression in humans andidentified genes whose expression levels in lymphoblastoid cellsvaried markedly among individuals. We also found evidence fora genetic basis for this variation.

We used cDNA microarrays to identify the genes whose tran-script levels varied most among normal individuals. The arrayscontained about 5,000 randomly selected cDNA clones from theIMAGE consortium clone collection. We hybridized differen-tially labeled cDNA from an individual and cDNA from a refer-ence sample, pooled from 10 individuals, onto each microarray.We initially analyzed 35 unrelated individuals from the Centred’Etude du Polymorphisme Humain (CEPH) Utah pedigrees(ref. 10; see Supplementary Table 1 online). We measured thegene-expression level of each gene as the ratio of the intensity ofhybridization signal for an individual relative to that for the ref-erence pool. We carried out all hybridizations with four repli-cates, yielding four values of the expression level for each gene ineach individual.

About 50% of the genes on the microarrays are expressed inlymphoblastoid cells. Of those, we analyzed only the genes withvalid observations on three or more replicates for at least 30 indi-viduals; 813 genes met these criteria. As a measure of variabilityin expression levels, we calculated the variance ratio or F statistic

Published online 3 February 2003; doi:10.1038/ng1094

Fig. 1 Scatter plot of variance in expression level between individuals andbetween replicates for 813 genes. The genes with the highest variance ratio (top5%) are highlighted in red. The dotted line indicates a variance ratio of 1.0.

©20

03 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.nat

ure

.co

m/n

atu

reg

enet

ics

letter

nature genetics • volume 33 • march 2003 423

for each gene by dividing the variance ofthe expression levels among individualsby the variance within individuals(microarray replicates), so that mea-surement noise was taken into account.The variance ratios of the 813 genesranged from 0.4 to 64 with a medianvalue of 2.5. The reproducibility of ourmicroarray data was high and for mostgenes, the variance of expression levelsbetween individuals was higher than thevariance within (Fig. 1). The variancewithin individuals ranged from 0.01 to0.18 and was similar for all genes.

Our goal was to identify the genes thatwere most variable in expression levelbetween individuals, because weexpected that genes with more variableexpression would be more amenable togenetic dissection. The 40 genes (about5%) with the highest variance ratios(Fig. 1) were not clustered in genomiclocation; instead, they mapped to manychromosomes across the humangenome (Fig. 2a and see SupplementaryTable 2 online). Among these genes areseveral that were previously known to behighly variable, especially at thesequence level; examples include HLA-DRB1 and some of the genes involved indrug metabolism, including CYP26A1and DHFR. Another gene, CYP3A4,whose enzyme product influencesmetabolism of most drugs prescribedtoday, was also highly variable (in thetop 10%; ref. 11).

We also wanted to understand howfunctional classification correlates withvariability in expression. We used thefunctional categories provided by theGene Ontology Consortium12 to classifyand compare the 100 most variable andthe 100 least variable genes. We includedall genes with known functions, butremoved two categories (RNA process-ing and energy metabolism) that hadtoo few genes for valid testing. We car-ried out χ2 analysis of the 8 × 2 contin-gency table and found significantlydifferent (P < 0.025) functions for vari-able versus non-variable genes (Fig. 2b).Among the variable genes, the mostheavily represented categories (account-ing for 31%) were those of cytoskeleton

Fig. 2 Genomic locations and functions of geneswith highly variable expression level. a, Chromo-somal locations of the 40 most variable genes.Chromosomes that do not contain any of thesegenes are not included in the figure. The vari-ance ratios for all 40 genes are given in Supple-mentary Table 2 online. b, Functional groupingsof the 100 most variable and the 100 least vari-able genes. Genes with unknown functions (36of the most variable and 50 of the least variable)are not shown.

a

b

©20

03 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.nat

ure

.co

m/n

atu

reg

enet

ics

letter

424 nature genetics • volume 33 • march 2003

(11%), protein modification (11%) and transport (9%). Amongthe non-variable genes, the most heavily represented categories(accounting for 23%) were those of signal transduction (12%)and cell death/proliferation (11%).

We also examined the actual variances of the expression levelsfor five of the highly variable genes (ACTG2, CYP26A1,HNRPA2B1, DHFR and GK) among normal individuals (Fig. 3).In our data, ACTG2 had the highest variance ratio; its expressionlevel varied by a factor of 17 among the 35 individuals. For eachof the 40 genes with the highest variance ratios, the highest andlowest values differed by a factor of 2.4 or greater.

To investigate the genetic basis of variation in gene expression,we examined the gene transcript levels of five highly variablegenes (ST3GALVI, ACTG2, GK, HNRPA2B1 and DHFR) amongthree groups of people: 49 unrelated individuals (including the35 individuals in the microarray experiments); offspring fromCEPH families 1333, 1347, 1362, 1416 and 1424 and 10 pairs ofmonozygotic twins. We determined the expression levels of thegenes by real-time quantitative RT–PCR and compared the threegroups. We found that the genes showed less variability inexpression level in more closely related individuals (Fig. 4). Forall five genes, expression levels varied the least in monozygotictwins, with intermediate variability in siblings from the samefamily and greatest variability in unrelated individuals. For thefive genes examined, the variance among unrelated individualswas 3–11 times greater than that between monozygotic twins,and the variance among siblings was 2–5 times greater than thatbetween twins.

We evaluated the significance of these results using the F test.The difference in variance for unrelated individuals versus twinswas nominally significant for all five genes (Table 1). The muchsmaller variability in expression between monozygotic twins isconsistent with a genetic contribution, but the statistical signifi-cance is difficult to evaluate because the genes were selected forhigh variability in unrelated individuals. This consideration doesnot, however, affect the comparison of siblings and monozygotictwins, as the five genes were selected for analysis without regardto these two groups. The nominal P values for the latter compar-ison are also shown in Table 1. Thus, the results from the familystudies suggest that genetic differences among individuals con-tribute to variation in expression phenotype for some genes.

This evidence for a heritable component of gene expressionsets the stage for further genetic analysis. For example, it will bepossible to use genetic linkage analysis to search for sequence

variants that control or influence gene expression, as has beendone in yeast7. These expression-control elements (ECEs) maybe located within genes or in intergenic regions. We expect thatfor some genes, the ECEs will be closely linked cis-regulatory ele-ments, such as sequence polymorphisms in the promoter regionsor 3′ untranslated regions that affect the stability of transcripts.In other cases, the ECEs will be trans-acting elements such astranscription factors. For most genes, we expect a combinationof cis- and trans-acting elements acting in concert to regulateexpression levels.

Our results provide an initial assessment of natural variationof gene expression in lymphoblastoid cells in humans. They sug-gest that a component of this variation is genetically determined.The next steps will be to map and identify the polymorphic genesor determinants that contribute to this variation and to under-stand the functional consequences of variation in gene expres-sion. It is likely that some of the expression phenotypes willcontribute to other complex traits and diseases in humans.

MethodsMicroarray analysis. The 35 individuals used in the microarray analysisof this study are parents from the CEPH Utah pedigrees. Demographicinformation on these individuals is given in Supplementary Table 1online. We extracted total RNA from lymphoblastoid cells from eachindividual and reverse-transcribed and labeled it with Cy3 fluorescenttags per manufacturer’s protocol (Genisphere). We reverse-transcribed areference RNA pool, made by combining RNA from 10 randomly select-ed CEPH individuals, and labeled it with Cy5 fluorescent tags as above.We hybridized the Cy3-labeled cDNA from an individual with the Cy5-labeled reference cDNA onto a microarray containing 5,184 genes. Weanalyzed the microarray images using ArrayVision 6.0 (ImagingResearch). Each array was normalized so that the overall intensity ratioof Cy3 to Cy5 was one. Details of microarray procedures are given inSupplementary Note 1 online.

Fig. 3 The expression levels measured using microarrays for five highly vari-able genes in 35 individuals. Each point represents the expression level foran individual.

Fig. 4 Variance in expression level for five genes. Quantitative RT–PCR data for49 unrelated individuals, 41 offspring in five CEPH families and 10 sets ofmonozygotic twins.

Table 1 • Relative variability of gene expression levels

Gene Unrelateds/twins Siblings/twinsF P value F P value

ST3GALVI 10.53 0.0002 4.74 0.0062ACTG2 5.02 0.0046 2.85 0.0402GK 3.06 0.0304 1.71 0.1848HNRPA2B1 5.86 0.0024 3.65 0.0168DHFR 3.25 0.0245 2.10 0.1060

Expression was measured in 49 unrelated individuals, 41 siblings (5 sibships)and 10 monozygotic twin pairs. Variances were compared by the F test.

©20

03 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.nat

ure

.co

m/n

atu

reg

enet

ics

letter

nature genetics • volume 33 • march 2003 425

Variance ratio calculation. We analyzed samples from each subject on fourreplicated microarrays. We removed the most deviant observation from everyset of four array observations and then calculated the expression level of eachgene as the average of the three remaining measurements13. If the number ofvalid replicates was three, we did not remove any expression ratios from the set.If two or fewer replicates were present for a given gene, then we removed allexpression ratios for that gene. For each gene in the ith individual, we calculat-ed the mean Mi and the variance si

2 of the remaining replicate observationsand used these values for subsequent analysis.

We used the variance ratio as a measurement of variability in geneexpression levels among individuals relative to the measurement noise. Foreach gene, we calculated this measure as the ratio of the variance in meanexpression level among individuals to the mean of the variance of the repli-cates within individuals: (variance of Mi)/(mean of si

2).

Quantitative RT–PCR. We analyzed samples from 49 unrelated individu-als, 41 siblings from five families (CEPH 1333, 1347, 1363, 1416, 1424; sib-ship sizes 9, 9, 7, 8 and 8, respectively) and 10 sets of monozygotic twins.Information on study subjects is presented in Supplementary Table 3online. We carried out RT–PCR with the SYBR-Green protocol (AppliedBiosystems) using a two-step procedure with 12.5 ng total RNA and 300 nMgene-specific primers following manufacturer’s protocol. We carried outthe assays in duplicate. We calculated the expression levels using dCt permanufacturer’s recommendation and normalized them to the level of β-actin. Detailed methods including experimental procedure and sequencesof the gene-specific primers are given in Supplementary Note 1 online.

For each gene, we carried out an F test to compare the variance in dCtbetween the unrelated individuals and the siblings within families andbetween the siblings and the monozygotic twins. The distribution of dCtvalues did not differ significantly from normal.

Human subjects. All samples studied were from lymphoblastoid cell linesthat are part of the CEPH Utah collections. The study was approved by theinstitutional review boards at The Children’s Hospital of Philadelphia andthe University of Pennsylvania.

Accession numbers. The GEO accession number for the microarray data isGPL230.

Note: Supplementary information is available on the NatureGenetics website.

AcknowledgmentsWe thank W.J. Ewens for discussion and A. Bruzel, K. Ewens and H.H.Kazazian for comments. This work was supported by grants from the U.S.National Institutes of Health (to V.G.C. and R.S.S.) and from the W.W.Smith Endowed Chair in Pediatric Genomics (to V.G.C.).

Competing interests statementThe authors declare that they have no competing financial interests.

Received 20 September 2002; accepted 8 January 2003.

1. Altshuler, D. et al. An SNP map of the human genome generated by reducedrepresentation shotgun sequencing. Nature 407, 513–516 (2000).

2. Sachidanandam, R. et al. A map of human genome sequence variation containing1.42 million single-nucleotide polymorphisms. Nature 409, 928–933 (2001).

3. Schena, D., Shalon, R.W., Davis, R.W. & Brown, P.O. Quantitative monitoring ofgene-expression patterns with a complementary DNA microarray. Science 270,467–470 (1995).

4. Harris, H. & Hopkinson, D.A. Average heterozygosity per locus in man: anestimate based on the incidence of enzyme polymorphisms. Ann. Hum. Genet.Lond. 36, 9–20 (1972).

5. Enard, W. et al. Intra- and interspecific variation in primate gene-expressionpatterns. Science 296, 340–343 (2002).

6. Steinmetz, L.M. et al. Dissecting the architecture of a quantitative-trait locus inyeast. Nature 416, 326–330 (2002).

7. Brem, R.B., Yvert, G., Clinton, R. & Kruglyak, L. Genetic dissection oftranscriptional regulation in budding yeast. Science 296, 752–755 (2002).

8. Oleksiak, M.F., Churchill, G.A. & Crawford, D.L. Variation in gene expressionwithin and among natural populations. Nat. Genet. 32, 261–266 (2002).

9. Yan, H., Yuan, W., Velculescu, V.E., Vogelstein, B. & Kinzler, K.W. Allelic variationin human gene expression. Science 297, 1143 (2002).

10. Dausset, J. et al. Centre d’etude du polymorphisme humain (CEPH): collaborativegenetic mapping of the human genome. Genomics 6, 575–577 (1990).

11. Guengerich, F.P. Cytochrome P-450 3A4: regulation and role in drug metabolism.Annu. Rev. Pharmacol. Toxicol. 39, 1–17 (1999).

12. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The GeneOntology Consortium. Nat. Genet. 25, 25–29 (2000).

13. Watts, J.A. et al. Gene expression phenotype in heterozygous carriers of ataxiatelangiectasia. Am. J. Hum. Genet. 74, 791–800 (2002).

©20

03 N

atu

re P

ub

lish

ing

Gro

up

h

ttp

://w

ww

.nat

ure

.co

m/n

atu

reg

enet

ics