Upload
niranabey
View
505
Download
2
Tags:
Embed Size (px)
Citation preview
MAGE-TAB - a simple tab delimited format for
describing microarray (and potentially other) experiments in a MIAME
compliant way
MAGE-TAB workshopNCI, January 24, 2008
What is needed to describe a microarray (and potentially any)
experiment adequately (MIAME)?
1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties
2. Description of the assay (e.g., microarray design)
3. Data from the assays - raw and processed4. Material and data processing protocols5. Experiment design – which sample went to
which assay and produced which data file
How to do this?
1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties – a list of properties – free text or ontology entries – table (spreadsheet)
2. Description of the assay (e.g., microarray design) – a list of features on the array and their properties – sequence, annotation, location – table (spreadsheet)
3. Data from the assays - raw and processed – files or tables
4. Material and data processing protocols – free text5. Experiment design – which sample went to which assay
and produced which data file – a graph (normally a DAG)
5. Experiment design graph
Sample 1
Sample 2
Hybridisation
Cy3
Cy5
Data
5. Experiment design graph
Sample 1(Homo S.,
Brain
Sample 2(Homo S.,
Kidney)
Hybridisation(SMD array 1)
Data
Protocol (Cy5)
Protocol (Cy3)
Protocol
liver 1(Homo sap.) Hybridisation 1
(HG_U95A)Data1.cel
Material processing protocols
(P-XMPL-1)
Normalisation protocol
(P-XMPL-2)
liver 2(Homo sap.)
Hybridisation 2(HG_U95A) Data2.cel
Kidney 1(Homo sap.)
Hybridisation 3(HGU_95A)
Data3.cel
Brain 1(Homo sap.)
FGDM.txtKidney 2
(Homo sap.)
Brain 2(Homo sap.)
liver 1(Homo sap.) Hybridisation 1
(HG_U95A)Data1.cel
Material processing protocols
(P-XMPL-1)
Normalisation protocol
(P-XMPL-2)
liver 2(Homo sap.)
Hybridisation 2(HG_U95A) Data2.cel
Kidney 1(Homo sap.)
Hybridisation 3(HGU_95A)
Data3.cel
Brain 1(Homo sap.)
FGDM.txtKidney 2
(Homo sap.)
Brain 2(Homo sap.)
Sample IDCharacteristics[Organism]
Characteristics[OrganismPart]
Protocol REFHybridization ID
ArrayDesign REF
ArrayData URI
Protocol REFDerivedArrayData URI
liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
Important observation
• In high throughput experiments the experiment design graphs are– Regular (similar subgraphs repeated many times)– For most nodes there is only small number of incoming
and outgoing edge – They can be presented in ‘layers’ in a natural way (in
fact any DAG can be represented in layers)
• This makes their representation as a spreadsheet simple and natural
Layers in a more complex DAG
A
B C
W
FE
G H I
0 1 2 3 4 5
In reality things are often even simpler
... or even simpler than that
v11 v12 v13
(c111, ..., c11n) (c121, ..., c12m) (c113, c ...,)
source (individual) age sex ... weight
sample extraction prot. sample
sample type amount ...
date taken
RNA extr. and labl. Prot assay
ma batch
Data processing prot. Data
ID1 s1 a1 d1ID2 s2 a2 d2
Idk sk ak dk
l1 l2 l3
vk1 vk2 vk3
(ck11,..., c11n) (c121, ..., c12m) (c113, ...,)
l1 l2 l3
node layer 1
Characteristics 11
characteristics 21 ...
characteristics n1
edge label
node layer 2
characteristics 21
characteristics 22 ...
characteristics m2
edge lable
node layer 3
characteristics 31
v11 c111 c112 c11n l11 v12 c121 c122 c12m l12 v13v21 v22 v23
vk1 vk2 vk3
... ... ...
Real world examples
• Simplified E-TABM-234.sdrf
Source ID Sample ID Extract ID LabeledExtract ID Label Hybridization IDBS_TKAC_13 BSM_TKAC_01m BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_13 BSM_TKAC_02m BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_13 BSM_TKAC_03m BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_13 BSM_TKAC_04m BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_13 BSM_TKAC_05m BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_13 BSM_TKAC_06m BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_13 BSM_TKAC_07m BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_13 BSM_TKAC_08m BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_13 BSM_TKAC_09m BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_13 BSM_TKAC_10m BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_14 BSM_TKAC_01n BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_14 BSM_TKAC_02n BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_14 BSM_TKAC_03n BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_14 BSM_TKAC_04n BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_14 BSM_TKAC_05n BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_14 BSM_TKAC_06n BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_14 BSM_TKAC_07n BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_14 BSM_TKAC_08n BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_14 BSM_TKAC_09n BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_14 BSM_TKAC_10n BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_15 BSM_TKAC_01o BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_15 BSM_TKAC_02o BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_15 BSM_TKAC_03o BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_15 BSM_TKAC_04o BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_15 BSM_TKAC_05o BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_15 BSM_TKAC_06o BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_15 BSM_TKAC_07o BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_15 BSM_TKAC_08o BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_15 BSM_TKAC_09o BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_15 BSM_TKAC_10o BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_16 BSM_TKAC_01q BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_16 BSM_TKAC_02q BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_16 BSM_TKAC_03q BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_16 BSM_TKAC_04q BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_16 BSM_TKAC_05q BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_16 BSM_TKAC_06q BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_16 BSM_TKAC_07q BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_16 BSM_TKAC_08q BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_16 BSM_TKAC_09q BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_16 BSM_TKAC_10q BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110A
Ontology usage
• Characteristics can be either free text or an ‘ontology entry’
• Ontology entry is identified by a ‘source’ column following it
Elements to describe an experiment
1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties – a list of properties – free text or ontology entries – table (spreadsheet)
2. Description of the assay (e.g., microarray design) – a list of features on the array and their properties – sequence, annotation, location – table (spreadsheet)
3. Data from the assays - raw and processed – files or tables
4. Material and data processing protocols – free text5. Experiment design – which sample went to which assay
and produced which data file – a graph (normally a DAG) – also a spreadsheet!
MAGE-TAB
• 1., 5. Sample properties and experiment design - SDRF
• 2. Array design 2 - ADF (Array design file)• 4. Protocols – IDF (Investigation design
file)• 3. Data files and data matrices
Array Design (ADF)
• ADF has been there around since MAGE-ML times as a way to describe an array
• A (table) spreadsheet - one row per array feature listing feature coordinates, the sequence, biological annotation, etc
ADF
Block
Column
Block
Row
Column
Row
Reporter ID
Reporter
Sequence
Role Control
Type
Composite
Element ID
CompositeElement
DatabaseEntry [refseq]
1 1 1 1 R1 ATGGTTGGTTACGTGT experimental PTEN NM_000314
1 1 1 2 R2 CCGCGTTGCCCCGCC experimental PAX2 NM_003989 1 1 1 3 R3 CGTAGCTGATCGATGA experimental WWOX NM_016373
1 1 1 4 R4 GGTTGGCTGAGATCGT experimental MAPK8 NM_139047 1 1 2 1 R1 ATGGTTGGTTACGTGT experimental PTEN NM_000314
1 1 2 2 R2 CCGCGTTGCCCCGCC experimental PAX2 NM_003989
1 1 2 3 R3 CGTAGCTGATCGATGA experimental WWOX NM_016373 1 1 2 4 R4 GGTTGGCTGAGATCGT experimental MAPK8 NM_139047
4 6 20 20 462020 TCCCTTCCGTTGTCCT control control_spike_calibration
Investigation design file IDF
• Lists general information about the experiment and gives (a free text) description of all the protocols
Investigation Title University of Heidelberg H sapiens TK6Experimental Designs genetic_modification_designExperimental Factors genetic_modification
Person Last Name MaierPerson First Name PatrickPerson Email [email protected] Phone +496213833773Person Address Theodor-Kutzer-Ufer 1-3Person Affiliation Department of Radiation Oncology, University of HeidelbergPerson Roles submitter investigator
Quality Control Types biological_replicateReplicate Types biological_replicateDate of Experiment 2005-02-28Public Release Date 2006-01-03PubMed ID 12345678Publication Author List
Publication Status submittedExperiment Description
Protocol Name GROWTHPRTCL10653Protocol Type growProtocol Description
Protocol Parameters media
Protocol Name EXTPRTCL10654Protocol Type nucleic_acid_extractionProtocol Description
Protocol Parameters Extracted Product Amplification
Protocol Name TRANPRTCL10656Protocol Type bioassay_data_transformationProtocol Description
EDF Files e-mexp-428_tab.txt
Patrick Maier; Katharina Fleckenstein; Li Li; Stephanie Laufs; Jens Zeller; Stefan Fruehauf; Carsten Herskind; Frederik Wenz
Gene expression of TK6 cells transduced with an oncoretrovirus expressing MDR1 (TK6MDR1) was compared to untransduced TK6 cells and to TK6 cell transduced with an oncoretrovirus expressing the Neomycin resistance gene (TK6neo). Two biological replicates of each were generated and the expression profiles were determined using Affymetrix Human Genome U133 Plus2.0 GeneChip microarrays. Comparisons between the sample groups allow the identification of genes with expression dependent on the MDR1 overexpression.
TK6 cells were grown in suspension cultures in RPMI 1640 medium supplemented with 10% horse serum (Invitrogen, Karlsruhe, Germany). The cells were routinely maintained at 37 C and 5% CO2.
Approximately 10^6 cells were lysed in RLT buffer (Qiagen).Total RNA was extracted from the cell lysate using an RNeasy kit (Qiagen).
Mixed Model Normalization with SAS Micro Array Solutions (version 1.3).
Data files
• Raw data – native formats (cel, genpix)• Normalised, summarised data – columns
may be individual references
liver 1(Homo sap.) Hybridisation 1
(HG_U95A)Data1.cel
Material processing protocols
(P-XMPL-1)
Normalisation protocol
(P-XMPL-2)
liver 2(Homo sap.)
Hybridisation 2(HG_U95A) Data2.cel
Kidney 1(Homo sap.)
Hybridisation 3(HGU_95A)
Data3.cel
Brain 1(Homo sap.)
FGDM.txtKidney 2
(Homo sap.)
Brain 2(Homo sap.)
Sample IDCharacteristics[Organism]
Characteristics[OrganismPart]
Protocol REFHybridization ID
ArrayDesign REF
ArrayData URI
Protocol REFDerivedArrayData URI
liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
Sample IDCharacteristics[Organism]
Characteristics[OrganismPart]
Protocol REFHybridization ID
ArrayDesign REF
ArrayData URI
Protocol REFDerivedArrayData URI
liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt
kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt
brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt
FGEM.txt:
Gene n
….
Gene 3
Gene 2
Gene 1
Hyb 3 (brain)
Hyb 2 (kidney)
Hyb 1 (liver)
Genes
Experimental Factors
• Most important experimental variables (e.g., organs – liver, kidney, brain – in the examples above)
• Any column in the EDF can be marked as an experimental factor
• These can be propagated down the edges of the graph to columns in FGEM
• In FGEM they serve as concise annotation
MAGE-TAB
• Investigation design file – IDF• Array design file ADF• Experiment (sample) design file SDRF• Data files and data matrices (FGEM)
Standard design templates
• simple iterated design;• iterated design with technical replicates;• iterated design with pooling;• iterated designs for dual channel experiments;• dual channel iterated designs with dye swap;• dual channel iterated designs with a reference sample;• dual channel iterated design with a reference and dye
swap;• dual channel iterated design with a pooled reference;• loop design;• loop design with die swap;• time series experiments;http://www.mged.org/Workgroups/MAGE/mage.html#mage-tab
MAGE-TAB
• Any experiment can be represented in MAGE-TAB in a structured way to MIAME granularity
• Large experiments with a regular design can be represented in a natural way
• It is possible to create MAGE-TAB files using generic spread-sheet software
• It is flexible – the granularity of the experiment description can vary
Some points for later discussion
• What should be the level of prescription on ID column names (source, sample, extract, ..., summary data)?– For instance, it is very often confusing to me
what is source and what is sample
• Do we need the concept of Experimental Factors at all?– An alternative could be optional labelling of
variable characteristics as ‘intentional’
Acknowledgements
• Tim Rayner, Helen Parkinson• Cathy Ball, Don Maier• Philippe Rocca-Sera (ADF)• Michael Miller, Ugis Sarkans, Paul Spellman,
Anna Farne• Mohammad Shojatalob – MAGE-TAB export
from ArrayExpress• MAGE working groupFunding - NHGRI/NIBIB, MGED sponsors