MAGE-TAB introduction: Alvis Brazma (EBI)

Preview:

Citation preview

MAGE-TAB - a simple tab delimited format for

describing microarray (and potentially other) experiments in a MIAME

compliant way

MAGE-TAB workshopNCI, January 24, 2008

What is needed to describe a microarray (and potentially any)

experiment adequately (MIAME)?

1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties

2. Description of the assay (e.g., microarray design)

3. Data from the assays - raw and processed4. Material and data processing protocols5. Experiment design – which sample went to

which assay and produced which data file

How to do this?

1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties – a list of properties – free text or ontology entries – table (spreadsheet)

2. Description of the assay (e.g., microarray design) – a list of features on the array and their properties – sequence, annotation, location – table (spreadsheet)

3. Data from the assays - raw and processed – files or tables

4. Material and data processing protocols – free text5. Experiment design – which sample went to which assay

and produced which data file – a graph (normally a DAG)

5. Experiment design graph

Sample 1

Sample 2

Hybridisation

Cy3

Cy5

Data

5. Experiment design graph

Sample 1(Homo S.,

Brain

Sample 2(Homo S.,

Kidney)

Hybridisation(SMD array 1)

Data

Protocol (Cy5)

Protocol (Cy3)

Protocol

liver 1(Homo sap.) Hybridisation 1

(HG_U95A)Data1.cel

Material processing protocols

(P-XMPL-1)

Normalisation protocol

(P-XMPL-2)

liver 2(Homo sap.)

Hybridisation 2(HG_U95A) Data2.cel

Kidney 1(Homo sap.)

Hybridisation 3(HGU_95A)

Data3.cel

Brain 1(Homo sap.)

FGDM.txtKidney 2

(Homo sap.)

Brain 2(Homo sap.)

liver 1(Homo sap.) Hybridisation 1

(HG_U95A)Data1.cel

Material processing protocols

(P-XMPL-1)

Normalisation protocol

(P-XMPL-2)

liver 2(Homo sap.)

Hybridisation 2(HG_U95A) Data2.cel

Kidney 1(Homo sap.)

Hybridisation 3(HGU_95A)

Data3.cel

Brain 1(Homo sap.)

FGDM.txtKidney 2

(Homo sap.)

Brain 2(Homo sap.)

Sample IDCharacteristics[Organism]

Characteristics[OrganismPart]

Protocol REFHybridization ID

ArrayDesign REF

ArrayData URI

Protocol REFDerivedArrayData URI

liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

Important observation

• In high throughput experiments the experiment design graphs are– Regular (similar subgraphs repeated many times)– For most nodes there is only small number of incoming

and outgoing edge – They can be presented in ‘layers’ in a natural way (in

fact any DAG can be represented in layers)

• This makes their representation as a spreadsheet simple and natural

Layers in a more complex DAG

A

B C

W

FE

G H I

0 1 2 3 4 5

In reality things are often even simpler

... or even simpler than that

v11 v12 v13

(c111, ..., c11n) (c121, ..., c12m) (c113, c ...,)

source (individual) age sex ... weight

sample extraction prot. sample

sample type amount ...

date taken

RNA extr. and labl. Prot assay

ma batch

Data processing prot. Data

ID1 s1 a1 d1ID2 s2 a2 d2

Idk sk ak dk

l1 l2 l3

vk1 vk2 vk3

(ck11,..., c11n) (c121, ..., c12m) (c113, ...,)

l1 l2 l3

node layer 1

Characteristics 11

characteristics 21 ...

characteristics n1

edge label

node layer 2

characteristics 21

characteristics 22 ...

characteristics m2

edge lable

node layer 3

characteristics 31

v11 c111 c112 c11n l11 v12 c121 c122 c12m l12 v13v21 v22 v23

vk1 vk2 vk3

... ... ...

Real world examples

• Simplified E-TABM-234.sdrf

Source ID Sample ID Extract ID LabeledExtract ID Label Hybridization IDBS_TKAC_13 BSM_TKAC_01m BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_13 BSM_TKAC_02m BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_13 BSM_TKAC_03m BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_13 BSM_TKAC_04m BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_13 BSM_TKAC_05m BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_13 BSM_TKAC_06m BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_13 BSM_TKAC_07m BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_13 BSM_TKAC_08m BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_13 BSM_TKAC_09m BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_13 BSM_TKAC_10m BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_14 BSM_TKAC_01n BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_14 BSM_TKAC_02n BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_14 BSM_TKAC_03n BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_14 BSM_TKAC_04n BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_14 BSM_TKAC_05n BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_14 BSM_TKAC_06n BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_14 BSM_TKAC_07n BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_14 BSM_TKAC_08n BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_14 BSM_TKAC_09n BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_14 BSM_TKAC_10n BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_15 BSM_TKAC_01o BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_15 BSM_TKAC_02o BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_15 BSM_TKAC_03o BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_15 BSM_TKAC_04o BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_15 BSM_TKAC_05o BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_15 BSM_TKAC_06o BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_15 BSM_TKAC_07o BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_15 BSM_TKAC_08o BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_15 BSM_TKAC_09o BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_15 BSM_TKAC_10o BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110ABS_TKAC_16 BSM_TKAC_01q BSM_TKAC_22p BSM_TKAC_22p biotin HFB2002012101ABS_TKAC_16 BSM_TKAC_02q BSM_TKAC_23p BSM_TKAC_23p biotin HFB2002012102ABS_TKAC_16 BSM_TKAC_03q BSM_TKAC_24p BSM_TKAC_24p biotin HFB2002012103ABS_TKAC_16 BSM_TKAC_04q BSM_TKAC_25p BSM_TKAC_25p biotin HFB2002012104ABS_TKAC_16 BSM_TKAC_05q BSM_TKAC_26p BSM_TKAC_26p biotin HFB2002012105ABS_TKAC_16 BSM_TKAC_06q BSM_TKAC_27p BSM_TKAC_27p biotin HFB2002012106ABS_TKAC_16 BSM_TKAC_07q BSM_TKAC_28p BSM_TKAC_28p biotin HFB2002012107ABS_TKAC_16 BSM_TKAC_08q BSM_TKAC_29p BSM_TKAC_29p biotin HFB2002012108ABS_TKAC_16 BSM_TKAC_09q BSM_TKAC_30p BSM_TKAC_30p biotin HFB2002012109ABS_TKAC_16 BSM_TKAC_10q BSM_TKAC_31p BSM_TKAC_31p biotin HFB2002012110A

Ontology usage

• Characteristics can be either free text or an ‘ontology entry’

• Ontology entry is identified by a ‘source’ column following it

Elements to describe an experiment

1. Description of biological sample (research subject, aliquot – i.e., biomaterial) properties – a list of properties – free text or ontology entries – table (spreadsheet)

2. Description of the assay (e.g., microarray design) – a list of features on the array and their properties – sequence, annotation, location – table (spreadsheet)

3. Data from the assays - raw and processed – files or tables

4. Material and data processing protocols – free text5. Experiment design – which sample went to which assay

and produced which data file – a graph (normally a DAG) – also a spreadsheet!

MAGE-TAB

• 1., 5. Sample properties and experiment design - SDRF

• 2. Array design 2 - ADF (Array design file)• 4. Protocols – IDF (Investigation design

file)• 3. Data files and data matrices

Array Design (ADF)

• ADF has been there around since MAGE-ML times as a way to describe an array

• A (table) spreadsheet - one row per array feature listing feature coordinates, the sequence, biological annotation, etc

ADF

Block

Column

Block

Row

Column

Row

Reporter ID

Reporter

Sequence

Role Control

Type

Composite

Element ID

CompositeElement

DatabaseEntry [refseq]

1 1 1 1 R1 ATGGTTGGTTACGTGT experimental PTEN NM_000314

1 1 1 2 R2 CCGCGTTGCCCCGCC experimental PAX2 NM_003989 1 1 1 3 R3 CGTAGCTGATCGATGA experimental WWOX NM_016373

1 1 1 4 R4 GGTTGGCTGAGATCGT experimental MAPK8 NM_139047 1 1 2 1 R1 ATGGTTGGTTACGTGT experimental PTEN NM_000314

1 1 2 2 R2 CCGCGTTGCCCCGCC experimental PAX2 NM_003989

1 1 2 3 R3 CGTAGCTGATCGATGA experimental WWOX NM_016373 1 1 2 4 R4 GGTTGGCTGAGATCGT experimental MAPK8 NM_139047

4 6 20 20 462020 TCCCTTCCGTTGTCCT control control_spike_calibration

Investigation design file IDF

• Lists general information about the experiment and gives (a free text) description of all the protocols

Investigation Title University of Heidelberg H sapiens TK6Experimental Designs genetic_modification_designExperimental Factors genetic_modification

Person Last Name MaierPerson First Name PatrickPerson Email patrick.maier@radonk.ma.uni-heidelberg.dePerson Phone +496213833773Person Address Theodor-Kutzer-Ufer 1-3Person Affiliation Department of Radiation Oncology, University of HeidelbergPerson Roles submitter investigator

Quality Control Types biological_replicateReplicate Types biological_replicateDate of Experiment 2005-02-28Public Release Date 2006-01-03PubMed ID 12345678Publication Author List

Publication Status submittedExperiment Description

Protocol Name GROWTHPRTCL10653Protocol Type growProtocol Description

Protocol Parameters media

Protocol Name EXTPRTCL10654Protocol Type nucleic_acid_extractionProtocol Description

Protocol Parameters Extracted Product Amplification

Protocol Name TRANPRTCL10656Protocol Type bioassay_data_transformationProtocol Description

EDF Files e-mexp-428_tab.txt

Patrick Maier; Katharina Fleckenstein; Li Li; Stephanie Laufs; Jens Zeller; Stefan Fruehauf; Carsten Herskind; Frederik Wenz

Gene expression of TK6 cells transduced with an oncoretrovirus expressing MDR1 (TK6MDR1) was compared to untransduced TK6 cells and to TK6 cell transduced with an oncoretrovirus expressing the Neomycin resistance gene (TK6neo). Two biological replicates of each were generated and the expression profiles were determined using Affymetrix Human Genome U133 Plus2.0 GeneChip microarrays. Comparisons between the sample groups allow the identification of genes with expression dependent on the MDR1 overexpression.

TK6 cells were grown in suspension cultures in RPMI 1640 medium supplemented with 10% horse serum (Invitrogen, Karlsruhe, Germany). The cells were routinely maintained at 37 C and 5% CO2.

Approximately 10^6 cells were lysed in RLT buffer (Qiagen).Total RNA was extracted from the cell lysate using an RNeasy kit (Qiagen).

Mixed Model Normalization with SAS Micro Array Solutions (version 1.3).

Data files

• Raw data – native formats (cel, genpix)• Normalised, summarised data – columns

may be individual references

liver 1(Homo sap.) Hybridisation 1

(HG_U95A)Data1.cel

Material processing protocols

(P-XMPL-1)

Normalisation protocol

(P-XMPL-2)

liver 2(Homo sap.)

Hybridisation 2(HG_U95A) Data2.cel

Kidney 1(Homo sap.)

Hybridisation 3(HGU_95A)

Data3.cel

Brain 1(Homo sap.)

FGDM.txtKidney 2

(Homo sap.)

Brain 2(Homo sap.)

Sample IDCharacteristics[Organism]

Characteristics[OrganismPart]

Protocol REFHybridization ID

ArrayDesign REF

ArrayData URI

Protocol REFDerivedArrayData URI

liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

Sample IDCharacteristics[Organism]

Characteristics[OrganismPart]

Protocol REFHybridization ID

ArrayDesign REF

ArrayData URI

Protocol REFDerivedArrayData URI

liver 1 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

liver 2 Homo sapiens liver P-XMPL-1 hyb 1 HG_U95A 1.CEL P-XMPL-2 FGEM.txt

kidney 1 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

kidney 2 Homo sapiens kidney P-XMPL-1 hyb 2 HG_U95A 2.CEL P-XMPL-2 FGEM.txt

brain 1 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

brain 2 Homo sapiens brain P-XMPL-1 hyb 3 HG_U95A 3.CEL P-XMPL-2 FGEM.txt

FGEM.txt:

Gene n

….

Gene 3

Gene 2

Gene 1

Hyb 3 (brain)

Hyb 2 (kidney)

Hyb 1 (liver)

Genes

Experimental Factors

• Most important experimental variables (e.g., organs – liver, kidney, brain – in the examples above)

• Any column in the EDF can be marked as an experimental factor

• These can be propagated down the edges of the graph to columns in FGEM

• In FGEM they serve as concise annotation

MAGE-TAB

• Investigation design file – IDF• Array design file ADF• Experiment (sample) design file SDRF• Data files and data matrices (FGEM)

Standard design templates

• simple iterated design;• iterated design with technical replicates;• iterated design with pooling;• iterated designs for dual channel experiments;• dual channel iterated designs with dye swap;• dual channel iterated designs with a reference sample;• dual channel iterated design with a reference and dye

swap;• dual channel iterated design with a pooled reference;• loop design;• loop design with die swap;• time series experiments;http://www.mged.org/Workgroups/MAGE/mage.html#mage-tab

MAGE-TAB

• Any experiment can be represented in MAGE-TAB in a structured way to MIAME granularity

• Large experiments with a regular design can be represented in a natural way

• It is possible to create MAGE-TAB files using generic spread-sheet software

• It is flexible – the granularity of the experiment description can vary

Some points for later discussion

• What should be the level of prescription on ID column names (source, sample, extract, ..., summary data)?– For instance, it is very often confusing to me

what is source and what is sample

• Do we need the concept of Experimental Factors at all?– An alternative could be optional labelling of

variable characteristics as ‘intentional’

Acknowledgements

• Tim Rayner, Helen Parkinson• Cathy Ball, Don Maier• Philippe Rocca-Sera (ADF)• Michael Miller, Ugis Sarkans, Paul Spellman,

Anna Farne• Mohammad Shojatalob – MAGE-TAB export

from ArrayExpress• MAGE working groupFunding - NHGRI/NIBIB, MGED sponsors