Macromolecular Structure Database Structural Database Infrastructure Services for Europe

Macromolecular Structure DatabaseStructural Database Infrastructure Services for Europe

www.ebi.ac.uk/msd

EMBL-EBI

The MSD databases

The MSD actually consists of two separate databases: the archive database is highly normalized, with

thousands of relationships linking some 400 tables; the deposition database is the definitive archive for all structural data at MSD

the search database is a much simpler, denormalized database, with data items duplicated and aggregated into 40 much wider tables, making it more amenable to searching and retrieval of data : the MSDSD

EMBL-EBI

What is the MSDSD database A relational database primarily developed in Oracle

that stores the data derived from the PDB together with reference and other derived information

Simple to understand for the novice biologist and fast in performance for the database non-expert

Originates from the internal MSD archive database that ensures accuracy and data integrity

In MSDSD naming and other summary information is repeated from every level of the hierarchy to the next one in order to be closer to the familiar PDB data

EMBL-EBI

Main MSDSD features

The symmetry has been expanded and the information of the quaternary biological assemblies is directly available External Information like binding sites and secondary

structure has been derived on the assembly level The original PDB asymmetric unit is also available Includes and provides clear database relations

with the ligands “data mart” and other reference information

Includes information and cross-references to external databases (NCBI taxonomy, UniProt, SCOP etc)

EMBL-EBI

Data mining is a term that is applied very loosely within bioinformatics to describe any type of data analysis. Almost without exception the analysis of molecular biology is hypothesis based where the search for information has a target that is defined by the knowledge of the biological context of the data.

DATA Analysis

EMBL-EBI

“Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data.”“True data mining software does not just change the presentation, but discovers previously unknown relationships among the data.

(Webopedia and other technical dictionaries)”

Data Mining

was first “invented” by IBM

EMBL-EBI

Traditional analysis is via “verification-driven analysis”

Requires hypothesis of the desired information (target)Requires correct interpretation of proposed query

Discovery-driven data miningFinds data with common characteristicsResults are ideal solutions to discovery Finds results without previous hypothesis

EMBL-EBI

“catalytic triad” : text string matchingAtomic coordinates : coordinate superpositionMathematical graph : graph matchingHIS,ASP,SER : data hierarchy knowledge

So what is Hypothesis driven data analysis ?

Define a target = hypothesis Search for target There are/are-not “hits”

Verify/negate hypothesis Distribution is centred on target

EMBL-EBI

For example, it is possible to find the presence of catalytic triads within the PDB by selecting an example structure and then using a matching technique such as coordinate superposition or graph analysis to screen this against all the coordinate data within the PDB. This will identify the presence of similar residue configurations to the search target and result in a distribution of hits centered on the original search model.

HOWEVER we can only find similar objects distributed about this target.

EMBL-EBI

Discovery-driven data mining

Finds data with common characteristicsResults are ideal solutions to discovery Finds results without previous hypothesisTarget is mathematical – so has no scientific

dependency

EMBL-EBI

Mining techniques

Creation of predictive models : future data expectation

Link analysis : connections between data objects

Database segmentation : classification Deviation detection : finding outliers.

EMBL-EBI

Finds new things !But not what it means !

So what is this data mining ?

Given multiple sets of primary data)Characters, numbers, Function(numbers),….

Find anomaliesTo many : numerical occurrenceData variation : DerivativesSingularities

Correlations and clustersWithin primary datawith other data (dependent variables)

EMBL-EBI

Discovery driven data mining of the PDB

Analysis of 3-dimensional coordinatesDefined common patterns of atomic interactions

locallyDB segmentation - active sites & common packing featuresLink analysis - Similarity between different functional

groupDefined globally

DB segmentation - common patterns of super-secondary str’

Link analysis - common folds in diverse protein familiesOutlier detection - unique folds

Nucleic Acid sequencesDefine information content using information

technology

EMBL-EBI

Issues

Systematic “error” propagates as solution 300 lysozyme structures return as a strong solution

Results cannot be found below the noise level Need to characterise the noise level Need to improve signal/noise ratio (S/N) to see information

Target is not biologically defined It does not give you the biological answer Results should reproduce known biology Can give you new results not previously observed

EMBL-EBI

Data selection

Cannot leave in 300 lysozyme structures ! Select by sequence similarity at 70% exact

alignmentDifferent “phase space” to select data

Remove structures with resolution < 2.5A Remove NMR (different statistics) Remove pre-1982 etc. Geometrical analysis criteria to check for

outliers

EMBL-EBI

Off the shelf products

Main problem – they “all” do column correlation – but this requires row analysis Ie you can find whether x coordinates are more

correlated to y coordinates than z coordinates Slow

I tried the above on 1e3 of data and it took hours; not much chance on 1.6e9 data then.

Money often

EMBL-EBI

Local atomic interactions

Data 3D coordinatesAtom typesResidue types

Convert coordinates to distances - easier to compare, no need to superpose coordinates.

Create 3D Hash table of triplets of distances between “points”

EMBL-EBI

Local atomic interactions

Merge triplets Any pair of N-fold

interactions are a (N+1) interaction if they have (N-1) equivalence.

Just keep going until no more (N+1) interaction are found.

Time = 8 seconds (Digital alpha ES40)

EMBL-EBI

Local atom interactions

Define key atoms/groups-of-atoms as run time parameters. Solves problem of residue

symmetry Approximation for speed This is a hypothesis

External definition of residue equivalence (PHE TYR) for released data. Improves Signal/Noise ratio. This is a hypothesis

EMBL-EBI

Is this data mining ?

Basic 3D correlation of distances is Program can be run without any prior definitions.

Addition of key atoms and residue equivalence introduces biology and chemistry introduces hypothesis regarding what is important. Without adding this information you get very little out.

Improvement to the method should spot this without being told !

EMBL-EBI Re-implemented

This idea has been re-implementedCore analysis on distance onlyStatistical analysis of residue equivalence is

carried out – will find residue equivalenceBit slower now – 2 minutes

To use MSD assembly dataMust be able to normalise by chain similarity

to remove common features due to structure.Can use MSD similarity tables for this.

EMBL-EBI

Refining the answers

This analysis produces approximate geometrical results

For each “solution”, a second full All vs. All LSQ overlay is performed handles symmetry in D,E,R,P,Thandles different residue overlays

Clusters results using average linkage Writes average + superposed coordinates

+ ligands.

EMBL-EBI

Catalytic quartet

EMBL-EBI

Electrostatic interaction

Ligands are found close by rather than associated with the residues

EMBL-EBI

N-linked glycosolation binding site +?

Spot the non-sugar

This glycosolation site is the same as active site found in “1a53” – indol-3-glycerolphosphate synthase

EMBL-EBI

Summary

Creates 1000’s of results Returns many metal and catalytic sites 50% have at least 2 of 3 residues as

sequence neighbours 30% have associated ligands

http://www.ebi.ac.uk/msd-srv/MSDtemplate/

See T.J.Oldfield (2003) PROTEINS: Structure, Function, and Genetics 49, 510-528. T.J.Oldfield (2002) Acta Cryst. D57, 1421-1427

EMBL-EBI

Data mining – not idiot proof

Date of birth and age will give 100 % correlation

Authors for structure submission will be correlated to authors on primary citation.

“Lysozyme” is the most common fold pattern

36 spelling’s of E.Coli will mask results. Requires representative sets

Statistically valid ones too ! Signal/Noise ratio is a problem : hit the

noise and the calculation grows rapidly

EMBL-EBI Other methods

Representative sets and clusteringAnother talk

Data mining fold

Information technological analysis of genomes

Documents

Macromolecular Structure Database Structural Database Infrastructure Services for Europe