DAME: Searching Large Data Sets Within a Grid-Enabled ......DAME: Searching Large Data Sets Within a Grid-Enabled Engineering Application JIM AUSTIN, ROB DAVIS, MARTYN FLETCHER, TOM

DAME: Searching Large Data Sets Within aGrid-Enabled Engineering Application

JIM AUSTIN, ROB DAVIS, MARTYN FLETCHER, TOM JACKSON, MARK JESSOP,BOJIAN LIANG, AND ANDY PASLEY

Invited Paper

The use of search engines within the Internet is now ubiquitous.This paper examines how Grid technology may affect the implemen-tation of search engines by focusing on the Signal Data Explorerapplication developed within the Distributed Aircraft MaintenanceEnvironment (DAME) project. This application utilizes advancedneural-network-based methods (Advanced Uncertain ReasoningArchitecture (AURA) technology) to search for matching patternsin time-series vibration data originating from Rolls-Royce aero-engines (jet engines). The large volume of data associated with theproblem required the development of a distributed search engine,where data is held at a number of geographically disparate loca-tions. This paper gives a brief overview of the DAME project, thepattern marching problem, and the architecture. It also describesthe Signal Data Explorer application and provides an overviewof the underlying search engine technology and its use in theaeroengine health-monitoring domain.

Keywords—Advanced Uncertain Reasoning Architecture(AURA), distributed search, Grid industrial application, time-series data.

I. INTRODUCTION

The Distributed Aircraft Maintenance Environment(DAME) project was undertaken as a Grid pilot projectunder the U.K.’s Engineering and Physical Science Re-search Council (EPSRC) funding, set up to demonstratethe benefits of Grid computing in engineering applications.DAME is a £3 million, 45-person-year project, which com-pletes during 2005. It has three commercial collaborators:Rolls-Royce Group plc, Derby, U.K., which provides theengine data, information about the problem domain, andthe definition of the virtual organization; Data Systems and

Manuscript received March 1, 2004; revised June 1, 2004. This work wassupported by the U.K. Engineering and Physical Sciences Research Council(EPSRC) under Grant GR/R67668/01.

The authors are with the Advanced Computer Architectures Group, De-partment of Computer Science, University of York, York YO10 5DD, U.K.(e-mail: [email protected]).

Digital Object Identifier 10.1109/JPROC.2004.842746

Solutions LLC, Bristol, U.K., which supplies the data man-agement frame work; and Cybula Ltd., York, U.K., which iscommercializing the underlying Advanced Uncertain Rea-soning Architecture (AURA) search technology on behalfof the University of York, York, U.K.

The DAME project has investigated the problem ofbuilding a Grid-based diagnosis and prognosis system foraeroengine (i.e., jet engine) data. The details of the com-plete project can be found in [1]. This paper focuses on theproblem of managing and searching vibration data fromaeroengines. The DAME project has developed a completepattern storage and search system, based on existing AURAsearch technology. To facilitate the use of this distributedsystem, an application-specific portal-based data browserand search interface has been developed called the SignalData Explorer (SDE). The problems solved in developinga demonstration of this technology have many applicationsoutside of the specific problem domain.

The remainder of this paper is organized as follows. Sec-tion II provides an overview of the motivation and objectivesof the DAME project, outlines the main characteristics ofthe problem domain, and sets an overall context by brieflyoutlining the search task and the requirement to develop adistributed search engine. Section III outlines the diagnosispattern-matching problem addressed by DAME and searchstrategy. The use of the Grid in the DAME architecture is de-scribed in Section IV. An overview of the AURA technologyused to implement the search engine is provided in Section V.Finally, Section VI describes the SDE application.

II. DAME PROJECT OBJECTIVES

The central challenge of the DAME project is the designand implementation of a fault diagnosis and prognosissystem based on the Grid computing paradigm. The partic-ular context for the DAME project is a health monitoringapplication for Rolls-Royce aeroengines. The DAME project

0018-9219/$20.00 © 2005 IEEE

496 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 3, MARCH 2005

Fig. 1. Engine vibration data (Zmod) used in the DAME system. Copyright © 1996–1999 OxfordUniversity, Oxford, U.K., used with permission.

demonstrates how predictive and prognostic health mon-itoring can be achieved through the timely managementof data across a Grid computing infrastructure. It supportsthe concept of virtual organizations, enabling a variety ofstakeholders to interact with the system as part of the enginehealth monitoring activity.

Fault diagnosis techniques are deployed across many di-verse IT domains, for example, medicine, engineering, trans-port, and aerospace. However, regardless of the applicationdomain, diagnosis and prognosis systems share a number ofoperating and design characteristics.

• They are data centric. Monitoring and analysis ofsensor data and domain specific knowledge is criticalto the diagnostic process.

• There are complex interactions among multiple agentsor stakeholders at distributed locations.

• There is a need for provenance; supporting or quali-fying evidence for the diagnosis or prognosis offered.

• They can be business critical, typically with stringentdependability requirements.

The emerging Grid computing paradigm [2] and Grid ser-vices model offers a practical framework in which to buildand manage systems that can meet these requirements.

A. Demonstrator Context

The context for the DAME project is a Rolls-Royceaeroengine diagnosis and prognosis problem. Modern

aeroengines operate in highly demanding operational en-vironments and do so with extremely high reliability. Toachieve this, the engines combine advanced mechanicalengineering systems with tightly coupled electronic controlsystems. As one would expect, such critical systems arefitted with extensive sensing and monitoring capabilities forperformance analysis. The basis of monitoring is to detectthe earliest possible signs of deviation from normal oper-ating behavior. Rolls-Royce has collaborated with OxfordUniversity, Oxford, U.K., and Oxford BioSignals to de-velop an advanced engine monitoring system called QUICKTechnology [3]. QUICK performs condition analysis ondata derived from continuous monitoring of broadbandengine vibration and performance measurements. QUICKTechnology provides state-of-the-art on-wing monitoringcapability, but also provides a new business challenge: theground-based management and analysis of the high volumesensor data that is produced.

Developing a Grid-enabled diagnostic system that fa-cilitates the processing of data in a ground-based systempresents the DAME project with three principal challenges.

1) The type of data captured by QUICK involves realvalued variables monitored over time. An example ofthis is shown in Fig. 1. This plot is typical of the datastored and utilized to support the diagnostic process.Each flight can produce up to 1 GB of data per engine,which, when scaled to the fleet level, means terabytesof data per year. The storage of this data requires vastdata repositories that may be distributed across many

AUSTIN et al.: DAME: SEARCHING LARGE DATA SETS WITHIN A GRID-ENABLED ENGINEERING APPLICATION 497

geographic and operational boundaries. However, theengine data must also be accessible in a timely and de-pendable way for health monitoring services.

2) Advanced pattern matching and data mining methodsare required to search for matches to novel featuresdetected in the vibration data. These methods must beable to operate on the large volumes of data and give aresponse time that meets operational demands.

3) The diagnostic processes require collaboration amonga number of stakeholders within the airline, support,and manufacturer’s organizations. These individualsneed to deploy a range of different engineering andcomputational tools to analyze the problem. Hence,any Grid-based solution must support a virtual organi-zation encompassing the services, individuals, and sys-tems involved.

The DAME project has built a technology demonstratorto show how such challenges can be met. This demonstratorillustrates how the Grid could be used to maintain and managefleetwide repositories of aeroengine data with the goal ofproviding an enhanced ability to anticipate maintenancerequirements and monitor engine health.

The demonstrator addresses several issues, including:

• the ability of the Grid to support the use of complex dis-tributed processing systems by Virtual Organizations;

• robustness of the Grid, including issues of security andavailability;

• performance of the Grid for large-volume data manage-ment and search.

The proof-of-concept demonstrator is based on the idea ofa diagnostic workbench environment, hosted within a secureGrid portal. The workbench provides seamless accesses to di-agnostic and data management services distributed across theGrid. For demonstration purposes the services are deployedacross the White Rose Grid at sites in York, Leeds, Sheffield,and Oxford, U.K.

The DAME demonstrator utilizes core-processingmethods developed from existing AURA search technology.The AURA methods provide a scalable and feasible solutionto the problem of searching large volumes of aeroenginedata within an operational timeframe.

The characteristics of the search problem and strategy aredescribed in the next section, followed by sections on theDAME architecture and overviews of AURA and the SDE,respectively.

III. THE DAME PATTERN MATCHING PROBLEM AND

SEARCH STRATEGY

This section describes the search problem that underpinsthe diagnosis and prognosis of aeroengine faults in the con-text of the DAME project.

Using the on-wing QUICK system, it is possible to mon-itor each engine and detect unusual behavior in the vibrationdata. The major gain offered by the DAME system is in en-abling engineers to compare unusual vibration behavior de-tected by QUICK with data collected from all engines on allprevious flights: the historical fleet archives. By identifying

the best matches to novel data and then looking at the sub-sequent behavior of the engines exhibiting matching vibra-tion patterns, it is possible to reason about probable causes,thereby leading to a potential diagnosis. Alternatively, in re-gard to maintenance and prognosis, early detection of smallchanges in the engine’s characteristics can lead to optimalscheduling of maintenance requirements.

The vibration and performance data searched are forms oftime-series data. A time-series is a sequence containing thevalues of a variable over time. Typically the variable may bea sensor reading, for example temperature, pressure, or fuelflow, taken at constant intervals of time. Alternatively, thetime-series may be a set of derived values over time.

A. Search Requirements

There are a number of requirements for an effective searchimplementation within the context of the DAME system. Inparticular, the search must have the following characteristics.

1) It must be able to support queries of variable length, asthe time-series patterns characterizing a novelty can beanything from 1–2 s to a few minutes in length.

2) Accuracy. The search must implement a suitablemetric that reflects a domain expert’s understanding of“similarity.” There must be few (if any) false negatives(definition to follow in this section). False positivesare acceptable provided that their number is smallrelative to the size of the dataset searched, as they canbe quickly removed by postprocessing.

3) Efficiency. Searches must be completed within at mosta few minutes so that the results can be analyzed andacted upon within the normal aircraft turnaround time:typically of the order of 30 min.

4) Scalability. It must be possible, over time, to introducefurther engine records for searching against while con-tinuing to meet the above requirements.

B. Query by Content

Retrieving similar time-series subsequences is referred toas the query by content or query by example problem. Thisis a significant area of research in its own right; for example,see [4]–[6].

The problem of finding similar subsequences can be statedas that of finding

where is the set of subsequences within a distance (or tol-erance threshold) of the query . Alternatively, we may beinterested in the nearest neighbors of .

If a method returns a set of subsequences that is a subsetof , then the wrongly dismissed subsequences – are re-ferred to as false negatives. A method is said to be admissibleor exact if it can be proven that it has no false negatives. Al-ternatively, if is a superset of , then the subsequences

– are referred to as false positives. Provided that there arenot too many false positives, these can easily be dealt withby calculating the precise distance measure for all of , thusremoving the false positives.


The query by content problem is within the scope of theGrid Information Retrieval (GridIR) Working Group [7] setup by the Global Grid Forum (GGF). This group was formedonly recently, but aims to provide standards for generic infor-mation retrieval solutions. It is expected that services imple-mented as part of the DAME project will follow this standardand, hence, achieve applicability to a wider range of problemdomains.

C. Sequential Scan as a Benchmark

The simplest way of determining the matching subse-quences is to perform a sequential scan. For a time-seriesof elements, checking for a matching subsequence ofelements at each of the possible offsets requires O(mn)time. This is generally regarded as unacceptable for largedatabases.

To estimate the potential performance of sequential scanwithin the context of a deployed DAME system it is instruc-tive to consider the following example scenario.

• A fleet of 100 aircraft with four engines each, flying10 h per day, 365 days per year.

• Tracked-order1 vibration data recorded at five datapoints/s with 4 B of storage required for each datapoint.

• A deployed DAME system utilizing one hundred2-GHz CPUs with 1 GB of RAM each; effectively, oneCPU supporting each aircraft. Search is completed inparallel, utilizing all available processing power.

Each search needs to check 25 000 000 000 alignments ofthe query per year of tracked-order data. Using a sequentialscan on a single CPU, it takes approximately 2 s to check5 000 000 alignments. This extrapolates to approximately500 s to search 5 years of data using a query of 100 datapoints. This search uses the processing power of all CPUsuntil the search is completed.

This is within the typical 30-min turnaround time betweenan aircraft landing and being cleared for further service.However, a deployed DAME system must be able to handlemore than one search at a time. In the worst case, a numberof aircraft could land, at different airports within a shortinterval of time and all place demands upon the DAMEsystem. A reasonable assumption is that a deployed DAMEsystem handling 100 aircraft should effectively be able tosupport up to ten simultaneous searches with a responsetime of at most 5 min. This requires either:

1) an order of magnitude greater processing re-source—which is potentially prohibitively expensiveor

2) an order of magnitude faster approach to searching thevibration data.

1As an engine accelerates, the rotation speed of the shafts increases andso does the frequency of the vibrations caused by the shafts. A tracked orderis the amplitude of the vibration signal in a narrow frequency band centredon a harmonic of the rotation frequency of a shaft as it changes frequency.

Fig. 2. PAA representation.

D. Search Strategy

The strategy employed to achieve efficient patternmatching in DAME is as follows.

1) Each subsequence of the tracked-order data is pro-cessed to form a reduced dimensionality signatureusing a standard dimensionality reduction technique.

2) The signature for each subsequence is then encodedand stored in AURA (see Section V).

3) The query is processed using the same dimensionalityreduction and encoding techniques.

4) The encoded query is used as the input to the AURAsearch.

5) The output from the AURA search provides a set ofcandidate matches. The candidate matches are the setof tracked-order subsequences that are close to thequery in signature space.

6) Finally, a back-check is performed on the set of candi-date matches. The back-check makes a conventionalcomparison between the original query data and thetracked-order data for the candidate matches to deter-mine the final result set.

Steps 1 and 2 are performed only once for each pieceof data stored; this is a key aspect to the efficiency of theapproach.

The rationale behind this strategy is to quickly throw awaythe typically very large percentage of possible alignmentsthat result in a very poor match against the query data. Anexact comparison is then used in the back-check stage to findthe required results from the set of candidate matches.

E. Dimensionality Reduction

The dimensionality reduction technique currently used inDAME is piecewise aggregate approximation (PAA) [4], [5].Using PAA, the dimensionality of time-series data is reducedby grouping adjacent values together and taking their meanvalue. Thus, using a sector length of, say, ten, points 0–9 inthe time-series are combined to form a single mean value;similarly, points 10–19 are combined to form the next meanvalue, and so on. This process results in the reduced dimen-sionality signature. In this case, the signature comprises 1 10the amount of data as the original time-series.

Fig. 2 illustrates the PAA method. The original time-seriesis represented by the thin line (enclosing the shaded area) andthe corresponding signature series by the mean values plottedby the thick line.


F. AURA Encoding and Search

Once a signature series has been computed, it is encodedusing integer binning and concatenation. A large number ofbins are used to ensure that accuracy is maintained. The con-catenation process results in sparse binary codes that are typ-ically tens of thousands of bits long with perhaps 10–20 bitsset.

In the case of the tracked-order subsequences, the sparsebinary codes are stored in the AURA search engine. In thecase of the query, the binary codes are used as the basis forthe input to AURA. Here, further processing is performedto simulate the required comparison measure and to avoidproblems of edge effects at the integer bin boundaries. Thereader should note that although AURA logically operates onbinary codes, it does not do so in the literal sense. AURA em-ploys various sophisticated methods of data compression toachieve efficient storage and processing of the sparse binarycodes used. An overview of AURA is provided in Section V.

G. Performance

By utilizing PAA for dimensionality reduction and AURAencoding and search, it is possible to achieve between oneand two orders of magnitude overall speed up over a simplesequential search through the data.

Returning to our figures for data processing requirements,we can see how using a combination of AURA and PAAleads to a workable solution. Assuming one CPU per aircraftin service, then the processing time for a search of 5 yearsworth of data using the AURA-PAA approach is estimatedto be of the order of 50 s, an acceptable level given aircraftturnaround times of 30 min or more.

IV. DAME SERVICE DATA ARCHITECTURE

This section reviews the logistics of handling the raw en-gine data and the distributed architecture required to manageit.

The scenario, which we will describe, is one in whichevery time an aircraft lands, vibration and performance datais downloaded to DAME from the QUICK system fitted toeach engine. In future deployed systems, the volume of theZmod data downloadable from the QUICK system may be upto 1 GB per engine per flight. Given the large volume of dataand the rate at which it is produced, there is a need to con-sider how the data is managed. Should the data be transferredto a central location or maintained at separate nodes withinindividual airports? The DAME project has investigated theuse of Grid technology to support a fully distributed archi-tecture. This section briefly introduces Grid technology andthen describes how this technology can be used to solve thedata management problem.

A. What Is the Grid?

Ian Foster of the Globus Alliance [8] defines the Grid assomething that “coordinates distributed resources using stan-dard, open, general purpose protocols and interfaces to de-liver required qualities of service” [9]. In general, Grids canbe categorized as either data or processing centric.

Data Grids normally house volumes of data at the terabytelevel and are connected via large capacity networks, possiblyat the gigabit level, utilizing high-performance data deliverymechanisms.

Processing Grids comprise distributed clusters of tens orhundreds of computers that individually may not be that pow-erful. Taken as a whole, a processing Grid provides signif-icant processing capability and supports powerful parallelprocessing techniques.

In each case, the data and processing nodes will be inter-connected via the Internet, or in some cases private networks,and exploit standard protocols to make them appear as singleentities.

There are many software packages available both commer-cially, such as Sun Grid Engine [10] and as open source, suchas the Globus Toolkit [8], for building Grid systems. Globusis probably the most widely known and used.

B. The Need for the Grid

DAME has both processing and data Grid requirements;however, the focus of this paper is on the data managementissues, namely, providing the means to manage and processthe high volume engine data archives.

In order to illustrate the requirements of a deployedDAME system, consider the following scenario. Heathrow,with its two runways, is authorized to handle a maximumof 36 landings per hour [11]. Let us assume that on averagehalf of the aircraft landing at Heathrow have four enginesand the remaining half have two engines. In future, if eachengine downloads around 1 GB of data per flight, the systemat Heathrow must be capable of dealing with a typicalthroughput of around 100 GB of raw engine data per hour,all of which must be processed and stored. The data storagerequirement alone for an operational day is, therefore,around 1 TB, with subsequent processing generating yetmore data.

C. Architecture for Distributed Pattern Matching

On a global scale, the amount of data flowing into andaround a DAME system could be of the order of a terabyteper day. Given the Grid paradigm, it makes little sense tohold aircraft data centrally and direct all processing andsearch queries to a single location. This centralized archi-tecture would also be prohibitively expensive in terms ofhardware requirements.

Grid technology allows the DAME system to be highlydistributed and yet still appear as a single coherent system.To these ends, the architecture shown in Fig. 3 has been de-vised to allow high-performance management of and patternmatching against global archives of engine data.

To achieve its aims, the DAME data management systemuses airports as the unit of distribution. All data that arrivesat an airport is stored at that location, and all processing andsearch queries against that data are processed at that site. Thisresults in data relating to any one engine being spread aroundthe airports that it has visited.

Each airport node consists of three elements: a data repos-itory, the pattern matching services, and a data catalogue.


Fig. 3. Distributed Data Management Architecture.

All nodes share a single or federated metadata catalogue(MCAT), and it is this catalogue that contains the loca-tions of the engine data. The catalogue can be queried ina data-centric way, e.g., engine serial number and flightdate/number, or via some characteristic of the data, ratherthan in a location-centric fashion.

The physical manifestation of each node is relativelysimple, and it is through simplicity that the power of thepattern matching architecture is derived. The architectureprovides a heterogeneous (hardware and operating system)environment for nodes of varying types. Nodes may alsoconsist of one or more computing resource, dependingon the quantities of data stored at each location. Currentdemonstrations use a single laptop for each node.

A typical scenario might be that an engine specialist isrequired to analyze some data for a specific engine. The spe-cialist can enter engine flight identifying traits into the SDE(see Section VI). Based on these characteristics, the SDEsubmits a query to the global MCAT in order to locate thecorrect data. The MCAT returns a handle, which the explorercan then use to retrieve data from the Data Repository. Thishandle contains the address of the appropriate repositoryto contact, along with the address of an appropriate PatternMatch Control (PMC) service, which brokers the searchprocess.

If the specialist performs searches based on some featurein the engine data, the SDE requests that the indicated PMCservice perform the search. This PMC service acts as themaster node for this search, responsible for replicating thequery to all other PMC services. All nodes processing a queryperform their search in parallel, searching all engine dataheld across the system as required. The PMC is also respon-sible for correlating the results and returning them to theSDE. The specialist may then want to view further enginedata as a result of the search. Search results contain appro-priate handles for each of the matches, allowing the SDE tocontact the correct data repository to request data.

D. Pattern Matching Services

The Pattern Match Controller is the front-end service forpattern matching operations across the data held at a singlenode. It accesses a catalogue of what searchable data is heldat that node and is responsible for coordinating searchesacross remote nodes in the system.

The PMC also maintains a list of all other PMCs in thesystem. When a new PMC node is added, it is given the ad-dress of one other PMC. It is then able to retrieve the full listof all other nodes from this known PMC and register itselfwith every other.

When nodes fail, any search operations will not includeresults from these nodes. The result set will indicate thenumbers of participating, failed and completed nodes in anysingle search. This gives users an idea of the amount of datacovered in any search. Once failed nodes become availableagain, they are once more able to participate in searches.

The Pattern Match Controller communicates with severalother services. An Engine Data Extractor/Encoder serviceextracts specific low-level features (tracked orders), from rawengine data into a specified format. To enable tracked or-ders to be searched using AURA pattern match technology,they must be encoded into a specific format (see Section V).This encoding is performed by an AURA Encoder service.The AURA technology itself exists as a Grid-enabled ser-vice called AURA-G. The results from AURA-G consist ofa set of candidate matches that are processed using conven-tional pattern matching techniques to form a result set. Thisis performed by the back-check service.

These services are implemented as a Globus Toolkit3.x grid services [8], and as such are effectively Web ser-vices, which are hosted inside a Jakarta Tomcat 4.1.x [12]installation.

E. Data Catalogue

The data catalogue maintains a list of all the tracked ordersand AURA encoded/searchable data held at a single node. APMC uses this information to determine if it needs to partic-ipate in a search request.

It is expected that this service can be implemented by ex-tending the metadata held in the MCAT. Since the MCAT anddata catalogue will no longer need to be harmonized, this willimprove data integrity.

F. Data Repository

The data repository at each node is responsible for storingall raw engine data along with any extracted and AURAencoded data. This service is provided by the SDSC StorageRequest Broker (SRB) [13]. SRB is a tool for managingdistributed storage resources from large disc arrays to tapebackup systems. Files are entered into SRB and can then bereferenced by logical file handles that require no knowledgeof where the file physically exists. A metadata catalogueis maintained which maps logical handles to physical filelocations. Additional system specified metadata can beadded to each record.


SRB can operate in heterogeneous environments and inmany different configurations from completely stand-alone,such as one disc resource, one SRB server, and one MCAT,to completely distributed with many resources, many SRBservers, and completely federated MCATs. In this configura-tion, a user could query their local SRB system and yet workwith files that are hosted remotely.

When data is requested from storage, it is delivered viaparallel IP streams, to maximize network throughput, usingprotocols provided by SRB.

In current implementations, a single MCAT is deployedinside a Postgress database. Real world implementationswould use a production standard database, such as Oracleor DB2. With this setup, each airport’s SRB installation isstatically configured to use the known MCAT.

The use of a single MCAT provides a single point of failurethat would incapacitate the entire distributed system. To al-leviate this, a federated MCAT would be used, allowing eachnode to use their own local MCAT. In large systems, thiswould provide a performance benefit, as all data requestswould be made through local servers. However, this wouldrequire an efficient MCAT synchronization process.

G. Summary

The pattern matching architecture described above hasbeen designed to provide a robust and scalable method ofsearching large scale distributed data sets.

The architecture exhibits graceful degradation in the faceof node failures. When nodes are not able to contributeto search operations, users still receive results from theremaining airports.

The addition of extra airports into the system does notplace any additional workload onto existing nodes, and thismakes the system highly scalable.

V. OVERVIEW OF AURA

AURA [14] is the key technology used to search the enginedata within DAME. This section overviews the principles be-hind the technology.

AURA is designed to provide scalable high-performancepattern matching capabilities on uncertain and imprecisedata. It achieves this by storing and searching binary patternsusing a simple and analyzable form of a neural network,known as a correlation matrix memory (CMM) [15]. Thetechnique has been shown to be scalable to relatively largeproblems, for example, text matching [16] and moleculardatabases [17].

Pattern matching using AURA requires two phases of op-eration: a storage or training phase and a search or recallphase. In the storage phase, existing data—typically recordsfrom a file or database—are encoded into binary sequencesand stored in a CMM. In the search phase, new data is en-coded in the same way and applied as an input to the CMM.The output of the CMM is then thresholded and decoded andthe records identified are fed into a back-check process.

Within an AURA based system, CMMs work as ahigh-performance method of filtering out the vast majority

of records that do not match the query. The matching recordsand a few false positives are retained and fed into theback-check process. The back-check performs conventionalpattern matching operations to remove the false positivesleaving just the records required.

We now give a simple example of how data is stored andsearched in a CMM. The example is based on a very sim-plistic way of searching time-series data (DAME deals withthe complex time-series data shown in Fig. 1). In this ex-ample, the time-series are sequences of three numbers eachof which can take values in the range one to three, for ex-ample, etc.

A. Encoding

Each sequence needs to be encoded as a binary inputpattern and a binary output pattern. Finding an effectiveencoding scheme is both problem-specific and key to ob-taining high performance. Developing a suitable encodingstrategy is a fundamental step in implementing any AURAbased system. Typical simple encoding strategies formbinary patterns for each element of a sequence. These codesare then combined using either superposition (logical OR)or concatenation to form a binary pattern for the completesequence.

To illustrate the encoding problem, consider a system thatmust search for similar three number sequences in a set ofsequences held in database.

To form an input pattern for each sequence, first we use a3-b-long binary code for each value in the sequence. So “1”is encoded as “100,” “2” as “010 ,” “3” as “001,” and so on.These codes are then concatenated to form the input patternfor a given sequence. For example, is encoded as“100 010 001” and as “001 010 100.”

As well as an input pattern, an output pattern is also re-quired for each sequence stored. The output pattern is usedduring storage and also during search, where the output ofthe CMM is decoded into a set of output patterns that iden-tify the matching data items.

In typical applications, a simple orthogonal encoding isused for output patterns. These codes have a single bit set,effectively associating a single column in the CMM with asingle record. Such output patterns have the advantage ofavoiding ghosting effects in the CMM as well as being trivialto decode.

Returning again to our example, we can use an output pat-tern for each sequence that has a bit set in a position re-flecting the position of the sequence in the database. So thefirst word has an output pattern of “100 000 ,” the second“0 100 000 ,” and so on.

B. Storage

Storing data in a CMM is achieved by setting bits in thematrix corresponding to the outer product of the input andoutput patterns or, using simple terms, by setting bits in thecells given by the intersection of the rows selected by the setbits in the input pattern and the columns selected by the setbits in the output pattern.


Fig. 4. Training an input vector.

Fig. 5. Trained CMM being used for a pattern search.

Fig. 4 shows an input pattern being stored in a CMM.Fig. 5 shows the state of a CMM after the following inputpatterns have also been stored: ,and .

C. Search

To use a CMM to return matching data, we must first en-code the query sequence using the same scheme that wasused during storage. The input pattern for the query is appliedto the input of the CMM, which evaluates the number of setbits that each column has in common with the input pattern.These column totals are then thresholded to generate a binaryoutput pattern that can be decoded to obtain the matchingrecords.

Returning to our example, let us assume the query data is. Fig. 5 shows the CMM being evaluated for an input

pattern of “100 010 001.” The column totals are three, zero,one, two, and one for columns 0–4, respectively.

To complete the recall process, the column totals arethresholded to produce an output pattern. In this case, wechoose a threshold of two, as we are interested in partialas well as exact matches. This gives an output pattern of“10 010 ” Given the simple orthogonal encoding schemeused to generate output patterns, this can be trivially de-coded to the first and fourth sequences in the database.These sequences are , which is an exact match, and

, which has two matching values.

D. Properties of CMMs

Although the encoding scheme used in our example is verysimple, it is hoped that it gives the reader a feel for how thebasic technology works. It should be noted that such a simple

encoding scheme has many drawbacks for use in time-seriespattern matching and is not suitable for use in practice. Itdoes, however, help to illustrate two key properties of CMMs.

First, using CMMs, it is possible to perform partialmatches very efficiently. Suppose we had a CMM thatencoded sequences of 20 numbers. A single CMM searchcould be used to identify all the sequences that had ten ormore values that matched the corresponding values in thequery sequence. By comparison, the number of separateoperations required using standard database techniqueswould be over 184 000 (all combinations of 10 out of 20).

Second, the performance of a CMM-based search canfar exceed that of an equivalent sequential scan throughthe data. Take the simple sequence-matching example. Asimple sequential scan of all the three value sequences inthe database against a query might reasonably takeoperations where is the number of sequences in thedatabase. Using a CMM, we note that only three out of ninerows are activated by the input pattern. (The input pattern issaid to have a saturation of ) The total amount ofinformation that needs to be processed is reduced by a factorthat is dependent on the saturation—assuming that the datais spread reasonably evenly between the rows and the CMMis implemented effectively. Using smart encoding schemescan bring the performance improvement resulting from verylow saturation input patterns to over 100-fold—for example,in text-based searching [16].

It should be noted that CMMs do not require specialisthardware—although hardware CMM implementations doexist. CMMs can be implemented in software on standardmachines or clusters of machines. It is, therefore, easy tohost pattern matching Grid services using AURA withoutthe need for specialist hardware.

VI. OVERVIEW OF THE ARCHITECTURE OF THE DAME SDE

We now discuss the user data-mining interface to the dis-tributed data architecture. The SRB distributed architectureabstracts the problem of data management away from endusers; however, we must provide the mechanisms to facili-tate data-mining and pattern matching across this distributedfile system. Within the DAME architecture [1], this can bevia an automatic workflow or through a developed tool, theSDE. The SDE demonstrates the utility of the DAME ap-proach to data searching.

Recall that the goal of DAME is to contribute to the en-hanced diagnosis and prognosis of engine problems throughthe use of remote Grid services, tools, and human experts.This involves the analysis of data generated by aircraft en-gines to identify potential faults that require maintenance.The DAME system provides a collection of diagnostic tools,which can be used in an automatic sequence, usually by amaintenance engineer, or interactively by a domain expert.Using the SDE, a domain expert can view raw engine data,extract and view features from the raw data, select a patternor region of interest, and search for similar patterns in theterabytes of data stored from previous and recent flights. It isan important tool within the DAME system, since intelligent


Fig. 6. Architecture of the SDE.

feature extraction and data mining are necessary to provideearly detection and diagnosis of deviations from normal en-gine behavior.

The SDE allows the viewing and searching of time-seriesdata stored locally or remotely. Insight can be gained fromviewing and searching such data for the presence of featuresknown to be associated with fault conditions and for devia-tions from a model of normal operation (known as noveltydetection). Many systems and processes within industry,healthcare, business, and research can be characterized bya set of time-series data. Time-series data can be analyzedin the time or frequency domains and the monitoring andfast searching for similar patterns is applicable to manydomains.

A crucial part of engine health monitoring is the search anddetection of abnormal patterns and values within the vibrationand performance data. The SDE provides an interactive GUIfor the viewing and analysis of data from individual enginedata records (EDRs). It is equipped with the necessarysignal processing tools to extract low level features andparameters from the input vibration data and performancedata. It permits large-scale searching for patterns in anengine data set (many EDRs). A built-in pattern templatelibrary allows the user to manage a library of “interestingpattern templates” for engine diagnosis or feature detection.

In addition to the provision of an integrated environmentfor interactive pattern matching, feature extraction, anddetection, the DAME SDE also provides programmablefacilities. A Domain Expert can use the SDE to performa programmed set of parallel and sequential operations tosearch for multiple patterns in the course of fault detection.

A. Architecture

Fig. 6 shows the architecture of the SDE. An outline of itsoperation is as follows. Raw data is input from the data storeand is displayed in the appropriate subwindow of the GUI.The user can then select a feature to view—this feature isthen extracted by the SDE from the raw data and displayed.The user can “play” the data at various speeds for rapid ob-servation. A fragment of the feature can be selected by theuser and used as a search query within the AURA search en-gine (which is trained on terabytes of historical engine data).The fragment has to be encoded prior to submission to thesearch engine. The results of the search and the scores aredisplayed—individual results (matched patterns) can be dis-played graphically in a pop-up window.

The SDE uses internal or external versions of the followingservices, as selected by the user: Extractor, Encoder, AURASearch Engine.

The components shown in Fig. 6 are described below.

• The data store is a repository for the EDRs and providesthe DAME SDE with the engine data for viewing andprocessing. This is managed by the SRB data architec-ture described in Section IV. The functional interfaceto the SRB data repositories is via the Pattern MatchController.

• The GUI integrates the input and output to and from alllocal and remote services with the purpose of providingfacilities to explore the data.

• The signal extractor extracts user selected low-levelfeatures from the unstructured raw engine data. Such


Fig. 7. SDE GUI.

low level features can be directly displayed or frag-ments of these can be selected for use as a query to theAURA search engine. The SDE can use either its ownlocal version of the extractor or a more sophisticatedexternal extractor Grid service.

• The AURA search engine forms the core of the SDE.The SDE enables the user to select a region of interest(query fragment). This pattern is first encoded into aquery vector and then passed to the search engine. Theresults of the search are returned to the SDE for display.The SDE has the flexibility to use either its own localsearch engine or a more sophisticated external searchengine implemented as a Grid service.

• The Encode service encodes low-level features into abinary vector appropriate for input to the AURA searchengine. The SDE can again use either its own local en-coder or a more sophisticated external encoder imple-mented as a Grid service.

• The filter is a programmable module that providesa means of signal preprocessing such as noisesuppression.

• The pattern library contains pattern templates. Thesecan be generated using built-in tools and used as thequery pattern. The sources of pattern templates canbe a region of interest from a displayed low-level fea-ture, a library pattern, or a user-constructed pattern. Thebuilt-in pattern template library stores and manages thepattern templates; these might be examples of faults orexamples of unknown events. These templates can be

used separately or organized as a set of features that to-gether represent a particular fault.

• The tools module provides probing and scaling facili-ties. The probing tools allow the user to point to areasof the data display and view information about thatarea. The information can be frequency spectra or datasource information as required. The scaling facility al-lows the user to rescale the various display axes.

• Other services. The SDE also integrates seamlesslywith other DAME diagnostics tools and services. Forexample, the results from the AURA search engine canbe sent directly from the SDE to the DAME case-basedreasoning tools for further processing.

B. Operation and Functionality

The SDE has the flexibility to be used as a separate tool oras a front-end GUI for the DAME search services. It can beused as a viewer to display time-series data, switch betweendifferent extracted data sets, measure parameters of the vari-able at the given time instance, and probe information hiddenin the raw data. The low-level features are extracted in realtime while the data is played out in the viewer. The input datacan be local or from the Grid.

The tool supports correlation, Euclidean distance, andcity-block distance similarity measures. Depending on theapplication, one or more measures can be used during thesearch. Based on the defined measure, a searching task canbe “find me all examples of vibration patterns similar to the


Fig. 8. Patterns displayed as a result of a search.

input template” or “find me the best matches to the querypattern from the data store.”

The SDE contains a set of tools for extracting datavalues from EDRs and manipulating feature patterns. Whensearching for a similar pattern, the user can choose the querypattern from the template library or from highlighted regionsof interest from the current time-series pattern displayed inthe window. Once the search is invoked, the search enginewill look for similar patterns from the data store or testingdata and locate the position of the best match. The results arethen displayed in the results window. The user can drill downinto each result to examine the actual shape of the patternand call for further operations over the Grid. The SDE alsoprovides facilities to print or save results and features to file.

The main scenario in which DAME SDE is used is asfollows.

• Display the vibration data (as a Zmod plot) and thetime-series or other parameters extracted from the datarecord as a movie sequence. Locate and highlight posi-tions of the parameters from the main window.

• Extract and display instantaneous values of parametersfrom the EDR.

• Search for the best match to a query pattern in eachrecord in the data store over the network.

• Search for patterns of interest within the current Zmoddata set.

• Manage the “interesting pattern template” data store forfault detection. The user can add, delete, and edit thetemplates stored in this library.

• Probe for particular information hidden in the datarecord from the region of interest in the pattern viewwindow.

• Organize the search tasks from the task plannerwindow. This allows the system to execute a batch oftasks automatically.

Fig. 7 shows the layout of the SDE tool window (a set ofsubwindows), which allows the user to view data and managethe activities. The user can view raw data, select features, andview the selected feature in the windows shown in Fig. 7.Any part of the feature can then be selected and used in asearch activity. The listed results of the search are shown inthe search results window and each can be displayed in aseparate pop-up window.

C. Application Examples

Figs. 8 and 9 show particular examples of the operation ofthe SDE.


Fig. 9. Pattern found from the current feature set matching a template.

Fig. 8—Search for Similar Pattern to the Input TemplateFrom Large Volume of History Data Store: The region ofinterest is highlighted as the search template. The best fourmatching patterns are displayed. Search results are listed inthe results window.

Fig. 9—Find Similar Pattern to the Template From theCurrent Test Data Record: A query pattern is imported fromthe template library. A similar pattern in the current test datarecord is displayed in the lower window.

VII. CONCLUSION

This paper introduced the DAME project and describedthe pattern processing architecture. The aim of DAME is tocontribute to the enhanced diagnosis and prognosis of en-gine problems through the use of remote Grid services, tools,and human experts. Enhanced diagnosis/prognosis will en-able more effective planning of maintenance activities andlogistics, leading to a reduction in disruption costs (activi-ties necessary to accommodate unplanned maintenance) anda reduction in maintenance costs (an early preventative re-pair usually costs less than a repair when failure has actu-ally occurred). This will lead to increased customer (airline)

satisfaction through reduced operating and disruption costs.This paper shows how vital the AURA pattern processingis to the diagnostic/prognostic process. Without this tech-nology, the searching of terabytes of data in a short timescaleis not viable. The pattern processing architecture, althoughdeveloped for the aeroengine problem, is a technology thatis widely applicable to many problems involving storage, re-trieval, and search of signal data. The project has shown thatGrid technology is a viable approach to implementing suchadvanced distributed search systems.

Future work will look at the deployment of AURA searchtechnologies in general, toward the deployment of a DAMEdiagnostic/prognosis system in the aeroengine domain andtoward its use in other domains.

ACKNOWLEDGMENT

The DAME project has involved over 60 people in its firsttwo years and, thus, it would be impossible to mention themall here. The architects of the pattern processing serviceshave contributed to this paper and are its authors. The au-thors gratefully recognize the help of the industrial sponsors:Rolls-Royce, Data Systems and Solutions, and Cybula Ltd.for their help.


REFERENCES

[1] J. Austin et al., “Predictive maintenance: Distributed aircraft en-gine diagnostics,” in The Grid, 2nd ed, I. Foster and C. Kesselman,Eds. San Mateo, CA: Morgan Kaufmann, 2003, ch. Ch. 5.

[2] I. Foster and C. Kesselman, Eds., The Grid, 2nd ed. San Mateo,CA: Morgan Kaufmann, 2003.

[3] A. Nairac, N. Townsend, R. Carr, S. King, P. Cowley, and L.Tarassenko, “A system for the analysis of jet engine vibration data,”Integrated Comput.-Aided Eng., vol. 6, pp. 53–65, 1999.

[4] R. Agrawal, C. Faloutos, and A. Swami, “Efficient similarity searchin sequence databases,” in Proc. 4th Int. Conf. Foundations of DataOrganization and Algorithms (FODO), 1993, pp. 69–84.

[5] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimen-sionality Reduction for Fast Similarity Search in Large Time-SeriesDatabases,” Knowl. Inf. Syst., vol. 3, no. 3, pp. 263–286, 2001.

[6] E. Keogh and S. Kasetty, “On the Need for Time-Series Data MiningBenchmarks: A Survey and Empirical Demonstration,” in Proc. 8thACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining,2002, pp. 102–111.

[7] Grid information retrieval requirements, K. Gamiel, S.Karimi, G. Newby, and N. Nassar. [Online]. Available:http://www.gridir.org/papers/Grid_Information_Retrieval_Requirements.pdf

[8] The Globus Alliance [Online]. Available: http://www.globus.org[9] Grid Today [Online]. Available: http://www.gridtoday.

com/04/0202/102 590.html[10] Sun Grid Engine [Online]. Available: http://gridengine.sunsource.

net/[11] D. O’Connel, “Pilots predict air chaos,” The Sunday Times, p. 3.3,

Mar. 14, 2004.[12] Jakarta Project [Online]. Available: http://jakarta.apache.org/

tomcat/index.html[13] SDSC Storage Request Broker [Online]. Available: http://

www.npaci.edu/DICE/SRB/[14] AURA and AURA-G Web site [Online]. Available: http://

www.cs.york.ac.uk/aura[15] V. Hodge and J. Austin, “An evaluation of standard retrieval al-

gorithms and a weightless neural approach,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural Networks, 2000, pp. 24–27.

[16] M. Weeks, V. Hodge, and J. Austin, “Scalability of a distributedneural information retrieval system,” in Proc. 11th IEEE Int. Symp.High Performance Distributed Computing, 2002, p. 423.

[17] J. Austin, A. Turner, and K. Lees, “Chemical structure matchingusing correlation matrix memories,” in Proc. 9th Int. Conf. Artifi-cial Neural Networks, vol. 2, 1999, pp. 619–624.

Jim Austin received the B.Sc. degree in neuro-biology from the University of Sussex, Brighton,U.K., in 1982 and the Ph.D. degree from BrunelUniversity, London, U.K., in electrical andelectronic engineering in 1986 for work in neuralnetworks.

He is the Director of the Advanced ComputerArchitectures Group, in the Department of Com-puter Science, University of York, York, U.K.,and is the CEO of Cybula Ltd., a Universityspin-off company. His main interest is in the

application of neural networks to large unstructured datasets. He is theManager of the Distributed Aircraft Maintenance Environment (DAME)project. He is most well known for developing the AURA technology. Hehas published over 180 papers and two edited books. His main area ofresearch is artificial neural networks with interest in applications to largedatabases.

Prof. Austin’s work on AURA received an award for IT from the BritishComputer Society in 2001.

Robert Davis received the B.A. Honours degreein physics from University College, Oxford Uni-versity, Oxford, U.K. in 1986 and the D.Phil. de-gree in computer science from the University ofYork, York, U.K., in 1995 for his work in the fieldof real-time systems.

He has been involved in applied research andtechnology transfer and cofounded three startupcompanies. In 1995, he cofounded NorthernReal-Time Technologies (NRTT) Ltd., a com-pany that developed bespoke real-time embedded

software for Volvo Car Corporation. At NRTT, he was responsible fordevelopment of a Controller Area Network (CAN) software library calledVolcano. Today Volcano is used in many Volvo and Ford PAG cars. In1997, he cofounded LiveDevices Ltd., where he was responsible for thedevelopment of the Real-Time Architect suite of products, including anOSEK RTOS and schedulability analysis tools. LiveDevices became partof ETAS in 2003. Recently, he has been involved in the Distributed AircraftMaintenance Environment (DAME) Project at the University of York,where his research focussed on uncertain and imprecise pattern matchingusing binary representations and correlation matrix memories: AURA tech-nology. He is also a director of Rapita Systems Ltd., a spin-out companyfrom the University of York transferring worst case execution time analysistechnology into industry.

Martyn Fletcher received the B.Sc. degree in ap-plied physics at the University of Hull, Hull, U.K.and the M.Sc. degree in integrated circuit systemdesign at the University of Manchester Instituteof Science and Technology, Manchester, U.K.

He has worked in industry on a range of soft-ware and hardware projects. This has includedworking for Marconi Communications SystemsLtd. (as a Systems Engineer—troposphericcommunication systems), Castle Associates Ltd.(Design Engineer—sound level meters), and

BAe/BAESystems (Hardware/Software Engineer—avionics systems andavionics test equipment and then Software Research Engineer—distributedoperating systems and fault tolerant avionic systems). He is currently theSoftware Manager for the Distributed Aircraft Maintenance Environment(DAME) at the University of York, York, U.K., and is concerned withrequirements capture, software design and UML modeling of the system.He has coauthored various papers whilst in industry and in his currentposition.

Thomas Jackson received the B.Eng. degreein electronics and electrical engineering fromUniversity of Salford, Manchester, U.K., in 1988and the D.Phil. degree in computer science fromthe University of York, York, U.K., in 1995. Hismajor field of study was neural computing andartificial intelligence.

From 1988 to 1993. he was a Senior Engineerwithin the Systems Computing R&D group ofBritish Aerospace Military Aircraft Limited.This group specialized in advanced concepts

and future projects for avionic flight systems. From 1993 to 1997, he wasResearch Manager for the High Integrity Systems Engineering group inthe Computer Science Department of the University of York. He thenserved for five years as a Scientific Officer for the European Commissionat the EC Joint Research Centre, Ispra, Italy. He is currently the ProjectCoordinator for the Engineering and Physical Sciences Research Council(EPSRC)-funded e-science Distributed Aircraft Maintenance Environ-ment (DAME) pilot project. He is responsible for the management ofa distributed research team of over 20 staff and the delivery of a majore-science technology demonstrator in the area of aeroengine diagnostics,in collaboration with Rolls-Royce and Data Systems and Solutions. He isthe author/coauthor of numerous publications in refereed conferences andjournals and coauthor of An Introduction to Neural Computing (Bristol,U.K.: Inst. of Physics, 1990).


Mark Jessop received the B.Sc. degree (firstclass) in computer systems from NottinghamTrent University, Nottingham, U.K., in 1999.

He has worked in both industry and academiaand has considerable experience in Web and Gridservices programming. His industrial experienceincludes working for BT Research, Martlesham,U.K., in the Shared Spaces group. He is currentlyworking on the Distributed Aircraft MaintenanceEnvironment (DAME) project at the Universityof York, York, U.K., where he has a particular in-

terest in the development and demonstration of security techniques, buildingservice oriented Grid applications, and porting existing systems and tech-nologies to the Grid service paradigm.

Bojian Liang received the B.Sc. degree in com-munication engineering from Northern JiaotongUniversity, Beijing, China, in 1982 and the Ph.D.degree in computer science from Heriot-WattUniversity, Edinburgh, U.K., in 2001.

He is currently working on the Distributed Air-craft Maintenance Environment (DAME) projectat the University of York, York, U.K., where hisspecialization is pattern matching. He has pub-lished over 20 papers. His early work focused onthe application of microelectronics in real-time

and telemetry systems. Recent research interests include object recognition,robot navigation, and the application of polarization analysis to the com-puter vision problem.

Andy Pasley received the B.Sc. degree (firstclass) in computer science from Leicester Uni-versity, Leicester, U.K., in 1998 and the Ph.D.degree from the Computer Science Department,University of York, York, U.K., in 2004. Thefocus of his research was binary neural networksand their application to time-series analysis andforecasting.

He is currently working on the Distributed Air-craft Maintenance Environment (DAME) projectat the University of York, where his role concerns

pattern matching against time-series of vibration measurements for aero-engine diagnostics. In particular, he is exploring the potential of the Grid toallow pattern searches against large, distributed data sets of flight data.


Documents

DAME: Searching Large Data Sets Within a Grid-Enabled ......DAME: Searching Large Data Sets Within a Grid-Enabled Engineering Application JIM AUSTIN, ROB DAVIS, MARTYN FLETCHER, TOM