43
1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi Hashimoto

1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

Embed Size (px)

Citation preview

Page 1: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

1

Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System

(TOPS)

Petr Votava

Ramakrishna Nemani

Andrew Michaelis

Hirofumi Hashimoto

Page 2: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

2

Outline

• Project Overview• Architecture• Knowledge Management System• Control Module

Page 3: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

3

Project Goal

Provide a framework for automated anomaly detection and verification in large heterogeneous Earth science data sets as well as on-demand data

analysis integrated with the Terrestrial Observation and Prediction System (TOPS)

Page 4: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

4

Use Case Scenario Example

• Given a global TOPS NPP (Net Primary Productivity) anomaly product, perform the following actions for identified clustered anomalies:- Check for anomalies in the inputs to the NPP product

• Climate data: if anomaly appears in one of the climate datasets, confirm this from another source of the same variable

• Satellite Data: If the anomaly is one of the MODIS satellite products (LAI - Leaf Area Index), check the QA (Quality Assurance flags), and if QA indicates problem make that our default position

• If QA appears fine, check for other known events that could cause anomaly in LAI product, such as fire. Also check the MODIS fire product for any recent fires in the area of interest.

Page 5: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

5

Architecture Overview

Page 6: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

6

Anomaly Detection and Analysis Flow (System Overview)

• Anomalies are detected using by on of the plug-in algorithms (Anomaly Detection)

• Related datasets are analyzed to see if the anomaly is observed in other products as well (Knowledge Management System, Anomaly Confirmation)

• If the anomaly is confirmed, it is classified against a set of known anomalies or is classified as new (Anomaly Classification)

• A pre-defined workflow that represents an in-depth analysis process is executed for recognized anomaly classes (Anomaly Analysis)

• The overall flow is managed by the logic in the Control Module.

Page 7: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

7

Objective: Knowledge Management System

Develop a Knowledge Management System in order to store and query for information about model hierarchies, dataset hierarchies and their

relationships.

Capture of data and model relationships such as similarity and compatibility so that different components can be dynamically selected and matched during analysis.

Page 8: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

8

Anomaly Detection System Overview

Page 9: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

9

Building Knowledge Management System

• Use Case Scenario• Knowledge Representation• Architecture• Development• Knowledge Management System Interface

Page 10: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

10

Use Case Scenario Example

• Given a global TOPS NPP (Net Primary Productivity) anomaly product, perform the following actions for identified clustered anomalies:- Check for anomalies in the inputs to the NPP product

• Climate data: if anomaly appears in one of the climate datasets, confirm this from another source of the same variable

• Satellite Data: If the anomaly is one of the MODIS satellite products (LAI - Leaf Area Index), check the QA (Quality Assurance flags), and if QA indicates problem make that our default position

• If QA appears fine, check for other known events that could cause anomaly in LAI product, such as fire. Also check the MODIS fire product for any recent fires in the area of interest.

Page 11: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

11

Use Case Analysis

• Typical interaction among members of our group (from scientists to engineers)

• There is a lot of knowledge in the previous statements that we want to capture for future use

• We also want to automate much of the process to speed up any future analysis so that- When we automatically process the anomaly products, we already

have a possible indication why is the anomaly there

- Problem: Things can differ dramatically from one variable to the next AND we may want to try different approaches/hypothesis as well (look at different products for the same variable/event)

Page 12: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

12

Queries

• Based on the use case, we need the answers to the following questions:- What are the inputs to NPP?

- What are the climate inputs to NPP?

- What are the satellite inputs to NPP?

- What are the inputs to LAI?

- What event(s) can have influence on LAI?

- In what product(s) can a fire event be observed?

- What products are impacted by a fire event?

Page 13: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

13

Knowledge Capture

• We need to be able to capture the knowledge in a persistent machine readable format

• Additionally we would like to be able to share this knowledge among the community- Can provide basis for data lineage = documentation of a process of

how we arrived in our conclusions• The terminology of actions performed during the analysis is standardized

- Good for automated data analysis

• Preferably we would re-use existing community vocabularies = attach semantic meaning to the data, models and actions- Good for automation- Great for interoperability

Page 14: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

14

Knowledge Representation

• In order to be able to easily build on existing foundation and have a useful vocabulary that is in sync with the community - create new taxonomy (ontology) using one of the existing ontology languages and link it with existing ontologies

• Two most popular ontology languages:- Resource Description Framework (RDF)

- Web Ontology Language (OWL)

Page 15: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

15

Language Selection

• RDF- Sets of <subject, predicate, object> triples

• NPP rdfs:type Data OR BGC rdfs:type Model

• rdfs:type is a predefined RDF construct

- Good, but very low-level with limited inference

• OWL (our choice of language)- Vocabulary extension of RDF

• NPP isDerivedFrom LAI and LAI isDerivedFrom NDVI

• isDerivedFrom is custom predicate

- Supports inference through Description Logics (finding implicit information from presented data)

• It can be automatically inferred that NPP isDerivedFrom NDVI

- Reuse and integration of large number of existing ontologies:• SWEET, Dublin Core, US Census Ontology, …

Page 16: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

16

Knowledge Storage and Access

• Standard DBMS (MySQL, Postgres..)- XML encoded, so it could theoretically map easily to database

tables, BUT would not provide any semantics, extensibility and inference capabilities as well as support for enhanced queries

- On the other hand, very scalable

• Ontology Data Stores (Jena, Sesame, …)- Takes advantage of the RDF/OWL standard encodings- Provide a way to query the DB in a way more suitable for ontology- Can be configured to support inference- For scalability, can have standard DB as back end- We chose Sesame for our Knowledge Storage Engine

• Good extensibility (including OWL inference component)

• Web service interface

Page 17: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

17

Query Interface

• Two ways to query ontology stored in Sesame- SeRQL

- SPARQL

• Similar capabilities and we have tested both

• Eventually settled on SPARQL, because it is W3C (World Wide Web Consortium) standard- Support from large number of tools

- Ease of interoperability and integration

- Example:

SELECT DISTINCT ?input

WHERE { tops:NPP tops:isDerivedFrom ?input }

Page 18: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

18

Inference Choices

• There are two basic places for knowledge inference- Pre-compute it before ingestion into Sesame using one of the

standard reasoners (a software component that computes inferences from a given ontology)

• Can lead to a large ontology (things get inlined into a single file)

• Can be hard to distinguish the original and the inferred knowledge

- In the context of the storage component (on ingest)

• Have the ability to distinguish original and inferred data

• External ontology can stay untouched (good for development)

• We use OWLIM as a Sesame Storage and Inference Layer (SAIL) in order to take advantage of the OWL inference capabilities

Page 19: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

19

Knowledge Management System Architecture

Page 20: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

20

Knowledge Management System Development

• Start with the Use Cases• Map the Use Case

- We have been using Mind Manager across multiple projects for a while

• Use the map to define ontology- Can be done by hand, but we used Swoop and Protégé -

two popular open source OWL editors

Page 21: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

21

Use Case Map

Page 22: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

22

Ontology Definition

• Define classes- Data, Model, Event, …

• Define relationships among classes- isDerivedFrom (Data -> Data)

- hasImpactOn (Event -> Data)

• Link with existing ontologies- For example, SWEET ontlogy defines the term

Satellite, so we can define our term Satellite_Data as Data whose source is Satellite, similarly we can re-use the definition of Fire as a type of an Event

Page 23: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

23

Populating Knowledge Management System

• Once the ontology is defined, we can populate our system with specific instances

• Define LAI as a type of Data with Satellite source• Define NDVI as a type of Data with Satellite

source• Define LAI as being derived from NDVI• Define NPP as being derived from LAI• The system will reason that NPP is also derived

from NDVI without having to specify it explicitly- This is a trivial example of inference

Page 24: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

24

Finishing Up

• Populate the Sesame database- Can be done with custom scripts

• Testing- Not performance testing, but a testing for validity

• Can the system answer the questions that were identified in the Use Cases

Page 25: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

25

Sample Queries

• What are satellite inputs to NPP?SELECT DISTINCT ?input

WHERE

{ tops:NPP tops:isDerivedFrom ?input.

?input tops:hasDataSource ?s.

?s rdf:type sweet:Satellite. }

• What data are influenced by a fire event?SELECT DISTINCT ?data

WHERE

{tops:FireEvent tops:hasImpactOn ?data. }

Page 26: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

26

Adding New Knowledge

• We plan on populating our system throughout the project- New research information becomes available- Integration of new models and datasets

• We would like to get the scientists more involved so that we can have a richer set of information/knowledge available to the rest of the system

• The ontology is being developed with Protégé - Good environment, but steeper learning curve

• Experimenting with CMap Tools (Good candidate)- Good: Graphical interface is intuitive and easy to use- Questionable: Some problems editing existing ontology created

with different tools

Page 27: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

27

Interface

• Two native ways to access the system- Java API

• For interfacing with our Control Module

- Web services• For access from external and web-based systems

• Synergy with our NASA ACCESS project

- Bringing in semantics with web-based access to data and processing capabilities

Page 28: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

28

Challenges

• Needed more background research than anticipated to survey the field and pick the components that fit our requirements

• Can be a fairly complex undertaking - we address it by focusing on our requirements and use cases

Page 29: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

29

Status

• Ontology definition completed• Knowledge Management System populated with

initial knowledge that satisfies our use case scenarios

• Sesame server up and running together with the OWLIM reasoner

• All initial queries tested against the system using both web-service and Java API access

Page 30: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

30

Backup Slides

Page 31: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

31

Objective: Building the Control Module

• Goals• Design Considerations• Interfaces

- Data Acquisition

- Data Processing

- Knowledge Management System

Page 32: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

32

Control Module Function

• Provides a logical flow of the process of retrieving similar datasets as well as matching them with appropriate models and/or execution workflows and executing these workflows- Given a set of anomalies perform a sequence of pre-defined high-

level actions that will automatically prepare datasets for analysis and further processing

- It doesn’t describe the workflows to be executed on each of the datasets, but rather have the ability to execute a pre-defined workflow

• Depending on the anomaly classification, we will have a separate system for defining which workflow gets triggered (Year 3)

Page 33: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

33

Dataset Name Resolution

• Because the ontology describes the data and their relationships in different levels of abstraction, we need a process to map these into concrete datasets that we can retrieve- For example LAI (Leaf Area Index) can map to a number of

datasets in the system, but we need to be able to distinguish between these dataset in order to retrieve them and provide them to workflows

- The KM system can tell us what are the possible sources of LAI, but not which one was used for a specific product - that is product-level metadata available through product database

• Need to map between ontology data identifiers (URI) and TOPS internal data identifiers (URN)

Page 34: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

34

Dataset Mapping

• Physical data are stored in traditional relational databases- We have a large tested production system that could be

expensive to transform into an ontology based database• Outside the scope of this project

- Contains specific links to the data files that are needed during processing

- Database id’s are URNs

Page 35: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

35

Dataset Mapping( 2 )

• General data semantics is stored in the Ontology-based system (Knowledge Management System )

- These contains information that describes the data at more abstract level

• NPP is derived from LAI or LAI is a satellite data

• MODIS is a source of LAI data

• Doesn’t provide link to specific dataset, only information on how one dataset relates to another and how do they fit into the community naming conventions

- These descriptions are needed to “reason” about the analysis process, to find related data, or data that match inputs to processes

Page 36: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

36

Dataset Name Resolution Example

• The system reports anomaly in Photosynthesis (GPP) product from MODIS-Terra- This is TOPS id (URN) in the relational database:

• urn:x-tops:def:phenomenon:NASA:gpp:5.0:1• Compatible with OGC URN naming conventions

- Within OWL/RDF, data are referenced using URIs• http://ecocast.arc.nasa.gov/TOPSKnowledgeBase.owl#GPP• Use it to find what we need

- Input datasets - LAI, Temperature, …- For each of the dataset, find possible sources (Terra, Aqua, weather

network, …)• Convert combination of dataset (LAI) and source (Terra) from their

URIs to match TOPS URN

- For example: urn:x-tops:def:phenomenon:NASA:lai:5.0:2

Page 37: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

37

Dataset Name Resolution Notes

• Rather than combining the information into the Knowledge Management System, keep it separate- Keeps everything simpler

• Easier to add knowledge to the system• One of our main goals is to get other people involved - more feasible, when the

representation of the knowledge contains only the level of abstraction needed to be expressive enough

- Keep data acquisition and knowledge management at different levels of abstraction• KM System needs to capture information in terms of general data products

(i.e. variable dependencies) (NPP and LAI)

• Data Management System needs specific information to be able to retrieve the dataset

- Need to know which product(s) to request

Page 38: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

38

Control Module Flow• Input are a set of anomalies generated by:

- Routine model runs (TOPS, WRF, Landsat Anomaly Pipeline)• With many products, we already generate anomaly information

- On-demand anomaly detection runs• Algorithm testing and development• In response to external processing (NWS weather warning, …)

• For each of the reported anomalies:- Check the QA information of the dataset if available- Using KM System, find possible causes and check them against external databases of

known events (fire, development, …)- Find related/similar datasets and check for anomalies at the same locations

• Similar/related = for example temperature from satellite or ground stations• Re-run the anomaly detection algorithms with different datasets• The datasets could be in different resolutions (MODIS -> Landsat)

- Do the same analysis for datasets used to derive the variable(s), where anomaly is observed

- If datasets is model output, re-run model with different datasets and re-check for anomaly, possibly follow by picking better higher-resolution model with known good performance over selected region

Page 39: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

39

Design Considerations

• Many design considerations for the future components of the system- Anomaly selection

- Performance optimization

- On-demand process scheduling

Page 40: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

40

Anomaly Selection

• How do you pick the top anomalies for analysis- Have a ranking mechanism

• Number of possible ranking heuristics- Size of the area impacted by the anomaly- Potential economic impact (proximity to population center, …)

• Potentially a good role for the Knowledge Management System = potential impact given anomaly magnitude and a set of related variables

- More likely combination of the two above (i.e. area size and potential impact)

- Can we define a general Ecosystem anomaly score?• Score number of different anomalies across the region and add them up for all

locations

- Creates general anomaly heat-map- In itself probably a separate research topic - we’ll keep it simple, but have the

design to accommodate for this if needed.

Page 41: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

41

Performance Optimization

• Given a limited set of resources, we may want to perform the operations of anomaly verification in a more optimal pre-defined set of steps that would better utilize the resources:- For example:

• First check the data quality flags, if available (cheap)

• Check for known events in external database (fairly cheap)

• Re-run anomaly on similar/related datasets (could be expensive)

• Re-run anomaly on derived products (even more expensive)

• Re-run number of models with finer resolution datasets (could get really expensive)

- Need to define the exit criteria from this process

Page 42: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

42

Interfaces

• Data Acquisition- Data Manager component developed under AIST Sensor Web

project- Plug-in framework for number of data acquisition protocols (OGC

SOS, FTP, TOPS-DB)- Both local and web services capability- Need URI -> URN translation for dataset resolution

• Process Execution- Process Manager component developed under AIST Sensor Web

project- Based on OGC SensorML protocol

• Knowledge Management- Java API and partially operational web services interface

Page 43: 1 Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi

43

Challenges - Control Module

• Decision making in Control Module can easily cross into a research in planning and scheduling- Focus on use cases and simple heuristics first,

rather than going in depth into the planning and scheduling problems