1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance...

Preview:

Citation preview

1

Aspire DocumentProcessing

1

2Document Processing – “Aspire”

• Very High Performance• Structured Document Processing Architecture• Dynamic configuration and deployment• Based on Open Source Technologies• Well Supported (wiki, javadoc)• Administration interface built-in• Vendor Neutral (CMS and search engine)

2

3Top-Level Overview

Aspire

Data Sources

Feeders

Document Processing Pipelines Indexing Index

4

Aspire

Common Resources

Components In Aspire (today)

Content Control DB

SubJob Extractors

Unload ARC Files

Unload CSV

Component Manager Pipeline Manager

MetadataManipulation

Text Extraction

Date Chooser

Split Multi-valued data

Host to Domain

Groovy Scripting

JDBC Connection

Feeders

RSS

Hot Folder

Single Page

RDB

Enhancers

Get CCD Metadata

RDB Enhancer

Output

Push XML to REST

Error Job Handler

Debug Output

JMS

RDB Unloader

Feed One

Fetch URL

Category Tagger

Content Boost

5Functions Handled by Aspire

• Threading• Collection Deployment• Error handling and notification

• Including individual sub-job notifications• Collection Configuration• Component Scripting• Job Processing• Admin I/F, performance, live system status

6Benefits

• Much lower lifecycle cost• File processing no longer an ad-hoc

collection of java objects and methods• Encourages re-use of components• New collections with no programming

• Just re-configure existing components

• Flexibility: deploy collections individually• Much better visibility into the file processing

internals, performance, and queuing

7Typical Installation Structure

Machine #1 Machine #2

CrawlerAspire

(other feeders and doc processing)

Search Engine

8

Aspire Architecture and Components

Details

9Top-Level Component Architecture

10Aspire and OSGi Components

AspireComponent

AspireComponent

Factory

OSGi Bundle

Java Jar File

Manufactured By

ISA

ISA

11The Contents of a Bundle/Component Factory

12Component and Factory Details

13

14

Aspire Sample Configurations

15Web Site Crawler / Search

16Processing CSV Files

17RSS Feeds, Single Pages

18

Aspire Deployment

19Deployment

• Architected to the latest deployment standards• Distribution Archetypes• Component Repositories

• Redeploy collections independently• In a live running system

• Redeploy and update components• In a live running system

• Ready for the cloud

19

20Deployment Structure

Aspire

Resources

CollectionConfigCollection

ConfigCollectionConfigCollection

ConfigCollectionConfigCollection

Config

Feeders & Pipelines

Administrator

load/reloadconfiguration

ConfigurationControl

re-useable components

ComponentRepository

21Deployment Implications

• Collections are configured independently• Collections use standard components• Can be dynamically and remotely deployed

Remote System

Aspire(always running)

CollectionConfig

load remoteconfigurations

remoteadmincontrol

Recommended