20
CLCAR, Panamá 2012 A Framework for High Performance Image Analysis Over Cloud Resources 1 A Framework for High Performance Image Analysis Pipelines over Cloud Resources Raúl Ramos-Pollán, Ángel Cruz-Roa, Fabio González-Osorio Bioingenium Research Group Universidad Nacional de Colombia

A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

1

A Framework for High Performance ImageAnalysis Pipelines over Cloud Resources

Raúl Ramos-Pollán, Ángel Cruz-Roa, Fabio González-Osorio Bioingenium Research Group

Universidad Nacional de Colombia

Page 2: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

2

Motivation

• Bioingenium research group

www.bioingenium.unal.edu.co

• Large scale image processing pipelines

• Initially focused on medical imaging

• Support for machine learning processes

• Seldom availability of computing resources

• Limitation on collections size, applicability,algorithms design

1 Motivation

Page 3: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

3

Strategy

• Build on experiences with Hadoop / Grid, etc.

• Profit from whatever resources available

• Decouple algorithm design from deployment

• Adopt Big Data principles and technologies

• Streamline software development process

• Unify coherent algorithms repository

1 Motivation

Page 4: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

4

Image Processing Pipelines

2 Image Processing Framework

PATCH EXTRACTION

INPUTIMAGES

FEATURES EXTRACTION

FEATURESCLUSTERING

LATENTSEMANTICS

AUTOMATIC ANNOTATION

Page 5: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

5

Image Processing Framework

2 Image Processing Framework

NOSQL DATABASE

CLIENT at EXPERIMENTER’S

DESKTOP

SCHEDULE GENERATOR

EXPERIMENTER

SCHEDULE

PIPELINE DEFINITION

STAGE 1 STAGE 2

STAGE 3STAGE 4

INPUTOUTPUT

DATA

WORKERS

(AMAZON, DESKTOPs)

WORKERADDED LATER

WORKER

WORKER

Page 6: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

6

Pipeline definition example

2 Image Processing Framework

######################################## FIRST STAGE: Patch Sampling#######################################stage.01.task: ROIsFeatureExtractionTaskstage.01.numberOfPartitions: 10stage.01.roiAlgorithm: RandomPatchExtractorstage.01.feAlgorithm: GrayHistogram

stage.01.RandomPatchExtractor.blockSize: 18stage.01.RandomPatchExtractor.Size: 256

stage.01.input.table: corel5k.imgs

stage.01.output.table: corel5k.sample

############################################ SECOND STAGE: CodeBook Construction###########################################stage.02.task: KMeans

stage.02.KMeans.kValue: 10stage.02.KMeans.maxNumberOfIterations: 5stage.02.KMeans.minChangePercentage: 0.1stage.02.KMeans.numberOfPartitions: 10

stage.02.input.table: corel5k.sample

stage.02.output.table: corel5k.codebook

##################################### THIRD STAGE: Bag of Features Histograms####################################stage.03.task: BagOfFeaturesExtractionTask

stage.03.numberOfPartitions: 10stage.03.maxImageSize: 256stage.03.minImageSize: 20stage.03.indexCodeBook: truestage.03.featureExtractor: GrayHistogramstage.03.patchExtractor: RegularGridPatchExtractorstage.03.spatialLayoutExtractor: IdentityROIExtractorstage.03.codeBookID: STAGE-OUTPUT 2

stage.03.RegularGridPatchExtractor.blockSize: 18stage.03.RegularGridPatchExtractor.stepSize: 9

stage.03.input.table: corel5k.imgs

stage.03.output.table: corel5k.histograms

SCHEDULE

OF JOBS

Page 7: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

7

Framework architecture

2 Image Processing Framework

ITERATIVE

DATA PARTITION

CORE

schedule

generator

data

management

jobs wrapper

WO

RK

ER

CMD LINETOOL

STORAGE API

DYNAMO DB IMPL

KMEANS

BAG of FEATURES

JWS LAUNCHER

CMD LINE LAUNCHER

TASKS API

HBASEIMPL

FILE SYSTEMIMPL

AMZON VM LAUNCHER

EXPERIMENTER

UNDERLYING NoSQL STORAGE

TASK PATTERN API

SCHEDULES

INPUT/OUTPUT DATASETS

WEB CONSOLE

WORKER API

CLIENT API

Page 8: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

8

Algorithms patterns

2 Image Processing Framework

BEFORE PROCESSING ALL PARTITIONS

AFTER PROCESSING ALL PARTITIONS

SA

ME

PR

OC

ES

S

START PARTITION

FINALIZE PARTITION

PROCESS DATA ITEM

SA

ME

PR

OC

ES

S

START PARTITION

FINALIZE PARTITION

PROCESS DATA ITEM …

PARTITION 1 PARTITION 2

SA

ME

PR

OC

ES

S

START PARTITION

FINALIZE PARTITION

PROCESS DATA ITEM

PARTITION N

INPUTDATA

PARTITION

OUTPUTDATA

PARTITION

PROCESS DATA ITEM

BEFORE PROCESSING ALL PARTITIONS

AFTER PROCESSING ALL PARTITIONS

PA

RT

ITIO

N 1

PA

RT

ITIO

N 2

PA

RT

ITIO

N N

FINALIZE ITERATION

START ITERATION

DATA PARTITIONITERATIVE ALGORITHMSONLINE ALGORITHMS

Page 9: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

9

Goals for Experiments on the Cloud

• After using our framework in an opportunistic

setting

• Understand how to deploy our framework on

Amazon Cloud Services

• Validate transparency for the experimenter

• Understand cost patterns

3 Experimentation

Page 10: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

10

Cloud deployment

3 Experimentation

EXPERIMENTER

DynamoDBS3

LOCAL FRAMEWORK

PIPELINES

SCHEDULES

DATASETS

WORKER

WORKER

WORKER

WEB BROWSERAMAZON API

AMAZONLOCAL

NETWORKWORKER

WORKER

MANAGE WORKERS

(LAUNCH, SHUTDOWN)

AMAZON CLOUD

Page 11: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

11

Cloud usage

3 Experimentation

# prepare files with AWS credentials and desired pipeline

> load.imgs images /home/me/clef/images

306,530 files loaded into table images

> pipeline.load bof.pipeline

pipeline successfully loaded. pipeline number is 4

> pipeline.prepare 4

pipeline generated 56 jobs

> aws.launchvms ami-30495 15

launching 15 instances of ami-30495

> pipeline.info 4

…..

> aws.termvms ami-30495 5

terminating 5 instances of ami-30495

> pipeline.prepare 4

Page 12: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

12

Experiments setup

3 Experimentation

simple experiments before committing budget for Amazon

500KB images

Amazon EC2 t1.micro instances (613MB 1 virtual core)

fe-simple simmulated image feature extraction 0-2 secs

fe-complex simmulated image feature extraction 5-8 secs

Page 13: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

13

3 Experimentation

FE SIMPLE

FE COMPLEX

Experiments result 100 imgs

Page 14: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

14

3 Experimentation

FE SIMPLE

FE COMPLEX

Experiments results 1000 imgs

Page 15: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

15

1000 imgs

3 Experimentation

100 imgs

Experiments scalability

Page 16: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

16

3 Experimentation

Additional experimentsImages from ImageCLEF2012 medical challenge.

http://www.imageclef.org/2012/medical

306,530 medical images

from journal papers in biomedical sciences

x-rays, MRI, PET, diagrams, ultrasound, microscopy, gammagraphy, …

mostly in JPEG format at medium resolution (avg width 512 pixels).

Page 17: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

17

3 Experimentation

Additional experiments

CEDD EXTRACTION OF 3K ImageCLEF Medical 3 STAGE PIPELINE ON 30K

ImageCLEF Medical

1 Patch extraction

2 Kmeans – codebook

3 BoF Representation

Page 18: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

18

3 Experimentation

Additional experiments• Submission to ImageCLEF2012 medical

• All 300K images processed and annotated

• 36 workers over 6 scattered servers

• One run (parameters configuration)

• 116 mins elapsed time

• 40 hours compute time

• Ad-hoc image retrieval task:

• 1st out of 53 in textual classification

• 3rd out of 36 in visual classification

http://www.imageclef.org/2012/medical

Page 19: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

19

Conclusions• Tool for supporting the full image processing lifecycle

• First scalability behavior on Amazon Cloud

• Decoupled algorithm design from deployment

• Adapted to our reality � grasp whatever computerresources available

• Streamlined internal software process

• Unified software repository

• AGILE EXPERIMENTATION LIFECYCLE

• LITTLE A PRIORI KNOWLEDGE ON COMPUTING RESOURCES AVAILABLE

4 Conclusions

Page 20: A Framework for High Performance Image Analysis Pipelines ... · CLCAR, Panamá 2012 A Framework forHigh Performance ImageAnalysisOverCloud Resources 1 A Framework for High Performance

CLCAR, Panamá 2012

A Framework for High Performance Image Analysis Over Cloud Resources

20

Future work

• Extend support to additional machine learning processes

• Larger scalability experiments (# workers)

• Optimize Amazon usage (DynamoDB)

• Better understand Amazon costs patterns

• Experience with NoSQL scalability (HBase)

• Better understand limitations of each deployment model

(opportunistic, cluster, cloud)

• Move to >1M image datasets

4 Conclusions