View
215
Download
1
Category
Tags:
Preview:
Citation preview
Extension of caGrid Federated Query for Large Heterogeneous Data Services
Eta S. Berner, EdDElliot Lefkowitz, PhDJohn David Osborne, MSHarsh Taneja, MSNiveditha Thota, MSCurtis HendricksonDon Dempsey, MS Matthew Wyatt, MSHI
John-Paul RobinsonPoornima Pochana, MSShantanu Pavgi, MS
Geoff Gordon, MSTim Day, PhDGreg Fuller
Objectives
• Background• Customization of caGrid stack
• Scaling for Large Dataset• Optimization of Query• Query Chunking in FQP• WS-Enumeration in Client(Controller), FQP & Data Services
• Outstanding Issues• Summary
Background
• UAB has developed a Custom “Cohort Discovery” tool• Query based upon: Age, race, gender, Labs, Diagnosis, Procedure• Aggregate Results (numbers) stratified by: Age, Race, and Gender
• Two caCORE SDK generated data services• Administrative Data (Demographics etc)
• Patient table with simple demographics (~700 K)• Diagnosis, Encounter, Procedures (~12 M)
• Labs (Lab Results)• Patient table (~700K)• Lab Result table (~185 M)
• Federated Query Processor (modified 1.3 Snapshot)• Controller generates DCQL for FQP that always targets Admin
System’s patient table and (optionally) labs• MRN is the identifier to link Admin System’s patient data to lab results
Aggregate Cohort Estimator (ACE)
Query Constraints could be:Age, Race, GenderLabs, Diagnosis, Procedures
Results can be grouped by:
1. Counts2. Gender3. Race4. Age5. Race* Gender6. Race * Age7. Gender * Age8. Race * Gender * Age
ACE Result Screens
Architectural overview
Grid Data Services
Admin System~12 M
Labs~185 M
F Q P (internal)
Federated Query Processor
UAB Data Center VLAN (private)
User Interface
Controller (RESTful
Web Service)
DCQL Generator
Shibboleth (AuthN & AuthZ)
Controller DB
Problem – Customization of caGrid Stack
1. Scaling for Large Dataset2. Optimization of Query3. Query Chunking in FQP4. WS-Enumeration in
Client(Controller), FQP & Data Services
Scaling for Large Dataset
• Time out was overridden to 24 hrs in FQP & Data Services• Row Count was increased from 1K to 1M in Data Services• DCQL was restructured in Controller to avoid table space
overflow errors due to the Cartesian joins• this occurs only as a result of "AND" statements • Occurs only when row count is high• This was not required against Admin Systems (12M vs 185 M in labs)• And not with “OR” queries against labs, which can run with a join-
free SQL statement• FQP should be able to analyze DCQL and run it efficiently
since similar to how a relational database query analyzer does it
Before and After the Restructuring
Foreign AssociationGroup: AND
Attribute: Lab A
Attribute: Lab C
Attribute: Lab B
Before
Foreign AssociationGroup: AND
Association: Lab A
Foreign AssociationGroup: AND
Association: Lab B
Foreign Association Association: Lab C
After
B
C A
Problem – Customization of caGrid Stack
1. Scaling for Large Dataset2. Optimization of Query3. Query Chunking in FQP4. WS-Enumeration in
Client(Controller), FQP & Data Services
Query Optimization
Grid Data ServiceFederated Query Processor
Query 2 + 250 K
Query 3 + 100 K
Query 1
Response 1 = 250 K
Response 2 = 100 K
50K
Query Optimization
Step 1: Controller pre-runs count-only CQL queries.
For example:
Count(A) = 250K, Count(B) = 100K &Count(C) = 50K
Step 2: Reorder DCQL query so that the most restrictive statements are executed first.
B 100K
C50K
A 250K
Query Optimization
Grid Data ServiceFederated Query Processor
Query 1
Response 1 50 K
Query 2 with 50K
Response 2 50K
Query 3 with 50K
Response 3 50K
Smallest-Data-Set-First reduces size of all sub queries
Problem – Customization of caGrid Stack
1. Scaling for Large Dataset2. Optimization of Query3. Query Chunking in FQP4. WS-Enumeration in
Client(Controller), FQP & Data Services
Problem with Large Sub Queries
• Problem: Too many identifiers (>300k MRNs from Labs in our case)• FQP
• Passes huge OR clause down to data service• Data Services
• Uses hibernate which parses OR clause recursively, thus blowing the stack for large results with typical JVM settings
• Solution – fix both hibernate and JVM stack size setting• Database
• Chokes on large queries consisting of• Where In (MRN1, MRN2, …. MRNn) or• Where Attribute1 = value1 or Attribute2 = value2 or … AttributeN = valueN
• No success with either Oracle or MySQL even after adjusting settings like max packet size, etc
Solutions - Query Chunking in FQP
• Introduced Query Chunking in FQP --limits number of MRNs in where clause of native queries at database
• Controlled by a new “chunk size” parameter in FQP• If any sub-CQLQuery returns more rows than the “chunk size”,
the dependent query will be run N times, once per chunk
e.g. say Chunk Size (d)= 1000 & Result Size (c) = 10096
This resulted in successful completion of Complex Query in finite amount of time.
Number of CQL Queries (n) = Result Size (c)/ Chunk Size (d)
No. of CQL Queries (n) = 10096 / 1000 = 11 CQL Queries {Smallest with 96 parameters}
Problem – Customization of caGrid Stack
1. Scaling for Large Dataset2. Optimization of Query3. Query Chunking in FQP4. WS-Enumeration in
Client(Controller), FQP & Data Services
Problem – XML Serialization and De-serialization is Expensive
• XML is used to deliver results of CQL queries• A single XML result file is generated• WS-Enumeration can break a result down into smaller
file pieces but• Was not used by FQP to query the grid data services• Data service, grid and FQP all serve WS-Enumeration
requests by de-serializing entire object in memory• The entire object is then written to disk as a resource to
serve the client
Solution: WS-Enumeration in Client(Controller), FQP & Data Services
To utilize WS-Enumeration
• Grid Data Services were generated with caGrid WS-Enumeration enabled.
• FQP: implemented new code to support WS-Enumeration• Used Federated Query Results Client’s Enumerate method
in Controller.Using WS Enumeration end-to-end allowed transfer of larger data sets
over SOAP from Data Service to ACE Controller.
WS-Enumeration Enabled Grid Data Service
Federated Query Processor
Controller
Non Standard Configuration Settings
• WS-Enumeration services returned ALL associations associated on the target object and generated lazy load exceptions
• David Erwin’s patch permitted lazy loading and prevented unwanted associations on the target object from being returned. This vastly reduced the size of returned results and subsequent network overhead.
• Changed default JVM sizes for data services and FQP (currently 15G and 6G respectively)
• Turned off ECache as unsuitable for our application, Caches consume memory, and disk space.
Outstanding Issues
We did not resolve the issue with translation of CQL to efficient SQL with Associations in them, and we worked around this by Joining using Foreign Associations, whereas fixing the CQL to SQL would (theoretically) have been more appropriate.
Summary
• After several bug fixes, FQP is able to handle extremely large data sets.
• With Customizations in caGrid Stack we are able to utilize the benefits of the technology that enables us to share information and analytical resources efficiently.
• With ACE application built on the caGrid Stack we are able to facilitate the inter-departmental data sharing within UAB.
Acknowledgements
Working with caGrid Knowledge Center has been very helpful.• Justin D. Permar Senior Consultant, Biomedical Informatics
Director, Center for IT Innovations in Healthcare (CITIH)
• David W. Ervin Biomedical Informatics Consultant
Center for IT Innovations in Healthcare, Team Manager
• William Stephens Senior Biomedical Informatics Consultant
Center for IT Innovations in Healthcare, Team Manager
UAB Team
CCTS (CTSA)Lisa Guay-Woodford, MD (PI)Eta S. Berner, EdD (Director)Elliot Lefkowitz, PhD (Director)Matthew Wyatt, MSHIJohn David Osborne, MSR. Curtis HendricksonHarsh Taneja, MSNiveditha Thota, MSDon Dempsey, MS Health Systems Information Systems (HSIS) Geoff Gordon, MS (Web Development Director) Steve Osburne (IT) Terrell W Herzig (Data Security Officer) Tim Day, PhD Greg Fuller (GUI) Suresh Nair (DBA) UAB Health System Data Resources Group
Andy MatthewsStephen W DuncanDarlene Green, RN, DSN UAB IT Research ComputingJohn Paul Robinson (Lead)Poornima Pochana MSShantanu Pavgi MS Comprehensive Cancer Center:John Sandefur MBA, CISSP FUNDING:UAB CCTS is funded through a CTSA grant
(5UL1 RR025777)
Recommended