Upload
andreas-schreiber
View
2.812
Download
0
Embed Size (px)
DESCRIPTION
PyCon UK 2008 (12.-14. September 2008, Birmingham)
Citation preview
Folie 1PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
DataFinder: Organizing the Data Chaos of Scientists
PyCon UK 2008 (September 12th, 2008, Birmingham)
Andreas Schreiber <[email protected]>
German Aerospace Center (DLR), Cologne
http://www.dlr.de/sc
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 2
The DLRGerman Aerospace Research Center Space Agency of the Federal Republic of Germany
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 3
5,700 employees working in 29 research institutes and facilities
at 13 sites.
Offices in Brussels, Paris and Washington. Köln
Lampoldshausen
Stuttgart
Oberpfaffenhofen
Braunschweig
Göttingen
Berlin-
Bonn
Trauen
Hamburg
Neustrelitz
Weilheim
Bremen-
Sites and employees
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 4
Short Overview
DataFinder is a software for efficient management of scientific and technical data
Focus on huge data sets
Development by DLR
Primary functionality
Structuring of data through assignment of meta information and self-defined data models
Flexible usage of heterogeneous storage resources
Integration in the working environment
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 5
Introduction
DataFinder founded by DLR
National Grid project AeroGrid
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 6
IntroductionBackground
Large-scale simulations
aerodynamics
material science
climate
…
Tons of measured data
wind-tunnel experiments
earth observations
traffic data
…
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 7
IntroductionData Management Problem
Typical organizational situations
No central data management policy
Every employee organizes his/her data individually
Researchers spend about 30% of their time searching for data
Problem with data left behind by temporary staff
Increase of data size and regulations
Rapidly growing volume of simulation and experimental data
Legal requirements for long-term availability of data (up to 50 years!)
Situation similar at many organizations
All ~30 DLR institutes
Other research labs and agencies
Industry
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 8
DataFinder HistorySearch for solution for scientific data management
Definition of “standard problem” (helicopter simulation)
Test case for evaluation of software
Evaluation of commercial product data management (PDM) systems
PDM systems could manage data but with huge amount of costs
PDM systems have many unneeded functionalities
PDM systems have self-defined or unreadable scripting languages for extension and customization (Tcl etc.)
Development of DataFinder
Lightweight data management client and existing server solution
Just enough functionality for our problems (no paid but unused features!)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 9
DataFinder DevelopmentFrom Java Prototype to Python Product…
Development of prototype in Java
Data could be manages with prototype successfully
Drawbacks: Java problems on important platforms (e.g., SGI IRIX)
Embedded Jython interpreter great feature for users
User: “The Java GUI is like shit, but the Python scripting is great. We want a pure Python solution!”
Development of DataFinder product from scratch in Python
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 10
Python for Scientists and EngineersReasons for Python in Research and Industry
Observations:
Scientists and engineers don’t want to write software but just
solve their problems
If they have to write code, it must be as easy as possible
Why Python is perfect?
Very easy to learn and easy to use
( = steep learning curve)
Allows rapid development
( = short development time)
Inherent great maintainability
“I want to design planes,
not software!”
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 11
“Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.
Paul F. Dubois
Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 12
DataFinder OverviewBasic Concept
Client-Server solution
Based on open and stable standards, such as XML and WebDAV
Extensive use of standard software components (open source / commercial), limited own development at client side
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 13
WebDAVWeb-based Distributed Authoring & Versioning
Extension of HTTP
Allows to manage files on remote servers collaboratively
WebDAV supports
Resources (“files”)
Collections (“directories”)
Properties (“meta data”, in XML format)
Locking
WebDAV extensions
Versioning (DeltaV)
Access control (ACP)
Search (DASL)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 14
DataFinder OverviewClient and Server
Client
User client
Administrator client
Implementation: Python with Qt
Server
WebDAV server for meta data and data structure
Data Store concept
Abstracts access to managed data
Flexible usage of heterogeneous storage resources
Implementation: Various existing server solutions (third-party)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 15
DataFinder ClientGraphical User Interfaces
User Client Administrator Client
Implementation in Python with Qt/PyQt
Implementation in Python with Qt/PyQt
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 16
DataFinder ServerSupported WebDAV servers
Commercial Server Solution
Tamino XML database (Software AG)
Open Source Server Solutions
Apache HTTP Web server and module mod_dav
Default storage: file system (mod_dav_fs)
Module Catacomb (mod_dav_repos) + Relational database
(http://catacomb.tigris.org)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 20External Medias
(CD, DVD,…)
Mass Data StorageData Stores
Meta Data Server
Department
Employee
Simulation
Geometry
Grid Generation
Flow Solution
Visualisation
Data Access
WebDAV Server
FTP/GridFTP Server
Tivoli StorageManager
Storage Resource Broker
File System
Amazon S3
Logical View User ClientStorage
Locations
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 21
DataFinder Technical Aspects
Access privilege management
Authentication using WebDAV and LDAP
Authorization for users and groups based on WebDAV (ACP)
Client available on many platforms
Linux, Windows, …
Restricted by availability of Python 2.5 and Qt 3 + PyQt
Extensible through Python scripts
Python application programming interface (API)
Accessing data and meta data
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 22
Python API User Client Extension with GUI
import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade
def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)
thread = threading.Thread(target=searchAndDisplayResult)thread.start()
import threadingfrom datafinder.application import search_supportfrom datafinder.gui.user import facade
def searchAndDisplayResult(): """Searches and displays the result in the search result logging window. """ query = "displayname contains ‘test’ OR displayname == ‘ab’" result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info("Found item %s." % path)
thread = threading.Thread(target=searchAndDisplayResult)thread.start()
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 23
Python API Command Line Example (without GUI)
# Get APIfrom datafinder.application import ExternalFacade
externalFacade = ExternalFacade.getInstance()
# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)
# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)
# Get APIfrom datafinder.application import ExternalFacade
externalFacade = ExternalFacade.getInstance()
# Connect to a repositoryexternalFacade.performBasicDatafinderSetup(username, password, startUrl)
# Download the whole contentrootItem = externalFacade.getRootWebdavServerItem()items = externalFacade.getCollectionContents(rootItem)for item in items: externalFacade.downloadFile(item, baseDirectory)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 24
Additional “Batteries”…Used Libraries beyond the Python Standard Library (1)
PyQt (http://www.riverbankcomputing.co.uk/software/pyqt)Interface to the Qt GUI framework (currently Qt 3)Used for DataFinder UI layer
Pyparsing (http://pyparsing.wikispaces.com/)Creating and executing simple grammarsUsed for highlighting search expressions
python-ldap (http://python-ldap.sourceforge.net/)Object-oriented API to access LDAP serversAuthentication against LDAP / ActiveDirectory server
paramiko (http://www.lag.net/paramiko)SSH2 protocol implementation
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 25
Additional “Batteries”…Used Libraries beyond the Python Standard Library (2)
PyGlobus (http://www-itg.lbl.gov/gtg/projects/pyGlobus)
Interface to The Globus Toolkit
Used for GridFTP Data Store
Boto (http://code.google.com/p/boto)
Interfaces to Amazon Web Services
Used for S3 (Simple Storage Service) Data Store
davlib (http://www.webdav.org/mod_dav/davlib.py)
WebDAV client library
Used for core WebDAV functions
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 26
WebDAV Client LibrarySupport for DAV Extensions
Provides an object-oriented interface for accessing WebDAV server
Extracted from DataFinder source
WebDAV client-side library supports
Core WebDAV specification
Access Control Protocol
Basic Versioning (experimental)
DAV Searching and Locating
Secure HTTP connections
Implementation based on davlib and standard httplib
Apache License Version 2
Project Site: http://sourceforge.net/projects/pythonwebdavlib
Folie 27PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Simple Use Case:File Upload and Search
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 28
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 29
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 30
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 31
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 32
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 33
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 34
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 35
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 36
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 37
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 38
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 39
Folie 40PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Working with DataFinder…
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 41
Configuration and CustomizationPreparing DataFinder for certain “use cases”
Requirements Analysis
Analyze data, working environment, and users workflows
Configuration
Define and configure data model
Configure distributed storage resources (Data Stores)
Customization
Write functional extensions with Python scripts
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 42
DataFinder ConfigurationData Model and Data Stores
Logical view to data
Definition of data structuring and meta data(“data model”)
Separated storage of data structure / meta data and actual data files
Flexible use of (distributed) storage resources
File system, WebDAV, FTP, GridFTP
Amazon S3 (Simple Storage Service)
Tivoli Storage Manager (TSM)
Storage Resource Broker (SRB)
Complex search mechanism to find data
Department
Employee
Simulation
Geometry
Grid Generation
Flow Solution
Visualisation
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 43
Data StructureMapping of Organizational Data Structures
User
Project A
Project B
Project C
File 1
File 2
Simulation I
Experiment
Simulation II
Project MegaCode UltraUser EddieKey Value
Object(collection)
Object(file)
Relation Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Project MegaCode UltraUser EddieKey Value
Attributes(meta data)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 44
Meta Data
Describe and annotate data (“files”) and collections (“directories”)
Different levels of meta data
Required attributes defined by administrator
User is free to choose additional ones
Different types of meta data
String
Numbers (float, double, …)
Lists
Pictures
Links
Stored in XML format
User can search in meta data
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 45
Impact for Users
“Damn! I’m a great scientist!I want freedom to have
my own directory layout…”
DataFinder restricts the rights of users!
Enforcement of “good behavior”
User must comply to organizational standards
Data is stored in defined (directory) hierarchy on data server
Required meta data must be set prior upload
User have certain access rights within hierarchy
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 46
Customization Python-Scripting for Extension and Automation
Integration of DataFinder with environment
User, infrastructure, software, …
Extension of DataFinder by Python scripts
Actions for resources (i.e., files, directories)
User interface extensions
Typical automations and customizations
Data migration and data import
Start of external application (with downloaded data files)
Extraction of meta data from result files
Automation of recurring tasks (“workflows”)
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 47
DataFinder Scripting Downloading File and Starting Application# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()
if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)
if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")
# Download the selected file and try to execute it.from datafinder.application import ExternalFacadefrom guitools.easygui import *import osfrom tempfile import *from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder APIfacade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource()
if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile)
if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, "", "", 1)else: msgbox("No file selected to execute.")
Folie 48PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Examples…
Folie 49PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Example 1:Example 1:Turbine SimulationTurbine Simulation
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 50
Example 1: Fluid Dynamics SimulationTurbine Simulation
Design of new turbine engines
High-resolution simulation of flow
Computational Fluid Dynamics (CFD)
Use of high-performance computing resources (Cluster / Grid)
Huge amounts of data (>100 GByte)
DataFinder used for
Management of results
Automation of simulation runs
Starting pre-/post processing
Used for CFD-code TRACE (DLR)
See http://www.aero-grid.de
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 51
Simulation steps (example):
1. splitCGNSPreparing data for TRACE
2. TRACE (CFD solver)Main computation
3. fillCGNSConflating results
4. Post ProcessingData reduction and visualization
Automation with customized DataFinder
Turbine SimulationData Model
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 52
Turbine Simulation: Graphical User Interface
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 53
Turbine Simulation: Customized GUI Extensions11
22
3
4
55
1.1. Create new simulationCreate new simulation
2.2. Start a simulation Start a simulation
3.3. Query statusQuery status
4.4. Cancel simulationCancel simulation
5.5. Project overviewProject overview
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 54
Turbine Simulation Starting External Applications
1. CGNS Infos / ADFview / CGNS Plot
2. TRACE GUI
3. Gnuplot
1
2
3
Folie 55PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Example 2:Example 2:Automobile SupplierAutomobile Supplier
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 56
Example 2: Automobile SupplierDataFinder for Simulation and Data Management
Tasks
• Automation and management of simulation of customers
• Mapping of specific work sequence
• High flexibility regarding customers requirements
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 57
Automobile SupplierData Model
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 58
Automobile SupplierConfiguration of Customers Parameters
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 59
Automobile SupplierManagement of Simulations
Status overview
Create, change, and deletedata sets
Manage versions of datafiles
Parameter overview
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 60
Automobile SupplierUpload, Download, and Versioning of Files
Upload/download of results
Versioning of results
Script store results in DataFinder data structures
Folie 61PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Example 3:Example 3:Air Traffic Air Traffic ManagementManagement
Example 3:Example 3:Air Traffic Air Traffic ManagementManagement
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 62
Example 3: Air Traffic Monitoring Database for Air Traffic Monitoring
Air traffic monitoring is important for researchPredictions of air trafficNew traffic management approaches
Usage of DataFinderDatabase for traffic data and reportsProject oriented view
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 63
Database for Air Traffic MonitoringData Model and Data Migration
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 64
Database for Air Traffic MonitoringData Import Wizard
Import of all data sources (PDF/Word/text files, Excel, Access, …)
Classification into multiple categories
Prevention of duplicated data and consistent naming
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 65
Database for Air Traffic MonitoringSearch Results
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 66
Current Work and Future Plans
Current work
Migration to Qt 4
Improved usage (e.g., search dialogs)
Integration with Shibboleth
Future
Web interfaces
Jython
Embedding in Java/Eclipseapplications
Reuse of custom GUI dialogs
Migration to Py3k
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 68
Availability
DataFinder core available as Open Source
BSD License
http://sourceforge.net/projects/datafinder
Extended versions / extensions are proprietary
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 69
Links
DataFinder Web sitehttp://www.dlr.de/datafinder
DataFinder Open Sourcehttp://sourceforge.net/projects/datafinder
Python WebDAV libraryhttp://sourceforge.net/projects/pythonwebdavlib
Catacombhttp://catacomb.tigris.org
AeroGrid Projecthttp://www.aero-grid.de
PyCon UK 2008 > Andreas Schreiber > DataFinder > 12.09.2008
Folie 70
Questions?Questions?