Web 2.0 for e-Science Environments

  • Published on

  • View

  • Download


Web 2.0 for e-Science Environments. SKG2007 Xian Hotel, Xian China October 29 2007 Geoffrey Fox and Marlon Pierce Computer Science, Informatics, Physics Community Grids Laboratory Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org. - PowerPoint PPT Presentation


*Web 2.0 for e-Science Environments SKG2007Xian Hotel, Xian ChinaOctober 29 2007Geoffrey Fox and Marlon PierceComputer Science, Informatics, PhysicsCommunity Grids LaboratoryIndiana University Bloomington IN 47401gcf@indiana.eduhttp://www.infomall.orgApplications, Infrastructure, TechnologiesThis field is confused by inconsistent use of terminology; I defineWeb Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are technologiesGrids could be everything (Broad Grids implementing some sort of managed web) or reserved for specific architectures like OGSA or Web Services (Narrow Grids)These technologies combine and compete to build electronic infrastructures termed e-infrastructure or Cyberinfrastructuree-moreorlessanything is an emerging application area of broad importance that is hosted on the infrastructures e-infrastructure or Cyberinfrastructuree-Science or perhaps better e-Research is a special case of e-moreorlessanything Relevance of Web 2.0They say that Web 1.0 was a read-only Web while Web 2.0 is the wildly read-write collaborative WebWeb 2.0 can help e-Science in many waysIts tools can enhance scientific collaboration, i.e. effectively support virtual organizations, in different ways from gridsThe popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e-Science and preferable to Grid or Web Service solutionsThe usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audienceWeb 2.0 can even help the emerging challenge of using multicore chips i.e. in improving parallel computing programming and runtime environments*Best Web 2.0 Sites -- 2006Extracted from http://web2.wsj2.com/ All important capabilities for e-ScienceSocial NetworkingStart PagesSocial Bookmarking Peer Production News Social Media SharingOnline Storage (Computing)Web 2.0, Grids and Web Services IWeb Services have clearly defined protocols (SOAP) and a well defined mechanism (WSDL) to define service interfacesThere is good .NET and Java supportThe so-called WS-* specifications provide a rich sophisticated but complicated standard set of capabilities for security, fault tolerance, meta-data, discovery, notification etc.Narrow Grids build on Web Services and provide a robust managed environment with growing but still small adoption in Enterprise systems and distributed science (so called e-Science)Web 2.0 supports a similar architecture to Web services but has developed in a more chaotic but remarkably successful fashion with a service architecture with a variety of protocols including those of Web and Grid servicesOver 500 Interfaces defined at http://www.programmableweb.com/apis Web 2.0 also has many well known capabilities with Google Maps and Amazon Compute/Storage services of clear general relevance There are also Web 2.0 services supporting novel collaboration modes and user interaction with the web as seen in social networking sites, portals, MySpace, YouTube Web 2.0 Systems like Grids have Portals, Services, ResourcesCaptures the incredible development of interactive Web sites enabling people to create and collaborate Web 2.0, Grids and Web Services III once thought Web Services were inevitable but this is no longer clear to meWeb services are complicated, slow and non functionalWS-Security is unnecessarily slow and pedantic (canonicalization of XML)WS-RM (Reliable Messaging) seems to have poor adoption and doesnt work well in collaborationWSDM (distributed management) specifies a lotThere are de facto Web 2.0 standards like Google Maps and powerful suppliers like Google/Microsoft which define the architectures/interfacesOne can easily combine SOAP (Web Service) based services/systems with HTTP messages but dominance of lowest common denominator suggests additional structure/complexity of SOAP will not easily survive Distribution of APIs and Mashups per ProtocolNumber ofMashupsNumber ofAPIsSOAP is quite a small fractionWhere did Narrow Grids and Web Services go wrong?Too much Computing: historically one (including narrow grids) has tried to increase computing capabilities byOptimizing performance of codes at cost of re-usabilityExploiting all possible CPUs such as Graphics co-processors and idle cycles (across administrative domains)Linking central computers together such as NSF/DoE/DoD supercomputer networks without clear user requirementsNext Crisis in technology area will be the opposite problem commodity chips will be 32-128way parallel in 5 years time and we currently have no idea how to use them especially on clientsOnly 2 releases of standard software (e.g. Office) in this time spanInteroperability Interfaces will be for data not for infrastructureGoogle, Amazon, TeraGrid, European Grids will not interoperate at the resource or compute (processing) level but rather at the data streams flowing in and out of independent Grid islandsData focus is consistent with Semantic Grid/Web but not clear if latter has learnt the usability message of Web 2.0One needs to share computing, data, people in e-moreorlessanything, Grids initially focused on computing but data and people are more importanteScience is healthy as is e-moreorlessanythingMost Grids are solving wrong problem at wrong point in stack with a complexity that makes friendly usability difficultSome Web 2.0 Activities at IUUse of Blogs, RSS feeds, Wikis etc.Use of Mashups for Cheminformatics Grid workflowsMoving from Portlets to Gadgets in portals (or at least supporting both)Use of Connotea to produce tagged document collections such as http://www.connotea.org/user/crmc for parallel computingSemantic Research Grid integrates multiple tagging and search systems and copes with overlapping inconsistent annotationsMSI-CIEC portal augments Connotea to tag a mix of URL and URIs e.g. NSF TeraGrid use, PIs and ProposalsHopes to support collaboration (for Minority Serving Institution faculty) Multicore SALSA project using for Parallel Programming 2.0Use blog to create posts. Display blog RSS feed in MediaWiki.Semantic Research Grid (SRG)Integrates tagging and search system that allows users to use multiple sites and consistently integrate them with traditional citation databasesWe built a mashup linking to del.icio.us, CiteULike, Connotea allowing exchange of tags between sites and between local repositoriesRepositories also link to local sources (PubsOnline) and Google Scholar (GS) and Windows Academic Live (WLA)GS has number of cited publications. WLA has Digital Object Identifier (DOI)We implement a rather more powerful access control mechanismWe build heuristic tools to mine web lists for citationsWe have an event based architecture (consistency model) allowing change actions to be preserved and selectively changedSupports integrating different inconsistent views of a given document and its updates on different tagging systems**MSI-CIEC PortalMSI-CIECMinority Serving Institution CyberInfrastructure Empowerment CoalitionNSF Grants Tag SystemNSF has the ability to get information (in XML) on all of the grants a particular person worked on We downloaded, parsed, and bookmarked this info using a little scavenger robot.Each grant is represented by a bookmark and tagged with relevant information in MSI-CIEC PortalGrant tags point to URLs of the NSF award page.The investigators are imported as users Each has a bookmark for each project they worked onThey are also represented in the tags of these projects.Can now form research collaborations by linking researchers with common tagsHopefully will enable broader collaborations and not just those between usual suspectsSuperior (from broad usage) technologies of Web 2.0Mash-ups can replace WorkflowGadgets can replace PortletsUDDI replaced by user generated registries*Mashups v Workflow?Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 Workflow Tools are reviewed by Gannon and Fox http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdfBoth include scripting in PHP, Python, sh etc. as both implement distributed programming at level of servicesMashups use all types of service interfaces and perhaps do not have the potential robustness (security) of Grid service approachMashups typically pure HTTP (REST)*Grid Workflow Datamining in Earth ScienceWork with Scripps InstituteGrid services controlled by scripting workflow process real time data from ~70 GPS Sensors in Southern California NASA GPSEarthquakeGrid Workflow Data Assimilation in Earth ScienceGrid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecastsTypical graphical interface to service compositionTaverna another well known Grid/Web Service workflow toolRecent Web 2.0 visual Mashup tools include Yahoo Pipes and Microsoft PopflyParallel Programming 2.0Web 2.0 Mashups will (by definition the largest market) drive composition tools for Grid, web and parallel programmingParallel Programming 2.0 will build on Mashup tools like Yahoo Pipes and Microsoft PopflyWeb 2.0 Mashups and APIshttp://www.programmableweb.com/apis has (Sept 12 2007) 2312 Mashups and 511 Web 2.0 APIs and with GoogleMaps the most often used in MashupsThis is the Web 2.0 UDDI (service registry) The List of Web 2.0 APIsEach site has API and its featuresDivided into broad categoriesOnly a few used a lot (49 APIs used in 10 or more mashups)RSS feed of new APIsGoogle maps dominates but Amazon S3 growing in popularityNow to Portals*Grid-style portal as used in Earthquake GridThe Portal is built from portlets providing user interface fragments for each service that are composed into the full interface uses OGCE technology as does planetary science VLAB portal with University of MinnesotaQuakeSim has a typical Grid technology portalSuch Server side Portlet-based approaches to portals are being challenged by client side gadgets from Web 2.0*Portlets v. Google GadgetsPortals for Grid Systems are built using portlets with software like GridSphere integrating these on the server-side into a single web-pageGoogle (at least) offers the Google sidebar and Google home page which support Web 2.0 services and do not use a server side aggregatorGoogle is more user friendly!The many Web 2.0 competitions is an interesting model for promoting development in the world-wide distributed collection of Web 2.0 developersI guess Web 2.0 model will win!Typical Google Gadget Structure Lots of HTML and JavaScript Portlets build User Interfaces by combining fragments in a standalone Java ServerGoogle Gadgets build User Interfaces by combining fragments with JavaScript on the clientWeb 2.0 can also help address long standing difficulties with parallel programming environmentsToo much computing addresses too much data and implies need for multicore datamining algorithmsClusteringPrincipal Component Analysis (SVD)Expectation-Maximization EM (mixture models)Hidden Markov Models HMMMulticore SALSA at CGLService Aggregated Linked Sequential Activitieshttp://www.infomall.org/multicoreAims to link parallel and distributed (Grid) computing by developing parallel applications as services and not as programs or librariesImprove traditionally poor parallel programming development environmentsCan use messaging to link parallel and Grid services but performance functionality tradeoffs differentParallelism needs few s latency for message latency and thread spawningNetwork overheads in Grid 10-100s sDeveloping set of services (library) of multicore parallel data mining algorithmsParallel Programming ModelIf multicore technology is to succeed, mere mortals must be able to build effective parallel programsThere are interesting new developments especially the Darpa HPCS Languages X10, Chapel and FortressHowever if mortals are to program the 64-256 core chips expected in 5-7 years, then we must use todays technology and we must make it easyThis rules out radical new approaches such as new languagesThe important applications are not scientific computing but most of the algorithms needed are similar to those explored in scientific parallel computingIntel RMS analysisWe can divide problem into two parts:High Performance scalable (in number of cores) parallel kernels or librariesComposition of kernels into complete applicationsWe currently assume that the kernels of the scalable parallel algorithms/applications/libraries will be built by experts with a Broader group of programmers (mere mortals) composing library members into complete applications. Scalable Parallel ComponentsThere are no agreed high-level programming environments for building library members that are broadly applicable. However lower level approaches where experts define parallelism explicitly are available and have clear performance models. These include MPI for messaging or just locks within a single shared memory.There are several patterns to support here including the collective synchronization of MPI, dynamic irregular thread parallelism needed in search algorithms, and more specialized cases like discrete event simulation. We use Microsoft CCR http://msdn.microsoft.com/robotics/ as it supports both MPI and dynamic threading style of parallelismIt already supports a Web 2.0 compatible service model DSSComposition of Parallel ComponentsThe composition step has many excellent solutions as this does not have the same drastic synchronization and correctness constraints as for scalable kernelsUnlike kernel step which has no very good solutionsTask parallelism in languages such as C++, C#, Java and Fortran90; General scripting languages like PHP Perl PythonDomain specific environments like Matlab and Mathematica Functional Languages like MapReduce, F# HeNCE, AVS and Khoros from the past and CCA from DoE Web Service/Grid Workflow like Taverna, Kepler, InforSense KDE, Pipeline Pilot (from SciTegic) and the LEAD environment built at Indiana University. Web solutions like Mash-ups and DSSMany scientific applications use MPI for the coarse grain composition as well as fine grain parallelism but this doesnt seem elegantThe new languages from Darpas HPCS program support task parallelism (composition of parallel components) decoupling composition and scalable parallelism will remain popular and must be supported.Service Aggregation in SALSAKernels and Composition must be supported both inside chips (the multicore problem) and between machines in clusters (the traditional parallel computing problem) or Grids. The scalable parallelism (kernel) problem is typically only interesting on true parallel computers as the algorithms require low communication latency. However composition is similar in both parallel and distributed scenarios and it seems useful to allow the use of Grid and Web 2.0 composition tools for the parallel problem. This should allow parallel computing to exploit large investment in service programming environmentsThus in SALSA we express parallel kernels not as traditional libraries but as (some variant of) services so they can be used by non expert programmersFor parallelism expressed in CCR, DSS represents the natural service (composition) model.Inside the SALSA ServicesWe generalize the well known CSP (Communicating Sequential Processes) of Hoare to describe the low level approaches to fine grain parallelism as Linked Sequential Activities in SALSA. We use term activities in SALSA to allow one to build services from either threads, processes (usual MPI choice) or even just other services. We choose term linkage in SALSA to denote the different ways of synchronizing the parallel activities that may involve shared memory rather than some form of messaging or communication.There are several engineering and research issues for SALSAThere is the critical communication optimization problem area for communication inside chips, clusters and Grids. We need to discuss what we mean by services SALSA PerformanceThe macroscopic inter-service DSS Overhead is about 35sDSS is composed from CCR threads that have 4s overhead for spawning threads in dynamic search applications20s overhead for MPI ExchangeMPI Exchange Latency in s (20-30 s computation between messaging)MachineOSRuntimeGrainsParallelismMPI Exchange Latency Intel8c:gf12(8 core 2.33 Ghz)(in 2 chips)RedhatMPJE (Java)Process8181MPICH2 (C)Process840.0MPICH2: FastProcess839.3NemesisProcess84.21Intel8c:gf20(8 core 2.33 Ghz)FedoraMPJEProcess8157mpiJavaProcess8111MPICH2Process864.2Intel8b(8 core 2.66 Ghz)VistaMPJEProcess8170FedoraMPJEProcess8142FedorampiJavaProcess8100VistaCCR (C#)Thread820.2AMD4(4 core 2.19 Ghz)XPMPJEProcess4185RedhatMPJEProcess4152mpiJavaProcess499.4MPICH2Process439.3XPCCRThread416.3Intel4 (4 core 2.8 Ghz)XPCCRThread425.8RentersClustering is typical of data mining methods that are needed for tomorrows clients or servers bathed in a data rich environmentClustering Census data in Indiana on dual quadcore processorsImplemented with CCR and DSS Use deterministic annealing that uses multiscale method to avoid local minima Efficiency is 90% limited by peculiar Windows thread scheduling effectsParallel Multicore GISDeterministic Annealing ClusteringParallel Overhead on 8 Threads Intel 8b Speedup = 8/(1+Overhead)10000/(Grain Size n = points per core)Overhead = Constant1 + Constant2/nConstant1 = 0.02 to 0.1 (Windows) due to thread runtime fluctuations 10 Clusters20 ClustersWeb 2.0 v Narrow Grid IWeb 2.0 and Grids are addressing a similar application class although Web 2.0 has focused on user interactionsSo technology has similar requirementsWeb 2.0 chooses simplicity (REST rather than SOAP) to lower barrier to everyone participatingWeb 2.0 and Parallel Computing tend to use traditional (possibly visual) (scripting) languages for equivalent of workflow whereas Grids use visual interface backend recorded in BPELWeb 2.0 and Grids both use SOA Service Oriented ArchitecturesServices will be used everywhere: Grids, Web 2.0 and Parallel ComputingSystem of Systems: Grids and Web 2.0 are likely to build systems hierarchically out of smaller systemsWe need to support Grids of Grids, Webs of Grids, Grids of Services etc. i.e. systems of systems of all sortsWeb 2.0 suggest data not infrastructure system linkage*Web 2.0 v Narrow Grid IIWeb 2.0 has a set of major services like GoogleMaps or Flickr but the world is composing Mashups that make new composite servicesEnd-point standards are set by end-point ownersMany different protocols covering a variety of de-facto standardsNarrow Grids have a set of major software systems like Condor and Globus and a different world is extending with custom services and linking with workflowPopular Web 2.0 technologies are PHP, JavaScript, JSON, AJAX and REST with Start Page e.g. (Google Gadgets) interfacesPopular Narrow Grid technologies are Apache Axis, BPEL WSDL and SOAP with portlet interfacesRobustness of Grids demanded by the Enterprise?Not so clear that Web 2.0 wont eventually dominate other application areas and with Enterprise 2.0 its invading GridsWeb 2.0 v Narrow Grid IIINarrow Grids have a strong emphasis on standards and structureWeb 2.0 lets a 1000 flowers (protocols) and a million developers bloom and focuses on functionality, broad usability and simplicityInteroperability at user (data) level not at service levelPuts semantics into application (user) level (like KML for maps) and minimizes general system level semanticsSemantic Web/Grid has structure to allow reasoningAnnotation in sites like del.icio.us and uploading to MySpace/YouTube is unstructured and free text search replaces structured ontologies?Flickr has geocoded (structured) and unstructured tagsPortals are likely to feature both Web and desktop client technology although it is possible that Web approach will be adopted more or less uniformlyWeb 2.0 has a very active portal activity which has similar architecture to Grids A page has multiple user interface fragmentsWeb 2.0 user interface integration is typically Client side using Gadgets AJAX and JavaScript while Grids are in a special JSR168 portal server side using Portlets WSRP and Java*The Ten areas covered by the 60 core WS-* Specifications WS-* Specification AreaTypical Grid/Web Service Examples1: Core Service ModelXML, WSDL, SOAP2: Service InternetWS-Addressing, WS-MessageDelivery; Reliable Messaging WSRM; Efficient Messaging MOTM3: NotificationWS-Notification, WS-Eventing (Publish-Subscribe)4: Workflow and TransactionsBPEL, WS-Choreography, WS-Coordination5: SecurityWS-Security, WS-Trust, WS-Federation, SAML, WS-SecureConversation6: Service DiscoveryUDDI, WS-Discovery7: System Metadata and StateWSRF, WS-MetadataExchange, WS-Context8: ManagementWSDM, WS-Management, WS-Transfer9: Policy and AgreementsWS-Policy, WS-Agreement10: Portals and User InterfacesWSRP (Remote Portlets)WS-* Areas and Web 2.0 WS-* Specification AreaWeb 2.0 Approach1: Core Service ModelXML becomes optional but still usefulSOAP becomes JSON RSS ATOM WSDL becomes REST with API as GET PUT etc.Axis becomes XmlHttpRequest 2: Service InternetNo special QoS. Use JMS or equivalent?3: NotificationHard with HTTP without polling JMS perhaps? 4: Workflow and Transactions (no Transactions in Web 2.0)Mashups, Google MapReduceScripting with PHP JavaScript .5: SecuritySSL, HTTP Authentication/Authorization, OpenID is Web 2.0 Single Sign on6: Service Discoveryhttp://www.programmableweb.com7: System Metadata and StateProcessed by application no system state Microformats are a universal metadata approach8: Management==InteractionWS-Transfer style Protocols GET PUT etc.9: Policy and AgreementsService dependent. Processed by application10: Portals and User InterfacesStart Pages, AJAX and Widgets(Netvibes) GadgetsLooking to the FutureWeb 2.0 has momentum as it is driven by success of social web sites and the user friendly protocols attracting many developers of mashupsGrids momentum driven by the success of eScience and the commercial web service thrusts largely aimed at EnterpriseWe expect applications such as business and military where predictability and robustness important might be built on a Web Service (Narrow Grid) core with perhaps Web 2.0 functionality enhancementsBut even this Web Service application may not surviveMulticore usability driving Parallel Programming 2.0Simplicity, supporting many developers are forces pressuring Grids!Robustness and coping with unstructured blooming of a 1000 flowers are forces pressuring Web 2.0**********