57
If you have the Content, then Apache has the Technology! A whistle-stop tour of the Apache content related projects

Apache Content Technologies

Embed Size (px)

Citation preview

If you have the Content,then Apache has the

Technology!

A whistle-stop tour of the

Apache content related projects

Nick Burch

Software EngineerAlfresco

Apache Projects

• 79 Top Level Projects• 40 Incubating Projects

• 30 “Content Related” Main Projects• 7 “Content Related” Incubating

Projects

37 Projects in 50 minutes

With time for questions...

This is not a comprehensive guide!

Different Technologies• Serving• Storing• Transforming• Generating• Hosting• Web Framework Rendering /

Templating / etc

What can we get in 50 mins?• A quick overview of each project• When talks on the project are

happening• When meetups on the project are

happening• Anything new/exciting about the

project?• What interests me in the project!

Serving upyour Content

Apache HTTPD Serverhttp://httpd.apache.org/

• Talks – All day WednesdayMeetup – Thursday evening

• Very wide range of features• (Fairly) easy to extend• Can host most programming

languages• Can front most content systems• Can proxy your content applications• Can host code and content

Apache TrafficServerhttp://trafficserver.apache.org/

• High performance web proxy• Forward and reverse proxy• Ideally suited to sitting between your

content application and the internet• For proxy-only use cases, will probably

be better than httpd• Fewer other features though• Often used as a cloud-edge http router

Apache Tomcathttp://tomcat.apache.org/

• Talks – All day Friday!• Java based, as many of the Apache

Content Technologies are• Java Servlet Container• And you probably all know the rest!

Tomcat – What's Newhttp://tomcat.apache.org/

• Memory leak detection – for your applications, and for the JVM!

• Easier to embed – no need for large numbers of config files!

• Asynchronous request processing for things like Comet / Bayeux

• Servlet 3.0• Improved JMX configurability

Storing allthat Content

Apache Cassandrahttp://cassandra.apache.org/

• Talk - 11am WednesdayMeetup - Wednesday evening

• One of our many NoSQL Databases• Column-Family store• Eventually consistent• Distributed, replicating, no SPF• Can elastically add machines

Apache CouchDBhttp://couchdb.apache.org/

• 12pm Wednesday• Relax!• Erlang• NoSQL• Document orientated distributed store• Eventually consistent if replicating• Map-Reduce queries

Apache HBasehttp://hbase.apache.org/

• 2pm Wednesday• Recently graduated from Hadoop• Another NoSQL Database• Column-Family store, modelled on

Google's Big Table paper• Some transactions and locking• Fast range queries and sorting• Built on HDFS

Which Apache NoSQL?• Do you have tuples, documents,

variable key/values or complex object?• Must data always be consistent?• If you loose a chunk of machines

(partition), should read/write still work?• Query by id, range, arbitrary key/value

or map-reduce function?• How much human interaction is

required to add or remove nodes?

Apache DB: Derbyhttp://db.apache.org/derby/

• Small, easy to embed SQL database• Can be embedded and accessed via

an embedded JDBC driver• Can be accessed over the network• Can be run entirely in-memory• Efficient on-disk format• Has a JavaME version – run it on

basic cell phones!

Apache Directoryhttp://directory.apache.org/

• LDAP Directory• Optimised for many reads per write• Hierarchical, class/attribute based

storage• Triggers, stored procedures, queries

and views• Multi-master replication• Rich permissions model built in

Apache JackRabbithttp://jackrabbit.apache.org/

• 1.30pm Thursday• JCR (Java Content Repository)• Hierarchical content store• Supports structured and unstructured

data• Transactional• Support versions• Full text search built in

Apache Lucenehttp://lucene.apache.org/

• All day Friday + Meetup Tuesday night• Inverted index store• (Each term lists it documents, rather

than each document listing terms)• Searching is faster than adding• Normally stores text, but additional

data can be associated with it• Can hold indexed and un-indexed data

Lucene – What's New?http://lucene.apache.org/

• Lucene and SOLR have merged• Near real-time support when indexing• Better storing of attributes and other

data in the token stream• Numeric fields improved – no need to

externally process numbers into range buckets yourself

• Fast vector highlighter for large docs

Apache Subversionhttp://subversion.apache.org/

• Meetup Thursday evening• Versioning content store• Efficient at storing changes• Normally stores code, text and the odd

binary blob• If you have textual data and you want

a versioning store, it's a good fit!• Used by the new Apache CMS

Apache Xindicehttp://xml.apache.org/xindice/

• Native XML Database• No need to map your complex XML

files to a different data structure• Ideally suited to problems where you

have large numbers of XML files, and little / no other content

• Schema independent model• XPath queries

Transforming andReading Content

Apache PDFBoxhttp://pdfbox.apache.org/

• 4pm Wednesday• Read, Write, Create and Edit PDFs• Create PDFs from text• Fill in PDF forms• Extract text and formatting (Lucene,

Tika etc)• Edit existing files, add images, add text

etc

Apache POIhttp://poi.apache.org/

• 3pm Wednesday + FastFeatherTrack• File format reader and writer for

Microsoft office file formats• Support binary & ooxml formats• Strong read edit write for .xls & .xlsx• Read and basic edit for .doc & .docx• Read and basic edit for .ppt & .pptx• Read for Visio, Publisher, Outlook

Apache Tikahttp://tika.apache.org/

• 9am Friday + Fast Feather Track• Java (+ command line) toolkit for

detecting and extracting content• Identifies what a blob of content is• Gives you consistent metadata back

for it• Parses the contents into plain text,

HTML, XHTML or sax events

Tika – What's New?http://tika.apache.org/

• Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc

• Long standing parsers improved – better HTML from word for example

• Embedded resources and containers• Use expanding – used by many SOLR

users, Alfresco, lots of people crunching masses of data on Hadoop

Apache Cocoonhttp://cocoon.apache.org/

• Component Pipeline framework• Plug together “Lego-Like” generators,

transformers and serialisers• Generate your content once in your

application, serve to different formats• Read in formats, translate and publish• Can power your own “Yahoo Pipes”• Modular, powerful and easy

Apache Xalanhttp://xalan.apache.org/

• XSLT processor• XPath engine• Java and C++ flavours• Cross platform• Library and command line executables• Transform your XML• Fast and reliable XSLT transformation

engine

Apache XML Graphics: Batikhttp://xmlgraphics.apache.org/#batik

• Java SVG toolkit + library• SVG Parser – read and process

existing SVG files• SVG Generator – Graphics2D

implementation that outputs SVG• SVG Dom – easy way to manipulate

your SVG files• SVG viewer program (Squiggle)• Command line SVG rasteriser

Apache XML Graphics: FOPhttp://xmlgraphics.apache.org/#fop

• XSL-FO processor in Java• Reads W3C XSL-FO, applies the

formatting rules to your XML document, and renders it

• Output to Text, PS, PDF, SVG, RTF, Java Graphics2D etc

• Lets you leave your XML clean, and define semantically meaningful rich rendering rules for it

Apache Commons: Codechttp://commons.apache.org/codec/

• Commons Track – Thursday Morning• Encode and decode a variety of

encoding formats• Base64, Hex, Phonetic and URLs• Handy when interchanging content

with external systems

Apache Commons: Compresshttp://commons.apache.org/compress/

• Commons Track – Thursday Morning• Standard way to deal with archive

formats• Read and write support• zip, tar, gzip, bzip, cpio and ar• Wider range of capabilities than

java.util.Zip• Common API across all formats

Apache Commons: Sanselanhttp://commons.apache.org/sanselan/

• Commons Track – Thursday Morning• Pure Java image reader and writer• Fast parsing of image metadata and

information (size, color space, icc etc)• Much easier to use than ImageIO• Slower though, as pure Java• Wider range of formats supported• PNG, GIF, TIFF, JPEG + Exif, BMP,

ICO, PNM, PPM, PSD, XMP

GeneratingContent

Apache Forresthttp://forrest.apache.org/

• Document rendering solution build on top of cocoon

• Reads in content in a variety of formats (xml, wiki etc), applies the appropriate formatting rules, then outputs to different formats

• Heavily used for documentation and websites

• eg read in a file, format as changelog and readme, output as html + pdf

Apache Abderahttp://abdera.apache.org/

• Atom – syndication and publishing• High performance Java

implementation of RFC 4287 + 5023• Generate Atom feeds from Java or by

converting• Parse and process Atom feeds• Atompub server and clients• Supports Atom extensions like

GeoRSS, MediaRSS & OpenSearch

Apache Droids (Incubating)http://incubator.apache.org/droids/

• Intelligent Robots!• Generic standalone crawler framework• Easy to extending existing common

crawlers• Easy to write custom ones• Queue requests for content, protocol

handler gets it, multi threaded• Uses Apache Tika for core of handling

fetched resources

Apache JSPWiki (Incubating)http://incubator.apache.org/jspwiki/

• Feature-rich extensible wiki• Written in Java (Servlets + JSP)• Fairly easy to extend• Can be used as a wiki out of the box• Provides a good platform for new wiki

based application• Rich wiki markup and syntax• Attachments, security, templates etc

Apache ManifoldCF (Incubating)http://incubator.apache.org/connectors/

• Name has changed a few times... (Lucene/Apache Connectors)

• Provides a standard way to get content out of other systems, ready for sending to Lucene etc

• Different goals to CMIS (Chemistry)• Uses many parsers and libraries to talk

to the different repositories / systems• Analogous to Tika but for repos

Apache PhotArk (Incubating)http://incubator.apache.org/photark/

• 5pm Thursday• Open Source Photo Gallery application• Standalone or servlet modes• Can host photos locally• Can aggregate external photo albums

(Flickr, Picassa) for a unified view• SCA programming model – uses

Apache Tuscany to power it

HostingContent

Apache Chemistry (Incubating)http://incubator.apache.org/chemistry/

• 2pm Wednesday• Java, Python and PHP, Atom and WS*• OASIS CMIS (Content Management

Interoperability Services)• Client and Server bindings• “SQL for Content”• Consistent view on content across

different repositories• Read / Write / Manipulate content

Chemistry vs ManifoldCFincubator /chemistry/ /connectors/

• ManifoldCF treats repo as nasty black box, and handles talking to the parsers

• Chemistry talks / exposes repo's contents through CMIS

• ManifoldCF supports a wider range of repositories

• Chemistry supports read and write• Chemistry delivers a richer model• ManifoldCF great for getting text out

Apache Lenyahttp://lenya.apache.org/

• 9am Thursday• XML Content Management system• Powered by Apache Cocoon• WSIWYG editors onto Relax-NG XML• Rich workflow engine + staging• Clean URLs, CSS for styling• Sensible handling of metadata, assets,

internal links, users, permissions etc

Apache Rollerhttp://roller.apache.org/

• Multi-user blog server• Used by the ASF internally• Scales to thousands of users & blogs• Should work with any JavaEE servlet

container and SQL database• Comment moderation and spam filters• Each author has full layout control• Indexes, feeds and Metaweblog API

support for 3rd party clients

Apache Shindighttp://shindig.apache.org/

• Open Social Application Container• Hosts your open social widgets• Renders OpenSocial applications into

HTML + JavaScript• Stores the data for your application• Full client-side JavaScript libraries to

deliver gadget functionality• Reference implementation

Apache Wookie (Incubating)http://incubator.apache.org/wookie/

• 5.30pm Wednesday• W3C Widgets server• Upload, Deploy and Host Widgets• Widgets can range from a badge,

through a small app to a full-blown collaborative system like chat

• Connector framework to make it easy to write widgets in many languages

Web Frameworks

(those with a strong Content focus to them)

Apache Slinghttp://sling.apache.org/

• 12pm Wednesday• “Fun” and easy web framework• REST based• Backed by Jackrabbit content repo• Powered by OSGi• Easy to script, supports multiple output

languages (JSP, server side javascript, scala etc)

• Stores both templates and content

Apache Tapestryhttp://tapestry.apache.org/

• Object Orientated web applications• Build your application in terms of

objects, methods and properties• Tapestry handles URLs, query

parameters and state for you• Pages built with simple HTML• Concentrate on the content that backs

each part, and the business logic for it• Tapestry glues it together for you

Apache Tileshttp://tiles.apache.org/

• Templating framework for Java• Works well with Struts and Shale• Lets you build your page from lots of

tiles (components), which can nest• Build tiles together to make templates• Clean separation between your

content, the business logic to select it, and the rendering rules

Apache Velocityhttp://velocity.apache.org/

• Templating engine• MVC webapp or standalone• Can generate HTML, SQL, PostScript,

XML, Java Code or email from templates

• Anakia lets you make a xdoc file available to a velocity template, handy when generating HTML from xdoc

• Fairly rich templating language

Apache Wickethttp://wicket.apache.org/

• Build your web applications in Java• Uses Java in preference to JavaScript,

CSS etc• Handy if you have a strong Java team

and you need to do some web stuff• Fits well with your Java components• But JS / CSS front end devs tend to be

cheaper than Java ones....

Apache Clerezza (Incubating)http://incubator.apache.org/clerezza/

• OSGi based modular semantic web application framework

• Lets you build applications that fit into the Semantic Web

• Stores and easily manipulates RDF• Full control over REST and URIs• Build applications that both consume

semantic data (eg RDF files), and that expose content to others

Any Questions?

Any cool projects thatI happened to miss?