Open Source Business Intelligence Overview

Open Source Business Intelligence Tools

Alex MeadowsTriLUG, January 2012

Agenda

Business Intelligence Overview

Review of OSBI Tools

Data Warehousing

Data Integration

Reporting/OLAP

Visualization

Statistical Analysis/Predictive Analytics

Tonight's agenda will basically be an overview of the different areas of BI, starting from the back-end with data warehousing and data integration and moving into the front-end with reporting, visualization, and statistical analysis.

What Is Business Intelligence?

Utilizing technology to identify and analyze trends in data to makebetter business decisions.

So it's important for us to level-set and define what BI really is. It has quickly become one of the most important fields in the business world because it allows businesses to make better, faster decisions.

Source: Back In Business, Klimberg, Miori (www.informs.org)

Overlapping Fields

BI is not just one field, but many overlapping fields. One can't just look at IT and say that it is BI. It takes experts in data management, process modeling, and statistics to really make a BI program deliver the best return on investment.

Source: Competing on Analytics; Thomas Davenport, Jeanne Harris

Competing On Analytics

Thomas Davenport discovered that businesses actually go through a very predictable pattern while developing the ability to make better business decisions through data. Analytically impaired companies are those that are more 'gut driven'. They make decisions based on conjecture and feeling, not on the actual data in their systems! At the top are the analytical competitors. These companies make all of their business decisions with good data to back them up. Some examples of those companies are: Amazon, Harrah's Entertainment, and Zynga.

Phases of Growth

In keeping with the same pyramid structure, there is also a clear path to the types of tools used when companies develop a BI program. Usually reporting is the start because companies need to know what happened. As mentioned, all of these tools could be used in silos throughout an analytically impaired company. A silo example would be one employee that builds complex spreadsheets because there's no other way to report on their department.

As companies move up the Analytic Competitor pyramid, more of these tools are utilized and integrated throughout the organization. For their full potential to be met, not only does the company have to start using their data to make decisions rather, they have built-in systems that can take the data, filter based on business requirement criteria, and have their workflow automatically change based on that data.

The Three Types of Questions

What happened?

How was performance last week?

What is currently happening?

How is performance right now?

What will happen?

What can I do to reach our goals?

BI's area of focus boils down to essentially three areas: past, present, and future. In taking performance as an example, one could start utilizing reporting tools to answer questions like 'How was my server's performance last week?'. At this point, the data is probably still coming from production systems and can actually hinder the performance the company is wanting to report on. As the company matures, questions quickly arise not only about past performance, but also how well performance is trending and how well are those systems currently performing. Dashboards and other data visualization tools can both report trending as well as current performance. By this time, most companies would have at least started a rudimentary data warehouse due to performance.

Many companies stop there at present performance. It takes a lot of effort to move into predictive analytics because then more data oriented skills are needed. Answering with certainty about future performance based on historical trends is the ultimate goal of BI.

Data Warehousing

Store data outside of application/normal business environment (i.e. ERP systems)

Specific for reporting/analytics

Modeling Styles

3NF (normal database modeling)

Data Marts (aka star schemas)

Data Vault (hybrid 3NF/Data Mart)

Anchor Modeling (6NF)

Any good BI program starts with a data warehouse. You can think of a warehouse as a specialized database that offloads historical data from your production environment. It does a lot more than that as well unlike in a production environment a data warehouse actually stores deltas, changes in the data set, that would be lost forever in a production environment. For example, if you have a table that stores an employee's first name, the production system would only store the current value. If an employee named Robert changed his name from Bob and then to Sally, your production database would never remember the first two events. The data warehouse would not only store the three events, but also the time they occurred and how long they were valid.The other neat thing about data warehouses is how they integrate data from across an organization. If a company has an ERP, online website, and an external data set, the warehouse can integrate those three systems' data into one cohesive data set.There are many different modeling styles for a dwh. The traditional methodologies are very similar to what is used in an ideal database environment. Third normal form is the standard normalization you would see in a typical database while data marts move the data into a format that is better suited for reporting and analysis by end users. In the Data Warehousing 2.0 line, there is data vault modeling which is a hybrid of the first two, and anchor modeling. Anchor modeling is interesting in that it is actually sixth normal form and can get pretty complex.

Data Warehousing

Databases

MySQL, Postgres, etc

Columnar Data Stores

Infobright*, LucidDB, InfiniDB*, etc.

Hybrid Data Warehouse Databases

Greenplum* (both RDBMS and Columnar)

NoSQL

Hadoop, CouchDB, MongoDB, etc.

*Hardware and/or Software limitations in community editions

There are actually quite a few options for warehousing in OS. From more traditional databases that work well with 3NF to columnar data stores that are highly optimized for data marts. NoSQL has also become an option because it can store the unstructured and semi-structured data that never could be stored in a normal warehouse environment.

RDBMS vs Columnar

Source: http://www.calpont.com/column-oriented-database-bi

Columnar data stores basically flip the data from row based into columns. In a typical database, if the last name column needed to be filtered on, columns one through three would have to be scanned. In columnar, the last name row can be filtered on and the other aggregations can be performed as fast as the rows can be read.

The other neat thing about columnar databases is that many of them are smart enough to learn how users query their data sets. They can actually trim and grow their indexes accordingly so that users will get huge performance gains.

NoSQL?

Not Only SQL

Unstructured/semi-structured data

Huge (multi-terrabyte to petabyte+ data sets)

Source: http://www.information-management.com/specialreports/20040622/1005301-1.html

NoSQL tools are able to store 'documents' in a highly compressed way so that PB+ data sets can be quickly filtered through. This is the tool that warehousers have wanted for years, but is only now starting to go mainstream! Unstructured and semi-structured data sets have not been able to easily be searched through until now. It's easily the proverbial gold mine. Look at Facebook or Twitter and you can see where this could be a huge advantage for understanding customer bases.

Data Integration

Syncing data across systems

Includes:

ETL (Extract, Transform, Load)

MDM (Master Data Management)

EAI (Enterprise Application Integration)

EII (Enterprise Information Integration)

Where data warehouses are the backend storage system, data integration acts as the plumbing. DI moves data from source systems into a warehouse or other application. There are many types of DI, from ETL which is moving, cleaning, and loading data, to MDM, which is moving and syncing data across systems, and more.

There are two big OS DI tools, Talend and Pentaho K.E.T.T.L.E.

Talend

Data Management Tool Suite

ETL

MDM

Data Profiling

Data Quality

Code generator

Eclipse based

Extensible plugin architecture

Pentaho K.E.T.T.L.E.

Kettle Extraction, Transport, Transformation, and Loading Environment

Focus on ETL

Extensible plugin architecture

Engine based

Reporting

Focus: Historical Analysis

Now that the back-end has been covered, we can start climbing the pyramid of front-end tools. Reporting is the start of this climb and usually where most organizations start since it is the easiest to implement.

Reporting Options

*Flat Files, NoSQL, etc.

MDXPivot TableChartingSQLOther Sources*Drill ThroughParameterized

BIRT

Pentaho

JasperReports

SQL Power Wabit

Saiku

There are quite a few options out there, and these are some of the more popular ones. The comparison is only taking into account the actual reporting tool and not their server-side component, if applicable.

BIRT is an Eclipse-based tool, so if you're using Eclipse you may want to consider it.Pentaho's Report Designer, JasperReports,are stand-alone tools. All three use a style of design known as banded reports where data elements are essentially dragged and dropped onto a pallet. All three do have server-side components.All three report designers can embed reports into existing applications (i.e. web apps, Java apps).

The neat thing about Saiku and SQL Power Wabit is that they are both built to handle OLAP cubes as well as normal reporting. Saiku's Interactive Reporting tool is still in beta, but is looking very impressive. They are a thin-client based analytics tool that can be embedded in with BI servers or live as it's own stand-alone tool.

BIRT Example

Some charts generated in BIRT.

Here is a screenshot of Pentaho's Report Designer. Each line of the report is the 'banded row' mentioned earlier.

Visualization

Focus: Trending and Present

Visualization is the next area of our tour. In a nutshell, visualizations take very complex data and make it very easy to interpret and take action.

This dashboard is from Stephen Few's Information Dashboard Design book. Notice how it is not flashy, with muted colors that really help to draw attention to the bright red circles. There is a lot of information packed into this space. From trends, to current performance and pacing, it's all here and in plain sight. Usually dashboards like this will also have a drill through ability. For example, clicking on an alert will take you to a more detailed report or view of the data so that a decision can be made on how to react.

Visualization can also be fun, and even describe themselves. XKCD has quite a few such examples.

Notice how much information is packed into such a small space, yet can still be understood.

Pentaho CDE/CDF

Dashboard framework and editor built into Pentaho BI Server

Community developed uses open web languages (Javascript, HTML, etc).

There is really only one OS tool that I have been able to find that builds dashboards akin to Few's. Pentaho's Community Dashboard Framework and Editor was designed by a Web Details and adopted by Pentaho. It is still a stand-alone library.

This is a sample dashboard that WebDetails built for a training course on the tools. Notice that the same principles used by Few are applied here.

Statistics/Predictive Analytics

Focus: All relevent data used to predict outcomes

We've reached the top of our tour of BI. Statistical and Predictive analysis is the goal, and OS provides quite a few options.

Statistics/Predictive Analytics

R stats oriented

Weka machine learning oriented

RapidMiner mixed

Originally YALE

Weka and R Plugins

Like SAS Enterprise Miner

Here's a pic of RapidMiner at work.

BI From Reporting to Statistical Analysis

* Utilizes Talend ETL**Utilizes Weka Data Mining***All use Mondrian for OLAP, with different front ends

ETLMetadataReportingDashboardsOLAP***StatisticsAutomated Decisions

Jaspersoft* Pentaho **SpagoBI* * **

Of note, there are three companies providing an OSBI suite of tools. The biggest differentiation between them are their communities. Jaspersoft and SpagoBI's suites are not totally in their control because they have licensed Talend for their ETL and Metadata tools.

All three use Pentaho's Mondrian OLAP engine.

Pentaho and SpagoBI license the use of Weka as part of their suite of tools.

Shameless Plug

RTP Pentaho User Group

On LinkedIn (soon to be also on Meetup)

Meets quarterly

Yes, I have to put in a shameless plug. I am the Community Leader for the local Pentaho User Group. We are currently on LinkedIn (www.linkedin.com/groups/RTP-Pentaho-User-Group-3674498) and will soon be on Meetup. We're currently meeting quarterly and are looking for speakers.

Technology

Open Source Business Intelligence Overview