107
DATA AND BUSINESS PROCESS INTELLIGENCE PENTAHO PLATFORM DEVELOPED AT: BHAT, GANDHINAGAR-382428 DEVELOPED BY: BHAGAT FARIDA H. SINGH SWATI 11ITUOS079 11ITUOS068 GUIDED BY: INTERNAL GUIDE EXTERNAL GUIDE PROF. R.S. CHHAJED MR. VIJAY PATEL

DATA AND BUSINESS PROCESS INTELLIGENCE

Embed Size (px)

Citation preview

Page 1: DATA AND BUSINESS PROCESS INTELLIGENCE

DATA AND BUSINESS PROCESS

INTELLIGENCE

PENTAHO PLATFORM

DEVELOPED AT:

BHAT, GANDHINAGAR-382428

DEVELOPED BY:

BHAGAT FARIDA H. SINGH SWATI

11ITUOS079 11ITUOS068

GUIDED BY:

INTERNAL GUIDE EXTERNAL GUIDE

PROF. R.S. CHHAJED MR. VIJAY PATEL

Department of Information Technology.

Faculty of Technology,

Dharmsinh Desai University,

Page 2: DATA AND BUSINESS PROCESS INTELLIGENCE

College Road, Nadiad- 387001.

Page 3: DATA AND BUSINESS PROCESS INTELLIGENCE

CANDIDATE’S DECLARATION

We declare that the final semester report entitled “DATA AND BUSINESS PROCESS

INTELLIGENCE” is our own work conducted under the supervision of the external

guide MR. Vijay Patel, Institute for Plasma Research, Bhat, Gandhinagar and internal

guide Prof. R.S. Chhajed, Faculty of Technology, DDU, Nadiad.

We further declare that to the best of our knowledge the report for B.TECH SEM-VIII

does not contain part of the work which has been submitted either in this or any other

university without proper citation.

Farida Bhagat H.

Branch: Information Technology

Student ID: 11ITUOS079

Roll: IT-07

Singh Swati

Branch: Information Technology

Student ID: 11ITUOS068

Roll: IT-124

Submitted To:

PROF. R.S. CHHAJED,

Department of Information Technology,

Faculty of Technology,

Dharmsinh Desai University,

Nadiad

DDU (Faculty of Tech., Dept. of IT) i

Page 4: DATA AND BUSINESS PROCESS INTELLIGENCE

DDU (Faculty of Tech., Dept. of IT) ii

Page 5: DATA AND BUSINESS PROCESS INTELLIGENCE

DHARMSINH DESAI UNIVERSITYNADIAD-387001, GUJARAT

CERTIFICATE

This is to certify that the project entitled “DATA AND BUSINESS PROCESS

INTELLIGENCE” is a bonafied report of the work carried out by

1) Miss BHAGAT FARIDA H., Student ID No: 11ITUOS079

2) Miss SINGH SWATI, Student ID No: 11ITUOS068

of Department of Information Technology, semester VIII, under the guidance and

supervision for the award of the degree of Bachelor of Technology at Dharmsinh Desai

University, Gujarat. They were involved in Project training during academic year 2013-

2014.

Prof. R.S.Chhajed

HOD, Department of Information Technology,

Faculty of Technology,

Dharmsinh Desai University, Nadiad

Date:

DDU (Faculty of Tech., Dept. of IT) iii

Page 6: DATA AND BUSINESS PROCESS INTELLIGENCE

ACKNOWLEDGEMENTS

We are grateful to Mr. Amit Srivastava (Institute for Plasma Research) for giving us

this opportunity to work under the guidance of prominent Solution Expert in the field of

Software Engineering and also providing us with the required resources at the institute.

We are also thankful to Mr. Vijay Patel (Institute for Plasma Research) for guiding us

in our project and sharing valuable knowledge with us.

It gives us immense pleasure and satisfaction in presenting this report of Project

undertaken during the 8th semester of B.Tech. As it is the first step into our professional

life, we would like to take this opportunity to express our sincere thanks to several

people, without whose help and encouragement, it would be impossible for us to carry

out this desired work.

We would like to express thanks to our Head of Department Prof. R. S. Chhajed who

gave us an opportunity to undertake this work. We are grateful to him for his guidance in

the development process.

Finally we would like to thank all Institute of Plasma Research employees, all the faculty

members of our college, friends and family members for providing their support and

continuous encouragement throughout the project.

Thank you

Bhagat Farida H.

Singh Swati

DDU (Faculty of Tech., Dept. of IT) iv

Page 7: DATA AND BUSINESS PROCESS INTELLIGENCE

TABLE OF CONTENTS

ABSTRACT……………………………………………………………………………....1

COMPANY PROFILE………………………………………………………………......3

LIST OF FIGURES……………………………………………………………………...4

LIST OF TABLES……………………………………………………………………….6

1. INTRODUCTION…………………..……………………………………………….7

1.1 Project Details……………………………………………………………………7

1.2 Purpose…………………………………………………………………………....7

1.3 Scope………………………………………………………………………………7

1.4 Objective………………………………………………………………………….8

1.5 Technology and Literature Review……………………………………………..8

1.5.1 Alfresco ECM……………………………………………………………8

1.5.2 Pentaho Platform………………………………………………………...9

2. PROJECT MANAGEMENT………………………………………………………10

2.1 Feasibility Study………………………………………………………………...10

2.2 Project Planning………………………………………………………………...10

2.2.1 Project Development Approach……………………………………….10

2.2.2 Project Plan…………………………………………………………..…11

2.2.3 Milestones and Deliverables...……………………………………….…12

2.2.4 Project Scheduling………………………………………………….…..13

3. SYSTEM REQUIREMENTS STUDY………………………………………..….14

3.1 User Characteristics……………..……………………………………………..14

3.2 Hardware and Software Requirements…………………………………….…14

3.2.1 Hardware Requirements……………………………………………….14

DDU (Faculty of Tech., Dept. of IT) v

Page 8: DATA AND BUSINESS PROCESS INTELLIGENCE

3.2.2 Software Requirements……………………………………………..…14

3.3 Constraints…………………………………………………………………...…15

3.3.1 Regulatory Policies…………………………………………………..…15

3.3.2 Hardware Limitations………………………………………………….15

3.3.3 Interfaces to Other Applications………………………………………15

3.3.4 CMIS……………………………………………………………………15

3.3.5 Parallel Operations……………………………………………………..16

3.3.6 Reliability Requirements………………………………………………16

3.3.7 Criticality of the Application………………………………………….16

3.3.8 Safety and Security Considerations………………………………...…16

4. ALFRESCO ECM SYSTEM……………………………………………………...17

4.1 Introduction……………………………………………………………………..17

4.2 Alfresco Overview………………………………………………………………17

4.3 Architecture...…………………………………………………………………...19

4.3.1 Client.……………………………………………………………………19

4.3.2 Server……………………………………………………………………19

4.4 Data Storage in Alfresco……………………………………………………….21

4.5 Relationship Diagrams…………………………………………………………21

5. TRANSFORMATION PHASE……………..…………………………………….24

5.1 Introduction…………………………………………………………………….24

5.2 Pentaho Data Integration Tool….……………………………………………..24

5.2.1 Introduction…………………………………………………………….24

5.2.2 Why Pentaho?..........................................................................................25

5.2.2.1 JasperSoft vs Pentaho vs BIRT……………………………………25

5.2.2.2 Conclusion…………………………………………………………..26

5.2.3 Components of Pentaho………………………………………………..27

5.3 Alfresco Audit Analysis and Reporting Tool………………...……………….28

5.3.1 Introduction…………………………………………………………….28

5.3.2 Working and Installation of A.A.A.R. ..................................................29

DDU (Faculty of Tech., Dept. of IT) vi

Page 9: DATA AND BUSINESS PROCESS INTELLIGENCE

5.3.2.1 Pre Requisites……………………………………………………….30

5.3.2.2 Enabling Alfresco Audit Service…………………………………...30

5.3.2.3 Data Mart Creation and Configuration………………………...…30

5.3.2.4 PDI Repository Setting..……………………………………………31

5.3.2.5 First Import………………………………………………………....36

5.3.3 Audit Data Mart………………………………………………………...36

5.3.4 Dimension Tables……………………………………………………….37

5.4 Transformations Using Spoon…………………………………………………38

5.5 Example Transformations………..………………………………………….…38

6. REPORTING PHASE……………..………………………………………….……42

6.1 What is a Report?.……………………………………………………………...42

6.2 Pentaho Report Designer Tool….……………………………………………...42

6.2.1 Introduction……………………………………………………………..42

6.2.2 Working of Pentaho Designer………………………………………….43

6.3 Example Reports………..………………………………………………………44

7. PUBLISHING PHASE……………..………………………………………………46

7.1 Introduction………………..…………………………………………………...46

7.2 Pentaho BI Server………...…………………………………………………….46

7.2.1 In

troduction……………………………………………………………...46

7.2.2 Example Published Reports……………………………………………47

7.3 Scheduling of Transformations…………….………………………………….50

8. TESTING……………..……………………………………………………………..51

8.1 Testing Strategies….…………………………………………………………....51

8.2 Testing Methods………………………………………………………………...52

8.3 Test Cases……………………………………………………………………….53

8.3.1 User Login and Functionality of Report………………………………53

8.3.2 Viewing Documents, Folders, Permissions, Audits…………………...54

DDU (Faculty of Tech., Dept. of IT) vii

Page 10: DATA AND BUSINESS PROCESS INTELLIGENCE

9. USER MANUAL……………………………………………………………………55

9.1 Description………………………………………………………………………55

9.2 Login Page………………………………………………………………………55

9.3 View Reports……………………………………………………………………57

9.4 Scheduling………………………………………………………………………59

9.5 Administration……………………………………………………………….....62

10. LIMITATIONS AND FUTURE ENHANCEMENTS……………………………64

10.1 Limitations……………………………………………………………………..64

10.2 Future Enhancements…………………………………………………………64

11. CONCLUSION AND DISCUSSION……………………………………………...65

11.1 Self Analysis of Project Viabilities……………………………………………65

11.1.1 Self Analysis……………………………………………………………...65

11.1.2 Project Viabilities………………………………………………….…….65

11.2 Problems Encountered and Possible Solutions……………………………...65

11.3 Summary of Project Work…………………………………………………...66

12. REFERENCES…………………………………………………………………….68

DDU (Faculty of Tech., Dept. of IT) viii

Page 11: DATA AND BUSINESS PROCESS INTELLIGENCE

Abstract

ABSTRACT

Design and implement a platform for Data and Process intelligence tool

IPR has selected Alfresco, an Enterprise Content Management (ECM), as an

Electronic Document and Record Management System (EDRMS). Alfresco do

not have powerful reporting functionality and honestly, it is not its job.

Unfortunately, the need for powerful reporting is still there and most of the

answers are tricky solutions, quite hard to manage and scale. Alfresco ECM has a

detailed audit service that exposes a lot of (potentially) useful information.

Alfresco is integrated with Activiti, a Business Process Management (BPM)

Engine. It also has auditing functionality and exposing the audit data related to

processes and tasks.

Data and Process Intelligent tool (the project) will be divided into two parts. The

first part will be Alfresco Data Integration which will provided a solution to

extract, transform, and load (ETL) data (document/folder/process/task) together

with the audit data at a very detailed level in a central warehouse. On top of that,

it will provide the data cleansing and merging functionality and if needed convert

it in to OLAP format for efficient analysis.

The second part will be the reporting functionality. The goal will be generic

reporting tool useful to the end-user in a very easy way. The data will be

published in reports in well-known formats (pdf, Microsoft Excel, csv, etc.) and

stored directly in Alfresco as static documents organized in folders.

To achieve above goal, Alfresco will be integrated with a powerful open source

data integration and reporting tool. The necessary data from the Alfresco

Repository will be extracted, transformed, merged/integrated and loaded in the

data warehouse. The necessary schema transformation (for example OLTP to

OLAP) will be applied to increase the efficiency. The solution will be scalable

DDU (Faculty of Tech., Dept. of IT) 1

Page 12: DATA AND BUSINESS PROCESS INTELLIGENCE

Abstract

and generic Reporting System with an open window on the Business Intelligence

world. Saying that, the solutions will be suitable also for publishing (static)

reports containing not only audit data coming from Alfresco but also Key

Performance Indicators (KPIs), analysis and dashboards coming from a complete

Enterprise Data Warehouse.

DDU (Faculty of Tech., Dept. of IT) 2

Page 13: DATA AND BUSINESS PROCESS INTELLIGENCE

Company Profile

COMPANY PROFILE

Institute for Plasma Research (IPR) is an autonomous physics research institute

located in Gandhinagar, India. The institute is involved in research in aspects of

plasma science including basic plasma physics, research on magnetically confined

hot plasmas and plasma technologies for industrial applications. It is a large and

leading plasma physics organization in India. The institute is mainly funded

by Department of Atomic Energy. IPR is playing major scientific and technical

role in Indian partnership in the international fusion energy initiative ITER

(International Thermonuclear Experimental Reactor).

IPR is now internationally recognized for its contributions to fundamental and

applied research in plasma physics and associated technologies. It has a scientific

and engineering manpower of 200 with core competency in theoretical plasma

physics, computer modeling, superconducting magnets and cryogenics, ultra high

vacuum, pulsed power, microwave and RF, computer-based control and data

acquisition and industrial, environmental and strategic plasma applications.

The Centre of Plasma Physics - Institute for Plasma Research has active

collaboration with the following Institutes/ Universities:

Bhabha Atomic Research Centre, Bombay

Raja Ramanna Centre for Advanced Technology, Indore

IPP, Juelich, Germany; IPP, Garching, Germany

Kyushu University, Fukuoka, Japan

Physical Research Laboratory, Ahmedabad

National Institute for Interdisciplinary Science and Technology, Bhubaneswar

Ruhr University Bochum, Bochum, Germany

Saha Institute of Nuclear Physics, Calcutta

St. Andrews University, UK

Tokyo Metropolitan Institute of Technology, Tokyo

University of Bayreuth, Germany; University of Kyoto, Japan.

DDU (Faculty of Tech., Dept. of IT) 3

Page 14: DATA AND BUSINESS PROCESS INTELLIGENCE

List of Figures

LIST OF FIGURES

1. MVC Architecture…………….……………………………………….Fig 1.1

2. Flowchart of the project……………………………………………….Fig 2.1

3. Gantt Chart…………………………………………………………….Fig 2.2

4. Alfresco Icon…………………………………………………………...Fig 4.1

5. Uses of Alfresco ECM…………………….………………………...…Fig 4.2

6. Alfresco Architecture……………………….…………………………Fig 4.3

7. Relational Diagrams (users, documents and folders)……………..…Fig 4.4

8. Relational Diagrams (permissions)…………………………………...Fig 4.5

9. Relational Diagrams (audits)………………………………………….Fig 4.6

10. Pentaho Data Integration Icon……………………………………..…Fig 5.1

11. Pentaho Icon…………………………………………………………...Fig 5.2

12. A.A.A.R. Icon………………………………………………………….Fig 5.3

13. Working of A.A.A.R…………………………………………………..Fig 5.4

14. PDI Repository Settings Step 1……...………………………………..Fig 5.5

15. PDI Repository Settings Step 2……...………………………………..Fig 5.6

16. PDI Repository Settings Step 3……...………………………………..Fig 5.7

17. PDI Repository Settings Step 4……...………………………………..Fig 5.8

18. PDI Repository Settings Step 5……...………………………………..Fig 5.9

19. PDI Repository Settings Step 6…….………………………………..Fig 5.10

20. PDI Repository Settings Step 7….....………………………………..Fig 5.11

21. PDI Repository Settings Step 8….....………………………………..Fig 5.12

22. Audit Data Mart……………………………………………………...Fig 5.13

23. Dimension Tables…………………………………………………….Fig 5.14

24. Document Information Transformation…………...……………….Fig 5.15

25. Document Permission Transformation…………...………………...Fig 5.16

26. Folder Information Transformation…………….....……………….Fig 5.17

DDU (Faculty of Tech., Dept. of IT) 4

Page 15: DATA AND BUSINESS PROCESS INTELLIGENCE

List of Figures

27. Folder Permission Transformation………………...……………….Fig 5.18

28. User Information Transformation……………….....……………….Fig 5.19

29. Pentaho Reporting Tool Icon……………………………………..…..Fig 6.1

30. Document Information Report…….……..………...………………....Fig 6.2

31. Document Permission Report……...……..………...………………....Fig 6.3

32. Folder Information Report…….……..…..………...………………....Fig 6.4

33. Folder Permission Report………….……..………...………………....Fig 6.5

34. User Information Report…………..……..………...………………....Fig 6.6

35. Pentaho BI Server Icon……………………………………….……….Fig 7.1

36. Document Information Report…….……..………...………………....Fig 7.2

37. Document Permission Report……...……..………...………………....Fig 7.3

38. Folder Information Report…..…….……..………...………………....Fig 7.4

39. Folder Permission Report.……...….……..………...………………....Fig 7.5

40. User Information Report…………..……..………...………………....Fig 7.6

41. Scheduling of Transformations…...…………………………………..Fig 7.7

42. Login Step 1………………………………………………...………….Fig 9.1

43. Login Step 2………………………………………………...………….Fig 9.2

44. Login Step 3………………………………………………...………….Fig 9.3

45. View Reports Step 1………………...…………………………………Fig 9.4

46. View Reports Step 1………………...…………………………………Fig 9.5

47. Scheduling Page………...……………………………………………...Fig 9.6

48. Administration Page……………….…………………………………..Fig 9.7

DDU (Faculty of Tech., Dept. of IT) 5

Page 16: DATA AND BUSINESS PROCESS INTELLIGENCE

List of Tables

LIST OF TABLES

1. Milestones and Deliverables……….……………………………… Table 2.1

2. Project Scheduling Table…………………………………………...Table 2.2

3. Test Case 1…………………………………………………………..Table 8.1

4. Test Case 2…………………………………………………………..Table 8.2

5. Scheduling Options………………..………………………………..Table 9.1

6. Scheduling Controls………………..……………………………….Table 9.2

7. Administration Options…………………………………………….Table 9.3

DDU (Faculty of Tech., Dept. of IT) 6

Page 17: DATA AND BUSINESS PROCESS INTELLIGENCE

Introduction

INTRODUCTION

1.1PROJECT DETAILSInstitute of Plasma Research has selected Alfresco, an Enterprise Content Management

(ECM), as an Electronic Document and Record Management System (EDRMS). Alfresco

do not have powerful reporting functionality. Thus, IPR requires a reporting tool to

present the various details related to metadata of the documents and folders (folders are

used to organize documents), access control applied on documents and folders.

Additional analyses (like most active user, most active documents in last week, months

etc.) are required on audit trailing data generated by the alfresco. Some Key Performance

Indicators (KPI) needs to be generated of the document review and approval process.

Possibilities to create and exports reports in well-known formats (pdf, Microsoft Excel,

csv, etc.) needs to be provided. There will be a central administrator who has the

possibility to configure the access rights on the reports for end users. Additionally end

users shall have the possibility to subscribe the reports and schedule the report generation

and send it via E-mail as attachment in preferred format.

1.2PURPOSEThis system needs to be developed to enhance the way of looking at a traditional

document management system and to make it more user-friendly. Along with all the

features, we need a few customizations for the better usability of the resources. With

these powerful reporting tools, it will become easy and secure to understand the files and

documents in the institute. Also, it would help in decision making so as to what steps

have to be taken on the basis of the reports generated.

1.3SCOPEThe scope of the current project is just to implement a framework/deployment

DDU (Faculty of Tech., Dept. of IT) 7

Page 18: DATA AND BUSINESS PROCESS INTELLIGENCE

Introduction

architecture using BI toolset and test it by integrating with Alfresco. Alfresco data mart

will be created and used for developing analysis reports related to document management

system. The reports will be made available securely to the employees of the institute,

collaborators and contractors on internet.

In future generic reporting architecture implemented as part of this project will be used

and extended as a full Data Warehouse solution by integrating and merging other data

management tools of IPR. The full DW solution is out of the scope of this project.

1.4OBJECTIVEThe objective of this project is to ease the visibility of the document management system

and enhance decision making. Alfresco is a powerful content management system.

Unfortunately, the need for powerful reporting is still there and most of the answers are

tricky solutions, quite hard to manage and scale. To achieve above goal, Alfresco will be

integrated with a powerful open source data integration and reporting tool. The necessary

data from the Alfresco Repository will be extracted, transformed, merged/integrated and

loaded in the data warehouse. The necessary schema transformation (for example OLTP

to OLAP) will be applied to increase the efficiency. The solution will be scalable and

generic Reporting System with an open window on the Business Intelligence world.

Saying that, the solutions will be suitable also for publishing (static) reports containing

not only audit data coming from Alfresco but also Key Performance Indicators (KPIs),

analysis and dashboards coming from a complete Enterprise Data Warehouse.

1.5TECHNOLOGY AND LITERATURE REVIEW

1.5.1 ALFRESCO ECM

Open source java based Enterprise Content Management system (ECM) named Alfresco

is selected as a document repository. It uses MVC architecture. Model–view–

DDU (Faculty of Tech., Dept. of IT) 8

Page 19: DATA AND BUSINESS PROCESS INTELLIGENCE

Introduction

controller (MVC) is a software pattern for implementing user interfaces. It divides a

given software application into three interconnected parts, so as to separate internal

representations of information from the ways that information is presented to or accepted

from the user.

Model: It consists of application data, business rules, logic and functions. Here, XML is

used for the same.

View:  It is the output representation of information. Here, FTL is used for the same.

Controller: It accepts input and converts it to commands for the model or view.

Figure 1.1 MVC Architecture

1.5.2 PENTAHO PLATFORM

Pentaho is a company that offers Pentaho Business Analytics, a suite of open

source Business Intelligence (BI) products which provide data integration, OLAP

services, reporting, dashboard, data mining and ETL capabilities. Pentaho was founded in

2004 by five founders and is headquartered in Orlando, FL, USA.

Pentaho software consists of a suite of analytics products called Pentaho Business

Analytics, providing a complete analytics software platform. This end-to-end solution

includes data integration, metadata, reporting, OLAP analysis, ad-hoc query, dashboards,

and data mining capabilities. The platform is available in two offerings: a community

edition (CE) and an enterprise edition (EE).

DDU (Faculty of Tech., Dept. of IT) 9

Page 20: DATA AND BUSINESS PROCESS INTELLIGENCE

Project Management

PROJECT MANAGEMENT

2.1 FEASIBILITY STUDYFeasibility study includes an analysis and evaluation of a proposed project to determine if

it is technically feasible, is feasible within the estimated cost, and will be profitable.

The following softwares have to be installed for the project:-

1. Alfresco Entity Content Management

2. PostgreSQL and SQuirreL

3. Pentaho Data Integration Tool (K.E.T.T.L.E.)

4. Alfresco Audit Analysis and Reporting Tool (A.A.A.R.)

5. Pentaho Reporting Tool

6. Pentaho BI Server

The study assures that the hardware cost required for one database server plus two web

servers is acceptable and the 500 GB of file storage for the final product is feasible.

2.2PROJECT PLANNING

2.2.1 Project Development ApproachWe have used Agile methodology. After the feasibility study, the first thing to be done

was to create a basic flowchart charting out the flow of the project so as to create a mind

map. The base database system is Alfresco, from where we need to load tables using

PostGreSQL or SQuirreL. The number of tables is compressed to create a staging data

warehouse. After the transformations on these tables using Pentaho Data Integration tool,

reports are created using Pentaho Reporting on the BI server, according to the given

requirements of the project.

DDU (Faculty of Tech., Dept. of IT) 10

Page 21: DATA AND BUSINESS PROCESS INTELLIGENCE

Project Management

Figure 2.1 Flowchart of the Project

Once, the flowchart was made, we proceeded towards the development part keeping the

flowchart in mind. Thus, we started from studying Alfresco Enterprise Content

Management System and then moved on to Pentaho Tools. We also installed PostgreSQL

and SQuirreL so as to deal with the queries.

2.2.2 Project Plan1. Gather the definition.

2. Check whether the definition is feasible or not in given deadline.

3. Requirement gathering.

4. Study and analysis on gathered requirements.

5. Transformation Phase.

6. Reporting Phase.

7. Deployment.

DDU (Faculty of Tech., Dept. of IT) 11

Page 22: DATA AND BUSINESS PROCESS INTELLIGENCE

Project Management

2.2.3 Milestones and Deliverables

Table 2.1 Milestones and Deliverables

Phase Deliverables Purpose

Abstract and

System Feasibility

Study

Had complete

understanding of the

flow of the project

To be familiar with

the flow of the project

Requirement

Gathering and

Software Installation

and understanding of

Technology

Had studied the ECM,

it’s architecture and

how the data is stored in

the Alfresco repository

Getting familiar with

the Alfresco platform

Study of Platform

and the tools with it

Had studied and used

the three tools namely,

Pentaho Data

Integration Tool,

Pentaho Report

Designer and Pentaho

BI Server

Better understanding

of the Pentaho

platform and all the

tools and plug ins

associated with it

Transformation

Phase

Completed the

transformation phase

with help of A.A.A.R,

developed some custom

ETL and schedules the

transformation jobs to

run during nights

To make the staging

data warehouse

Reporting phase Made the reports

according to the user’s

requirements

To complete the

reporting phase

DDU (Faculty of Tech., Dept. of IT) 12

Page 23: DATA AND BUSINESS PROCESS INTELLIGENCE

Project Management

Deployment Published it on server in

different output types,

like pdf, csv etc

Deploy it on the Web

and hence completing

the project

2.2.4 Project SchedulingIn project management, a schedule is a listing a project's milestones, activities,

and deliverables, usually with intended start and finish dates.

Table 2.2 Project Scheduling Table

Abstract and Feasibility Study

Requirement Gathering

Study of Database Management System

Study of platform and associated tools

Transformation Phase

Reporting Phase

Deployment

8-Dec 23-Dec 7-Jan 22-Jan 6-Feb 21-Feb 8-Mar 23-Mar

Figure 2.2 Gantt Chart

DDU (Faculty of Tech., Dept. of IT) 13

Page 24: DATA AND BUSINESS PROCESS INTELLIGENCE

System Requirement Study

SYSTEM REQUIREMENT STUDY

3.1 USER CHARACTERISTICSThis system is made available on the web so it can be accessed from anywhere. The

users will be scientists, researchers, engineers and other employees of the institute. They

login with their respective credentials user will be logged in.

3.2 HARDWARE AND SOFTWARE REQUIREMENTS

3.2.1 Server and Client side Hardware Requirements:RAM : 4GB

Hard-disk : 40GB

Processor: 2.4GHz

File Storage: 500GB

3.2.2 Server and Client side Software Requirements:Windows or Linux based system

PostgreSQL Database

SQuirreL Database Client tool

Alfresco ECM

Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)

Alfresco Audit Analyzing and Reporting tool

Notepad++

3.3 CONSTRAINTS

3.3.1 Regulatory Policies

DDU (Faculty of Tech., Dept. of IT) 14

Page 25: DATA AND BUSINESS PROCESS INTELLIGENCE

System Requirement Study

Regulatory policies, or mandates, limit the discretion of individuals and agencies, or

otherwise compel certain types of behavior. These policies are generally thought to be

best applied when good behavior can be easily defined and bad behavior can be easily

regulated and punished through fines or sanctions. IPR is very strict about its policies and

ensures that all the employees follow it properly.

3.3.2 Hardware LimitationsTo ensure the smooth working of the system, we need to meet the minimum hardware

requirements. We need at least 2GB RAM, 40GB hard disk and 2.4 GHz processor. All

these requirements are readily available. Hence, there are not really any hardware

limitations.

3.3.3 Interfaces to Other ApplicationsETL tool of BI suits generally supports a number of standards-based protocols including

the ODBC, JDBC, REST, web script, FTP and many more for extracting the data from

multiple sources. It is easy to integrate any data management application using supported

input protocols. We have used CMIS (Content Management Inter-operatibility Services)

and JDBC protocol for Alfresco data integration. The published reports will be integrated

back to Alfresco using http protocol. Single sign Up will be implemented by the IT

department for providing the transparent access of reports form Alfresco or any other

web based tools.

 

3.3.4 CMISCMIS (Content Management Interoperability Services) is an OASIS standard designed

for the ECM industry. It enables access to any content management repository that

implements the CMIS standard. We can consider using CMIS if application needs

programmatic access to the content repository.

DDU (Faculty of Tech., Dept. of IT) 15

Page 26: DATA AND BUSINESS PROCESS INTELLIGENCE

System Requirement Study

3.3.5Parallel OperationsThis is a document management system where around 300 employees will work

concurrently. They can upload a document, review it, modify it, start workflow and even

delete it. Parallel operations include allowing more than single employee to read the

document. Work flows can be started to any document; also any document can be in any

number of workflows. Parallel editing of a document will be restricted by providing

check-in and check-out functionality.

3.3.6 Reliability RequirementsAll quality hardware, software and frameworks with valid licenses are required for better

reliability.

3.3.7 Criticality of the ApplicationCriticality of the module was one of the concerned constraints. The system was being

developed for the users who were mainly employees of the government sector. They had

certain rigid aspects which were to be taken care during development. Any change in

pattern of their workflow would lead to extremely critical conditions. Thus this was a

matter of concern and served as one of the deep rooted constraints.

3.3.8 Safety and Security ConsiderationThe system provides a tight security to user account. It is secured by password

mechanism which are encrypted and stored to database. Also, the repository is accessible

for modifications only to some privileged users.

DDU (Faculty of Tech., Dept. of IT) 16

Page 27: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

ALFRESCO ECM SYSTEM

4.1INTRODUCTION

Figure 4.1 Alfresco Icon

Alfresco is a free enterprise content management system for both Windows and Linux

operating systems, which manages all the content within an enterprise and provides

services to manage this content.

It comes in three flavors:-

Community edition – It is a free software with some limitations. No clustering

feature is present. (We have used community edition of Alfresco for this project

since we just need to perform ETL logic on the database and not use the advanced

functionalities.)

Enterprise edition – It is commercially licensed and suitable for user who requires

a higher degree of functionalities.

Cloud edition - It is a SaaS (Software as a Service) version of Alfresco.

We would be using Alfresco database as our base database from where we want to

extract information and create a warehouse. For further transformation purpose, we

would be using SQuirreL and K.E.T.T.L.E a.k.a. Pentaho Data Integration tool.

4.2ALFRESCO OVERVIEWThere are various ways in which Alfresco can be used for storing files and folders and it

can also be used by different systems. It is basically a repository, which is a

central location where data are stored and managed.

DDU (Faculty of Tech., Dept. of IT) 17

Page 28: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

Few of the ways in which Alfresco can be used are:

Figure 4.2 Uses of Alfresco ECM

Alfresco ECM is a useful tool to store files and folders of different types. Few of the uses

of Alfresco are:-

Document Management

Records Management

Shared drive replacement

Enterprise portals and intranets

Web Content Management

Knowledge Management

Information Publishing

Case Management

4.3 ARCHITECTURE

DDU (Faculty of Tech., Dept. of IT) 18

Page 29: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

Alfresco has a layered architecture with mainly three parts:-

1. Alfresco Client.

2. Alfresco Content Application Server

3. Physical Storage

4.3.1 ClientAlfresco offers two primary web-based clients: Alfresco Share and Alfresco Explorer.

Alfresco Share can be deployed to its own tier separate from the Alfresco content

application server. It focuses on the collaboration aspects of content management and

streamlining the user experience. Alfresco Share is implemented using Spring Surf and

can be customized without JSF knowledge.

Alfresco Explorer is deployed as part of the Alfresco content application server. It is a

highly customizable power-user client that exposes all features of the Alfresco content

application server and is implemented using Java Server Faces (JSF).

Clients also exist for portals, mobile platforms, Microsoft Office, and the desktop. A

client often overlooked is the folder drive of the operating system, where users share

documents through a network drive. Alfresco can look and act just like a folder drive.

4.3.2 ServerThe Alfresco content application server comprises a content repository and value-added

services for building ECM solutions. Two standards define the content repository: CMIS

(Content Management Interoperability Services) and JCR (Java Content Repository).

These standards provide a specification for content definition and storage, content

retrieval, versioning, and permissions. Complying with these standards provides a

reliable, scalable, and efficient implementation.

The Alfresco content application server provides the following categories of services

built upon the content repository:

DDU (Faculty of Tech., Dept. of IT) 19

Page 30: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

1. Content services (transformation, tagging, metadata extraction)

2. Control services (workflow, records management, change sets)

3. Collaboration services (social graph, activities, wiki)

Clients communicate with the Alfresco content application server and its services through

numerous supported protocols. HTTP and SOAP offer programmatic access while CIFS,

FTP, WebDAV, IMAP, and Microsoft SharePoint protocols offer application access. The

Alfresco installer provides an out-of-the-box prepackaged deployment where the

Alfresco content application server and Alfresco Share are deployed as distinct web

applications inside Apache Tomcat.

Figure 4.3 Alfresco Architecture

At the core of the Alfresco system is a repository supported by a server that persist

content, metadata, associations, and full text indexes. Programming interfaces support

multiple languages and protocols upon which developers can create custom applications

and solutions. Out-of-the-box applications provide standard solutions such as document

management, and web content management.

DDU (Faculty of Tech., Dept. of IT) 20

Page 31: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

4.4 DATA STORAGE IN ALFRESCOThere are total 97 tables in the database mainly divided into two parts Alfresco databases

and Activity workflows. The Alfresco database is further divided into three parts- nodes,

access and properties.

1. Node is the parent class of the database which has all identity numbers stored in

it.

2. Access tables deals with the security issues of Alfresco like the permissions and

last modification dates.

3. Properties store the information about which kind of data is stored; its size,

type, ranges etc.

4.5 RELATIONSHIP DIAGRAMSAfter studying the tables, we created the relationship diagram of the tables using

SQuirreL.

Since the Relational Diagram for the Alfresco System comprises 97 tables, we selected

the ones that are vital like:-

alf_node – holds the identity of other tables.

alf_qname – It defines a valid identifier for each and every attribute.

alf_node_properties – It connects both node and qname tables and stores all

properties of each node id.

alf_access_control_list – It is used to specify who can do what with an object in

the repository i.e. gives the permission information.

DDU (Faculty of Tech., Dept. of IT) 21

Page 32: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

Figure 4.4 Relation Diagrams for users, documents and folders

Figure 4.5 Relational Diagrams for permissions

DDU (Faculty of Tech., Dept. of IT) 22

Page 33: DATA AND BUSINESS PROCESS INTELLIGENCE

Alfresco ECM System

Figure 4.6 Relational Diagram for audits

DDU (Faculty of Tech., Dept. of IT) 23

Page 34: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

TRANSFORMATION PHASE

5.1INTRODUCTIONThere are 97 tables in Alfresco ECM System. To create a staging data warehouse, we

first have to perform E.T.L. logic i.e. Extract, Transform and Load.

In computing, ETL refers to a process in database usage and especially in data

warehousing where it:

Extracts data from homogeneous or heterogeneous data source

Transforms the data for storing it in proper format or structure for querying

and analysis

Loads it into the final target (database, more specifically, operational data

store, data mart, or data warehouse)

Usually all the three phases execute in parallel since the data extraction takes time, so

while the data is being pulled another transformation process executes, processing the

already received data and prepares the data for loading and as soon as there is some

data ready to be loaded into the target, the data loading kicks off without waiting for

the completion of the previous phases.

ETL systems commonly integrate data from multiple applications (systems), typically

developed and supported by different vendors or hosted on separate computer

hardware. The disparate systems containing the original data are frequently managed

and operated by different employees. In our project though, there is only one source

from where the data is extracted i.e. Alfresco.

5.2 PENTAHO DATA INTEGRATION TOOL

5.2.1 IntroductionPentaho Data Integration (or Kettle) delivers a powerful extraction, transformation,

DDU (Faculty of Tech., Dept. of IT) 24

Page 35: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

and loading (ETL) capabilities, using a metadata-driven approach. It prepares and

blends data to create a complete picture of the business that drives actionable

insights. The complete data integration platform delivers accurate, “analytics ready”

data to end users from any source.

Figure 5.1Pentaho Data Integration Icon

In particular, Pentaho Data Integration is used to: extract Alfresco audit data into the Data

Mart and create the defined reports uploading them back to Alfresco.

5.2.2 Why Pentaho?

Figure 5.2 Pentaho Icon

5.2.2.1 Pentaho vs Jaspersoft vs BIRT

Pentaho and Jaspersoft, both provide the unique advantage of being cost effective but the

differences in terms of features vary. Although Jaspersoft’s report for designing reports is

comparatively better than Pentaho Report Designer, the dashboard capabilities of Pentaho

in terms of functionality are better. This is because dashboard functionality is present

only in the Enterprise edition of Jaspersoft whereas in Pentaho, it is accessible in the

Community edition too.

DDU (Faculty of Tech., Dept. of IT) 25

Page 36: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

When it comes to Extract, Transfer and Load (ETL) tools, the Pentaho Data Integrator is

comparatively better since Jaspersoft falls short of few functions. When it comes to

OLAP analysis, Pentaho Mondrian engine has a stronger case compared to Jaspersoft.

Pentaho users also have huge set of choices in terms of plugin marketplace that is similar

to the app store of iOS and Android. To sum it up, Jaspersoft focus is more on reporting

and analysis and Pentaho’s focus is on data integration, ETL and workflow automation.

BIRT has also emerged as an important tool for business intelligence for those who are

well versed in Java. BIRT is an Eclipse-based open source reporting system for web

applications, especially those based on Java and Java EE where it consists of a report

designer based on Eclipse and a runtime component that be added to the app server. In

terms of basic functionality BIRT is at par with Pentaho and Jaspersoft perhaps a slight

advantage as it is based on Eclipse. Apart from that as a typical BI tool it is expected to

cover common Chart Types. Although BIRT covers most of the charts, it falls short of

Chart types like Ring, Waterfall, Step Area, Step, Difference, Thermometer and Survey

scale wherein Pentaho fills the gaps.

5.2.2.2 Conclusion

Unlike previous two tools, Pentaho is a complete BI suite covering various operations

from reporting to data mining. The key component of Pentaho is the Pentaho Reporting

which is a rich feature set and enterprise friendly. Its BI Server which is a J2EE

application also provides an infrastructure to run and view reports through a web-based

user interface. All of the following open source BI and reporting tools provide a rich

feature set ready for enterprises. It is based on the end user to thoroughly compare and

select either of these tools. All three of these open source business intelligence and

reporting tools provide a rich feature set ready for enterprise use. It will be up to the end

user to do a thorough comparison and select either of these tools. Major differences can

be found in report presentations, with a focus on web or print, or in the availability of a

report server. Pentaho distinguishes itself by being more than just a reporting tool, with a

full suite of components (data mining and integration).

DDU (Faculty of Tech., Dept. of IT) 26

Page 37: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Among organizations adopting Pentaho, one of the advantages felt is its low integration

time and infrastructural cost compared to SAP BIA, SAS BIA which are one of the big

players in Business Intelligence. Along with that the huge community support available

24/7 with active support forums allows Pentaho users to discuss the challenges and have

their questions cleared while using the tool. Its unlimited visualization and data sources

can handle any kind of data, coupled with a good tool set which has wide applicability

beyond just the base product.

5.2.3 COMPONENTS OF PENTAHOKettle is a set of tools and applications which allows data manipulations across multiple

sources. The main components of Pentaho Data Integration are:

Spoon – It is a graphical tool that makes the design of an ETL process transformation

easy to create. It performs the typical data flow functions like reading, validating,

refining, transforming, writing data to a variety of different data sources and destinations.

Transformations designed in Spoon can be run with Kettle Pan and Kitchen.

Pan – Pan is an application dedicated to run data transformations designed in Spoon.

Chef – It is a tool to create jobs which automate the database update process in a

complex way.

Kitchen – It is an application which helps execute the jobs in a batch mode, usually

using a schedule which makes it easy to start and control the ETL processing.

Carte – It is a web server which allows remote monitoring of the running Pentaho Data

Integration ETL processes through a web browser.

 

5.3 ALFRESCO AUDIT ANALYSIS AND REPORTING TOOL

5.3.1 Introduction

DDU (Faculty of Tech., Dept. of IT) 27

Page 38: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Alfresco is one of the most widely used open source content management systems.

And though it is not part of its core, it is crucial to get metrics out of the Alfresco

system.

Figure 5.3 A.A.A.R. Icon

To that goal, a full-fledged audit layer was built on top of Alfresco using Pentaho. The

principle is that it is used for doing optimized analytics to build a data mart properly

optimized for the information we are extracting for the system and doing all the discovery

on top of that. To do that, one need an ETL tool and then once that is done, Pentaho is

needed to do reporting and exploration on top of that data warehouse. This in-between

tool is called AAAR - Alfresco Audit Analysis and Reporting.

5.3.2 Working and Installation of A.A.A.R.Alfresco Content Management System can be seen as a primary source and generates

only raw data. On the other hand, Pentaho is a pure BI environment and consists of some

suitable integration and reporting tools.

Thus, A.A.A.R. extracts audit data from the Alfresco E.C.M., stores the data in the Data

Mart, creates reports in well-known formats and publishes them again in the Alfresco

E.C.M.

DDU (Faculty of Tech., Dept. of IT) 28

Page 39: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Figure 5.4 Working of A.A.A.R.

Alfresco E.C.M. is, at the same time, source and target of the flow. As source of the flow,

Alfresco E.C.M. is enabled with the audit service to track all the activities with detailed

information about who, when, what has been done on the system. Login (failed or

succeed), creation of content, creation of folders, adding or removing of properties or

aspects are only some examples of what is tracked from the audit service.

5.3.2.1 Prerequisites

1. Alfresco E.C.M.

2. PostGreSQL/MySQL

3. Pentaho Data Integration Tool

4. Pentaho Report Designer Tool

5.3.2.2 Enabling Alfresco Audit Service

The very first task to do is to activate the audit service in Alfresco performing the actions.

1. Stop Alfresco.

2. In '<Alfresco>/tomcat/shared/classes/alfresco-global.properties' append: # Alfresco Audit service

audit.enabled=true

audit.alfresco-access.enabled=true

# Alfresco FTP service

## ATTENTION: Don’t do it if just enabled!

ftp.enabled=true

ftp.port=8082

DDU (Faculty of Tech., Dept. of IT) 29

Page 40: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

3. Start Alfresco.

4. Login into Alfresco to have the very first audit data.

5.3.2.3 Data Mart Creation and Configuration

1. Open a terminal

2. For the PostgreSQL platform use: cd <PostgreSQL bin>

psql –U postgres –f “<AAAR folder>/AAAR_DataMart.sql”

(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)

3. Exit

4. Extract ‘reports.zip’ in the ‘data-integration’ folder. ‘report.zip’ contains 5 files with

‘prpt’ extension, each one containing one Pentaho Reporting Designer report. By

default, and to let the report production simpler, are saved in the default folder: ‘data-

integration’.

5. Update ‘dm_dim_alfresco’ table with the proper environment settings. Each row of

the table represents one Alfresco installation and for that reason the table is defined

with a

unique row by default, as described below.

desc with value ‘Alfresco’.

login with value ‘admin’.

password with value ‘admin’.

url with value ‘http://localhost:8080/alfresco/service/api/audit/query/alfresco- access?

verbose=true&limit=100000’.

is_active with value ‘Y’.

6. Update ‘dm_reports’ table with your target settings.

5.3.2.4 PDI Repository Settings

The third task is to set the Pentaho Data Integration Jobs properly.

1. Open a terminal

2. For the PostgreSQL platform use: cd <PostgreSQL bin>

DDU (Faculty of Tech., Dept. of IT) 30

Page 41: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

psql –U postgres –f “<AAAR folder>/AAAR_Kettle.sql”

(use ‘psql.exe’ on Windows platform and ‘./psql’ on Linux based platforms)

3.Exit

4. To set the Pentaho Data Integration repository:

i. Open a new terminal. cd <data-integration>

ii. Launch ‘Spoon.bat’ if you are on Windows platform or ‘./Spoon.sh’ if you are

on Linux based platforms.

iii. Click on the green plus to add a new repository and define a new repository

connection in the database.

Figure 5.5 Step 1

iv. Add a new database connection to the repository.

DDU (Faculty of Tech., Dept. of IT) 31

Page 42: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Figure 5.6 Step 2

v. If you choose a PostgreSQL platform set the parameters described below in the

image. At the end push the test button to check the database connection.

Figure 5.7 Step 3

DDU (Faculty of Tech., Dept. of IT) 32

Page 43: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

vi. Set the ID an Name fields and press the ‘ok’ button. Attention not to push the

‘create or upgrade’ button otherwise the E.T.L. will be damaged.

Figure 5.8 Step 4

vii. Connect with the login ‘admin’ and password ‘admin’ to test the connection.

Figure 5.9 Step 5

viii. If everything succeeds, you see the Pentaho Data Integration (Kettle) panel.

viii. From the Pentaho Data Integration panel, click on Tool -> Repository ->

explore.

DDU (Faculty of Tech., Dept. of IT) 33

Page 44: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Figure 5.10 Step 6

ix. Click on the ‘Connections’ tab and edit (the pencil on the top right) the

AAAR_DataMart connection. In the image below the PostgreSQL case but with

MySql is exactly the same.

Figure 5.11 Step 7

DDU (Faculty of Tech., Dept. of IT) 34

Page 45: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

x. Modify the parameters and click on the test button to check. If everything

succeed you can close all. In the image below the PostgreSQL case but with

MySql is exactly the same.

Figure 5.12 Step 8

5.3.2.5 First Import

Now you are ready to get the audit data in the Data Mart and create the reports publishing

them to Alfresco.

Open a terminal cd <data-integration>

kitchen.bat /rep:"AAAR_Kettle" /job:"Get all/dir:/Alfresco /user:admin

/pass:admin /level:Basic

kitchen.bat /rep:"AAAR_Kettle" /job:"Report all" /dir:/Alfresco

/user:admin /pass:admin /level:Basic

Finally you can access to Alfresco and look in the repository root where the reports are

uploaded by default.

5.3.3 Audit Data MartOn the other side of the represented flow, there is a database storing the

DDU (Faculty of Tech., Dept. of IT) 35

Page 46: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

extracts audit data organized in a specific Audit Data Mart. A Data Mart is a structure

that is usually oriented to a specific business line or team and, in this case, represents the

audited actions in the Alfresco E.C.M.

Figure 5.13 Audit Data Mart

5.3.4 Dimension TablesThe implemented Data Mart develops a single Star Schema having one only measure (the

number of audited actions) and the dimensions listed

below:

1. Alfresco instances to manage multiple sources of auditing data.

2. Alfresco users with a complete name.

3. Alfresco contents complete with the repository path.

4. Alfresco actions (login, failedLogin, read, addAspect, etc.).

5. Date of the action. Groupable in day, month and year.

DDU (Faculty of Tech., Dept. of IT) 36

Page 47: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

6. Time of the action. Groupable in minute and hour.

Figure 5.14 Dimension Tables

5.4 TRANSFORMATIONS USING SPOONThe Spoon is the only DI design tool component. The DI Server is a core component that

executes data integration jobs and transformations using the Pentaho Data Integration

Engine. It also provides the services allowing you to schedule and monitor scheduled

activities.

Drag elements onto the Spoon canvas, or choose from a rich library of more than 200

pre-built steps to create a series of data integration processing instructions.

5.5 EXAMPLE TRANSFORMATION

Few of the transformations we have done using Spoon are listed below:-

1. Document Information

2. Document Permission

3. Folder Information

DDU (Faculty of Tech., Dept. of IT) 37

Page 48: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

4. Folder Permission

5. User Information

Figure 5.15 Document Information Transformation

Figure 5.16 Document Permission Transformation

DDU (Faculty of Tech., Dept. of IT) 38

Page 49: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Figure 5.17 Folder Information Transformation

Figure 5.18 Folder Permission Transformation

DDU (Faculty of Tech., Dept. of IT) 39

Page 50: DATA AND BUSINESS PROCESS INTELLIGENCE

Transformation Phase

Figure 5.19 User Information Transformation

DDU (Faculty of Tech., Dept. of IT) 40

Page 51: DATA AND BUSINESS PROCESS INTELLIGENCE

Reporting Phase

REPORTING PHASE

6.1 WHAT IS A REPORT?In its most basic form, a report is a document that contains information for the reader.

When speaking of computer generated reports, these documents refine data from various

sources into a human readable form. Report documents make it easy to distribute specific

fact-based information throughout the company. Reports are also used by the

management departments in decision making.

6.2 PENTAHO REPORT DESIGNER TOOL

6.2.1 IntroductionPentaho Reporting is a suite of tools for creating pixel perfect reports. With Pentaho

Reporting, we are able to transform data into meaningful information. You can create

HTML, Excel, PDF, Text or printed reports. If you are a developer, you can also produce

CSV and XML reports to feed other systems.

Figure 6.1 Pentaho Reporting Tool Icon

It helps in transforming all the data into meaningful information tailored according to

your audience with a suite of Open Source tools that allows you to create pixel-perfect

reports of your data in PDF, Excel, HTML, Text, Rich-Text-File, XML and CSV. These

computer generated reports easily refine data from various sources into a human readable

form.

DDU (Faculty of Tech., Dept. of IT) 41

Page 52: DATA AND BUSINESS PROCESS INTELLIGENCE

Reporting Phase

6.2.2 Working of Pentaho Report Designer ToolOnce, the transformations are completed using K.E.T.T.L.E., we can import these

transformations from the Data Mart in the Pentaho Report Designer tool with the help of

SQL. Pentaho Report Designer tool has a large selection of elements (Text fields, Labels

etc.) and various GUI representation techniques like pie-charts, tables, graphs etc with

which we can create our reports.

6.3 EXAMPLE REPORTSAccording to the transformations done using Spoon, we created reports of the following

requirements using Pentaho Report Designer:-

1. Document Information

2. Document Permission

3. Folder Information

4. Folder Permission

5. User Information

Figure 6.2 Document Information Report

DDU (Faculty of Tech., Dept. of IT) 42

Page 53: DATA AND BUSINESS PROCESS INTELLIGENCE

Reporting Phase

Figure 6.3 Document Permission Report

Figure 6.4 Folder Information Report

DDU (Faculty of Tech., Dept. of IT) 43

Page 54: DATA AND BUSINESS PROCESS INTELLIGENCE

Reporting Phase

Figure 6.5 Folder Permission Report

Figure 6.6 User Information Report

DDU (Faculty of Tech., Dept. of IT) 44

Page 55: DATA AND BUSINESS PROCESS INTELLIGENCE

Publishing Phase

PUBLISHING PHASE

7.1INTRODUCTIONAfter the reports are made using the designing tool, we need to publish the reports on the

server. Pentaho BI Server or BA platform allows you to access business data in the form

of dashboards, reports or OLAP cubes via a convenient web interface. Additionally it

provides an interface to administer your BI setup and schedule processes. Also, different

types of output types are available like pdf, html, csv etc.

7.2PENTAHO BI SERVER

7.2.1 IntroductionIt is commonly referred to as the BI Platform, and recently renamed Business Analytics

Platform (BA Platform). It makes up the core software piece that hosts content created

both in the server itself through plug-ins or files published to the server from the desktop

applications. It includes features for managing security, running reports, displaying

dashboards, report bursting, scripted business rules, OLAP analysis and scheduling out of

the box.

Figure 7.1 Pentaho BI Server Icon

The commercial plug-ins from Pentaho expand out-of-the-box features. A few open-

source plug-in projects also expand capabilities of the server. The Pentaho BA Platform

runs in the Apache Java Application Server. It can be embedded into other Java

DDU (Faculty of Tech., Dept. of IT) 45

Page 56: DATA AND BUSINESS PROCESS INTELLIGENCE

Publishing Phase

Application Servers.

7.2.2 Example Published ReportsAccording to the reports we have created, the following reports can be deployed on the

Web:-

1. Document Information

2. Document Permission

3. Folder Information

4. Folder Permission

5. User Information

Figure 7.2 Document Information Published Report

DDU (Faculty of Tech., Dept. of IT) 46

Page 57: DATA AND BUSINESS PROCESS INTELLIGENCE

Publishing Phase

Figure 7.3 Document Permission Published Report

Figure 7.4 Folder Information Published Report

DDU (Faculty of Tech., Dept. of IT) 47

Page 58: DATA AND BUSINESS PROCESS INTELLIGENCE

Publishing Phase

Figure 7.5 Folder Permission Published Report

Figure 7.6 User Information Published Report

DDU (Faculty of Tech., Dept. of IT) 48

Page 59: DATA AND BUSINESS PROCESS INTELLIGENCE

Publishing Phase

7.3 SCHEDULING OF TRANSFORMATIONSOnce, the project has been completed, for real-time usage, the data warehouse needs to be

updated after every particular interval. For that purpose, we have to create scheduling for

our project so that it gets updated every day reflecting changes done in the last 24 hours.

There are three types to perform scheduling:

1. Using schedule option from action menu in Spoon.

2. Using start element in job i.e. kjb(Kettle jobs) files

3. Using task scheduler

Usually, first method is preferred in industries, but as we are working on community

edition, scheduling option is not provided. Also, the second method is used just for jobs

and does not update transformations so it was not suitable.

So, we scheduled the project using task scheduler. We have scheduled all the

transformations. It has been scheduled in such a way that it will run daily at 11:00 am.

The project has been deployed on web and submitted to our external guide. It will be

used further by IPR on web server for real-time usage.

Figure 7.7 Scheduling of Transformations

DDU (Faculty of Tech., Dept. of IT) 49

Page 60: DATA AND BUSINESS PROCESS INTELLIGENCE

Testing

TESTING

8.1 TESTING STRATEGYData completeness: Ensures that all expected data is loaded in to target table.

1. Compare records counts between source and target and check for any rejected

records.

2. Check Data should not be truncated in the column of target table.

3. Check unique values has to load in to the target. No duplicate records should

exist.

4. Check boundary value analysis

Data quality: Ensures that the ETL application correctly rejects, substitutes default

values, corrects or ignores and reports invalid data.

Data cleanness: Unnecessary columns should be deleted before loading into the staging

area.

1. Example: If a column have name but it is taking extra space , we have to

“trim” space so before loading in the staging area with the help of expression

transformation space will be trimmed.

2. Example: Suppose telephone number and STD code in different columns and

requirement says it should be in one column then with the help of expression

transformation we will concatenate the values in one column.

Data Transformation: All the business logic implemented by using ETL-Transformation

should reflect.

Integration testing: Ensures that the ETL process functions well with other upstream and

downstream processes.

DDU (Faculty of Tech., Dept. of IT) 50

Page 61: DATA AND BUSINESS PROCESS INTELLIGENCE

Testing

User-acceptance testing: Ensures the solution meets users’ current expectations and

anticipates their future expectations.

Regression testing: Ensures existing functionality remains intact each time a new release

of code is completed.

8.2 TESTING METHODS• Functional test: it verifies that the item is compliant with its specified business

requirements.

• Usability test: it evaluates the item by letting users interact with it, in order to verify that

the item is easy to use and comprehensible.

• Performance test: it checks that the item performance is satisfactory under typical

workload conditions.

• Stress test: it shows how well the item performs with peak loads of data and very heavy

workloads.

• Recovery test: it checks how well an item is able to recover from crashes, hardware

failures and other similar problems.

• Security test: it checks that the item protects data and maintains functionality as

intended.

• Regression test: It checks that the item still functions correctly after a change has

occurred.

DDU (Faculty of Tech., Dept. of IT) 51

Page 62: DATA AND BUSINESS PROCESS INTELLIGENCE

Testing

8.3 TEST CASES

8.3.1 USER LOGIN AND USING THE FUNCTIONALITY OF REPORT

Description: This test will validate user name and password and he will be able to select

the desired format of reports with the desired selection option

Table 8.1 Test Case 1

Sr. No Test Case Expected Output Actual Output Test Case

Status

1 User login to

his/her page

BA server should

open

BA server page

opens

Pass

2 User views a

report

Report should be

displayed

User is able to

view report

Pass

3 User while

viewing selects

the type of

output format

User must see the

desired format

output

Desired format

of the report is

displayed

Pass

4 User filters out

the report view

User should see

the filtered report

User is able to

view the

desired report

Pass

DDU (Faculty of Tech., Dept. of IT) 52

Page 63: DATA AND BUSINESS PROCESS INTELLIGENCE

Testing

8.3.2 VIEWING DOCUMENTS, FOLDERS, PERMISSIONS, AUDITS

Description: This test case will check whether user is able to view the data of folders,

documents, its permissions and audit data.

Table 8.2 Test Case 2

Sr.

No

Test Case Expected Output Actual Output Test Case

Status

1 User view the

documents

Document details

should be displayed

Document is

seen

Pass

2 User view the

folders

Folder should be

displayed

Folder is seen Pass

3 User view the

permissions

of folders and

documents

Permissions must be

seen by user

Permissions

displayed

Pass

4 User view the

auditing data

Audit data must be

displayed

Audit data is

seen by user

Pass

DDU (Faculty of Tech., Dept. of IT) 53

Page 64: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

USER MANUAL

9.1DESCRIPTIONThis manual describes the working and use of the project so as to help the end user and

get them familiar with the features.

Our project is divided into three levels. These are:-

1. Source Level

2. DWH Level

3. View Level

The source level is the back-end of our project i.e. Alfresco Database. The DWH level is

PostGreSQL, used in creating our Data Mart. And the view level is the Pentaho tools.

The users will be able to see the view level of the project, specifically the Pentaho

Business Analytics tool where the published reports are deployed. Once in the BA

dashboard, the user can use many functionalities of it. The functionalities are listed

below:-

1. Login Page

2. View Reports

3. Scheduling

4. Administration

9.2 LOGIN PAGEBefore using the BA server, a user has to login into the server using his assigned user

name and password so that the system knows which user has accessed the server and at

what time. This helps in security purposes.

To login, we have to follow the steps below:-

DDU (Faculty of Tech., Dept. of IT) 54

Page 65: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

1. We have to go the BI server folder using command line prompt. After we have

changed the directory to BI server, we need to start Pentaho.

Figure 9.1 Login Step 1

2. Once we login, the system automatically loads runs Apache tomcat.

Figure 9.2 Login Step 2

DDU (Faculty of Tech., Dept. of IT) 55

Page 66: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

3. If the tomcat doesn’t find any error, it opens the user console of Pentaho BA

server. The user can now login the server using their own user name and

password.

Figure 9.3 Login Step 3

9.3 VIEW REPORTSThe main requirement of the user is to view the reports on the Web Browser so as to

make decisions among various other uses. To do that, the user has to follow these steps:-

1. Once we login, Home screen opens up as shown in the given figure. For

viewing reports, we have to select ‘Browse Files’ (1) from the drop down list.

DDU (Faculty of Tech., Dept. of IT) 56

Page 67: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

Figure 9.4 View Reports Step 1

2. Once we select the ‘Browse Files’ options, the console opens up the ‘Folders’

(2) in Home and the associated ‘Files’ (3) of the folder we select in file box.

There is also an option of ‘Folder Actions’ (4) provided in the console which

helps in various functions like creating a new folder, deleting a folder etc.

3.To view a report, we have to select the report from the file box. For example, if

we have to see the report of the Documents Permissions, we need to click at

docpermission-rep (5) file in the file box. It will open the Documents Permissions

report (6) on the browser.

DDU (Faculty of Tech., Dept. of IT) 57

Page 68: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

Figure 9.5 View Reports Step 2

4. We can apply filters in the report. For example, in this report, we can filter and

list the document according to the permissions by selecting the appropriate

permission (7). Here, we have selected ‘Read’ permission from the ‘select

permissions’ filter.

Also, we can view reports in different styles selecting the appropriate style from

‘Output Type’ (8). Here, we have selected HTML (Single Page) type.

9.3 SCHEDULINGYou can schedule reports to run automatically. All of your active scheduled reports

appear in the list of schedules, which you can get to by clicking the Home drop-down

menu, then the Schedules link, in the upper-left corner of the User Console page. You can

also access the list of schedules from the Browse Files page, if you have a report selected.

DDU (Faculty of Tech., Dept. of IT) 58

Page 69: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

The list of schedules shows which reports are scheduled to run, the recurrence pattern for

the schedule, when it was last run, when it is set to run again, and the current state of the

schedule.

Figure 9.6 Scheduling Page

Table 9.1 Scheduling options

Item Name Function

Schedules indicat

or

Indicates the current User Console perspective

that you are using. Schedules displays a list

ofschedules that you create, a toolbar to work

with your schedules, and a list of times that

your schedules are blocked from running.

Schedule Name Lists your schedules by the name you assign to

them. Click the arrow next to Schedule Nameto

sort schedules alphabetically in ascending or

descending order.

DDU (Faculty of Tech., Dept. of IT) 59

Page 70: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

Item Name Function

Repeats Describes how often the schedule is set to run.

Source File Displays the name of the file associated with the

schedule.

Output Location Shows the location that the scheduled report is

saved.

Last Run Shows the last time and date when the schedule

was run.

Next Run Shows the next time and date when the schedule

will run again.

Status Indicates the current Status of the schedule. The

state can be either Normal or Paused.

Blockout Times Lists the times that all schedules are blocked

from running.

You can edit and maintain each of your schedules by using the controls above the

schedules list, on the right end of the toolbar.

Table 9.2 Scheduling Controls

Icon Name Function

Refresh Refreshes the list of schedules.

Run Now Runs a selected schedule(s) at will.

Stop Scheduled

Task

Pauses a specified schedule. Use Start

Schedule to start paused jobs.

Start Scheduled

Task

Resumes a previously stopped schedule.

DDU (Faculty of Tech., Dept. of IT) 60

Page 71: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

Icon Name Function

Edit Scheduled

Task

Edits the details of an existing schedule.

Remove

Scheduled Task

Deletes a specified schedule. If the schedule is

currently running, it continues to run, but it

will not run again.

9.4 ADMINISTRATIONThe User Console has one unified place, called the Administration page, where people

logged in with a role that has permissions to administer security can perform system

configuration and maintenance tasks. If you see Administration in the left drop-down

menu on the User Console Home page, you can click it to reveal menu items having to do

with administration of the BA Server. If you do not have administration privileges,

Administration does not appear on the home page.

Figure 9.7 Administration Page

DDU (Faculty of Tech., Dept. of IT) 61

Page 72: DATA AND BUSINESS PROCESS INTELLIGENCE

User Manual

Table 9.3 Administration Options

Item Control Name Function

1 Administration Open the Administration perspective of

the User Console. The Administration

perspective enables you to set up users,

configure the mail server, change

authentication settings on the BA

Server, and install software licenses for

Pentaho.

2 Users & Roles Manage the Penatho users or roles for

the BA Server.

3 Authentication Set the security provider for the BA

Server to either the default Pentaho

Security or LDAP/Active Directory.

4 Mail Server Set up the outgoing email server and

the account used to send reports

through email.

5 Licenses Manage Pentaho software licenses.

6 Settings Manage settings for deleting older

generated files, either manually or by

creating a schedule for deletion.

DDU (Faculty of Tech., Dept. of IT) 62

Page 73: DATA AND BUSINESS PROCESS INTELLIGENCE

Limitations and Future Enhancements

LIMITATIONS AND FUTURE ENHANCEMENTS

10.1 LIMITATIONS All the data is stored in single repository in Alfresco. In case of improper

management in backup of data, there are chances of data loss.

Since community edition of Pentaho Data Integration Tool had limited number of

functionalities, scheduling had to be done manually.

10.2 FUTURE ENHANCEMENTS We could compress 97 tables of Alfresco to 29 tables in data warehouse. This

could be further reduced in future so as to increase the efficiency.

Sophisticated requirements like hyperlink functions and ticket generation for

employees can be done.

DDU (Faculty of Tech., Dept. of IT) 63

Page 74: DATA AND BUSINESS PROCESS INTELLIGENCE

Conclusion and Discussion

CONCLUSION AND DISCUSSION

11.1 SELF ANALYSIS OF PROJECT VIABILITIES

11.1.1 Self AnalysisWe have created an information repository i.e. a data warehouse of an already existing

database system Alfresco. We have successfully installed the application and tested its

performance on several fronts. We have successfully completed validation testing. This

project task has been accomplished in such a way that it incorporates several features

demanded for present report generation and decision making requirements.

11.1.2 Project ViabilitiesThis project is successfully completed and is viable to be used in the Institute of Plasma

Research as a tool for generating reports according to the data stored in their database,

Alfresco. These reports are user-friendly with strong GUI support using a host of

graphical options like pie-charts, line graphs, bar charts etc. Decision making becomes

easier for the management department because of these reports.

11.2 PROBLEMS ENCOUNTERED AND POSSIBLE SOLUTIONS Alfresco was a new system, which we have never used before. For three to four

weeks, it was difficult to understand all its functionalities and working. So it took

time to understand full-fledged working of these technologies.

Alfresco GUI was not accessible in both our computers, due to which, we had to

install PostGreSQL and SQuirreL database systems.

It took time to finalize the ETL and reporting tools. We finally zeroed it down to

Pentaho over JasperSoft and BIRT.

DDU (Faculty of Tech., Dept. of IT) 64

Page 75: DATA AND BUSINESS PROCESS INTELLIGENCE

Conclusion and Discussion

Pentaho Enterprises is basically a collection of tools. Each stage of our project

could be done by a particular tool/system. Thus, we had to get ourselves familiar

with a host of Pentaho tools.

Alfresco Audit Analysis and Reporting (A.A.A.R.) had not converted many of

our tables while transforming into data warehouse. Thus, we had to do manually.

11.3 SUMMARY OF PROJECT WORK

PROJECT TITLE

DATA AND BUSINESS PROCESS INTELLIGENCE

It is a project based on the subject of data mining. Data Warehouse is created

from where data is used to create user-friendly reports.

PROJECT PLATFORM

PENTAHO

It is an open-source provider of reporting, analysis, dashboard, data mining and

workflow capabilities.

SOFTWARE USED

Windows/Linux based system

PostgreSQL Database

SQuirreL Database

Alfresco ECM

Pentaho Community Edition 5.0 (PDI, Reporting Tool, BI Server)

Alfresco Audit Analyzing and Reporting tool

Notepad++

DOCUMENTATION TOOLS

VISIO 2013

DDU (Faculty of Tech., Dept. of IT) 65

Page 76: DATA AND BUSINESS PROCESS INTELLIGENCE

Conclusion and Discussion

WORD 2007

EXCEL 2007

INTERNAL PROJECT GUIDE

PROF. R.S. CHHAJED

EXTERNAL PROJECT GUIDE

MR. VIJAY PATEL

COMPANY

INSTITUTE FOR PLASMA RESEARCH

SUBMITTED BY

BHAGAT FARIDA H.

SINGH SWATI

SUBMITTED TO

DHARAMSINH DESAI UNIVERSITY

PROJECT DURATION

8TH DEC 2014 TO 28TH MARCH 2015

DDU (Faculty of Tech., Dept. of IT) 66

Page 77: DATA AND BUSINESS PROCESS INTELLIGENCE

References

REFERENCES

http://wiki.pentaho.com/display/Reporting/01.+Creating+Your+First+Report

http://infocenter.pentaho.com/help/index.jsp?topic=%2Freport_designer_user_guide%2Ftask_adding_hyperlinks.html

http://www.robertomarchetto.com/pentaho_report_parameter_example

http://docs.alfresco.com/4.2/concepts/alfresco-arch-about.html

http://fcorti.com/alfresco-audit-analysis-reporting/aaar-description-of-the-solution/aaar-pentaho-data-integration/

http://en.wikipedia.org/wiki/Pentaho

http://www.joyofdata.de/blog/getting-started-with-pentaho-bi-server-5-mondrian-and-saiku/

https://technet.microsoft.com/en-us/library/aa933151(v=sql.80).aspx

http://datawarehouse4u.info/OLTP-vs-OLAP.html

DDU (Faculty of Tech., Dept. of IT) 67