36
Data Quality of The Datawarehouse In Business Intelligence Process 2017 SEMESTER PROJECT IN COMPUTER SCIENCE AND INFORMATICS TEAJENI MISAGO & HONGZHI ZHANG

Data Quality of The Datawarehouse In Business Intelligence

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Data Quality of The Datawarehouse In Business Intelligence Process

2017

SEMESTER PROJECT IN COMPUTER SCIENCE AND INFORMATICS TEAJENI MISAGO & HONGZHI ZHANG

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

1

Abstract In modern society we are witnessing an increasing development in the Business Intelligence area,

where more companies require data sorting and processing under the help of consultant compa-

nies, because they do not have resources to accomplish that by themselves. Consultant companies

delivers the professional products and informative data analysis to their customers, giving top

management an easier time to take quick and accurate business decisions on the basis of statistics

calculated using these data analyzation.

For the consultant companies, their products as data reporting systems, have to be 100% vali-

dated and accurate, they have to overcome the data discrepancies that happens when transfer-

ring and extracting data from customers’ source databases, the data discrepancies can be caused

by many reasons such as network issues or logic errors buried in data loading process, this process

is automatically completed by certain complicated algorithms such as ETL, it is relatively difficult to

prevent that from happening.

Most consultant companies usually focus on correcting them by testing a discounted amount of

data manually when the datasets are overwhelmingly huge. It is however, from the information

we have, that cannot remove data discrepancies effectively due to many reasons we will be listing

in our study.

Our project will be focusing on developing a testing framework within the BI process, we reached

a consultant company called Accobat, and we will deliver the design of the framework in this study

project, at the same time the company will provide information and feedback between meetings.

We will be using variant methods and approaches to form our project within limited time window,

as well as researching supportive theories from articles and books, and utilize knowledge we

learned from past semesters in Computer Science and Informatics. Based on our proposal in this

project, we will give conclusion and future plans for further development of the framework in the

end.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

2

1 Table of Contents 2 Project Introduction .................................................................................................................................. 4

2.1 Background of the project company ................................................................................................. 4

2.2 Project Objective and goal ................................................................................................................. 4

2.3 Deliverables ....................................................................................................................................... 4

2.4 Human Resources .............................................................................................................................. 4

2.5 Introduction of Business Intelligence ................................................................................................ 4

2.6 Business intelligence process ............................................................................................................ 5

2.7 Problem to Address ........................................................................................................................... 6

3 Methods .................................................................................................................................................... 6

3.1 Overall Approach ............................................................................................................................... 6

3.2 Methodology ..................................................................................................................................... 7

3.3 Alternative Approach......................................................................................................................... 8

3.4 Working Procedures .......................................................................................................................... 9

3.5 Performing Interviews ....................................................................................................................... 9

3.6 Stakeholder ...................................................................................................................................... 10

4 Theory ...................................................................................................................................................... 10

4.1 Books ............................................................................................................................................... 10

5 Analysis .................................................................................................................................................... 11

5.1 Stakeholder Analysis ........................................................................................................................ 11

5.2 Requirement overview .................................................................................................................... 13

5.3 Challenges ........................................................................................................................................ 15

5.4 Future Testing Framework .............................................................................................................. 16

5.5 Introduction of Data Reconciliation ................................................................................................ 17

6 Design ...................................................................................................................................................... 17

6.1 Design Summary .............................................................................................................................. 17

6.2 Context view .................................................................................................................................... 18

6.3 Functional view ................................................................................................................................ 19

6.4 Information architecture ................................................................................................................. 22

6.4.1 Why data discrepancies arise in BI process ............................................................................. 23

6.5 Technical architecture ..................................................................................................................... 24

6.6 Performance and Scalability Architectural Perspective .................................................................. 25

7 Discussion ................................................................................................................................................ 26

8 Conclusion ............................................................................................................................................... 26

9 Perspective .............................................................................................................................................. 27

10 Reference ................................................................................................................................................. 29

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

3

11 Appendix .................................................................................................................................................. 30

11.1 Glossary ........................................................................................................................................... 30

11.2 Meeting notes with Accobat ........................................................................................................... 31

11.2.1 Meeting with Accobat - 21th September 2017 ....................................................................... 31

11.2.2 Meeting with Accobat - 4th October 2017 .............................................................................. 33

11.2.3 Meeting with Accobat – 18th October 2017 (Thomas – developer) ........................................ 33

11.2.4 Meeting with Accobat - 13th November 2017 ......................................................................... 34

11.2.5 Meeting with Accobat - 4th December 2017 (Wolfgang - Senior Consultant) ........................ 34

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

4

2 Project Introduction 2.1 Background of the project company Accobat A/S is a Business Intelligence consultancy company based on Microsoft technology, and with offices in Copenhagen, Aarhus and Aalborg. Apart from Business intelligence they also have 13 years of experience with providing Business consultancy solutions to companies. Accobat has delivered solutions to Denmark’s most progressive companies and organizations like Ramboll, UNIK data, AAU and more.

2.2 Project Objective and goal Main objective and goal of the project is to propose a solution that does data reconciliation testing between source database and target database which is the data warehouse in the business intelligence process. This testing solution should be able to help Accobat automate the process of identifying data discrepancies and general errors that occur prior to the process of creating financial reports and dashboards.

2.3 Deliverables Deliver an architectural design of reconciliation testing framework.

2.4 Human Resources Role

Responsibilities

BI Department Manager -Approving the project plan. -Providing information when needed and giving feedbacks -Evaluating the test framework

Bent K Slot – University Supervisor -Giving feedback to students on how to work on the project. -Evaluating the project report

Teajeni Misago & Hongzhi Zhang -Propose solution architecture design -Plan the project and developing project report -Conducting interviews with stakeholders.

BI Developers -Providing technical information, feedback and BI experience in general.

2.5 Introduction of Business Intelligence Many business and corporations are increasingly using business intelligence and incorporating data analytics into their systems.

Why is Business Intelligence Important

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

5

Business Intelligence, BI involves the delivery and integration of most relevant and useful business information in a company or in an organization.

Most Companies now days also use BI to detect significant events and identify or monitor business trends to adapt quickly to their changing environment. If companies use business intelligence effectively and also train their staff, they can improve the decision-making processes at all levels in the organization and also improve strategic management process.

Main reasons why companies invest in Business Intelligence system

- One of the reason is to help gain insights into the behaviours of consumers buying trends. Once the company gets information on how consumers are buying, this will help producing products that match current consumption trends.

- BI system can help to understand the implications of various organisational processes better and enhance ability to identify suitable opportunities for future planning.

- To invest in Business intelligence system will help to identify where is error and areas that need improvement. will also help in saving time and money and performance of the organisation in terms of business decision making.

- Helps to gain sales and marketing powers to customers.

- Helps to gain insight into what competitors are doing.

Figure 1 : life cycle in Business intelligence

2.6 Business intelligence process Data from the client’s side data source (DS) is processed into the staging database within ETL process, which is defined as source and staging data database. Then data is loaded into the data warehouse at the target database. After data marts (see Appendix: Glossary) from data warehouse are created depending on the business logic, from there it is presented into multidimension database, where reports are created for management or any other person requires to review or use.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

6

Figure 2: Business intelligence process

The ETL process in the figure 2 is an important component of the process. It is used to copy data from Operational Applications (Source databases) to the Staging Area, from the Staging Area into the DW and finally from the DW into a set of conformed Data Marts that are accessible by decision makers. Each data mart often holds only one subject area for example, finance or sales or marketing.

The ETL process-based software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of ETL jobs are critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately.

2.7 Problem to Address Problem was found at some data mismatches on some tables between the source database and the target database (Data warehouse). Each data mismatch may cause a problem or error on reports produced by the accurate business intelligence data. In that regard Accobat needs a solution to help them reconcile data between the two databases.

Problem Statement: How to develop a testing framework that will Reconcile source database and Data warehouse?

Questions to answer in this project

1. Why reconcile test framework design is important to implement.

2. Will the developers be able to automate see the reconcile on the database?

3. Is data mismatch detected and give a warning indicator in the reconcile process.

4. How can recon table result be shown online real-time?

5. What will the company Accobat benefit from the test framework?

3 Methods 3.1 Overall Approach All empirical data collected for this project is based on the qualitative approach by conducting

interviews and doing some observations on developer. Additional to that, Company materials are

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

7

provided by the Accobat, academic books, articles and research materials are used too. For the

methodology we will be using the Iterative model for our project.

3.2 Methodology The methodology we use in the project is more or less partly plan-drive and iterative development

methodology, since this is a new needed solution to test data between source database and data

warehouse on Business Intelligence Enterprise platform at Accobat, and it is our working cycles in

an academic point of view.

We must plan every phase of the project and carry out tasks and actives needed in that phase. And

the main reason why we use the plan and Iterative methodology is to reduce the risk for the

execution of the project and help us to see how we have organize the project into a structured and

streamlined process. Figure below presents what we have covered in every phase.

Figure 3 iterative methodology life cycle

Project Planning

We define the project plan, project scope and project introduction. and also defined the method of

work of the project.

Stakeholder Analysis

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

8

In this phase, we perform detailed analysis of people, groups or organization that are likely to be

affected or who will have an impact on the new testing framework system. And identifying what

value does each stakeholder bring when defining and correcting requirements.

Requirement Engineering

We study the current test framework and how it works and it challenges and then make a

description of the future test framework. by identify the requirements and problems a new test

framework was supposed to address and solve. By the help of using rich pictures to capture the

missing and conflicting requirement, because our stakeholders don’t know how it should be

developed, in order to define requirements well. Paper-based models to represent the most

important aspect (functional and system design) of the test framework were than evaluated by the

stakeholder to refine the requirement or capture the missing requirement.

Framework Architecture Design

In this phase, the requirements are being translated into different design like full Design

architectures of the test framework, in our case it consists of Context view, functional view,

Information architecture and Technical architecture, and architectural perspectives the design

follows.

Evaluation/Discussion

Results from design phase are constantly delivered to be evaluated by stakeholder so to make sure

that the testing framework meets their requirements and business needs.

Document Delivery

In this phase a design description document is delivered to Accobat the project company and

university report is handed in, as this is a school project report for 3rd semester of Master’s in

Computer science and informatics at RUC.

3.3 Alternative Approach The development of the framework architecture description is mainly carried out in initial phase of

software development to help in defining and agreeing on scope, agreeing on and validating

requirements, and providing the technical leadership to make the decision that will shape the

architecture of the new system to be developed. Therefore, architecture definition can also be

integrated within liner approach like waterfall model.

Similarly, iterative model could be an alternative if the architecture definition process was iterative.

Therefore, it can be integrating at early analysis phase of iterative model.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

9

In case of using agile approach, should consider different factors to ensure its success such as the

architecture work should be delivered incrementally, aim to create good enough document that

could be deliver as soon as they are useable rather than waiting for them to be polished to the point

of perfections, making sure every deliverable has a customer and that the customers understand

and agree with the value it is bring them.

3.4 Working Procedures Working procedures as following:

- Obtaining resources

- Having project milestones and project state updates.

- Performing our approach of the project.

- Creating communication channel with Accobat (interviews and meetings)

Based on these procedures we are able to break down our work load to several iterations, notice that the iterative process is not shown in the figure, however, for each phase we will be adjusting our solution according to feedbacks from previous phase, following the principles of iterative methodology.

Figure 4 Milestones and meeting summary

3.5 Performing Interviews In order to identify the needs and requirements for the testing framework, as well as communicating with stakeholders, our approach is to performs several interviews with the stakeholders, we made appointments with stakeholders in advance, then show and illustrate our milestones and project states after iterations. Meetings appointments within work procedures figure shown above.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

10

We arranged 5 meetings with Accobat stakeholders throughout this semester, then set our milestones beforehand. (Meeting details see Appendix-meetings) For each stage in our project we act according to the previous meeting, then present our milestone to the stakeholders in next meeting. We believe that this method is one of the fastest way for us to form our action plan, because the employees in company are not flexible with the meeting schedule, we want to fully utilize the time between meetings, and get feedbacks once they are held.

3.6 Stakeholder After the first meeting with one of the Accobat Manager, we were able to collect information of who are the key stakeholders in company and what concerns they have are important.

Stakeholder list

Solution responsible Developers Accobat consultant

Commercial responsible

Delivery responsible

Developers: They look into the actual implementation of the test framework and confirm that it

meets business requirements.

Solution responsible: also the main project supervisor, controlling budget and communicating to

other key stakeholders in case resources needed.

Commercial responsible: In-Charge of promoting and publishing developed products.

Delivery responsible: When system is ready to be delivered at client they play as a key role.

Consultant: Is an experienced developer who can provide technical information regarding BI in

general.

4 Theory 4.1 Books Business Intelligence Book: Business Intelligence Guidebook 2015 by Rick Sherman

We use Business Intelligence Guidebook 2015 by Rick Sherman to guide us through Business

intelligence process and understanding Business intelligence architecture framework. In this book,

it introduced to us how to develop a cost-effective Business intelligence solution, what capabilities

and skills needed for Business intelligence to meet challenges, because all business intelligence

frameworks are designed to accommodate expansion and renovation based on the evolving

requirements. For our project we referred mostly chapter 4 of the book for Business intelligence

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

11

architecture framework, which presents all key component and elements that compose the design

architecture of BI. This includes information architecture, data architecture, technical architecture

and product architecture. In this project we cover mainly key design architectures we think are

important for our architecture, see details below

• Information architecture: Architecture defines the purpose of the project, the business

processes and analytics, who will have access, and where the data is and how it will be

integrated and consumed.

• Technical architecture: Architecture defines the technologies that is being used in the

implementation of the business intelligence solution that fulfills the information and data

architecture requirements.

Software Systems Architecture Book: Nick Rozanski and Eoin Woods

We have used this book as guideline of how to develop a software system architecture in general

in terms of making designs for our solution, and all areas involved to have a good software

architecture analyzed and developed. This includes working with stakeholder analysis and it is

Mapping, how to design and importance of having view and their perspectives for software

architecture design. below are key views we have used in the project.

• Context View: The context view of the system describes the relationships, dependencies

and interactions between the system and its environment.

• Function View: Ddefines the system's functional elements, responsibilities of each, the

interfaces they reveal and the interaction between elements

5 Analysis 5.1 Stakeholder Analysis Here we used stakeholder mapping to show and present influence of each stakeholder and how

much their participation is required and where is mostly needed, then we categorized them into

stakeholder matrix with detailed approaches. Below have outlined some stakeholders that have

key role to play for our project.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

12

Figure 5 Stakeholder map

For developer in Accobat, it is certain in this case that they have the highest priority for participation because they need to proceed and polishing the actual implementation for the new testing frame-work and communicate with their supervisors in Accobat. Accobat solution responsible person, as this project’s supervisor, has the control and high-level vi-sion for database access, budget calculation, buying external consultants, as well as communicate all the other stakeholders. Accobat commercial responsible and Accobat delivery responsible person, who are in control of promoting and publishing the product once it is done. They have high impact for the testing frame-work project but they have less participation until the project is finalized. Accobat consultant as an assistant to the project, has high influence as well as need for participation. We acknowledge that they are experienced developers who can provide enough information to the team in IT field.

Stakeholder Matrix – Based on Stakeholders Analysis and mapping By categories their strength and demands we were able to conclude our approaches to different stakeholders, working with the system while communicating with correct stakeholder. .

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

13

How will they benefit?

Attitude towards project?

How can they contribute?

How do we deal with them?

Delivery responsible

Make sure delivery of a good product

Working together with developers on the project

Promoting developed product to clients.

Project on time.

Commercial responsible

Sell and market the solution.

Learning and training on test framework

Involve in Project lifecycle

Involve in all communication.

Solution Manager

Control all project planning

Active communication with both internal and external teams.

Get involved in all project planning

Keep reporting our project states and getting feedbacks.

Developer Development experience on solutions.

Information sharing and support.

System testing, debugging and maintaining the solution.

Involving and identifying challenges in project implementation.

Consultants

BI Testing solution

Project involvement.

Identify solution risks

Give feedback and input

We went through the stakeholder analysis section just to have an overview of different visions that stakeholders held, so that we can balance and document the requirements and find out what we can propose as a solution that benefits current system in analysis section, as well as identifying stakeholders that we can communicate later on for any kind of information depending on the theme of our project.

5.2 Requirement overview Based on the stakeholder analysis in Chapter 4, where we made sure to go through all key stakeholders’ concerns, we were able to locate concerns from different stakeholders that are connecting each other. We concluded the direction of our project as proposing a Testing framework for the Accobat BI procedure. As we already addressed the problem for the system in

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

14

Chapter 2, that there are data mismatches in some of the datasets between source database and target database, majority of the stakeholders directly or indirectly required to eliminate these data discrepancies. Examples:

Commercial responsible as a stakeholder finds quite hard to promote products to clients with data discrepancies, clients will stay doubtful towards the product system if a single mistake is spotted. Delivery responsible as a stakeholder supervises the project delivery time and values it the most, when the project finished without data reconciliation, it requires developers to test it manually with discounted dataset, and that is extremely inefficient and informal, as a result it might delays the delivery time. To fulfil the purpose of eliminating data discrepancies, we went through the research work listed chapter 3 and found out data reconciliation is needed for testing data gathered from client's source database and target database (Accobat data warehouse), meaning that we are targeting the entire datasets before they are further processed into Accobat's data reporting system. In order to formalize this testing process, we will be developing a testing framework. Our goals are listed down below.

Goals of the reconciliation Test Framework

• To ensure the reliability and accuracy on data.

• To inform data mismatch.

• To have automate test framework

• Provide Realtime information on reconciliation data

• Support Business intelligence process with correct data.

• Integrated into the BI debugging process

• To be sold as a solution by Accobat to their clients.

Jannick as the solution manager, he provided us all the information after communicating with

other stakeholders, since he is the key of controlling all project planning, we were able to balance

their concerns and formulate detailed requirements from the information received from him.

Below is a final requirement list that we summarized.

Stakeholder Responsible Requirement

Requirement Description

Jannick, Solution Manager

Requirement No.1

Reconciliation of master data between source and target database on row count

Jannick, Solution Manager Requirement No.2

Creation of threshold indicators notifies

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

15

Matching 100%= green, >=5% mismatch =yellow, <=10% =red

Jannick, Solution Manager Requirement No.3

Make Comment on each reconciliation for follow up.

Jannick, Solution Manager Requirement No.4 Provide real-time information on the recon table using report tool like, Power BI

Jannick, Solution Manager

Requirement No.5 Test framework to work automatic by running some set jobs on the system

Jannick, Solution Manager

Requirement No.6 Solution coded in SQL using Microsoft SQL Server

Jannick, Solution Manager

Requirement No.7 Done in a dynamic way to support performance.

Jannick, Solution Manager Requirement No.8 Test framework System print out a CSV file automatic.

We have listed 8 detailed requirements for the testing framework to be able to identify and indicate data discrepancies, and notify the users or employees in Accobat.

Requirement No.1 makes sure no data loss in terms of quality (master data between database two sets of tables), as well as quantity (row count).

Requirement No.2 defines the indicators featuring different kind of situations and level of severity of data discrepancies.

Requirement No.3 complements the detail information that user should receive from the testing results.

Requirement No.4, No.5, No. 7 and No.8 requires our proposed testing designed to be able to complete data conciliation within a really short time span, and it has to be done automatically.

Requirement No.6 restricts the BI platform for the data conciliation.

5.3 Challenges For this particular testing framework, we reflect on our goals and requirements, identifies a few challenges in order to overcome them or find a way to compromise.

• Limited Information: According the information we gathered, we must address here that testing frameworks for the Business Intelligence are usually not open sourced, many companies including Accobat as consultant company do not have a completely developed framework for data reconciliation testing. Nowadays it becomes a norm that Consultant companies build their own testing framework. That is why we will have challenges building this, since this project points to a relatively blank area for us to explore.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

16

• Limited resource: For this project alone to design and build a framework, we do not have

enough support from the company at a business project level, nor have enough meetings

with the stakeholders in company.

• Limited Knowledge: To design and build this framework requires us understanding

knowledge of back-end database programming language, Software Architecture principles

as well as Business Intelligence Enterprise frameworks in general, as a matter of fact we

have to learn a lot of knowledge during making this framework. We see this challenge also

as one of our academic goal.

5.4 Future Testing Framework We made the sketch for the testing framework we are going to develop, in order to execute data reconciliation testing we need to join the master data from Data warehouse and Staging database during ETL process together, then calculate the data discrepancies. This testing framework involves the whole BI procedures in Accobat, Because the master data that being tested are come from the very beginning of the process (Staging database has the same data as Source database). This sketch will settle our starting point for the design phase.

Figure 6 Sketch of testing framework

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

17

5.5 Introduction of Data Reconciliation Each time data is exchanged between the source and target systems, there is a risk that the data that is sent may not be consistent on reception or the transmission system may error out, causing data discrepancies that must be resolved to ensure data consistency. Ensuring the consistency of data is an important aspect of data quality in BI. The term “reconciliation” is defined as the comparison of data in the source systems to the data in the DW by checking whether the summary values (e.g. record counts or total values) and detail data such as a particular fact table row in DW are the same as in the source system (Rainardi, 2008).

6 Design 6.1 Design Summary We have summarized and analysed the information we have from chapter 4 and 5, now we are officially following the requirements we have and starting the design of the framework.

As our sketch showed in chapter 5, we have a general idea where the testing framework should interfere and affect the BI process, we will be designing a context view (Rozanski & Woods, 2012, page 66) for the testing framework and its external environment (external entities, stakeholders), how they are working together and the relationships between them, this view will be following the principles and philosophies of architectural viewpoints. (Rozanski & Woods, 2012, chapter 16)

To have a good data reconciliation on the database, it is required to verify data in the target database is as expected after loading process. Confirms that Business intelligence is processing correct and matching data from the sources database into target database.

The main methods of testing data reconciliation, is to have master data transaction reconciliation between source and target databases on table levels, ensuring that the total number of one table in the source database match exactly the total number of the same table in the target database.

In that case, it will be easy to see whether the number of records are the same on both side. If the total number is the same, then no data mismatch found. It is 100% matching, showing or giving green indication on the reconciliation.

When data mismatch is found it will indicate a warning message notifying data mismatch issue. if the threshold is less the 10pct it gives or shows yellow indication. If more then 10pct will show or give red indication, which means that it is worrying and action need to be taken asap.

Developers can test the reconciliation at any given time by running the compare store procedure on the target database. Developer can also make a comment on the reconciliation when needed. These functions above will be elaborated more in our Functional view.

With the views we have, we will be developing an informational architecture following the principles and philosophies of architectural blueprints, and focusing on two aspects of our solutions under the business context:

• Where the data is, where it will be integrated, and where it will be consumed in analytical applications

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

18

• Why the BI solution(s) will be built—what and where cause the problem (data

discrepancies)

Lastly, we have technical architecture, which defines the technologies inside different stages of the process that are used to implement and support a BI solution that fulfils the information and data architecture requirements.

Our design for the testing framework will fulfil some of the requirements from stakeholders, by

achieving the perspectives (Rozanski & Woods, 2012, chapter 24) along with designs of views and

architecture.

Architectural Views and Architectural blueprints

An architectural view is a way to portray those aspects or elements of the architecture that are relevant to the concerns the view intends to address—and, by implication, the stakeholders to whom those concerns are important. (Nick Rozanski, Eoin Woods 2012)

An architectural blueprint for Business intelligence process is similar to a blueprint for a building in real life, different type of blueprint describes different aspect of an architecture, they are helpful for developing or adjusting framework under Business Intelligence environment.

6.2 Context view Here we demonstrate our testing framework design inside a context view. The context view of the system describes the relationships, dependencies and interactions between the system and its environment, which in this case referring to Accobat employees and Accobat Business Intelligence process.

As we mentioned before the data from Source database is extracted, transformed and loaded into staging database temporarily, then after it is loaded into Data warehouse. Data discrepancies are supposing to be happened here, and this will affect the rest of the system with data mining, front-end reporting, etc. The data will be consumed by SSAS in order to create reporting OLAP cube, and at the end-user point, data will be visualized by PowerBI tool to produce financial reports and dashboards. Developer will be supporting the data mining from data warehouse to SSAS, Solution Manager, Delivery responsible and Commercial responsible will keep track of the reporting system with the Recon Table integrated into the system, some of the interactions between process and stakeholders will change. First of all, as an entity Recon Table will be joining two databases (Staging DB and DW), after that the developer will be supervising the recon table and adjusting and fix the data discrepancies illustrate by the indicators from testing results. Throughout the process, solution manager will be keeping track of Reporting system, as well as tracking the testing result from recon table on a daily basis to avoid severe business mistakes that could have caused by data discrepancies.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

19

Figure 7 Context View for testing framework

6.3 Functional view The functional view of testing framework defines the system's functional elements, responsibilities of each, the interfaces they reveal and the interaction between elements. Taken together, this demonstrates how the system will perform the functions required of it.

In Functional view for testing framework, functions indicate and connect elements one to another, for example, Recon table indicates compare table when it requires to compare and locate data discrepancies, the compare table will indicate other entities to fulfil the requirements, connections in between are critical functions to present framework elements or responsibilities.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

20

Figure 8 Functional View

Below is an overview of functional elements and their properties.

Functional Elements

Element Name Recon Table

Responsibilities Showing testing results and comments, as well as overview of general controllable elements that users such as Developer can interact with.

Inbound

Outbound Compare data discrepancies from Compare Table

Mark discrepancies with Indicator

Element Name Indicator

Responsibilities Showing testing results with different degrees of data matches for every piece of detailed information in compare table

Inbound Whenever Recon Table asks to mark discrepancies

Outbound

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

21

Element Name Compare Table

Responsibilities Gathering and Joining tables from two data tables

Inbound When Recon Table calls the function of compare, it will receive the signal of commencing two Temp tables

Outbound Indicates both Data Warehouse temp table and Staging temp table

Element Name Temp Datawarehouse Table

Responsibilities Receiving properties such as names, row counts and numbering tables from Data warehouse, forming as a temp table

Inbound When the Compare Table needs to join temp Data warehouse table with another table

Outbound Getting properties from Data warehouse database

Element Name Temp Staging Table

Responsibilities Receiving properties such as names, row counts and numbering tables from Staging database, forming as a temp table

Inbound When the Compare Table needs to join temp Staging table with another table

Outbound Getting properties from Staging database

Element Name Datawarehouse Database

Responsibilities Providing information for a temp table

Inbound When the Temp Table needs to retrieve data

Outbound

Element Name Staging Database

Responsibilities Providing information for a temp table

Inbound When the Temp Table needs to retrieve data

Outbound

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

22

6.4 Information architecture

Figure 9 Information architecture of testing framework

The diagram shows the information architecture which includes the processes, external entities and decisions made.

- Client is external entity (company) that is in need of a business intelligence solution at Accobat.

- Source database is a database at client end, this is a database client is using for storing data, data on database can be in any format. It is also operated by client.

- Staging database (extract) data from the source database is extracted and loaded into a staging database, it is a copy of data from source database before formation.

- Data warehouse data from staging (extract) database is loaded into data warehouse, then it is formatted depending on the business logic and needs. Then processed and loading in SSAS (SØL Server Analysis Services)

- Developer run compare staging and data warehouse once data has been loading into the data warehouse and formatted, developer of data warehouse will be able to compare data between staging and data warehouse database, if there is any mismatch on data.

- Recon table is the reconciliation result on both tables, put into a compare table called recon table if there is a mismatch of data on row account, developer will investigate and update business logics or data warehouse. if there is no mismatch on row count, status is ok. recon table is accessible online direct from database into online tool for client to access Realtime. BI manager and Commercial managers get updates on the reconciliation

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

23

from both the table and online reporting tool. client have Realtime access on table through online tool.

- SSAS cube, does analysis on data using different measures depending on business requirement and every business logic is processed by a cube. Reports are created from a cube.

6.4.1 Why data discrepancies arise in BI process Data discrepancies rise of different reasons, below we outline some of the causes.

• Can be human error in logic coding or programming errors

• Can be loss of data in the loading process, if there is sever problems.

• Inconsistent database and data

• Issue may be a problem with the Algorithm, were one algorithm may not be suited for a particular task.

• Data models in the data warehouse may also be too complex, causing data discrepancies.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

24

6.5 Technical architecture

Figure 10 Technical architecture for testing framewrok

Data Sources

Data source system may contain structured, unstructured data and semi -structured data that will be integrated into BI enterprise framework.

ETL

ETL is used to extract, transform and load data into the Data warehouse. ETL process data on daily basis. depending on the business logic or business need ETL can also be set up to load on Realtime. ELT process uses relation database based on SQL coding.

Data Warehouse (SSIS)

Data warehouse uses relational database technology for Business intelligence and data integration. from Data warehouse is were all database rules set, this include data indexing, data partitioning, materialized views, memory- processing etc, plus infrastructural components that may support in performance, like hardware, memory storage and network. In the project we are required to work on the Microsoft integration tool called SSIS (SQL Server Integration services).

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

25

SSAS cube

SSAS (SQL Server Analysis Service) is used to analysis data by measures depending on the business needs and business logics set up in the business intelligence. SSAS uses Microsoft SQL Server and relational database to perform data analysis.

Online report tool (Power BI)

Online report is a tool that presents Realtime reports online for access to clients, this tool is useful for extracting data and having reports, which are used for making decision by management.

Compare table

Compare table is created by doing some SQL store procedures code on both ETL and Data warehouse.

6.6 Performance and Scalability Architectural Perspective An architectural perspective is a collection of architectural activities, tactics, and guidelines that are used to ensure that a system exhibits a particular set of related quality properties that require consideration across system’s architectural designs. (Rozanski & Woods, 2012, page 72). Performance and scalability defines the ability of a system to predictably execute within its mandated performance role and to handle increased processing volumes, the concerns of it could be Response time, throughput, peak load behaviour, etc.

The test framework approach between two databases source and target database comparing row count on both databases performance has to have reliability and accuracy. Source database contains about 700 tables, which are all loaded into date warehouse, which is also the target database in this case. The store procedures are developed to compare the two databases and give result into a compare table (recon table). Test framework solution is to be implemented in a dynamic way, were all reconciliation testing on databases will be automated. And recon table is to be accessed real-time online, once the reconciliation procedure is executed.

Adding to that, there will be an automated job set up in system, executing every mid-night after the loading of data in the target database from the source database, Then the compare store procedure will execute right after, to check if the row count on tables match. this will help developers to investigate in good time without first to run the store procedure when they come in the morning.

Overall the design of testing framework will fulfil the requirements by achieving performance perspective. Since the dynamically programmed by using SQL query and stored procedure will process the comparison of datasets, also the automatic execution during mid-night every day after data are loaded after ETL process, the increased workload and processing volumes should be remained in a controllable range and the response time of testing should be consistent.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

26

7 Discussion Reflecting on what we designed according to our analysis and findings, first of all we located stakeholders and gathered their concerns from solution manager, our detailed requirement overview was created based upon that. Secondly, we planned our approach by making sketch, got detailed requirements from meetings, and lastly, we designed blueprints/views for correspond audience.

The goal of the design is to achieve the requirements that were given to us, as a result we believe that most of the requirements are covered during the process, there are few technical requirements that requires actual implementation, which we might work on that in the future.

As we mentioned in design chapter, the philosophy and principles we followed are from Rozanski & Woods, 2012, when it comes to visualization, we visualized our diagrams using UML, it is an efficient way to cover a wider portion of software development efforts including agile practices, examples such as our context view is using a class model to represent environment and the testing framework, the relationships between stakeholders and the system, another example is that our functional view is built as functional structure model, which represents main functions that will support the goals of this testing framework: To inform data discrepancies, to support business intelligence process with correct data, etc.

In our technical architecture, we decided to use Microsoft SQL procedures and dynamic implementation, which will also be executed during mid-night after loading data from client database into data warehouse every day, this decision along fulfils several requirements such as performance time, because of the way SQL procedures and its stored logic handle the data.

We did not cover the requirements such as privacy and security, we acknowledge that how important they are, but we did not need to focus on those aspects for now because they are not the main concerns from stakeholders. We did cover architectural perspective with performance and scalability as a way to handle the inevitable incensement of data work load, since the client might have larger databases or more databases that need to be extracted from.

8 Conclusion Looking back at the problem we intend to solve in the Business Intelligence process, we dedicated our project specifically targeting the key word: data discrepancy.

We introduced the problem in the very beginning, the whole business process is depending on the accuracy of the reported data, which come from the data warehouse of Accobat, stakeholders are having troubles and lots of them are related to data discrepancies between source database and data warehouse, the clients would lose trust in Accobat even a single data mismatch spotted. This alone hurts the fame and reputation of company, not to mention the delay of delivery, the duplicated and manually executed workloads that requires human resources.

Our proposed solution requires minimum resource to achieve, and we tried to design the framework that fits requirements centered around the problem we formulated in introduction chapter.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

27

As a result, for the problem we introduced in our project, the summary is shown below:

Problem statement: How to develop a test framework that will Reconcile two database source database and Data warehouse?

• Why reconcile testing framework design is important to implement.

• What will the company Accobat benefit from the test framework?

Because overall this involves most of the stakeholders and their concerns, the difference between having a testing framework and without it directly impact the company benefits and client base. It was our effort to make a feasible framework design that is able to be implemented in the future, the reasons of why the data discrepancies occurs we listed in information architecture, and the testing framework will be preventing that from happening once it’s online.

• Will the developers be able to automatically see the reconcile on the database?

• Is data mismatch detected and give a warning indicator in the reconcile process.

As the functions and the indicators, we designed in Functional view and discussion from performance perspective, we strongly believe that with the testing framework technical solution, the developers will be able to see the reconcile result on a daily basis, the testing result will be fast to navigate due to stored procedures and logic behind reconciliation table. In addition, warning or comments from indicators that will show the severity of data discrepancies.

To sum up in an academic point of view, we have taken a challenge of completing a project with following steps: studying the problem, information gathering, forming the approach and solution development. We are satisfied with the amount of knowledge we obtained when doing this project, especially in early stage where we had to absorb information in order to comprehend the big picture. From the practice, working through this project, we have significantly developed our skills in applying architectural design, evaluation and project work model techniques to solve the real-world problems.

We would like to take this opportunity to thank our supervisor for review sessions to broaden our horizons, and Accobat for providing the experience for a collaboration project.

9 Perspective For the future, we expect to develop this testing framework as part of our Master thesis, deploy the solution also for the company Accobat to be integrated into their BI enterprise platform.

Accobat will also sell it as a service to their client, creating revenue for the company and adding value to their client by controlling data discrepancies on database. Accobat will employ someone to work on the project for further development.

Plus, the all solution will save time and money for the company, as it will be implemented in an automated way, no more testing manual on databases. We might be researching and gathering automate testing theories for the actual technical solution. Lastly, depending on the requirements

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

28

change over time we might want to add more features to the testing framework and learn more about Business Intelligence or testing in general while developing.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

29

10 Reference • Rozanski N. and Woods E., 2012. Software Systems Architecture: Working with Stakeholders Using

Viewpoints and Perspectives, 2nd Edition. Addison Wesley.

• Poughkeepsie Center, 2000. Business Intelligence Architecture on S/390 presentation guide, First

Edition

• Rainardi V., 2008. Building a Data Warehouse: With Examples in SQL Server. Apress.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

30

11 Appendix

11.1 Glossary Term Abbreviation Definition

Business Intelligence BI

Business Intelligence is the

process of collecting raw

data or business data and

turning it into information

that is useful and more

meaningful.

Extract-Transform-Load ETL

ETL is a process in data

warehousing responsible

for pulling data out of the

source systems and placing

it into a data warehouse.

Data Warehouse DW

A data warehouse is a data-

base that is designed for

query and analysis rather

than for transaction pro-

cessing. Data Warehousing

(DW) is one of BI solutions

that helps to convert data

into useful information by

providing multiple dimen-

sions to study the data for

the purposes of informative

dashboard generating, re-

porting, etc. Top manage-

ment

can therefore take quick

and accurate decisions on

the basis of statistics calcu-

lated

using this data.

Microsoft SQL Server

Analysis Services SSAS

Delivers online analytical

processing (OLAP) and

data mining functionality for

business

intelligence applications. In

simple terms, SSAS creates

cubes using data from data

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

31

marts / data warehouse for

deeper and faster data

analysis.

Dimension Table

Store descriptions of the

characteristics of a busi-

ness. A dimension is usu-

ally descriptive

information that qualifies a

fact. Dimensions do not

change, or change slowly

over time.

Data Staging Area DSA

The Data Warehouse Stag-

ing Area is temporary loca-

tion where data from source

systems are copied. A stag-

ing area is mainly required

in a Data Warehousing Ar-

chitecture for timing rea-

sons. In short, all required

data must be available be-

fore data can be integrated

into the Data Warehouse.

Data Mart DM

A data mart is the access

layer of the data warehouse

environment that is used to

get data out to the end us-

ers. The data mart is a sub-

set of the data warehouse

and is usually oriented to a

specific business line or

team.

11.2 Meeting notes with Acoobaat from Audio and written notes

11.2.1 Meeting with Accobat - 21th September 2017 Company presentation By Solution Manager Jannick

• Welcomed us to the company office

• Told that he will be the main contact person for the project and the supervisor for the so-lution

• He made Company background introduction.

• Company is 14 years old to May this year, has 36 employees, and they have over 115 clients in Denmark.

• Their core business is business Intelligence based on Microsoft BI Platform

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

32

• They implement solutions for different sectors e.g. real estate, high schools.

• They are one of the best performing company's year 2016 and 2017

• CEO is the only owner of the company.

• Thier Biggest client is Ramboll Engineering, were they provide BI solution about 15.000 employees.

• The introduction for the need of data quality in BI context. Company Core Business

• Commitment to what they do and client comes first

• Respect for each other in the company (staff)

• Engagement is key in everything the company does.

• Project Method at the company: SCRUM / Agile Other add-on products for Business Consutant :

• Front-end products:

• Jedox and Prophix both tools are used for budgeting and financial management by companies.

• BI is Microsoft tool used for dashboard, graphical report

• Data warehouse: ETL (extract – transform – load). Working with dimensional data and transform Company has Goals of the next 3 years:

• Turn data into insight, actions and impact.

• New sales for enterprises and Jedox: want to get more than 10.000 customers

• Have more business Frameworks

• Company grows

• Turnover 50 million in 2020 and 50 employees in 2020 Introduction to data warehousing (technical perspective)

• It is Based on Kimball’s approach of data warehouse.

• Made of business process matrix

• Gives us what we need to measure on?

• When is then Broken into into facts tables

• Dimensions are present in the data warehouse system to help have measures.

• The data warehouse must be able to work with different platforms with different data for-mat (data sources).

• Data marts are subsets of DW, for each business logic

• “Star Schema” is used to map the actual relationship among tables in the DW.

• Actual implementation of Design: Microsoft Data warehouse

• Change management of the project is important, part of service provided by Accobat.

• Project Management is part of DW development.

• Having Meetings, workshops with clients is key to success. Systems used for operations at the company Ticketing system enables communication among employees (i.e. task-oriented) Both customers and employees can create tickets for tasks and change requests from clients.

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

33

• Some clients can send email to request for change. TimeLog system is used by employs to register time they have spent on request. later used to bill client. Project for students to work on

• Data reconciliation between source database and target database. Stakeholders: o BI Department o Service Desk o Customers “UNIK”

11.2.2 Meeting with Accobat - 4th October 2017 Testing issues at the company: Company has challenge when it comes to testing , they would like to have testing on most of their product, below is a list of testing they outline for us

• To make sure if two numbers from source and target data are the same (very much needed)

• Sample test (done)

• Functional testing(done)

• Correction (Reconciliation), full population (needed)

• Build test (ETL here) – not interesting to them

• Deploy test (done)

• Load test(done)

• Tests can be used internal or external. (needed)

• Internal testing process (needed)

• Continuous integration(needed)

• Developer testing (needed)

• Manual testing (not Needed) Types of projects

• Database testing

• Integration service testing

• Analysis service testing

• Between SQL query test and MDX query test

11.2.3 Meeting with Accobat – 18th October 2017 (Thomas – developer) For the client company: They got the source code from previous development.

• No API.

• Example for real estate companies: system used for administrators

offers the way to check the “tenancy” of their own buildings.

• As far as Thomas knows most of the data checks are done manually by communicating with clients.

• As a developer he has an idea of the accuracy of the data as soon as he look into the tables

from databases(row counts, etc.)

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

34

• From his point of view the automate implementation for data reconciliation is difficult.

• It is not necessary to check the value in tables are correct or not, the most important part of reconciliation testing is to identify the total number/count is correct or not.

The current client we are studying in our semester project is standard solution in Accobat. Problem with testing: As a developer Thomas told us that clients don’t know how to or where to start testing, they only knows that data presented in system should be accurate and should be tested beforehand. Using PowerBI: can use filtering to show what to test and customer can send the screen dumps to check. Selecting scenarios that are important: coming from clients. The key thing is that Accobat is able to see what the most important thing to the clients and show them at strategic level, operational level. The project we are studying in Accobat, their source database has around 700 tables. For data warehouse, they only use partial tables to be used in reporting system. ETL: Extract: It takes a couple of hours to extract and load data every night, this process should be finished before next day's morning.

11.2.4 Meeting with Accobat - 13th November 2017

• Clients have high expectation for the design of the reporting system, if there is a single data mismatch or data loss, it will damage the efforts from Accobat on this project heavily.

• Important data such as financial transaction Financial transaction is not what the client knows for example the total number of transaction.

• Data quality over integration.

• Permissions from source database to extract data is straight forward, Accobat either get all rows of data or nothing

• Manual testing started by a discounted amount of sample data, then proceed to the whole datasets

• It will be relatively easy to build solution in compare table in Microsoft SQL platform.

• Some of the business rules are hard to test and they are okay to be skipped.

-Data reconciliation, take the source data and check the number of rows at the target database - As a technical solution we can provide a way to hardcode for the purpose of ease, we can get

indicators such as success color and warning color for different testing results.

11.2.5 Meeting with Accobat - 4th December 2017 (Wolfgang - Senior Consultant) Wolfgang - Accobat consultant for Business Intelligence in general

We gathered information from him, since he is a technical stakeholder his opinion is valuable for our design, listed down below:

Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang

35

• For the current state the data comparison is completed manually for the source datase and target database(Accobat datawarehouse)

• Data is changing in a very short time span from Client system

• The logic for the data transaction from source to target consists of two stages, for the

current client they have approximately 600 tables, and the extraction requires only a part

of them.

• According to his experience, comparing data with source data and data warehouse quite

difficult.

• Mdx as a language can also be used for logic of data conciliation.

• Comparing data against, extract layer

• Writing SQL procedure with C# to extract data from sources

• Wolfgang understands that the solution manager wants to compare the extract database (staging database) with data warehouse

• The data that should be compared come from the tables inside databases

• The client side has operating systems and the data that will be extracted also come from there.

• Extract(Staging) database has the business logic that defines the design of the table.

• Transferring data into bus matrix

• Tables in the data warehouse and source database are different since data warehouse has different logic