Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Data Quality of The Datawarehouse In Business Intelligence Process
2017
SEMESTER PROJECT IN COMPUTER SCIENCE AND INFORMATICS TEAJENI MISAGO & HONGZHI ZHANG
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
1
Abstract In modern society we are witnessing an increasing development in the Business Intelligence area,
where more companies require data sorting and processing under the help of consultant compa-
nies, because they do not have resources to accomplish that by themselves. Consultant companies
delivers the professional products and informative data analysis to their customers, giving top
management an easier time to take quick and accurate business decisions on the basis of statistics
calculated using these data analyzation.
For the consultant companies, their products as data reporting systems, have to be 100% vali-
dated and accurate, they have to overcome the data discrepancies that happens when transfer-
ring and extracting data from customers’ source databases, the data discrepancies can be caused
by many reasons such as network issues or logic errors buried in data loading process, this process
is automatically completed by certain complicated algorithms such as ETL, it is relatively difficult to
prevent that from happening.
Most consultant companies usually focus on correcting them by testing a discounted amount of
data manually when the datasets are overwhelmingly huge. It is however, from the information
we have, that cannot remove data discrepancies effectively due to many reasons we will be listing
in our study.
Our project will be focusing on developing a testing framework within the BI process, we reached
a consultant company called Accobat, and we will deliver the design of the framework in this study
project, at the same time the company will provide information and feedback between meetings.
We will be using variant methods and approaches to form our project within limited time window,
as well as researching supportive theories from articles and books, and utilize knowledge we
learned from past semesters in Computer Science and Informatics. Based on our proposal in this
project, we will give conclusion and future plans for further development of the framework in the
end.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
2
1 Table of Contents 2 Project Introduction .................................................................................................................................. 4
2.1 Background of the project company ................................................................................................. 4
2.2 Project Objective and goal ................................................................................................................. 4
2.3 Deliverables ....................................................................................................................................... 4
2.4 Human Resources .............................................................................................................................. 4
2.5 Introduction of Business Intelligence ................................................................................................ 4
2.6 Business intelligence process ............................................................................................................ 5
2.7 Problem to Address ........................................................................................................................... 6
3 Methods .................................................................................................................................................... 6
3.1 Overall Approach ............................................................................................................................... 6
3.2 Methodology ..................................................................................................................................... 7
3.3 Alternative Approach......................................................................................................................... 8
3.4 Working Procedures .......................................................................................................................... 9
3.5 Performing Interviews ....................................................................................................................... 9
3.6 Stakeholder ...................................................................................................................................... 10
4 Theory ...................................................................................................................................................... 10
4.1 Books ............................................................................................................................................... 10
5 Analysis .................................................................................................................................................... 11
5.1 Stakeholder Analysis ........................................................................................................................ 11
5.2 Requirement overview .................................................................................................................... 13
5.3 Challenges ........................................................................................................................................ 15
5.4 Future Testing Framework .............................................................................................................. 16
5.5 Introduction of Data Reconciliation ................................................................................................ 17
6 Design ...................................................................................................................................................... 17
6.1 Design Summary .............................................................................................................................. 17
6.2 Context view .................................................................................................................................... 18
6.3 Functional view ................................................................................................................................ 19
6.4 Information architecture ................................................................................................................. 22
6.4.1 Why data discrepancies arise in BI process ............................................................................. 23
6.5 Technical architecture ..................................................................................................................... 24
6.6 Performance and Scalability Architectural Perspective .................................................................. 25
7 Discussion ................................................................................................................................................ 26
8 Conclusion ............................................................................................................................................... 26
9 Perspective .............................................................................................................................................. 27
10 Reference ................................................................................................................................................. 29
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
3
11 Appendix .................................................................................................................................................. 30
11.1 Glossary ........................................................................................................................................... 30
11.2 Meeting notes with Accobat ........................................................................................................... 31
11.2.1 Meeting with Accobat - 21th September 2017 ....................................................................... 31
11.2.2 Meeting with Accobat - 4th October 2017 .............................................................................. 33
11.2.3 Meeting with Accobat – 18th October 2017 (Thomas – developer) ........................................ 33
11.2.4 Meeting with Accobat - 13th November 2017 ......................................................................... 34
11.2.5 Meeting with Accobat - 4th December 2017 (Wolfgang - Senior Consultant) ........................ 34
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
4
2 Project Introduction 2.1 Background of the project company Accobat A/S is a Business Intelligence consultancy company based on Microsoft technology, and with offices in Copenhagen, Aarhus and Aalborg. Apart from Business intelligence they also have 13 years of experience with providing Business consultancy solutions to companies. Accobat has delivered solutions to Denmark’s most progressive companies and organizations like Ramboll, UNIK data, AAU and more.
2.2 Project Objective and goal Main objective and goal of the project is to propose a solution that does data reconciliation testing between source database and target database which is the data warehouse in the business intelligence process. This testing solution should be able to help Accobat automate the process of identifying data discrepancies and general errors that occur prior to the process of creating financial reports and dashboards.
2.3 Deliverables Deliver an architectural design of reconciliation testing framework.
2.4 Human Resources Role
Responsibilities
BI Department Manager -Approving the project plan. -Providing information when needed and giving feedbacks -Evaluating the test framework
Bent K Slot – University Supervisor -Giving feedback to students on how to work on the project. -Evaluating the project report
Teajeni Misago & Hongzhi Zhang -Propose solution architecture design -Plan the project and developing project report -Conducting interviews with stakeholders.
BI Developers -Providing technical information, feedback and BI experience in general.
2.5 Introduction of Business Intelligence Many business and corporations are increasingly using business intelligence and incorporating data analytics into their systems.
Why is Business Intelligence Important
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
5
Business Intelligence, BI involves the delivery and integration of most relevant and useful business information in a company or in an organization.
Most Companies now days also use BI to detect significant events and identify or monitor business trends to adapt quickly to their changing environment. If companies use business intelligence effectively and also train their staff, they can improve the decision-making processes at all levels in the organization and also improve strategic management process.
Main reasons why companies invest in Business Intelligence system
- One of the reason is to help gain insights into the behaviours of consumers buying trends. Once the company gets information on how consumers are buying, this will help producing products that match current consumption trends.
- BI system can help to understand the implications of various organisational processes better and enhance ability to identify suitable opportunities for future planning.
- To invest in Business intelligence system will help to identify where is error and areas that need improvement. will also help in saving time and money and performance of the organisation in terms of business decision making.
- Helps to gain sales and marketing powers to customers.
- Helps to gain insight into what competitors are doing.
Figure 1 : life cycle in Business intelligence
2.6 Business intelligence process Data from the client’s side data source (DS) is processed into the staging database within ETL process, which is defined as source and staging data database. Then data is loaded into the data warehouse at the target database. After data marts (see Appendix: Glossary) from data warehouse are created depending on the business logic, from there it is presented into multidimension database, where reports are created for management or any other person requires to review or use.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
6
Figure 2: Business intelligence process
The ETL process in the figure 2 is an important component of the process. It is used to copy data from Operational Applications (Source databases) to the Staging Area, from the Staging Area into the DW and finally from the DW into a set of conformed Data Marts that are accessible by decision makers. Each data mart often holds only one subject area for example, finance or sales or marketing.
The ETL process-based software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of ETL jobs are critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately.
2.7 Problem to Address Problem was found at some data mismatches on some tables between the source database and the target database (Data warehouse). Each data mismatch may cause a problem or error on reports produced by the accurate business intelligence data. In that regard Accobat needs a solution to help them reconcile data between the two databases.
Problem Statement: How to develop a testing framework that will Reconcile source database and Data warehouse?
Questions to answer in this project
1. Why reconcile test framework design is important to implement.
2. Will the developers be able to automate see the reconcile on the database?
3. Is data mismatch detected and give a warning indicator in the reconcile process.
4. How can recon table result be shown online real-time?
5. What will the company Accobat benefit from the test framework?
3 Methods 3.1 Overall Approach All empirical data collected for this project is based on the qualitative approach by conducting
interviews and doing some observations on developer. Additional to that, Company materials are
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
7
provided by the Accobat, academic books, articles and research materials are used too. For the
methodology we will be using the Iterative model for our project.
3.2 Methodology The methodology we use in the project is more or less partly plan-drive and iterative development
methodology, since this is a new needed solution to test data between source database and data
warehouse on Business Intelligence Enterprise platform at Accobat, and it is our working cycles in
an academic point of view.
We must plan every phase of the project and carry out tasks and actives needed in that phase. And
the main reason why we use the plan and Iterative methodology is to reduce the risk for the
execution of the project and help us to see how we have organize the project into a structured and
streamlined process. Figure below presents what we have covered in every phase.
Figure 3 iterative methodology life cycle
Project Planning
We define the project plan, project scope and project introduction. and also defined the method of
work of the project.
Stakeholder Analysis
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
8
In this phase, we perform detailed analysis of people, groups or organization that are likely to be
affected or who will have an impact on the new testing framework system. And identifying what
value does each stakeholder bring when defining and correcting requirements.
Requirement Engineering
We study the current test framework and how it works and it challenges and then make a
description of the future test framework. by identify the requirements and problems a new test
framework was supposed to address and solve. By the help of using rich pictures to capture the
missing and conflicting requirement, because our stakeholders don’t know how it should be
developed, in order to define requirements well. Paper-based models to represent the most
important aspect (functional and system design) of the test framework were than evaluated by the
stakeholder to refine the requirement or capture the missing requirement.
Framework Architecture Design
In this phase, the requirements are being translated into different design like full Design
architectures of the test framework, in our case it consists of Context view, functional view,
Information architecture and Technical architecture, and architectural perspectives the design
follows.
Evaluation/Discussion
Results from design phase are constantly delivered to be evaluated by stakeholder so to make sure
that the testing framework meets their requirements and business needs.
Document Delivery
In this phase a design description document is delivered to Accobat the project company and
university report is handed in, as this is a school project report for 3rd semester of Master’s in
Computer science and informatics at RUC.
3.3 Alternative Approach The development of the framework architecture description is mainly carried out in initial phase of
software development to help in defining and agreeing on scope, agreeing on and validating
requirements, and providing the technical leadership to make the decision that will shape the
architecture of the new system to be developed. Therefore, architecture definition can also be
integrated within liner approach like waterfall model.
Similarly, iterative model could be an alternative if the architecture definition process was iterative.
Therefore, it can be integrating at early analysis phase of iterative model.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
9
In case of using agile approach, should consider different factors to ensure its success such as the
architecture work should be delivered incrementally, aim to create good enough document that
could be deliver as soon as they are useable rather than waiting for them to be polished to the point
of perfections, making sure every deliverable has a customer and that the customers understand
and agree with the value it is bring them.
3.4 Working Procedures Working procedures as following:
- Obtaining resources
- Having project milestones and project state updates.
- Performing our approach of the project.
- Creating communication channel with Accobat (interviews and meetings)
Based on these procedures we are able to break down our work load to several iterations, notice that the iterative process is not shown in the figure, however, for each phase we will be adjusting our solution according to feedbacks from previous phase, following the principles of iterative methodology.
Figure 4 Milestones and meeting summary
3.5 Performing Interviews In order to identify the needs and requirements for the testing framework, as well as communicating with stakeholders, our approach is to performs several interviews with the stakeholders, we made appointments with stakeholders in advance, then show and illustrate our milestones and project states after iterations. Meetings appointments within work procedures figure shown above.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
10
We arranged 5 meetings with Accobat stakeholders throughout this semester, then set our milestones beforehand. (Meeting details see Appendix-meetings) For each stage in our project we act according to the previous meeting, then present our milestone to the stakeholders in next meeting. We believe that this method is one of the fastest way for us to form our action plan, because the employees in company are not flexible with the meeting schedule, we want to fully utilize the time between meetings, and get feedbacks once they are held.
3.6 Stakeholder After the first meeting with one of the Accobat Manager, we were able to collect information of who are the key stakeholders in company and what concerns they have are important.
Stakeholder list
Solution responsible Developers Accobat consultant
Commercial responsible
Delivery responsible
Developers: They look into the actual implementation of the test framework and confirm that it
meets business requirements.
Solution responsible: also the main project supervisor, controlling budget and communicating to
other key stakeholders in case resources needed.
Commercial responsible: In-Charge of promoting and publishing developed products.
Delivery responsible: When system is ready to be delivered at client they play as a key role.
Consultant: Is an experienced developer who can provide technical information regarding BI in
general.
4 Theory 4.1 Books Business Intelligence Book: Business Intelligence Guidebook 2015 by Rick Sherman
We use Business Intelligence Guidebook 2015 by Rick Sherman to guide us through Business
intelligence process and understanding Business intelligence architecture framework. In this book,
it introduced to us how to develop a cost-effective Business intelligence solution, what capabilities
and skills needed for Business intelligence to meet challenges, because all business intelligence
frameworks are designed to accommodate expansion and renovation based on the evolving
requirements. For our project we referred mostly chapter 4 of the book for Business intelligence
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
11
architecture framework, which presents all key component and elements that compose the design
architecture of BI. This includes information architecture, data architecture, technical architecture
and product architecture. In this project we cover mainly key design architectures we think are
important for our architecture, see details below
• Information architecture: Architecture defines the purpose of the project, the business
processes and analytics, who will have access, and where the data is and how it will be
integrated and consumed.
• Technical architecture: Architecture defines the technologies that is being used in the
implementation of the business intelligence solution that fulfills the information and data
architecture requirements.
Software Systems Architecture Book: Nick Rozanski and Eoin Woods
We have used this book as guideline of how to develop a software system architecture in general
in terms of making designs for our solution, and all areas involved to have a good software
architecture analyzed and developed. This includes working with stakeholder analysis and it is
Mapping, how to design and importance of having view and their perspectives for software
architecture design. below are key views we have used in the project.
• Context View: The context view of the system describes the relationships, dependencies
and interactions between the system and its environment.
• Function View: Ddefines the system's functional elements, responsibilities of each, the
interfaces they reveal and the interaction between elements
5 Analysis 5.1 Stakeholder Analysis Here we used stakeholder mapping to show and present influence of each stakeholder and how
much their participation is required and where is mostly needed, then we categorized them into
stakeholder matrix with detailed approaches. Below have outlined some stakeholders that have
key role to play for our project.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
12
Figure 5 Stakeholder map
For developer in Accobat, it is certain in this case that they have the highest priority for participation because they need to proceed and polishing the actual implementation for the new testing frame-work and communicate with their supervisors in Accobat. Accobat solution responsible person, as this project’s supervisor, has the control and high-level vi-sion for database access, budget calculation, buying external consultants, as well as communicate all the other stakeholders. Accobat commercial responsible and Accobat delivery responsible person, who are in control of promoting and publishing the product once it is done. They have high impact for the testing frame-work project but they have less participation until the project is finalized. Accobat consultant as an assistant to the project, has high influence as well as need for participation. We acknowledge that they are experienced developers who can provide enough information to the team in IT field.
Stakeholder Matrix – Based on Stakeholders Analysis and mapping By categories their strength and demands we were able to conclude our approaches to different stakeholders, working with the system while communicating with correct stakeholder. .
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
13
How will they benefit?
Attitude towards project?
How can they contribute?
How do we deal with them?
Delivery responsible
Make sure delivery of a good product
Working together with developers on the project
Promoting developed product to clients.
Project on time.
Commercial responsible
Sell and market the solution.
Learning and training on test framework
Involve in Project lifecycle
Involve in all communication.
Solution Manager
Control all project planning
Active communication with both internal and external teams.
Get involved in all project planning
Keep reporting our project states and getting feedbacks.
Developer Development experience on solutions.
Information sharing and support.
System testing, debugging and maintaining the solution.
Involving and identifying challenges in project implementation.
Consultants
BI Testing solution
Project involvement.
Identify solution risks
Give feedback and input
We went through the stakeholder analysis section just to have an overview of different visions that stakeholders held, so that we can balance and document the requirements and find out what we can propose as a solution that benefits current system in analysis section, as well as identifying stakeholders that we can communicate later on for any kind of information depending on the theme of our project.
5.2 Requirement overview Based on the stakeholder analysis in Chapter 4, where we made sure to go through all key stakeholders’ concerns, we were able to locate concerns from different stakeholders that are connecting each other. We concluded the direction of our project as proposing a Testing framework for the Accobat BI procedure. As we already addressed the problem for the system in
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
14
Chapter 2, that there are data mismatches in some of the datasets between source database and target database, majority of the stakeholders directly or indirectly required to eliminate these data discrepancies. Examples:
Commercial responsible as a stakeholder finds quite hard to promote products to clients with data discrepancies, clients will stay doubtful towards the product system if a single mistake is spotted. Delivery responsible as a stakeholder supervises the project delivery time and values it the most, when the project finished without data reconciliation, it requires developers to test it manually with discounted dataset, and that is extremely inefficient and informal, as a result it might delays the delivery time. To fulfil the purpose of eliminating data discrepancies, we went through the research work listed chapter 3 and found out data reconciliation is needed for testing data gathered from client's source database and target database (Accobat data warehouse), meaning that we are targeting the entire datasets before they are further processed into Accobat's data reporting system. In order to formalize this testing process, we will be developing a testing framework. Our goals are listed down below.
Goals of the reconciliation Test Framework
• To ensure the reliability and accuracy on data.
• To inform data mismatch.
• To have automate test framework
• Provide Realtime information on reconciliation data
• Support Business intelligence process with correct data.
• Integrated into the BI debugging process
• To be sold as a solution by Accobat to their clients.
Jannick as the solution manager, he provided us all the information after communicating with
other stakeholders, since he is the key of controlling all project planning, we were able to balance
their concerns and formulate detailed requirements from the information received from him.
Below is a final requirement list that we summarized.
Stakeholder Responsible Requirement
Requirement Description
Jannick, Solution Manager
Requirement No.1
Reconciliation of master data between source and target database on row count
Jannick, Solution Manager Requirement No.2
Creation of threshold indicators notifies
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
15
Matching 100%= green, >=5% mismatch =yellow, <=10% =red
Jannick, Solution Manager Requirement No.3
Make Comment on each reconciliation for follow up.
Jannick, Solution Manager Requirement No.4 Provide real-time information on the recon table using report tool like, Power BI
Jannick, Solution Manager
Requirement No.5 Test framework to work automatic by running some set jobs on the system
Jannick, Solution Manager
Requirement No.6 Solution coded in SQL using Microsoft SQL Server
Jannick, Solution Manager
Requirement No.7 Done in a dynamic way to support performance.
Jannick, Solution Manager Requirement No.8 Test framework System print out a CSV file automatic.
We have listed 8 detailed requirements for the testing framework to be able to identify and indicate data discrepancies, and notify the users or employees in Accobat.
Requirement No.1 makes sure no data loss in terms of quality (master data between database two sets of tables), as well as quantity (row count).
Requirement No.2 defines the indicators featuring different kind of situations and level of severity of data discrepancies.
Requirement No.3 complements the detail information that user should receive from the testing results.
Requirement No.4, No.5, No. 7 and No.8 requires our proposed testing designed to be able to complete data conciliation within a really short time span, and it has to be done automatically.
Requirement No.6 restricts the BI platform for the data conciliation.
5.3 Challenges For this particular testing framework, we reflect on our goals and requirements, identifies a few challenges in order to overcome them or find a way to compromise.
• Limited Information: According the information we gathered, we must address here that testing frameworks for the Business Intelligence are usually not open sourced, many companies including Accobat as consultant company do not have a completely developed framework for data reconciliation testing. Nowadays it becomes a norm that Consultant companies build their own testing framework. That is why we will have challenges building this, since this project points to a relatively blank area for us to explore.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
16
• Limited resource: For this project alone to design and build a framework, we do not have
enough support from the company at a business project level, nor have enough meetings
with the stakeholders in company.
• Limited Knowledge: To design and build this framework requires us understanding
knowledge of back-end database programming language, Software Architecture principles
as well as Business Intelligence Enterprise frameworks in general, as a matter of fact we
have to learn a lot of knowledge during making this framework. We see this challenge also
as one of our academic goal.
5.4 Future Testing Framework We made the sketch for the testing framework we are going to develop, in order to execute data reconciliation testing we need to join the master data from Data warehouse and Staging database during ETL process together, then calculate the data discrepancies. This testing framework involves the whole BI procedures in Accobat, Because the master data that being tested are come from the very beginning of the process (Staging database has the same data as Source database). This sketch will settle our starting point for the design phase.
Figure 6 Sketch of testing framework
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
17
5.5 Introduction of Data Reconciliation Each time data is exchanged between the source and target systems, there is a risk that the data that is sent may not be consistent on reception or the transmission system may error out, causing data discrepancies that must be resolved to ensure data consistency. Ensuring the consistency of data is an important aspect of data quality in BI. The term “reconciliation” is defined as the comparison of data in the source systems to the data in the DW by checking whether the summary values (e.g. record counts or total values) and detail data such as a particular fact table row in DW are the same as in the source system (Rainardi, 2008).
6 Design 6.1 Design Summary We have summarized and analysed the information we have from chapter 4 and 5, now we are officially following the requirements we have and starting the design of the framework.
As our sketch showed in chapter 5, we have a general idea where the testing framework should interfere and affect the BI process, we will be designing a context view (Rozanski & Woods, 2012, page 66) for the testing framework and its external environment (external entities, stakeholders), how they are working together and the relationships between them, this view will be following the principles and philosophies of architectural viewpoints. (Rozanski & Woods, 2012, chapter 16)
To have a good data reconciliation on the database, it is required to verify data in the target database is as expected after loading process. Confirms that Business intelligence is processing correct and matching data from the sources database into target database.
The main methods of testing data reconciliation, is to have master data transaction reconciliation between source and target databases on table levels, ensuring that the total number of one table in the source database match exactly the total number of the same table in the target database.
In that case, it will be easy to see whether the number of records are the same on both side. If the total number is the same, then no data mismatch found. It is 100% matching, showing or giving green indication on the reconciliation.
When data mismatch is found it will indicate a warning message notifying data mismatch issue. if the threshold is less the 10pct it gives or shows yellow indication. If more then 10pct will show or give red indication, which means that it is worrying and action need to be taken asap.
Developers can test the reconciliation at any given time by running the compare store procedure on the target database. Developer can also make a comment on the reconciliation when needed. These functions above will be elaborated more in our Functional view.
With the views we have, we will be developing an informational architecture following the principles and philosophies of architectural blueprints, and focusing on two aspects of our solutions under the business context:
• Where the data is, where it will be integrated, and where it will be consumed in analytical applications
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
18
• Why the BI solution(s) will be built—what and where cause the problem (data
discrepancies)
Lastly, we have technical architecture, which defines the technologies inside different stages of the process that are used to implement and support a BI solution that fulfils the information and data architecture requirements.
Our design for the testing framework will fulfil some of the requirements from stakeholders, by
achieving the perspectives (Rozanski & Woods, 2012, chapter 24) along with designs of views and
architecture.
Architectural Views and Architectural blueprints
An architectural view is a way to portray those aspects or elements of the architecture that are relevant to the concerns the view intends to address—and, by implication, the stakeholders to whom those concerns are important. (Nick Rozanski, Eoin Woods 2012)
An architectural blueprint for Business intelligence process is similar to a blueprint for a building in real life, different type of blueprint describes different aspect of an architecture, they are helpful for developing or adjusting framework under Business Intelligence environment.
6.2 Context view Here we demonstrate our testing framework design inside a context view. The context view of the system describes the relationships, dependencies and interactions between the system and its environment, which in this case referring to Accobat employees and Accobat Business Intelligence process.
As we mentioned before the data from Source database is extracted, transformed and loaded into staging database temporarily, then after it is loaded into Data warehouse. Data discrepancies are supposing to be happened here, and this will affect the rest of the system with data mining, front-end reporting, etc. The data will be consumed by SSAS in order to create reporting OLAP cube, and at the end-user point, data will be visualized by PowerBI tool to produce financial reports and dashboards. Developer will be supporting the data mining from data warehouse to SSAS, Solution Manager, Delivery responsible and Commercial responsible will keep track of the reporting system with the Recon Table integrated into the system, some of the interactions between process and stakeholders will change. First of all, as an entity Recon Table will be joining two databases (Staging DB and DW), after that the developer will be supervising the recon table and adjusting and fix the data discrepancies illustrate by the indicators from testing results. Throughout the process, solution manager will be keeping track of Reporting system, as well as tracking the testing result from recon table on a daily basis to avoid severe business mistakes that could have caused by data discrepancies.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
19
Figure 7 Context View for testing framework
6.3 Functional view The functional view of testing framework defines the system's functional elements, responsibilities of each, the interfaces they reveal and the interaction between elements. Taken together, this demonstrates how the system will perform the functions required of it.
In Functional view for testing framework, functions indicate and connect elements one to another, for example, Recon table indicates compare table when it requires to compare and locate data discrepancies, the compare table will indicate other entities to fulfil the requirements, connections in between are critical functions to present framework elements or responsibilities.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
20
Figure 8 Functional View
Below is an overview of functional elements and their properties.
Functional Elements
Element Name Recon Table
Responsibilities Showing testing results and comments, as well as overview of general controllable elements that users such as Developer can interact with.
Inbound
Outbound Compare data discrepancies from Compare Table
Mark discrepancies with Indicator
Element Name Indicator
Responsibilities Showing testing results with different degrees of data matches for every piece of detailed information in compare table
Inbound Whenever Recon Table asks to mark discrepancies
Outbound
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
21
Element Name Compare Table
Responsibilities Gathering and Joining tables from two data tables
Inbound When Recon Table calls the function of compare, it will receive the signal of commencing two Temp tables
Outbound Indicates both Data Warehouse temp table and Staging temp table
Element Name Temp Datawarehouse Table
Responsibilities Receiving properties such as names, row counts and numbering tables from Data warehouse, forming as a temp table
Inbound When the Compare Table needs to join temp Data warehouse table with another table
Outbound Getting properties from Data warehouse database
Element Name Temp Staging Table
Responsibilities Receiving properties such as names, row counts and numbering tables from Staging database, forming as a temp table
Inbound When the Compare Table needs to join temp Staging table with another table
Outbound Getting properties from Staging database
Element Name Datawarehouse Database
Responsibilities Providing information for a temp table
Inbound When the Temp Table needs to retrieve data
Outbound
Element Name Staging Database
Responsibilities Providing information for a temp table
Inbound When the Temp Table needs to retrieve data
Outbound
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
22
6.4 Information architecture
Figure 9 Information architecture of testing framework
The diagram shows the information architecture which includes the processes, external entities and decisions made.
- Client is external entity (company) that is in need of a business intelligence solution at Accobat.
- Source database is a database at client end, this is a database client is using for storing data, data on database can be in any format. It is also operated by client.
- Staging database (extract) data from the source database is extracted and loaded into a staging database, it is a copy of data from source database before formation.
- Data warehouse data from staging (extract) database is loaded into data warehouse, then it is formatted depending on the business logic and needs. Then processed and loading in SSAS (SØL Server Analysis Services)
- Developer run compare staging and data warehouse once data has been loading into the data warehouse and formatted, developer of data warehouse will be able to compare data between staging and data warehouse database, if there is any mismatch on data.
- Recon table is the reconciliation result on both tables, put into a compare table called recon table if there is a mismatch of data on row account, developer will investigate and update business logics or data warehouse. if there is no mismatch on row count, status is ok. recon table is accessible online direct from database into online tool for client to access Realtime. BI manager and Commercial managers get updates on the reconciliation
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
23
from both the table and online reporting tool. client have Realtime access on table through online tool.
- SSAS cube, does analysis on data using different measures depending on business requirement and every business logic is processed by a cube. Reports are created from a cube.
6.4.1 Why data discrepancies arise in BI process Data discrepancies rise of different reasons, below we outline some of the causes.
• Can be human error in logic coding or programming errors
• Can be loss of data in the loading process, if there is sever problems.
• Inconsistent database and data
• Issue may be a problem with the Algorithm, were one algorithm may not be suited for a particular task.
• Data models in the data warehouse may also be too complex, causing data discrepancies.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
24
6.5 Technical architecture
Figure 10 Technical architecture for testing framewrok
Data Sources
Data source system may contain structured, unstructured data and semi -structured data that will be integrated into BI enterprise framework.
ETL
ETL is used to extract, transform and load data into the Data warehouse. ETL process data on daily basis. depending on the business logic or business need ETL can also be set up to load on Realtime. ELT process uses relation database based on SQL coding.
Data Warehouse (SSIS)
Data warehouse uses relational database technology for Business intelligence and data integration. from Data warehouse is were all database rules set, this include data indexing, data partitioning, materialized views, memory- processing etc, plus infrastructural components that may support in performance, like hardware, memory storage and network. In the project we are required to work on the Microsoft integration tool called SSIS (SQL Server Integration services).
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
25
SSAS cube
SSAS (SQL Server Analysis Service) is used to analysis data by measures depending on the business needs and business logics set up in the business intelligence. SSAS uses Microsoft SQL Server and relational database to perform data analysis.
Online report tool (Power BI)
Online report is a tool that presents Realtime reports online for access to clients, this tool is useful for extracting data and having reports, which are used for making decision by management.
Compare table
Compare table is created by doing some SQL store procedures code on both ETL and Data warehouse.
6.6 Performance and Scalability Architectural Perspective An architectural perspective is a collection of architectural activities, tactics, and guidelines that are used to ensure that a system exhibits a particular set of related quality properties that require consideration across system’s architectural designs. (Rozanski & Woods, 2012, page 72). Performance and scalability defines the ability of a system to predictably execute within its mandated performance role and to handle increased processing volumes, the concerns of it could be Response time, throughput, peak load behaviour, etc.
The test framework approach between two databases source and target database comparing row count on both databases performance has to have reliability and accuracy. Source database contains about 700 tables, which are all loaded into date warehouse, which is also the target database in this case. The store procedures are developed to compare the two databases and give result into a compare table (recon table). Test framework solution is to be implemented in a dynamic way, were all reconciliation testing on databases will be automated. And recon table is to be accessed real-time online, once the reconciliation procedure is executed.
Adding to that, there will be an automated job set up in system, executing every mid-night after the loading of data in the target database from the source database, Then the compare store procedure will execute right after, to check if the row count on tables match. this will help developers to investigate in good time without first to run the store procedure when they come in the morning.
Overall the design of testing framework will fulfil the requirements by achieving performance perspective. Since the dynamically programmed by using SQL query and stored procedure will process the comparison of datasets, also the automatic execution during mid-night every day after data are loaded after ETL process, the increased workload and processing volumes should be remained in a controllable range and the response time of testing should be consistent.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
26
7 Discussion Reflecting on what we designed according to our analysis and findings, first of all we located stakeholders and gathered their concerns from solution manager, our detailed requirement overview was created based upon that. Secondly, we planned our approach by making sketch, got detailed requirements from meetings, and lastly, we designed blueprints/views for correspond audience.
The goal of the design is to achieve the requirements that were given to us, as a result we believe that most of the requirements are covered during the process, there are few technical requirements that requires actual implementation, which we might work on that in the future.
As we mentioned in design chapter, the philosophy and principles we followed are from Rozanski & Woods, 2012, when it comes to visualization, we visualized our diagrams using UML, it is an efficient way to cover a wider portion of software development efforts including agile practices, examples such as our context view is using a class model to represent environment and the testing framework, the relationships between stakeholders and the system, another example is that our functional view is built as functional structure model, which represents main functions that will support the goals of this testing framework: To inform data discrepancies, to support business intelligence process with correct data, etc.
In our technical architecture, we decided to use Microsoft SQL procedures and dynamic implementation, which will also be executed during mid-night after loading data from client database into data warehouse every day, this decision along fulfils several requirements such as performance time, because of the way SQL procedures and its stored logic handle the data.
We did not cover the requirements such as privacy and security, we acknowledge that how important they are, but we did not need to focus on those aspects for now because they are not the main concerns from stakeholders. We did cover architectural perspective with performance and scalability as a way to handle the inevitable incensement of data work load, since the client might have larger databases or more databases that need to be extracted from.
8 Conclusion Looking back at the problem we intend to solve in the Business Intelligence process, we dedicated our project specifically targeting the key word: data discrepancy.
We introduced the problem in the very beginning, the whole business process is depending on the accuracy of the reported data, which come from the data warehouse of Accobat, stakeholders are having troubles and lots of them are related to data discrepancies between source database and data warehouse, the clients would lose trust in Accobat even a single data mismatch spotted. This alone hurts the fame and reputation of company, not to mention the delay of delivery, the duplicated and manually executed workloads that requires human resources.
Our proposed solution requires minimum resource to achieve, and we tried to design the framework that fits requirements centered around the problem we formulated in introduction chapter.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
27
As a result, for the problem we introduced in our project, the summary is shown below:
Problem statement: How to develop a test framework that will Reconcile two database source database and Data warehouse?
• Why reconcile testing framework design is important to implement.
• What will the company Accobat benefit from the test framework?
Because overall this involves most of the stakeholders and their concerns, the difference between having a testing framework and without it directly impact the company benefits and client base. It was our effort to make a feasible framework design that is able to be implemented in the future, the reasons of why the data discrepancies occurs we listed in information architecture, and the testing framework will be preventing that from happening once it’s online.
• Will the developers be able to automatically see the reconcile on the database?
• Is data mismatch detected and give a warning indicator in the reconcile process.
As the functions and the indicators, we designed in Functional view and discussion from performance perspective, we strongly believe that with the testing framework technical solution, the developers will be able to see the reconcile result on a daily basis, the testing result will be fast to navigate due to stored procedures and logic behind reconciliation table. In addition, warning or comments from indicators that will show the severity of data discrepancies.
To sum up in an academic point of view, we have taken a challenge of completing a project with following steps: studying the problem, information gathering, forming the approach and solution development. We are satisfied with the amount of knowledge we obtained when doing this project, especially in early stage where we had to absorb information in order to comprehend the big picture. From the practice, working through this project, we have significantly developed our skills in applying architectural design, evaluation and project work model techniques to solve the real-world problems.
We would like to take this opportunity to thank our supervisor for review sessions to broaden our horizons, and Accobat for providing the experience for a collaboration project.
9 Perspective For the future, we expect to develop this testing framework as part of our Master thesis, deploy the solution also for the company Accobat to be integrated into their BI enterprise platform.
Accobat will also sell it as a service to their client, creating revenue for the company and adding value to their client by controlling data discrepancies on database. Accobat will employ someone to work on the project for further development.
Plus, the all solution will save time and money for the company, as it will be implemented in an automated way, no more testing manual on databases. We might be researching and gathering automate testing theories for the actual technical solution. Lastly, depending on the requirements
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
28
change over time we might want to add more features to the testing framework and learn more about Business Intelligence or testing in general while developing.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
29
10 Reference • Rozanski N. and Woods E., 2012. Software Systems Architecture: Working with Stakeholders Using
Viewpoints and Perspectives, 2nd Edition. Addison Wesley.
• Poughkeepsie Center, 2000. Business Intelligence Architecture on S/390 presentation guide, First
Edition
• Rainardi V., 2008. Building a Data Warehouse: With Examples in SQL Server. Apress.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
30
11 Appendix
11.1 Glossary Term Abbreviation Definition
Business Intelligence BI
Business Intelligence is the
process of collecting raw
data or business data and
turning it into information
that is useful and more
meaningful.
Extract-Transform-Load ETL
ETL is a process in data
warehousing responsible
for pulling data out of the
source systems and placing
it into a data warehouse.
Data Warehouse DW
A data warehouse is a data-
base that is designed for
query and analysis rather
than for transaction pro-
cessing. Data Warehousing
(DW) is one of BI solutions
that helps to convert data
into useful information by
providing multiple dimen-
sions to study the data for
the purposes of informative
dashboard generating, re-
porting, etc. Top manage-
ment
can therefore take quick
and accurate decisions on
the basis of statistics calcu-
lated
using this data.
Microsoft SQL Server
Analysis Services SSAS
Delivers online analytical
processing (OLAP) and
data mining functionality for
business
intelligence applications. In
simple terms, SSAS creates
cubes using data from data
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
31
marts / data warehouse for
deeper and faster data
analysis.
Dimension Table
Store descriptions of the
characteristics of a busi-
ness. A dimension is usu-
ally descriptive
information that qualifies a
fact. Dimensions do not
change, or change slowly
over time.
Data Staging Area DSA
The Data Warehouse Stag-
ing Area is temporary loca-
tion where data from source
systems are copied. A stag-
ing area is mainly required
in a Data Warehousing Ar-
chitecture for timing rea-
sons. In short, all required
data must be available be-
fore data can be integrated
into the Data Warehouse.
Data Mart DM
A data mart is the access
layer of the data warehouse
environment that is used to
get data out to the end us-
ers. The data mart is a sub-
set of the data warehouse
and is usually oriented to a
specific business line or
team.
11.2 Meeting notes with Acoobaat from Audio and written notes
11.2.1 Meeting with Accobat - 21th September 2017 Company presentation By Solution Manager Jannick
• Welcomed us to the company office
• Told that he will be the main contact person for the project and the supervisor for the so-lution
• He made Company background introduction.
• Company is 14 years old to May this year, has 36 employees, and they have over 115 clients in Denmark.
• Their core business is business Intelligence based on Microsoft BI Platform
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
32
• They implement solutions for different sectors e.g. real estate, high schools.
• They are one of the best performing company's year 2016 and 2017
• CEO is the only owner of the company.
• Thier Biggest client is Ramboll Engineering, were they provide BI solution about 15.000 employees.
• The introduction for the need of data quality in BI context. Company Core Business
• Commitment to what they do and client comes first
• Respect for each other in the company (staff)
• Engagement is key in everything the company does.
• Project Method at the company: SCRUM / Agile Other add-on products for Business Consutant :
• Front-end products:
• Jedox and Prophix both tools are used for budgeting and financial management by companies.
• BI is Microsoft tool used for dashboard, graphical report
• Data warehouse: ETL (extract – transform – load). Working with dimensional data and transform Company has Goals of the next 3 years:
• Turn data into insight, actions and impact.
• New sales for enterprises and Jedox: want to get more than 10.000 customers
• Have more business Frameworks
• Company grows
• Turnover 50 million in 2020 and 50 employees in 2020 Introduction to data warehousing (technical perspective)
• It is Based on Kimball’s approach of data warehouse.
• Made of business process matrix
• Gives us what we need to measure on?
• When is then Broken into into facts tables
• Dimensions are present in the data warehouse system to help have measures.
• The data warehouse must be able to work with different platforms with different data for-mat (data sources).
• Data marts are subsets of DW, for each business logic
• “Star Schema” is used to map the actual relationship among tables in the DW.
• Actual implementation of Design: Microsoft Data warehouse
• Change management of the project is important, part of service provided by Accobat.
• Project Management is part of DW development.
• Having Meetings, workshops with clients is key to success. Systems used for operations at the company Ticketing system enables communication among employees (i.e. task-oriented) Both customers and employees can create tickets for tasks and change requests from clients.
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
33
• Some clients can send email to request for change. TimeLog system is used by employs to register time they have spent on request. later used to bill client. Project for students to work on
• Data reconciliation between source database and target database. Stakeholders: o BI Department o Service Desk o Customers “UNIK”
11.2.2 Meeting with Accobat - 4th October 2017 Testing issues at the company: Company has challenge when it comes to testing , they would like to have testing on most of their product, below is a list of testing they outline for us
• To make sure if two numbers from source and target data are the same (very much needed)
• Sample test (done)
• Functional testing(done)
• Correction (Reconciliation), full population (needed)
• Build test (ETL here) – not interesting to them
• Deploy test (done)
• Load test(done)
• Tests can be used internal or external. (needed)
• Internal testing process (needed)
• Continuous integration(needed)
• Developer testing (needed)
• Manual testing (not Needed) Types of projects
• Database testing
• Integration service testing
• Analysis service testing
• Between SQL query test and MDX query test
11.2.3 Meeting with Accobat – 18th October 2017 (Thomas – developer) For the client company: They got the source code from previous development.
• No API.
• Example for real estate companies: system used for administrators
offers the way to check the “tenancy” of their own buildings.
• As far as Thomas knows most of the data checks are done manually by communicating with clients.
• As a developer he has an idea of the accuracy of the data as soon as he look into the tables
from databases(row counts, etc.)
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
34
• From his point of view the automate implementation for data reconciliation is difficult.
• It is not necessary to check the value in tables are correct or not, the most important part of reconciliation testing is to identify the total number/count is correct or not.
The current client we are studying in our semester project is standard solution in Accobat. Problem with testing: As a developer Thomas told us that clients don’t know how to or where to start testing, they only knows that data presented in system should be accurate and should be tested beforehand. Using PowerBI: can use filtering to show what to test and customer can send the screen dumps to check. Selecting scenarios that are important: coming from clients. The key thing is that Accobat is able to see what the most important thing to the clients and show them at strategic level, operational level. The project we are studying in Accobat, their source database has around 700 tables. For data warehouse, they only use partial tables to be used in reporting system. ETL: Extract: It takes a couple of hours to extract and load data every night, this process should be finished before next day's morning.
11.2.4 Meeting with Accobat - 13th November 2017
• Clients have high expectation for the design of the reporting system, if there is a single data mismatch or data loss, it will damage the efforts from Accobat on this project heavily.
• Important data such as financial transaction Financial transaction is not what the client knows for example the total number of transaction.
• Data quality over integration.
• Permissions from source database to extract data is straight forward, Accobat either get all rows of data or nothing
• Manual testing started by a discounted amount of sample data, then proceed to the whole datasets
• It will be relatively easy to build solution in compare table in Microsoft SQL platform.
• Some of the business rules are hard to test and they are okay to be skipped.
-Data reconciliation, take the source data and check the number of rows at the target database - As a technical solution we can provide a way to hardcode for the purpose of ease, we can get
indicators such as success color and warning color for different testing results.
11.2.5 Meeting with Accobat - 4th December 2017 (Wolfgang - Senior Consultant) Wolfgang - Accobat consultant for Business Intelligence in general
We gathered information from him, since he is a technical stakeholder his opinion is valuable for our design, listed down below:
Data Quality of the data warehouse in Business Intelligence Teajeni Misago & Hongzhi Zhang
35
• For the current state the data comparison is completed manually for the source datase and target database(Accobat datawarehouse)
• Data is changing in a very short time span from Client system
• The logic for the data transaction from source to target consists of two stages, for the
current client they have approximately 600 tables, and the extraction requires only a part
of them.
• According to his experience, comparing data with source data and data warehouse quite
difficult.
• Mdx as a language can also be used for logic of data conciliation.
• Comparing data against, extract layer
• Writing SQL procedure with C# to extract data from sources
• Wolfgang understands that the solution manager wants to compare the extract database (staging database) with data warehouse
• The data that should be compared come from the tables inside databases
• The client side has operating systems and the data that will be extracted also come from there.
• Extract(Staging) database has the business logic that defines the design of the table.
• Transferring data into bus matrix
• Tables in the data warehouse and source database are different since data warehouse has different logic