PRE HADOOP AND POST HADOOP VALIDATIONS FOR ......folder. Post Hadoop validation is done before data is moved into a production data warehouse system. It is sometimes also called as

http://www.iaeme.com/IJMET/index.asp 608 [email protected]

International Journal of Mechanical Engineering and Technology (IJMET) Volume 8, Issue 10, October 2017, pp. 608–616, Article ID: IJMET_08_10_066 Available online at http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=10 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed

PRE HADOOP AND POST HADOOP

VALIDATIONS FOR BIG DATA

Nachiyappan .S

School of Computer Science and Engineering, VIT University, Chennai, TN, India

Justus Selwyn

School of Computer Science and Engineering, VIT University, Chennai, TN, India

ABSTRACT:

Big data, a platform in which everybody wants to gain knowledge on processing

and analyzing the vast amount of data with ease. Various application are available

today for processing vast amount of data and get the intended result from it, And

various application testing tools are available to test the Big Data application. But no

application or tool explain how data is validated before processing and after

processing of data. Various existing functional and non-functional testing can be

performed to assure the quality of the data so that the cost and time can be saved for

the processing party. Here in this paper we are going to propose a set of testing

strategies at various stages on Big Data analysis process so that one can validate data

before retrieving the vast amount of data from Hadoop.

Keywords: Big Data, Fuctional testing, Non-functional testing, hadoop, quality.

Cite this Article: Nachiyappan .S and Justus Selwyn, Pre Hadoop and Post Hadoop Validations for Big Data, International Journal of Mechanical Engineering and Technology 8(10), 2017, pp. 608–616. http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=8&IType=10

1. INTRODUCTION

“Large size of data which is either structured or unstructured or semi-structured cannot be processed with existing traditional DBMS methods” can be processed with the help of a concept known as Big Data. Today, various companies adapted to Big data to process the huge streams of data. Since almost everyone has a device which is connected to internet, but on processing such vast data they only care about how the data is processed, what algorithm is used to process the data? But nobody cares to check whether the dataset they are processing is valid or not whether it can provide the intended result or not. If the data is acquired from valid data provider or data collector then there will be no issue, Say one want to do sentimental analysis and he got data by crawling from various stream. Nowadays more false data circulate in the form of rumors. Processing such data will get results from anywhere[2]. Hence

Pre Hadoop and Post Hadoop Validations for Big Data


validating data is equally important as processing the data to get the intended result. Big data testing methods must be implemented along with the Big Data processing.

2. AUTOMATED TESTING

Testing generally falls into 2 major categories based on how the testing is performed

• Manual Testing

• Automation Testing

Testing which is performed by writing script will automate the testing process and reduce the effort and time by using various automation tools is known as Automation testing. Most of the automation tools include record and playback process that will generate the script and by adding data pool to the tool will automate the testing process for different input and will generate output record accordingly for each iteration.

Most of the testing strategies are performed using Pig script. This paper meets the basic testing which includes functional and non- functional testing.

3. EXISTING TESTING METHODOLOGIES

Most of the existing validations on Big Data cover the testing that can be done on data set for finding the quality and consistency. Testing process mainly concentrates on application but they fail to test the data that is to be processed. Non-functional testing can be performed on any big data application that will check the reliability of the application. List of proposed functional and non- functional testing that can be done on data for checking the consistency of the data are mentioned in proposed validation[1].

Figure 1 Various Data source

4. PROPOSED TESTING METHODOLOGIES:

Testing in Big Data can be performed in three different stages that can validate the data at various stages at various forms.

List of stages at which testing can be performed is:

• Pre-Hadoop validation.

• Map-reduce validation.

• ETL or Post-Hadoop Validation.

Nachiyappan .S and Justus Selwyn


Both functional and non-functional testing can be performed on data to check the consistency of the data and the list of validations that can be performed on the data are listed below. Most of the testing are done in Pig Latin.

A. Pre – Hadoop Validation:

Pre-Hadoop Validation mainly constitute the testing of data before processing so that the data cleansing can be done to avoid negative processing that will result in wasting of both time and resource. Data cleansing plays a major role in getting reliable output.

Big Data systems typically process a mix of structured data (such as point-of-sale transactions, call detail records, general ledger transactions, and call center transactions), unstructured data (such as user comments, doctors' notes, insurance claims descriptions and web logs) and semi-structured social media data (from sites like Twitter, Facebook, LinkedIn and Pinterest). Often the data is extracted from its source location and saved in its raw or a processed form in Hadoop or another Big Data database management system. Data is typically extracted from a variety of source systems and in varying file formats, e.g. relational tables, fixed size records, flat files with delimiters (CSV), XML files, JSON and text files. Most Big Data database management systems are designed to store data in its rawest form, creating what has come to be known as a "data lake," a largely undifferentiated collection of data as captured from the source. These DBMSs use an approach called "schema on read," i.e. the data is given a simple structure appropriate to the application as it is read, but very little structure is imposed during the loading phase. The most important activity during data loading is to compare data to ensure extraction has happened correctly and to confirm that the data loaded into the HDFS (Hadoop Distributed File System) is a complete, accurate copy[14].

Typical tests include:

1. Data type validation:

Data type validation is customarily carried out on one or more simple data fields. The simplest kind of data type validation verifies that the individual characters provided through user input are consistent with the expected characters of one or more known primitive data types as defined in a programming language or data storage and retrieval mechanism[14].

2. Range and constraint validation:

Simple range and constraint validation may examine user input for consistency with a minimum/maximum range, or consistency with a test for evaluating a sequence of characters, such as one or more tests against regular expressions [14].

3. Code and cross-reference validation.

Code and cross-reference validation includes tests for data type validation, combined with one or more operations to verify that the user-supplied data is consistent with one or more external rules, requirements or validity constraints relevant to a particular organization, context or set of underlying assumptions. These additional validity constraints may involve cross-referencing supplied data with a known look-up table or directory information service such as LDAP.

4. Structured validation.

Structured validation allows for the combination of any number of various basic data type validation steps, along with more complex processing. Such complex processing may include



the testing of conditional constraints for an entire complex data object or set of process operations within a system.

B. Map-Reduce Validation:

MapReduce is the heart of Apache Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.

Figure 2 Map Reduce Word Count Process.

Map-Reduce Validation constitute the checking of key-value pairs generation and validate the map-reduce by applying various business rules.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

C. Post – Hadoop Validation

After Map-reduce process is completed rest is validating the Extract-Transfer-Load Validation. This mainly constitute the output file extraction and loading it into target output folder. Post Hadoop validation is done before data is moved into a production data warehouse system. It is sometimes also called as table balancing or production reconciliation. It is different from database testing in terms of its scope and the steps to be taken to complete this.

The main objective of Post Hadoop validation is to identify and mitigate data defects and general errors that occur prior to processing of data for analytical reporting. Post Hadoop validation is different from database testing or any other conventional testing. One may have to face different types of challenges while performing Post Hadoop validation. As we are dealing with huge data and executing on multiple nodes there are high chances of having bad data issues at each stage of the process.

As we know, processing big data is difficult since it is a collection of huge amount of data and executing it on multiple nodes there is a high risk of bad data and even quality issues.



Main challenges are:

• Incorrect data

• Incomplete or duplicate data.

• Inefficient in procedures and business process.

• DW system contains historical data, so the data volume is too large and extremely complex to perform Post Hadoop testing in the target system.

Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any transactional systems depending on the requirement. Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS. Some high level scenarios that need to be validated during this phase include:

• Validating that transformation rules are applied correctly.

• Validating that that there is no data corruption by comparing target table data against HDFS files data.

• Validating the data load in target system.

• Validating the aggregation of data.

Testing on various stages is listed below:

D. Functional Testing

The Functional testing is done to check the dataset consistency by

• Comparing the data before and after uploading the dataset.

• Comparing the file size of file in Hadoop server and the same file in local system.

• Comparing the file format of file in Hadoop server and the same file in local system.

• Validating the schema of both files.

• Checking the generation of key value pair after map reduce process.

• Checking the generated output file.

• Apply Business rules for validating the map reduce process

• After Map reduce process check if business rules are applied correctly and generated output is as desired.

• Checking if output file is extracted correctly.

Figure 3 Phases if Testing in Big Data



E. NON FUNCTIONAL TESTING

Non-functional validations of application are done to check the reliability of the application. List of non- functional testing that can be done on the application are:

• Performance testing

• Security testing

• Reusability testing

• Reliability testing

5. TEST RESULTS AND DISCUSSION

A. Functional Validations

1. Comparing Data

Data has been collected from various sources and after collecting the dataset and uploading the data into the Hadoop system and before processing it, it is loaded into the Pig storage by loading it into the Pig Storage, use DIFF function to compare if both the files (Source and the destination) in Hadoop system and the file in local file system then generate the report of validation, which will give clear picture what field has been changed. If the replication is done correctly the output file will have a pair of empty brasses.

Figure 4 Pig Output of Comparing two files

2. File extraction:

The next validation is file extraction, here we need to compare the file which is inside the hadoop and source file and validations are done according to that. Once the file is loaded into the pig storage, implement the sample application in local mode so that we will get the output file. Comparing the file with file in Hadoop system will give result that will ensure that the file is extracted correctly or not.



Figure 5 File Extraction

3. File size comparision:

The File size comparison is one of the important validation to conclude that there is any modification in file, if there is any modification in the file which is uploaded the file size will be varies. This particular validation uses java code which will get the file from its location and retrieve the size of the file and uses apache function that will get the size of the file in hadoop and compare the both sizes and return the value accordingly so that we can compare both file size and validate it. In the below fig. 6 it is depicted as file and file1, file is the one we have as a source and file1 is the one which is stored in HDFS. So both the files are in same size.

Figure 6 File Size Comparision

4. File format validation:

File format validation helps us in many ways to confine the format of file. The format which we store the file in source system and the file which is moved into the destination varies many times. This validation is used to find the file format of source and destination and it compares and gives us the result. It uses the java code that retrieve the file size and will do the same with the file in local system and uses arrayutil.getextension() to get the extension from hadoop file system and compare both extension and return value that will validate the file in Hadoop system. In below fig.7 it depicts the file size and the format. If both the file sizes are same then it will be as same size, if it is in the same format it responds as same format, if there is any change it alerts the user that the file size and formats are different.

Figure 7 File Format Validation



5. Key Value Pair Validation:

Hash Map function generates the tokenized map value and then compare the same with the output that is generated from map-reduce method.

Figure 8 Key value Pair Validation

6. Output File Generation:

After completion of the map-reduce process the output file location must be validated and then return the size of the output file.

Figure 9 Output File Generation

6. CONCLUSION

Big data is still emerging and a there is a lot of responsibility on testers to identify innovative ideas to test the implementation. One of the most challenging things for a tester is to keep pace with changing dynamics of the industry. While on most aspects of testing, the tester need not know the technical details behind the scene however this is where testing Big Data Technology is so different. A tester not only needs to be strong on testing fundamentals but also has to be equally aware of minute details in the architecture of the database designs to analyze several performance bottlenecks and other issues. Hadoop testers have to learn the components of the Hadoop eco system from the scratch. In this paper we have used some 10000 sample data and we have pushed the same into Hadoop in single cluster mode. We have come out with the both functional and non-functional testing results. The future work in this is to test the data with multi cluster systems.

REFERENCES

[1] S.Nachiyappan and Dr.S.Justus, Getting ready for Big Data Testing:A practitioner perception 4th ICCCNT 2013 July 4-6,2013, Tiruchengode, India.

[2] Muthuraman Thangaraj and Subramanian Anuradha, State of art in testing big data IEEE International Conference on Computational Intelligence and Computing Research, 2015.



[3] Harry M. Sneed and Katalin Erdoes, Testing the Big Data 2015 IEEE Eighth International Conference on Software Testing,Verification and Validation Workshops (ICSTW) 13th User Symposium on Software Quality, Test and Innovation (ASQT 2015) 978-1-4799-1885-0/15/$31.00 © 2015 IEEE.

[4] Piyaporn Samsuwan and Yachai Limpiyakorn, Generation of Data Warehouse Design Test Cases IT convergence and security 5th international confrence on 24-27Aug,2015

[5] White paper by Infosys, infosys data warehouse testing solutions.

[6] Data Warehouse Testing Solutions. - White Paper by Infosys

[7] White paper by Infosys, big data testing services.

[8] White paper by Syntel, Proven testing techniques in large data warehousing projects.

[9] White paper by Infosys, Teat data management in software testing life cycle.

[10] Proven testing techniques in large data warehousing projects. - White Paper by Syntel.

[11] Teat data management in software testing life cycle. - White Paper by Infosys.

[12] Building a Robust Big Data QA Ecosystem to Mitigate Data Integrity Challenges. White Paper by Cognizant Technology Solutions.

[13] The Emerging Big Data System - Testing Perspective. - White Paper by Hexaware.

[14] A Primer on Big Data. White paper by QA Consultants.

[15] Suja Cherukullapurath Mana, Big Data Paradigm and a Survey of Big Data Schedulers. International Journal of Computer Engineering & Technology, 8(5), 2017, pp. 11–14

[16] Dr. M Nagalakshmi, Dr. I Surya Prabha, K Anil, Big Data Map Reducing Technique Based Apriori in Distributed Mining. International Journal of Advanced Research in Engineering and Technology, 8(5), 2017, pp 19–28.

[17] Dr. V.V.R. Maheswara Rao, Dr. V. Valli Kumari and N. Silpa. An Extensive Study on Leading Research Paths on Big Data Techniques & Technologies. International Journal of Computer Engineering and Technology, 6(12), 2015, pp. 20-34

[18] Vijayashanthi.R and N.Shunmuga Karpagam, A Literature Survey on Sp Theory of Intelligence Algorithm for Big Data Analysis, International Journal Of Computer Engineering & Technology (IJCET), Volume 5, Issue 12, December (2014), pp. 207-213

Documents

PRE HADOOP AND POST HADOOP VALIDATIONS FOR ......folder. Post Hadoop validation is done before data is moved into a production data warehouse system. It is sometimes also called as