Upload
acadgild
View
36
Download
2
Embed Size (px)
Citation preview
ACADGILDACADGILD
In this blog, we will work on a use case involving electric bulbs and work with the date andtime concepts in Pig.
In this instance, Pig is used in the local mode to load the local data. We can use Pig in HDFS
mode as per our convenience.
In the research center of bulb manufacturing companies, the longevity of bulbs is tested by
subjecting them to adverse conditions.
The dataset used in this case is a sample from the light bulb production house where bulbs are
tested at random intervals of time. The first column is StartDate which is the date and time
when the testing of the bulb started and the second column is EndDate which is the date
when the testing ended.
StartDate EndDate
30-Jun-2018 23:42 04-Jul-2018 15:10
30-Jun-2018 23:37
30-Jun-2018 23:13 30-Jun-2019 23:34m
https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit
ACADGILDACADGILD
A few rows may be empty which indicates that data is not available, maybe because of various
reasons. But as a developer we need not worry about missing data. With the help of Data
Filtering, we can remove the unnecessary data.
Loading Data into the Pig environment
Since Pig uses default as tab(\t) delimited data, it’s not mandatory to state USING
PigStorage('\t') in the code while loading, nevertheless it is good to write it. You have to use
this parameter depending on the dataset.
Since we have data inside Pig, the first step is to filter data in the column we are working on.
https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit
ACADGILDACADGILD
Here we remove all the rows with null data.
In this step, it is mandatory to filter all the data in EndTime containing - symbol.
We have to convert the data loaded in Pig into datetime format in order to work with it.
Here, we use two predefined functions:
ToDate()
MinutesBetween()
The first one converts the character array to datetime readable structure which can be
interpreted by Pig and the second one takes the difference between two DateTime parameters
provided.
The ToDate function can be used in different formats of year, month and date. Some examples
are as follows:
YYYY-MM-DD
DD/MM/YYYY
DD-YY-MM
Depending on the appropriate structure in the dataset provided, we can choose the format.
After simple filtering and conversion of character array data to datetime format, we have now
determined the difference in terms of minutes for every bulb which was in ON state during
testing.
https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit
ACADGILDACADGILD
We can see the results with dump command.
Result in minutes is displayed:
Once we achieve this, we can perform analysis on the result, for example, to find the
maximum time a bulb can stay ON or minimum time and so on..
Shown below is the result for the average time the bulbs were ON during the testing phase.
https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit
ACADGILDACADGILD
Dump Avg_ALL;
This way we can perform analysis on the filtered result and get the results with help of Pig in a
matter of minutes from a large set of data.
For dataset and code for practice, click HERE.
For more such blogs on various topics, please visit ACADGILD.
https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit