    An Introduction to Big Data Concepts

    Bryan C Smith 1 Nov 2011

    !he idea that data co""ected in computeri#ed systems cou"d $e used to in%orm and there$y improve

    decision ma&ing has $een around %or 'uite some time( )ver the "ast coup"e decades* ideas o% ho+

    to assem$"e a decision support system have coa"esced around the concept o% a data +arehouse(

    !he construction o% a proper data +arehouse re'uires a non,trivia" investment( !his investment is

    made +ith the e-pectation o% $ene%its* $ut these are o%ten di%%icu"t to enumerate prior to the

    +arehouse.s construction and su$se'uent emp"oyment( /or this reason* the data +arehousere'uires a "eap o% %aith(

    /or many years* preparation %or this "eap +as a signi%icant part o% the conversation +ith customers

    interested in Business Inte"" igence BI( !oday* in recognition o% the data +arehouse as a too" %or

    navigating $usiness cha""enges and uncertainty* the conversation tends to %ocus on ma-imi#ing the

    impact o% BI on the organi#ation(

    As customers %ocus on ho+ $est to e-tract insights %rom data* there is gro+ing recognition o% 

    untapped data resources especia""y unstructured data( !hese data remain "arge"y untapped


    !he va"ue o% these data re"ative to the cost o% their processing and storage is "o+(1(

    !hese data are not easi"y stored and ana"y#ed +ithin the con%ines o% the traditiona" data



    !o i""ustrate these points* consider the data in a +e$ "og( !hese data cou"d $e very insight%u" to a

    $usiness interested in engaging customers through a +e$site( o+ever* individua" data records*ho"ding in%ormation on a sing"e page re'uest or sing"e image retrieva"* are not "i&e"y to $e high in

    va"ue* especia""y over the "onger periods o% time in +hich data are stored in a traditiona" data


    /urthermore* the structure o% many e"ements +ithin the "og records* such as the 3I o% the

    re%errer or the 'uery string associated +ith a re'uested resource is high"y varia$"e in nature(

    Di%%ering 'uestions posed against these data may re'uire them to $e interpreted in di%%ering +ays(

    Signi%icant pre,processing o% the data in order to neat"y %it it into the traditiona" data +arehouse

    may $e unnecessary or even counter,productive(

    5e$ "ogs are a common"y cited %orm o% unstructured data( A $etter term %or these data may $e

    comp"e- or mi-ed,typed data as at some "eve" these data have a +e"" understood and meaning%u"

    structure( o+ever* this structure is o%ten as a "eve" o% granu"arity higher than the "eve" at +hich

    ana"ysis is to $e per%ormed* and it.s this mismatch that "eads to the unstructured moni&er( )ther

    %orms o% unstructured data inc"ude 67 or 8S)N documents* images* video* or PD/* 5ord* or !7


    !he cha""enges o% +or&ing +ith unstructured data* i""ustrated in the +e$ "og e-amp"e* are o%ten

    characteri#ed in terms o% four Vs( !he four Vs are identi%ied as:

    9o"ume De%ined as the tota" num$er o% $ytes associated +ith the data( 3nstructured data

    are estimated to account %or ;0, o% the data in e-istence and the overa"" vo"ume o% data

    is rising(


    9e"ocity De%ined as the pace at +hich the data are to $e consumed( As vo"umes rise* the

    va"ue o% individua" data points tend to more rapid"y diminish over time(


    9ariety De%ined as the comp"e-ity o% the data in this c"ass( !his comp"e-ity esche+s

    traditiona" means o% ana"ysis(


    9aria$i"ity De%ined as the di%%ering +ays in +hich the data may $e interpreted( Di%%ering

    'uestions re'uire di%%ering interpretations(


    !he %our 9s articu"ate the $road cha""enges o% +or&ing +ith unstructured data* $ut the dominant

    cha""enge tends to $e in terms o% data vo"ume( As a resu"t* the e%%ort to e-tract insights %romunstructured data is o%ten re%erred to as Big Data(

    Because o% the cha""enges o% the %our 9s* Big Data necessitates an a"ternative approach to

    Business Inte""igence( !his a"ternative approach* +hich +e might re%er to as the unstructured data

    +arehouse or the Big Data +arehouse* does not inva"idate the traditiona" data +arehouse $ut does

    ac&no+"edge its "imitations in e-tracting insights %rom the %u"" range o% avai"a$"e data resources(

    5hat e-act"y is the unstructured data +arehouse and ho+ it +i"" re"ate to the traditiona"

    structured data +arehouse has yet to $e determined* $ut ideas are $eginning to coa"esce around

    distri$uted* a"gorithmic techno"ogies such as Apache adoop(



