Upload
glynn
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dealing with Data Quality. Google Workshop July 24, 2009. ?. Low light. Blurry. Missing. Blurry. Faults can reduce the quantity and quality of the collected information. When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions. “Circle”. “Circle”. - PowerPoint PPT Presentation
Citation preview
Dealing with Data Quality
Google WorkshopJuly 24, 2009
Blurry Blurry Low lightMissing
?
Faults can reduce the quantity and quality of the collected information.
When ignored, faults in a dataset can lead to ambiguous, or worse, incorrect conclusions.
“Circle”
“Circle”“Circle”
“Square”“Square”
“Square”
“Square”
“Square”
“Square”
Unfortunately faults in networked sensing systems are common
GDI ‘04 Redwoods '05
63 G. Werner-Allen et. al. Fidelity and Yield in a Volcano Monitoring Sensor Network. In Procs. OSDI, 2006.2 G. Tolle et. al. A macroscope in the redwoods. In Proc. SenSys, 2005.1 R. Szewczyk et. al. An analysis of a large scale habitat monitoring application. In Procs. Sensys, 2004.
4 Cms database. http://cens.jamesreserve.edu/phpmyadmin
*** Numbers are approximations based on publications, personal communications
Volcan '06 James Reserve '06
Network Faults
Data Faults
Good Data
Ammonium
Calcium
Carbonate
Chloride
Nitrate
pH
Our experience is similar: Almost 60% of data was faulty in this soil deployment (Bangladesh, 2006)
Many methods to find faults
Examples include• Visual inspection• Manual validation• Analytical validation: statistical, scientific
models
Statistical, e.g. outlier detection
Scientific, e.g. “temperature decreases with depth”
Tem
pera
ture
Depth
Several methods to fix faults
• Go into the field and replace or fix the problem.
• Remove the faulty data, (“clean” the dataset), after the deployment is over.
Faults persist for a number of reasons, including:
First, faults can be difficult to define and identify
Faults persist partly because they are difficult to define
X
Faults persist partly because they are difficult to define
A nitrate deployment in the riverbed of Merced river
A nitrate deployment in the riverbed of Merced river
Faults persist partly because they are difficult to define
A nitrate deployment in the riverbed of Merced river
Nitrate data taken from nearby locations
Faults persist partly because they are difficult to define
Which one is correct? Are the both correct? Are they both faulty?
Faults persist for a number of reasons, including:
First, faults can be difficult to define and identify
Second, faults are not always worth fixing
Not all faults need to be fixed [Schoellhammer ‘08]
Maintenance can be expensive
And, if the analysis can happen without the faulty data, then what’s the point?
Tem
pera
ture
Depth
Tem
pera
ture
Depth
Faults persist for a number of reasons, including:
First, faults can be difficult to define and identify
Second, faults are not always worth fixing
Answering these questions is hard
Incomplete, ad-hoc, or last minute solutions for addressing faults only exacerbates the problem.
Regardless of the solution for addressing faults - and there are many – it should be incorporated into the design and implementation of the system right from the beginning.
Thank You
Nithya Ramanathan
Collecting usable sensor data from a networked system is never easy. Whetherthe data consists of images or nitrate levels from a chemistry sensor,faults can reduce the quantity and quality of the collected information. Andwhen ignored, faults in a dataset can lead to ambiguous, or worse, incorrectconclusions. Unfortunately faults in networked sensing systems are painfullycommon.
Faults persist partly because they are difficult to define, and even onceidentified, they are not always worth fixing. Incomplete, ad-hoc, or lastminute solutions for addressing faults only exacerbates the problem.Regardless of the solution for addressing faults - and there are many - itshould be incorporated into the design and implementation of the systemright from the beginning.