Lossy compression of structured scientific data sets -Shreya Mittapalli New Jersey Institute of...

Preview:

Citation preview

Lossy compression of structured scientific data sets

-Shreya MittapalliNew Jersey Institute of Technology

Friday Jul 31, 2015NCAR

Supervisors: John Clyne and Alan Nortan

HSS, CISL-NCAR

Problem we are trying to solve:

• Due to advancement in technology, large data is collected by the supercomputers, satellites, etc. There are two problems with Big Data:-

1) The hard-disk which collects the data might not have enough disk-space.

2) The speed at which the data can be read might be much lesser than the required speed. For example:

To tackle this problem, we compress the data.

One way to compress the data is using Wavelets.Because of their multi-resolution and information compaction properties, wavelets are widely used for lossy compression in numerous consumer multimedia applications (e.g. images, music, and video). For example:

The parrot is compressed in the ratio 1:35 and the rose 1:18 using wavelets

Source: http://arxiv.org/ftp/arxiv/papers/1004/1004.3276.pdf

What is Lossy Compression?

Lossy Compression is the class of data encoding methods that uses inexact approximations (or partial data discarding) to represent the content. These techniques are used to reduce data size for storage, handling, and transmitting content.Source: Wikipedia

In lossy compression:

AdvantageCompressed data can be stored in hard disk and it also saves a lot of computation time

DisadvantageWhile reconstructing back the data, some data is permanently lost.

Project Goal• To determine compression

parameters that:1) minimize distortion for a

desired output file size.2) reduce the computation time

and come with the best possible outcome.

Experiments done• To achieve the project goal, we have been

attempting to experimentally determine the optimal parameter choices for compressing numerical simulation data using wavelets.

• For this we experimented on three different big data sets, viz., two wrf hurricane data sets Katrina and Sandy and one turbulence data set Taylor Green.

Sandy———Grid resolution; 5320 x 5000 x 149 (= 16 Gigabytes / 3D variable)# 3D variables : 15Time steps ~100Total data set size: ~24Terabytes

Katrina—————Grid resolution; 316 x 310 x 35 (= 10 Megabytes / 3D variable)# 3D variables : 12Time steps ~60Total data set size: ~9 Gigabytes

TG—Grid resolution; 1024^3 (= 4 Gigabytes / 3D variable)# 3D variables : 6Time steps ~100Total data set size: ~2.5Terabytes

Images of Hurricane Katrina which occurred on 29th August, 2005.

Images of Hurricane Sandy which occurred on October 25, 2012

Image of vortex iso-surfaces in a viscous flow starting from Taylor-Green initial conditions. Source : http://www.galcit.caltech.edu/research/highlights

• We constructed a python framework that allowed us to change various compression parameters like wavelet type and block size each time.

• Measurements: lmax, rmse, time• Compression ratios:1,2,4,16,32,64,128,256

Compression parameters we wanted to explore:

1) Compare wavelet-types Bior3.3 and Bior4.4• The wavelet Bior4.4 is also called CDF9/7

wavelet which is widely used in the digital signal processing and image compression.

• The wavelet Bior3.3 is traditionally used in Vapor software.

• Goal: Determine if Bior4.4 is better than Bior3.3

Compression parameters we wanted to explore:

2) Compare block size 64x64x64 with other block sizes.

64

256

Compression parameters we wanted to explore:

2) Compare block size 64x64x64 with other block sizes.a) Determine if smaller blocks are better than

the larger blocks. The two contrasting features are:-

i) Smaller blocks are more computationally efficient than larger blocks.

ii) Larger blocks introduce less artefacts than the smaller blocks.

b) If the block sizes are not in integral multiples of the 64, some extra data is introduced to cover up the gap. This is called padding. • The problem with padding is that while we are

looking to compress the data, an extra data is introduced.

• For TG data, there is no padding but for Katrina and Sandy data, we have 50% and 30% padding respectively.

• Goal: Determine if the aligned data has comparable errors with the padded data.

Example to illustrate padding:

64

64

64196

149150

50

50

50

We did the following three experiments:

• We compared the wavelet types Bior3.3 and bior4.4 for all the three data sets.

• We compared larger blocks with smaller blocks. For TG: 64x64x64 vs 128x128x128 vs

256x256x256• We compared padded data with aligned data. a) For Katrina: 64x64x64 vs 64x64x35 b) For Sandy: 64x64x64 vs 64x64x50

BIOR3.3 VS BIOR4.4The plots for Katrina data illustrating Experiment 1.

ALIGNED DATA VS PADDED DATAThe plot for sandy data illustrating Experiment 2.

BIGGER BLOCKS VS SMALLER BLOCKS

The plots for TG data illustrating Experiment 3.

Lmax error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and 256x256x256

RMSE error for the wx variable of TG data set for the block sizes 64x64x64,128x128x128 and

256x256x256

WHEN USING A LARGER BLOCK SIZE (256^3 VS 64^3) FOR THE VX COMPONENT OF THE TG DATA SET( THE DATA IS COMPRESSED 512:1), WE SEE IMPROVED COMPRESSION QUALITY AS ILLUSTRATED ABOVE:

Source: Pablo Mininni, U. of Buenos Aires.

Time taken for the wx variable of TG data set to construct the raw data for the block sizes 64x64x64,128x128x128 and 256x256x256

Conclusion:

1) Bior4.4 is in some cases better than Bior3.32) Surprisingly larger block (say 256x256x256)

is better than 64x64x64 in terms of both the computation time and error.

3) The errors of the aligned data and the padded data are comparable.

Acknowledgements:

• My Supervisors John Clyne and Alan Nortan for their continued support.

• Dongliang Chu, Samuel Li and Kim Zhang.• Delilah• Gail Rutledge• NCAR