How much information? Adapted from a presentation by: Jim Gray Microsoft Research gray Alex Szalay Johns Hopkins University

How much information?

Adapted from a presentation by:Jim Gray

Microsoft Researchhttp://research.microsoft.com/~gray

Alex SzalayJohns Hopkins University

http://tarkus.pha.jhu.edu/~szalay/

http://research.microsoft.com/~gray

How much information is there in the world

Infometrics - the measurement of information

• What can we store

• What do we intend to store.

• What is stored.

• Why are we interested.

Infinite Storage?

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

Kilo

• The Terror Bytes are Here– 1 TB costs <100$ to buy– 1 TB costs 300k$/y to own

• Management & curation are expensive

– Searching without indexing 1TB takes minutes or hours

• Petrified by Peta Bytes?• But… people can “afford” them so,

– They will be used.• Solution: Automate processes

Digital Information Created, Captured, Replicated Worldwide

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2006 2007 2008 2009 2010 2011

Exabytes

10-fold Growth in 5

Years!

DVDRFID

Digital TVMP3 players

Digital camerasCamera phones, VoIP

Medical imaging, Laptops,Data center applications, Games

Satellite images, GPS, ATMs, ScannersSensors, Digital radio, DLP theaters, Telematics

Peer-to-peer, Email, Instant messaging, Videoconferencing,CAD/CAM, Toys, Industrial machines, Security systems, Appliances

Source: IDC, 2008

Scale of things to come

• Information:– In 2002, recorded media and electronic information

flows generated about 22 exabytes (1018) of information

– In 2006, the amount of digital information created, captured, and replicated was 161 EB

– In 2010, the amount of information added annually to the digital universe will be about 988 EB (almost 1 ZB)

Digital Universe Environmental Footprint• In our physical universe, 98.5% of the

known mass is invisible, composed of interstellar dust or what scientists call “dark matter.” In the digital universe, we have our own form of dark matter — the tiny signals from sensors and RFID tags and the voice packets that make up less than 6% of the digital universe by gigabyte, but account for more than 99% of the “units,” information “containers,” or “files” in it.

• Tenfold growth of the digital universe in five years will have a measurable impact on the environment, in terms of both power

consumed and electronic waste.

How much information is there?• Soon most everything will

be recorded and indexed• Most bytes will never be

seen by humans.• Data summarization,

trend detection anomaly detection are key technologies

See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html

See Lyman & Varian: How much informationhttp://www.sims.berkeley.edu/research/projects/how-much-info/

Yotta

Zetta

Exa

Peta

Tera

Giga

Mega

KiloA BookA Book

.Movie

All books(words)

All Books MultiMedia

Everything!

Recorded

A PhotoA Photo

24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Digital ImmortalityRequirements for storing various media for a single

person’s lifetime at modest fidelity

Bell, Gray, CACM, ‘01

What is Digital Immortality?• Preservation and interaction of digitized

experiences for individuals and/or groups– Preservation and access– Active interaction with archives through

queries and/or an avatar (agents)– Avatar interactions for group experiences

• Issues:– Archiving– Indexing– Veracity– Access

PB

EB

TB

Media TB/y Growth Rate, %

optical 50 70

paper 100 2

film 100,000 4

magnetic 1,000,000 55

total 1,100,150 50

• ~10 Exabytes

• ~90% digital

• > 55% personal

• Print: .003% of bytes5TB/y, but text has lowest entropy

• Email is (10 Bmpd) 4PB/y and is 20% text (estimate by Gray)

• WWW is ~50TBdeep web ~50 PB

• Growth: 50%/y

Information CensusLesk Varian & Lyman

Internet

First Disk 1956• IBM 305 RAMAC

• 4 MB

• 50x24” disks

• 1200 rpm

• 100 ms access

• 35k$/y rent

• Included computer & accounting software(tubes not transistors)

10 years later1.

6 m

eter

s 30 MB

Terabyte external drive for$200 - 20 cents a gigabyte.

In 5 years, 1 cent/gigabyte, $10 for a terabyte?

Now - Terabytes on your desk

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB growth: 112%/y

Moore's Law: 58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (Jim Porter)

http://www.disktrend.com/pdf/portrpkg.pdf.Storage capacity beating Moore’s law

• Improvements:Capacity 60%/yBandwidth 40%/yAccess time 16%/y

• 1000 $/TB today• 100 $/TB in 2007

Moores law 58.70% /year

TB growth 112.30% /year since 1993

Price decline 50.70% /year since 1993

Most (80%) data is personal (not enterprise)This will likely remain true.

Disk Evolution• Capacity:100x in 10 years

1 TB 3.5” drive in 2006 20 GB as 1” micro-drive

• System on a chip • High-speed LAN

• Disk replacing tape• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

Disk Storage Cheaper Than Paper• File Cabinet (4 drawer) 250$

Cabinet: Paper (24,000 sheets) 250$Space (2x3 @ 10€/ft2) 180$Total 700$0.03 $/sheet 3 pennies per page

• Disk: disk (250 GB =) 250$ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper) micro-dollar per pageImage: 1 m photos 3e-4 $/photo (100x cheaper) milli-dollar per photo

• Store everything on disk

Note: Disk is 100x to 1000x cheaper than RAM

Why Put Everything in Cyberspace?

Low rentmin $/byte

Shrinks timenow or later

Shrinks spacehere or there

Automate processingknowbots

Point-to-Point OR Broadcast

Imm

edia

te O

R T

ime

Del

ayed

LocateProcessAnalyzeSummarize

MemexAs We May Think, Vannevar Bush, 1945

“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility”

“yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

Trying to fill a terabyte in a year

Item Items/TB Items/day

300 KB JPEG 3 M 9,800

1 MB Doc 1 M 2,900

1 hour 256 kb/s MP3 audio

9 K 26

1 hour 1.5 Mbp/s MPEG video

290 0.8

Projected Portable Computer for 2006

• 100 Gips processor

• 1 GB RAM

• 1 TB disk

• 1 Gbps network

• “Some” of your software finding things is a data mining challenge

The Personal Terabyte(s) (All Your Stuff Online)

So you’ve got it – now what do you do with it?

• TREASURED (what’s the one thing you would save in a fire?)

• Can you find anything?• Can you organize that many objects?• Once you find it will you know what it is?• Once you’ve found it, could you find it again?• Information Science Goal:

Have GOOD answers for all these Questions

How Will We Find Anything?• Need Queries, Indexing, Pivoting,

Scalability, Backup, Replication,Online update, Set-oriented accessIf you don’t use a DBMS, you will implement one!

• Simple logical structure: – Blob and link is all that is inherent– Additional properties (facets == extra tables)

and methods on those tables (encapsulation) • More than a file system • Unifies data and meta-data

SQL ++SQL ++DBMSDBMS

80% of data is personal / individual.

But, what about the other 20%?• Business

– Wall Mart online: 1PB and growing….– Paradox: most “transaction” systems < 1 PB.– Have to go to image/data monitoring for big data

• Government– Government is the biggest business.

• Science– LOTS of data.

Q: Where will the Data Come From?A: Sensor Applications

• Earth Observation – 15 PB by 2007

• Medical Images & Information + Health Monitoring– Potential 1 GB/patient/y 1 EB/y

• Video Monitoring– ~1E8 video cameras @ 1E5 MBps

10TB/s 100 EB/y filtered???

• Airplane Engines– 1 GB sensor data/flight, – 100,000 engine hours/day– 30PB/y

• Smart Dust: ?? EB/y

http://robotics.eecs.berkeley.edu/~pister/SmartDust/http://www-bsac.eecs.berkeley.edu/~shollar/macro_motes/macromotes.html

CERN Tier 0

Instruments: CERN – LHCPeta Bytes per Year

Looking for the Higgs Particle

• Sensors: 1000 GB/s (1TB/s ~ 30 EB/y)

• Events 75 GB/s

• Filtered 5 GB/s

• Reduced 0.1 GB/s~ 2 PB/y

• Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

Thesis• Most new information is digital

(and old information is being digitized)

• An Information Science Grand Challenge:– Capture– Organize– Summarize– Visualize

this information

• Optimize Human Attention as a resource

• Improve information quality

Access!

The Evolution of Science• Observational Science

– Scientist gathers data by direct observation– Scientist analyzes data

• Analytical Science – Scientist builds analytical model– Makes predictions.

• Computational Science – Simulate analytical model– Validate model and makes predictions

• Data Exploration Science Data captured by instrumentsOr data generated by simulator– Processed by software– Placed in a database / files– Scientist analyzes database / files

http://es.rice.edu/ES/humsoc/Galileo/Images/Astro/Instruments/hevelius_telescope.gif

Computational Science Evolves • Historically, Computational Science = simulation.• New emphasis on informatics:

– Capturing,

– Organizing,

– Summarizing,

– Analyzing,

– Visualizing

• Largely driven by observational science, but also needed by simulations.

• Too soon to say if comp-X and X-info will unify or compete.

BaBar, Stanford

Space Telescope

P&E Gene SequencerFromhttp://www.genome.uci.edu/

http://www.pd.astro.it/othersites/altrimondi/prot02_083/Storia%20dell'astronautica-file/Space%20telescope.jpg

Next-Generation Data Analysis• Looking for

– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• Global statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

• As data and computers grow at same rate, we can only keep up with N logN

• A way out? – Discard notion of optimal (data is fuzzy, answers are

approximate)– Don’t assume infinite computational resources or memory

• Requires combination of statistics & computer science

Smart Data (active databases)

• If there is too much data to move around,take the analysis to the data!

• Do all data manipulations at database– Build custom procedures and functions in the database

• Automatic parallelism guaranteed• Easy to build-in custom functionality

– Databases & Procedures being unified– Example temporal and spatial indexing– Pixel processing

• Easy to reorganize the data– Multiple views, each optimal for certain types of analyses– Building hierarchical summaries are trivial

• Scalable to Petabyte datasets

Data Mining in the Image Domain: Can We Discover New Types of Phenomena Using Automated Pattern

Recognition?(Every object detection algorithm has its biases and limitations)

– Effective parametrization of source morphologies and environments– Multiscale analysis (Also: in the time/lightcurve domain)

Challenge: Make Data Publication & Access Easy

• Augment FTP with data query: Return intelligent data subsets

• Make it easy to – Publish: Record structured data– Find:

• Find data anywhere in the network• Get the subset you need

– Explore datasets interactively

• Realistic goal: – Make it as easy as

publishing/reading web sites today.

Information Science and Data Generation Trends

• What does large amounts of information provide?– New opportunities for search!– New discoveries

• Business opportunities?

• Research opportunities?

• Problems?

Documents

How much information? Adapted from a presentation by: Jim Gray Microsoft Research gray Alex Szalay Johns Hopkins University