44
Some notes on “Big Data” : What does “full life-cycle” data management mean ? Tom Moritz, OPM “Big Data” July, 2012

US Office of Personnel Management: Notes on "Big Data"

Embed Size (px)

DESCRIPTION

A presentation given December 13, 2012 for the US Office of Personnel Management "Big Data" initiative

Citation preview

Page 1: US Office of Personnel Management: Notes on  "Big Data"

Some notes on “Big Data” : What does “full life-cycle” data management mean ?

Tom Moritz, OPM “Big Data” July, 2012

Page 2: US Office of Personnel Management: Notes on  "Big Data"

Open Government and “Transparency”

Two dimensions:-- Data about government operations (all three

branches!)-- Data that represent the products of

government activity

Page 3: US Office of Personnel Management: Notes on  "Big Data"

Case Studies

Page 4: US Office of Personnel Management: Notes on  "Big Data"

“A representation of the cholera epidemic of the nineteenth century”

http://history.nih.gov/exhibits/history/index.html

Page 5: US Office of Personnel Management: Notes on  "Big Data"

http://www.ph.ucla.edu/epi/snow/snowmap1_1854_lge.htm

Page 7: US Office of Personnel Management: Notes on  "Big Data"
Page 8: US Office of Personnel Management: Notes on  "Big Data"

http://www.flickr.com/photos/trialsanderrors/3075553370/

Paul Strand: “Wall Street, New York City, 1915” Aerial view of pedestrians walking along Wall Street in strong sunlight and building in background with large recesses (likely 23 Wall Street, the headquarters of J.P. Morgan & Co.). Photograph by Paul Strand (a student of Lewis Hine), 1915; published in Camera Work, v. 48, p. 25. October

1916.

The “Flash Crash”: “On the afternoon of May 6, 2010, the U.S. equity markets experienced an extraordinary upheaval. Over approximately 10 minutes, the

Dow Jones Industrial Average dropped more than 600 points, representing the disappearance of approximately $800 billion of market value. The share price of several blue-chip multinational companies fluctuated dramatically; shares that had been at tens of dollars plummeted to a penny in some cases and rocketed

to values over $100,000 per share in others. As suddenly as this market downturn occurred, it reversed, so over the next few minutes most of the loss was recovered and share prices returned to levels close to what they had been before the crash.”

“Large-Scale Complex IT Systems.” By Ian Sommerville, et al. Communications of the ACM, Vol. 55 No. 7, Pages 71-

77. http://cacm.acm.org/magazines/2012/7/151233-large-scale-complex-it-systems/fulltext

Page 9: US Office of Personnel Management: Notes on  "Big Data"

The “Flash Crash” (2)

“…the trigger event was identified as a single block sale of $4.1 billion of futures contracts executed with uncommon urgency on behalf of a fund-management company. That sale began a complex pattern of interactions between the high-frequency algorithmic trading systems (algos) that buy and sell blocks of financial instruments on incredibly short timescales.

“A software bug did not cause the Flash Crash; rather, the interactions of independently managed software systems created conditions unforeseen (probably unforeseeable) by the owners and developers of the trading systems. Within seconds, the result was a failure in the broader socio-technical markets that increasingly rely on the algos…”“Large-Scale Complex IT Systems.” By Ian Sommerville, et al.

Communications of the ACM, Vol. 55 No. 7, Pages 71-77. http://cacm.acm.org/magazines/2012/7/151233-large-scale-complex-it-systems/fulltext

Page 10: US Office of Personnel Management: Notes on  "Big Data"

The “Flash Crash”(3): Key Insights

“Large-Scale Complex IT Systems.” By Ian Sommerville, et al. Communications of the ACM, Vol. 55 No. 7, Pages 71-77.

http://cacm.acm.org/magazines/2012/7/151233-large-scale-complex-it-systems/fulltext

“Coalitions of systems, in which the system elements are managed and owned independently, pose challenging new problems for systems engineering.”“ When the fundamental basis of engineering – reductionism – breaks down, incremental improvements to current engineering techniques are unable to address the challenges of developing, integrating, and deploying large-scale complex IT systems.”“Developing complex systems requires a socio-technical perspective involving human, organizational, social and political factors, as well as technical factors.”

Page 11: US Office of Personnel Management: Notes on  "Big Data"

The Digital Environment…

Page 12: US Office of Personnel Management: Notes on  "Big Data"

Individual Libraries

Cooperative Projects

National Disciplinary Initiatives

“BIG Science”“Small Science”

Local / Personal Archiving

International Collaborative Research Effort

Individuals

Data Centers

GRIDS

The “Ecology” of Digital Data

Page 13: US Office of Personnel Management: Notes on  "Big Data"

THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM Julie M. Esanu and Paul F. Uhlir, Editors Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 5

The Public Domain

“The institutional ecology of the digital environment” (Yokai Benkler)

Sectors (public < - > private) and Jurisdictional Scale

Page 14: US Office of Personnel Management: Notes on  "Big Data"

The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independently, and the resulting data sets from such studies generally are heterogeneous and unstandardized, with few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight states of accessibility, depending on the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus disaggregated components of an incipient network that is only as effective as the individual transactions that put it together. Openness and sharing are not ignored, but they are not necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8

Page 15: US Office of Personnel Management: Notes on  "Big Data"

“Small” data collections may become “Big” (and more complex)

by successive aggregation of sources…

Page 16: US Office of Personnel Management: Notes on  "Big Data"

Linked Open Data

20 Jun 2012 @timrdf http://bit.ly/lebo-ipaw-2012 16

20092011

2009

Courtesy of Tim Lebo, RPI http://bit.ly/lebo-ipaw-2012

Page 17: US Office of Personnel Management: Notes on  "Big Data"

“Data” ? [technical definition]

“…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” -- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”

Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically

machine-readable.

Page 18: US Office of Personnel Management: Notes on  "Big Data"

“The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication.

“By 2011, the digital universe will be 10 times the size it was in 2006.

“As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.

“Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks.

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf

Page 19: US Office of Personnel Management: Notes on  "Big Data"

http://longtail.typepad.com/the_long_tail/2005/05/isnt_the_long_t.html

“As you go down the Long Tail the signal-to-noise ratio gets worse. Thus the only way you can maintain a consistently good enough signal to find what you want is if your filters get increasingly powerful.”

Chris Anderson “Is the Long Tail full of crap?” May 22, 2005

Page 20: US Office of Personnel Management: Notes on  "Big Data"

“Data” [epistemic definition]

“Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)”

-- AnthroDPA Working Group on Metadata (May, 2009)

Page 21: US Office of Personnel Management: Notes on  "Big Data"

“…data longevity is increased. Comprehensive metadata counteract the natural tendency for data to degrade in information content through time

(i.e. information entropy sensu Michener et al., 1997; Fig. 1).” W. K. Michener “Meta-information concepts for ecological data management” Ecological Informatics 1 (2006) 3-7

Data Entropy: the risks of inaction and the urgency of action

Tom Moritz, OPM “Big Data” July, 2012

Page 22: US Office of Personnel Management: Notes on  "Big Data"

Data Development:“Data Reduction - Processing Level Definitions” (an example)

http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19860021622_1986021622.pdf Report of the EOS Data Panel Vol IIA, NASA, 1986 (Tech Memorandum 87777)

Tom Moritz, OPM “Big Data” July, 2012

Page 23: US Office of Personnel Management: Notes on  "Big Data"

T.C. Chamberlin Tom Moritz, OPM “Big Data” July, 2012

Page 24: US Office of Personnel Management: Notes on  "Big Data"

“What science does is put forward hypotheses, and use them to make predictions, and test those predictions against empirical evidence. Then the scientists make judgments about which hypotheses are more likely, given the data. These

judgments are notoriously hard to formalize, as Thomas Kuhn argued in great detail, and philosophers of science don’t have anything like a rigorous

understanding of how such judgments are made. But that’s only a worry at the most severe levels of rigor; in rough outline, the procedure is pretty clear. Scientists like hypotheses that fit the data, of course, but they also like them to be consistent

with other established ideas, to be unambiguous and well-defined, to be wide in scope, and most of all to be simple. The more things an hypothesis can explain on

the basis of the fewer pieces of input, the happier scientists are.”

-- Sean Carroll “Science and Religion are not Compatible”

Discover MagazineJune 23rd, 2009 8:01 AM

Hypotheses and data as evidence:Inductive < -- > Deductive feedback loops?

Tom Moritz, OPM “Big Data” July, 2012

http://blogs.discovermagazine.com/cosmicvariance/2009/06/23/science-and-religion-are-not-compatible/

Page 25: US Office of Personnel Management: Notes on  "Big Data"

Full Life Cycle Management?

Tom Moritz, OPM “Big Data” July, 2012

Page 26: US Office of Personnel Management: Notes on  "Big Data"

US NSF “DataNet” Program“the full data preservation and access lifecycle”

• “acquisition” • “documentation”• “protection” • “access” • “analysis and dissemination” • “migration” • “disposition”

“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07-601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information

Science & Engineering

Tom Moritz, OPM “Big Data” July, 2012http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.htm

Page 27: US Office of Personnel Management: Notes on  "Big Data"

http://www.nitrd.gov/about/harnessing_power_web.pdf

IWGDD = [US] “Interagency Working Group on Digital Data”

Tom Moritz, OPM “Big Data” July, 2012

Page 28: US Office of Personnel Management: Notes on  "Big Data"

IWGDD “DIGITAL DATA LIFE CYCLE”Exhibit B-2. Life Cycle Functions for Digital Data*

• Plan−− Determine what data need to be created or collected to support a research agenda or a mission function

-- Identify and evaluate existing sources of needed data−− Identify standards for data and metadata format and quality−− Specify actions and responsibilities for managing the data over their life cycle

• Create−− Produce or acquire data for intended purposes−− Deposit data where they will be kept, managed and accessed for as long as needed to support their intended

purpose−− Produce derived products in support of intended purposes; e.g., data summaries, data aggregations, reports,

publications

• Keep−− Organize and store data to support intended purposes

-- Integrate updates and additions into existing collections-- Ensure the data survive intact for as long as needed

• Acquire and implement technology−− Refresh technology to overcome obsolescence and to improve performance−− Expand storage and processing capacity as needed−− Implement new technologies to support evolving needs for ingesting, processing, analysis, searching and accessing

data• Disposition−− Exit Strategy: plan for transferring data to another entity should the current repository no longer be able to keep it−− Once intended purposes are satisfied, determine whether to destroy data or transfer to another organization

suited to addressing other needs or opportunities

http://www.nitrd.gov/about/harnessing_power_web.pdf Tom Moritz, OPM “Big Data” July, 2012

Page 29: US Office of Personnel Management: Notes on  "Big Data"

www.dcc.ac.uk/docs/publications/DCCLifecycle.pdf

Tom Moritz, OPM “Big Data” July, 2012

Page 30: US Office of Personnel Management: Notes on  "Big Data"

“JISC DCC Curation Lifecycle Model”

Tom Moritz, OPM “Big Data” July, 2012http://www.dcc.ac.uk/docs/publications/DCCLifecycle.pdf

Page 31: US Office of Personnel Management: Notes on  "Big Data"

Database Lifecycle Management “The Database Lifecycle Management covers the entire

lifecycle of the databases, including:• Discovery and Inventory tracking: the ability to discover your

assets, and track them• Initial provisioning, the ability to rollout databases in minutes• Ongoing Change Management, End-to-end management of

patches , upgrades, schema and data changes• Configuration Management, track inventory, configuration

drift and detailed configuration search• Compliance Management, reporting and management of

industry and regulatory compliance standards• Site level Disaster Protection Automation”http://www.oracle.com/technetwork/oem/pdf/511949.pdf

Tom Moritz, OPM “Big Data” July, 2012

Page 32: US Office of Personnel Management: Notes on  "Big Data"

W. K. Michener “Meta-information concepts for ecological data management” Ecological Informatics 1 (2006) 3-7

Tom Moritz, OPM “Big Data” July, 2012http://tinyurl.com/d49f3vm

Page 33: US Office of Personnel Management: Notes on  "Big Data"

“Sustainable data curation”“There are several main elements necessary to sustain data curation:

“Robust data storage facilities (hardware and software) that are capable of accurately handling data migration across generations of media.

“Backup plans, that are tested, so irreplaceable data are not at risk. Unintended data loss can occur for many reasons: some major causes are: poor stewardship leading to the loss of metadata to understand where the data is located and documentation to understand the content, physical facility and equipment failure (fire, flood, irrecoverable hardware crashes), accidental data overwrite or deletion.

“Science-educated staff with knowledge to match the data discipline is important for checking data integrity, choosing archive organization, creating adequate metadata, consulting with users, and designing access systems that meet user expectations. Staff responsible for stewardship and curation must understand the digital data content and potential scientific uses. “

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 10.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Tom Moritz, OPM “Big Data” July, 2012

Page 34: US Office of Personnel Management: Notes on  "Big Data"

Sustainable data curation (cont.) “Non-proprietary data formats that will ensure data access capability for

many decades and will help avoid data losses resulting from software incompatibilities…

“Consistent staffing levels and people dedicated to best practices in archiving, access, and stewardship…

“National and International partnerships and interactions greatly aids in shared achievements for broad scale user benefits, e.g. reanalyses, TIGGE…

“Stable funding not focused on specific projects, but data management in general…”

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 10-11.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Tom Moritz, OPM “Big Data” July, 2012

Page 35: US Office of Personnel Management: Notes on  "Big Data"

“Data Quality” ???In general colloquial terms, “Data Quality” is the fundamental issue of concern to

scientists, policy makers, managers/decision makers and the general public. “Quality” can be considered in terms of three primary values: • Validity: logical in terms of intended hypothesis to be tested (all potential

types of data that could be chosen should be weighed for probative value…)

• Competence (Reliability) : consideration of the proper choice of expert staff, methods, apparatus/gear, calibration, deployment and operation

• Integrity: the maintenance of original integrity of data as well as tracking and documenting of all recording, migration, transformations and sequences of transformation of data

Tom Moritz, OPM “Big Data” July, 2012

Page 36: US Office of Personnel Management: Notes on  "Big Data"

“…the “validation” of any scientific hypotheses rests upon the sum integrity of all original data and

of all sequences of data transformation to which original data have been subject. “

– Tom Moritz“The Burden of Proof”

Tom Moritz, OPM “Big Data” July, 2012

http://imsgbif.gbif.org/CMS_NEW/get_file.php?FILE=2b032cf8212d19a720f21465df0686

Page 37: US Office of Personnel Management: Notes on  "Big Data"

A Primary Goal of Open Government

Public Access to Data that is:• Of High Quality ( SEE –previous discussion)

• Free – no cost or minimal cost

• Open – easily discoverable and accessible– “A piece of content or data is open if anyone is free to use, reuse, and

redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” [ http://http://opendefinition.org/ ]

• Effective / Useful / Usable – both technically usable and descriptively identified in ways that support ready analysis, citation, use, reuse…

T. Moritz “The Burden of Proof: Data as Evidence in Science and Public Policy” MicroSoft Research, GRDI2020, Stellenbosch, South Africa , Sept., 2010

http://www.grdi2020.eu/Pages/SelectedDocument.aspx?id_documento=87f1b6d5-5c30-42a7-94df-d9cd5f4b147c

Page 38: US Office of Personnel Management: Notes on  "Big Data"

Thanks for your attention…

Tom MoritzTom Moritz Consultancy

Los [email protected]

+1 310 963 0199tommoritz (Skype)

http://www.linkedin.com/in/tmoritz http://www.slideshare.net/Tom_Moritz

Page 39: US Office of Personnel Management: Notes on  "Big Data"
Page 40: US Office of Personnel Management: Notes on  "Big Data"

Saturn images courtesy of R J Robbins and The Research Coordinating Network for the Genomics Standards Consortium…

Page 42: US Office of Personnel Management: Notes on  "Big Data"

Rosalind Franklin’s Image

http://philosophyofscienceportal.blogspot.com/2008/04/rosalind-franklin-double-helix.html

“Franklin's B-form data, in conjunction with cylindrical Patterson map calculations that she had applied to her A-form data, allowed her to determine DNA's density, unit-cell size, and water content. With those data, Franklin proposed a double-helix structure with precise measurements for the diameter, the separation between each of the coaxial fibers along the fiber axis direction, and the pitch of the helix.3

“The diffraction photograph of the B form of DNA taken by Rosalind Franklin in May 1952 was by far the best photograph of its kind. Data derived from this photograph were instrumental in allowing James Watson and Francis Crick to construct their Nobel Prize winning model for DNA.” (Courtesy of the Norman Collection on the History of Molecular Biology in Novato, Calif.)

Page 43: US Office of Personnel Management: Notes on  "Big Data"

“Notebook entries show that Rosalind Franklin (a) recognized that the B form of DNA was likely to have a two-chained helix; (b) was aware of the Chargaff ratios; (c) knew that most, if not all, of the nitrogenous bases in DNA were in the keto configuration…; and (d) determined that the backbone chains of A-form DNA are antiparallel.” (Courtesy of Anne Sayre and Jenifer Franklin Glynn.)

http://philosophyofscienceportal.blogspot.com/2008/04/rosalind-franklin-double-helix.html