White Paper Final Version

Data Issues in the Life Sciences

Anne E. Thessen* David J Patterson

Data Conservancy (Life Sciences)

Marine Biological Laboratory Woods Hole Massachusetts 02543

*Author for contact, [email protected]

Data Conservancy: Data Issues in the Life Sciences (March 2011)

1

EXECUTIVE SUMMARY1. The expansion of the Life Sciences with a data-centric 'Big New Biology' will provide opportunities to reveal previously obscure truths about the living world and will provide scientists with more resources to address large, challenging questions. 2. This change will require a major investment in infrastructure, changes to current data practices, and an appropriately trained community of informaticians. 3. Changes that will improve the reuse of data within the Life Sciences include: a. Incentives to improve researcher readiness to share data, b. Extensions and merger (integration) of metadata standards and ontologies that can be used to organize data, to improve the discovery of data, and to enable machine reasoning on data, c. More technical support and infrastructure to support people, projects, institutions, and programs nationally and internationally, d. Persistent registries and repositories structured to curate data and to enhance the suitability of data for reuse, e. Aggressive implementation of semantic web technologies, f. New university courses that combine Computer Sciences with Life Sciences in the context of informatics. 4. The Life Sciences are very heterogeneous in terms of data cultures. No single approach to change will suit the needs of all of the Life Sciences. The most successful strategies are likely to be those that address needs in the context of sub-disciplines. The International Nucleotide Sequence Database Collaboration that serves molecular biology with GenBank, EMBL and DDBJ provides a good model of a domain-specific solution to the challenges that will accompany a transformation to datacentricity. 5. We do not yet have a detailed understanding of the spectrum and nature of data cultures in the Life Sciences, nor is there a common understanding of the nature of data, nor is there a widespread understanding among biologists of the nature and benefits of a more data-centric discipline. Considerable effort is required to communicate the benefits of new data-centric dimensions within the Life Sciences, and to promote participation in the transformation.


2

INTRODUCTIONThe urgent need to understand complex, global phenomena and the emergence of improved data management technologies are driving an agenda in the Life Sciences to enhance data-driven discovery (National Academy of Sciences 2009). The development demands new approaches to sharing and querying existing data (Hey et al. 2009, Kelling et al. 2009). This document addresses some of the more proximate issues that scientists will face as they progress towards this 'Big New Biology'. Data-driven discovery refers to the discovery of scientific insights and hypothesis generation through the novel management and analysis of pre-existing data. It relies on accessing and reusing data which will most likely have been generated to address other scientific problems in contrast to the more familiar process of acquiring new data to address questions. While still hypothesis-based, data-driven discovery differs in character from scientific inquiry based on laboratory experiments or by making new field observations. Data-driven discovery requires a large virtual pool of data across a wide spectrum of the Life Sciences. The availability of such a pool will allow biology to join the other 'Big' (= data-centric) sciences such as Astronomy and high-energy particle physics (Hey et al. 2009). Access to a pool will invite 'New' logic, strategies and tools (a 'macroscope') to discover trends, associations, discontinuities, and exceptions that reveal aspects of the underlying biology which are unlikely to emerge from more reductionist approaches (de Rosnay 1975, Ausubel 2009, National Academy of Sciences 2009, Patterson et al. 2010, Sirovich et al. 2010). The pool, and the resources from which it is macerated, may reveal factors, other than the properties that are intrinsic to biology, that shape knowledge. Insights into sociological trends which improve acuity or introduce distortions lead to a richer understanding of 'scientific certainty' (Evans & Foster 2011). The emergence of a data-centric Big New Biology requires a stable cyberinfrastructure (National Science Foundation 2003, 2006, Burton & Treloar 2009, European Science Foundation 2006, URL 1). Registries and repositories must grow to meet the challenges of making data discoverable and accessible. The emerging Knowledge Organization System (Morris 2010) will need elements that will aggregate disparate data sets. Such elements require flexible and evolving schemas that define categories of data across Life Sciences and ontologies to intelligently link vocabularies and to model existing knowledge. Semantic web technologies are needed to achieve flexibility of reuse. Enhanced user interfaces with organizational, analytical and visualization tools will be needed to allow scientists to interact with the data and associated infrastructure. Most existing environments for data management are limited in scope. The Big New Biology requires a new mesh of biological, computer and information sciences, as well as changes to current cultures, to achieve data-centric bridges among the subdisciplines of the Life Sciences.


3

The best current examples of data-centric environment within the Life Sciences are provided by the International Nucleotide Sequence Database Collaboration - a group of linked repositories (GenBank, EMBL and DDBJ) for genetic sequences that receives data from willing scientists around the world (Strasser 2008). Data are freely available for reuse. Many new tools have appeared to take advantage of the data and to create new knowledge from them. The community and publishers actively endorse this environment. Many more environments similar to this, but serving other subdisciplines, will be needed if a Big New Biology, focused on aggregating and querying existing data in novel ways, is to emerge. A task of the Life Sciences component of the Data Conservancy is to address sociological and technical issues that need to accompany the expansion of the Life Sciences as a Big New Biology.

What are the Life Sciences?The Life Sciences investigate all aspects of living systems, past and present. The phenomena that are studied can endure from nanoseconds to billions of years, and involve events that occur on a physical scale that extends from molecules to ecosystems. More than 2 million species of life have been

described, one-tenth or less of the number that currently exists (Raup 1992, Chapman 2009). Events in each evolutionary lineage are unpredictable and there is no dominant organizing principle for biological data. This has led to different styles and methods of capturing data about nature, and to a diversity of data cultures. A large proportion of Life Sciences data is not in a sharable form (Heidorn 2008). Not all aspects of the Life Sciences are considered here. Subdisciplines that address the

interactions of organisms with the physical world, such as geochemistry and ecology, are not included because they are covered by the GEOBON initiative (Scholes et al. 2008) and, within the DataNet program, by the Data Observation Network for Earth (DataONE) (Reichman et al. 2011; URL 2). Medical sciences, while based on living organisms, are also excluded both because there is already a robust informatics community, well-developed schema management systems such as UMLS or SnoMed (URL 3, URL 4), a rich array of environments addressing issues of management of data relating to human health (such as DICOM URL 5), dedicated courses and journals (URL 6), software (URL 7), and communities addressing particular challenges (e.g. URL 8, URL 9; Bilimoria et al. 2003). We exclude the agricultural and food sciences because of their applied focus and because they too have an emerging infrastructure (Maurer & Tochterman 2010).


4

Academically, the Life Sciences have splintered into thousands of subdisciplines (Fig. 1), also collectively referred to as Biology. As a science, Biology has to describe and understand phenomena considerable that variation exhibit (the

phenomena are noisy), and in which the diversity ofFigure 1: A word cloud constructed from the names of some subdisciplines of the Life Sciences illustrates the scope of the discipline.

components and processes at all

scales contribute to a massively complex array of interactions. Within individual organisms, the biology that is expressed is determined in part by genetic makeup. How the genes are expressed is determined by which alleles are present and by other genes, the condition of the organism, interactions with other organisms, or by environmental conditions. Mutations can lead to changes in genetic makeup within the lifetime of an organism or in subsequent generations. The evolutionary events in any one of the millions of evolutionary lineages may constrain options, but cannot tell us what whether stasis, transformation, extinction, or speciation will happen next. There is a capacity for change in ways that cannot be predicted from the study of parts; this defining property of biology is referred to as emergence (Mayr 2004). As a result of inherent variation and complexity across a wide spectrum of scales, many parts of biology cannot be explained through general rules or through an exclusive focus on details. Biology is unlike those sciences in which the identification and cataloguing of components (such as the periodic table) and the discovery of the rules of interaction (such as Newtonian mechanics) explain large swathes of the discipline. We represent biology as being contained within an envelope (Fig. 2), one axis of which extends from the shortest events known in life (the sub-nanosecond phenomena associated with electron and ion movements) to the most enduring processes (the evolutionary processes that began about 3.5 billion years ago and have continued to the present day). The other axis extends from the smallest biological objects (bioactive ions) to the largest (the complete biosphere).


5

Levels of organization within biologyOne means of categorizing the Life Sciences and seeking common aspects relating to data issues is group phenomena into 'levels of organization'. Those levels extend from molecular processes to ecosystem-level phenomena. Eachmeter 10,000 kms

millimeter

level engages those below it and influences those above it. Each level requires appropriate instruments and associated data cultures. Molecular tools are becoming increasingly influential subdivisions at all are levels. arbitrary, The andmillisecond second year 1 bya nanometer

individual phenomena can affect aspects at many other levels. A

Figure 2: Biological phenomena extend from molecular events that involve sub-ngstom-sized objects and last for fragments of milliseconds, to events that extend across the globe and have endured for 3 billion years or more of Earth's history. The part of the envelope occupied by expressed biological phenomena is shown in a lighter shade. Overlain are rectangles to indicate whether data about the underlying phenomena are acquired through individual experience (yellow), use of tools (green) or by instruments (blue and red) or at the extreme, if knowledge is assembled largely by inference (purple).

single mutational change in human haemoglobin is the source of sickle-celled anemia, but also provides sufficient protection against parasitism by malaria to lead to large scale changes in cultural development of human populations in Africa (Fleming et al. 1979). Typical 'levels of organization' are: 1. Molecular biology and biochemistry. These subdisciplines address the structure, interactions and roles of molecules and their components. This domain includes molecular genetics,

biochemistry, metabolic pathways, and cell physiology. While some molecules can be more than 10 mm long (such as DNA), molecular phenomena are usually expressed at the sub-nanometer level, and in short to very short (sub-nanosecond) time scales. Data are typically collected using instruments such as sequencers, spectrometers and chromatographs. Data are often born digital and sharing is common at this level. Major repositories exist for molecular data (Table 1). 2. Cellular. All organisms (arguably except viruses) are made of cells and cell biology addresses components, events, and processes that occur within or via cells. Such events happen at scales mostly around 1-10 microns (10-6 meters) but may occur in cells that are a meter or so in length or may involve sub-ngstom (10-10 meters) components. Events are usually measured in seconds to Data Conservancy: Data Issues in the Life Sciences (March 2011) 6

minutes, but can extend to periods of hours (more rarely to years). Generally, cellular processes that are not part of molecular biology and biochemistry are studied through tools such as microscopes. There are no major internationally acclaimed repositories of data about cell

biology, but there are smaller cell image repositories (URL 10). There are no widespread traditions of data sharing outside publications and inter-scientist social interactions.Repository AlgaeBase ArrayExpress Australia National Data Service ConceptWiki CSIRO Data.gov Diptera database EMAGE ENA Ensembl EUNIS Euregene Eurexpress EURODEER FishBase FlyBase GBIF GenBank GEO GNI INBIO INSPIRE KEGG Life Sciences Data Archive NASA MassBank MGI MorphBank OBIS OMIM PDB PRIDE PubMed Stanford Microarray Database tair Taxon Concept TOPP TreeBase TROPICOS UniProt WILDSPACE WRAM Type of Life Sciences Data algae names and references microarray general research data concepts fisheries catch natural resources data Dipteran information gene expression gene sequences genomes biodiversity renal genome transcriptome movement of roe deer fish information Drosophila genes and genomes occurrences gene sequences microarray names Costa Rican biodiversity spatial genes effects of space on humans mass spectra mouse images occurrences human genes and phenotypes molecule structure proteomics citations microarray Arabidopsis molecular biology species descriptions animal tagging phylogenetic trees plant specimens protein sequence and function life history information wireless remote animal monitoring Location http://www.algaebase.org/ http://www.ebi.ac.uk/arrayexpress/ http://www.ands.org.au/ http://conceptwiki.org/index.php/Main%20Page http://www.marine.csiro.au/datacentre/ http://www.data.gov/ http://www.sel.barc.usda.gov/diptera/biosys.htm http://www.emouseatlas.org/emage/ http://www.ebi.ac.uk/ena/ http://uswest.ensembl.org/index.html http://eunis.eea.europa.eu http://www.euregene.org/ http://www.eurexpress.org/ee/ http://sites.google.com/site/eurodeerproject/home http://www.fishbase.org/ http://flybase.org http://www.gbif.org/ http://www.ncbi.nlm.nih.gov/genbank/ http://www.ncbi.nlm.nih.gov/geo/ http://gni.globalnames.org/ http://www.inbio.ac.cr/es/default.html http://inspire.jrc.ec.europa.eu/index.cfm http://www.genome.jp/kegg/ http://lsda.jsc.nasa.gov/ http://www.massbank.jp/index.html?lang=en http://www.informatics.jax.org/ http://www.morphbank.net/ http://www.iobis.org/ http://www.ncbi.nlm.nih.gov/omim http://www.pdb.org/pdb/home/home.do http://www.ebi.ac.uk/pride/ http://www.ncbi.nlm.nih.gov/pubmed/ http://smd.stanford.edu/ http://www.arabidopsis.org/ http://taxonconcept.org http://www.topp.org/topp_census http://www.treebase.org/ http://www.tropicos.org/ http://www.uniprot.org/ http://wildspace.ec.gc.ca/more-e.html http://www-wram.slu.se/

Table 1. Examples of repositories for Life Sciences data.

3. Tissue-level events are coordinated interactions involving many cells within multicellular organisms and usually relate to processes that take minutes or hours (but may be as short as Data Conservancy: Data Issues in the Life Sciences (March 2011) 7

fractions of seconds or extend to years), they include developmental and physiological phenomena and are accessible by direct observation, through microscopes, and by instruments. There are no widespread traditions of data sharing outside publications and inter-scientist social interactions. 4. Organismal phenomena include behavior, growth, development, and appearance. The

temporal and spatial dimensions depend on what kinds of organisms are under consideration: individual bacteria may be less than 1 micron in size and may undergo an entire life cycle in hours; large organisms may extend to tens of meters and have life cycles that extend to thousands of years. Most of the data are collected by direct observation and are communicated through a narrative. Data are often held in small sets and can include complex data objects which cannot be directly digitized, such as specimens. 5. Populations are interacting collections of individuals of one (or more symbiotic) species. Populations, such as herds of wildebeest or schools of mackerel, inhabit a patch in space and time that are influenced by the size of the organisms, their diffusivity, their interactions with the physical world and other species, and by their evolutionary history. Patches may extend across a scale of millimeters to thousands of kilometers, and of periods that involve many generations from days to millennia. Populations may have a discrete genetic identity achieved through inbreeding. Phenomena are studied through disciplines such as ecology and genetics. Nongenetic data mostly are obtained through direct observation and disseminated through narrative publication. 6. Species-level phenomena occur within groups of organisms that have a sufficiently distinctive (genetic) identity to be treated as a species. A species may include one or more populations. Areas of activity include appearance (taxonomy), change (evolution), distribution (ecology), and loss (extinction). Processes can occur over millions of years. Distributional aspects may extend up to pan-global scales. Species-level phenomena can be observed or have to be inferred, and are typically described through narrative. There are approximately 1.9 million extant and 0.3 million extinct species currently described (Chapman 2009, Raup 1992) with 20,000 or so new species being described each year (SOS report 2010). 7. Ecological or ecosystem aspects address interactions within and among communities and with the physical world. The disciplines explore issues relating to abundance, patchy distributions in space and time, and roles within food webs, energy and nutrient flows, biogeochemistry, etc. Depending on sizes of organisms involved, the subject is addressed in scales that extend from sub-millimeter (for microbes) to the full extent of the Earth's surface. Ecological phenomena extend from minutes through to the full history of Earth. At the more extensive range, data may Data Conservancy: Data Issues in the Life Sciences (March 2011) 8

be derived from satellites and other remote sensing devices.

Many existing or emerging

databases exist for environmental data (Reichman et al. 2011). Understanding of some aspects may be inferred.

Different means of acquiring dataData are acquired in different ways in different regions of the envelope (Fig. 2). We refer to data that are acquired by direct human observation (Fig. 2 yellow box) as firstperson data. These data are limited to objects and processes ranging from about 1 mm to 10 kms in size, and that endure from about a second to a decade. We can gather data from a more extended range through the use of tools - smaller objects can be detected with microscopes (Fig. 2 green box). The concept of tools that we use to extend our capacity to observe grades into the concept of instruments devices to which we defer responsibility for data collection. Narratives deriving from and including first person scientific data have been compiled in our literature for about 250 years. Included are some cellular and tissue phenomena, most population, organismic, species-level data, and some ecological phenomena. First-person data are often very selective and do not fairly represent the world from which they are drawn. Most first-person data are held in many, small sets (i.e. they make up the long tail of biological data Heidorn 2008). While typically 'small science' in nature, some participatory environments (such as eBird URL 11) are bringing together the efforts of tens of thousands of observers. There is a strong tradition of transforming data into knowledge by use of the narrative, and then discarding the data. Instruments are devices that acquire data on our behalf when the phenomena are either too small, too big, too short or too long to be observed directly. Events too short or physically too small to be observed directly include molecular and biochemical phenomena, such as molecular genetics, ion movements by molecular pumps, metabolic processes such as photosynthesis (Fig. 2 blue box). Instrument data tend to be born digital and are often associated with experiments. Some areas, especially molecular biology, have good repositories and a more sophisticated culture of data organization and sharing relative to other areas of Life Sciences. Data on long term or extensive processes (Fig. 2 red box) are acquired through instruments such as monitoring platforms and satellites. Data from these sources are also captured and preserved as electronic files. Some phenomena, such as evolutionary processes and the various transformations of the biosphere that extend to billions of years, are informed through observations of fossils, geology, and geochemistry (Fig. 2 purple box). Understanding often relies on small fragments of information. There is a culture of preserving specimens and samples that inform those areas.


9

SOCIOLOGICAL ISSUESAs the study of human social behavior, sociology includes the behavior and practices of scientists. We refer to the sociological factors that determine the destiny of data as data cultures. If we are to promote a shift to a Big New Biology, we need to understand current practices, their diversity, and what elements favor that transformation, and which aspects will hinder it (Evans and Foster 2011).

Data culturesThe phrase data culture refers to the explicit and implicit (Evans & Foster 2011) data practices and expectations of the relevant scientific community. Data cultures relate to the social conventions of acquisition, curation, preservation, sharing, and reuse of data. While there is no published, detailed survey of data cultures in the Life Sciences, there have been sufficient studies to confirm that there is no single data culture for the Life Sciences (Norris et al. 2008, Gargouri et al. 2010, Key Perspectives Ltd. 2010). This is unsurprising given the scope and scale of the Life Sciences. The cultures range from the field biologist whose data are captured in short-lived notebooks to the molecular biologist whose data are born digital in near terabyte quantities and are widely shared through global data repositories. If the goal is to make data digital, standardized and openly accessible in a reusable format, then the current data cultures provide starting points which determine cultural and technical changes that are needed before that vision can be realized. We do not know how many cultures there are, nor if cultures are discrete or a rich continuum. We discuss below some factors that influence or define data cultures in the Life Sciences. 1. What are data? The term data is not used consistently. For some the term is limited to raw data, for others the term widens to include any kind of information or indeed process that leads to insights. We seek to limit the term to discriminate what is neutral, objective, and largely independent of context or observer. It is this class of 'raw data' we refer to here as 'data'. As data become constrained, filtered and selected, they acquire or are even assigned a meaning in the context of what they apply to. This process, coupled with others, transforms data into information (Ackoff 1989; Fig. 3). Knowledge is comprised of those elements of information that areFigure 3: Data are neutral, objective, and largely independent of context or observer. Raw data are analyzed, filtered, and given meaning within a context - that is, they become information. Information that is universally agreed is knowledge and the agreed composite of knowledge and its application is what we regard as wisdom.


10

universally accepted. Wisdom is the application of knowledge (Morris 2010). 2. Contextual categorization of the data The context in which biological data are acquired or generated is important to understanding how that data can be appropriately reused. The context may be formed by observer interpretation, because of the tools or instruments used, or may be imposed because data are gathered in an experimental (unnatural) setting. In addition, individuals and technologies are selective and capture a limited subset of all available data, and data are affected by choice of instrument and analytical processes. Context can be represented through the application of appropriate metadata. We categorize the following broad types of data based on context. A. Observational data relate to an object or event actually or potentially witnessed by an agent. An agent may be a person, team, project, initiative; and they may call upon tools and instruments. Key metadata will identify the agent, specify date, location, and contexts such as experimental conditions if relevant or the equipment that was used. Within the Life Sciences, the metadata should include taxon names, the basis for the identification and/or pointers to reference (voucher) material. 1. Descriptive data are non-experimental data collected through observations of nature. Ideally, descriptive data can be reduced to values about a specified aspect of a taxon, system, or process. Each value will be unique, having been made at one place, at one time, by one agent. Observations may be confirmed but not replicated such that it is important to preserve these data. Preservation often does not occur as data of this type are often discarded after completion of the research narrative. A formal framework for descriptive data has been developed in the context of the OBOE project (Madin et al. 2007a). Descriptive data can be collected by instruments or by individuals (i.e. are firstperson data). First-person data may not completely represent the world. Mistakes can be made, such as misidentification of taxa (MacLeod et al. 2010). Researchers may be selective about the data they seek to gather, either intentionally or unintentionally, such that data-sets have limited applicability. For example, counts of bird species can be biased because noisy birds are more likely to be seen and counted than quiet birds (Bibby et al. 2000). GBIF, the Global Biodiversity Information Facility (URL 12, data accessed Feb 2011), contains data on more than 250,000 occurrences of birds worldwide, but only on 8,962 nematode occurrences despite nematode abundances of tens of millions of individuals within a square meter column in the soil (Lal 2006). Some individuals may discard data that are not in keeping with their expectations. Few or no raw data may be Data Conservancy: Data Issues in the Life Sciences (March 2011) 11

recorded such that the information may only be available in an interpreted form, for example as drawings rather than photographs. The acquisition of first-person data is rate-limited, constrained by the number of observers. The rate of collecting first person data tends to plateau much sooner than instrument-derived data. Data born digital, such as molecular data, continue to show geometric rates of growth as indicated by the growth of GenBank (Fig. 4a), now custodians of approximately over 100 billion bases and over 12 Petabytes of data. In comparison, the description of new species, a process that is still dominated by the has

narrative

approach,

leveled-out at about 20,000 species per year (Fig. 4b, SOS report 2010). Biological phenomena that endure for less time or are outside physical scales that we can easily register areNew species descriptions1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 1750 1800 1850 1900 1950 YEAR

original descriptions currently valid valid and unchanged

Linnaeus (1758)

accessed through instruments (Fig. 2). Molecular

sequencing devices are the source of vast amounts of comparative information about biology across many levels of organization. New high

Figure 4: (a) Growth of sequence information in Genbank (1982 present) vs (b) number of described species (of fish). Molecular data shows continuing growth, whereas new taxonomic insights that depend on first -person observations have plateaued for about 150 years.

throughput machines can sequence billions of bases in days, with single machines generating terabytes of raw data in that time (Doctorow, 2008) while projects such as the 1000 Genomes project generate almost 100 TB of raw data per week (Rhm & Blakeley 2009). Large-scale instrumentation, such as satellites, collect data from large swaths of the globe. The NASA SeaWiFs satellite can gather information from the entire globe in two days (Hooker & McClain 2000). 2. Experimental Data are obtained when a scientist changes or constrains the conditions under which the expression of a phenomenon occurs. Experiments can be conducted across a broad range of scales - from electrophysiological investigations of sub Data Conservancy: Data Issues in the Life Sciences (March 2011) 12

millisecond processes within cells (Bunin et al. 2005) to manipulations of oceanic ecosystems (Coale et al. 2004). The intent is to dissect the elements of the phenomenon by changing conditions to uncover causal relationships, or to identify variant and invariant elements of biological processes. The experimental paradigm characterized much of the research in the Life Sciences in the 20th century. The paradigm assumes that it will uncover robust underlying phenomena such that the experiment, if repeated, will produce the same results. Given the variation that is inherent in biology, it does not follow that outcomes of experiments can ever be perfectly replicated. Raw data are contextualized by the experimental framework, and may have limited or no value in other contexts. It is important for metadata to include information about source and storage of material before the experiment, experimental conditions, equipment controls and treatments. B. Processed data are obtained through a reworking, recombination or analysis of raw data. There are two primary types. 1. Computed data result from a reworking of data to make them more meaningful or to normalize them. In ecology, information about the productivity of an ecosystem is important, but productivity or the extent of the ecosystem are rarely measured directly. Rather they are computed using information or data from other sources to generate measurements of the amount of carbon or mass that is generated per unit area per unit time. While computed data may be held in the same regard as raw data, choices or errors in formulae or algorithms may diminish or invalidate the data created. Raw data and information on how computed data were derived (provenance) are important for reproducibility. The metadata should provide this information. It is expected that

computed data will grow as the virtual data pool expands. 2. Simulation data are generated by combining mathematical or computational models with raw data. Often models seek to make predictions of processes, such as the future distribution of cane toads in Australia under various climatic projections. The proximity of predictions to subsequent observations is used to test the concepts on which the model is based and to improve the model and our associated understanding of Biology. Metadata differ dramatically from other data types in that date of the run, initial conditions of the model, resolution of the model output, time step, etc. are important. Rerunning the model may require preservation of initial conditions, model software, and even the operating system (URL 13). Simulation data become less useful as they age and can become a storage burden. Data Conservancy: Data Issues in the Life Sciences (March 2011) 13

These categories of data can be used as part of the framework for managing Life Sciences data within the Data Conservancy.Early adopters

Early majority

Late majority

3. Data readinessInnovators Laggards

Readying data to be contributed to a shared pool often involves a series of steps or stages that relate to the capture, digitization, access, discoverability, structure, and mobility of data. The situation with molecular data achieved by the International Nucleotide Sequence Database Collaboration comprising the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and the NCBI GenBank in the USA is exemplary. Molecular data tend to be born digital, and are submitted in standard formats to centralized repositories in which they are freely available for reuse in a standard form. Yet, set in the context of the Rogers' adoption curve (Fig. 5; Rogers 1983), and as suggested by Figure 6, Life Sciences, generally, are closer to the 'early adopters' stage of transition to open access than other sciences. It is still unusual for data created by individuals or small groups to be made ready and openly available for sharing (Davis 2009).0 50 100 150 200 250 Physics Sociology Psychology Law Management Education Business Health Science Political Science Economics Biology

Figure 5: Rogerss adoption curve describes the acceptance of a new technology. Life Sciences is still in the 'Early Adopters' phase for accepting principles of data readiness.

The Long Tail in BiologyQuantities

Figure 6: Relative percentage improvement in citation associated with article being published as open access. The increase is less for biology than other disciplines. Redrawn from Harnad (2010, URL 24)

Observational and processed data have characteristics that vary widely in terms of quantity, digital status, and openness that make up overall availability (Heidorn 2008, Key Perspectives Ltd. 2010). The distribution of data-packages can be represented as a hollow curve (Fig. 7). To the left are a small but growing number of agents producing data in large packages at a high rate. That end of the spectrum includes high throughput biology such as remote monitoring programs or molecular analyses of natural communities (Sogin et al. 2006). Data are often collected via instruments such as sequencing machines which, as indicated above, can produce terabytes of data in a matter of hours. A major challenge at this extreme end of the spectrum is that amounts of data produced often exceed the ability of the hardware to Data Conservancy: Data Issues in the Life Sciences (March 2011) 14

serve data to remote clients (Kahn 2011) and the ability of software to manipulate it (Doctorow 2008). As a result centralized analysis of large, unprocessed federated data files has to be relinquished in favor of distributed analysis of smaller,Small number of providers with lots of data.

processed data files. To the right of the hollow curve is the long tail of biology, which reflects the many providers with small amounts of data ( $500,000, data must be released no later than the acceptance of publication of the main findings from the final data set data release no later than publication or within 3 years of generation, Researchers are expected to ensure data availability for 10 years after completion of project Data must be made available within 2 years from the end of data collection

Data Conservancy: Data Issues in the Life Sciences (March 2011)www.nerc.ac.uk/research/sites/data/policy.asp X www.welcome.ac.uk/About-us/Policy/Policy-and-positionstatements/WTX035043.htm X US China Australia US Austria US US India http://genomicsgtl.energy.gov/datasharing http://english.cas.cn/ http://www.arc.gov.au/default.htm http://www.nsf.gov/bfa/dias/policy/dmp.jsp X X X X X X X X X X X X X X X Data can be embargoed for 2 years http://www.fwf.ac.at/en/public_relations/oai/index.html http://science.nasa.gov/earth-science/earth-sciencedata/data-information-policy/ http://www.ncdc.noaa.gov/oa/about/open-access-climatedata-policy.pdf http://rdpp.csir.res.in/csir_acsir/Home.aspx US http://www.nprb.org/projects/metadata.html Japan http://www.jst.go.jp/EN/index.html South Africa http://www.nrf.ac.za/ X X

Table 2: List of funding agencies and characteristics of their data policies

Wellcome Trust

Department of Energy Chinese Academy of Sciences Australian Research Council National Science Foundation

Requires deposit of 1) protocols 2) raw data 3) other relevant materials no later than 3 months after publication Requires deposit or no further funding No policy Data must be available no more than 2 years after end of project

Austrian Science Fund

NASA

NOAA Council for Scientific and Industrial Research

North Pacific Research Board Japan Science and Technology Agency National Research Foundation

Plan being developed in 2010 Data must be transferred to NPRB by the end of the project None None

21

Citizen Scientists Citizen scientists are non-professionals who participate in scientific activities. Citizen science covers many subjects, but the appealing richness of nature, its accessibility, and our reliance on natural resources ensures that biology attracts an especially high participation by the citizenry (Silvertown 2009). The academic skills of citizen scientists cover a massive spectrum, from those with casual interests in nature or science to individuals who publish in the scientific literature. There are tens of millions of birders in the US (Kerlinger 1993), a number that translates to more than 100 million worldwide. The number of recreational fishermen in marine waters approaches that of birdwatchers (Arlinghaus & Cooke, 2009, Cisneros-Montemayor & Sumaila 2010), and an estimated 500 million people have livelihoods attached to fishing (URL 57). That suggests that the potential citizen scientist community exceeds 1 billion people. This remarkable pool can be called upon to add the sightings (occurrence of a given species at a particular location at a particular time) which can be used to monitor the changing distributions and abundances of endemic and invasive species. The Swedish ArtPortalen (URL 33) has in 10 years compiled more than 26 million sightings at a rate of about 10,000 per day, illustrating the irreplaceable role of the citizen scientist. Several mobile phone apps exist that allow natualists to record species occurrences in the field (BirdsEye from eBird, URL 58 and Observer from WildObs, URL 59). Data on occurrences, or of the first occurrences of flowering or appearance of migratory species, can be called on to test scientific hypotheses as to the impact of climate change on the biosphere. Citizen scientists are significant monitors of endangered species providing the first evidence that some presumed-extinct species, such as the coelocanth (URL 60), wollemi pine (URL 61), ivorybilled woodpecker (URL 62), Lord Howe Island stick insect (URL 63) and mountain pygmy possum (URL 64) are still with us. Repositories A repository provides services offered to a community for management and dissemination of data including, ideally, protection of the integrity of the data, long term preservation, access and migration to new technologies (Lynch 2003). Most repositories typically handle a specific data type at a particular granularity. Thousands of repositories already exist for managing Life Sciences data and hold tens of millions of items (Table 1; see Jones et al. 2006 and URL 65 for more). However, it is estimated that less than 1% of ecology data is captured in this way (Reichman et al. 2011). Repositories range in

functionality from basic databases that store data to collaborative databases that incorporate analysis functions (WRAM, Wireless Remote Animal Monitoring, URL 66). The pathways in and out can determine whether or not a repository is populated and whether data within the repository are reused (Wren & Bateman 2008).


22

Many repositories are difficult to access or are not maintained (Wren & Bateman 2008). Failure of a repository can result from policy shifts, funding instability, management issues, or technical failures (Lynch 2003). Such failures can undermine acceptance of digital scholarly work by the community at large. As data repositories become more important over time, they must be trusted to provide high quality services reliably (Schofield et al. 2010, Klump 2011). The trustworthiness of archives can be assessed using criteria catalogues (Klump 2011) available from organizations like the Digital Curation Center (Innocenti et al. 2007) and the International Standards Organization (ISO 2000). The Center for Research Libraries has assembled a list of ten principles for data repositories that addresses administrative and technical concerns (URL 67).

Data PoliciesA data policy is a list of philosophical statements and procedures that describe the beliefs and regulations of an agent concerning the production, sharing and reuse of data. Some sources for policies can be found in Table 2 and at Biosharing.org (URL 68). Discipline-based efforts to accommodate data-sharing problems and funder-mandated protocols have led to a piecemeal array of policies (Field et al. 2009). Researchers may feel that their ability to comply is limited by inadequate funding, time, software, hardware, expertise or personnel. Many datasharing policies are ignored or even resisted (Savage & Vickers 2009). An impediment to sharing data across the Life Sciences is the absence of an over-arching framework for data-sharing. Generalized policies that can be applied widely within the Life Sciences can be used to promote desirable trends of community participation, infrastructures, tools and repositories that achieve consistency and distribute costs. The development of such policies is as much of a challenge as the technical aspects of data management. General policies need to be extendable to suit the needs of subdisciplines or agents involved. Field et al. (2009) recommend the following steps in developing data policy. 1. Identify science driver(s) necessitating a formal data policy for a particular community 2. Create a working body to bring the data policy to fruition 3. Conduct an initial poll of researcher and funder priorities with respect to data policy development 4. Identify the full range of stakeholders 5. Research current policies and draw from them and the literature 6. Draft a straw man document and define key aspects of the policy a. Scope of policy b. Applicability


23

c. Funding levels required 7. Subject the straw man to internal and the external rounds of consultation followed by iterative improvement 8. Obtain formal sign-off or endorsement by the organization of a final draft and post final draft onto appropriate public website and publicize 9. Set into motion support for policy 10. Monitor compliance and enforce policy 11. Extend policy to cover subareas of science/data as required 12. Evolve or deprecate policy as required The following list of key issues (Step 6 above) was developed during a Data Conservancy (Life Sciences) workshop held in the summer of 2010 that was attended by computer, life and information scientists from academia, government and the private sector. The issues are: 1. Scientists have the right to first use of data they produce 2. Data providers should have a choice of licenses 3. Repositories must provide access to data they hold 4. Scientists must receive attribution when data they produce is used 5. Open formats should be used 6. Existing standards should be used where available 7. Tools and formats used should be free and open-source 8. Data should be collected with interoperability in mind 9. Some data are valuable and should be preserved over time The above can be embellished to create more granular data policies that meet the needs of subdomains. Once the scope and purpose is agreed, the key issues can be embedded in a Data Policy template (Box 1) and distributed to interested parties. As data policies become effective they help to define the data cultures.


24

1 Overview Title of plan; Author; Date; Revision; etc. Project name; Award information; Funding Agencies; etc; reference to main proposal 2 Expected Data 2.1 Data What data get created by the project, and in what form? What raw data are generated, what processed data are generated? What data are expected to be managed by the project for sharing and later archiving? Who is expected to use the (shared) data? 2.2 Data Formats What data formats will be used for data generated? What tools will be required to read the data? 2.3 Data Generation Acquisition How are the data generated and how are they acquisitioned? What quality control/standards are applied to data generation, acquisition and storage? When are data generated? What is the frequency and rate of data generated? 2.4 Software What software does the project create? What will be managed and what wont be managed? Will software be archived? Will software be made available for sharing? Will there be any licensing, if so what? 2.5 Documentation and Metadata What data and metadata standards will be employed? How will metadata be generated (automatically or manually, or both)? How will metadata be stored and managed? How will unique identifiers be managed? What naming schemes will be used? (Data Dictionaries/Taxonomies/Ontologies?) 3 Data Storage and Preservation 3.1 Storage and Backup During the Project Who is responsible for the stored data? Who is responsible for data backups? What digital and non-digital data will be stored? Where will the data be stored and backed up, what policies will be in place? What will be the access controls on data stored? What are the backup procedures for data generated? 3.2 Data Capacity / Volume Volumes of data and rates of creation and ingestion? 3.3 Security Are there any data with specific security issues? How will security be enforced in the system? 3.4 Operation Storage Post-Project Completion How will data be stored after the project has been completed? What mechanisms, policies, agreements, etc. will be used to manage data after the project has been completed? 3.5 Long Term Archiving and Preservation What data will be archived? Where will data be archived? Who will manage and administer the archive? What metadata will be required? Data Conservancy: Data Issues in the Life Sciences (March 2011) 25

What will be the access controls? What will be the retention and disposition policies? 3.6 Roles and Responsibilities Who makes decisions regarding the overall data management (e.g., PI)? Who makes decisions regarding day-to-day data management (e.g., PI)? What is the role and responsibility of the organization that preserves the data? 4 Data Retention How long will each type of data be kept? And why? When will data be made available for sharing? Are there any data embargoes, if so what? When will the data be made public? What is the archival lifecycle and retention policy for archived data? 4.1 Operational Data Who will be responsible for the data in the near-term following project completion? 4.2 Archival Data Who will be responsible for the data for long-term archiving (beyond the most active use of the data)? 5 Data-Sharing and Dissemination What data will be shared? When will data be shared? What restrictions are there on subsequent data use? How will the data be made available? What metadata will be generated to ensure the data are accessible? 5.1 Stakeholders Who will data be made available to? What data will be made available to what stakeholders? 5.2 Privacy and Confidentiality Are there any data with privacy issues? Are there any data relating to human subjects, and what policies need to be adhered to? How will any such privacy requirements be enforced? 5.3 Ownership, Copyright and IP Are any of your data copyrightable (i.e. non-factual in nature)? If so, who holds that copyright (e.g. PI, university, funder)? 5.4 Third Party Data Are any of the data owned by someone else? What are the conditions of use, sharing and dissemination? 5.5 Legal and Regulatory Describe any other legal and/or regulatory constraints on sharing and dissemination of data. 5.6 Re-Use What is the policy on re-use of the data, citations, and production of derivatives? 5.7 Ethical Requirements Does this work involve human subjects, and if so what policies and procedures must be adhered to? What other ethical requirements are in place for the data generated?Box 1. Elements of a data management plan based on the "Data Conservancy' template (URL 69).


26

TECHNOLOGICAL ISSUESVisualization Analysis Aggregation Manipulation

The second array of challenges that will need to be addressed as we move towards Big New Biology are the technical issues that affect the accessibility and reuse of data.Trust Discovery Free & open Published Archived Attributed Registered

Data re-use Access

Processing Annotation Vetting

Making data accessibleThe effective reuse of data requires that an array of conditions (Fig. 8) are optimized. Data are digital. Digitization is a prerequisite for data mobility. As noted earlier, considerable amounts of relevant data are not yet in a digital format. Non-digital formats include notes, books, photographs and

Data pool Data flow

Normalized Structured Digital

Data generationObservations Experiments Models Processed

Figure 8: A Big New Biology can only emerge with a framework that optimizes reuse. Ideally, data should be in forms that can flow from source into a common pool and can flow back out to consumers, be subject to quality control, or be enhanced through analysis to rejoin the pool as processed data.

micrographs, papers, and specimens. Digital metadata about non-digital materials have value as they make the data discoverable and increase incentives for digitization. Data are structured. Digital data may be unstructured (e.g. in the form of free text or an image) or they may be structured into categories that are represented consecutively or periodically through the use of a template, spreadsheet or database. The simple structure of a spreadsheet allows records to be represented as rows. Each record contains data in categories defined by metadata (headers) at the top of each column. Data occur within the cells formed by the intersection of a row and a column. A source may mix both structured and unstructured data such as when fields include free-form text, images, or atomic data. Unstructured data, such as the legacy data to be found in the estimated 500 million pages of text, can be improved through annotation with metadata and this can be achieved by curators or by applying tools (such as natural language processing tools) that discover elements that can be treated as metadata. Data are normalized. Normalization brings information contained within different structures to the same format (or structure). Normalization may be as simple as consistently using one type of unit. Placing data within a template is a common first step to normalization. Normalization is a prerequisite for aggregating data. When data are structured and normalized, they can be mobilized in simple formats (tab delimited or comma delimited text files) or can be transformed into other structures to meet agreed upon


27

standards. DiGIR is an early example of a data transformation tool (URL 70). More contemporary tools, such as TAPIR or IPT from GBIF (URL 71) can output data in an array of normalized forms. Data are standardized. Standardization indicates compliance with a widely accepted mode of normalizing. Standards provide terms that define data and relationships among categories of data. Two basic types of standards that are indispensable for management of biological data are metadata and ontologies. Metadata are terms that define data (data about data) in ways that may serve different purposes, such as helping people to find data of relevance - discovery (Michener 2006) or to bring data together - federation. Metadata standards articulate how data should be named and structured, thus reducing the heterogeneity of terms. Standards mandate the types of metadata that are appropriate for different types of observations. Sets of metadata terms agreed upon by a community are referred to as controlled vocabularies, one of the most extensive bearing on the Life Sciences being the Ecological Metadata Language (EML; Fergraus et al. 2005). By articulating what metadata should be applied and how they should be formatted, standards introduce the consistency that is needed for interoperability and the context for machine reasoning. For example, a marine bacterial RNA sequence collected from the environment ideally might be accompanied by metadata on location (latitude, longitude, depth), environmental parameters, collection metadata (collection event, date of collection, sampling device), and an identifier for the bacterium. Without such metadata, the scope of possible queries is much reduced. Examples of minimum reporting requirements have been established by the MIBBI project (Taylor et al. 2008). Numerous metadata guides are available within Life Sciences (Table 3). There are software programs available to assist in the collection and organization of metadata (such as Morpho, URL 72, Higgins et al. 2002; Metacat, URL 73, Jones et al. 2002; MERMAid, URL 74). An ontology is a formal statement of relationships among concepts (represented by metadata terms) that allows for discovery of data through relationships. Ontologies may use formal descriptive languages to define relationships within systems of metadata. Ontologies are regarded as having great promise (Madin et al. 2007b): "An ontology makes explicit knowledge that is usually diffusely embedded in notebooks, textbooks and journals or just held in academic memories, and therefore represents a formalization of the current state of a field. If ontologies are properly curated over the longer term, they will come to be seen as modern-day (albeit terse) textbooks providing online and up-to-date biological expertise for their area. In another sense, they will provide the common standards needed for producing a strong biological framework for integrating data sets. Ontologies therefore provide the formal basis for an integrative approach to biology that complements the traditional deductive methodology" (Bard & Rhee 2004). Data Conservancy: Data Issues in the Life Sciences (March 2011) 28

StandardABCD Bioontology BIRN Cardiac Electrophysiology Ontology CMECS Comparative Data Analysis Darwin Core Dublin Core Ecological Metdata Language Environment Ontology Evolution ontology Experimental Factor Ontology Federal Geographic Data Committee Fungal Anatomy Gene Ontology Homology ontology Hymenoptera Anatomy Ontology HUPO Infectious Disease ontology International Standards Organization Marine Metadata Interoperability Microbiological Common Language Miriam National Biodiversity Information Infrastructure Ontology of Microbial Phenotypes

Locationhttp://www.bgbm.org/TDWG/CODATA/Schema/default.htm http://www.bioontology.org/ and http://www.birncommunity.org/

TypeSchema Ontology Repository

Ontology http://bioportal.bioontology.org/ontologies/39038 Coastal and marine ecological classification standard http://www.csc.noaa.gov/benthic/cmecs/cmecs_dVocabulary Ontology http://sourceforge.net/apps/mediawiki/cdao/index.php?title=Main_Page http://wiki.tdwg.org/twiki/bin/view/DarwinCore/ http://dublincore.org/ http://knb.ecoinformatics.org/software/eml/ http://www.environmentontology.org/ http://code.google.com/p/evolution-ontology/ http://www.ebi.ac.uk/efo/ http://www.fgdc.gov/ http://www.yeastgenome.org/fungi/fungal_anatomy_ontology/ http://www.geneontology.org/ http://bioportal.bioontology.org/ontologies/42117 http://www.psidev.info/index.php?q=node/159 http://www.infectiousdiseaseontology.org/Home.html http://www.iso.org http://marinemetadata.org/ Verslyppe et al. 2010 http://www.ebi.ac.uk/miriam/main/datatypes/ http://www.nbii.gov/portal/community/Communities/NBII_Home/ Ontology http://sourceforge.net/projects/microphenotypes/ Ontology Repository http://www.obofoundry.org/ http://obofoundry.org/wiki/index.php/PATO:Main_Page http://www.plantontology.org/ http://wiki.tdwg.org/twiki/bin/view/SDD/Version1dot1 http://wiki.tdwg.org/SPM http://lod.taxonconcept.org/ontology/txn.owl http://www.tdwg.org/activities/tnc/tcs-schema-repository/ http://www.bgbm.org/TDWG/acc/Referenc.htm https://www.phenoscape.org/wiki/Teleost_Anatomy_Ontology Ontology Ontology Schema Schema Ontology Schema Standards Body Ontology Vocabulary Metadata Metadata Standards Ontology Ontology Ontology Ontology Vocabulary Ontology Standards Body Metadata Metadata Metadata Ontology Ontology Ontology Standards Body

Open Biological and Biomedical Ontologies Phenotype Quality Ontology Plant Ontology SDD Species Profile Model TaxonConcept Taxonomic Concept Schema TDWG Teleost Anatomy Ontology

Table 3: Examples of standards and their location

Ontologies are part of 'Knowledge Organization Systems'. Those relating to biodiversity have been discussed by Morris (Morris 2010). Ontologies contribute to the semantic annotation of data and the artificial intelligence it enables. As an example, a simple search for information about the bird, robin, seeks to match some or all of character string r-o-b-i-n or to character strings in text within a data object or annotating the data object. The system cannot discriminate among data on American robins, European robins, Robin Reliant cars, Robin Wright Penn, or Robin the boy-superhero. However, if the query for 'robin' can be placed in the context of an ontology, such as one that declares that the context is that the robin in question is a member of the turdidae, an informed computer could use this to return only relevant results. In addition to more precise searching, ontological structures allow the computer to perform Data Conservancy: Data Issues in the Life Sciences (March 2011) 29

inference, a form of artificial intelligence. For example, an ontology that establishes that turdidae is_a bird and wing is part_of a bird, allows the inference that an American robin has wings and that data may be discoverable. Larger interconnected ontologies allow the assembly of more complex inferences. Many ontological structures are available for use in Life Sciences (Table 3). Some, such as the observational (URL 75, URL 76, URL 77) and taxonomic ontologies (below), have broad applicability the first within the field of ecoinformatics and the second to biodiversity informatics. Users can adopt existing structures or create their own using an ontology editor such as Protg (URL 78) or OBOEdit (URL 79). The search engines, Swoogle (URL 80) and Sindice (URL 81), search over 10,000 ontologies and can return a list of those that contain a term of interest. Services such as these help users to determine if an existing ontology will meet his/her needs. Often, a user may need to use parts of existing ontologies or merge several ontologies into a single new one. Defining relationships between terms in different ontologies can be accomplished through the use of automated alignment tools such as SAMBO and KitAMO (Lambrix & Tan 2008). The development and integration of ontologies is best carried out using formal languages (such as OWL, URL 82) and by individuals versed in its logical foundations. Standards in the biodiversity sciences are well served by the Biodiversity Information Standards (TDWG) organization, initiated in 1985 (URL 83). TDWG has been a prime mover in developing organizational frameworks for biodiversity information. GBIF has also been a source of standards innovation and development. Their intent is to provide a common framework for federating data and therefore data reuse. Unfortunately, there may be competing systems of standards and not all aspects of biology have established standards. Various efforts are under way to create broad scope ontologies (URL 84, URL 85, URL 86). The promise of ontologies is as yet not fully realized as "The semantic web is littered with ontologies lacking ... data" (Joel Sachs, pers. comm.). Next generation tools and interfaces hopefully will be better fitted for use by general biologists. The most extensive system of potential metadata for the Life Sciences is the latinized binomial names (such as Homo sapiens) introduced for species in the mid-18th century by Linnaeus. They have been used since then to annotate virtually every statement about any of our current catalog of 2.2 million living and extinct forms of life. Inevitably, they will be replaced by molecular identifiers, but at this time they are well suited to form the basis of a names-based cyberinfrastructure for Biology (Patterson et al. 2008, 2010). This approach has been used for life-wide data-organization projects such as the Encyclopedia of Life (URL 87). Placement of names within hierarchical classifications offers ontological frameworks to organize the names. The conversion of names into a formal ontology has been explored through projects such as ETHAN (URL 88). Our current understanding of biodiversity and the system of names is maintained by a specialist group of 5,000-10,000 professional taxonomists worldwide (Hopkins & Freckleton 2002), who generally are unaware of the informatics potential of names as a near universal Data Conservancy: Data Issues in the Life Sciences (March 2011) 30

indexing system for biological data. The Global Names Architecture is a new global initiative that links names databases and associated services to deliver names-based services to end users (Patterson et al. 2010). Data are atomized. Atomization refers to the reduction of data to minimal semantic units. In such a form, data may exist as numerical values of variables (e.g. length of tail: 5.3 cm) binary statements (e.g. chloroplasts: absent), or association with metadata terms from agreed upon vocabularies (e.g. "has lodicules of lower floret of pedicellate spikelet of tassel" Zea mays ontology ID ZEA:0015118, URL 89). Atomized data on the same subject can be brought together if the data are classified in a standard way. Atomization is necessary for most types of analysis of data from one or more datasets. Atomized data stand in contrast to complex data such as images or large bodies of text. Data centers can foster atomization by providing services that transform data sets. Many older data centers capture data as files (or packages of files). The responsibility for extraction of data atoms falls to the user. This can be time consuming if there is no universal format for files, suggesting that, in the future, atomization needs to occur at or near the source of raw data, becoming part of the responsibilities of the author of the data, the software in which data are logged, or data centers. Data are published. Projects participating in a Big New Biology will increasingly make data visible and accessible (i.e. published). Scientists may publish data by their display in unstructured or structured formats on local, project, or institutional web sites. The scientists may take no responsibility for shifting the data to a central repository. In science generally, over three-quarters of the published data are in local repositories (Science staff editorial 2011). Local archives can provide few guarantees of persistence (see 'Data are Archived' below) and in such environments, the responsibilities for discovery of data, negotiations with copyright holders and acquisition of data with the consumer. This is timeconsuming and unlikely to be done on a large scale. Publication is better served through the use of central, domain-specific repositories because they are more likely to persist, provide better services, and offer the framework around which third-parties develop value-adding services. The molecular data environment consortium of ISNDC is a good example of this model. Only a small fraction of data are deposited in such environments (less than 10% of the science community generally -Science staff editorial 2011), with costs and absence of an organizational framework (metadata and archiving environments) being cited as reasons. There are repositories for heterogeneous datasets (such as oceanographic data bases URL 90, URL 91, URL 92), but increasingly it will be more rewarding to publish via repositories that provide the services that will facilitate reuse. Such services will include data standardization, quality control, and atomization. Given the desire to intercept the data life cycle as close to source as possible, repositories or their agents can develop data capture tools that ideally provided with services (APIs) to export data to the central repositories. Data Conservancy: Data Issues in the Life Sciences (March 2011) 31

Publication of atomized data is essential for large scale data reuse. Data must be able to move from one computer to another in an intelligent way. Scientific initiatives can add RSS feeds, web services, and APIs (Application Programming Interfaces) to their web sites to broadcast new data and to respond to requests for data. An API facilitates interaction between computers in the same way that a user interface facilitates interactions between humans and computers. These additions incur overhead and are probably best served through community repositories. Without such services, data may need to be 'screen scraped' from the web site, a process that is usually costly (because the solution for each site will differ) and, at worst, may require manual copying. Data are archived. It is preferable that data, once published, are persistent. Projects, initiatives and host institutions have little incentive to preserve data for the long term as the process incurs a cost, and repositories that emerge within projects may have limited life spans (e.g. OBIS, URL 93). Central repositories that are not dependent on short-term funding are better positioned to archive data making them persistent. The three global molecular databases that make up the International Nucleotide Sequence Database Collaboration provide an excellent example of how domain-specific repositories may operate. Because they are not funded through short-term projects, and because they mirror each other, such repositories guarantee the persistence of data, and empower scientists to develop projects that involve substantial analyses of shared data (Tittensor et al. 2010). Persistence can be assisted by components (libraries and museums) that specialize in the preservation of artifacts or by governmental intervention (the US-based National Institutes of Health support GenBank). An alternative solution to persistence is an effective business model that allows a data-center to be sustained by income from services that it sells; or by providing essential services that ensure support from the community of users. Examples of

commercial models include the Chemical Abstracts Service of the American Chemical Society (URL 94) or Thomson Reuters' Zoological Record (URL 95). Data are free and open. Open Access, the principle of providing unconstrained access to information on the web, improves the uptake, usage, application and impact of research output (Harnad 2008). Open Access has been applied widely to the process of publication, where it is seen as an alternative to the model in which publishers act as gatekeepers. Open Access has been applied less to data, and while this extension is natural, it is not straightforward (Vision 2010). Attitudes about sharing data freely within Life Sciences vary broadly. In sub-disciplines like genomics, data sharing is the norm with some researchers sharing their data immediately via blogs or wikis. Communities that value data sharing may have no formal recognition for such activities nor supportive technical infrastructure. Other communities have a strong sense of data ownership and are antagonistic to open data sharing. Researchers in these communities expect to be directly involved in any further analysis of their data.


32

Databanks for these communities often require registration and/or a fee to gain access. Some data may be regarded as too sensitive to be made fully accessible (Key Perspectives Ltd. 2010). Web-accessible Life Sciences data are acquired through four main routes (Key Perspectives Ltd. 2010): 1. Website of journal in which the data are published. This is typically in pdf form, which is not ideal for reuse. 2. Website of individual researcher or group. The quality is often good, but often do not comply with standards; data at the sites can be hard to find and navigate. 3. Web-based databases maintained by individuals or groups. These are often funded on a short term basis. Typically they contain data from a project or from colleagues and collaborators. This is probably the most abundant type of database. Coverage is far from comprehensive, and data may not comply with standards. 4. Public databanks with the molecular databases being the best examples but there are also major repositories for records in the context of geospatial metadata (URL 12, 94). Data are trusted. Once data are accessed, consumers may reveal errors and/or omissions. Biological data can be very 'dirty', especially if they were acquired without expectation that they would be shared later. Any data cleaning procedures should be documented to aid the consumer in assessing whether the source is 'suitable for their purpose' (Chapman 2005b). The creation of 'quality loops' allow comments to flow back to source where data can be annotated or modified, and returned to users for renewed vetting. Webhooks (URL 96) offer a mechanism to exploit APIs to have comments returned to source. Any editing of data can lead to the undesirable outcome that variant forms of the same data may co-exist. To some extent, 'versioning' of data sets can be used to discriminate between modified datasets. Users can cite the version they called upon for their analyses. Data are attributed. Scientists gain credit in part through attribution. The permanent

association of identifiers with data offers a means of linking attribution to the data and of tracking reuse. The association of authors' names with data motivates contributions (or lack of credit demotivates them). Attribution favors the development of quality loops to correct errors or otherwise comment on the data. Special care is needed when attributing data resulting from the combination of one or more existing sets so that all intellectual investment is properly credited. Dryad, a JDAP partner, provides data citations through the use of DataCite DOIs with an unrestrictive Creative Commons Zero license, thus promoting clear citation and reuse of data (Vision 2010). Data can be manipulated. A value of having large amounts of data available on the web is that it allows users to explore, in addition to search for, data. Data exploration can be used to vet datasets, check a hunch (hypothesis), or simply indulge basic curiosity. A desirable component of dataData Conservancy: Data Issues in the Life Sciences (March 2011) 33

environment pools are tools that draw data together, analyze or visualize them. Off-the-shelf software packages such as Microsoft Excel, which are easy to use, are unlikely to meet all of the challenges of research. Exploratory systems that are flexible include: Humbolt (Kobilarov & Dickinson 2008) which operates like a faceted filter for Linked Data; Parallax which accesses data in Freebase and has the ability to interact with data on multiple web pages at once (Huynh & Karger 2008); and Microsoft Pivot (URL 97) allows a user to interact with large amounts of data from multiple Internet sources. Visualizations have the capacity to reveal patterns, discontinuities and exceptions that can inform us as to underlying biological processes, appropriateness of data sets or consistency of experimental protocols. Visualizations can be used to display results with analyses of large data sets. Many Life Sciences data sets can be drawn together and visualized using the geospatial element such as in LifeMapper (URL 98). Through visualizations we may help address the challenge stated by Fox and Hendler (2011) that "... many of the major scientific problems facing our world are becoming critically linked to the interdependence and interrelatedness of data from multiple instruments, fields and sources. The absence of effective visualization is creating a bottleneck within data-intensive sciences (Fox and Hendler, 2011). Solutions need to be found in relatively simple low end visualizations (as wonderfully catalogued in URL 99) to high end tools designed for the data deluge that themselves may call on graphics and visualization standards to be pipelined into rich, complex, and flexible aids. Data are registered and discoverable. Registries index data resources to alert potential users to their availability. Search engines, the normal indexers of web-accessible materials, are not good at revealing database contents - only about half of the open data in repositories are indexed by search engines (McCown et al. 2006). Discovery is made possible by the addition of coarse grained discovery metadata. Registry functions need to add and expose discovery metadata to make datasets more visible. As an example, GBIF provides registry level service for biodiversity data (URL 100). Registries that cover software (URL 101, URL 102) or web services (URL 103) are valuable in promoting awareness of tools for data capture, conversion and processing. Successful domain repositories, such as GenBank, have well-structured and detailed metadata that enable detailed search and enhanced discoverability. In the absence of such registries, researchers turn to peers, publications or the thousands of minor data sets available via the Internet. Under these circumstances, it is hard to know when, or if, all relevant data are found. There is a need for a broad-spectrum registry and indexing service (like a Google for data) where researchers can post pointers to their own data, search for desired data and have a means to quickly preview the results. Examples of this exist in Europe with OpenDOAR (URL 104) and in India with Database of Biological Database (URL 105), each with thousands of listings. Semantic annotation of data greatly increases discoverability, and is discussed below.


34

Reusing Biological DataWith the exception of molecular data, there is not a well-developed tradition of repurposing open biological data. One reason is the dominance of narrative tradition in biology. Traditionally, narrative biologists collect a tiny fraction of available data, and do so selectively to emphasize representative, exemplary or outlying observations. They interpret data as published conclusions, in which significant credit is gained for original and confirmed intuitive leaps. In this narrative approach, science is

assembled (in part) through retention, improvement, or rejection of the stories. A key aspect of the narrative approach is the data upon which the stories are based are rarely preserved or even recorded. Instead, value has traditionally been associated with manuscript publication. In most sub-domains of biology, biological data are not held in digital form nor have investments been made in an infrastructure for proper curation. Transferring data from manuscript (paper or pdf) and filing cabinet to structured digital files, web-accessible databases, and to semantic-web-enabled repositories will require a change in culture and a significant additional investment in data management. A second factor that deters reuse of data is the complexity of the subject and that biological data often only make sense in the context of many parameters, conditions and terms expressed over broad temporal and spatial scales. If data are to be reused to address complex questions, they need to be intensively annotated with consistent metadata, and these are generally lacking (Jones et al. 2006). The challenges in achieving consistency of data and metadata can be illustrated with a conceptually simple parameter such as growth rate. 1. Ambiguity of terms - Growth rate can refer to two different concepts: an increase in the number of individuals over time or an increase in the size or mass of individuals. The precise meaning is rarely explicitly defined within data sets. Data can only be shared if the meaning is disambiguated. 2. Heterogeneity of metadata Growth rates can be measured either in experimental or in 'field' contexts. The rate of growth can be influenced by many factors, such as ambient temperature, location, available food, life cycle stage, competitors, and so on. A considerable body of

metadata, entered in a consistent way, is necessary if a user wishes to rely on data that are strictly comparable. 3. Derived data - Growth rates are rarely measured directly. Raw data may be collected as a change in mass or individuals over time. Growth rate is calculated from such data. There are many formulae to calculate growth rate including: (ln x2 lnx1)/t2 t1 Birth death dN/dt = rN 35


P(t) = 1/(1+e-t)

Raw data in isolation, such as number of organisms present at a given time, are meaningless without the other measurements used to make calculations. Metadata describing the experimental method for determining growth rate, the specific calculations and statistical information about those calculations must also be captured. 4. Heterogeneity of units Growth rate data can be reported using multiple units, from pounds per year, to h-1. Within reason, units must be interoperable to promote data aggregation and reuse. In summary, the diversity of data practices present major challenges to reuse of data. The absence of an infrastructure that can cope with this heterogeneity hinders the application of computational solutions to broad biological problems. The costs of adding metadata and ontologies, of normalizing and standardizing data and of extracting data from the narrative will be considerable, and the task unfamiliar to many. The discipline remains in need of dialog to determine the most cost-effective ways to integrate our past efforts, and align our current efforts with the vision of a data-intensive future.

The semantic web and big new biologyThe semantic web has many definitions, but here we think of it as a technical framework that promotes automated sharing and re-use of data across disciplines (Campbell & MacNeill 2010). A semantic infrastructure will permit machine-mediated answers to more complex queries than at present (Stein 2008). The semantic approach has advantages of being flexible, evolvable, and additive. The foundations for automated reasoning lie in the annotation of data with agreed metadata, linked through a network of ontologies, and queried using conventions (languages) such as RDF, OWL, SKOS and SPARQL (Campbell & MacNeill 2010). The mass of appropriately annotated data that can be accessed through the Internet is referred to as LOD (Linked Open Data). Through common metadata, the data can be linked to form a Linked Open Data Cloud (Fig. 9).

Berners-Lee has promoted four guidelines for linked data (Berners-Lee 2006): 1. The use of a standard system of URIs as names for things 2. The use of HTTP URIs so that the names can be 'looked' up and the data accessed 3. When a URI is looked up, it should return useful information using standards (RDF, SPARQL) 4. Include links to other URIs so that they can discover more things.


36

A URI (Uniform Resource Identifier) is a type of persistent identifier (see below) made up of a string of characters that unambiguously (at least in an ideal world, see Booth 2010 for discussion) represents data or metadata and can be used by machines to access the data. Different data-sets can be linked if they share the same URIs. For example, several marine data-sets could be linked by using the same URIs for investigator or sampling event. The most useful classes of terms that are likely to serve the needs of the Life Sciences are georeferences (which can link data from the same location held in different repositories), names of taxa (the common denominator to the majority of statements about biodiversity), identities of people that can be interconnected through devices such as FOAF (Friend-of-afriend) to find collaborators, relevant data, as well as to dig into the world of scientific literature, the latter being linkable through devices such as DOIs to show citation trends, influential publications, etc. (Fig. 10).

Figure 9: Linked open data cloud diagram, by Richard Cyganiak and Anja Jentzsch http://lod-cloud.net/. The circles represent sources of data and the arrows show how they are connected. DBpedia (a central hub) features the contents of Wikipedia in a structured form.

RDF is a language that defines relationships between things. Relationships in RDF are usually made in three parts (often called triples), Entity:Attribute:Value. Entity refers to, for example, organisms, parts or collectives of organisms to which the statement refers. The attribute defines what the statement is about, and the value provides the datum. A machine-readable form in RDF may be a statement that Data Conservancy: Data Issues in the Life Sciences (March 2011) 37

American Each term

robin:has_color:red. is ideally by defined controlledLINKED DATA CLOUDDISTRIBUTION SPECIMENS COLLECTIONS * COLLECTORS * MUSEUMS PUBLICATIONS * DATABASES * ON-LINE IMAGE GALLERIES ENVIRONMENTAL DATA LEGISLATIVE BOUNDARIES DEMOGRAPHICS USAGE BANK NOMENCLATORS * PUBLICATIONS * DATABASES * OCCURRENCES * TAXONOMIES PHYLOGENIES MOLECULAR SEQUENCES AUTHORS * PUBLICATION HISTORY * CO-WORKERS COLLECTORS * COLLECTION SITES * GEOSOCIAL NETWORKS * ON-LINE CATALOGS VIRTUAL LIBRARIES CROSS REFERENCING IMPACT FACTORS TAXONOMIC USAGES * NOMENCLATURAL USAGES * OCCURRENCE RECORDS *

stringently

vocabularies and ontologies, and each part represented within the triple as a URI. The Value can be a URI or a literal - the actual value. An advantage of RDF is that it allows datasets to be merged, for example TaxonConcept and

GAZETTEERS GIS WEB SERVICES

NAMES INDEX CLASSIFICATION / LIST REPOSITORY RECONCILIATION SERVICES DISAMBIGUATION SERVICES SEARCH ENGINES

EXTERNAL SNS FOAF

BIBLIOGRAPHIC DATA DOIS CROSS REFERENCING

GEOSPATIAL

NAMES

PEOPLE

PUBLICATION DATA

Wikipedia (URL 106). A goal of the Linking Open Data project is to promote a data commons by

Figure 10: Four classes of terms can provide the means to link data relating to the Life Sciences. They are references to location, names of organisms, people, and publications.

registering sets in RDF. As of September 2010, the project had grown to 25 billion triples and 395 million RDF links (Fig. 9). The EU project, Linking Open Data 2, received 6.5 million to expand Linked Data by building tools and developing standards (URL 107). Transformation of data from printed narrative or spreadsheet to semantic-web formats is a large challenge. Based on existing ontologies, there is enough information to create 1014 triples in biomedicine alone (Mons & Velterop 2009). At the time of writing, this quantity far exceeds the capacity of any system to process the information. While Life Sciences stand to benefit greatly from the advantages of linked data (Reichman et al. 2011), the current structure lacks mechanisms for ensuring quality, provenance and attribution. Provenance especially is important for Life Sciences data and several software packages currently exist for tracking it (such as Kepler, URL 108; Taverna, URL 109; VisTrails, URL 110). Bechhofer et al. (2010) advocate the use of Research Objects (ROs) as a mechanism for describing the aggregation and investigation of semantic resources with the capacity of capturing the additional value necessary to make the semantic web work for science. Provenance of ROs would also satisfy recent calls for open science where not only data but methods and analyses are also open (Reichman et al. 2011). Semanticization enables nanopublication, a form of publication that extends traditional narrative publication (Groth et al. 2010) and allows attribution to be associated with the semantic web (Mons & Veltrop 2009). Nanopublications relate to publication of triples. A uniquely identifiable triple is a statement. A triple with a statement for a subject is called an annotation and a set of annotations that refer to the same statement is called a nanopublication. The annotations add the attribution and context to the statement. The concept is not widely accepted. Data Conservancy: Data Issues in the Life Sciences (March 2011) 38

Persistent or Globally Unique Identifiers (GUIDs) are used to distinguish individual data elements (Richards et al. 2011). The attachment of a globally unique persistent identifier to data can be used to declare their provenance (source). It allows an author to be identified and to gain credit (attribution). It also provides the mechanism through which questions about data can be returned to source and the record confirmed, corrected or rejected. Identifiers can be used to establish versions and to identify data that are to be deprecated (Van de Sompel et al. 2010). Web-resolvable identifiers can be used to represent the Subject, Attribute and even Value in triples in a semantic world to make each element of a triple unambiguous. There are several desirable properties of globally unique identifiers. The first is that every datum has a single identifier. Secondly, the identifier should be resolvable such that users, once they have a GUID, can view data. An extension of this is that the GUID is dereferenceable, which means that the identifier can be converted into the data to which it refers. Once applied, a GUID should be persistent, but equally, there should be the capacity to version data. Any spreadsheet or database can include identifiers. Scientists have learned the value of a system that allows them to distinguish all records so that they can revisit data if required. To serve its role, an identifier needs to be made unique in its context, whether by instituting a system of incrementing identifiers, or by linking the identifier to a source file, and/or date, and/or author, and/or event. As we move into a semantic world, the principle extends to a requirement for each identifier of all records to be unique. Such an extension requires a universal system of unique identifiers that can be applied to any and all data. Unfortunately, as is often the case in biology, there is no single system of unique identifiers. There are four major types of GUIDs. The first is simply a very large alphanumeric identifier, such as 5fabfc40-0c3d-11e0-81e0-0800200c9a66, a Universally Unique Identifier or UUID. On-line environments can be called upon to provide such numbers (URL 111). They are designed to be globally unique without requiring a central registry. Unless used at the source and then respected, more than one UUID may become associated with a record. This is an unresolved problem. UUIDs lack any inherent property that allows them to be dereferenced. However, they can be included within other identifiers that are resolvable, or agencies can be used to provide a resolving service. A UUID can be resolved within the context of the semantic web, by including them within po

Documents

White Paper Final Version