Research into Understanding Scientific Evidencecommunity.dur.ac.uk/rosalyn.roberts/Evidence/CofEv_Gott et al.pdf · Research into Understanding Scientific Evidence Richard Gott, Sandra

Research into Understanding Scientific EvidenceRichard Gott, Sandra Duggan, Ros Roberts and Ahmed Hussain

Introduction

Our research is based on the belief that there is a body of knowledge which underlies an understanding of scientific evidence. Certain ideas which underpin the collection,analysis and interpretation of data have to be understood before we can handle scientific evidence effectively. We have called these ideas concepts of evidence. Somepupils/students will pick up these ideas in the course of studying the more traditional areas of science, but many will not. These students will not understand how to evaluatescientific evidence unless the underlying concepts of evidence are specifically taught. If these ideas are to be taught, then they need to be carefully defined.

We are in the process of developing a comprehensive, but as yet tentative, definition of concepts of evidence ranging from the ideas associated with a single measurement tothose which are associated with evaluating evidence as a whole. What follows is the latest version which has been, and continues to be, informed by research and writing inprimary and secondary science education, in sciencebased industry and in the public understanding of science. Our definition is by no means complete and we welcomecomments or suggestions from readers of this site.

The reader should note that we are not suggesting that students need to understand all of these concepts. Although we believe that some of these ideas are fundamental andappropriate at any age, others may be necessary only for a student engaged in a particular branch of science.

We are aware that some concepts, such as sensitivity, can have several meanings in different areas of science. We aim to point this out where applicable.

Further information:

The latest downloadable version of the complete list can be obtained here.A concept map for 'the thinking behind the doing' in scientific practice can be found below, and as a PDF file here.A version produced in collaboration with teachers (funded by AQA) which describes the subset of the complete list appropriate to GCSE science in the UK can be foundhere.A report detailing a recent research project and links to the instruments used can be found found here.Research publications can be found here.This work has been influential in framing curriculum developments in England and has been cited in the US's Framework for K12 Science Education and the PISA 2015Science Framework.

Background

Fundamental ideasInvestigations must be approached with a critical eye. What sort of link is to be established, with what level of measurement and how will opinion and data be weighed asevidence? This pervades the entire scheme and sets the context in which all that follows needs to be judged. Topic Understanding that: Notes

Opinion and data It is necessary to distinguish between opinion based on scientific evidence and ideas on the In the UK, the Royal Society’s motto ‘Nullius in verba' which roughly translates

https://www.dur.ac.uk/education/

http://community.dur.ac.uk/rosalyn.roberts/Evidence/CofEv_Gott%20et%20al.pdf

http://community.dur.ac.uk/rosalyn.roberts/Evidence/Roberts%20&%20Johnson%20(2015)%20The%20Curriculum%20Journal%20-%20What%20is%20for%20the%20'thinking%20behind%20the%20doing'%20for%20determining%20the%20validity%20of%20data.pdf

http://community.dur.ac.uk/rosalyn.roberts/Evidence/CofEv_PDF_GCSE.pdf

http://www.dur.ac.uk/education/research/current_research/maths/msm/understanding_scientific_evidence/

one hand, and opinion based on nonscientific ideas (prejudice, whim, hear say . . ) on theother.

as 'take nobody’s word for it' emphasises the central importance in science ofevidence over assertion, as scientists make claims following investigations intothe real world. All scientific research is judged on the quality of its evidence.

Links A scientific investigation seeks to establish links (and the form of those links) between twoor more variables.

Association and causation Links can be casual (change in the value of one variable causes a change in another), orassociative (changes in one variable and changes in another are linked to some third, andpossibly unrecognised, third (or more) variable).

Types of measurement Interval data (measurements of a continuous variable) are more powerful than ordinal data(rank ordering) which are more powerful than categoric data (a label).

Extended tasks Some measurements, for instance, can be very complicated and constitute a task on theirown, but they are only meaningful when set within the wider investigation(s) of which theywill form a part.

ObservationObservation of objects and events can lead to informed description and the generation of questions to investigate further. Observation is one of the key links between the 'realworld' and the abstract ideas of science. Observation, in our definition, does not include 'measurement' but rather deals with the way we see objects and events through theprism of our understanding of the underlying substantive conceptual structures of science. Topic Understanding that: Notes

Observing objects Objects can be 'seen' differently depending on the conceptual window used to view them. A low profile car tyre can be seen as nothing more than that, or it can be seenas a way of increasing the stiffness of the tyre, thus giving more centripetalforce with less deformation and thus improving road holding.

Observing events Events can similarly be seen through different conceptual windows. The motion of a parachute is seen differently when looked at through aframework of equal and unequal forces and their corresponding accelerations.

Using a key The way in which an object can be 'seen' can be shaped by using a key. E.g. a branching key gives detailed clues as to what to 'see'. It is, then, aheavily guided substantive conceptdriven observation.

Taxonomies Taxonomies are a means of using conceptually driven observations to set up classes ofobjects or organisms that exhibit similar/different characteristics or properties with a viewto using the classification to solve a problem.

Organisms observed in a habitat may be classified according to their feedingcharacteristics (to track population changes over time for instance) or aselection of materials classified into efficient conductors identified frominefficient conductors.

Observation and experiment Observation can be the start of an investigation, experiment or survey. Noticing that shrimp populations vary in a stream leads to a search for ahypothesis as to why that is the case, and an investigation to test thathypothesis.

Observation and map drawing Technique used in biological and geological fieldwork to map a site based on conceptuallydriven observations that illustrate features of scientific interest.

An ecologist may construct a map of a section of a stream illustrating areas ofvarying stream flow rate or composition of the stream bed.

Note We regard 'observation' as being essentially substantive in nature, requiring the use ofestablished ideas of force, for instance, as a window on how we see the world. As such it isincluded here only because of its crucial role in raising questions for investigation.

MeasurementMeasurement must take into account inherent variation due to uncontrolled variables and the characteristics of the instruments used. This section lies at the very centre of ourmodel for measurement, data and evidence and is fundamental to it. Topic Understanding that: Notes

Inherent variation The measured value of any variable will never repeat unless all possible variables arecontrolled between measurements circumstances which are very difficult to create.

Such uncertainties are inherent in the measurement process that lies at theheart of science; they do not represent a failure of science, or of scientists.

Human error Needless to say, the measured value of any variable can be subject to human error whichcan be random, or systematic.

A framework for data and evidence

In any discussion of the place of data and evidence in science or engineering, we must avoid the trap of failing to define terms and, as a consequence, rendering the argumentunintelligible. We shall therefore begin by defining what we mean by data and evidence.

We take datum to mean the measurement of a parameter e.g. the volume of gas or the type of rubber. This does not necessarily mean a single measurement: it may be theresult of averaging several repeated measurements and these could be quantitative or qualitative.

Data we take to be no more or less than the plural of datum, to state the obvious.

Evidence, on the other hand, we take as data which have been subjected to some form of validation so that it is possible, for instance, to assign a 'weight' to the data whencoming to an overall judgement. This process of weighting will need to look wider than the data itself. It will need to consider, for example, the quality of the experiment andthe conditions under which it was undertaken, together with its reproducibility by other workers in other circumstances and perhaps the practicality of implementing theoutcomes of the evidence.

We begin our definition in the centre of the figure above with the ideas that underpin the making of a single measurement and work outwards. This seems a logical way toproceed but, please note, that we are not suggesting that this equates with the order of understanding necessary for carrying out an experiment or the order in which theseideas are best taught.

Making a single measurement

To make a single measurement, the choice of an instrument must be suited to the value to be measured. Making an appropriate choice is informed by an understanding of thebasic principles underlying measuring instruments.

1 Underlying relationshipsAll instruments rely on an underlying relationship which converts the variable being measured into another that is easily read. For instance, the following (volume, temperatureand force) are measured by instruments which convert each variable into length:

a measuring cylinder converts volume to a length of the column of liquida thermometer converts temperature to a change in volume and then to a change in length of the mercury threada force meter converts a force into the changing length of a spring

Other instruments convert the variable to an angle on a curved scale, such as a car speedometer. Electronic instruments convert the variable to a voltage.Some instruments are not so obviously 'instruments' and may not be recognised as such. One example is the use of lichen as an indicator of pollution and another is pH paperwhere chemical change is used as the basis of the 'instrument' and the measurement is a colour. Other instruments rely on more complex and less direct relationships. Topic Understanding that: Example

1. Linear relationships ... most instruments rely on an underlying and preferably linear relationship between twovariables.

A thermometer relies on the relationship between the volume of a liquid andtemperature.

2. Nonlinear relationships ... some 'instruments', of necessity, rely on nonlinear relationships. Moving iron ammeter, pH.

3. Complex relationships ... the relationship may not be straightforward and may be confounded by other factors. The prevalence, or size, of a species of lichen is an indicator of the level ofpollution but other environmental factors such as aspect, substrate, or airmovement can also affect the distribution of lichen.

4. Multiple relationships ... sometimes several relationships are linked together so that the measurement of avariable is indirect.

Medical diagnosis often relies on indirect, multiple relationships. Brakingdistance is an indirect measure of frictional force.Proxy measures are very important in ‘historical’ sciences, such as geology/earthscience and in the study of climate change e.g. tree rings and ice cores as proxymeasures of climate conditions.

2 Calibration and errorInstruments must be carefully calibrated to minimise the inevitable uncertainties in the readings. All instruments must be calibrated so that the underlying relationship isaccurately mapped onto the scale. If the relationship is nonlinear, the scale has to be calibrated more often to map that nonlinearity. All instruments, no matter how wellmade, are subject to error. Each instrument has finite limits on, for example, its resolution and sensitivity.

Topic Understanding that: Example5. End points ... the instrument must be calibrated at the end points of the scale. A thermometer must be calibrated at 0 °C and 100 °C.

6. Intervening points ... the instrument must be calibrated at points in between to check the linearity of theunderlying relationship.

A thermometer must be calibrated at a number of intervening points tocheck, for instance, for nonlinearity due to nonuniform bore of the capillary.

7. Zero Errors ... there can be a systematic shift in scale and that instruments should be checked regularly. If the zero has been wrongly calibrated, if the instrument itself was notzeroed before use or if there is fatigue in the mechanical components, asystematic error can occur.

8. Overload, limiting sensitivity /limit of detection

... there is a maximum (full scale deflection) and a minimum quantity which can bemeasured reliably with a given instrument and technique.

The lower and upper ends of the scale of a measuring instrument place limitson the lowest and highest values that can be measured. It is all too easy toread an electronic meter (in particular) without realising it is on its end stop.

9. Sensitivity* ... the sensitivity of an instrument is a measure of the amount of error inherent in theinstrument itself.

An electronic voltmeter will give a reading which fluctuates slightly.

10. Resolution and error ... the resolution is the smallest division which can be read easily. The resolution can beexpressed as a percentage.

If the instrument can measure to 1 division and the reading is 10 divisions,the error can be expressed as 10±1 or as a percentage error of 10%.

11. Specificity** ... an instrument must measure only what it purports to measure. This is of particular significance in biology where indirect measurements areused as 'instruments' e.g. bicarbonate indicator used as an indirect measureof respiratory activity in woodlice could be affected by other acids such asthat produced by the woodlice during excretion.

12. Instrument use ... there is a prescribed procedure for using an instrument which, if not followed, will lead tosystematic and / or random errors.

Taking a thermometer out of the liquid to read it will lead to systematicallylow readings. More specifically, there is a prescribed depth of immersion forsome thermometers which takes account of the expansion of the glass andthe mercury (or alcohol) which is not in the liquid being measured.

13. Human error ... even when an instrument is chosen and used appropriately, human error can occur. Scales on measuring instruments can easily be misread.

*Sensitivity and **specificity have a different meaning in medicine in the measurement of disease where sensitivity is the true positive rate, that is, the proportion of patients with the disease who are correctly 'measured' oridentified by the test. Specificity is the proportion of patients without the disease who are correctly measured or identified by the test. These two measures describe the 'measurement efficiency'.

3 Reliability and validity of a single measurementAny measurement must be reliable and valid. A measurement, once made, must be scrutinised to make sure that it is a valid measurement; it is measuring what was intended,and that it can be relied upon. Repeating readings and triangulation, by using more than one of the same type of instrument or by using another type of instrument, canincrease reliability.

Topic Understanding that: Example14. Reliability ... a reliable measurement requires an average of a number of repeated readings; the

number needed depends on the accuracy required in the particular circumstances.Measurement of blood alcohol level can be assessed with a breathalyser, butat least 3 independent readings are made before the measure is considered a

legal measurement.

15. Reliability ... instruments can be subject to inherent inaccuracy so that using different instruments canincrease reliability.

Measurement of blood alcohol level can be assessed with a breathalyser andcross checked with a blood test. Temperature can be measured with amercury, alcohol and digital thermometer to ensure reliability.

16. Reliability ... human error in the use of an instrument can be overcome by independent, random checks. Spot checks of measurement techniques by coworkers are sometimes builtinto routine procedures.

17. Validity ... measures that rely on complex or multiple relationships must ensure that they aremeasuring what they purport to measure.

A complex technique for measuring a vitamin may be measuring more thanone form of the same vitamin.

Measuring a datumMoving from the measuring instrument itself, we now turn to the actual measurement of a datum. The measurement of a single datum may be required or it may be as one ofseveral data to be measured. A significant element of science in industry is indeed about the sophisticated and careful measurement of a single parameter.

1 The choice of an instrument for measuring a datumMeasurements are never entirely accurate for a variety of reasons. Of prime importance is choosing the instrument to give the accuracy and precision required; a proactivechoice rather than a reactive discovery that it wasn't the right instrument for the job!

Topic Understanding that: Example18. Trueness or accuracy* ... trueness is a measure of the extent to which repeated readings of the same quantity give

a mean that is the same as the 'true' mean.If the mean of a series of readings of the height of an individual pupil is 173cm and her 'true' height, as measured by a clinic's instrument is 173 cm, themeasuring instrument is 'true'.

19. Nonrepeatability ... repeated readings of the same quantity with the same instrument never give exactly thesame answer.

Weighing yourself on a set of bathroom scales in different places on thebathroom floor, or standing on a slightly different position on the scales, willresult in slightly differing readings. It is never possible to repeat the readingin exactly the same way.

20. Precision ... precision (sometimes called "imprecision" in industry) refers to the observed variations inrepeated measurements from the same instrument. In other words, precision is an indicationof the spread of the repeated measurements around the mean. A precise measurement is onein which the readings cluster closely together. The less the instrument’s precision, the greateris its uncertainty. A precise measurement may not necessarily be an accurate or truemeasurement (and vice versa). The concept of precision is also called "reliability" in somefields. A more formal descriptor or assessment of precision might be the range of theobserved readings, the standard deviation of those readings, or the standard error of theinstrument itself.

For bathroom scales, a precise set of measurements might be: 175, 176,175, 176, and 174 pounds.

21. Reproducibility ... whereas repeatability (precision) relates to the ability of the method to give the sameresult for repeated tests of the same sample on the same equipment (in the samelaboratory), reproducibility relates to the ability of the method to give the same result forrepeated tests of the same sample on equipment in different laboratories.

'Round Robins' are often used to check between different laboratories. Astandardised sample is sent to each lab and they report their measurement(s)and degree of uncertainty. Labs are then compared.

22. Outliers in relationships ... outliers, aberrant or anomalous values in data sets should be examined to discoverpossible causes. If an aberrant measurement or datum can be explained by poormeasurement procedures (whatever the source of error), then it can be deleted.

Outliers may be due to errors discussed above, for example. In medicallaboratory practice, outliers may have serious implications if not explored.

* Accuracy is a term which is often used rather loosely to indicate the combined effects of precision and trueness. But, in some sciencebased industries the distinction we have defined here is used widely so that, for example, theprecision and accuracy of a given measurement are quoted routinely.

2 Sampling a datumA series of measurements of the same datum can be used to determine the reliability of the measurement. We shall use the term sampling to mean any subset of a'population'. The 'population' might be the population of a species of animal or plant or even the 'population' of possible sites where gold might be found. We shall also take thepopulation to mean the infinite number of repeated readings that could be taken of any particular measurement. We consider these together since their effect on the data is thesame.

Topic Understanding that: Example23. Sampling ... one or measurements comprise a sample of all the measurements that could be made. The measurement of a single blade of grass is a sample of all the blades of

grass in a field.A single measurement of the bounce height of a ball is a sample of theinfinite number of such bounces that could be measured.

24. Size of sample ... the number of measurements taken. The greater the number of readings taken, the morelikely they are to be representative of the population.

As more readings of, for example, the height of students in a college aretaken, the more closely the sample is likely to represent the whole collegepopulation.The more times a single ball is bounced, the more the sample is likely torepresent all possible bounces of that ball.

25. Reducing bias in sample /representative sampling

... measurements must be taken using an appropriate sampling strategy, such as randomsampling, stratified or systematic sampling so that the sample is as representative aspossible.

In the above example of the height of college students, tables of randomnumbers can be used to select students.

26. An anomolous datum ... an unexpected datum could be indicative of inherent variation in the data or theconsequence of a recognised uncontrolled variable.

In the above example, a very small height may have been recorded from achild visiting the college and should not be part of the population beingsampled; whereas a very low rebound height from a squash ball may occur asa result of differences in the material of the ball and is therefore part of thesample.

3 Statistical treatment of measurements of a single datumA group of measurements of the same datum can be described in various mathematical ways. The statistical treatment of a datum is concerned with the probability that ameasurement is within certain limits of the true reading. The following are some of the basic statistics associated with a single datum:

Topic Understanding that: Example27. Range ... the range is a simple description of the distribution and defines the maximum and

minimum values measured.Measuring the height of carbon dioxide bubbles on successive trials in a yeastexperiment, the following measurements were recorded and orderedsequentially: 2.7, 2.9, 3.1, 3.1, 3.1, 3.3, 3.4, 3.4, 3.5, 3.6 and 3.7 cm. Therange is 1.0 cm (3.7 2.7).

28. Mode ... the mode is the value which occurs most often. Continuing the example above, the mode is 3.1 cm.

29. Median ... the median is the value below and above which there are half the measurements. Continuing the example above, the median is 3.3 cm.

30. Mean ... the mean (average) is the sum of all the measurements divided by the number ofmeasurements.

Continuing the example above, the mean is 3.2 cm.

31. Frequency distributions. ... a series of readings of the same datum can be represented as a frequency distribution bygrouping repeated measurements which fall within a given range and plotting the frequenciesof the grouped measurements.

32. Standard deviation. ... the standard deviation (SD) is a way of describing the spread of normally distributed data.The standard deviation indicates how closely the measurements cluster around their mean. Inother words, the standard deviation is a measure of the extent to which measurementsdeviate from their mean. The more closely the measurements cluster around the mean, thesmaller the standard deviation. The standard deviation depends on the measuring instrumentand technique the more precise these are, the smaller the standard deviation of the sampleor of repeated measurements.

Continuing the example above, SD = 0.30 cm.

33. Standard deviation of the mean(standard error).

... the standard deviation of the mean describes the frequency distribution of the means froma series of readings repeated many times. The standard deviation of the mean depends onthe measuring instrument and technique AND on the number of repeats. The standard error ofa measurement is an estimate of the probable range within which the 'true' mean falls; thatis, an estimate of the uncertainty associated with the datum.

Continuing the example above, SE = 0.09 cm.

34. Coefficient of variation. ... the coefficient of variation is the standard deviation expressed as a percentage of themean (CV = SD*100/mean).

Continuing the example above, CV = 9.4%.

35. Confidence limits. ... confidence limits indicate the degree of confidence that can be placed on the datum. Forexample, '95% confidence limits' means that the 'true' datum lies within 2 standard errors ofthe calculated mean, 95% of the time. Similarly '68% confidence limits' means that the'true' datum lies within 2 standard errors of the calculated mean, 68% of the time.

Continuing the example above, the true value of the datum lies within 0.18cm (2 standard errors) of 3.2 cm (the mean), 19 times out of 20. The upperand lower confidence limits at the 95% level are 3.38 (3.2 + 0.18) and 3.02(3.2 0.18) respectively. In other words, the 'true' value lies between 3.02and 3.38 cm, 95% of the time.

4 Reliability and validity of a datumA datum must have a known (or estimated) reliability and validity before it can be used in evidence.Any datum must be subject to careful scrutiny to ascertain the extent to which it:

is valid: that is, has the value of the appropriate variable been measured? Has the parameter been sampled so that the datum represents thepopulation?is reliable: for example, does the datum have sufficient precision? The wider the confidence limits (the greater the uncertainty), the less reliable thedatum.

Only then can the datum be weighed as evidence. Evaluation of a datum also includes evaluating the validity of the ideas associated with the making of a single measurement.

Topic Understanding that: Example36. Reliability ... a datum can only be weighed as evidence once the uncertainty associated with the

instrument and the measurement procedures have been ascertained.The reliability of a measurement of blood alcohol level should be assessed interms of the uncertainty associated with the breathalyser (e.g. +/ 0.01) andin terms of how the measurement was taken (e.g. superficial breathingversus deep breathing).

37. Validity ... that a measurement must be of, or allow a calculation of, the appropriate datum. The girth of a tree is not a valid indicator of the tree's age.

Data in investigations looking for relationshipsAn investigation is an attempt to determine the relationship, or lack of one, between the independent and dependent variables or between two or more sets of data.Investigations take many forms but all have the same underlying structure.

1. The design of practical investigationsWhat do we need to understand to be able to appraise the design of an investigation in terms of validity and reliability?

1.1 Variable structureIdentifying and understanding the basic structure of an investigation in terms of variables and their types helps to evaluate the validity of data.

Topic Understanding that: Example38. The independent variable ... the independent variable is the variable for which values are changed or selected by the

investigator.The type of ball in an investigation to compare the bounciness of differenttypes of balls; the depth in a pond at which light intensity is to be measured.

39. The dependent variable ... the dependent variable is the variable the value of which is measured for each and everychange in the independent variable.

In the same investigations as above: the height to which each type of ballbounces; the light intensity at each of the chosen depths in the pond.

40. Correlated variables ... in some circumstances we are looking for a correlation only, rather than any impliedcausation

Foot size can be predicted from hand size (both ‘caused’ by other factors).

41. Categoric variables ... a categoric variable has values which are described by labels. Categoric variables are alsoknown as nominal data.

The variable 'type of metal' has values 'iron', 'copper' etc.

42. Ordered variables ... an ordered variable has values which are also descriptions, labels or categories but thesecategories can be ordered or ranked. Measurement of ordered variables results in ordinaldata.

The variable of size e.g.' very small', 'small', 'medium' or 'large' is anordered variable. Although the labels can be assigned numbers (e.g. verysmall=1, small=2 etc.) size remains an ordered variable.

43. Continuous variables ... a continuous variable is one which can have any numerical value and its measurementresults in interval data.

Weight, length, force.

44. Discrete variables ... a discrete variable is a special case in which the values of the variable are restricted tointeger multiples.

The number of discrete layers of roof insulation.

45. Multivariate designs ... a multivariate investigation is one in which there is more than one independent variable. ...The effect of the width and the length of a model bridge on its strength.The effect of temperature and humidity on the distribution of gazelles in aparticular habitat.

1.2 Validity, 'fair tests' and controlsUncontrolled variation can be reduced through a variety of techniques. 'Fair tests' and controls aim to isolate the effect of the independent variable on the dependent variable.Laboratorybased investigations, at one end of the spectrum, involve the investigator changing the independent variable and keeping all the control variables constant. This isoften termed 'the fair test', but is no more than one of a range of valid structures. At the other end of the spectrum are 'field studies' where many naturally changing variablesare measured and correlations sought. For example, an ecologist might measure many variables in a habitat over a period of time. Having collected the data, correlationsmight be sought between variables such as day length and emergence of a butterfly, using statistical treatments to ensure validity. The possible effect of other variables can bereduced by only considering data where the values of other variables are the same or similar. In between these extremes, are many types of valid design which involvedifferent degrees of manipulation and control. Fundamentally, all these investigations have a similar structure; what differs are the strategies to ensure validity.

Topic Understanding that: Example46. Fair test ... a fair test is one in which only the independent variable has been allowed to affect the

dependent variable.A laboratory experiment about the effect of temperature on dissolving time,where only the temperature is changed. Everything else is kept exactly thesame.

47. Control variables in the laboratory ... other variables can affect the results of an investigation unless their effects are controlledby keeping them constant.

In the above experiment, the mass of the chemical, the volume of liquid, thestirring technique and the room temperature are some of the variables thatshould be controlled.

48. Control variables in field studies ... some variables cannot be kept constant and all that can be done is to make sure that theychange in the same way.

In a field study on the effect of different fertilisers on germination, theweather conditions are not held constant but each experimental plot issubjected to the same weather conditions. The conditions are matched.

49. Control variables in surveys ... the potential effect on validity of uncontrolled variables can be reduced by selecting datafrom conditions that are similar with respect to other variables.

In a field study to determine whether light intensity affects the colour of dog’smercury leaves, other variables are recorded, such as soil nutrients, pH andwater content. Correlations are then sought by selecting plants growing wherethe value of these variables is similar.

50. Control group experiments ... control groups are used to ensure that any effects observed are due to the independentvariable(s) and not some other unidentified variable. They are no more than the default valueof the independent variable.

In a drug trial, patients with the same illness are divided into an experimentalgroup who are given the drug and a control group who are given a placebo orno drug.

1.3 Choosing valuesThe values of the variables need to be chosen carefully. This is possible in the majority of investigations during trialling. In field studies, where data are collected fromvariables that change naturally, some of these concepts can only be applied retrospectively.

Topic Understanding that: Example51. Trial run ... a trial run can be used to establish the broad parameters required of the experiment

(scale, range, number) and help in choosing instrumentation and other equipment.Before drug experiments are carried out, trials are conducted to determineappropriate dosage and appropriate measures of side effects, among otherthings.

52. The sample ... issues of sample size and representativeness apply in the same way as in sampling adatum (see Measuring a datum).

The choice of sample size and the sampling strategy will affect the validity ofthe findings.

53. Relative scale ... the choice of sensible values for quantities is necessary if measurements of the dependentvariable are to be meaningful.

In differentiating the dissolving times of different chemicals, a large quantityof chemical in a small quantity of water causing saturation will invalidate theresults.

54. Range ... the range over which the values of the independent variable is chosen is important inensuring that any pattern is detected.

An investigation into the effect of temperature on the volume of yeast doughusing a range of 20 25°C would show little change in volume.

55. Interval ... the choice of interval between values determines whether or not the pattern in the datacan be identified.

An investigation into the effect of temperature on enzyme activity would notshow the complete pattern if 20°C intervals were chosen.

56. Number ... a sufficient number of readings is necessary to determine the pattern. The number is determined partly by the range and interval issues above but,in some cases, for the complete pattern to be seen, more readings may benecessary in one part of the range than another. This applies particularly ifthe pattern changes near extreme values, for example, in a spring extensionexperiment at the top of the range of the mass suspended on the spring.

1.4 Accuracy and precision

The design of the investigation must provide data with sufficiently appropriate accuracy and precision to answer the question. This consideration should be built into the designof the investigation. Different investigations will require different levels of accuracy and precision depending on their purpose.

Topic Understanding that: Example57. Determining differences ... there is a level of precision which is sufficient to provide data which will allow

discrimination between two or more means.The degree of precision required to discriminate between the bounciness of asquash ball and a ping pong ball is far less than that required to discriminatebetween two ping pong balls.

58. Determining patterns ... there is a level of precision which is required for the trend in a pattern to be determined. Large error of measurement bars on a line graph or dispersed scatter plotpoints may not allow discrimination between an upward curve or a straightline.

1.5 TablesTables can be used to design an experiment in advance of the data collection and, as such, contribute towards its validity. In this way, tables can be much more than just a wayof presenting data, after the data have been collected.

Topic Understanding that: Example59. Tables ... tables can be used as organisers for the design of an experiment by preparing

the table in advance of the whole experiment. A table has a conventional format.An experiment on the effect of temperature on the dissolving time of sodium chloride:

1.6 Reliability and validity of the designIn evaluating the design of an investigation, there are two overarching questions:

Will the measurements result in sufficiently reliable data to answer the question?Will the design result in sufficiently valid data to answer the question?

Evaluating the design of an investigation includes evaluating the reliability and validity of the ideas associated with the making of single measurements and with each and everydatum.

Topic Understanding that: Example60. Reliability of the design ... the reliability of the design includes a consideration of all the ideas associated with the

measurement of each and every datum.Factors associated with the choice of the measuring instruments to be usedmust be considered e.g. the error associated with each measuring instrument.The sampling of each datum and the accuracy and precision of themeasurements should also be considered.This includes the sample size, the sampling technique, relative scale, therange and interval of the measurements, the number of readings, and theappropriate accuracy and precision of the measurements.

61. Validity of the design ... the validity of the design includes a consideration of the reliability (as above) and thevalidity of each and every datum.

This includes the choice of measuring instrument in relation to whether theinstrument is actually measuring what it is supposed to measure.This includes considering the ideas associated with the variable structure and

the concepts associated with the fair test.For example, measuring the distance travelled by a car at different angles ofa ramp will not answer a question about speed as a function of angle.

2. Data presentation, patterns and relationships in practical investigationsHaving established that the design of an investigation is reliable and valid, what do we need to understand to explore the relationship between one variable and another?Another way of thinking about this is to think of the pattern between two variables or 2 sets of data. What do we need to understand to know that the pattern is valid andreliable? The way that data are presented allows patterns to be seen.

2.1 Data presentationThere is a close link between graphical representations and the type of variable they represent.

Topic Understanding that: Example62. Tables ... a table is a means of reporting and displaying data. But a table alone presents limited

information about the design of an investigation e.g. control variables or measurementtechniques are not always overtly described.

Simple patterns such as directly proportional or inversely proportionalrelationships can be shown effectively in a table.

63. Bar charts ... bar charts can be used to display data in which the independent variable is categoric andthe dependent variables is continuous.

The number of pupils who can and cannot roll their tongues would be bestpresented on a bar chart.

64. Line graphs ... line graphs can be used to display data in which both the independent variable and thedependent variable are continuous. They allow interpolation and extrapolation.

The length of a spring and the mass applied would be best displayed in a linegraph.

65. Scatter graphs (or scatter plots) ... can also be used to display data in which both the independent variable and the dependentvariable are continuous. Scatter graphs are often used where there is much fluctuation in thedata because they can allow an association to be detected. Widely scattered points can showa weak correlation, points clustered around, for example, a line can indicate a relationship.

The dry mass of the aerial parts of a plant and the dry mass of the roots.

66. Histograms ... histograms can be used to display data in which a continuous independent variable hasbeen grouped into ranges and in which the dependent variable is continuous.

On a sea shore, the distance from the sea could be grouped into ranges andthe number of limpets in each range plotted in a histogram.

67. Box and whisker plots ... the box, in box and whisker plots, represents 50% of the data limited by the 25th and75th percentile. The central line is the median. The limits of the 'whiskers' may show eitherthe extremes of the range or the 2.5% and 97.5% values.

Box and whisker plots are often used to compare large data sets.

68. Multivariate data ... 3D bar charts and line graphs (surfaces) are suitable for some forms of multivariate data.

69. Other forms of display ... data can be transformed, for example, to logarithmic scales so that they meet the criteriafor normality which allows the use of parametric statistics.

Logarithmic transformation is commonly used in clinical and laboratorymedicine, weather maps etc.

2.2 Statistical treatment of measurements of dataThere are a large number of statistical techniques for analysing data which address three main questions:

Do the two groups of data differ from each other (by probabilistic chance alone)?Do data change when repeated measurements are taken on a second separate occasion?Is there an association between two sets of data?

Statistics consider the variability of the data and present a result based on probability. Each statistical technique has associated criteria depending on, for example, the type ofdata, its distribution, the sample size etc. Some common methods of statistical analysis of data are shown below.

Topic Understanding that:70. Differences between means ... a ttest can be used to estimate the probability that two means from normally distributed populations, derived from an investigation involving a

categoric independent variable, are different. i.e. what is the chance that the two means probably occurred by chance alone? If measures are repeatedwith the same or matched pairs, then a paired ttest can be used.

71. Analysis of variance ... analysis of variance is a technique which can be used to estimate the effects of a number of variables in a multivariate problem involving categoricindependent variables.

72. Linear and nonlinear regression ... regression can be used to derive the 'line of best fit' for data resulting from an investigation involving a continuous independent variable.

73. Nonparametric measures ... when the measurements are not normally distributed, nonparametric tests, such as the MannWhitney Utest, can be used to estimate the probabilityof any differences.

74. Categoric data ... when the data results from an investigation in which both independent and dependent variables are categoric, the analysis of the data must use, forinstance, a chisquared test.

2.3 Patterns and relationships in dataData must be inspected for underlying patterns. Patterns represent the behaviour of variables so that they cannot be treated in isolation from the physical system that theyrepresent. Patterns can be seen in tables or graphs or can be reported by using the results of appropriate statistical analysis. The interpretation of patterns and relationshipsmust respect the limitations of the data: for instance, there is a danger of overgeneralisation or of implying causality when there may be a different, less direct type ofassociation.

Topic Understanding that: Example75. Types of patterns ... there are different types of association such as causal, consequential, indirect or chance

associations. "Chance association" means that observed differences in data sets, or changesin data over time, happen simply by chance alone. We must sceptically be open topossibility that a pattern has emerged by chance alone. Statistical tests give us a rationalway to estimate this chance.

In any large multivariate set of data, there will be associations, some ofwhich will be chance associations. Even if x and y are highly correlated, xdoes not necessarily cause y: y may cause x or z may cause x and y. Also,changes in students’ understanding before and after an intervention may notbe significant and/or may be due to other factors.

76. Linear relationships ... straight line relationships (positive slopes, negative, and vertical and horizontal as specialcases) can be present in data in tables and line graphs and that such relationships haveimportant predictive power (y = mx + c).

Height and time for a falling object.

77. Proportional relationships ... direct proportionality is a particular case of a straight line relationships with consequentpredictive characteristics. The relationship is often expressed in the form (y = mx).

Hooke’s law: the length of a spring is directly proportional to the force onthe spring.

78. 'Predictable' curves ... patterns can follow predictable curves (y=x2 for instance), and that such patterns arelikely to represent significant regularities in the behaviour of the system.

Velocity against time for a falling object. Also, the terminal velocity of aparachute against its surface area.

79. Complex curves ... some patterns can be modelled mathematically to give approximations to different partsof the curve

Hooke's law for a spring taken beyond its elastic limit.

80. Empirical relationships ... patterns can be purely empirical and not be easily represented by any simplemathematical relationship.

Traffic flow as a function of time of day.

81. Anomalous data ... patterns in tables or graphs can show up anomalous data points which require furtherconsideration before excluding them from further consideration.

A 'bad' measurement or datum due to human error.

82. Line of best fit ... for line graphs (and scatter graphs in some cases) a 'line of best fit' can be used toillustrate the underlying relationship, 'smoothing out' some of the inherent (uncontrolled)variation and human error.

3. Reliability and validity of the data in the whole investigationIn evaluating the whole investigation, all the foregoing ideas about evidence need to be considered in relation to the two overarching questions:

Are the data reliable?Are the data valid?

In addressing these two questions, ideas associated with the making of single measurements and with each and every datum in an investigation should be considered. Theevaluation should also include a consideration of the design of an investigation, ideas associated with measurement, with the presentation of the data and with theinterpretation of patterns and relationships.

Data to evidence comparisons with other dataSo far we have considered the data in a single investigation. In reality, the results of an investigation will usually be compared with other data.

Topic Understanding that:... a series of experiments can add to the reliability and validity of evidence even if, individually, their precision does not allow much weight to be placed

83. A series of experiments on the results of any one experiment alone.

84. Secondary Data ... data collected by others is a valuable source of additional evidence, provided its value as evidence can be judged. E.g. metaanalyses.

85. Triangulation ... triangulation with other methods can strengthen the validity of the evidence.

Relevant societal issuesEvidence must be considered in the light of personal and social experience and the status of the investigators. If we are faced with evidence and we want to arrive at ajudgement, then other factors will also come into the equation, some of which are listed below.

Topic Understanding that: Example86. Credibility of evidence ... credibility has a lot to do with face validity: consistency of the evidence with

conventional ideas, with common sense, and with personal experience. Credibility increaseswith the degree of scientific consensus on the evidence or on theories that support theevidence. Credibility can also turn on the type of evidence presented, for instance, statisticalversus anecdotal evidence.

Evidence showing low emissions of dioxins from a smokestack iscompromised by photos of black smoke spewing from the smokestack (eventhough dioxins are relatively colourless). Also, concern for potential healthhazards for workers in some industries often begins with anecdotal evidence,but is initially rejected as not being scientifically credible.

87. Practicality of Consequences ... the implications of the evidence may be practical and cost effective, or they may not be.The more impractical or costly the implications, the greater the demand for higher standardsof validity and reliability of the evidence.

The negative side effects of a drug may outweigh its benefits, for all butterminally ill patients. Also, when judging the evidence on the source of acidrain, Americans will likely demand a greater degree of certainty of theevidence than Canadians who live down wind, because of the cost toAmerican industries to reduce sulphur.

88. Experimenter bias ... evidence must be scrutinized for inherent bias of the experimenters. Possible bias may bedue to funding sources, intellectual rigidity, or an allegiance to an ideology such asscientism, religious fundamentalism, socialism, or capitalism, to name but a few. Bias isalso directly related to interest: Who benefits? Who is burdened?

Studying the link between cancer and smoking funded by the tobaccoindustry; or studying the health effects of genetically modified foods fundedby Green Peace. Also, the acid rain issue (above) illustrates differentinterests on each side of the Canadian/American border.

89. Power structures ... evidence can be accorded undue weight, or dismissed too lightly, simply by virtue of itspolitical significance or due to influential bodies. Trust can often be a factor here.Sometimes people are influenced by past occurrences of broken trust by governmentagencies, by industry spokespersons, or by special interest groups.

Studies published in the New England Journal of Medicine tend to receivegreater weight than other studies. Also, the pharmaceutical industry’snegative reaction to Dr. Olivieri’s research results that were not supportiveof their drug Apotex at Toronto’s Hospital for Sick Children in 2001.

90. Paradigms of practice ... different investigators may work within different paradigms of research. For instance,engineers operate from a different perspective than scientists. Thus, evidence garneredwithin one paradigm may take on quite a different status when viewed from anotherparadigm of practice.

Theoretical scientists tend to use evidence to support arguments foradvancing a theory or model, whereas scientists working for an NGO, forinstance, tend to use evidence to solve a problem at hand within a shorttime period. Theoretical scientists have the luxury of subscribing to higherstandards of validity and reliability for their evidence.

91. Acceptability of consequences ... evidence can be denied or dismissed for what may appear to be illogical reasons such aspublic and political fear of its consequences. Prejudice and preconceptions play a part here.

During the tainted blood controversies in the mid 1980s, the Canadian RedCross had difficulty accepting evidence concerning the transmission of HIV inblood transfusions. BSE and traffic pollution are examples in Europe.

92. Status of experimenters ... the academic or professional status, experience and authority of the experimenters mayinfluence the weight which is placed on the evidence.

Nobel laureates may have their evidence accepted more easily than newresearchers’ evidence. Also, a botanist’s established reputation affects thecredibility of his or her testimony concerning legal evidence in a courtroom.

93. Validity of conclusions ... conclusions must be limited to the data available and not go beyond them throughinappropriate generalisation, interpolation or extrapolation

The beneficial effects of a pharmaceutical may be limited to the populationsample used in the human trials of the new drug. Also, evidence acquiredfrom a male population concerning a particular cardiac problem may notapply as widely to a female population.

We are indebted to Glen Aikenhead of the University of Saskatchewan for his detailed comments on this version and for some of the examples used to illustrate the ideas.

A concept map for 'the thinking behind the doing'A concept map with the focus question "What is the 'thinking behind the doing' for determining the validity of data?":

NB: Concepts directly informed by substantive knowledge are highlighted with a shadow on the box.

From: Roberts, R. and Johnson, P. (2015): Understanding the quality of data: a concept map for 'the thinking behind the doing' in scientific practice, Curriculum Journal,26(3), 345369. DOI: 10.1080/09585176.2015.1044459, where the ideas and their relationships are explained fully and are applied to the decisions made when conducting a labbased investigation and a fieldwork survey.

Here are some of our most recent publications:Roberts, R. and Johnson, P. (2015) Understanding the quality of data: a concept map for ‘the thinking behind the doing’ in scientific practice. Curriculum Journal, 26(3), 345369. DOI:

10.1080/09585176.2015.1044459.

Roberts, R and Reading, C. (2015) The practical work challenge: incorporating the explicit teaching of evidence in subject content. School Science Review, 96(357) pp 31 39.

Roberts, R. (2015) Understanding evidence in scientific disciplines: identifying and mapping ‘the thinking behind the doing’and its importance in curriculum development.

Practice and Evidence of the Scholarship of Teachingand Learning in Higher Education (PESTLHE), vol 10.

Roberts, R. and SahinPekmez, E.(2012)

Scientific Evidence as Content Knowledge: a replication study with English and Turkish preservice primaryteachers.

European Journal of Teacher Education, 35(1), 91109.

Roberts, R., and Gott, R. (2010) Questioning the evidence for a claim in a socioscientific issue: an aspect of scientific literacy. Research in Science & Technological Education, 28: 3,203 — 226

Roberts, R., Gott, R. and Glaesser, R.(2010)

Students’ approaches to openended science investigation: the importance of substantive and proceduralunderstanding. Research Papers in Education. 25(4), 377407

Roberts, R. (2009) How Science Works (HSW). Education in Science. June 2009, no 233, 3031

Roberts, R. (2009) Can teaching about evidence encourage a creative approach in openended investigations? School Science Review, 90(332) pp3138 ISSN: 00366811

Glaesser, J., Gott, R., Roberts, R. &Cooper, B. (2009)

Underlying success in openended investigations in science: using qualitative comparative analysis toidentify necessary and sufficient conditions.

Research in Science and Technological Education,27,1,530.

Glaesser, J., Gott, R., Roberts, R. &Cooper, B. (2009)

The roles of substantive and procedural understanding in openended science investigations: Using fuzzy setQualitative Comparative Analysis to compare two different tasks

Research in Science Education. 39, 4 (2009), 595624.

Roberts, R. and Gott, R. (2008) Practical work and the importance of scientific evidence in science curricula. Education in Science, Nov 2008, 89.

Gott, R. and Roberts, R. (2008)Concepts of evidence and their role in openended practical investigations and scientific literacy;background to published papers.

Durham, Durham University

Gott R. and Duggan, S. (2007) A framework for practical work in science and scientific literacy through agrumentation Res. in Sc. and Tech. Educ. 25 (3)

Roberts, R and Gott R. (2007) Questioning the Evidence: research to assess an aspect of scientific literacy.

Proceedings of European Science Education ResearchAssociation (ESERA) conference, Malmo, Sweden,August 2007

Roberts, R and Gott R. (2007) Evidence, investigations and scientific literacy: what are the curriculum implications?

Proceedings of National Association for Research inScience Teaching (NARST) conference, New Orleans,April 2007

Investigations, scientific literacy and evidence Hatfield

http://dx.doi.org/10.1080/09585176.2015.1044459

Gott R. and Duggan, S. (2006)

Roberts, R and Gott R. (2006) The role of evidence in the new KS4 National Curriculum and the AQA specifications School Science Review 87 (321)

Roberts, R and Gott R. (2006) Assessment of performance in practical science and pupil attributes. Assessment in Education 13 (1)

Roberts, R and Gott R. (2004) A written test for procedural understanding: a way forward for assessment in UK science education Res. in Sc. and Tech. Educ. 22 (1)

Roberts, R. (2004) Using Different Types of Practical within a ProblemSolving Model of Science. School Science Review 85 (312)

Roberts, R. and Gott, R (2004)

Assessment of Sc1: alternatives to coursework? School Science Review 85 (313)

Gott, R and Duggan, S. (2003)Understanding and Using Scientific evidence. Sage, London

Gott, R and Duggan S. (2003)Building success in Sc 1. Workbook and interactive CD ROM Folens, Bedfordshire.

Roberts, R and Gott R (Feb 2003)Written tests for procedural understanding in science: why? And would they work?

Education in Science, Feb 2003, 1618.

Roberts, R and Gott R (2003) Assessment of biology investigations. Jnl. of Biol. Ed. 37, 3, 114121

Gott R. and Duggan S. (2002) Performance assessment of practical science in the UK National Curriculum Cambridge Journal of Education., 32, 2, 183 – 201

Roberts, R and Gott, R.(2002) Investigations: collecting and using evidence. In Teaching Scientific Enquiry, ASE/John Murray (SangD Ed).

Duggan S. and Gott R. (2002) What sort of science do we really need? Int. J. Sci. Ed. 24, 7, 661679

Roberts R. 2001 Procedural understanding in biology: “the thinking behind the doing” Journal of Biological Education 35 (3) 113117

Tytler R., Duggan S. and Gott R.2001 Public participation in an environmental dispute: implications for science education Public Understanding of Science 10 343364

Tytler R., Duggan S. and Gott R. 2001 Dimensions of evidence, the public understanding of science and science education Int. J. Sci. Ed., 23, 8, 815832

Duggan S. and Gott R. 2000 Intermediate GNVQ science: a missed opportunity?Research in Science and Technological Education 18(2) 201214

Duggan, S. and Gott, R (2000) Understanding evidence in science: the way to a more relevant curriculum. In Issues in science teaching. Sears J. and SorensonP, Routledge, London, pp6070.

Roberts R. and Gott R. 2000 Procedural understanding in biology: how is it characterised in texts? School Science Review 82 (298) 8391

Gott, R, Duggan, S and Roberts, S.(1999) The science investigation workshop. Education in Science 183, 2627

Gott R., Foulds K. and Johnson P.1997 Science Investigations Book 1 Collins Educational

Gott R., Foulds K. and Jones M. 1998 Science Investigations Book 2 Collins Educational

Gott R., Foulds K. and Roberts R.1999 Science Investigations Book 3 Collins Educational

Gott R. and Duggan S. 1998 Understanding scientific evidence why it matters and how it can be taught. In: ASE Secondary ScienceTeachers’ Handbook Ed. M. Ratcliffe Stanley Thornes (Publishers) Ltd

Gott R., Duggan S. and Johnson P.1999 What do practising applied scientists do and what are the implications for science education? Research in Science and Technological Education 17

(1) 97107)

Roberts R. and Gott R. 1999 Procedural understanding: its place in the biology curriculum School Science Review 81 (294) 1925

Last updated: 7/08/15To comment on the content of these web pages or for further information,please contact:[email protected]

mailto:[email protected]

Documents

Research into Understanding Scientific Evidencecommunity.dur.ac.uk/rosalyn.roberts/Evidence/CofEv_Gott et al.pdf · Research into Understanding Scientific Evidence Richard Gott, Sandra