Applied statistics in molecular biology - description - LSM · Applied statistics in molecular biology "Applied statistics in molecular biology" is a practical course designed to

Axel Strauß, Dr. rer. nat. Diplombiologe

i

tsib

obok

a.de

tsib

obok

a.de

Applied statistics in molecular biology "Applied statistics in molecular biology" is a practical course designed to provide you a smooth way in handling, visualising, and statistically testing data. Neither knowledge in statistics nor in R1 is required to follow the course although it is very useful to have own data in mind when thinking about statistics. It is not the idea to show you fancy high class statistics but to help you with everyday questions regarding the data you often collect.

Based on real world data from the field and the lab we will look at the following topics: characteristic measures: We permanently summarise our data to get an overview of the results.

Surely, the mean and standard deviation or standard error are the favorites. But are these actually good "representatives" of my data? Which other characteristic measures ("Maßzahlen", e.g., median, quantiles, confidence interval,...) are available, what do they mean, and which one to use best? You will see how much the use of these measures depends on how your data look like. So we will also deal with the question "Should I use standard deviation or error on my bar plots?" (and we will extend it: "Should I use bar plots at all?").

data exploration: Before you calculate means etc. you need to look in your raw data. Based on what we have seen dealing with characteristic measures, we do some so called data exploration. You will get an introduction to R on this day so you will be trained before you need to apply it.

hypothesis testing: How do I properly ask scientific questions? What are null hypothesis, alternative hypothesis, and what does this "magic" p-value actually mean?

comparing two samples (two data sets) I. : We will start statistics comparing two samples. You will get an introduction to t-test and Mann-Whitney-U-test and, based on the knowledge gained before, will apply it to both artificial data and real lab data. We will also focus on some pitfalls, such as pseudoreplication (aka "technical replicates" in lab jargon), missing controls, or strange data distributions (e.g., outliers). You will see that t-test is a good tool but you just can't test everything with t-tests.

comparing more than two samples: This works just as the topic above but dealing with more complex data (more than two factors, factor with more than two levels). We will first focus on ANOVA and Kruskal-Wallis-test and later on (simple) linear regression and the mix of both: ANCOVA. Also model evaluation (Have I been allowed to do ANOVA with my data?) will be part of this session. We will also do some model simplification to keep things as simple as possible. Also important: what to report in thesis or publication?

data visualisation: Measurements and "p-values" etc. are important to present your results. However, graphics are what you need to visualise your data. There are some pitfalls and we will mostly deal with box plots, bar plots, strip charts, scatter plots, mosaic plots... (some of them later in the course when dealing with the respective type of data). We will do that step-by-step in R. Also step-by-step I will show you how to modify graphs to make them more sexy.

1 is an open sourced, free software environment for statistical computing and graphics. It is extremely useful, not too difficult to handle, and commonly used in biological and other science. R can be downloaded from http://www.r-‐project.org for all platforms. Why not in Excel? Some of the basic things, graphics, and even some tests can be done in Excel. You will see during the course that this really isn't a good choice.


ii

tsib

obok

a.de

tsib

obok

a.de

data transformation: Unfortunately, data ore often "mis-shaped", e.g., there are outliers or other strange things in the data set. This is often a problem for analyses; tests and graphics do not represent your data well or are even wrong. If your data are of such a certain type (skewed distribution, heteroscedasticity, outliers, ...) that restricts the use of your preferred analysis, data transformation can make things possible. R provides very nice, easy tools to do that.

comparing two samples II.: If you wonder whether your observations follow a specific pattern/distributions (e.g., Mendel's law), Chi2-goodness of fit test (Chi2-Anpassungstest) will tell it to you. And if the question is whether observation patterns of character A (e.g., genotype) are associated with those of character B (e.g., phenotype), Chi2 on contingency tables (Chi2-Homogenitätstest) will answer it. And if you have proportion data to analyse, a proportion test may be the solution.

power-tests: Especially if you need to tell your supervisor that your data "just missed significance" (whatever it means) you will ask yourself whether you "failed" (no!) because you did not do enough measurement. Power-tests can tell you which experimental setup is needed to find something – if there is something.

experimental design: Although you should think about the design of your study before you do it, we place this at the end of the course. There are some general, very important rules of experimental design that need to be considered before collecting data. It is all about replication and randomisation; it includes pseudoreplication, controls, types of data, and principle of parsimony.

There are incredibly many books for statistics with R. Here are some I like. You don't need to by a book for the course but it will be better for later on.

Statistics for Ecologists Using R and Excel: Data Collection, Exploration, Analysis and Presentation Mark Gardener Paperback ISBN: 978-1-907807-12-1 324 pages ©2012 Pelagic Publishing £29.99 (GB) (discount possible, see http://www.gardenersown.co.uk/education/lectures/r/) Nice start in stats with R and Excel (xls is painful!). Good explanations and examples. Useful structure. Has some limits when it comes to details and more advanced questions. Includes keys to tests. I recommend it for beginners in statistics and R.

Biostatistical Design and Analysis Using R: A Practical Guide Murray Logan (Australian Institute of Marine Science) ISBN: 978-1-4051-9008-4 Paperback 576 pages April 2010, ©2010, Wiley-Blackwell £39.99 (GB)/ €51.90 (Ger) Mainly statistics book. Good chapters about R handling. Nice chapter for R graphics (base). Illustrative and well explained stats. Covers almost all daily use statistics (no multivariate analysis [e.g., PCA]). I recommend it for everybody. Alternative: The R book


iii

tsib

obok

a.de

tsib

obok

a.de

The R Book, 2nd Edition Michael J. Crawley (Imperial College of Science, Technology and Medicine, UK) Wiley ISBN: 978-0-470-97392-9 Hardcover 1076 pages December 2012 £60.00 (GB) / €77.90 (Ger) Mainly a statistics book. A lot about R handling (more than you may need). Good chapter for R graphics (base). Well explained statistics. Some maths. Covers almost all daily use statistics (incl. some multivariate analysis), more than “Logan”. I recommend it for everybody. Alternative: Biostatistical Design and Analysis Using R: A Practical Guide

Statistics: An Introduction using R Michael J. Crawley (Imperial College of Science, Technology and Medicine, UK) Wiley ISBN: 978-0-470-02298-6 Paperback 342 pages March 2005, ©2005 £29.95 (GB)/ €38,90 (Ger) German translation available R book’s small (pink) brother. Covers a bit less stats; incl. experimental design. No real R handling chapter. Covers most things you may need, good understandable. Good for everybody. Will most prob. be enough for you.

R in a Nutshell Joseph Adler Second Edition October 2012 O’Reilly ISBN 978-1-4493-1208-4 721 pages €41.00 German translation of first edition available (€49.90) A pretty cool R reference book. Well explained, very useful. Definitively deserves a place on your work desk. It includes R handling, R language (syntax, objects, functions and more), data handling, graphics, statistics examples (a how to but no statistics text book) and a short chapter about Bioconductor. Alternative: R in 10 Schritten

R in 10 Schritten Alexandrowicz, Rainer W. First Edition 2013 facultas.wuv / UTB ISBN 9783825284848 230 pages €27.99 (print) (Ger) in German R reference with lots of nice step by step explanations and additional comments. Covers R handling, objects, data handling (several chapters: structures, im- and export, reshaping, and more), graphics, R programming (basics)... Almost no statistics (but I would recommend an additional statistics book anyway). I recommend it for everybody who wants to do more than a test with prepared data in R. Alternative: R in a nutshell.


iv

tsib

obok

a.de

tsib

obok

a.de

Getting started with R – An Introduction for Biologists Andrew P. Beckerman & Owen L. Petchey First Edition 2012 Oxford University Press ISBN 978-0-19-960162-2 113 pages £21 (GB) / €29.90 (Ger) A How to for performing statistics/graphics in R. This is no big text book but surely a good, small, well explained guide that points you to many important things already before stats. Very good R hints. No introduction to stats (some basic knowledge expected). Recommend it for everybody with some stats knowledge or in addition to a statistics book.

R graphs cookbook Hrishi V. Mittal First Edition 2011 PACKT Publishing ISBN 1849513066 272 pages €20.39 (ebook) or €38.99 (print incl. ebook) Recipes for graphs in R. A cookbook with lots of examples for graphics. As for all cookbooks, you will discover many nice ways of modifying graphs by browsing through the examples. Worth buying it. Please note: The “R graphics cookbook” from O’REILLY is made for the use of the graphics package “ggplot2” with a totally different syntax than the base graphics.

Documents

Applied statistics in molecular biology - description - LSM · Applied statistics in molecular biology "Applied statistics in molecular biology" is a practical course designed to