Click here to load reader

GBIO0002-1 Bioinformatics and Geneticskbessonov/archived_data/GBIO0002-1course... · GBIO0002-1 Bioinformatics and Genetics (Previous GBIO0009-1 Bioinformatics) Introductory lecture

  • View
    213

  • Download
    0

Embed Size (px)

Text of GBIO0002-1 Bioinformatics and Geneticskbessonov/archived_data/GBIO0002-1course... · GBIO0002-1...

  • GBIO0002-1 Bioinformatics

    and Genetics

    (Previous GBIO0009-1 Bioinformatics)

    Introductory lecture Databases and R statistical language

  • Instructors Course instructors

    Prof. Kristel Van Steen

    Office: 0/15 in B37

    E-mail: [email protected]

    http://www.montefiore.ulg.ac.be/~kvansteen

    Prof. Franck DEQUIEDT

    Office: level +5, B34 (GIGA tower)

    E-mail: [email protected]

    Teacher Assistant

    Kyrylo Bessonov

    Office: 1/16 in B37

    [email protected]

    mailto:[email protected]://www.montefiore.ulg.ac.be/~kvansteenmailto:[email protected]:[email protected]

  • Course Scope This course is introduction to

    bioinformatics and genetics fields

    covering wide array of topics:

    accessing and working with main

    biological DB (PubMed, Ensembl);

    sequence alignments;

    statistical genetics;

    microarray/genotype data analysis

    gene regulation mechanisms

    basic Molecular Biology concepts

  • Bioinformatics

    Definition: the collection, classification,

    storage, and analysis of biochemical

    and biological information using

    computers especially as applied to

    molecular genetics and genomics

    (Merriam-Webster dictionary)

    Definition: a field that works on the

    problems involving intersection of

    Biology/Computer Science/Statistics

  • Genetics Definition: Study of heredity in general and

    of genes in particular

    (Concise Encyclopedia)

    1.In 19th century Gregor Mendel formulated the basic concepts

    of heredity

    2.In 1909s Wilhelm Johannsen introduced a new word - gene

    3.In 1909s Thomas Hunt Morgan provided evidence that genes

    occur on chromosomes and that adjacent

    4.In 1940s Oswald Avery showed that DNA is the chromosome

    component that carries genetic information.

    5.In 1962s the molecular structure of DNA was deduced by

    James D. Watson, Francis Crick, and Maurice Wilkins.

    6.In 1970s development of genetic recombination techniques

  • Course expected outcomes

    Gain a taste of various bioinformatics

    fields coupled to hands-on knowledge

    Be able to perform

    multiple sequence alignments

    query biological databases

    programmatically

    perform basic GWAS and microarray

    analysis

    present scientific papers

  • Course practical aspects

    Mode of delivery: in class

    Activities:

    reading of scientific literature

    practical assignments (programming in R)

    in-class group presentations

    Meeting times:

    Thursdays from 2pm-6pm

    Room 1.123, Montefiore Institute (B28)

  • Course practical aspects

    Course material: will be posted on Prof. Kristel Van Steen (lectures) and/or

    Kyrylo Bessonovs (practicals) website(s)

    Assignment submission: will be done online via a special submission

    website

    After the deadline, the assignment should be

    e-mailed to Kyrylo Bessonov

    ([email protected])

    http://www.montefiore.ulg.ac.be/~kvansteen/Teaching20132014.htmlhttp://www.montefiore.ulg.ac.be/~kvansteen/Teaching20132014.htmlhttp://www.montefiore.ulg.ac.be/~kvansteen/Teaching20132014.htmlhttp://www.montefiore.ulg.ac.be/~kvansteen/Teaching20132014.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/courses.htmlhttp://www.student.montefiore.ulg.ac.be/~GBIO009-1/http://www.student.montefiore.ulg.ac.be/~GBIO009-1/mailto:[email protected]:[email protected]:[email protected]

  • What will we be doing?

    Well cover a selected recent topics

    in bioinformatics and genetics both

    trough lectures and assignments

    (including students presentations)

    reading papers from the bioinformatics

    literature and analyzing/critiquing them

    Self-learning through assignments

    In-class hands-on presentation of tools

  • How will we do it?

    Theory classes

    All course notes are in English.

    Instructors

    Kristel Van Steen

    Franck DEQUIEDT

    The theory part of the course is

    meant to be interactive:

    In-class discussions of papers / topics

  • How will we do it?

    Practical classes During these classes will be looking at practical

    aspects of the topics introduced in theory classes. It

    is suggested to execute sample R scripts and

    demonstrations on your PCs.

    Optional reading assignments will be assigned:

    to prepare for discussions in class based on the previously

    posted papers (no grading; yet participation grades)

    Homework assignments are of 3 types (graded)

    Homework assignments result in a group report and

    can be handed in electronically in French or in English

    Homework assignments constitute an important part of this

    course

  • Types of HW assignments Three types of homework assignments are:

    Literature style assignment (Type 1)

    A group of students is asked to select a paper

    from the provided ones. The group prepares in-

    class presentation and a written report

    All oral presentations of HW1,HW2, HW3 will be

    done at the end of the semester (all together)

    Programming style assignment (Type 2)

    A group is asked to develop an R code to

    answer assignment questions

    Classical style assignment (Type 3)

    A group is provided with questions to be

    answered in the written report. Usually R scripts

    are provided and require execution / modification

  • Report writing tips

    Every homework assignment involves writing a

    short report

    Suggested length approximately five single-spaced

    typed pages of text, excluding figures, tables and

    bibliography

    Longer reports are accetable

    It should contain

    an abstract (e.g., brief description of the paper content,

    description of the problem)

    results/discussion part

    If citations are made to other papers, there should be a

    bibliography (any style is OK)!

    Only one report per group is needed.

    Submit report via online system

    http://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO009-1course2013/homework_sign_upForm.php

  • Course materials

    All course materials will be either

    posted on Prof. Kristel van Steens

    and/or Kyrylo Bessonovs websites.

    Please check both sites

    There is no course book

    Final course schedule will be posted

    online shortly

    http://www.montefiore.ulg.ac.be/~kvansteen/http://www.montefiore.ulg.ac.be/~kvansteen/http://www.montefiore.ulg.ac.be/~kvansteen/http://www.montefiore.ulg.ac.be/~kvansteen/http://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO009-1course2013/GBIO009-1Bioinformatics2013.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO009-1course2013/GBIO009-1Bioinformatics2013.htmlhttp://www.montefiore.ulg.ac.be/~kbessonov/present_data/GBIO009-1course2013/GBIO009-1Bioinformatics2013.html

  • Evaluation scheme

    Written exam: 40% of the final mark

    Multiple choice questions/open book

    In French / English

    Assignments: 50% of the final mark

    Total of 3 assignments

    Participation in discussions (10%)

    Throughout the course

    During oral students presentations

    Last lecture of the course

  • Assignment Submission Step by Step Guide

  • Assignment submission

    All assignments should be zipped into

    one file (*.zip) and submitted online

    Create a submission account

  • Account creation Any member of the group can submit assignment

    Account details will be emailed to you automatically

    All GBIO009-1 students should create an account

  • Submit your assignment After account creation login into a submission page

    The remaining time to deadline is displayed. Good idea to

    check it from time to time in order to be on top of things

    File extension should be zip

    Can submit assignment as many times as you wish

  • Introduction to A basic tutorial

  • Definition

    R is a free software environment for

    statistical computing and graphics1

    R is considered to be one of the most

    widely used languges amongst

    statisticians, data miners,

    bioinformaticians and others.

    R is free implementation of S language

    Other commercial statistical packages

    are SPSS, SAS, MatLab

    1 R Core Team, R: A Language and Environment for Statistical Computing, Vienna, Austria (http://www.R-project.org/)

  • Why to learn R?

    Since it is free and open-source, R is

    widely used by bioinformaticians and

    statisticians

    It is multiplatform and free

    Has wide very wide selection of

    additional libraries that allow it to use

    in many domains including

    bioinformatics

    Main library repositories CRAN and

    BioConductor

  • Programming? Should I be scared?

    R is a scripting language and, as

    such, is much more easier to learn

    than other compiled languages as C

    R has reasonably well written

    documentation (vignettes)

    Syntax in R is simple and intuitive if

    one has basic statistics skills

    R scripts will be provided and

    explained in-class

  • Topics covered in this tutorial

    Operators / Variables

    Main objects types

    Plotting and plot modification functions

    Writing and reading data to/from files

  • Variables/Operators

    Variables store one element x x

    [1] 25

    Basic mathematical operators that could be applied

    to variables: (+),(-),(/),(*)

    Use parenthesis to obtain desired sequence of

    mathematical operations

  • Arithmetic operators

    What is the value of small z here? >x y z Z z

    [1] 80

  • Vectors

    Vectors have only 1 dimension and

    represent enumerated sequence of

    data. They can also store variables > v1 mean(v1)

    [1] 3

    The elements of a vector are specified

    /modified with braces (e.g. [number]) > v1[1] v1

    [1] 48 2 3 4 5

  • Logical operators

    These operators mostly work on

    vectors, matrices and other data types

    Type of data is not important, the same

    operators are used for numeric and

    character data types Operator Description

    < less than

    greater than

    >= greater than or equal to == exactly equal to != not equal to

    !x Not x x | y x OR y

    x & y x AND y

  • Logical operators

    Can be applied to vectors in the

    following way. The return value is

    either True or False > v1

    [1] 48 2 3 4 5

    > v1

  • R workspace

    Display all workplace objects

    (variables, vectors, etc.) via ls(): >ls()

    [1] "Z" "v1" "x" "y" "z"

    Useful tip: to save workplace and

    restore from a file use: >save.image(file = " workplace.rda")

    >load(file = "workplace.rda")

  • How to find help info?

    Any function in R has help information

    To invoke help use ? Sign or help(): ? function_name()

    ? mean

    help(mean, try.all.packages=T)

    To search in all packages installed in

    your R installation always use

    try.all.packages=T in help()

    To search for a key word in R

    documentation use help.search(): help.search("mean")

  • Basic data types

    Data could be of 3 basic data types:

    numeric

    character

    logical

    Numeric variable type: > x mode(x)

    [1] "numeric"

  • Basic data types

    Logical variable type (True/False): > y z mode(z)

    [1] "character"

  • Objects/Data structures

    The main data objects in R are:

    Matrices (single data type)

    Data frames (supports various data types)

    Lists (contain set of vectors)

    Other more complex objects with slots

    Matrices are 2D objects (rows/columns)

    > m m

    [,1] [,2] [,3]

    [1,] 0 0 0

    [2,] 0 0 0

  • Lists

    Lists contain various vectors. Each

    vector in the list can be accessed by

    double braces [[number]] > x y L1 L1

    [[1]]

    [1] 1 2 3 4

    [[2]]

    [1] 2 3 4

  • Data frames

    Data frames are similar to matrices but

    can contain various data types > x y z

  • Factors

    Factors type are similar to character

    vectors but can provide info on

    on unique variables in the vector

    variable counts (quantity) > letters = c("A","B","C","A","C","C")

    > letters = factor(letters)

    [1] A B C A C C

    Levels: A B C

    > summary(letters)

    A B C

    2 1 3

  • Input/Output

    To read data into R from a text file use read.table()

    read help(read.table) to learn more

    scan() is a more flexible alternative raw_data write.table(mydata, "data_file.txt")

  • Conversion between data types

    One can convert one type of data

    into another using as.xxx where xxx

    is a data type

  • Plots generation in R

    R provides very rich set of plotting

    possibilities

    The basic command is plot()

    Each library has its own version of plot() function

    When R plots graphics it opens

    graphical device that could be

    either a window or a file

  • Plotting functions

    R offers following array of plotting

    functions Function Description

    plot(x) plot of the values of x variable on the y axis

    plot(x,y) bi-variable plot of x and y values (both axis scaled based on values of x and y variables)

    pie(y) circular pie-char boxplot(x) Plots a box plot showing variables via their quantiles hist(x) Plots a histogram(bar plot)

  • Plot modification functions

    Often R plots are not optimal and one

    would like to add colors or to correct

    position of the legend or do other

    appropriate modifications

    R has an array of graphical parameters

    that are a bit complex to learn at first

    glance. Consult here the full list

    Some of the graphical parameters can be specified inside plot() or using other

    graphical functions such as lines()

    http://www.math.montana.edu/Rweb/Rhelp/par.html

  • Plot modification functions

    Function Description

    points(x,y) add points to the plot using coordinates specified in x and y vectors

    lines(x,y) adds a line using coordinates in x and y

    mtext(text,side=3) adds text to a given margin specified by side number

    boxplot(x) this a histogram that bins values of x into categories represented as bars

    arrows(x0,y0,x1,y1, angle=30, code=1)

    adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn

    abline(h=y) draws horizontal line at y coordinate

    rect(x1, y1, x2, y2) draws rectangle at x1, y1, x2, y2 coordinates

    legend(x,y) plots legend of the plot at the position specified by x and y vectors used to generate a given plot

    title() adds title to the plot

    axis(side, vect) adds axis depending on the chosen one of the 4 sides; vector specifying where tick marks are drawn

  • Demos

    R functionality demonstration

    Plots: demo(graphics)

    3D: demo(persp)

    GLM data modelling: demo(lm.glm)

  • Installation of new libraries

    There are two main R repositories

    CRAN

    BioConductor

    To install package/library from CRAN install.packages("seqinr")

    To install packages from BioConductor source("http://bioconductor.org/biocLite.R")

    biocLite("GenomicRanges")

    http://cran.r-project.org/web/packages/available_packages_by_date.htmlhttp://www.bioconductor.org/

  • Installation of key libraries

    Install latest R version on your PC. Go

    to http://cran.r-project.org/

    Install following libraries by running install.packages(c("seqinr", "muscle", "ape",

    "GenABEL")

    source("http://bioconductor.org/biocLite.R")

    biocLite("limma","affy","hgu133plus2.db","Biosti

    ngs")

    http://cran.r-project.org/http://cran.r-project.org/http://cran.r-project.org/

  • Conclusions

    We hope this course will provide you

    with the good array of analytical and

    practical skills

    We chose R for this course as it is very

    flexible language with large scope of

    applications and is widely used

  • What are we looking for?

    Data & databases

  • Biologists Collect LOTS of Data Hundreds of thousands of species to explore

    Millions of written articles in scientific journals

    Detailed genetic information:

    gene names

    phenotype of mutants

    location of genes/mutations on chromosomes

    linkage (distances between genes)

    High Throughput lab technologies

    PCR

    Rapid inexpensive DNA sequencing (Illumina HiSeq)

    Microarrays (Affymetrix)

    Genome-wide SNP chips / SNP arrays (Illumina)

    Must store data such that

    Minimum data quality is checked

    Well annotated according to standards

    Made available to wide public to foster research

  • What is database?

    Organized collection of data

    Information is stored in "records, "fields, tables

    Fields are categories

    Must contain data of the same type (e.g. columns

    below)

    Records contain data that is related to one object

    (e.g. protein, SNP) (e.g. rows below)

    SNP ID SNPSeqID Gene +primer -primer

    D1Mit160_1 10.MMHAP67FLD1.seq lymphocyte antigen 84 AAGGTAAAAGGCAAT

    CAGCACAGCC

    TCAACCTGGAGTCAGA

    GGCT

    M-05554_1 12.MMHAP31FLD3.seq procollagen, type III,

    alpha

    TGCGCAGAAGCTGA

    AGTCTA

    TTTTGAGGTGTTAATGG

    TTCT

  • Genome sequencing

    generates lots of data

  • Biological Databases The number of databases is contantly growing!

    - OBRC: Online Bioinformatics Resources Collection

    currently lists over 2826 databases (2013)

    http://www.hsls.pitt.edu/obrc/

  • Main databases by category Literature

    PubMed: scientific & medical abstracts/citations

    Health

    OMIM: online mendelian inheritance in man

    Nucleotide Sequences

    Nucleotide: DNA and RNA sequences

    Genomes

    Genome: genome sequencing projects by organism

    dbSNP: short genetic variations

    Genes

    Protein: protein sequences

    UniProt: protein sequences and related information

    Chemicals

    PubChem Compound: chemical information with structures,

    information and links

    Pathways

    BioSystems: molecular pathways with links to genes, proteins

    KEGG Pathway: information on main biological pathways

    http://www.ncbi.nlm.nih.gov/pubmed/http://www.ncbi.nlm.nih.gov/omim/http://www.ncbi.nlm.nih.gov/nuccore/http://www.ncbi.nlm.nih.gov/genome/http://www.ncbi.nlm.nih.gov/snp/http://www.ncbi.nlm.nih.gov/protein/http://www.uniprot.org/help/uniprotkbhttp://www.ncbi.nlm.nih.gov/pccompound/http://www.ncbi.nlm.nih.gov/pccompound/http://www.ncbi.nlm.nih.gov/pccompound/http://www.ncbi.nlm.nih.gov/biosystems/http://www.genome.jp/kegg/pathway.html

  • Growth of UniProtKB

    database UniProtKB contains mainly protein

    sequences (entries). The database

    growth is exponential

    Data management issues? (e.g.

    storage, search, indexing?)

    Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats

    num

    ber

    of

    entr

    ies

    http://www.ebi.ac.uk/uniprot/TrEMBLstats

  • Primary and Secondary

    Databases

    Primary databases

    REAL EXPERIMENTAL DATA (raw)

    Biomolecular sequences or structures and associated

    annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.)

    Secondary databases

    DERIVED INFORMATION (analyzed and annotated)

    Fruits of analyses of primary data in the primary sources (patterns, blocks, profiles etc. which represent the most conserved

    features of multiple alignments)

  • Primary Databases

    Sequence Information DNA: EMBL, Genbank, DDBJ

    Protein: SwissProt, TREMBL, PIR, OWL

    Genome Information GDB, MGD, ACeDB

    Structure Information PDB, NDB, CCDB/CSD

  • Secondary Databases

    Sequence-related Information ProSite, Enzyme, REBase

    Genome-related Information OMIM, TransFac

    Structure-related Information DSSP, HSSP, FSSP, PDBFinder

    Pathway Information KEGG, Pathways

  • GenBank database

    Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research

    One can search by protein name to get DNA/mRNA sequences

    The search results could be filtered by species and other parameters

  • GenBank main fields

  • NCBI Databases contain more

    than just

    DNA & protein sequences

    NCBI main portal: http://www.ncbi.nlm.nih.gov/

    http://www.ncbi.nlm.nih.gov/

  • Fasta format to store sequences

    Saccharomyces cerevisiae strain YC81 actin (ACT1) gene

    GenBank: JQ288018.1

    >gi|380876362|gb|JQ288018.1| Saccharomyces cerevisiae strain YC81 actin (ACT1)

    gene, partial cds

    TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTG

    ACTGAAGCTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAA

    ACTTTCAACGTTCCAGCCTTCTACGTTTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCC

    GGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATGGTGTTACTCACGTCGTTCCAATT

    TACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTGGCCGGTAGAGATTTG

    ACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTCTCCACCACTGCTGAA

    AGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCGCCTTGGACTTCGAGCA

    AGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTACGAACTTCCAGATGG

    TCAAGTCATCACTATTGGTAAC

    The FASTA format is now universal for all

    databases and software that handles DNA and

    protein sequences

    Specifications: One header line

    starts with > with a ends with [return]

  • OMIM database Online Mendelian Inheritance in Man (OMIM)

    information on all known mendelian disorders linked to

    over 12,000 genes

    Started at 1960s by Dr. Victor A. McKusick as a catalog of

    mendelian traits and disorders

    Linked disease data

    Links disease phenotypes and causative genes

    Used by physicians and geneticists

    http://www.ncbi.nlm.nih.gov/omim

  • OMIM basic search

    Online Tutorial: http://www.openhelix.com/OMIM

    Each search results entry has *, +, # or % symbol

    # entries are the most informative as molecular basis of

    phenotype genotype association is known is known

    Will do search on: Ankylosing spondylitis (AS)

    AS characterized by chronic inflammation of spine

    http://www.openhelix.com/OMIM

  • OMIM-search results

    Look for the entires that link to the genes.

    Apply filters if needed

    Filter results if known SNP is associated to

    the entry

    Some of the interesting entries. Try to look

    for the ones with # sign

  • OMIM-entries

  • OMIM Gene ID -entries

  • OMIM-Finding disease linked genes Read the report and find genes linked

    phenotype (e.g. IL23R)

    Mapping how the disease gene was found

  • PubMed database PubMed is one of the best known

    database in the whole scientific

    community

    Most of biology related literature from all

    the related fields are being indexed by this

    database

    It has very powerful mechanism of

    constructing search queries

    Many search fields Logical operatiors

    (AND, OR)

    Provides electronic links to most journals

    Example of searching by author articles

    published within 2012-2013

    http://www.ncbi.nlm.nih.gov/pubmed/

  • In-class assignment/demo

    on

    Biological Databases

  • Demo/Assignment

    Question 1

    Explore OMIM database and key

    clinical information on

    glioblastoma (brain cancer)

    Question 2

    Look for literature in PUBMED related

    to the disease

    Learn on how to create search queries

  • Thanks for attention!

    Have a nice week!