14
BMMB 852: Applied Bioinforma0cs Week 4, Lecture 8 István Albert Bioinforma0cs Consul0ng Center Penn State, 2015

Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

BMMB  852:  Applied  Bioinforma0cs  

   Week  4,  Lecture  8  

István  Albert    

Bioinforma0cs  Consul0ng  Center  Penn  State,  2015  

Page 2: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

You’ll  need  a  “good”  text  editor  

Absolutely  essen0al  feature:    •  Needs  to  be  able  to  show  you  white-­‐space  (allow  you  to  

dis0nguish  between  tabs  and  spaces)    

•  Needs  to  be  able  to  allow  you  to  change  line  ending  formats  (Windows/Unix/Mac)    

Handy  features:    •  Syntax  highligh0ng  •  Needs  to  be  able  to  show  line  numbers  

Page 3: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

There  are  many  op0ons  one  possible  choice  

Page 4: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

The  most  annoying  problems    are  caused  by  invisible  characters  

•  Tabs  vs  spaces  (when  you  copy/paster  from  the  web  it  turns  tabs  into  spaces!)  

•  New  lines  of  wrong  type  (yes  invisible  lines  can  have  types)  à  Unix,  Mac,  Windows  

•  Always  use  UNIX  line  endings!  

Page 5: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Short  Read  Archive  

It  is  (par0ally)  documented  and  “sort  of  logical”  –  but  only  “sort  of”    

Page 6: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

SRA  –  Sequence  Read  Archive    naming  conven0ons    

NCBI  BioProject:  PRJN...  -­‐  the  overall  descrip0on  of  a  single  research  ini0a0ve;  a  project  will  typically  relate  to  mul0ple  samples  and  datasets  

 

NCBI  BioSample:  SAMN…  and/or  SRS…  in  SRA  -­‐  a  descrip0on  of  biological  source  material;  each  physically  unique  specimen  should  be  registered  as  a  single  BioSample  with  a  unique  set  of  a`ributes    

SRA  Experiment:  SRX…  -­‐  a  unique  sequencing  library  for  a  specific  sample    

SRA  Run:  SRR…  ERR…  -­‐  a  manifest  of  data  file(s)  linked  to  a  given  sequencing  library  (experiment)  

There  is  a  cross  linking  between  SRA  and  NCBI  

Page 7: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Full  list  of  prefixes  

 

Page 8: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Visit  the  BioProject  for  the  data  

Page 9: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Web  based  download  of  the  data  

Page 10: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

That’s  not  ALL  –  when  it  comes  to  biological  data  distribu0on  confusion  is  the  rule.  

•  The  Gene  Expression  Omnibus  also  stores  results  from  func0onal  genomic  experiments  à  but  the  raw  data  links  back  to  SRA.  

•  GEO  was  originally  designed  for  microarray  data,  later  augmented    for  high  throughput  sequencing  

•  These  organiza0ons  appear  to  be  monolithic  and  it  is  not  clear  what  en0ty  is  responsible  for  them,  who  makes  what  decisions  and  why.  

•  This  is  why  groups  of  scien0sts  want  to  form  their  own  independently  run  informa0on  repositories.  

Page 11: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

GEO  nomenclature  

Words  that  start  with  G  usually  refer  to  GEO:    •  GPL…  will  be  a  plahorm  •  GSM…  indicates  a  sample  •  GSE…  indicates  a  series    The  sequencing  data  links  back  to  SRA  –  there  are  other  tools  to  read  GEO  data.  

Page 12: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Geing  data  from  SRA  

•  You  will  need  to  install  a  sojware  package  called  sra-­‐toolkit  

•  This  package  can  fetch  and  unpack  data  from  SRA  

 

Page 13: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Download  and  accessing  fastq  data  

•  Work  through  the  SRA  tookit  examples  

•  Become  familiar  with  the  terminology,  accessing  data,  iden0fying  runs  

Page 14: Week4,Lecture8 · The"mostannoying"problems"" are"caused"by"invisible"characters" • Tabs vsspaces (whenyoucopy/ pasterfrom the"web"itturns"tabs"into"spaces!)" • Newlinesofwrongtype

Homework  8  

•  Download  and  unpack  at  least  five  SRR  runs  (use  subsets  if  it  seems  too  slow).  

•  Run  a  fastqc  report  on  each.  

•  Which  run  do  you  like  most  and  why?  Show  one  plot  that  you  think  shows  good  quality  data.    

•  How  many  sequences  are  in  each  run?  Check  the  number  for  at  least  one  run  via  SRA  website.  

•  What  does  the  following  command  do:    

fastq-­‐dump  -­‐X  10  -­‐Z    SRR1553610