36
Alzheimer's Disease Clinical Data Classifica4on By George Kalangi Venkata Gopi

Clinical Data Classification of alzheimer's disease

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Clinical Data Classification of alzheimer's disease

Alzheimer's  Disease-­‐  Clinical  Data  Classifica4on      

By  George  Kalangi  

Venkata  Gopi    

Page 2: Clinical Data Classification of alzheimer's disease

Overview:  •  Introduc4on  •  Analysis  of  commonly  used  terms  and  explana4on  of  data  sets  

•  Overall  Programming  Process  

•  Genera4ng  a  merged  file  with  CDGLOBAL  

•  Genera4on  of  files  for  future  status  predic4on  •  Data  Preprocessing  •  Classifica4on  (Algorithms)  used  on  the  data  

•  Analysis  on  the  output  data  from  WEKAb  

G

Page 3: Clinical Data Classification of alzheimer's disease

Introduc4on  • What  is  Alzheimer’s  Disease?  •  Brain  disorder  • Most  common  form  of  demen4a  

– Term  for  the  loss    • Memory  • Other  intellectual  abili4es  • Serious  enough  to  interfere  with  daily  life  

•  Clinical  Demen4a  Ra4o  (0,0.5,1,2,3)  

Mild to Severe Dementia 1.0 to 3.0 Questionable Dementia 0.5

Normal 0

G

Page 4: Clinical Data Classification of alzheimer's disease

Datasets  (60  Files)  

"  56  comma  separated  files   1  File  –  Data  Dic4onary  (Explains  the  terms  used)  

 1  File  –  Clinical  Demen4a  Ra4ng  (Has  CDGLOBAL)  

 Rest   Assessments     Data  Defini4ons  

 Other  like  visits  having  abbrevia4ons  

G

Page 5: Clinical Data Classification of alzheimer's disease

Environment  Setup  

•  Programming  Languages  used  for  the  project  are  PHP,  MySQL,  Java,  Postgresql  

•  Tools  used  are  WEKA  (Waikato  Environment  for  Knowledge  Analysis),  MySQLWorkBench,  

       and  NetBeans  

•               -­‐Front  End      (PHP)  •               -­‐Back  End      (MySQL)            

G

V

Page 6: Clinical Data Classification of alzheimer's disease

Overall  Programming  Process    

•  A  selected  dataset  (FAQ)  is  given  by  the  user.  •  At  the  backend  MYSQL  queries  are  defined  enough  to  create  the  required  tables  and  insert  the  required  data  to  the  corresponding  tables.  

•  Here  aeer  the  required  opera4ons  are  performed  on  the  tables.  

•  Final  output  files  are  stored  in  .csv  format.  

G

V

Page 7: Clinical Data Classification of alzheimer's disease

Genera4ng  a  merged  file  with  CDGLOBAL  (For  current)  

•  For  the  given  datasets  as  input,        (Eg:adni_faq_2011-­‐01-­‐20.csv)  and  from  the  adni_cdr_2011-­‐01-­‐20.csv)  file      

                                           -­‐-­‐the  RID’s  and  VISCODE’s  of  faq  and  cdr  are  compared  and  based  on  that  CDGLOBAL  column  in  cdr  file  is  merged  to  faq  file.  

•   During  Remove  CDGLOBAL  which  has  -­‐1  and        VISCODE’s  f,nv,uns1  are  trimmed  off.  

             Result  file  is  “Merged_dataset_file.csv”  

G

Page 8: Clinical Data Classification of alzheimer's disease

Query  used  for  genera4ng  merged  file:  •  Select  f.cID  ,f.RID  ,f.VISCODE  ,f.EXAMDATE  ,f.FAQSOURCE,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQREM,f.FAQTRAVL,f.FAQTOTAL  ,cdr.cdglobal  from  cdr,faq  f  where  cdr.rid=f.rid  and  cdr.VISCODE=f.VISCODE  and  cdr.cdglobal  not  in  (-­‐1)";    

G

Page 9: Clinical Data Classification of alzheimer's disease

Genera4on  of  files  for  future  status  predic4on  

•  Predic4on  dataset  is  generated  by  mapping  the  first  4me  visit  to  the  6  month’s  Class  and  6  month  visit  to  the  12  month’s  Class  and  so  on.  

•  SQL  query  opera4ons  are  performed  on  the  merged  file  to  separate  the  6  month’s  4me  interval  classes.  

•  Following  are  the  files  generated:                                            -­‐  File_dataset_m06.csv  

                           -­‐File_dataset_m12.csv  and  so  on          

V

Page 10: Clinical Data Classification of alzheimer's disease

Query  used  for  genera4ng  class  files:  •  Select  v.ID  as  ID,v.RID  as  RID,v.VISCODE  ,v.EXAMDATE,v.FAQSOURCE  ,v.FAQFINAN  ,v.FAQFORM  ,v.FAQSHOP  ,v.FAQGAME  ,v.FAQBEVG,v.FAQMEAL  ,v.FAQEVENT  ,v.FAQTV  ,v.FAQREM  ,v.FAQTRAVL  ,v.FAQTOTAL  ,m12.cdrglobal  from  `table_adni_faq_2011-­‐01-­‐20_m06`  v,`table_adni_faq_2011-­‐01-­‐20_m12`  m12  where  v.rid=m12.rid  

V

Page 11: Clinical Data Classification of alzheimer's disease

Preprocessing  •  Aeer  we  get  required  .csv  files,  we  use  WEKA  to  preprocess  the  data.  

•  Load  the  file  into  WEKA.  

•  Apply  Filter  “weka.filters.unsuperwised.apributes.Remove”  to  trim  off  the  unused  fields.  

•  Apply  “NumericaltoNominal”  to  convert  all  the  values  in  the  data  to  Nominal  before  classifying  and  fetching  to  a  classifier  algorithm.  

G

Page 12: Clinical Data Classification of alzheimer's disease

Classifica4on  Algorithms  Used  

•  The  Classify  panel  enables  the  user  to  apply  classifica4on  and  regression  algorithms  (indiscriminately  called  classifiers  in  Weka)  to  the  resul4ng  dataset,  to  es4mate  the  accuracy  of  the  resul4ng  predic4ve  model.  

•   J48  uses    C4.5  (a  successor  of  ID3)  Algorithm  

•  Naïve  Bayesian  Classifica4on  Algorithm  

G

Page 13: Clinical Data Classification of alzheimer's disease

What  is  classifica4on?  •  Given  a  collec4on  of  records  (training  set  )  

– Each  record  contains  a  set  of  a"ributes,  one  of  the  apributes  is  the  class  

-­‐-­‐  A  test  set  is  used  to  determine  the  accuracy  of  the  model.  Usually,  the  given  data  set  is  divided  into  training  and  test  sets,  with  training  set  used  to  build  the  model  and  test  set  used  to  validate  it.  

Example:              If  we  have  items  in  a  house  which  are  not  classified  then  we  can’t  arrange  

items  in  our  house.    

         We  classify  the  items  depending  on  their  usage  as  cooking  items,  decora4on  items  etc.,  such  that  we  could  arrange  them  accordingly  and  can  use  it  in  an  efficient  and  easier  way.    

G

Page 14: Clinical Data Classification of alzheimer's disease

Decision  Tree  Classifica/on  Task   G

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Test Data

Assign Cheat to “No”

Page 15: Clinical Data Classification of alzheimer's disease

Decision  Tree    

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Test Data

G

Page 16: Clinical Data Classification of alzheimer's disease

J  48  uses  C  4.5  Algorithm  

•  Decision  trees  represent  a  supervised  approach  to  classifica4on  

•  Decision  trees  are  a  classic  way  to  represent  informa4on  from  a  machine  learning  algorithm,  and  offer  a  fast  and  powerful  way  to  express  structures  in  data.    

•  A  decision  tree  is  a  simple  structure  where  non-­‐terminal  nodes  represent  tests  on  one  or  more  apributes  and  terminal  nodes  reflect  decision  outcomes.  

•  The  basic  algorithm  described  above  recursively  classifies  un4l  each  leaf  is  pure,  meaning  that  the  data  has  been  categorized  as  close  to  perfectly  as  possible.    

•  The  latest  public  domain  implementa4on  of  Quinlan's  model  is  C4.5.  The  Weka  classifier  package  has  its  own  version  of  C4.5  known  as  J48.  

•  This  process  ensures  maximum  accuracy  on  the  training  data.  

Page 17: Clinical Data Classification of alzheimer's disease

Why  decision  tree  Algorithm?  •  Advantages:  

– Inexpensive  to  construct  – Easy  to  interpret  for  small-­‐sized  trees  – Accuracy  is  comparable  to  other  classifica4on  techniques  for  many  simple  data  sets  

– There  could  be  more  than  one  tree  possible  for  the  same  data  

•  Disadvantages:            -­‐  Under  fivng:  when  the  model  is  too  simple,  both  training  and  test  errors  are  large  

Page 18: Clinical Data Classification of alzheimer's disease

All  about  Cross  Valida4on  •  We  perform  cross  valida4on  when  amount  of  data  is  small  and  we  

need  to  have  independent    training  and  test  set  from  it.  

•  It  is  important  that  each  class  is  represented  in  its  actual  propor4ons  in  the  training  and  test  sets:  Stra4fica4on  

•  An  important  cross  valida4on  technique  is  stra4fied  10  fold  cross  valida4on,  where  the  instance  set  is  divided  into  10  folds.  

•  We  have  10  itera4ons  with  taking  different  single  fold  for  tes4ng  and  the  rest  for  training.  

V

Page 19: Clinical Data Classification of alzheimer's disease

Evalua4on  

• Metrics  for  Performance  Evalua4on  – How  to  evaluate  the  performance  of  a  model?  

• Methods  for  Model  Comparison  – How  to  compare  the  rela4ve  performance  among  compe4ng  models?  

V

Page 20: Clinical Data Classification of alzheimer's disease

Metrics  for  Performance  Evalua4on:  Confusion  Matrix  

•  A  confusion  matrix  contains  informa4on  about  actual  and  predicted  classifica4ons  done  by  a  classifica4on  system.  Performance  of  systems  is  commonly  evaluated  using  the  data  in  the  matrix.  The  following  table  shows  the  confusion  matrix  for  a  two  class  classifier:    

•  We  get  confusion  matrix  aeer  supplying  data  to  a  Classifier  

•  Based  on  the  confusion  matrix  we  can  evaluate  using  the  measures  like,  precision,  F-­‐measure,  accuracy  and  Recall.  

G

Page 21: Clinical Data Classification of alzheimer's disease

Example  •  Suppose  there  are  a  sample  of  27  animals  —  8  cats,  6  dogs,  and  13  rabbits.  

•  Each  column  of  the  matrix  represents  the  instances  in  a  predicted  class,  while  each  row  represents  the  instances  in  an  actual  class.  

•  We  can  see  from  the  matrix  that  the  system  in  ques4on  has  trouble  dis4nguishing  between  cats  and  dogs,  but  can  make  the  dis4nc4on  between  rabbits  and  other  types  of  animals  prepy  well.    

•  All  correct  guesses  are  located  in  the  diagonal  of  the  table,  so  it's  easy  to  visually  inspect  the  table  for  errors,  as  they  will  be  represented  by  any  non-­‐zero  values  outside  the  diagonal.  

G

Page 22: Clinical Data Classification of alzheimer's disease

Limita4on  of  accuracy  Limita/on  of  accuracy:  

•  Consider  a  2-­‐class  problem  

–  Number  of  Class  0  examples  =  9990  

–  Number  of  Class  1  examples  =  10  

•  If  model  predicts  everything  to  be  class  0,  accuracy  is  9990/10000  =  99.9  %  

–  It  has  some  disadvantages  as  a  performance  es4mate.  For  example,  if  there  were  95  cats  and  only  5  dogs  in  the  data  set,  the  classifier  could  easily  be  biased  into  classifying  all  the  samples  as  cats.  The  overall  accuracy  would  be  95%,  but  in  prac4ce  the  classifier  would  have  a  100%  recogni4on  rate  for  the  cat  class  but  a  0%  recogni4on  rate  for  the  dog  class,  so  you'll  probably  want  to  look  at  some  of  the  other  numbers.  ROC  Area,  or  area  under  the  ROC  curve,  is  also  taken  as    preferred  measure.  

–  Accuracy  is  misleading  because  model  does  not  detect  any  class  1  example.  

G

Page 23: Clinical Data Classification of alzheimer's disease

Metrics  for  Evalua4on  •  Accuracy:  The  accuracy  (AC)  is  the  propor4ons  of  the  total  number  of  

predic4ons  that  were  correct,  what  percentage  of  people  were  correctly  classified.  It  is  determined  using  the  equa4on:  

                                                  Accuracy  =  (#  True  Posi4ves  +  #  True  Nega4ves)  /  N  

                                          Where  N  =  Total  #  predic4ons.  

•  Precision:      Finally,  precision  (P)  is  the  propor4on  of  the  predicted  posi4ve  cases  that  were  correct.  Of  all  the  people  that  are  classified  as  demented,  what  percentage  of  them  is  actually  demented?  

             It  is  calculated  using  the  equa4on  

                                                               Precision  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False  Posi4ve)  

Accuracy =TP +TN

TP +TN + FP + FNV

Page 24: Clinical Data Classification of alzheimer's disease

Evalua4on  

•  F-­‐measure:  

                                                  F-­‐measure  =2*  (#  True  Posi4ves  )  /  (  #  2*True  Posi4ves  +  #  True  Nega4ves  +  #False  Posi4ves)  

•  Recall:    Recall  is  the  ra4o  of  the  number  of  true  posi4ves  and  the  sum  of  true  posi4ves  and  false  nega4ves.  It  is  calculated  using  the  equa4on:  

                                                        Recall  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False  Nega4ves)  

V

Page 25: Clinical Data Classification of alzheimer's disease

Methods  for  Model  Comparison  ROC  (Receiver  Opera/ng  Characteris/c)  

•  Developed  in  1950s  for  signal  detec4on  theory  to  analyze  noisy  signals    – Characterize  the  trade-­‐off  between  posi4ve  hits  and  false  alarms  

•  ROC  curve  plots  TP  (on  the  y-­‐axis)  against  FP  (on  the  x-­‐axis)  

V

Page 26: Clinical Data Classification of alzheimer's disease

Using  ROC  for  Model  Comparison    M1 is better for small

FPR   M2 is better for large

FPR

A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:.

.90-1 = excellent (A). .80-.90 = good (B). .70-.80 = fair (C). .60-.70 = poor (D). .50-.60 = fail (F) Area Under the ROC curve A

V

Page 27: Clinical Data Classification of alzheimer's disease

Naïve  Bayes  •  It  is  a  simple  probabilis4c  classifier  based  on  applying  bayes  theorem  with  

independence  assump4ons.  Naive  Bayes  classifier  assumes  that  the  presence  (or  absence)  of  a  par4cular  feature  of  a  class  is  unrelated  to  the  presence  (or  absence)  of  any  other  feature.  

•  For  example,  a  fruit  may  be  considered  to  be  an  apple  if  it  is  red,  round,  and  about  4"  in  diameter.  Even  if  these  features  depend  on  each  other  or  upon  the  existence  of  the  other  features,  a  naive  Bayes  classifier  considers  all  of  these  proper4es  to  independently  contribute  to  the  probability  that  this  fruit  is  an  apple.  

•  An  advantage  of  the  naive  Bayes  classifier  is  that  it  requires  a  small  amount  of  training  data  to  es4mate  the  parameters  (means  and  variances  of  the  variables)  necessary  for  classifica4on.  Because  independent  variables  are  assumed,  only  the  variances  of  the  variables  for  each  class  need  to  be  determined  and  not  the  en4re  set.  Best  suited  for  apributes,  which  are  independent.  It  is  very  simple,  very  fast.      

V

Page 28: Clinical Data Classification of alzheimer's disease

Challenges  faced  

•  Ini4ally  data  files  all  being  processed  using  JDBC  and  MySQL  and  later  its  been  found  to  be  hec4c  if  at  all  other  dataset  being  used.  Hence  PHP  based  MYSQL  is  used  which  is  generalized  for  all  datasets.  

•  Table  crea4on  ini4ally  for  loading  the  data,  later  done  with  file  opera4ng  func4ons.        

•  Running  all  the  “MYSQL”  commands  sequen4ally,  later  enhanced  using  php  as  front  end.  

•   Ini4ally  J48  tree  was  not  able  to  process  due  to  the  data  being  in  numerical  values.  Later  done  by  Discre4za4on/NumericaltoNominal  of  CDGLobal  columns.  

V G

Page 29: Clinical Data Classification of alzheimer's disease

Preprocess  Output  G

Page 30: Clinical Data Classification of alzheimer's disease

Result  file  for  current  status(J48)   G

Page 31: Clinical Data Classification of alzheimer's disease

Current  status  (Naïve  Bayes)  V

Page 32: Clinical Data Classification of alzheimer's disease

Future  status  (J48)   V

Page 33: Clinical Data Classification of alzheimer's disease

Future  status  (Naïve  Bayes)   V

Page 34: Clinical Data Classification of alzheimer's disease

MMSE  (J48)  

Page 35: Clinical Data Classification of alzheimer's disease

References:  

http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/WekaManual-3-6-2.pdf

http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf

http://stackoverflow.com/questions/2903933/how-to-interpret-weka-classification

http://www.slideshare.net/dataminingtools/weka-credibility-evaluating-whats-been-learned

Page 36: Clinical Data Classification of alzheimer's disease

                                           Thank  you