5
Final report: Transgene flow risk analysis Raúl Jiménez Rosenberg ([email protected]) Introduction Mexico is one of the most important place centers of origin and diversification of many plant foods, like cucurbit, avocado, and the worldwide used maize, which was domesticated over thousands of years in Mexico. By law, Mexico does not allow commercial genetically modified maize cultivation, only experimental parcels but there is already evidence of genetic contamination in Mexico [1, 2]. And there is a lot o pressure to open the country to commercial maize crops. Motivation Genetic diversity is very important in biodiversity assessment; it has been proven in many cases genetic diversity is a key component for species to deal with environmental shift, diseases, plagues, amount many other things. Mexico so far has decided not to allow commercially crops of genetically modified organism, for those plants that Mexico is the center of origin and diversification, in concern of transgene contagion to wild and domesticated relatives. And the experimental crops should be carefully analyzed to minimize the transgene contagion. One way to analyze if a crop should be allowed or not is based on geographical distribution of the relatives’ species and the place where the experimental (or commercial) is asked to establish. So having maps on the species distribution is essential. Short background on Species Distribution Modeling Potential Species Distribution Models (SDM) is related to the ecological conditions the species require to maintain populations in a given region [15] so the idea is to find the biotic and abiotic conditions that are suitable to the species, they are many factors why a specie not occupies all appropriate conditions (e.g. mountains, sea. I.e. geographical barriers, the presence of predator, etc.), that is the reason they are called potential distributions, there is a branch of ecology that studies the actual distribution of the species, but they rely on extensively and well, design species inventories or field works to validate the model, which is feasible only in cases like endemic species, or small regions, etc. Many approaches have been applied to the SDM, if the data set includes “true absences,” absences due to the species not being present, rather than to insufficient exploration. Then you have binary data, so we can use many statistical tools for this kind of data, like regressions, GLM and GAM, but most of the data set on specimen samples, lack “true absences,” which is called presence only data. For this kind of data sets, it has been a long discussion on the methods; to mention some, we found the profile methods (Bioclim, Domain, Mahalanobis), statistical methods (GLM, GAM) using pseudoabsences, Geographic models (Convex Hull, Inverse Distance Weighted) and the so called from the SDM field the Machine Learning (ML) methods (Maximum Entropy, Boosted Regression Trees, SVM, etc.). Recently, it has been shown the mathematical equivalency of the most widely used ML algorithm, MaxEnt with other algorithms, based in the work of William Fithian and Trevor Hastie at Stanford University concluding that MaxEnt, Boosted Regression Trees and others are motivated by the same underlying model Inhomogeneous Poisson Process ([4], [16], [17]). The approach explored in this work is based on the idea of the BAM diagram (figure 1), which tries to comprise a practical (and ‘realistic’) potential SDM, which illustrates the relations between biotic (B), abiotic (A), and movement (M) elements. Solid circles on the figure indicate presence, and open circles indicate absences of the specie. Figure taken from [15] Figure 1. BAM diagram GA = the abiotical suitable area, GO = the occupied area and GI = the invadable area. So the potential distribution area is GP = GO GI [6]. Methods Since the work presented is based on the ideas of Soberón at Kansa University [12], (BAM idea), the first step was to build a reliable dataset on biotic and abiotic dataset biological sound for the 59 races of maize present in Mexico, first specimens occurrences from two main sources Conabio (www.conabio.gob.mx) and GBIF (www.gbif.org). There is no dataset for M almost for any species, but there is literature about the biology of the species, preferred precipitation, altitude, temperature, etc. Therefor to get an idea of M the reading on the biology of the races of maize was done, I propose a simplistic idea based on what I learned [14].

Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

Final  report:  Transgene  flow  risk  analysis    Raúl  Jiménez  Rosenberg  ([email protected])  

 Introduction    Mexico  is  one  of  the  most  important  place  centers  of   origin   and   diversification   of  many   plant   foods,  like   cucurbit,   avocado,   and   the   worldwide   used  maize,  which  was  domesticated  over  thousands  of  years  in  Mexico.      By   law,   Mexico   does   not   allow   commercial  genetically   modified   maize   cultivation,   only  experimental  parcels  but  there  is  already  evidence  of   genetic   contamination   in   Mexico   [1,   2].   And  there   is   a   lot   o   pressure   to   open   the   country   to  commercial  maize  crops.    Motivation    Genetic  diversity  is  very  important  in  biodiversity  assessment;   it   has   been   proven   in   many   cases  genetic  diversity  is  a  key  component  for  species  to  deal   with   environmental   shift,   diseases,   plagues,  amount   many   other   things.   Mexico   so   far   has  decided   not   to   allow   commercially   crops   of  genetically   modified   organism,   for   those   plants  that   Mexico   is   the   center   of   origin   and  diversification,   in   concern  of   transgene   contagion  to   wild   and   domesticated   relatives.   And   the  experimental   crops   should   be   carefully   analyzed  to  minimize  the  transgene  contagion.    One  way  to  analyze  if  a  crop  should  be  allowed  or  not   is   based   on   geographical   distribution   of   the  relatives’   species   and   the   place   where   the  experimental   (or   commercial)   is   asked   to  establish.   So   having   maps   on   the   species  distribution  is  essential.        Short  background  on  Species  Distribution  Modeling    Potential   Species   Distribution   Models   (SDM)   is  related   to   the   ecological   conditions   the   species  require   to  maintain  populations   in  a  given  region  [15]   so   the   idea   is   to   find   the   biotic   and   abiotic  conditions  that  are  suitable  to  the  species,  they  are  many   factors   why   a   specie   not   occupies   all  appropriate   conditions   (e.g.   mountains,   sea.   I.e.  geographical   barriers,   the   presence   of   predator,  etc.),   that   is   the   reason   they   are   called   potential  distributions,   there   is   a   branch   of   ecology   that  studies   the   actual   distribution   of   the   species,   but  they   rely   on   extensively   and  well,   design   species  inventories   or   field   works   to   validate   the  model,  which   is   feasible   only   in   cases   like   endemic  species,  or  small  regions,  etc.      Many  approaches  have  been  applied  to  the  SDM,  if  the   data   set   includes   “true   absences,”   absences  due   to   the   species   not   being  present,   rather   than  to   insufficient   exploration.   Then   you   have   binary  data,   so  we  can  use  many  statistical   tools   for   this  kind   of   data,   like   regressions,   GLM   and  GAM,   but  

most   of   the   data   set   on   specimen   samples,   lack  “true   absences,”   which   is   called   presence   only  data.    For  this  kind  of  data  sets,  it  has  been  a  long  discussion   on   the  methods;   to  mention   some,  we  found   the   profile   methods   (Bioclim,   Domain,  Mahalanobis),   statistical   methods   (GLM,   GAM)  using   pseudo-­‐absences,   Geographic   models  (Convex  Hull,  Inverse  Distance  Weighted)  and  the  so  called  from  the  SDM  field  the  Machine  Learning  (ML)   methods   (Maximum   Entropy,   Boosted  Regression  Trees,  SVM,  etc.).    Recently,   it   has   been   shown   the   mathematical  equivalency   of   the   most   widely   used   ML  algorithm,  MaxEnt  with  other  algorithms,  based  in  the  work  of  William  Fithian   and  Trevor  Hastie   at  Stanford   University   concluding   that   MaxEnt,  Boosted   Regression   Trees   and   others   are  motivated   by   the   same   underlying   model  Inhomogeneous  Poisson  Process    ([4],  [16],  [17]).    The   approach   explored   in   this   work   is   based   on  the  idea  of  the  BAM  diagram  (figure  1),  which  tries  to   comprise   a   practical   (and   ‘realistic’)   potential  SDM,   which   illustrates   the   relations   between  biotic   (B),   abiotic   (A),   and   movement   (M)  elements.  

 Solid  circles  on  the  figure  indicate  presence,  and  open  circles  indicate  absences  of  the  specie.    Figure  taken  from  [15]    

      Figure  1.  BAM  diagram    GA  =  the  abiotical  suitable  area,  GO  =  the  occupied  area  and  GI  =  the  invadable  area.  So  the  potential  distribution  area  is  GP  =  GO  ∪  GI  [6].    Methods    Since  the  work  presented  is  based  on  the  ideas  of  Soberón  at  Kansa  University  [12],  (BAM  idea),  the  first   step  was   to  build  a   reliable  dataset  on  biotic  and   abiotic   dataset   biological   sound   for   the   59  races  of  maize  present   in  Mexico,   first   specimens  occurrences   from   two   main   sources   Conabio  (www.conabio.gob.mx)   and   GBIF   (www.gbif.org).  There   is   no  dataset   for  M  almost   for   any   species,  but   there   is   literature   about   the   biology   of   the  species,   preferred   precipitation,   altitude,  temperature,  etc.  Therefor  to  get  an  idea  of  M  the  reading   on   the   biology   of   the   races   of  maize  was  done,   I  propose  a   simplistic   idea  based  on  what   I  learned  [14].  

Page 2: Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

Datasets    Specimens’   occurrences,   were   collate   around  36,000   records   from   Conabio   and   GBIF,   but   a  quality  assessment  of  the  data  reduced  the  dataset  to  10,950  records.  Main  problems  were   lacked  or  incongruent   geographical   data   (latitude   and  longitude);   lack   of   a   valid   maize   race   scientific  name,   the   time   consumed   to   cleansing   the   data  was  far  greater  than  any  other  part  of  the  project.    However,   10,950   record   is   a   sufficient   amount   of  data   to   work,   but   for   four   races   they   were   less  than  20  records,  so  instead  of  work  with  each  race,  I   follow   the   idea   of   a   groups   (or   complexes)  classification,   so   maize   has   been   classified   into  seven  groups  [14]  (table  1).    Group   Representative  maize  races   Records  1   Palomero,  Cacahuacintle,  Cónico   3574  2   Apachito,  Cristalino  de  Chihuahua   222  3   Harinoso   de   Ocho,   Elotes  

Occidentale,  Bofo  1655  

4   Chapalotem,  Reventador   165  5   Nal-­‐tel,  Zapalote  Chico   681  6   Tepecintle,  Choapaneco,  Tuxpeño   3575  7   Olotillo,  Dzit  Bacal,  Olotón   1078       10950  

Table  1.  Occurrence  dataset    To  this,  occurrences  dataset  I  over  impose  climate  data   to   build   the   abiotic   and   biotic   part;   these  climate  data  are  our  variables  to  use  in  the  species  distribution   algorithm.   The   climatic   dataset   used  come   from   the   WorldClim   (www.worldclim.org)  [18];   Resolution   is   30   seconds   of   arc,   for  Mexico  this   is   ~1   km   pixel   resolution;   the   extend   of   the  study  area  is  the  geographic  box  of  ({-­‐120,  34},  {-­‐85,  12}),  so  each  climate  raster  are  4,200  *  2,640  =  11,088,000   pixels,   ocean   pixels   have   no-­‐data  value.  All  data  was  processed  in  R.    

From  the  reading  of   the  biology  of   the  maize,   the  algorithm   was   feed   only   with   the   following   12  variables:    BIO3      =  Isothermally,    BIO5      =  Max  Temperature  of  Warmest  Month,    BIO6      =  Min  Temperature  of  Coldest  Month,    BIO7      =  Temperature  Annual  Range,    BIO8      =  Mean  Temperature  of  Wettest  Quarter,  BIO9      =  Mean  Temperature  of  Driest  Quarter,  BIO10  =  Mean  Temperature  of  Warmest  Quarter,  BIO11  =  Mean  Temperature  of  Coldest  Quarter,  BIO16  =  Precipitation  of  Wettest  Quarter,    BIO17  =  Precipitation  of  Driest  Quarter,    BIO18  =  Precipitation  of  Warmest  Quarter,    BIO19  =  Precipitation  of  Coldest  Quarter    Most   of   the   biology   description,   characterize  maize  base  on  altitude,  therefor  for  M  it  was  used  a  physiographic  provinces  map  [3].  Actually,  there  is  no  algorithm   that  uses   a  BAM  concept   to  build  specie   distribution   maps,   but   instead   these   map  was   used   to   provide   the   background   data   (like  pseudo-­‐absences,   but   with   a   very   different   use),  therefor  the  points  for  background  was  randomly  sampled  from  the  physiographic  region  where  the  maize   group   has   an   occurrence   (based   on   the  dataset).  So  M  plays  a  role  as  a  mask  for  randomly  points  selection  (background  points).      A   quick   notion   of   background   data   vs.   pseudo-­‐absences   is   that   pseudo-­‐absences   was   used   to  label   ‘0’   (absences)   and   then   apply   supervised  classification   algorithms;   the   background   data   is  used   to   characterize   environments   in   the   region  where   species   occur   and   used   has   conditions   for  optimization,   in   particular,   MaxEnt   uses  Lagrange’s   technique.   Therefor   is   very   important  to   choose   good   conditions.   I.e.   M   should   rely  mostly   on   the   biological   knowledge   about   the  species.  

 

            Figure  2.  Overview  of  maize  groups  distribution

Page 3: Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

Data  preprocess    The   data   handling,   model   fitting,   evaluation,  prediction   process   was   done   in   R   (the   code   will  share;  R  is  getting  much  attention  within  the  SDM  community),   depending   on   the   setting.   It   took  until  one  hours  to  run  the  seven  maize  groups.  All  the  data  has  a  geographical  projection  with  datum  WGS84.    For  each  maize  group,  we  build  M  based  on  point  occurrence   on   a   polygon   over   the   physiographic  provinces   map   [3].   Figure   2   shows   M   map   for  group  4.  

                 

Figure  3.  Red  area  (provinces)  is  M  for  maize  group  4    For   the   background   points   the   same   size   of   the  maize   group   (table   1)   were   selected,   using   the  idea  of  Hijmans  at  Berkeley  University   to  remove  the   spatial   sorting   bias   [7]   (point-­‐wise   distance  sampling).      Model  Fitting    Two   algorithms   were   used   Maximum   Entropy  (MaxEnt   [9])  and  Boosted  Regression  Trees  (BRT  [5]);  this  report  includes  only  the  result  of  MaxEnt,  (BTR   requires   further   work,   However,   based   on  visually   inspection   of   the   predictions,   locks   a   lot  like   MaxEnt,   but   we   know   that   already   [4]).    MaxEnt,  maximize  the  information  entropy  H  =  -­‐  Σ  pi  log  pi.    To  assess  the  relative  variables  contribution  to  the  model   a   jackknife   process  was   performed.   Recall  that  based  on  the  biology  sound  variable  of  maize  races,   we   already   pre-­‐selects   12   ‘important’  variables.  This  can  be  used  as  a  result   in  terms  to  answer   which   variables   are   more   important   for  the  species,  this  analysis  entirely  depends  on  how  the   model   was   built   (optimization);   so   distinct  algorithms  can  get  different  answers.    

Group  /  Bio     1   2   3   4   5   6   7  

 10   5.3   2.5   1.9   0.8   1.0   0.4   0.4  11   4.0   4.2   12   32.6   11.6   32.0   0.7  16   6.9   9.3   33.7   13.7   0.4   25.5   20.0  17   1.3   1.0   13.8   1.0   8.45   4.9   1.6  18   7.7   0.6   2.5   11.7   3.43   0.5   7.8  19   8.4   2.0   0.3   1.7   26.3   11.5   3.6  3   3.1   7.0   21   6.0   8.7   3.1   31.4  5   1.4   0.9   0.1   2.3   17.9   1.5   4.1  6   0.6   54.7   5.0   7.1   8.8   4.7   10.0  7   0.9   7.15   5.8   10.5   3.0   6.9   7.3  8   57.4   0.1   1.4   2.4   1.8   4.6   1.9  9   2.3   10.0   1.9   9.6   8.1   3.8   10.0  

Table  2.  Percent  of  contribution    

In   addition   from   the   test   on   variable   importance,  we   get   which   variable   decrease   the   gain   of   the  model   the   most   when   it   is   omitted,   i.e.   which  appears   to   have   the   most   information   that   isn't  present  in  the  other  variables.    

Group   1   2   3   4   5   6   7  Variable   bio16   bio3   bio16   bio3   bio9   bio16   bio3  

Table  3.  Variable  most  gain  decrease  when  it  is  omitted.    Model  evaluation    In  SDM  field,  the  most  used  ‘fit’  of  model  measure  is   the   Receiver   Operating   Characteristic   (ROC)  analysis.   Using   a   cross-­‐validation,   based   on   the  rule  of  thumbs  80%  of  the  data  to  build  the  model  (training)   and   testing   with   the   remaining   20%,  those   points   were   built   through   k-­‐fold   data  partitioning,   Phillips   et  al   at   Princeton  University  discusses   the   use   of   Area   under   the   Curve   (AUC)  for   the  presence  only   context   [9].  Recall  we  have  only   occurrence   and   no   absence   data,   for   the  commission   rate   instead   of   fraction   of   absences  predicted  present  it  is  used  the  fraction  of  the  total  study  area  predicted  present.    

Maize  group  

AUC-­‐Train   AUC-­‐Test  

1   0.837   0.816  2   0.923   0.875  3   0.828   0.802  4   0.880   0.846  5   0.809   0.761  6   0.756   0.734  7   0.867   0.843  

Table  4.  AUC  for  maize  group  for  train  and  test  data  set    The   expected   area   is   above   0.5,   which   is  considering  a  random  model  (black  line  in  figure  4  and  5),  so  based  on  this  measure,  we  get  very  good  models   like  group  2  (AUC:  0.875),  and  the   ‘worst’  case   is   group   6   (AUC:   0.734),   even   the   latter   is  considered   a   useful   model   at   a   national-­‐level  decision   (they  a  very   informative).  However  with  a   best   data   cleaning,   tweaking   the   model   or  ensemble   methods,   I’m   sure   can   lead   to   better  models.    

                             

   

 Figure  4.  ROC  for  maize  group  2    

 

Page 4: Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

Figure  5.  ROC  for  maize  group  6    Prediction    Based  on  what   it  discussed  earlier  on   this  report,  we   can   build   our   potential   species   distribution  maps;   we   only   need   the   model   for   each   maize  group   and   feed   the   algorithm  and   the   predictors.  MaxEnt   is   a   Probabilistic   algorithm   so   that,  prediction   is   based   on   probabilities,   and   therefor  map   has   values   from   0   to   1   probability   of  presence.  

 Figure  6.    Maize  group  7  prediction  map  

 This   map   can   mislead   or   be   difficult   to   be  interpreted,   because   you   have   to   decide   which  value  of  the  map,  stand  for  presences  and  therefor,  which   for   absences,   first   thought   is   that   there   is  over-­‐fit   of   the   model,   because   actually   covers   all  the   data   points   (figure   6).     But  we   have   to   use   a  threshold,  and  like  many  things  in  SDM  field  is  an  ongoing  research  field  [8].      Thresholds   are   hard   and  much   into   debate,   I   use  an   equal   training   sensitivity   and   specificity   rule,  which  work  good  (but  not  always),  one  way  to  get  a   sense   of   usefulness   is   reviewing   the   density  (frequency)   of   the   presences   and   absences   from  the  training  data,  for  example,  maize  group  1.      In   this   case,   threshold   is   0.377   (figure   7),   so   we  classify  or  probability  of  presence  on  to  map  with  absences  <  threshold  and  presences  ≥  threshold.    However,   not   always   densities   look   like   the  following  example  (figure  7),   for   instance,  we  can  check   the   plot   densities   of  maize   group   5   (figure  8).  

Figure  7.  Threshold  graphic  representation      

                             

   

 Figure  8.  Densities  of  absence  and  presence  points  from  

training  maize  group  5.    It   is   easy   to   see  why   in   this   case  may  be  another  threshold   should   be   used   (because   the   areas  under   the   curves   are   much   overlapping   {figure  8}).    Let’s  classify  maize  group  7  (figure  6),  threshold  in  this   case   is   0.39;   this  map   is  more   interpretable,  but  changing  your   threshold  approach  can  vary  a  lot  the  presences’  areas.  

 Figure  9.  Maize  group  7  presence  classification  

   

0.0 0.2 0.4 0.6

01

23

4

grupo1predicted value

Den

sity

. Ban

dwid

th=

0.03

832

Absences  Presences    

0.0 0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

grupo5predicted value

Den

sity

. Ban

dwid

th=

0.06

835

0.0 0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

2.5

grupo7predicted value

Dens

ity. B

andw

idth

= 0.

0641

5

Page 5: Final&report:&Transgeneflow&risk&analysis&&cs229.stanford.edu/proj2013/JimenezRosenberg-TransgeneFlowRiskAnalysis.pdf · Datasets* & Specimens’&occurrences,&were&collate& around&

Conclusion  and  future  work    Applying  Machine  Learning  to  Species  Distribution  Modeling,   is   an   active   research   field,   many  methods   have   been   applied   to   produce  geographical  distribution  maps.    This  field  is  particular   interesting,  because  pose  a  challenge   in   terms   of   how   data   has   been  generated,   mostly   presences   and   the   highly   bias  ways   to   sample   those   presences,   i.e.   near   roads,  very   different   methods   of   collecting,   dataset  product  of  aggregated  field  work  for  decades,  etc.  even   so,   building   models   based   on   Machine  Learning  have  proven   it   can  be  very   successfully,  moreover,   in   countries   with   mega-­‐biodiversity  and   so   heterogeneous;   These   tools   can   help  decision  makers,  use   the  best  possible  achievable  

information   and   analysis   to   promote   biology  conservation  and  sustainable  use.    Tweaking   these   algorithms   will   indeed   provide  better  results,  but  at  last  they  will  reach  to  a  limit  difficult   to   overpass,   so   using   ensemble   method  can   be   an   interesting   research   area;   some  ensemble  has  already  done,  but  many  of  them  are  based  on  algebraic  combination  of  different  model  results,  which  may  be  delivered  a  consensus  map  but  not  an  strictly  ensemble  method.                

References [1] A Piñeyro-Nelson, J Van Heerwaarden, H R Perales, J

A Serratos-Hernández, A Rangel, M B Hufford, P Gepts, A Garay-Arroyo, R Rivera-Bustamante, E R Álvarez-Buylla. Transgenes in Mexican maize: molecular evidence and methodological considerations for GMO detection in landrace populations. Mol Ecol. 2009 February; 18(4): 750–761. doi: 10.1111/j.1365-294X.2008.03993.x

[2] Araújo, Miguel and A. Townsend Peterson. 2012. Uses

and misuses of bioclimatic envelope modeling. Ecology 93:1527-1539.

[3] Cervantes-Zamora, Y., Cornejo-Olgín, S. L., Lucero-

Márquez, R., Espinoza-Rodríguez, J. M., Miranda-Viquez, E. y Pineda-Velázquez, A, (1990). 'Provincias Fisiográficas de México'. Extraido de Clasificación de Regiones Naturales de México II, IV.10.2. Atlas Nacional de México. Vol. II. Escala 1:4000000. Instituto de Geografía, UNAM. México. http://www.conabio.gob.mx/informacion/gis/?vns=gis_root/region/fisica/rfisio4mgw

[4] Fithian, William and Hastie, Trevor. 2012. Statistical

Models for Presence-Only Data: Finite-Sample Equivalence and Addressing Observer Bias. arViv:1207.6950.

[5] Friedman, J.H., 2001. Greedy function approximation: a

gradient boosting machine. The Annals of Statistics 29: 1189-1232. http://www-stat. stanford.edu/~jhf/ftp/trebst.pdf)

[6] Gaston, K. J. 2003. The Structure and Dynamics of

Geographic Ranges. Oxford University Press, Oxford, UK.

[7] Hijmans, R.J., 2012. Cross-validation of species

distribution models: removing spatial sorting bias and calibration with a null-model. Ecology 93: 679- 688

[8] Liu C., P.M. Berry, T.P. Dawson, and R.G. Pearson,

2005. Selecting thresholds of occurrence in the prediction of species distributions. Ecography 28: 385-393.

[9] Phillips, Steven J, Miroslav Dudík and R.E. Schapire.

2004. A maximum entropy approach to species distribution modeling. In Proceeding of the twenty-first international conference on Machine Learning, page 83- ACM, 2004.

[10] Quist D, Chapela I. Transgenic DNA introgressed into

traditional maize landraces in Oaxaca, Mexico. Nature. 2001;414:541–543.

[11] Modeling of species distributions with Maxent: new

extensions and a comprehensive evaluation, Ecography 31: 161-175, 2008.

[12] Soberón, J. & Peterson, A. T. 2005. Interpretation of

models of fundamental ecological niches and species' distributional areas. Biodiversity Informatics, 2: 1-10.

[13] Stockwell, D. R. B. 1999. Genetic algorithms II. Pages

123-144 in A. H. Fielding, editor. Machine learning methods for ecological applications. Kluwer Academic Publishers, Boston.

[14] Takeo Ángel Kato Yamakake, Cristina Mapes Sánchez,

Luz María Mera Ovando, José Antonio Serratos Hernández and Robert Arthur Bye Boettles. 2009. Origen y Diversificacion del Maíz. Una Revisión Analítica. Comisión Nacional para el Conocimiento y Uso de la Biodiversidad.

[15] Townsend Peterson, A., Jorge Soberón, Richard G.

Pearson, Robert P. Anderson, Enrique Martínez-Meyer, Miguel Nakamura and Miguel Bastos Araújo. 2011. Ecological niches and geographic distributions. Princeton University Press.

[16] Trevor Hastie and Will Fithian. 2013, Inference from

presence-only data; the ongoing controversy. Ecography 36: 864-867, 2013.

[17] Warton David I., Leah C. Shepherd. 2010. Poisson

point process models solve the "pseudo-absence problem" for presence-only data in ecology. Annals of Applied Statistics 2010, Vol. 4, No. 3, 1383-1402.

[18] WorldClim, 2013 (consulted on October 2013),

Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.