27
OCTOBER 1114, 2016 BOSTON, MA

Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Embed Size (px)

Citation preview

Page 1: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  

Page 2: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Cross  Data  Center  Replica@on  for  the  Enterprise  Adam  Williams  

Search  Lead,  Iron  Mountain  

Page 3: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Objec&ves  •  How  Iron  Mountain  uses  cross  data  center  replica&on  (CDCR)  •  Our  experiences  with  CDCR  •  Disaster  recovery  op&ons  available  •  What  you  need  to  run  CDCR  •  How  to  configure  CDCR  •  How  to  keep  CDCR  running  daily  •  What’s  next  for  solr  CDCR?  

Page 4: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Iron  Mountain  Solr  •  Record  Center  Project    

–  140,000  worldwide  users  –  Went  live  in  2013  –  Users  maintain  and  order  records  stored  at  Iron  Mountain  –  5.3  billion  documents  stored  in  38  clouds  –  Completely  virtual,  internally  hosted  infrastructure  (180  vms)  –  Hosted  on  tomcat  –  Early  adopter  of  solr  4  –  Currently  index  at  140,000  documents  per  min  (16  indexers,  2  million  per  min  capacity)  –  11  million  avg  updates  per  day  (15  min  update  SLA)  –  140,000  searches  avg  per  day  –  Customers  rely  on  Iron  Mountain  for  essen&al  business  processes  such  as  claims  processing,  

financials  and  medical  records  

Page 5: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Iron  Mountain  Solr  •  Business  Requirement  

–  Maintain  a  disaster  recovery  environment  capable  of  being  fully  func&onal  within  4  hours  of  an  event  

–  Data  accuracy  must  be  within  15  minutes  of  produc&on  –  Ac&ve  /  Passive  replica&on  

Page 6: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Short  History  Worked  with  several  commi[ers  to  develop  for  Iron  Mountain:  •  Developed  under  SOLR-­‐6273  •  Available  in  Solr  6  •  Iron  Mountain  tested  the  func&onality  in  our  environments    •  Running  in  produc&on  at  Iron  Mountain  for  over  a  year  •  Assisted  with  developing  formal  documenta&on  posted  on  the  solr  wiki    Documenta&on:  h[p://yonik.com/solr-­‐cross-­‐data-­‐center-­‐replica&on/  h[ps://cwiki.apache.org/confluence/pages/viewpage.ac&on?pageId=62687462    

Page 7: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Cross  Data  Center  Replica&on  (CDCR)  •  Replicate  data  to  mul&ple  data  centers  

(source  and  target)  •  Data  is  replicated  to  the  target  once  it  is  

persisted  to  disk  in  the  source  •  Changes  are  replicated  in  near  real-­‐&me  

based  upon  seangs  •  Assumes  source  and  target  are  iden&cal  

when  CDCR  is  introduced  or  blank  •  Shard  leaders  send  updates  to  target  cloud  

leaders  which  replicate  within  the  cloud  CDCR  Apache  documenta&on  

Page 8: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Iron  Mountain  Experiences  with  CDCR  •  CDCR  has  provided  us  with  piece  of  mind  and  saved  us  

on  several  occasions.      •  It  would  take  us  approximately  2  weeks  to  recreate  our  

indexes  of  5.3  billion  records  from  scratch.  •  Confidence  that  we  have  a  warm  backup  ready  in  case  

of  a  disaster.    •  On  two  occasions  we  had  corrupt  indexes  in  

produc&on.    We  restored  from  the  backups  in  our  DR  data  center.    Resul&ng  in  less  than  one  hour  of  down&me.    

•  DR  system  allows  us  to  run  large  queries  and  facets  for  maintenance/research  ac&vi&es  without  impac&ng  produc&on  load.  

Page 9: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Disaster  Recovery  

Apple  Data  Center    Mesa,  Arizona  -­‐  May  2015    Solar  Panels  catch  fire  

Page 10: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Disaster  Recovery  –  Why  not?  •  Smaller  companies  are  less  likely  to  

have  disaster  recovery  capability  •  Economy  of  scale  is  a  challenge    •  Achieving  a  “hot  standby”  is  costly    •  Approach  must  be  reliable  and  

rehearsed  regularly  •  Disaster  is  not  necessarily  a  cataclysmic  

event,  could  be  the  result  of  malicious  acts  (internal  or  external)  or  corrupt  data.      

2012  CRN  study  

Page 11: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Disaster  Recovery  -­‐  Backing  up  Solr  Is  the  backup  going  to  load???  •  Ever  have  this  happen  to  

you?  •  If  an  index  file  is  not  fully  

copied,  the  index  can  be  corrupt.      

•  This  is  a  challenge  with  hot  backups  and  disk  mirroring  with  Solr.  

Page 12: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Disaster  Recovery  Op&ons  with  Solr  Op@on   Actors   Risk  Index  to  two  instances  in  different  data  centers  

     Index  (I)  to  Source  (S)  and  to  Target  (T)  at  once  

-­‐  Oien  requires  addi&onal  custom  dev.  -­‐  No  guarantee  that  the  instances  are  iden&cal.    

Disk  Mirroring      

-­‐  What  if  en&re  index  file  is  not  copied?  -­‐  What  state  is  the  disk  in  at  the  &me  of  an  abrupt  event?  

Regular  Backups   -­‐  Works  if  you  have  low  volume  index  updates  with  a  controlled  schedule    -­‐  Managing  backups,  storing  offsite  and  retrieving  quickly  when  needed  

Cross  Data  Center  Replica&on  

-­‐  Ability  to  monitor  and  track  replica&on  to  see  that  it  is  running  properly  

I  S  

T  

S   T  

S   T  

S   T  

Page 13: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Advantages  of  CDCR  Advantage   Comment  Can  be  controlled  by  an  administrator  /  support  personnel  

-­‐  Does  not  require  storage  or  infrastructure  personnel    -­‐  Can  be  turned  off  /  on  easily  compared  to  turning  on/off  

disk  replica&on  -­‐  Can  be  monitored  for  latency  and  accuracy  as  the  target  

system  is  running  

Increase  in  confidence  that  the  standby  system  is  ready  

Target  system  is  fully  func&onal  and  ready  at  all  &mes.  

Works  Cross  data  centers     If  the  target  is  unavailable  synching  will  queue  and  restart  when  it  is  available.  

Data  backup     Full  index  available  in  remote  data  center  loca&on  

Page 14: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

What  you  need  to  run  CDCR  • We  added  addi&onal  disk  space  to  queue  up  tlogs  in  case  the  target  is  unavailable  for  periods  of  &me.  Disk  

• Reliable  network  between  the  source  and  target  with  capacity  for  exchanging  your  data  quickly.  Bandwidth  

• Ability  to  monitor  replica&on  for  issues.    A  monitoring  tool  such  as  Nagios  or  Solarwinds.    Monitoring  

• Must  run  in  test  environment  to  determine  rate  of  change,  seangs  and  bandwidth.      Tes&ng  

We  did  not  need  to  add  addi&onal  memory  or  CPU  for  CDCR.  

Page 15: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Required  Seangs  on  source  -­‐DtargetZk=<targetZkhost1>,<targetZkhost2>,<targetZkhost3>,<targetZkhost4>,<targetZkhost5>  -­‐DsourceCollec&on=<source  collec&on  name>  -­‐DtargetCollec&on=<target  collec&on_name>  

Startup  seangs  for  source  system:  

-­‐  TargetZk  is  the  host  names  of  the  DR  (target)  zookeepers  -­‐  SourceCollec&on  is  the  name  of  the  clouds  in  the  primary  data  center  (source)  -­‐  Target  Collec&on  is  the  name  of  the  cloud  in  the  DR  data  center  (target)  

Page 16: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Solrconfig  Seangs  <requestHandler  name="/cdcr"  class="solr.CdcrRequestHandler">          <lst  name="replica">              <str  name="zkHost">${targetZk}</str>              <str  name="source">${sourceCollec&on}</str>              <str  name="target">${targetCollec&on}</str>          </lst>          <lst  name="replicator">              <str  name="threadPoolSize">8</str>              <str  name="schedule">10</str>              <str  name="batchSize">2000</str>          </lst>          <lst  name="updateLogSynchronizer">              <str  name="schedule">1000</str>          </lst>      </requestHandler>          <updateRequestProcessorChain  name="cdcr-­‐processor-­‐chain">          <processor  class="solr.CdcrUpdateProcessorFactory"  />          <processor  class="solr.RunUpdateProcessorFactory"  />      </updateRequestProcessorChain>    

Solrconfig  seangs  on  source:  

-­‐  replicator  -­‐  Thread  Pool  -­‐  Scheduler  -­‐  BatchSize  

-­‐  updateLogSynchronizer    -­‐  schedule  

Page 17: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Solrconfig  Seangs  Parameter   Required   Default   Iron  Mountain     Descrip@on  

threadPoolSize   No   2   8   The  number  of  threads  to  use  for  forwarding  updates.  One  thread  per  replica  is  recommended.  

schedule   No   10   10   The  delay  in  milliseconds  for  the  monitoring  the  update  log(s).  

batchSize   No   128   2000   The  number  of  updates  to  send  in  one  batch.  The  op&mal  size  depends  on  the  size  of  the  documents.  Large  batches  of  large  documents  can  increase  your  memory  usage  significantly.  

schedule   No   60000   1000   The delay in milliseconds for synchronizing the updates log.  

CDCR  Apache  documenta&on  

Page 18: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Determining  Solrconfig  Seangs  •  Approach  

–  Determine  the  average  size  of  your  documents  –  Iden&fy  rate  of  change  –  Determine  network  capacity  –  Standup  scaled  model  in  a  test  environment  –  Index  documents  at  various  rates  to  source  and  monitor  throughput.      

–  Use  CDCR  API  to  collect  throughput  /  performance  metrics  –  Run  for  brief  periods  in  produc&on  on  limited  collec&ons  before  going  full-­‐scale  

 

Page 19: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  API  included  in  CDCR  func&onality  allows  you  to  ac&vely  control  and  monitor  replica&on.    API  Entry  Points  (Control)  collec&on/cdcr?ac&on=STATUS:  Returns  the  current  state  of  CDCR.  collec&on/cdcr?ac&on=START:  Starts  CDCR  replica&on  collec&on/cdcr?ac&on=STOP:  Stops  CDCR  replica&on.  collec&on/cdcr?ac&on=ENABLEBUFFER:  Enables  the  buffering  of  updates.  collec&on/cdcr?ac&on=DISABLEBUFFER:  Disables  the  buffering  of  updates.    API  Entry  Points  (Monitoring)  core/cdcr?ac&on=QUEUES:  Fetches  sta&s&cs  about  the  queue  for  each  replica  and  about  the  update  logs.  core/cdcr?ac&on=OPS:  Fetches  sta&s&cs  about  the  replica&on  performance  (opera&ons  per  second)  for  each  replica  core/cdcr?ac&on=ERRORS:  Fetches  sta&s&cs  and  other  informa&on  about  replica&on  errors  for  each  replica.  

Page 20: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Disk  Size   Enough  disk  space  is  needed  to  store  tlog  files.  If  the  target  data  center  is  offline,  the  system  will  queue  tlogs  un&l  the  connec&on  to  the  target  is  restored.    

Separate  par&&on  for  tlogs  with  enough  space  to  queue  tlogs  for  24  hours    

Disk  Monitor  for  tlogs  directory  if  disk  is  greater  than  60%  full      

Page 21: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Connec@vity  to  target  zookeepers  

If  the  source  cannot  see  the  target  zookeepers,  then  replica&on  is  being  queued  

Make  sure  target  zookeepers  are  accessible    -­‐DtargetZk=zk1,zk2,zk3  -­‐DsourceCollec&on=cloud1  

Ping  test  from  solr  instances  to  target  zookeeper  Or  Log  monitor  to  capture  errors  in  solr  connec&ng  to  target      

Page 22: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Is  CDCR  enabled  in  source?  

It’s  simple,  but  essen&al.      

Check  CDCR  status  using  built-­‐in  API    h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=status    

Best  prac&ce:  STOP  and  START  source  aier  every  deployment  h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=stop    h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=start  

Every  5  minutes  and  aier  deployments  /  maintenance  verify  that  CDCR  is  enabled    

Page 23: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Are  target  buffers  disabled  in  source?  

Tlog  files  will  grow  on  the  target  and  not  be  cleaned,  will  lead  to  large  disk  and  slowness  as  the  node  scans  all  of  the  tlogs    

Disable  target  buffers  aier  each  deployment    h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=DISABLEBUFFER  

   

Every  5  minutes  and  aier  deployments  /  maintenance  verify  that  source  buffers  are  disabled  

Page 24: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Is  CDCR  working  within  agreed  SLAs?  

Validate  that  CDCR  is  working  end-­‐to-­‐end  

Add  a  test  document  to  the  source  cloud.    Then  check  for  it  on  the  target  and  &me  how  long  it  took  to  read  it  from  the  target.    When  done,  delete  it.  

Aier  every  deployment  and  several  &mes  a  day.    If  the  document  is  not  found  in  the  target  aier  5  minutes,  throw  an  alert.  

Page 25: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Latency   If  there  is  a  spike  in  indexing,  there  can  be  some  latency  

Use  API  call  to  determine  the  queue  size  (bytes),  number  of  tlog  files  and  last  update  opera&on  &me    h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=QUEUES  

Every  15  minutes.    If  tlogs  grows  greater  than  100  files  and  last  update  &me  is  older  than  an  hour.  

Page 26: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

Configuring  and  Monitoring  CDCR  Considera@on   Why?   Approach   Monitor  

Performance   How  many  changes  (adds/deletes)  are  being  processed  per  second?  

Use  API  call  to  determine  the  average  performance  per  second      h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=OPS  

Once  daily  gather  performance  stats,  store  and  review.    Stats  can  help  you  op&mize  performance.  

Page 27: Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain

What’s  next  for  data  center  replica&on?  

•  Ac@ve  /  Ac@ve  –  Ability  to  replicate  between  the  target  and  source  

•  Selec@ve  replica@on  -­‐  Ability  to  sync  sets  of  data  between  data  centers.    Master  source  capable  of  synching  select  data  with  replicas  in  remote  data  centers.  

•  CDCR  was  developed  with  our  needs  in  mind.  What  are  the  needs  of  the  community?