30
红红红 红红 红红红红红红红红红红 红红红 红红红红红红红 () : 176 红 红红红红红红 西 11 红红 3 红 红红(86)-0571-89939888 InfowareLab Site Failo ver January 2007 Infrastructure Team Gary Chen, Mingfei Hua, Eric Li

InfowareLab Site Failover

  • Upload
    kamin

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

InfowareLab Site Failover. January 2007 Infrastructure Team Gary Chen, Mingfei Hua, Eric Li. Agenda. Objective Overview Components Replication Monitoring and Master Failover DB data checking and verification File synchronization Site Monitoring at service and components levels - PowerPoint PPT Presentation

Citation preview

Page 1: InfowareLab Site Failover

红杉树(中国)信息技术有限公司公司 地址:杭州市天目山路 176 号西湖数源软件园 11 号楼 3 层 电话: (86)-0571-89939888

InfowareLab Site Failover

January 2007

Infrastructure Team

Gary Chen, Mingfei Hua, Eric Li

Page 2: InfowareLab Site Failover

Agenda

• Objective• Overview• Components

– Replication Monitoring and Master Failover– DB data checking and verification– File synchronization– Site Monitoring at service and components levels– Site switchover based on DNS GSLB

• Integration With Applications

Page 3: InfowareLab Site Failover

Objectives

• High availability at service and site level

• Minimize down time caused by: critical components failure, major upgrades, natural disasters

• Intelligent load balancing based on status, load and geographical locations

Page 4: InfowareLab Site Failover

Overview

• Active/Standby model

• Standby site internal design

• Active/Active and site load balancing in the future

Page 5: InfowareLab Site Failover
Page 6: InfowareLab Site Failover

Components

• MySQL replication

• Database Replication Monitoring and Master Failover

• Database Synchronization Checking

• File Replication and Synchronization

• DNS GSLB site IP switchover

Page 7: InfowareLab Site Failover

Cross Site Data Replication and Master Failover• MySQL replication

• Election of Director Node

• Replication monitor – repmond

• Command line utility for replication monitor management

• LVS VIP management and failover

Page 8: InfowareLab Site Failover

Advantages of MySQL Replication

• Minimum amount of extra workload

• Small overhead – one thread on master and slave

• Asynchronous – does not slow down primary site performance

• Real time

• Network disconnection fault tolerant

• Maintains replication status

Page 9: InfowareLab Site Failover

Site Internal Design

Page 10: InfowareLab Site Failover

Replication Monitor/Master Failover

• Internal Design– Site monitor thread– Replication monitor on local node– Intra-site heart beat sender thread– Intra-site heart beat receiver thread– Command line health check thread

Page 11: InfowareLab Site Failover
Page 12: InfowareLab Site Failover

List of parameters• Environment variables

– REPMON_HOME: conf, bin, log• Configuration File

– cluster_srv_list=2:192.168.1.113:/usr/local/mysql/data/:/dbdata/test/master.info.2:/dbdata/test/relay-log.info.2:2036:2037:2038; 3:192.168.1.114:/usr/local/mysql/data/:/dbdata/test/master.info.3:/dbdata/test/relay-log.info.3:2036:2037:2038; \ 4:192.168.1.115:/usr/local/mysql/data/:/dbdata/test/master.info.4:/dbdata/test/relay-log.info.4:2036:2037:2038;

– site_director=192.168.1.112:2036;10.0.1.233:2036;– pimary=1– site_hb_period=10– cluster_hb_period=5– site_hb_miscount=5– cluster_hb_miscount=5– repuser=root– reppasswd=password– server_status_file=/dbdata/test/.server– lvs_vip_list=192.168.1.241:1;192.168.1.242:2;192.168.1.243:3;– [email protected]

Page 13: InfowareLab Site Failover

Usages -1

• repmond [–conf=<conf file loc> --srvID=n]

• Repmonmain

-start [--conf=conf_file_name] --srvID=myID

-stop [IP:[PORT]]

-check [IP:[port]]

-checkvip [IP:[port]] VIP

Page 14: InfowareLab Site Failover

Usages -2

• NFS file system should be mounted with options “rw,soft,intr,noac” .

• Status file of repmond and mysql replication temp file shoud be readed with O_DIRECT flag.With this flag,program will read files through I/O hardware directly,pass by the operating system.

• When status file “.server” is deleted when DB Check is running,the daemon repmond will re-create it.

• Change the mechanism how to get mysql replication binlog file and position from new master.Now,it get this information from mysql replication temp file master.info and relay-log.info.This change resolve the issue of probability data loss when changing master .

Page 15: InfowareLab Site Failover

Usages - 3

• Mysql replication temp file master.info and relay-log.info should be stored in NFS directory,because it will be accessed by other servers in site.

• Modify config file.Pathes and names of file master.info and relay-log.info should be added to item “cluster_srv_list” in config file.New “cluster_srv_list” will looks like listed below:

1:192.168.1.113:/usr/local/mysql/data/:/dbdata/test/master.info.2:/dbdata/test/relay-log.info.2:2036:2037:2038;

Page 16: InfowareLab Site Failover

Known Bug

• Due to the known limitations of MySQL replication, the content of file master.info and relay-log.info are not synchronized with the result of command “show slave status”,mysql server flush replication slave information to file master.info and relay-log.info in minutes. So problems maybe appear when changing master.

Page 17: InfowareLab Site Failover

Introduction of File Synchronization Daemon

• Fsyncd – File Synchronization Daemon, is a useful little utility that’s easy to set up on your UNIX/linux machines.

• As integrated with tool ‘rsync’, Fsyncd can puts more attention on local file compare and change-tracing.

• Rsync copies only the diffs of files that actually changed, compressed and through ssh for security.

Page 18: InfowareLab Site Failover

Characteristic of Rsync

• Diffs - Only actual changed pieces of files are transferred, rather than the whole file. This makes updates transfer, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.

• Compression – The tiny pieces of diffs are then compressed on the fly, further saving your file transfer time and reducing the load on the network.

• Secure Shell – The stream from rsync is passed through the ssh protocol.

Page 19: InfowareLab Site Failover

How does fsyncd work?

• You must set up one machine to be an “publisher” by running fsyncd and setting up a short, easy configuration file.

• Any files and directories the configuration file sets will be transferred to “subscriber” machine.

• After first transferring, the fsyncd daemon will check the status of the files and directories continually. If any changes occurred, fsyncd will synchronize it.

Page 20: InfowareLab Site Failover

The configuration file

• Default location: /etc/fsyncd.conf.• Follow – follow link or not.• Dstloc – destination location (user@ip address).• Incdir – included directories for synchronization.• Incfile – included files for synchronization.• You can use a separate sign (‘|’) to separate tw

o paths set in ‘incdir’ and ‘incfile’.

Page 21: InfowareLab Site Failover

File Scan Flow diagram

Page 22: InfowareLab Site Failover

Fsyncd usage - 1

• Flexible configuration file – you can run fsyncd via option ‘--configfile=<file>’ to use your own specified configuration file instead of the default one.

• Use your favorite port – through option ‘--port=<port>’, you can use your favorite port.

• Set intervals the fsyncd checking the file system – use ‘--sleepsec=<seconds>’ to let it check files and directories after each <seconds> (if the <seconds> is smaller than 15, it will be set to 15).

Page 23: InfowareLab Site Failover

Fsyncd usage - 2

• Max errors allowance – if you want to restart the daemon when <num> errors occurred during synchronization, use option ‘--maxerr= <num>’

• Make mirrors - by option ‘--delete’, any file exists on subscriber but not on the publisher will be deleted. It only affects directories set under ‘incdir’ in configuration file.

Page 24: InfowareLab Site Failover

Fsyncd usage - 3

• Max wait time for one file - when transferring files, uncertain factors may occur and the transfer may be blocked. To avoid this, the daemon may check the status after an interval (waitsec) several times (waitperiod). If after these checks the child process is still bloking, it will be skipped and the next file will be transferred. You can use options: ‘--waitsec=<secs>’ and ‘--waitperiod= <num>’ to set the intervals and check times. The default settings is : check the status 40 times in each 12 seconds.

Page 25: InfowareLab Site Failover

Some considerations - 1

• Any symbolic links, file ownership, permissions, devices and times will be preserved during file replication and synchronization.

• The in-memory tree contains all files’ information instead of contains only the directories’. If a file or directory changes, when scanning, the daemon may discover the change and synchronize it. After that, the new information will be recorded.

Page 26: InfowareLab Site Failover

Some considerations - 2

• Rsync has some problems on retransferring big files, when transfer them, it may be blocked.

• Fsyncd daemon may check the file size before transfer. If the file is larger than the number set by environment ‘FSYNCD_MAXLEN’, it will be transfered in a new mode. In this mode, the daemon checks the process status several times. After these checks, if the process is still blocking, it will be skipped. We can set the time by options ‘--waitsec=‘ and ‘--waitperiod=‘.

Page 27: InfowareLab Site Failover

Some considerations - 3

• Some logs may useless and it may make confusions when we check it.

• Fsyncd has 6 log levels. We can set the level through environment “LOGLEVEL”. The value of it may be 0 - 5. If it’s set to 5, all information will be logged, and if it’s set to 0, only fatal errors will be logged.

• Besides environment LOGLEVEL, there’s another environment “LOGPROMPT”. If it’s set to 1, the log’s level will be logged either.

Page 28: InfowareLab Site Failover

Some considerations - 4

• When the daemon’s running, some accidents may happen. Not all these accidents are fatal.

• Fsyncd may keep a global value which records the number of failures it got before. Except the fatal ones (if fatal errors occurred, the program will exit), the low-level failures will made the global value become larger. If the value reached a “--maxerr” one, the daemon would exit.

Page 29: InfowareLab Site Failover

Fsyncmain monitoring tool

• When using fsyncmain monitoring tool, the arguments mentioned above are available.

• When using ‘start’ or ‘check’ option, these arguments may useful.

• For ‘crontab’, there need an option “--quiet”. Use this option, the monitoring tool may make no terminal outputs.

Page 30: InfowareLab Site Failover

Q & A