How to Fix a Lagging MySQL Replication
4 stars based on
It provides major benefits in failover, point-in-time backup recovery, and hierarchical replication, and it's a prerequisite for crash-safe multi-threaded replication. In the process, we learned a great deal about deployment and operational use of the feature. We plan to open source many sync_binlog mysql performance our server-side fixes via WebScaleSQLas we believe others in the scale community can learn from this and benefit from the work we've done.
Traditional MySQL replication is based on relative coordinates — each replica keeps track of its position with respect to its current master's binary log files. GTID enhances this setup by assigning a unique identifier to every transaction, and each MySQL server keeps track of which transactions it has already executed.
Auto-positioning makes failover simpler, faster, and less error-prone. It becomes trivial to get replicas in sync after a master failure, without requiring an external tool such as Master High Availability MHA. Planned master promotions also become easier, as it is no longer necessary to stop all replicas at the same position first. Database administrators need not worry about manually specifying incorrect positions; sync_binlog mysql performance in the case of human error, the sync_binlog mysql performance is now smart enough to ignore transactions it has already executed.
By permitting replicas to be repointed to masters at sync_binlog mysql performance levels of the hierarchy, GTID greatly simplifies complex replication topologies, including hierarchical replication slaves of slaves.
Since a GTID-enabled binlog stream can safely be taken from any member of a replica set, as well as replayed without requiring relative positions, the feature also eases binlog backup and recovery. Additionally, by combining GTID with semi-synchronous replication, we have implemented automation to safely recover crashed masters as replicas. When a master crashes, we can detect this and promote a replica within 30 seconds without losing data.
Later, if the original master was able to be recovered and our automation detects its data is consistent, GTID allows us to repoint it to the new master instead of having to kick off a copy operation to replace it. This process is fundamentally incompatible with the notion of high availability, making it unviable for production use sync_binlog mysql performance scale.
This permits a high-availability deployment strategy as follows, for each replica set:. With sufficient safeguards and validation logic, it is safe to execute this rollout process to a large number of replica sets at a time. During the peak of the deployment process, we were running our rollout script on up to hundreds of replica sets simultaneously.
Apart from the deployment changes, during initial testing we have encountered a number of serious bugs and performance regressions with GTID. Using fully durable settings requires syncing both the binary log and innodb transaction log to disk after each transaction in single-threaded replication mode, which negatively affects slave apply performance. It is important for any feature to be crash-safe to avoid operational overhead at Sync_binlog mysql performance scale.
So in fb-mysql, we decided to fix this issue by adding a new transaction table mysql. GTID is a powerful feature that simplifies many replication complexities.
There were several steps we had to take prior to beginning our GTID deployment. One major step involved updating all of our automation to use GTID and auto-positioning. The most substantial change was to our promotion logic, which now had to cover additional permutations for whether GTID was already enabled, or being enabled for the first time.
Another important prerequisite involves prevention of Sync_binlog mysql performance statements. However, before beginning the rollout, it is necessary to audit applications and preemptively fix any uses of these query patterns. To sync_binlog mysql performance this possible at our scale, we augmented MySQL to add user stat counters for these statements, as well as an option to write full information on them to the Sync_binlog mysql performance error log.
This allowed us to easily identify around 20 cases of these query patterns being used, among our thousands of special-case workloads. Finally, we wrote a script to aid in skipping statements, in the rare cases sync_binlog mysql performance that is necessary. This is painful in an emergency, especially while a large DBA team is still ramping up on GTID knowledge, so having a helper script is prudent.
Facebook's Global Transaction ID deployment was a cross-functional collaboration between our MySQL engineering, database operations, and data performance teams. Deploying GTID to a Facebook-scale environment required substantial effort, including major improvements to the MySQL server, changes to our automation, and a custom rollout script.
We can happily state that it is now extremely stable in our use, with no new problems encountered in recent months. Despite the effort involved, deploying GTID has proven to be well-worth the time commitment. The feature has sync_binlog mysql performance us immediate benefits, in addition to being a base for additional automation improvements in the near future. Follow us on Twitter. Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.
Background Traditional MySQL replication is based on relative coordinates — each replica keeps track of its position sync_binlog mysql performance respect to its current master's binary log files. This permits a high-availability deployment strategy as follows, for each replica set: Perform a master promotion as normal, repointing the replicas and original master to a new master. Summary Facebook's Global Transaction Sync_binlog mysql performance deployment was a cross-functional collaboration between our MySQL sync_binlog mysql performance, database operations, and data performance teams.
Working together to make open source easier. Recommended Moving an Elephant: Finding inter-procedural bugs at scale with Infer static analyzer. Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale. Looking at the code behind our three uses of Apache Hadoop.
Want to work with us? Join the sync_binlog mysql performance, we're hiring! Here are some of our current open positions: Keep Updated Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events. News Blog Events Videos.