Reducing MySQL Replica Lag by 90% and Lag Catchup time by 97%
Chapter 1: Setting the context
At HomeLane, we have a single MySQL DBMS which acts as both, a data lake and a data warehouse. As one might expect, we schedule ELTs to pull data from our production DB’s into the data lake and then run aggregation jobs to prepare the data for analytics. This aggregated data is then shared with teams via a replica. This architecture keeps things simple and suffices for our use cases. Recently though, we ran into a problem with this setup where we started seeing increased replica lag times. This didn’t abruptly happen in one day but it built over time and reached a point where user experience started to greatly suffer. The result of this increased wait times, sometimes of up to 6 hours, to see updated data which frustrated a lot of users and so this needed to be addressed.
»