On Monday, we experienced a significant increase in load to one of our core data systems (MySQL). This was due to substantially higher mail volume being processed by Postmark than typical, as well as an uptick in a specific type of abuse that can cause increased load to our servers. Both of these factors caused overall load on our servers to more than double during our peak traffic hours compared to the same time of day for the last several months.
We took two actions to mitigate this issue on Monday, but load remained high on Tuesday. On Tuesday night we applied another change with the belief that this would be sufficient to mitigate the increased load.
Upon review of load as U.S. business hours ramped up on Wednesday morning, we determined that the change we applied would not be sufficient to resolve the issue. We took additional actions that brought load down to typical levels, but caused a small number of transactions to hang in MySQL. As a result, some types of event processing were blocked, and after trying several less disruptive methods, we determined that we needed to reboot our MySQL system to clear the locks and resume normal service.
Under normal circumstances, when MySQL is not available, we will accept messages, and queue them for processing. In applying these changes, and during the reboot, we placed our APIs into a state where this behavior was in effect. We regularly verify this behavior in our staging environment.
The changes we applied on Wednesday have been extremely effective at reducing the overall load on our core MySQL system, and this will translate to a faster and more robust experience using our API.