← Incidents

DEGRADEDDelays: Outbound and Inbound, API: Timeouts

Jul 3, 3:21 - 4:45 PMResolved after an hourAffected: API, Outbound SMTP, Inbound SMTP, Web App

On Monday, we experienced a significant increase in load to one of our core data systems (MySQL). This was due to substantially higher mail volume being processed by Postmark than typical, as well as an uptick in a specific type of abuse that can cause increased load to our servers. Both of these factors caused overall load on our servers to more than double during our peak traffic hours compared to the same time of day for the last several months.

We took two actions to mitigate this issue on Monday, but load remained high on Tuesday. On Tuesday night we applied another change with the belief that this would be sufficient to mitigate the increased load.

Upon review of load as U.S. business hours ramped up on Wednesday morning, we determined that the change we applied would not be sufficient to resolve the issue. We took additional actions that brought load down to typical levels, but caused a small number of transactions to hang in MySQL. As a result, some types of event processing were blocked, and after trying several less disruptive methods, we determined that we needed to reboot our MySQL system to clear the locks and resume normal service.

Under normal circumstances, when MySQL is not available, we will accept messages, and queue them for processing. In applying these changes, and during the reboot, we placed our APIs into a state where this behavior was in effect. We regularly verify this behavior in our staging environment.

The changes we applied on Wednesday have been extremely effective at reducing the overall load on our core MySQL system, and this will translate to a faster and more robust experience using our API.

Incident’s History

UP

Outbound sending, inbound processing, activity, and open events are recovered. Statistics are continuing to recover.

Open events and inbound activity are now caught up as well. Statistics will continue to recover for a bit longer.

Inbound processing queues and outbound delivery is caught up. Opens, inbound activity, and stats are delayed but recovering.

API timeouts are subsiding and activity is starting to catch up. Thanks for your patience here, we know today has been a bit rough.

DEGRADED

We're making a quick update to improve performance that may cause some delays and API Timeouts. We're working to mitigate customer impact as quickly as possible.