DEGRADED

Delays: Delivery events. Outbound messages are being processed and sent immediately.

Ongoing for an hour

Last update : We're currently experiencing delivery event delays. Outbound messages are processing immediately but the "delivered" event is delayed.

← Incidents

MAINTENANCEDatabase maintenance - some activity data lost

Sep 4, 2017, 10:36 - 11:31 PMResolved after an hour

What was affected

On Sept 4, 2017 Postmark lost some of your activity data. Activity events from approximately 5am Eastern until 11pm Eastern were affected and are unrecoverable. Activity events include:

  • Outbound data such as sent, opened, delivered, bounced, and clicked events
  • Inbound data such as processed and blocked events

What was not affected

This did not affect email sending in any way. All emails that we received were sent to their recipients. If you had webhooks set for opens or bounces, you would have received the appropriate data sent to your webhook.

Inbound processing was delayed at times during the day, but all inbound events were sent to your webhooks. Only the post-processing record that would have been shown in your activity stream is missing.

Daily statistics that we provide were not affected.

What happened

A problem in recovering data from our Elasticsearch cluster was the cause of today's data loss.

At approximately 5am Eastern our Elasticsearch cluster became unavailable. We determined that this was due to unavailable master nodes and restarted the eligible master nodes in our cluster. This succeeded in making the cluster available again and accept writes and reads against the cluster. However, due to the new master node reelection, even though our Elasticsearch cluster was in an operational state, it was in an extremely fragile state, not having our usual levels of redundancy.

Our team spent the day trying several different approaches to recovery but we weren’t able to resolve the issue. We then decided to perform a full cluster restart of our data nodes. This did succeed in bringing the cluster into a fully operational state, but as mentioned earlier, the cluster was not in a redundant state. Some nodes had the only copy of some data so when we forced the restart, the data on it was lost.

What's next

We've already started investigating alternative data stores so that we aren't so reliant on Elasticsearch. When we have more information on this, we'll be sure to blog and let you know.

We know that you depend on Postmark as an infrastructure product and we take any kind of data loss very seriously. Even though none of your emails were lost, we know you depend on activity data to help troubleshoot and give feedback to your customers, so we do take this incident very seriously.

If you have any questions about the specifics of this incident, please reach out to us at support@postmarkapp.com.


Incident’s History

UP

Database maintenance has been completed. All inbound messages have been processed. However, some activity data has been lost. We will update this status message with full details.

We're about halfway through our scheduled database maintenance. We should have everything wrapped up within the next hour.

MAINTENANCE

We are about to start some database maintenance. Inbound messages will be delayed, and activity data will be queued. Outbound sending will not be affected. We will continue to post updates as we make progress with the maintenance.