Postgres Connectors are failing

Incident Report for Fivetran

Postmortem

On April 15th 2020, approximately 19:00 UTC onwards multiple customers reported seeing the following on the Fivetran UI for their Postgres connector
Error: current transaction is aborted, commands ignored until end of transaction block.

Root cause:

There were changes committed to Fivetran code base that enables better data validation between source and destination. While rolling out the beta version to a few customers the feature instead got released to all Postgres connectors.
There were two issues identified in the code:

We were unable to execute a transaction against databases that were in `recovery mode` at the time.
The change attempted to stop XMIN short, but because the state is set in a separate spot in the connector, we were not able to save the state.

Steps taken to resolve the issue:

Upon further troubleshooting, we found that a new change, deployed earlier that morning (around 2020-04-15 17:50 UTC),
We reverted the code on April 15th 23:00 UTC

Action needed from customers:

There was a residual effect wherein certain Postgres connectors using XMIN for replication could have potentially lost the data during the affected period, in turn, causing data integrity issues.

If you do find a data integrity issue, contact Fivetran support team via support ticket to ensure that the issue is taken care of in a timely manner.

Fivetran has a script that would help set the state back for the Postgres connector to what it was before the faulty code was deployed. This would re-play changes that are already reflected in the warehouse, but guarantees that any potentially missing changes from the outage are also captured.

Steps to prevent/mitigate these risks in the future:

Add more comprehensive test cases to thoroughly test within the staging environment
Roll out the code in staging/pre-prod area for customer validation

We appreciate your patience and help with this issue, and sincerely apologize for any inconvenience we caused.

Regards,

The Fivetran Team

Posted Apr 21, 2020 - 17:36 UTC

Resolved

This incident has been resolved.

Posted Apr 16, 2020 - 01:02 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 15, 2020 - 23:20 UTC

Identified

The issue has been identified and our Engineering Team is working on an immediate fix.

Posted Apr 15, 2020 - 21:07 UTC