Some connectors are experiencing delays or intermittent failures. Additionally, some users may intermittently be unable to create or edit connectors.
Incident Report for Fivetran
Postmortem

Root Cause - What happened?

Due to a high concurrent load on our Production Database, connectors experienced sync delays and/or intermittent failures from ~10 AM UTC July 21, 2022 - 8:00 PM UTC July 23, 2022.

Resolution - Steps taken to fix the issue, along with a timeline of events (Note: Times are in UTC):

July 21, 2022:

  • ~10:30 AM - An automated alert detects a high load on our Production Database. This is causing some connectors to be delayed or intermittently fail.
  • ~11:30 AM - We tune DB Parameters to reduce the overall load.
  • ~1:00 PM - We observe a decrease to normal levels in the number of delayed and pending syncs, and begin monitoring for further impact.
  • ~1:20 PM - The load on our Production Database spikes once again.
  • ~2:20 PM - We take measures to reduce the overall load on a high-volume table (hereinafter referred to as Table A).
  • ~4:20 PM - We increase dedicated system resources for our Production DB.
  • ~8:00 PM - We begin migrating additional load from Table A.
  • ~9:00 PM - We get an alert for a buildup of successfully completed jobs in one of our clusters. We later determined that the buildup occurred because of latency in our Production DB, due to the ongoing incident. There were no additional side effects from this issue, beyond the impact of the ongoing incident.
  • ~10:00 - We manually remove successfully completed jobs from the affected cluster.

July 22, 2022

  • ~4:30 AM - Measures deployed to further reduce the load on Table A.
  • ~7:30 AM - Measures deployed to further reduce the overall load on our Production Database. At this time, we create a new Table A in a new dedicated database and plan to migrate connectors there.
  • ~5:30 PM - We complete testing the migration in our Staging environment.
  • ~8:30 PM - We identify a buildup of successfully completed jobs in one of the regions (hereinafter referred to as Region A) due to resource constraints.
  • ~9:30 PM - We scale resource allocation in Region A. At this point, our job cleaner process resumes expected functionality.
  • ~10:00 PM - Except for connectors in Region A, we begin slowly migrating connectors in all other regions in very small batches. From there onwards, we continue monitoring sync progress.

July 23, 2022

  • ~1:00 AM - Cleanup of finished jobs in Region A is complete. We migrate a small batch of connectors in Region A, and monitor progress.
  • ~2:00 AM - Identified an increase in the number of finished jobs in another region (hereinafter referred to as Region B). We begin manual deletion of these completed jobs.
  • ~2:30 AM - We continue migration for additional connectors, and continue monitoring.
  • ~5:00 AM - Automated cleanup of finished jobs in Region B is complete.
  • ~7:30 AM - We identify and restart a few jobs that had been stuck.
  • ~3:30 PM - After continuing to observe positive results, we decide to migrate the remaining connectors in four batches. At this time, we migrate the first of four batches of connectors to Table A in the new DB.
  • ~5:30 PM - Migrated second of four batches of connectors to Table A in the new DB.
  • ~6:30 PM - Migrated third of four batches of connectors to Table A in the new DB.
  • ~8:00 PM - Finished migrating all connectors to Table A in the new DB.

July 24, 2022

  • ~5 PM - After observing expected behavior for all connectors for roughly ~18 hours, we marked the incident as resolved.

Prevention - Steps taken to prevent this issue from recurring:

Completed:

  • We have scaled our Production Database to handle a higher volume of concurrent load.
  • We have also taken measures to reduce the volume of concurrent load on our Production Database.
  • We have improved our alerting to detect these issues earlier.

Ongoing/Planned:

  • Enhance disaster recovery for similar situations.
  • Continue to identify and reduce the additional load on Production DB.
  • Continue to optimize the handling of stuck syncs.
  • Proactively re-examine other architectural bottlenecks.
  • A more robust testing environment to simulate additional edge case scenarios.
Posted Aug 10, 2022 - 22:55 UTC

Resolved
This incident has been resolved, and all services are operational. We will be following up with a detailed RCA in the coming days.

Thank you for your cooperation.
Posted Jul 24, 2022 - 16:51 UTC
Update
Connectors are continuing to sync, per expected behavior, and all services remain operational. We will continue to proactively monitor through the weekend.

Thank you for your cooperation.
Posted Jul 24, 2022 - 03:58 UTC
Monitoring
Connectors have resumed normal functionality, and syncs are no longer delayed.

We have successfully completed migration of connectors to an updated core service, and no further action is needed at this time. We will continue to proactively monitor all services.

Thank you for your cooperation. We will be following up with a detailed RCA in the coming days.
Posted Jul 23, 2022 - 14:50 UTC
Update
We are continuing to work on the migration to bring down load on our main database. We expect the database load to return to normal by the end of today once the majority of migrations are complete.
Posted Jul 23, 2022 - 02:11 UTC
Update
We are continuing to migrate customers to balance the load on the scheduling service. Sync scheduling is recovering on migrated destinations.

In the meantime, connectors may experience sync delays with irregular scheduling. Incrementals syncs will proceed on an intermittent schedule.

We will have another update in ~4 hours if there are no other substantial updates.
Posted Jul 22, 2022 - 21:11 UTC
Update
We are continuing to migrate customers to balance the load on the scheduling service. We have observed a return to regular sync scheduling on connectors after moving to an updated service.

In the meantime, connectors may experience sync delays with irregular scheduling. Incrementals syncs will proceed on an intermittent schedule.

We will have another update in ~1 hour.
Posted Jul 22, 2022 - 20:10 UTC
Update
We have deployed an updated core service, and we're migrating customers over to the new service. Syncs are expected to resume normal scheduling after moving to an updated service.

In the meantime, connectors may experience sync delays of up to an hour with irregular scheduling. Incrementals syncs will proceed on an intermittent schedule.

Syncs may see intermittent failures due to timeouts retrieving data from Fivetran.

Connectors may have had their syncs terminated in the last hour as we terminate ongoing syncs that are unable to progress.

We will have another update in ~1 hour.
Posted Jul 22, 2022 - 19:10 UTC
Update
We are in the process of deploying an updated core service. Once the deployment is completed, sync scheduling intervals are expected to return to normal intervals.

In the meantime, connectors may experience sync delays of up to an hour with irregular scheduling. Incrementals syncs will proceed on an intermittent schedule.

Syncs may see intermittent failures due to timeouts retrieving data from Fivetran.

We will be terminating syncs that have not been making progress over the next hour. You may see an unexpected sync failure as we move that sync over to new resources. Syncs will automatically continue after this sync cancellation.

We will have another update in ~1 hour.
Posted Jul 22, 2022 - 18:02 UTC
Update
We are continuing to test and validate additional service resources. Syncs may see intermittent failures due to timeouts retrieving data from Fivetran and sync scheduling may be intermittently delayed.

We will have another update in ~1 hour.
Posted Jul 22, 2022 - 17:15 UTC
Update
We are continuing to prepare for the migration of the affected core services.

The proposed update is undergoing additional testing to ensure that we're able to transition without causing any further disruption to syncs.

We will have another update in ~1 hour.
Posted Jul 22, 2022 - 16:14 UTC
Update
We are still in the process of preparing the migration of the core services.

We have prepared the fix and are testing it to ensure a smooth transition.

As soon as testing has been completed and the fix is validated, we should be able to provide an ETA.
Posted Jul 22, 2022 - 15:12 UTC
Update
We have identified a potential fix for this issue involving migrating one of our core services to its own production database, and are currently testing it in our staging environment.
Once validated, we will apply this change to our production environment and will progressively roll this out to our customers in order to permanently resolve the resource constriction issues.
Posted Jul 22, 2022 - 13:30 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jul 22, 2022 - 09:54 UTC
Update
We are experiencing issues with our production database and are actively working on resolving them.
This may cause connector failures or delays and may also impact the dashboard whenever trying to create, update or sync connectors, update schemas, or trigger re-syncs.
Posted Jul 22, 2022 - 09:52 UTC
Identified
We have identified an issue where connectors are failing or experiencing an delay in the schedule, we are currently working to resolve it.
Posted Jul 22, 2022 - 07:29 UTC
This incident affected: Dashboard UI and Replication Servers.