Delay in scheduling syncs for connectors running in AWS US East region
Incident Report for Fivetran
Resolved
Monitoring has shown that pending sync jobs are sustaining at below the average from before this incident. We will post a full RCA on the status page in the coming weeks once the full investigation into this incident is completed.
Posted Apr 05, 2024 - 20:59 UTC
Update
The cleanup of the hung sync jobs is completed and we started to run new sync jobs again with the modified job cleaner and confirmed it is removing old jobs successfully.

We increased the sync job creator infrastructure and pending sync jobs dropped dropped from 15k down to less than 100. We expect all connectors to be back to syncing within the hour.

The second cluster to handle the increased workload is being deployed and we will monitor to ensure stability.
Posted Apr 05, 2024 - 20:22 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 05, 2024 - 20:13 UTC
Update
Cleanup is still ongoing, number of jobs to cleanup is down to 15K jobs from 200K since last update.

Our operations team is continuing to work to bring up additional clusters in the AWS East region to handle the queue of syncs once the cleanup process is complete.

We identified and fixed a memory allocation issue within our infrastructure.

The current target ETA for a fix is the end of day today. This will be adjusted once we can measure sync performance with the second cluster.
Posted Apr 05, 2024 - 19:30 UTC
Update
Since the deployment of additional infrastructure, we have observed the cleanup backlog drop by half in the last hour. Once the cleanup job is complete, we will start resuming sync jobs slowly at first to ensure stability before ramping up with a second cluster to service the incoming workload.
Posted Apr 05, 2024 - 18:22 UTC
Update
We are in the process of removing stale entries in a component of the Cloud Infrastructure Cluster. This is expected to resolve the issue of delayed scheduling and allow syncs to be scheduled.

We have also created a second Cloud Infrastructure Cluster to redistribute the pending load of sync scheduling moving forward, helping to prevent this issue again.

We are monitoring the impact of these changes and will provide further updates on the status.
Posted Apr 05, 2024 - 15:32 UTC
Identified
Connectors running in the AWS US East region are still delayed in scheduling due to Cloud Compute Infrastructure issues for this region.
We are continuing to investigate the issue with priority.
Posted Apr 05, 2024 - 13:54 UTC
Update
We have a lot of pending sync jobs in the queue, which is still causing issues with connector sync in this region. Our Engineering team is actively working to resolve the issue
Posted Apr 05, 2024 - 07:28 UTC
Monitoring
We identified the root cause as the sync job creator running into errors. We reduced the number of replicas to reduce load we believe to be causing the errors.

Errors are back down to 0 and we are adding a few additional replicas to get through the pending sync job backlog.
Posted Apr 05, 2024 - 01:09 UTC
Identified
The issue has been identified and we are working to resolve it.
Posted Apr 05, 2024 - 00:47 UTC
This incident affected: Systems (General Services).