Setup tests failing for connectors running in AWS US East destination region.
Incident Report for Fivetran
Postmortem

What happened:

Internal DNS errors caused DBT jobs and setup tests to fail in AWS East region.

Timeline[In UTC - 24hr format]:

2024-03-01 20:19 : Automatic monitoring found many jobs in a pending state and a scale up event was triggered. It was short lived and auto-resolved.

2024-03-01 21:59 : Setup test failures were reported on aws-us-east-1 from Fivetran support to SRE team

2024-03-01 22:12 :Main error identified in stack traces.

2024-03-01 22:21 : Confirmation that the issue is only happening in AWS us-east-1 and started investigation for Route53.

2024-03-01 22:57 : Identified which named resource was failing DNS resolution.

2024-03-01 23:24 Identified root cause as one specific coreDNS pod with the resolution issue. Restarted the deployment in an attempt to resolve the errors.

2024-03-01 23:48 Restart didn't help and we decided to open a support case with AWS.

2024-03-02 00:16 Confirmed DBT was also affected by DNS issue.

2024-03-02 01:34 AWS joined internal troubleshooting call.

2024-03-02 02:24 Continued AWS troubleshooting call. Kube-proxy was restarted. Coredns replicas was increased from 2 to 6

2024-03-02 02:53 Issue was resolved and placed under monitoring.

Root Cause(s):

We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.

It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.

DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.

After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).

As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.

To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.

This also adds the ability to test how autoscaler reacts to high load from dbt-runner.

Posted Apr 03, 2024 - 19:42 UTC

Resolved
We ran some maintenance on our Kubernetes environment and saw workloads return to normal at about 6PM Pacific time. A root cause analysis will be posted to this status page in the coming weeks.
Posted Mar 02, 2024 - 02:49 UTC
Update
AWS support is engaged to aid with the investigation into the intermittently failing DNS causing the platform problems.
Posted Mar 02, 2024 - 00:57 UTC
Identified
The root issue at this time appears to be DNS-related. We are continuing to troubleshoot to find a fix.
Posted Mar 01, 2024 - 23:50 UTC
Investigating
This issue looks to be specific to connectors whose destination is running in AWS US East 1 region.
Posted Mar 01, 2024 - 22:12 UTC
Update
We are continuing to work on a fix for this issue.
Posted Mar 01, 2024 - 22:11 UTC
Identified
The issue has been identified and we are working to resolve it.
Posted Mar 01, 2024 - 22:08 UTC
This incident affected: Systems (General Services).