Internal DNS errors caused DBT jobs and setup tests to fail in AWS East region.
2024-03-01 20:19
: Automatic monitoring found many jobs in a pending state and a scale up event was triggered. It was short lived and auto-resolved.
2024-03-01 21:59
: Setup test failures were reported on aws-us-east-1
from Fivetran support to SRE team
2024-03-01 22:12
:Main error identified in stack traces.
2024-03-01 22:21
: Confirmation that the issue is only happening in AWS us-east-1 and started investigation for Route53.
2024-03-01 22:57
: Identified which named resource was failing DNS resolution.
2024-03-01 23:24
Identified root cause as one specific coreDNS pod with the resolution issue. Restarted the deployment in an attempt to resolve the errors.
2024-03-01 23:48
Restart didn't help and we decided to open a support case with AWS.
2024-03-02 00:16
Confirmed DBT was also affected by DNS issue.
2024-03-02 01:34
AWS joined internal troubleshooting call.
2024-03-02 02:24
Continued AWS troubleshooting call. Kube-proxy was restarted. Coredns replicas was increased from 2 to 6
2024-03-02 02:53
Issue was resolved and placed under monitoring.
We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.
It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.
DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.
After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).
As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.
To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.
This also adds the ability to test how autoscaler reacts to high load from dbt-runner.