Internal DNS errors caused DBT jobs and setup tests to fail in AWS East region.
2024-03-19 21:14:00
Ticket created by a customer through our ticketing system.
2024-03-19 23:28:00
Issue escalated to Fivetran site reliability team.
2024-03-19 23:39:00
The spike in error(-2, 'Name or service notknown') started around 1:40 PM PST.
2024-03-20 1:14:00
One AWS host containing coredns pod went above 100% CPU and all the pods on that node were not working. The issue was coredns pod was in running state but was not able to service any requests. Hence we were high likely getting 1/6 errors with our services. Once the faulty node was figured out, we cordoned and drained that node. Draining of node resulted in faulty coredns pod to be rescheduled and the error count went to zero.
We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.
It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.
DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.
After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).
As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.
To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.
This also adds the ability to test how autoscaler reacts to high load from dbt-runner.