DBT Transformation jobs failing in the AWS us-east-1 region with a "Name or service not known" error

Incident Report for Fivetran

Postmortem

What happened:

Internal DNS errors caused DBT jobs and setup tests to fail in AWS East region.

‌

Timeline[In UTC - 24hr format]:

2024-03-19 21:14:00Ticket created by a customer through our ticketing system.

2024-03-19 23:28:00 Issue escalated to Fivetran site reliability team.

2024-03-19 23:39:00The spike in error(-2, 'Name or service notknown') started around 1:40 PM PST.

2024-03-20 1:14:00 One AWS host containing coredns pod went above 100% CPU and all the pods on that node were not working. The issue was coredns pod was in running state but was not able to service any requests. Hence we were high likely getting 1/6 errors with our services. Once the faulty node was figured out, we cordoned and drained that node. Draining of node resulted in faulty coredns pod to be rescheduled and the error count went to zero.

‌

Resolution

The issue was isolated to be happening on a particular node which was cordoned, drained and removed.
As a followup, we moved from m5.large instance type to m5.4xlarge instance type to avoid dbt jobs saturating CPU frequently.

‌

Root Cause(s):

We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.

It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.

DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.

After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).

As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.

To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.

This also adds the ability to test how autoscaler reacts to high load from dbt-runner.

Posted Apr 03, 2024 - 19:51 UTC

Resolved

We identified an issue for DBT Transformations running in the AWS us-east-1 region, this resulted in Transformation jobs failing with a "Name or service not known" error. This issue was caused by an increased load on one of our DNS servers and was resolved by distributing this load across other DNS resources.

Posted Mar 19, 2024 - 22:00 UTC