Timeline[In UTC - 24hr format]:

2024-03-16 14:08:20Ticket created by a customer through our ticketing system.

2024-03-16 15:34:00 dbt-scheduler in supporting-services-us-east-1-more-gar-1 cluster found in CrashLoopBackOff state.

2024-03-16 15:49:00Initial assumption was that a faulty node is the culprit as more non-functioning pods of different services were observed on the same node.

2024-03-16 15:52:00Previous assumption discarded, there are non-functioning pods on other nodes, dbt-scheduler was automatically rescheduled to a different node and was still failing.

2024-03-16 16:00:00 dbt-scheduler and corednswere restarted.

2024-03-16 16:02:00 dbt-scheduleris now in Running state, but it is not ready. Large number of transformations are starting at once, some are Running, but many are in Pendingstate.

2024-03-16 16:20:00Transformations are running, but some instability is observed with dbt-schedulerit is still not in Ready state, it is failing Liveness check and is periodically being restarted by Kubernetes Scheduler.

2024-03-16 16:22:00Due to huge number of transformations running simultaneously there is now CPU pressure on the nodes: https://onenr.io/0vjAVDK9nQP

2024-03-16 16:42:00 CPU spike has passed and the system is slowly returning to normal.

2024-03-16 17:17:00Incident marked as resolved.

‌

Root Cause(s):

Note: The following root cause is the same as the March 1 root cause as these were related incidents:

We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.

It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.

DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.

After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).

As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.

To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.

This also adds the ability to test how autoscaler reacts to high load from dbt-runner.

Posted Apr 23, 2024 - 21:05 UTC

Resolved

We have observed that all our affected transformations have returned to their usual sync schedules, and the incident has been resolved.

Posted Mar 16, 2024 - 17:19 UTC

Monitoring

To address the issue, we've restarted our dbt scheduler, which has facilitated the transformations to resume their scheduled operations. We will keep a close watch on all affected transformations until they return to their usual schedule syncs.

Posted Mar 16, 2024 - 17:02 UTC

Identified

We have identified an issue with our DBT scheduler and we will continue to investigate and provide additional updates.

Posted Mar 16, 2024 - 16:50 UTC

Investigating

We are currently investigating this issue.

Posted Mar 16, 2024 - 15:58 UTC

This incident affected: Systems (Transformations).