2024-03-16 14:08:20
Ticket created by a customer through our ticketing system.
2024-03-16 15:34:00
dbt-scheduler
in supporting-services-us-east-1-more-gar-1
cluster found in CrashLoopBackOff
state.
2024-03-16 15:49:00
Initial assumption was that a faulty node is the culprit as more non-functioning pods of different services were observed on the same node.
2024-03-16 15:52:00
Previous assumption discarded, there are non-functioning pods on other nodes, dbt-scheduler
was automatically rescheduled to a different node and was still failing.
2024-03-16 16:00:00
dbt-scheduler
and coredns
were restarted.
2024-03-16 16:02:00
dbt-scheduler
is now in Running
state, but it is not ready. Large number of transformations are starting at once, some are Running
, but many are in Pending
state.
2024-03-16 16:20:00
Transformations are running, but some instability is observed with dbt-scheduler
it is still not in Ready
state, it is failing Liveness check and is periodically being restarted by Kubernetes Scheduler.
2024-03-16 16:22:00
Due to huge number of transformations running simultaneously there is now CPU pressure on the nodes: https://onenr.io/0vjAVDK9nQP
2024-03-16 16:42:00
CPU spike has passed and the system is slowly returning to normal.
2024-03-16 17:17:00
Incident marked as resolved.
Note: The following root cause is the same as the March 1 root cause as these were related incidents:
We noticed spike in DNS lookup failures. DNS was not entirely down, but failing some percent of the time, causing test-runner and transformations to fail. Issue only happened in us-east-1 region, which is our largest AWS region.
It was isolated to a node where core-dns pods are running along with either test-runner or transformations (dbt-runner, scheduler,etc). Further investigation pointed to, when a node started failing, these nodes only have 2 cores, and 4 DBT jobs are able to saturate CPU.
DBT jobs were not requesting enough CPU to trigger an autoscale. CPU and memory limit were tweaked. Also max node count was increased. A new dashboard was made to monitor and ensure the limit increase worked as expected.
After tweaking the CPU limits it created another issue which was due to lack of resources and was widespread across at least 3 regions this time. This time around, this did trigger autoscaling (more nodes got added ).
As a result, to mitigate the lack of resources issue, new nodepool with bigger instance type was created (m5.4xlarge) and dbt cpu request limits were adjusted.
To make it even more stable, a dedicated nodepool for coredns pods was created. Having them isolated means we can be certain that if something goes wrong on the worker nodes, the core nodes will be unaffected.
This also adds the ability to test how autoscaler reacts to high load from dbt-runner.