This is a Postmortem regarding the incident resulting in the failures observed in syncing data via connector types that use webhooks as a part of their sync strategy. It does not apply to our Event Collector Service that ingests customer Webhooks & Snowplow events.
Summary of the issue:
- Internal Webhooks Service outage for 12 hours (2020-05-01 18:00 UTC to 2020-05-02 06:00 UTC), which has resulted in data loss during this time.
- The key reason for the outage was determined to be the delay in the heartbeat due to increased latency causing timeout errors. This caused some of the nodes to be repeatedly restarted by the managing processes.
- The following connector types had some impact: Appsflyer (Push API), GitHub, Help Scout, Intercom, Iterable, Jira.
Steps taken to resolve the issue:
- Increased timeout to receive the heartbeat from Fivetran platform components in a timely manner.
- Added actions to be taken to resolve data integrity issues for AppsFlyer, GitHub, Intercom, Jira connectors using the source API’s
- Our internal webhooks service is back to normal and syncing successfully.
Steps to prevent/mitigate these risks in the future:
- Reduce the threshold for existing alerts that determine the failures.
- Build automation to preemptively mitigate the timeout scenarios.
We appreciate your patience and help with this issue, and sincerely apologize for any inconvenience we caused.
The Fivetran Team