Fivetran Internal Webhook Service Outage
Incident Report for Fivetran
Postmortem

This is a Postmortem regarding the incident resulting in the failures observed in syncing data via connector types that use webhooks as a part of their sync strategy. It does not apply to our Event Collector Service that ingests customer Webhooks & Snowplow events.

Summary of the issue:

  • Internal Webhooks Service outage for 12 hours (2020-05-01 18:00 UTC to 2020-05-02 06:00 UTC), which has resulted in data loss during this time.
  • The key reason for the outage was determined to be the delay in the heartbeat due to increased latency causing timeout errors. This caused some of the nodes to be repeatedly restarted by the managing processes.
  • The following connector types had some impact: Appsflyer (Push API), GitHub, Help Scout, Intercom, Iterable, Jira.

Steps taken to resolve the issue:

  • Increased timeout to receive the heartbeat from Fivetran platform components in a timely manner.
  • Added actions to be taken to resolve data integrity issues for AppsFlyer, GitHub, Intercom, Jira connectors using the source API’s

Current status:

  • Our internal webhooks service is back to normal and syncing successfully.

Customer impact:

  • For sources that we can re-sync data from via other APIs (AppsFlyer, GitHub, Intercom, Jira), the impact is a temporary data integrity error that will be resolved by May 11th, 2020.
  • If a source does not allow to resync data via an API call or allow retries for webhooks, Customers would have lost data for those during the downtime of the connector.
  • Specific impact for connector type:

    • Iterable: EVENT_EXTENSION, which contains additional information about events, will not have entries for events during the outage.
    • Help Scout: CONVERSATION_HISTORY will be missing some history of conversations during the outage. The current state is correct.

Steps to prevent/mitigate these risks in the future:

  • Reduce the threshold for existing alerts that determine the failures.
  • Build automation to preemptively mitigate the timeout scenarios.

We appreciate your patience and help with this issue, and sincerely apologize for any inconvenience we caused.

Regards,

The Fivetran Team

Posted May 05, 2020 - 00:41 UTC

Resolved
This incident has been resolved.
Posted May 02, 2020 - 08:04 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 02, 2020 - 07:22 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 02, 2020 - 06:30 UTC
Investigating
We are currently investigating this issue.
Posted May 02, 2020 - 04:57 UTC