Today, our service experienced a major degradation caused by an unexpected outage of a third-party provider that we rely on for the Polygon extraction option. The upstream failure propagated through our system and led to significant inference delays and failures for all customers.
13:40–14:00 UTC:
14:00–14:50 UTC:
The service recovered once we switch the upstream provider and our worker pool cleared the blocked tasks.
The initial root cause was the outage of the third-party provider supporting the Polygon extraction option.
The problem was amplified internally because this external call did not include a timeout, allowing requests to hang and consume worker capacity indefinitely.
We are implementing the following measures to prevent similar incidents:
We sincerely apologize for today’s disruption. Reliability is essential to your operations, and we are taking concrete steps to ensure this type of issue does not happen again.
If you have any questions, our support team is here to help.