Heroku SSL Service Degradation
Incident Report for Skylight
Resolved
Our metrics indicates an agent report rate have recovered to the level before the incident. We believe most customer agents have resumed normal reporting and the issue has been resolved. If you continue to encounter issues, please email support@skylight.io for assistance. Unfortunately, if your agent was "locked out" from an expired authentication and was unable to report data during the outage, those unreported data will not be available for view on the dashboard. We are truly sorry about this.
Posted Oct 25, 2021 - 02:45 PDT
Monitoring
We have completed the migration and monitoring the situation. Due to the nature of the incident and it requiring updating our DNS records, it may take some time to fully resolve.

The Skylight dashboard (skylight.io) should be immediately accessible assuming your operating system has refreshed the DNS record, which should happen within minutes as we have a TTL of 300 seconds. If you are still unable to access the site, please email support@skylight.io for assistance.

Your Skylight agents should resume reporting data once it retries the previously failed authentication request. If this does not occur, you can try restarting your app, which would force the agent to restart and authenticate again. If that still doesn't work, please email support@skylight.io for further help.

Once again we are very sorry for the trouble.
Posted Oct 25, 2021 - 02:35 PDT
Identified
The dashboard and agent authentication endpoints are affected by a service degradation on our hosting provider Heroku. This outage impacts their "SSL Endpoint" add-on and is expected to last for 8 hours. We began to process to migrate away from the add-on but this is normally expected to take up to 24 hours. We are investigating if there is any way we can speed up the process or re-route the affected endpoints.

The data processing pipeline is technically unaffected by this outage as it is hosted on a different provider. However, given that agents are failing to authenticate (and therefore failing to submit traces), we expect this to cause lapses in Skylight data during the outage period.

We are very sorry for the inconveniences.
Posted Oct 25, 2021 - 01:41 PDT
Investigating
The Skylight dashboard is inaccessible currently due to a potential configuration issue. This outage also impacted agent authentication – new authentications from agents will not succeed at the moment. Agents that are already authenticated can continue to report data until the authentication session expires.
Posted Oct 25, 2021 - 01:11 PDT
This incident affected: Website (Application, Hosting).