Outage in Retention Science

Database server out of temporary disk space causing certain sites' recommendations to fail

Resolved Major

April 19, 2024 - Started 13 days ago - Lasted 3 days
Official incident page

Need to monitor Retention Science outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Retention Science, and never miss an outage again.
Start Free Trial

Outage Details

We are currently investigating, but it seems one of our databases has run out of temporary disk space to unload large tables to our machine learning algorithm. Not all sites are affected, and it seems to primarily be an issue for larger sites (many millions of users). We are looking to remediate this, but also we're trying to find out why this suddenly started happening even though we haven't changed much on the database server side. We will update here with more findings as we have them.

Components affected

Retention Science Recommendations API Retention Science Cortex Application (Main Dashboard)

Latest Updates ( sorted recent to last )

RESOLVED 10 days ago - at 04/22/2024 06:25PM

We found a missing configuration that did not carry over from the old version of the database to the new. We added this configuration on Friday around 6pm Pacific, which has fixed the problem with our big data operations on our Subscription service database for all clients.

We monitored over the weekend, and there were no further errors. This issue has been resolved.

We have taken steps to make sure all of our databases have this configuration and will be monitoring any similar issue with other databases going forward.

IDENTIFIED 13 days ago - at 04/19/2024 10:32PM

The issue has been identified and a fix is being implemented.

INVESTIGATING 13 days ago - at 04/19/2024 10:32PM

We are continuing to investigate. The issue seems to have started on April 14th when we applied a AWS-required MySQL 5.7 => 8 upgrade to our Subscription service database. This has apparently caused some unforeseen performance issues when running multiple sites' machine learning jobs.

We will be upgrading the database instance size in order to sidestep the space issue temporarily. Our hypothesis is that this will buy us some time and (hopefully) allow our big jobs to continue running.

Meanwhile, we will be investigating how to make the disk usage more efficient, or resolve the issue overall.

INVESTIGATING 13 days ago - at 04/19/2024 09:20PM

We are currently investigating, but it seems one of our databases has run out of temporary disk space to unload large tables to our machine learning algorithm. Not all sites are affected, and it seems to primarily be an issue for larger sites (many millions of users).

We are looking to remediate this, but also we're trying to find out why this suddenly started happening even though we haven't changed much on the database server side. We will update here with more findings as we have them.

Latest Retention Science outages

Slowness navigating the website, delays sending emails.. - 3 months ago

AWS Outage - over 2 years ago

Click/Open Data Missing in AI Stats - almost 3 years ago

Intermittent 502 errors for email clicks - almost 3 years ago

Delayed sends for emails with greater than 5k recipients - over 3 years ago

The easiest way to monitor Retention Science and all cloud vendors

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3154 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime

Start today for FREE