Why PlanetScale broke our trust in database startups

Recently we completed a full database migration from PlanetScale to Amazon RDS. While numerous blog posts cover database migrations, this post aims to delve beyond the technical intricacies and into the background on why we moved and how our experience has made us lose trust in database startups.

Who are PlanetScale?

If you haven't been shopping around recently for a managed MySQL DBaaS, then you may not have heard of PlanetScale. Their offering is a fully managed database built on top of Vitess (YouTube invented) which aims to bring "unlimited scale", and an improved workflow. The co-founder himself was actually the co-creator of Vitess at YouTube.

PlanetScale provides core Vitess features such as sharding support for MySQL (without application level changes) and connection pooling as well a number of other features including: query caching, branching/safe migrations, non-blocking schema changes and database insights.

They boast about being able to support one million MySQL connections, supporting foreign keys and ultimately preventing MySQL downtime. Oh and when they launched, they were very competitively priced with a free tier (Hobby), and a very cheap basic tier (Scaler).

Sounds great right? This is ultimately why we picked PlanetScale in the first place, big promises of "hyperscale" with a cheap price tag, all too good to be true...

Why did we leave PlanetScale?

When GridPanel launched, we were looking for a database solution that would scale with us and not break the bank - PlanetScale fit the bill perfectly. We got up and running quickly, everything just worked and we felt like we were in good hands.

Downtime

Ironically, after a few months of using PlanetScale we experienced complete downtime. At this point our database usage was very simple with your typical CRUD OLTP workloads, nothing that requires "unlimited scale". On October 10th 2023, our database went down without warning and we had a flood of the following access denied errors.

OperationalError: (1045, "Access denied for user '8u9drzeiogftabhnfjhp'")

For reasons unknown at the time, we had to change our connection hostname to get our database connections working again. Support informed us an hour later that:

We have rectified an issue between a load balancer and our internal database routing service and connectivity should be restored for all AWS us-west-2 databases. Our apologies for the disruption in service, please let us know if you continue experiencing any issues.

Despite this response, we continued to experience errors and our query latency was rising fast. After some back and forth with support we were eventually told to change our hostname back again which remedied the latency and downtime, sadly it introduced our next problem.

Connection issues

After the downtime, we started having connection issues. We emailed back and forth with support regarding flapping connections and DNS errors with very little fixes offered at the time. Our core Python applications would raise the following exception when trying to connect to the PlanetScale database:

OperationalError: (1105, 'unavailable: dns resolver: zero addresses for host')

Originally their support pushed back on these issues and blamed us for not retrying our connections. Eventually though we were told:

I did some extra digging on our end about this and from what I was reading it does appear that the error originates in our infrastructure, which also makes sense given that I've seen it occur across different programming languages as well so it doesn't seem to be a language-specific sort of issue.

In one of the bits of information I was reading through, it does look like new improvements in our edge routing layer should begin to start minimizing this issue from occurring, and that team has also been adding extra code to help with visibility to better understand what is occurring.

To be fair to PlanetScale, the errors started occuring less often, but ultimately they persisted until the day we migrated away.

Support changes

When we orignally started using PlanetScale, support was free, simple to access (despite the issues we had) and they would reply promptly. Soon after our joining, they changed this setup such that support was tiered and paid for separate to the plan you were on. So in our case we were put on the "Standard" support plan which gave us access to a response time of "2 business days". If you wanted to upgrade that, you needed to pay $1000/month, which quite frankly is absurd.

On "Standard" support you are also only covered 12x5 (not 24/7) for P1 (Urgent) issues. Which means if you database goes down out of hours, you are toast.

Pricing

At GridPanel, we were on the "Scaler" plan, paying $29/month for a fairly simple database setup. This is a super competitive price for the feature set that PlanetScale boasted to offer. In typical startup fashion however, PlanetScale recently announced a change to their pricing structure and we received a lovely email explaining our new pricing:

PlanetScale pricing email change

Obviously we were not pleased... A 300% pricing increase! This news alongside another blog post regarding layoffs and aiming for profitability really set the precedent that this was not a one time thing for PlanetScale. It is very likely that they will continue pushing up prices.

Like many businesses, we had a growing backlog - of which one item was this, but it was difficult to prioritise given the low-cost versus alternatives. With the higher price, this was no longer true. Sadly for PlanetScale, they are very replaceable, so we sought to replace them.

Why Amazon RDS?

All this being said, why did we land on Amazon RDS, in particular their MySQL variant?

Pricing

Despite RDS and Amazon pricing calculators being quite difficult to reason about, we will be saving a considerable amount by switching to RDS from PlanetScale's new pricing model. Combine this with $5000 of AWS credit with a Revolut business bank account (not affiliated with), it's an absolute no-brainer pricing wise.

Platform Stability

We understand that companies want to keep innovating and trying to improve their product. It sounds obvious, but with databases you really do want stability. Platforms like PlanetScale are changing their product, causing bugs and breakages down the line which we cannot afford to be subjected to. It would be naive to say that this is not possible still using Amazon RDS, but far less likely given Amazon's mature engineering practices and standards.

The migration

So we had made our decision that we were going to migrate onto RDS however it wasn't as simple as we hoped.

Platform lock-in

Our initial optimism for an easy migration was removed when we learnt that PlanetScale does not provide you with any access other than to dump your entire database. This makes sense given PlanetScale doesn't want to provide you with tools to actively churn, but it is something to keep in mind if you are choosing a DBaaS. In an ideal world you would be able to setup your new database as a replica of the existing one to minimise any downtime.

To migrate off of PlanetScale you sadly have to take a full DB dump and perform a full restore into your new database. This meant that there was going to be a longer period of downtime than we had hoped for as you need to take your application down (or at least prevent writes) during the dump and restore process. Luckily for us however, the size of our database was on the smaller side so the dump and restore process was a lot faster than it might be for many others.

We did have some tricks up our sleeve to reduce downtime including:

1. Dumping/restoring our biggest tables ahead of time as they were not written too often and we mainly controlled these writes with internal processes.

2. Ensuring that all our proxy services stay up by not allowing any rotations/configuration changes during the migration itself. This was a huge win for us, our core services stayed up during the migration, it was the site and marketing materials that were impacted.

3. Finally we can dump and restore our smallest, hot tables, with a grand total of 10 minutes downtime.

Obviously these apply to our business so your mileage may vary, but if you are making a similar migration have a think about how you can work around any downtime. Keep in the back of your mind however that sometimes the simplest thing to do is a full dump and restore with downtime, if you can afford it, go for it.

Conclusion

If there is anything that we want you to take away from this post it is that when you are choosing a DBaaS provider, take into account more than the bottom line pricing and promises of an easy ride, especially from startups.

If you pick a startup, you trust they will do the right thing: focus on stability, and be transparent when making sweeping changes (such as changing support). This is only natural for a startup, but the mantra of "move fast and break things" really doesn't work in this space - and it doesn't take too long for the trust to break since your entire application is down, and at their mercy.

At GridPanel, we want to focus on serving our customers high quality scraping solutions, not worrying about database reliability, stability and longevity. RDS is not a silver bullet solution, be wary of the risks you can and want to take when choosing providers.

Why PlanetScale broke our trust in database startups

Who are PlanetScale?