This is something that happened to me recently. I was working on my private projects and was playing around with different Amazon Web Services. I had one service deployed on Amazon Elastic Container Service (ECS). One morning when I woke up and looked at my phone, I was shocked. I got about ten budget warnings per mail! My budget was exceeded during the night.
So, what happened?
When I first deployed my Spring Boot service on ECS, it was not working. Therefore, I enabled CloudWatch Logs in my ECS Cluster to be able to debug the problems during startup.
In the end, the service was in a redeploy loop because the health check grace period was too low. Whenever the service was deployed, the Application Load Balancer almost immediately started to perform its health checks. The service did not answer with a successful response because it was still in the startup process. Therefore, the status of the service was set to unhealthy. The consequence of this was that the service got undeployed, and a new service got deployed.
I did not disable CloudWatch logs because we get the first 5GB of log ingestion for free, and it might help with upcoming issues. 5GB of logs for one service is a lot, right? Yes, indeed. But not in my case. I had a bug in my service that resulted in an endless loop that would log the same exception over and over again. This resulted in 238GB of useless logs! The costs are 159.79$ for literally nothing.
I logged in to my AWS Account and immediately stopped the service. Then I looked at the logs and recognized what was wrong. I deleted the logs and fixed the bug. It was just one line that was wrong. It had been added just before going to bed the day before.
With a faster instance on ECS, this could have gotten much worse! I was using a Fargate Spot instance with 0.5vCPUs. I can't imagine how many GBs could have been created with a stronger instance.
How to prevent this?
Don't commit and deploy stuff and go to bed or leave the office. Do you have static code analysis and automated tests? Are you confident that the upgrade works just fine? It would be best if you take the time to verify it manually. Look at the logs manually if you don't have an automated log analysis. In my case, just 5 minutes of my time could have saved me 159.79$ and a stressful morning.
Always make sure to stay within your free tier limit of all AWS services if you don't want to get high expenses. Unfortunately, there is no cost limit in AWS. You can and should create budgets with alerts. This does not prevent the cost explosion, but it might help react in a reasonable time and keep the costs to a minimum. Details about AWS Budgets can be found here.
It is also recommended to enable AWS Cost Anomaly Detection and create a cost monitor. It will notify you about any detected cost anomalies that exceed the configured threshold. Details about AWS Cost Anomaly Detection and how to create a cost monitor can be found here.
Another helpful measure could be CloudWatch itself. You can create Billing Alarms that notify you when your estimated charges exceed some configured threshold. A step-by-step guide on how to create a billing alarm using CloudWatch can be found here.
Note that all those measures do not stop any services. They just notify you, and you need to manually fix the problem or stop the service that incurs the charges.
I learned the hard way that I was too confident when deploying small changes.
Trust is good control is better - Vladimir Lenin
In the future, I am going to verify manually that all my automated checks are working correctly and there are no obvious issues with the deployment. I also disabled CloudWatch log ingestion in my ECS Cluster for now.
I had no Billing Alarms configured, and my cost monitor was not configured correctly. I have updated everything, and next time I should get notified much earlier.
Update 29.06.2022: Amazon refunded the charges for CloudWatch after I reached out to them and explained the situation. The support was excellent.