Recently when recording an episode of the FinOps on Azure podcast with Rik Hepworth which will be out soon we discussed a scenario where we have a cost issue because something went wrong with an Azure solution. In this article we will explore that problem and how you can implement some protection for it.
Scenario
The scenario was as per the below diagram.
In this case business partners are submitting files to an SFTP site which is actually Azure Storage with the SFTP proxy enabled. When files are uploaded it will trigger an event to the event grid which will then trigger a Logic App which will process the file and load it to a database. In the case of an invalid file the Logic App will just ignore it.
The interface is expected to process around 10 files per day and should be very cheap to run, costing in the region of $10 dollars per month.
Everything was running fine then one day, completely outside of the scope of the interface someone started using the SFTP site for another purpose and began archiving some files that get received onto it. This looks like the below diagram.
When files were archived by the admin with a PowerShell script they had no idea that the SFTP site triggers a Logic App which processes the file and ignores it.
The admin then had a job which would run on a regular schedule archiving any files which were output but there was a bug in the script. Rather than cleaning the local directory it missed that step so uploads new files and then ALL the old files again. Before long there is hundreds of thousands of files uploaded to the storage account via SFTP. Each file is triggering a Logic App.
The beauty of cloud however is the Serverless model of cloud means your solution can just dynamically scale to support this burst of load and in this case, Event Grid and Logic Apps just scaled on demand and perfectly processed all of these files in line with the defined business logic.
The problem was that while the cloud scales to handle the load, the cost also scales based on your usage.
In this case the customer came back to an unexpected bill of over $4000 for the interface that should cost $10 per month.
The question in this case is what can we do about trying to handle this problem and prevent unexpected costs.
Monitoring Billing Data
The first thing you should be doing is to monitor your billing data. This will get you part way to the ideal solution. In both Azure and Turbo360 there is an option to set up cost alerts for your consumption and there are cost anomaly detection features. You will see these discussed more in our recent Azure cost management whitepaper talking about the differences between Turbo360 and Azure Cost Manager.
In the below picture you can see how I am monitoring my solution or team with a budget and if the billing data breaks my budget, I will get an alert.
While the budget threshold being broken can trigger an alert, there are also anomaly detection features tracking the typical spend pattern and it will raise an alert if the spend changes in an anomalous pattern which you can see below.
Billing data is the minimum you should be monitoring to detect these potential issues but the problem with billing data is that it is often as much as 24 hours behind your current usage. This means a cost spike may be happening in real time and you may not see this in your billing data on Azure for a period of time which means you will have spend money you hadn’t planned for.
What is a Circuit Breaker
A circuit breaker pattern is normally used to stop execution of a process when it is known it would be likely to fail. In this case we know that the Logic App solution will not fail but the solution will be behaving in a way we had not planned for.
The desire would be if we could detect unexpected processing and we would then stop the solution and this would allow an operator to review what’s happening and in this case we would have an opportunity to stop the processing of all of the bad files and prevent us from incurring unplanned charges.
How can I implement a Circuit Breaker for this solution?
In order to implement the Circuit Breaker for this solution we would implement the following requirements:
- If the Logic App runs more than 20 times per hour then we know we are seeing unexpected behaviour
- We will look to disable the Logic App
- We will raise an alert to the support operator
We have decided above that cost data is not the best way to implement a circuit breaker because the cost data is not likely to be available quick enough in Azure. In order to implement the Circuit Breaker we will use the Business Applications module within Turbo360.
We will add the Logic App to Turbo360 and within the Business Application we will configure monitor to check the runs started metric. If this breaks a threshold then we will execute a management operation to disable the Logic App.
In the Business Applications module, you can see the below picture where I have added my Logic App.
I have also created an automated task within the business app which is used to implement the management operation to disable the Logic App.
You can see below the settings for configuring the automated task.
You can see below that I have configured monitoring for my Run Started metric which will check for 20 or more runs.
In the above picture you can see there is an action configured for the metric which will run the disable processor task we created above.
When the run started alert triggers then it will trigger the automated task which will disable the Logic App.
The monitoring will also detect the state of the Logic App is disabled and it will then also raise an alert to say the Logic App is disabled and it will stay in this state until someone reviews it.
If the operator just turns on the Logic App again then it will process some files but once the run started counter breaks the threshold again then the circuit breaker will kick in and disable the Logic App.
Summary
We hope this is a nice way to protect yourself from unexpected costs. It would be interesting to see what other scenarios customers need this kind of pattern.