Azure Service Bus Auto Scaling Best Practices

One of the most notable advantages of the Cloud is the ability to scale resources to meet demand. We then scale out or up when the demand increases, and we scale in or down when the demand decreases.

For the record, scaling out / in refers to increasing or decreasing the number of instances of a given resource, whereas scaling up / down refers to increasing or decreasing the capacity (CPU, Ram, Disk, I/O performance) or a given instance.

Scaling helps improve the resiliency and the responsiveness of your application by, for example, decreasing its response latency.

Although you can scale resources manually, it’s way more effective to rely on autoscaling. For that matter, you’ll need to determine what triggers the scaling operation.

(Auto)Scaling is based on metrics

Usually, we tend to trigger scaling depending on the % of CPU or memory in use. We thus define thresholds for scaling up/out when the application is under heavy load and for scaling down/in when the demand goes back to its original level.

That kind of process ensures that the instance won’t go down or reject requests when the demand is high and won’t cost us more than needed when the demand is low. It thus ensures a balance between performance and cost.

There’s another reason to scale which is due to external factors, especially in distributed systems. Distributed systems rely on asynchronous communication messages, such as queues. In the Azure world, Azure Service Bus is commonly used for that matter. In these situations, scaling based on the number of messages available in a given queue can be used in addition to scaling based on resources utilization as we’ve mentioned earlier.

Why scale based on the number of messages in a queue?

In a distributed system, there’s a well-known pattern called the “Producer-Consumer Pattern”. This pattern implies that a set of components (the “producers”) will push messages to a queue for another set of components (the “consumers”) to pick and process.

Azure Service Bus queue

There’s a trick, though: when the rate of message production is greater than the rate of message consumption, there will be an accumulation of messages in the queue. This accumulation can lead to very high latency in message processing if not controlled and acted on.

An illustration of the “Producer-Consumer Pattern” is a shopping site: you purchase an item on your favourite shopping site. You enter your payment information and hit the submit button. The site presents you with a nice UI telling you that your order is confirmed. In the background, the payment process (here, the “producer”) has pushed a message into a queue for the order confirmation process (here, the “consumer”) to pick it up, generate the email confirmation and send it to you.

It’s common to get your order confirmation email a couple of minutes after you’ve submitted your order. It’s uncommon to get that email a couple of days after you’ve submitted your order, though!

It’s important to note that a consumer will usually process one message at a time and that processing a single message won’t usually impact the performance of that consumer to the point that scaling based on resource utilization is triggered. That’s why it is important to also configure scaling based on the number of messages in the queue: to ensure that you get that confirmation email within minutes, not days.

You’ve probably experienced a similar situation where you go to a store and there are too many customers waiting in line for a cashier (the waiting line can be viewed as the queue and the customers as messages in that queue). In such cases, the store manager will likely ask more cashiers to come to help so that the wait time is kept at an acceptable value.

How to configure scaling based on the number of messages in a queue?

Let’s consider a system where we have one web application called “producer” that pushes messages to a queue in Azure Service Bus, and another web application called “consumer” that processes these messages. The Azure Service Bus instance is comprised of one queue (called “the queue”) that contains the messages to be sent and received by the producer and the consumer.

Azure Service Bus queue message

This system is illustrated here:

Note that the minimum SKU for your app service plan should be S1 to be able to configure autoscaling rules:

Azure Service Bus queue message

Now, let’s configure autoscaling for the consumer web app based on the number of messages in the queue.

In the “Scale-out (App Service plan)” feature under “Settings”, we select “Custom autoscale”, we provide a name for the setting, and we choose “Scale based on a metric” as the scale model. To balance performance and cost, we can also specify instance limits:

azure service bus auto scaling

We then click on “+ Add a rule” and we configure the parameters as highlighted in the figure below:

azure service bus auto scaling

Here, we’ve set the threshold to 32 messages. There’s no magic value for that. You have to analyze your usage metrics to determine what value works best for your scenario.

Alternatively, you need to define a rule for scaling in. Here’s what the two rules could look like:

Azure Service Bus messages

And voila! Autoscaling based on the number of messages in the queue is now configured and ready to be used.

Of course, this could have been configured using CLI although we demonstrated how to do it from the Azure Portal.

What are the impacts on the application code?

To fully take advantage of scaling, your application should be resilient and possibly stateless.

Resiliency can be achieved by implementing mechanisms such as retry policies and circuit breakers into the application. There are libraries (such as Polly for .NET) that can help you infuse resiliency into your application code.

The stateless nature of your application ensures that you don’t experience issues when the number of instances of your application increases or decreases. Note that if your application isn’t stateless, you can still scale up and down but not out and in. One way to achieve statelessness is to not store the user context in the memory of your application server but rather rely on a state server.

To sum it up…

Scaling is an important feature of every Cloud solution and provides many benefits to your solution. You should seriously consider it if you’re not already. Scaling improves the responsiveness and the resiliency of your solution. It’s also one means of attaining the SLA you promised your users and customers. Autoscaling is what you should aim for, and for that matter, you should first determine what will trigger the scaling operation. In this article, we saw how to scale based on the number of messages in an Azure Service Bus queue.

If you’re looking for more Azure resources,
Read about Azure Service Bus Pricing