In today’s cloud-centric, serverless computing landscape, applications are increasingly distributed and complex, composed of numerous microservices, functions, and external dependencies. Azure Functions, a serverless compute service offered by Microsoft Azure, plays a pivotal role in building scalable, event-driven applications.
To effectively manage and troubleshoot such applications, you need a mechanism for tracking the flow of requests, transactions, and interactions within your distributed system. Azure Functions Distributed Tracing comes to the rescue, providing comprehensive visibility, monitoring, and diagnostic capabilities.
In this blog, we will look into the concepts of distributed tracing in Azure Functions, understanding its significance, implementation, and the tools at your disposal.
Note: This blog is not intended to show the step-by-step procedure to configure Application Insights with Azure Functions to achieve custom distributed Tracing rather to help you understand which technology stack to choose and implement based on your business requirement.
The significance of Distributed Tracing in Azure Functions
Whether the solution is built with Azure Functions or Durable Functions, the need for distributed tracing can be fragmented into two major phases of the software development cycle.
- During Application Development (Development Team)
- Post Development Support (Operations Team)
During Application Development
If you’re using Azure Functions, then you have in-built support for using Application Insights. The host of the Azure Functions will track technical telemetry data automatically with App Insights and hence it is easy to diagnose the function instance.
Target Users
In this scenario, Application Insights is intended to the Azure Developers, Architects, and Experts.
Target Use Cases
Once you connect the Azure Functions with App Insights by enabling Diagnostics Settings or configuring the Instrumentation Key, it will start logging the technical telemetry data like Traces, Logs, and Errors etc.
The telemetry data can be found logged into ten different tables as below based on the request type.
- availabilityResults
- browserTimings
- customEvents
- customMetrics
- Dependencies
- Exceptions
- pageViews
- performanceCounters
- requests
- traces
Later the developer can build custom KQL queries or use the existing templates for standard queries like Failed Operations and Operations Performance to get the desired insights.
For Instance, you can see a simple HTTP function request has multiple Traces as detailed logs about the execution of the Function.
Generally, Application Insights contains complete logs of function execution (technical telemetry). Hence, it is ideal to be helpful for a developer during development to deep dive into the issues and deliver quality applications.
Post Development Support
Developers and solution experts primarily focus on creating and delivering exceptional applications for production, often overlooking the crucial aspect of pre-deployment support planning.
Once the solution is deployed to production, it often becomes the responsibility of the operations team to support the solution. The core responsibility of the operations team is to diagnose & troubleshoot the issues. However, the challenge lies in utilizing existing tools like Application Insights, which tend to be complex for non-developers to extract meaningful insights from the extensive technical telemetry data. We will discuss the typical requirement of the operations team to troubleshoot any L1 issues and why App Insights is not an ideal solution with a real-time example later in this article.
In nutshell, if you don’t consider a platform or solution that can be used by the operations team, then you are giving no means to the Ops team to troubleshoot problems when they arise. It is certainly important to have a traceability or Business Activity Monitoring platform when the services are running through multiple Integration Services.
Target Users
The typical users who will be employed to support the production application will be Business Users and IT support Operations Team.
Target Use Case
- Business Users who want to self-service their integrations
- Business Users who want visibility into their integrations which support their key business processes
- IT Support Operators who need a non-expert friendly view of processes to manage the solution
Real-time scenario of Azure Function – Automation Task
There are multiple common use cases like event driven processing, running background backup tasks, building serverless web APIs and more where Azure Function can be used. For this real-time scenario, we will consider building an application using Azure Durable Functions to run concurrent tasks in parallel.
Imagine ACME Corp runs a software business, and they have a policy to retain the customer data only for 30 days and later it gets deleted. To achieve this scenario, the organization has used Durable Function to automate the process.
Starter Function: This function gets started based on scheduled timer trigger at 11 PM each day.
Orchestrator Function: The orchestrator function is to manage the entire customers workflow. It fetches the customer data from the database as a batch of 100 records. This function can start sub-activities like checking customer data time frame, validating it, deleting the data if expired and sending a confirmation email to the customer.
Activity Functions: Each sub-activity, such as checking data time frame and sending email, is defined as a separate activity function and scales out based on the number of customers in the batch message. These functions can run concurrently, ensuring efficient resource utilization.
Fan-out and Fan-in: In the orchestrator function, you can use fan-out patterns to check customer data and validate timeframe concurrently. Once all activities are completed, the orchestrator can use fan-in to consolidate results.
Error Handling: If an activity function fails, Azure Durable Functions can automatically retry or take corrective actions. You can define custom error-handling logic for each activity.
Now consider what could be required by an operations team in a business to support a solution similar to this one.
Consider one of the customers raises a support ticket stating that the usual email that he receives about the data purging and total amount of records that got deleted is not received. The operations team or business user who handles the support require the following capability to address the ticket.
- Ability to query, filter, and troubleshoot what happened to individual messages/ request (user email as identifier) in the integration solution.
- The support team should be able to visualize and inspect incoming message payloads (Optional feature).
- There should be an option to correlate individual messages to the original batch and interpret whether all messages in a batch were processed successfully or not.
- It should allow support users to correlate all tracing events for the individual message in graphical view with error indications.
- Receive alerts when certain failures or exceptions occur so that they can be proactive before the customer reaches out to them.
Why can’t we use Application Insights for operations or non-tech Team?
- Operations team or business users are not technical users who may not know how to use App Insights and Kusto Query Language to get desired insights from the technical telemetry data.
- App Insights works based on data sampling at the backend. Hence, if you’re using in production for tracking the data consistent with business telemetry, it may not guarantee telemetry delivery.
- App Insights has a daily data cap and throttles the number of requests per second.
- The retention period of aggregated data is only 90 days.
- It is generally observed that the cost associated with App Insights is high and this could influence the overall running cost of your solution in production.
What is the ideal end-to-end traceability solution for operations or business users?
Turbo360 BAM is purpose-built to achieve end-to-end traceability and act as a self-service portal for Operations and business users. You don’t need to give business users direct access to the Azure portal.
Let’s look into some of the core capabilities of Turbo360 Business Activity Monitoring
- Let’s you define the business transaction expectations and track the end-to-end data flow.
- It supports both Azure Functions and Durable Functions.
- Monitor any exceptions happening in the Durable Function integration flow.
- Precisely pinpoint the failure where the request or message has failed.
- Trace and identify the specific instance of a failed message within a batch while maintaining parent-child correlation.
- Ensure timely message delivery to the intended destination.
- Serves as a user-friendly interface for non-developers and business users to track message.
Let us quickly look into how the end-to-end transaction or business activity monitoring implementation would look like in Turbo360 for the above real time scenario.
With .Net Library available in Turbo360 BAM, you can define the check points in your code and log the expected message at each stage of the business transaction. You can define each stage in business-friendly view from the technical view that can be understandable by the operations team.
For Instance, in the above scenario, you can define the Durable Function flow as below
- Workflow Initiated
- Customer Data Fetched
- Customer Data Purged and Send Email
- Workflow Completed
We may have four business stages and track where the transaction has been failed or executed successfully.
If any of the batch message was failed, Turbo360 would highlight the specific transaction with parent-child correlation which means that the specific failed message or transaction in a batch can be easily spotlighted.
Conclusion
Azure Functions Distributed Tracing is a vital component of observability in modern, distributed cloud applications. It provides the visibility and insights needed to effectively monitor, diagnose, and optimize your Azure Functions. By implementing distributed tracing with tools like Azure Application Insights and OpenTelemetry, you can gain a comprehensive view of your serverless world, enabling you to deliver reliable, high-performance applications in the dynamic cloud environment. Embrace distributed tracing as a key part of your observability strategy to keep your Azure Functions running smoothly and efficiently.