The Making Of Metrics

Jacobus van Vuuren
May 27, 2020
9 min read

Updated: Jul 30, 2021

A developer’s view of the journey as we prepare to launch our newest Fabrik application to the world.

Data - we all have it and it's our job as developers to try to figure out where to put it. Furthermore, businesses and teams are all trying to extract valuable insights from this data, which is easier said than done. At immedia, we are not exempt from this rule and have spent our fair share of time wrangling our data to try and highlight key insights for the clients of our Fabrik platform.

Metrics, Fabrik's new analytics and data visualisation tool, was born from the desire to consolidate our previous efforts in surfacing our data into a single self-service portal that would not only present informative data, but surface it with expedition.

As a developer who’s been instrumental in building Metrics with cost, scale and speed in mind, I’ve been taught many lessons along the way – and I’d like to invite you along for a glimpse into my journey so far.

In the Beginning

The wilderness awaits

When evaluating the data in any system you will generally find an assortment of formats: structured, unstructured, text, tabular, binary, some API endpoint written by a person that left three years ago (and also didn’t care that you would be using it today) – a primordial soup, if you will.

This wilderness that is set before you can seem daunting but, much like mowing your lawn after a year, the reward is well worth it when you smell the freshly-mowed grass or enjoy a languorously luxurious picnic between neatly arranged flower beds and carefully planted rows of trees.

Metrics started with these questions: Where do we need to trim our lawn and neaten up the hedge? What flowers do we need to get; are roses the best choice for our climate and budget? Where will we place our short flower beds and our tall trees? And most importantly – who will be coming to the picnic?

Or more simply, data is used to answer the 5 W's: who, what, why, when and where?

For Metrics, whilst planning our garden, we came up with the following key questions that we wanted to answer in the initial release candidate:

(who is)/(when are they)/(what are they using when)/(where are they when) listening to the live stream of the client?
(who is)/(when are they)/(what are they using when)/(where are they when) using the client’s mobile application?
when is the client’s application downloaded?

As for the ‘why’ – while the subjectivity of that question and the requirement for verification of the various possible answers can halt a project before it even begins, sometimes, the answers quickly reveal themselves. For instance, what we’re currently observing with Metrics is that our clients’ live-streaming and engagement have seen an upward trend over the last few weeks, which would most likely be attributed to a population currently in lockdown during the COVID-19 pandemic.

Find Your Source

The source is within you – or, at least, somewhere

Once our questions were established, it was time for us to identify from where our answers would come. As alluded to in the previous section: in any system, there will be multiple sources of data available and the selection of the source is a process. It may require some trial and error before you find a source that appropriately answers your question. We’ll skip over the boring stuff here and outline the sources that we eventually identified:

Streaming

Audio streams are served from HAProxy which provides us with configurable log output options. We used these to configure the logging to output what we need to answer our questions. We’ll get to how we parsed this information in a later section of this post.

App Engagement

How people use the services and engage with content is tracked via Matomo. Matomo provides a powerful API for retrieving the tracked data. What’s more, it provides our members with total privacy.

Application Downloads

App download numbers are retrieved via the Apple App Store Connect API and Google Cloud Storage API. Both provide us with files in CSV format. We’ll talk about how we used these in a later section.

Laying the Groundwork

Foundations are important

Now that we knew what we were solving for, and from where we would be retrieving our answers, we needed to decide on which approach we would take for processing, storing and displaying the data. We vetted some options and finally decided to use Azure Databricks as our data processor with Scala as our data processing language. Azure Databricks provides us with an Apache Spark cluster that we can scale on demand to meet workloads. It’s also fast. Very. Very. Fast.

For storage of our processed data, we identified Azure Cosmos DB and Azure Storage; Azure Cosmos DB for its ease of storage and retrieval of data (with a familiar SQL-like syntax) and Azure Storage for cost effective storage of files.

The data in Matomo was already being stored in a MySQL database which we don’t need to query directly because Matomo's API already provides us with all the data we need.

We would have a .NET Core API serving as the gateway between users and the stored data and an Angular application that would serve as the frontend.

With the outline of our garden in place we felt confident that we would be able to tame the wilderness set before us and we were ready to get started planting and arranging our flower beds.

THE PEOPLE BEHIND THE DATA

A key part of building a tool that provides insights on how humans are using the tool, is being respectful of the humans themselves.

Before we jump in to all the technical details, it is important to note that at immedia we hold the privacy and data rights of the people that use our platform in high regard. This means that we are always thinking about what needs to be done to ensure that data is properly anonymised before surfacing it to the people who use the platform.

Metrics fully anonymises the data before it gets surfaced. Anything that can be used to identify a user is removed. For instance: when processing our streaming data we perform a one way hash of the IP address of the request before all of our data processing is performed.

Furthermore, Matomo, our analytics engine, has user privacy baked into its design and also discards identifiable information as soon as it can.

Streaming

From logs to lines

Our pipeline for importing streaming data works roughly as follows:

Every hour a log file is rotated on HAProxy and uploaded to Azure Storage via the post-rotate hook.
We read these log files into our Azure Databricks environment via a streaming query.
The log files are processed, and relevant information is extracted and inserted into Delta Lake tables. Identifiable information, such as IP addresses are dropped before we write to Delta Lake storage.

We have another pipeline that will run and create rollups of our data for use with frontend applications, which roughly works as follows:

It calculates the peak number of listeners for all the newly created sessions per minute and saves the result to Delta Lake storage.
It then creates rollups of all our specified periods and stores it in JSON format in Azure Storage – we’ll look at some examples of this soon.

Lastly, we have a pipeline that will:

calculate and store peak, total and unique listeners for different periods of time, and
write the entries to Azure Cosmos DB.

Exploring Streaming Results

Summary Data

Calculation of the summary data is merely done as an aggregate count or sum over the period of the rollup. For instance, 'Total Sessions' is calculated as a count and 'Total Days' is the sum of all streaming session lengths.

Streaming Numbers

Streaming numbers are calculated as an aggregate over the size of the granularity specified. In the graph above, 'Total Sessions' is the count of all listens per day, 'Total Unique Listeners' are the number of people who listened per day, and 'Peak Concurrent' is the maximum number of listeners for that particular day. These are stored in Cosmos DB which allows us to search and display arbitrary ranges.

Streaming Numbers by Hour

A view of how frequently the live stream is accessed per hour

Streaming Numbers By Hour are calculated as an aggregate over the hour of day for sessions. This graph depicts the sum of all hours, the count of sessions streamed by listeners, the number of people who listened per hour, the count of sessions that were started, and the count of sessions completed per hour.

Streaming Session Length Breakdown

Understanding how long listeners listen to the live stream in Metrics

Session Length Breakdowns are calculated by using the Bucketizer class in Scala. We count the amount of sessions for every single duration ranging from 1 minute all the way to 18 hours. The frontend displays this as a pie chart, while the raw rollup data looks as follows:

Summaries By Dimension

Summaries By Dimension are calculated as a group by aggregate over session data. For instance, the 'Total Sessions' section lists the count of sessions grouped by each source, displayed from highest to lowest.

App Engagement

While statistics that describe how people use the Android and iOS apps is an exciting part of our data for our clients, it was much less exciting in terms of the data transformation work to be done. In essence, we query the Matomo API and display the data on the frontends. Luckily for us, Matomo did the heavy lifting in this regard and our biggest concern was displaying the data.

App Downloads

Our pipeline for application downloads works roughly as follows:

An Azure Function App retrieves the CSV files from the Google and Apple APIs respectively.
The function app does some slight preprocessing on the files and then stores them in Azure Storage.
These CSV files are read into our Azure Databricks environment via a streaming query.
Databricks does some processing on the data and writes the results to Cosmos DB.

The result of this pipeline is that we can query App Downloads for any arbitrary period.

Summary of app downloads reported in Metrics

The summary data is calculated by running an aggregate count query against our Cosmos DB container. The charts are rendered by querying Cosmos DB and displaying the entry of each day.

Live Data Initially one of our goals was to surface data and surface it with speed. Up until this point we have only discussed static rolled-up data and the exploration thereof. Whilst this is useful for doing some rudimentary analysis after the fact, these stats are not able to tell our clients what is happening right now. In other words: we haven’t checked expedience off our list.

To surface live data, we had to do some out of the box thinking. Processing log lines in real time wasn’t feasible as we only rotate the log every hour (unless of course you deem an hour ago as “live”) and we couldn’t really speed up the rotation.

Live Stream Listeners

Real-time reporting of live streaming in Metrics

For stream listeners there are two distinct types. HLS streams and Icecast streams and both of these required a unique approach to surface the live listener data. For HLS we wrote a .NET Core application that we deployed to our HAProxy server. This server checks the HAProxy Stick Table for the listener count of the tenant. We initially tried the HAProxy Stream Processing Offload Engine but this went bad – very bad – as it could not handle the amount of requests our servers were doing. In the end we got it working along the following lines:

Our .NET Core application runs a command to check listener counts on the HAProxy Stick Table.
It sends the listener count over Azure Event Hubs.
An Azure Function picks up the event hub message and stores it in Redis (we keep about 2 hours of this data in Redis).
We query Redis to show live stream data.

Icecast was slightly simpler to retrieve the live listener count for, as it exposes administration endpoints that return XML data. The process is roughly as follows:

An Azure Function imports the listener count from Icecast.
We store the listener count in Redis.
We query Redis to show the Icecast stream data.

Live App Visitors & Engagement

How App Visitors are reported in Metrics

Retrieving the live app engagement numbers follows a very similar pattern to the Icecast listener count imports. We rely on Matomo’s reporting to achieve this. The process is roughly as follows:

An Azure Function imports the current live visitor count and the actions taken in the last minute.
These values are stored in Redis.
We query Redis to show the App Visitor data.

Events

Prior to our Metrics application, Fabrik already had an Azure Event Hub to which it would report events. The reported events are retrieved via an Event Hub Listener. The process is roughly as follows:

An Azure Function listens for events on the Event Hub.
It stores the events in Redis.
We query Redis to show the event data.

The example above is during a relatively quiet hour – it can get crazy at time as seen in the screenshot below.

A busy messaging period captured by Metrics

The Road Ahead

Growing our garden

All things considered, the creation of Metrics has been quite a journey for us and we have learned a lot about what it takes to build a data pipeline that is cost effective and scalable. We aren’t planning on letting the weeds grow in our data garden over the next few months either - our clients have multiple new features to look forward to that are sure to teach us new lessons and provide greater insights into our data.

Our mission to figure out what data we have and where we want to share it is a lot closer to being complete, but will never be completely so. We hope that we can keep delivering valuable insights to our clients and help you answer some of your operational questions going forward by delivering new insights to you.

This journey is not yet over and we’re excited and ready to take on the future of Metrics.