• jacobus06

The Making Of Metrics

A developer’s view of the journey as we prepare to launch our newest Fabrik application to the world.


Data - we all have it and it's our job as developers to try to figure out where to put it. Furthermore, businesses and teams are all trying to extract valuable insights from this data, which is easier said than done At immedia, we are not exempt from this rule and have spent our fair share of time wrangling our data to try and highlight key insights for the clients of our Fabrik platform.


Metrics, Fabrik's new analytics and data visualisation tool, was born from the desire to consolidate our previous efforts in surfacing our data into a single self-service portal that would not only present informative data, but surface it with expedition. 


Fabrik Metrics dashboard

As a developer who’s been instrumental in building Metrics with cost, scale and speed in mind, I’ve been taught many lessons along the way and I’d like to invite you along for a glimpse into my journey so far.



In the Beginning

The wilderness awaits

When evaluating the data in any system you will generally find an assortment of formats: structured, unstructured, text, tabular, binary, some API endpoint written by a person that left three years ago (and also didn’t care that you would be using it today) – a primordial soup, if you will.


This wilderness that is set before you can seem daunting but, much like mowing your lawn after a year, the reward is well worth it when you smell the freshly-mowed grass or enjoy a languorously luxurious picnic between neatly arranged flower beds and carefully planted rows of trees.


Metrics started with these questions: Where do we need to trim our lawn and neaten up the hedge? What flowers do we need to get; are roses the best choice for our climate and budget? Where will we place our short flower beds and our tall trees? And most importantly – who will be coming to the picnic?


Or more simply, data is used to answer the 5 W's: who, what, why, when and where?


For Metrics, whilst planning our garden, we came up with the following key questions that we wanted to answer in the initial release candidate:

  • (who is)/(when are they)/(what are they using when)/(where are they when) listening to the live stream of the client?

  • (who is)/(when are they)/(what are they using when)/(where are they when) using the client’s mobile application?

  • when is the client’s application downloaded?

As for the ‘why’ – while the subjectivity of that question and the requirement for verification of the various possible answers can halt a project before it even begins, sometimes, the answers quickly reveal themselves. For instance, what we’re currently observing with Metrics is that our clients’ live-streaming and engagement have seen an upward trend over the last few weeks, which would most likely be attributed to a population currently in lockdown during the COVID-19 pandemic.



Find Your Source

The source is within you – or, at least, somewhere

Once our questions were established, it was time for us to identify from where our answers would come. As alluded to in the previous section: in any system, there will be multiple sources of data available and the selection of the source is a process. It may require some trial and error before you find a source that appropriately answers your question. We’ll skip over the boring stuff here and outline the sources that we eventually identified:


Streaming

Audio streams are served from HAProxy which provides us with configurable log output options. We used these to configure the logging to output what we need to answer our questions. We’ll get to how we parsed this information in a later section of this post.


App Engagement

How people use the services and engage with content is tracked via Matomo. Matomo provides a powerful API for retrieving the tracked data. What’s more, it provides our members with total privacy.


Application Downloads

App download numbers are retrieved via the Apple App Store Connect API and Google Cloud Storage API. Both provide us with files in CSV format. We’ll talk about how we used these in a later section.



Laying the Groundwork

Foundations are important

Now that we knew what we were solving for, and from where we would be retrieving our answers, we needed to decide on which approach we would take for processing, storing and displaying the data. We vetted some options and finally decided to use Azure Databricks as our data processor with Scala as our data processing language. Azure Databricks provides us with an Apache Spark cluster that we can scale on demand to meet workloads. It’s also fast. Very. Very. Fast.


For storage of our processed data, we identified Azure Cosmos DB and Azure Storage; Azure Cosmos DB for its ease of storage and retrieval of data (with a familiar SQL-like syntax) and Azure Storage for cost effective storage of files.


The data in Matomo was already being stored in a MySQL database which we don’t need to query directly because Matomo's API already provides us with all the data we need.

We would have a .NET Core API serving as the gateway between users and the stored data and an Angular application that would serve as the frontend.


With the outline of our garden in place we felt confident that we would be able to tame the wilderness set before us and we were ready to get started planting and arranging our flower beds.



THE PEOPLE BEHIND THE DATA

A key part of building a tool that provides insights on how humans are using the tool, is being respectful of the humans themselves.


Before we jump in to all the technical details, it is important to note that at immedia we hold the privacy and data rights of the people that use our platform in high regard. This means that we are always thinking about what needs to be done to ensure that data is properly anonymised before surfacing it to the people who use the platform.


Metrics fully anonymises the data before it gets surfaced. Anything that can be used to identify a user is removed. For instance: when processing our streaming data we perform a one way hash of the IP address of the request before all of our data processing is performed.


Furthermore, Matomo, our analytics engine, has user privacy baked into its design and also discards identifiable information as soon as it can.



Streaming

From logs to lines

Our pipeline for importing streaming data works roughly as follows:

  1. Every hour a log file is rotated on HAProxy and uploaded to Azure Storage via the post-rotate hook.

  2. We read these log files into our Azure Databricks environment via a streaming query.

  3. The log files are processed, and relevant information is extracted and inserted into Delta Lake tables. Identifiable information, such as IP addresses are dropped before we write to Delta Lake storage. 


We have another pipeline that will run and create rollups of our data for use with frontend applications, which roughly works as follows:

  1. It calculates the peak number of listeners for all the newly created sessions per minute and saves the result to Delta Lake storage.

  2. It then creates rollups of all our specified periods and stores it in JSON format in Azure Storage – we’ll look at some examples of this soon.

Lastly, we have a pipeline that will:

  1. calculate and store peak, total and unique listeners for different periods of time, and

  2. write the entries to Azure Cosmos DB.



Exploring Streaming Results


Summary Data

Fabrik Streaming Summary

Calculation of the summary data is merely done as an aggregate count or sum over the period of the rollup. For instance, 'Total Sessions' is calculated as a count and 'Total Days' is the sum of all streaming session lengths.



Streaming Numbers

Fabrik Streaming Numbers

Streaming numbers are calculated as an aggregate over the size of the granularity specified. In the graph above, 'Total Sessions' is the count of all listens per day, 'Total Unique Listeners' are the number of people who listened per day, and 'Peak Concurrent' is the maximum number of listeners for that particular day. These are stored in Cosmos DB which allows us to search and display arbitrary ranges.



Streaming Numbers by Hour

A view of how frequently the live stream is accessed per hour

Streaming Numbers By Hour are calculated as an aggregate over the hour of day for sessions. This graph depicts the sum of all hours, the count of sessions streamed by listeners, the number of people who listened per hour, the count of sessions that were started, and the count of sessions completed per hour.



Streaming Session Length Breakdown

Understanding how long listeners listen to the live stream in Metrics

Session Length Breakdowns are calculated by using the Bucketizer class in Scala. We count the amount of sessions for every single duration ranging from 1 minute all the way to 18 hours. The frontend displays this as a pie chart, while the raw rollup data looks as follows:



Summaries By Dimension

Traffic sources reported in Metrics

Summaries By Dimension are calculated as a group by aggregate over session data. For instance, the 'Total Sessions' section lists the count of sessions grouped by each source, displayed from highest to lowest.




App Engagement

While statistics that describe how people use the Android and iOS apps is an exciting part of our data for our clients, it was much less exciting in terms of the data transformation work to be done. In essence, we query the Matomo API and display the data on the frontends. Luckily for us, Matomo did the heavy lifting in this regard and our biggest concern was displaying the data.


App Downloads

Our pipeline for application downloads works roughly as follows:

  1. An Azure Function App retrieves the CSV files from the Google and Apple APIs respectively.

  2. The function app does some slight preprocessing on the files and then stores them in Azure Storage.

  3. These CSV files are read into our Azure Databricks environment via a streaming query.

  4. Databricks does some processing on the data and writes the results to Cosmos DB.

The result of this pipeline is that we can query App Downloads for any arbitrary period.

Summary of app downloads reported in Metrics

The summary data is calculated by running an aggregate count query against our Cosmos DB container. The charts are rendered by querying Cosmos DB and displaying the entry of each day.


The summary data is calculated by running an aggregate count query against our Cosmos DB container. The charts are rendered by querying Cosmos DB and displaying the entry of each day.



Live Data Initially one of our goals was to surface data and surface it with speed. Up until this point we have only discussed static rolled-up data and the exploration thereof. Whilst this is useful for doing some rudimentary analysis after the fact, these stats are not able to tell our clients what is happening right now. In other words: we haven’t checked expedience off our list.


To surface live data, we had to do some out of the box thinking. Processing log lines in real time wasn’t feasible as we only rotate the log every hour (unless of course you deem an hour ago as “live”) and we couldn’t really speed up the rotation.



Live Stream Listeners

Real-time reporting of live streaming in Metrics

For stream listeners there are two d