Making code listen to the radio... for science!

Rees Morris

Rees Morris Ā· 19 Nov, 2020

Introduction

Lately Iā€™ve become hugely passionate for data. With subreddits such as /r/dataisbeautiful always popping up in my Reddit feed, Iā€™ve always wanted to contribute something that interests me.

Ever since a young age, Iā€™d frequently find myself listening to the radio; that refreshing feeling refreshing to be able to listen to familiar songs and presenters with all of the control taken away: you just have to accept what song is currently playing.

I started thinking about how this passion for radio could be turned into data. After some initial planning, I realised that there are potentially unlimited comparisons that can be drawn from just storing three metrics from a radio station: the current song, artist, and timestamp.

From just this data, so many visualisations and conclusions can be made: how many songs stations play per hour, which artists are the most / least popular for that station, when Christmas ā€œofficiallyā€ begins (seeing as it /is/ November!), and much more. Recording data on more than one radio station opens this up to even more possibilities, as all of the metrics above can be compared, averaged, and analysed in much more depth.

The stations I chose to record data for were BBC Radio 2, Heart London, and Q Radio Belfast. This post mainly focuses on the technical implementation, so Iā€™ll be posting later on with my justifications for these choices, my initial hypotheses, and quarterly updates. The plan is to run the application in the background for a full 365 days before then pulling all of the data and making conclusions.

With that said, letā€™s do it!

Fine Tuning

The first technical consideration to make was to think about how often the code listeners for these applications will execute. Having found the REST API endpoints for each station, I was provided with some surprising results.

The BBC Radio 2 API was requested every 30 seconds by its radio player - so that it can show which current song is playing. The API surprisingly also provided the last 10 played songs in its response, alongside their timestamps. The timestamps were unfortunately related to the time elapsed since the start of that particular radio show (usually three hours long), rather than a full UNIX timestamp, but an additional string was present displaying ā€œX minutes agoā€ for each song, which could easily be parsed into a relative timestamp. I decided that the code for BBC Radio 2 would execute every 10 minutes and log any new songs listed since the last check was made.

The Heart London API was the trickiest to manipulate. They use websockets to update the data (smart!), so I had to use the TuneIn API instead. The TuneIn player was even more confusing: it would request the Rest API data every 2-3 minutes, but if the current song data had not changed in that time (ie. adverts are still playing, or the song is really long) then it will re-request its API every 15 seconds until the data changes, and then wait another 2-3 minutes ā€” clever! The API only displays the /current/ song playing, so I decided to go for a 2 minute request interval here; on average the number of requests will likely be lower than in their own implementation, and weā€™re much less likely to miss any songs. This is definitely the implementation I was most worried about breaking or being rate-limited, however.

The Q Radio Belfast player was definitely the nicest. I didnā€™t even have to record how often it requests its Rest API, as the endpoint listed the last 20 songs played along with their UTC timestamps. Result! This meant that I only needed to code the API check for this station to execute once per hour, which was still on the heavy side considering the 20 songs were usually covered over two hours.

Infrastructure

Before jumping into the project, I first had to answer some basic questions surrounding the infrastructure setup: where would this code even be executed from, how would I store the data, and how can I alert myself if it breaks?

I decided to go with Google Cloud Functions for the code execution: Iā€™d used Google Cloud services before, and the free tier covered practically everything. To set the functions to run, I used Google Cloud Scheduler - which would let me specify a frequency in which to run each task. I set up three of these, since I would request each stationā€™s API at different intervals. I checked the Cloud Scheduler docs to get Cloud Functions and Cloud Scheduler working together, and it was a surprisingly quick process.

My initial idea for storing the data was to have a .txt file which would be read from and written to every time the Cloud Function is executed, but using Google Cloud Datastore (Firebase version) made much more practical sense: the cloud functions could remain stateless, and the data could easily be manipulated for other uses ā€” such as if I wanted to set up a website viewer to see statistics in realtime over the next 12 months (which I totally do). Assuming a maximum of 20 songs per hour (each three minutes, with no advertising breaks), that would be roughly 175,200 database entries per station over the course of a year ā€” nothing huge by any means.

The database document structure was created with the following fields:

  • artist - name of the artist
  • song - name of the song
  • station - code for the station this entry came from
  • timestamp - time that the data was added to the database

Node.js Implementation

The next step was to get down to the nitty gritty: writing the code to be executed. The code was split into two sections: the API request made every X minutes, and the reading-from/writing-to the database. The former was trivial, whilst the latter required a bit of extra consideration: if we are reading an array of last-played songs from the API response, we need to make sure that we stop iterating over this array if one of the songs matches the last song added to the database for that station, preventing duplicate entries from popping up.

Another consideration was that each stationā€™s API output was formatted differently. Whilst all three responded with JSON data, the paths to find the artist name, song name, and timestamp, were all different. To make the code as simple as possible, I decided to create a separate file for each radio station, each of which had the sole purpose of pulling the artist, song, and timestamp from their respective APIs. I then created a main script which each station script would feed into, and it would handle everything else.

The first step was to initialise a Node.js project and make sure that the basics worked locally before uploading the files. Following this guide to get Firebase connected to Node worked like a dream. The test code looked as follows:

require('dotenv').config();

const test = async () => {
  const admin = require('firebase-admin');
  const axios = require('axios');
  const parseData = require('./parse-data');

  // Init
  if (admin.apps.length === 0) {
    admin.initializeApp({
      credential:
        process.env.ON_SERVER === 'true'
          ? admin.credential.applicationDefault()
          : admin.credential.cert(require('../serviceKey.json'))
    });
  }
  const db = admin.firestore();

  // Request the API
  const { data } = await axios.get(process.env.API_URL);

  // Pull the latest song from the database
  const radioRef = db.collection('radio');
  const snapshot = await radioRef
    .where('station', '==', process.env.STATION_NAME)
    .orderBy('timestamp', 'desc')
    .limit(1)
    .get();

  // Parse the song data
  // parseData() is the individual radio script
  const songsToAdd = parseData(
    data,
    snapshot.empty ? { artist: '', song: '' } : snapshot.docs[0].data()
  );

  // Add all missing songs
  if (songsToAdd.length > 0) {
    const batch = db.batch();

    songsToAdd.forEach(song => {
      const docRef = db.collection('radio').doc();
      batch.set(docRef, {
        artist: song.artist,
        song: song.song,
        station: process.env.STATION_NAME,
        timestamp: song.timestamp
      });
    });

    batch.commit();
  }
};
test();

Whilst a rather long function, itā€™s pretty simple to break down. After initialising with credentials, it will then send a request to the API which is defined in the environment ā€” as there are three separate Cloud Functions running for each station ā€” and also attempt to pull the last song stored in the database for this particular station.

We then call a custom function from another file parseData (more on that in a moment), which returns an array of objects that should be added to the database ā€” making sure to only return songs that had been played since the database was last updated. We then simply add each song into the database and call it a day - until the Scheduler runs the code again.

The parseData function is the only piece of code that separates all API handling. All Cloud Functions use the exact same code above, but - as mentioned earlier - still need their own unique ways of parsing the data. By passing the JSON response from the API /along with/ the most recent song played from the database, the custom function is able to fetch all songs that should be added to the database.

All scripts have the following code, with only the section between the comment blocks changing:

module.exports = parseData = (jsonAPI, lastSongInDatabase) => {
  // Array of songs to add
  const songsToAdd = [];
  const addSong = (artist, song, timestamp = new Date()) =>
    songsToAdd.push({ artist, song, timestamp });

  // START Custom Parser
  // ...
  // END Custom Parser

  // Return the songs to add
  return songsToAdd;
};

Very simply, it provides a single helper function: addSong, which accepts an /artist/ and /song/ as arguments, and optionally a timestamp too. Inside of the custom parser code, that radio stationā€™s API response will then be manipulated to find as many recent songs played as necessary. The /lastSongInDatabase/ parameter accepted by the parseData function will give us all information we need to make sure we only return songs added since the last database update, preventing duplicate entries.

Google Cloud Setup

Once all three scripts had been tried and tested locally, it was time to load them into Google Cloud Functions. I first set up BBC Radio 1, with three environment variables:

  • ON_SERVER - setting to true allows us to use internal credentials to connect to the database
  • API_URL - the url for the API
  • STATION_NAME - name of the radio station

I also had to remember to update my dependency array, as we were using axios (for outgoing requests) and firebase-admin (for data requests).

After all of the code uploaded, I clicked Deploy. And waited.. and waited.. and finally.. it failed. After checking the logs, I had forgotten to add an async tag to the main function.. oops. Frustratingly, Google Cloud didnā€™t actually save /anything/ Iā€™d added to the failed deployment, so I had to re-add all of the environment variables and upload the scripts again. I suppose it wasnā€™t too bad, since I noticed that the environment variables set up previously were for the build-step rather than runtime, so it would likely need to have been re-done anyway.

After waiting an anxiety-inducing five minutes for it to deploy, I finally got the green success checkmark. Begone, yellow warning triangle! I hit the three dots on the right-hand-side of the deployment and hit the test function button as fast as I could. After it executed, I checked the database and immediately became excited after I saw the ten new entries in front of me.

I quickly set up the Cloud Scheduler program to execute the Cloud Function on a regular interval, and went to grab a coffee. When I got back, I saw a few more entries had been added to the database - a fantastic feeling. It was almost done: just two more deployments remained. Following the exact same setup as before (minus the mistakes), I added Heart London and Q Radio Belfast. Everything worked like a treat.

Next Steps

Now, we wait. For a /long, long time/. Whilst some very interesting takes can already be made on the data only a few days after running it, I wonā€™t be attempting to draw any conclusions until around February 2021 - once the code has been running for around three months. Iā€™ll then be checking back in at the end of May, end of June, and finally at the end of November, when one full year will have passed.

Of course, Iā€™ll be checking up on the data from time to time to make sure things are still being recorded and nothing is breaking, but Iā€™ll be leaving the analysis of the data as a ā€˜future meā€™ problem.