Use Data You’re Familiar with When Practicing

December 04, 2023

Whenever I’m trying to learn something new in tech or data, I try to create a project that I sincerely care about and am familiar with, then combine learning some new language or platform with expanding the depth of knowledge on the chosen topic.

My first large scale project was aggregating US congressional data from all available APIs, either government or private like the Sunlight Foundation, ProPublica, and others combining it with campaign donations and spending, to create a single source of truth about our politicians. As that space consolidated I folded the project but I still managed to gain an expertise with AngularJS, which I took to my career at Cars.com when they decided to rebuild on Angular.

Now I’ve chosen to use sports data, focusing on the two of my most familiar sports basketball and football. While I shopped around for data, I decided to focus on the biggest name in sports as my source of truth. One place to look for many sports, standards across sports means that there is more code that will be reusable.

An important detail for any analyst to remember is to start small when you’re getting familiar with a dataset. Something like sports is simple, start with a single game or a single player, moving onto a full team and day of games, then onto the seasons and beyond once you’re comfortable. This makes reading the data less intimidating, but it also saves time and resources when accessing the data, this includes either your bandwidth, server usage, and even energy use.

In this case, while I’ve been querying the data from the source, the first step in all of my processes is to save the raw response to a GCP storage bucket or my local device. This means I can read the response in another tool when necessary, and also acts as a backup as I’m creating and recreating tables while I find some exceptions.

By maintaining a backup of the raw responses, it allows me to save time as I learn the nuances of the data, DBT, and GCP as a data platform. I’ve run dozens of backfills as I notice one issue or another, and thankfully I’ve got my automation running to keep updating with new data as I learn.

The most significant benefit of using a familiar topic to learn with is that I already understand how I want the final data to be shaped, and I know what is a reasonable result from my dataset. In my case I can confirm directly from the source, but never underestimate how many arithmetic bugs I’ve seen and made!