Spotify Data Analysis Project¶

For the period October 22 2019 - October 22 2020

Song attribute data collected from Kaggle. This link from Spotify lets you know how to download your own data.

# Import and read data

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

#===========================================
# Read data.  Change DATAPATH if necessary
#===========================================
try:
    # Executes if running in Google Colab
    from google.colab import drive
    drive.mount('gdrive/')
    DATAPATH = "gdrive/My Drive/Data/Spotify_Data" # Change path to location of data if necessary
except:
    # Executes if running locally (e.g. Anaconda)
    DATAPATH = "/Users/dennis/Desktop/SpotifyData"

# Misc Data
with open('/'.join((DATAPATH,'Inferences.json'))) as f:
  inferences = json.load(f)

with open('/'.join((DATAPATH,'YourLibrary.json'))) as f:
  library = json.load(f)

# Streaming History
with open('/'.join((DATAPATH,'StreamingHistory0.json'))) as f:
  sh_0 = json.load(f)

with open('/'.join((DATAPATH,'StreamingHistory1.json'))) as f:
  sh_1 = json.load(f)

with open('/'.join((DATAPATH,'StreamingHistory2.json'))) as f:
  sh_2 = json.load(f)

with open('/'.join((DATAPATH,'StreamingHistory3.json'))) as f:
  sh_3 = json.load(f)

# Import all spotify songs from the Kaggle dataset
all_songs = pd.read_csv('/'.join((DATAPATH,'data.csv')))

Word Cloud of Advertising Inferences¶

# Look at inferences

!pip install wordcloud
!pip install textdistance
import pprint
from wordcloud import WordCloud
import textdistance

my_inferences = inferences["inferences"]
text = ""
for i in my_inferences:
    count = 0
    for j in my_inferences:
        if (textdistance.jaro_winkler(i, j) > 0.9):
            count += 1
    if count < 2:
        text = text + " " + str(i)

wordcloud = WordCloud(width=480, height=480, colormap="Blues").generate(text)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

Collecting wordcloud
  Downloading wordcloud-1.8.0-cp37-cp37m-macosx_10_6_x86_64.whl (161 kB)
     |████████████████████████████████| 161 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: numpy>=1.6.1 in /opt/anaconda3/lib/python3.7/site-packages (from wordcloud) (1.18.1)
Requirement already satisfied: pillow in /opt/anaconda3/lib/python3.7/site-packages (from wordcloud) (7.0.0)
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.7/site-packages (from wordcloud) (3.1.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.4.6)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (46.0.0.post20200309)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.14.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.0
Requirement already satisfied: textdistance in /opt/anaconda3/lib/python3.7/site-packages (4.2.0)

Lots of interesting stuff here! It seems that Spotify has its own ad categorization system (all the strings with "3P" or "1P" in front). Then, Spotify will sell categories to companies and add those category tags to certain users. It looks like I got lumped into Starbucks, Apple, and Disney targeted ads (2/3, not bad Spotify!). Many of these categories are very very broad, like 'Purchasers'. I would guess that Spotify thinks that I am a very basic consumer and doesn't do much ad targeting. It could also be that these aren't very targeted because I have premium!

# Get arrays of songs and artists

library_songs = library["tracks"]

songs = []
artists = []
for i in library_songs:
    songs.append(i["track"])
    artists.append(i["artist"])

# Get wanted attributes from Kaggle Spotify dataset

my_song_attributes = all_songs.loc[all_songs['name'].isin(songs)]

wanted_cols = ['name', 'artists', 'year', 'popularity', 'explicit', 'duration_ms', 'key', 'mode', 
               'instrumentalness', 'acousticness', 'danceability', 'energy', 'speechiness', 'tempo', 'valence']

song_attributes = my_song_attributes[wanted_cols]

# Get rid of repeated song titles by checking that artists are the same

index = np.zeros(len(song_attributes), dtype = bool)
count = 0
for i in song_attributes['artists']:
    for j in artists:
        if j in i:
            index[count] = True 
    count = count + 1
song_attributes = song_attributes.iloc[index]

Plot Histograms of Song Attributes¶

# Make plots of all the numeric datatypes describing my Spotify songs

f, axes = plt.subplots(2, 5, figsize=(20, 10))

num_attributes = ['year', 'popularity', 'duration_ms', 'instrumentalness', 'acousticness', 'danceability',
                  'energy', 'speechiness', 'tempo', 'valence']

colors = ['skyblue', 'olive', 'gold', 'darkgray', 'tan', 'cadetblue', 'tomato', 'lightpink', 'g', 'orchid', 'chocolate']

axes = axes.ravel()

for i, ax in zip(range(len(num_attributes)), axes.flat):
    axes[i].hist( song_attributes[num_attributes[i]] , color = colors[i], label = num_attributes[i], bins = 20)
    axes[i].set_title(num_attributes[i])

Nothing much to see here. I tend to prefer songs that are popular, recent, high energy, and danceable, but not necessarily fast or happy. With very few exceptions, all of my songs have vocals, but none are entirely vocals. I tend to prefer non-acoustic songs, but my songs do lean acoustic based on Spotify's distribution.

A full documentation by Spotify of what these attributes mean and how they are distributed across all songs can be found here

How many of my songs are explicit?¶

print("Percentage of explicit songs: ", str(100 * round(np.mean(song_attributes['explicit']), 4)), "%", sep = "")

Percentage of explicit songs: 46.5%

This seems to be average among all songs on the platform.

Key Signatures¶

np.mean(song_attributes['mode'])

key_sing = song_attributes[['key', 'mode']]

key_dict = {0: 'C', 1: 'C#', 2: 'D', 3: 'D#', 4: 'E', 5: 'F', 6: 'F#', 7: 'G', 8: 'G#', 9: 'A', 10: 'A#', 11: 'B'}

key_sings = []
for i in range(len(key_sing)):
    key = key_dict.get(key_sing.iloc[i, 0])
    if key_sing.iloc[i, 1]:
        key_sings.append(key + " Major")
    else:
        key_sings.append(key + " Minor")


ax = pd.Series(key_sings).value_counts().plot(kind = 'barh', width = 0.9, figsize = (15, 10))
ax = plt.title("Favorite key signatures")

Seems like I have a strong preference towards major key signatures! Or maybe artists just have a preference towards wriing songs in major key signatures.... I'm not surprised that C major is number one either considering the only song I can still play on the piano is this C major Mozart sonata.

Streaming History¶

# Organize streaming history into a single dataframe

sh_0_df = pd.DataFrame(sh_0)
sh_1_df = pd.DataFrame(sh_1)
sh_2_df = pd.DataFrame(sh_2)
sh_3_df = pd.DataFrame(sh_3)

sh_tot = pd.concat([sh_0_df, sh_1_df, sh_2_df, sh_3_df])

sh_tot['msPlayed'] = sh_tot['msPlayed'] / 1000

sh_tot.columns = ['endTime', 'artistName', 'trackName', 'sPlayed']

sh_tot

Isn't it staggering that I've listened to almost 40,000 songs in just one calendar year?

# Distribution of how long each song has lasted when played

ax = plt.figure(figsize = (12, 8))
ax = plt.hist(sh_tot['sPlayed'], bins = 200, log = True, range = (0, 750))
ax = plt.xlabel('sPlayed')
ax = plt.ylabel('Song Count')

This is a great ad for Spotify Premium. Think of how many songs are under 100 seconds. I would guess that 80% of that distribution to the left of 100 seconds is from skipping. That is like 10^4 songs skipped right there. Days worth of listening saved, not to mention ads.

Some of the point on the right of the distribution are probably podcast listens. On occasion, I'll listen to some classical music that can last a while. Or sometimes Lil Dicky sings Pillow Talking to me.

# Plot most commonly played artists by duration

plotted = sh_tot.groupby(['artistName']).sPlayed.sum().reset_index()
plotted = plotted.loc[plotted['sPlayed'] > 3600].sort_values('sPlayed', ascending = False)

ax = plotted.plot(kind = 'bar', width = 0.9, figsize = (70, 8), legend = None)
ax.set_xticklabels(plotted['artistName'])
ax = plt.ylabel('Seconds Played')
ax = plt.title("Most commonly played artists by duration", fontsize = 40)

You need to zoom in pretty closely to see the results here, even though I filtered out all artists that I played for under an hour. In other words, the artists displayed here have all been listened to for over an hour over the course of the past year. Tobi Lou I've played for almost 400,000 seconds (for those keeping track at home, thats almost 5 entire days worth of listening, 18 minutes a day). Other top artists include Mac Miller, John Mayer, Post Malone, Lil Dicky, and Imagine Dragons. Pretty diverse! In general, I'm pretty proud of how many different artists I've explored.

# Plot commonly played artists by count

plotted = pd.Series(sh_tot['artistName']).value_counts()[pd.Series(sh_tot['artistName']).value_counts() > 15]
ax = plotted.plot(kind = 'bar', width = 0.9, figsize = (70, 8), legend = None)
ax = plt.title("Most commonly played artists by count", fontsize = 40)

Another similar plot of the artists I've listened to, this time by how many songs I've played from each of them. The threshold for this plot I set was 15 songs. This time, Bazzi creeps into the top 5, probably because of how short his songs are. Lil Dicky rambles, so he drops. I can't explain the obsession with Tobi Lou unfortunately. He played a concert at Hopkins my freshman year maybe? Skin Care Tutorial 2020 is a certified bop.

print("Number of unique artists played: " + str(len(sh_tot['artistName'].unique())))

print("Number of unique songs played: " + str(len(sh_tot['trackName'].unique())))

print("Total hours of music played: " + str(round(sum(sh_tot['sPlayed']) / 60 / 60, 2)))

Number of unique artists played: 1385
Number of unique songs played: 3102
Total hours of music played: 1830.01

Just some summary statistics. Again I'll do the math for those at home. 1830 hours is 76.25 days, which is roughly 21% of the entire year. Suppose I'm sleeping 8 hours a day...then almost a third of my life I'm going through with music playing.

What is the distribution for when I listen?¶

# Convert to datetimes

from datetime import datetime

datetimes = []
for i in sh_tot['endTime'].to_numpy():
    datetimes.append(datetime.strptime(i, '%Y-%m-%d %H:%M'))

sh_tot['endTime'] = datetimes

# Concert to a series object

sh_time = sh_tot
sh_time = sh_time.loc[sh_time['endTime'].isin(set(datetimes))]

indexes = sh_time.drop_duplicates(subset = ['endTime'])['sPlayed'].squeeze().reset_index()
del indexes['index']

indexes = indexes.squeeze()
indexes.index = set(datetimes)

# Create a timeseries heatmap plot

resampled = indexes.resample('H').sum()

groups = resampled.groupby(pd.Grouper(freq = 'W'))
weeks = pd.DataFrame()

for name, group in groups:
   if len(group.values) == 168:
        weeks["Week of: " + str(name.year) + str(name.month) + str(name.day)] = group.values

weeks = weeks.T
ax = plt.figure(figsize = (20,8))
ax = plt.matshow(weeks, interpolation=None, aspect='auto', cmap = "plasma", alpha = 0.8, origin = "lower", fignum = 1)
_ = plt.xlabel("Hours in a week", fontsize = 20)
_ = plt.ylabel("Weeks in a year", fontsize = 20)
_ = plt.title("Seconds music streamed", fontsize = 20)
_ = plt.colorbar()
_ = plt.show()

I don't think there is anything too interesting here. There is some very clear cyclical behavior here with sleeping times. There is a slight bias towards listening at the end of the day which makes sense because those are times I'm at higher likelihood of being alone. Okay and then the last thing is that I was blasting tons of music in the summer. Clearly weeks like 28 - 40 (Summer time) have much higher Spotify usage. Yay Covid! It's also pretty funny to me that you can also pretty clearly tell which days are structured and which are not. The weekdays during school seasons look much calmer and regular than weekends and during the summer. (I'm not entirely sure which timezone Spotify uses though I'd assume it is my local time)

How often do I listen on repeat?¶

# Plot most commonly played songs by count

plotted = pd.Series(sh_tot['trackName']).value_counts()[pd.Series(sh_tot['trackName']).value_counts() > 25]
ax = plotted.plot(kind = 'bar', width = 0.9, figsize = (70, 8), legend = None)
ax = plt.ylabel("Song Counts")
ax = plt.title("Most commonly played songs by count", fontsize = 40)

This is another fun plot that shows which songs make up a significant chunk of my listening. I set the cutoff here to 25, which still leaves an incredible diversity of tracks. And would you look at that! Tobi Lou loses the top spot to...Aminé? This makes sense as I can clearly remember when I was running during the summer, that song would be repeated for hours. I think this plot also shows a bias towards one hit wonders. Açai bowl - looking straight at you. I think this was because I thought it was the perfect song to listen to while reading When Breath Becomes Air by Paul Kalanithi.

A lot of these songs near the left are from the Summer. I'd go through Spotify's 'Weekly Discovery' playlist that they custom make for you and filter for bangers to listen to while I ran. Listening to the same song on repeat really helps with running, at least for me.

# Plot most commonly played songs by duration

plotted = sh_tot.groupby(['trackName']).sPlayed.sum().reset_index()
plotted = plotted.loc[plotted['sPlayed'] > 3600].sort_values('sPlayed', ascending = False)

ax = plotted.plot(kind = 'bar', width = 0.9, figsize = (70, 8), legend = None)
ax.set_xticklabels(plotted['trackName'])
ax = plt.ylabel("Duration Played")
ax = plt.title("Most commonly played songs by duration", fontsize = 40)

These correlate pretty well with the previous plot.

# Plot tracks and artists that I listen to by longest streak

sh_tot_streak = sh_tot

sh_tot_streak['block'] = (sh_tot_streak['trackName'] != sh_tot_streak['trackName'].shift(1)).astype(int).cumsum()
sh_tot_streak['streak'] = sh_tot_streak.groupby('block').cumcount()

streaks = sh_tot_streak.groupby(['block']).nth(-1).sort_values(by=['streak'], ascending = False).reset_index()

sig_streaks = streaks.loc[streaks['streak'] > 10]

artist_streak = sig_streaks.groupby(['artistName']).sum()[['streak']].sort_values(by = ['streak'], ascending = False)
track_streak = sig_streaks.groupby(['trackName']).sum()[['streak']].sort_values(by = ['streak'], ascending = False)

ax = artist_streak.plot(kind = 'bar', width = 0.9, figsize = (50, 8), legend = None)
ax = plt.ylabel("Total streaks > 10")
ax = plt.title("Most Streaky Artists")

ax = track_streak.plot(kind = 'bar', width = 0.9, figsize = (50, 8), legend = None)
ax = plt.ylabel("Total streaks > 10")
ax = plt.title("Most Streaky Tracks")

I really love these two plots because I think they illustrate a really fundamental way about how I listen to music. I really enjoy playing the crap out of songs for like a day, or two, or as long as it takes for it to become old. I've found that this also somehow makes me remember periods of my life better, for example this song ironically named Panic Attacks for when I traveled to Columbus for a Chess tournament made the memory of that weekend so much more vivid.

I think the reason why I do this and the reason why I can survive listening to music like this is that I never actually process the words. I listen purely on rhythm and melody, which means that I never know lyrics. Listening on repeat lets me get into a zone and do work not necessarily faster, but more enjoyably. In every song, there are really only a few seconds where it is really good and getting to enjoy that right as I finish off a problem set is a great way to keep focused. And that was a really sad sentence that very succinctly describes the college experience of a BME/Math major at Johns Hopkins.

The minimum streak that I set for making these plots was 10 songs in a row, which I then added together to get the total for each artist and song. Our good friend Tobi Lou is back on top in the first plot. I believe Dominic Fike and Aminé are each being carried by like 1 or 2 songs (Açai Bowl and Yellow respectively).

All Streaks¶

# Group streaks

streaks = sh_tot_streak.groupby(['block']).nth(-1).sort_values(by=['streak'], ascending = False).reset_index()

# Arrange the streaks by track, artist, and date

output = streaks[['trackName', 'artistName', 'streak', 'endTime']]
output = output.loc[output['streak'] > 10]

output['day'] = pd.DatetimeIndex(output['endTime']).day
output['month'] = pd.DatetimeIndex(output['endTime']).strftime("%b")

output['name'] = output['trackName'] + " by " + output['artistName'] + " " + output['month'] + " " + output['day'].astype(str)

plotted = output[['name', 'streak']]

# Plot the streaks

ax = plotted.plot(kind = 'bar', width = 0.9, figsize = (70, 8), legend = None)
ax.set_xticklabels(plotted['name'])
ax = plt.ylabel("Total streaks > 10")
ax = plt.title("Most Streaky Tracks")

Last plot! Here we are looking at the actual streaks (how long they are, whos song it is, which song it is, etc.)

Okay so this plot is actually kind of scary. If you look closely, there are like 5 songs that I've listened consecutively for over 100 times. Let's take Yellow, which is the shortest song (3 minutes flat). This gives me a total listening time of 6 hours straight. That isn't healthy!!!

I do think there are some issues with the methodology here. First, this could be broken up into smaller chunks over several days (I don't break off a streak if you take a break). Second, I have a tendency to play music on my computer and leave it without coming back to it, which could potentially last a long time. Finally, I could be skipping to the beginning of the song a lot if I only like the beginning of a song.

Either way, this is a pretty clear indication that I need to take better care of my ear health. Luckily, I typically listen with low volume but suffice to say I'll probably get hearing loss at like age 50 if I keep this up.

That's the end! We all have different ways of coping with Covid and other types of stress, and I think music has been a central part of how I've stayed afloat and been reasonably happy over the past year. Thanks for listening to my TED talk and have a great rest of your day!

	endTime	artistName	trackName	sPlayed
0	2019-10-22 01:06	Tessellated	I Learnt Some Jazz Today	176.727
1	2019-10-22 01:09	TEMPOREX	Nice Boys	180.845
2	2019-10-22 01:13	Nate Good	Gold Coast	225.600
3	2019-10-22 02:49	DPR LIVE	Jasmine	201.510
4	2019-10-22 02:52	Post Malone	Allergic	156.893
...	...	...	...	...
8936	2020-10-22 20:05	Christina Aguilera	Beautiful	90.271
8937	2020-10-22 21:32	Christina Aguilera	Beautiful	148.289
8938	2020-10-22 21:35	Jack Harlow	WHATS POPPIN	139.741
8939	2020-10-22 23:10	Tiffany	I Think We're Alone Now	46.905
8940	2020-10-22 23:13	Lostboycrow	Orange Juice	227.346