Every year in January, one of the largest award shows is hosted by the Recording Academy and streamed to millions of viewers around the globe. The Grammy's are a time for people to recap what has happened in music in the past year, and celebrate the hard work these artists have put in. Because of how influential music is to the average human, it is no surprise that the Grammy's bring in a multitude of celebrities and viewers to its momentous show. With all the different categories that have been added since the first Grammy's in 1959, there are now 30 different fields with a total of 84 distinct award categories and up to 8 nominees for each. But without fail, every year there are "Grammy upsets", where viewers feel like the winner of an award should not have won. Some say there is a committee bias, others believe the process is fair but just makes mistakes sometimes. The biggest question people have though is, "How do they decide the winner?".
On the Grammy's site they provide information about the voting process and the general flow of the committee. Unfortunately, they never actually explain what the committee is looking for in these different award categories. As avid music lovers, we keep up with the music industry and have been curious about how the decisions that have been made in previous Grammy Awards were decided upon. We wanted to see if we could collect and analyze enough data to create accurate, working models that predict future Grammy winners. In this tutorial, we will explore different data for songs, albums, and artists to see if there are any trends between winnners and just nominees in the past 20 years. We will then attempt to predict the winners of this years Grammy's using the previous data.
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import seaborn as sb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
from sklearn import tree, ensemble, neighbors, model_selection, metrics
from sklearn import tree, ensemble, neighbors, model_selection, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from IPython.display import display, HTML
Spotipy is really cool! It enables us to gain access to the spotify API and use it in a more "pythonic" way (simply put, easier for us to use in python). Once we went on the Spotify website and apply for access to the API, we used the ID and Secret ID given to us in our authorization code below, so we can gain access to the API and parse through any data we use.
sb.set()
sp = spotipy.Spotify()
cid = "6d49981d346842a1844ab1612afae8ba"
csec = "e33407ff688c4d48908d5f5286fe8e41"
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=csec)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
Now we get to the actual data collection process. As soon as we started this project, we hit the question: "How are we going to get the data on every grammy winner and nominee for the past 20 years?"
Obviously there's no "Grammy" section on Spotify, so we decided to cheat a little bit. Every song, album, user, and playlist has a specific ID within Spotify called its URI, which can easily be found through the settings of whatever page you are on. The important part here is that each playlist has its own ID, so we made our own playlists containing all the data we need!
We created playlists for the winners and the nominees of the past 20 years for Song of the Year, Album of the Year, and Best New Artist. With these playlists, we can access all the data we want much faster, instead of having to manually go through every song that we want.
Let's start with Song of the Year. In spotify, they assign numerical float values to each song's characteristics. Listed below are the values we are going to retrieve for each song, descriptions retrieved from the Spotify API.
We also keep track of if a particular song won or lost in the year that it was nominated. The ideology here is that we use a song's characteristics and find patterns between the ones who won and the ones who lost using Machine Learning, in order to potentially predict this year's winners.
Finally, we create a dataframe to store all of this data. Keep in mind that Spotify API calls take a bit of time, especially when you run it on hundreds of songs (album of the year has over 500 songs, each with a lot of data to parse through), so we exported our dataframe to a CSV file immediately so accessing the data would be easier and we only need to run this code once.
#Access the playlist for song of the year winners for the past 20 years using the URI of the user (Tomi Olusina)
#who made it and the playlist URI
playlist_tracks = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '7pF07PQEdD4e0AEoo6ue9G', fields='items,uri,name,id,total', market='fr')
#Create arrays for every category in our dataframe
name = []
date = []
pop = []
dance = []
energy = []
loud = []
speech = []
acous = []
live = []
valence = []
tempo = []
win = []
#Parse through the playlist data and add the values we are looking for for every song in the playlist
for i in playlist_tracks['items']:
name.append(i['track']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
dance.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['danceability'])
energy.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['energy'])
loud.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['loudness'])
speech.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['speechiness'])
acous.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['acousticness'])
live.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['liveness'])
valence.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['valence'])
tempo.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['tempo'])
win.append(1)
#Now access the nominees for song of the year for the past 20 years and repeat the same process
playlist_tracks = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '4OzkYLxp0VzMNlYBXq2faU', fields='items,uri,name,id,total', market='fr')
for i in playlist_tracks['items']:
name.append(i['track']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
dance.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['danceability'])
energy.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['energy'])
loud.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['loudness'])
speech.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['speechiness'])
acous.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['acousticness'])
live.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['liveness'])
valence.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['valence'])
tempo.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['tempo'])
win.append(0)
#Create a dataframe and add all the columns made
songdata = pd.DataFrame()
songdata.insert(0, "Name", name, True)
songdata.insert(1, "Date Released", date, True)
songdata.insert(2, "Popularity", pop, True)
songdata.insert(3, "Danceability", dance, True)
songdata.insert(4, "Energy", energy, True)
songdata.insert(5, "Loudness", loud, True)
songdata.insert(6, "Speechiness", speech, True)
songdata.insert(7, "Acousticness", acous, True)
songdata.insert(8, "Liveness", live, True)
songdata.insert(9, "Valence", valence, True)
songdata.insert(10, "Tempo", tempo, True)
songdata.insert(11, "Did They Win?", win, True)
#Send it to a CSV file
songdata
songdata.to_csv('topsongs.csv', encoding='utf-8', index=False)
songdb = pd.read_csv("topsongs.csv")
songdb
Now we move to Album of the Year nominees and winners of the past 20 years. Obviously, you can't make a playlist of albums, so the workaround we created was simply creating playlists from one song from each album. From there, we could access the album's URI in the song's data, and use that URI to get all the data we need.
The data collection process here is very similar, but there is one slight difference. The song's characteristics like danceability and energy are only applicable to songs, not albums in the Spotify API. Therefore, to get the characteristics of an album, we take these characteristics and average them over all the songs in the album. For example, we get the danceability of an album by averaging the danceabilities of every song in the album.
Also, there was a bug in the data with one of the Taylor Swift albums and its URI, so we manually put in its URI and Name. Who knew she could be so problematic?
#Get the songs for the winners once again from the playlist we made
playlist_albums = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '3q7RrjHTPcG1pAL3j4QcJF', fields='items,uri,name,id,total', market='fr')
playlist_albums
#smh Taylor
albums = ['2dqn5yOQWdyGwOpOIi9O4x']
name = ['Fearless']
date = []
pop = []
dance = []
energy = []
loud = []
speech = []
acous = []
live = []
valence = []
tempo = []
win = []
#Populate the name/date/popularity arrays now since that data is accessible, and get a list of the album URIs
for i in playlist_albums['items']:
name.append(i['track']['album']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
win.append(1)
albums.append(i['track']['album']['id'])
#Again, smh Taylor
del albums[1]
del name[1]
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
#Go through each album, and get the avaerage characteristics for each album
for a in albums:
songs = sp.album_tracks(a)
for i in songs['items']:
danceavg = danceavg + sp.audio_features('spotify:track:'+i['id'])[0]['danceability']
energyavg = energyavg + sp.audio_features('spotify:track:'+i['id'])[0]['energy']
loudavg = loudavg + sp.audio_features('spotify:track:'+i['id'])[0]['loudness']
speechavg = speechavg + sp.audio_features('spotify:track:'+i['id'])[0]['speechiness']
acousavg = acousavg + sp.audio_features('spotify:track:'+i['id'])[0]['acousticness']
liveavg = liveavg + sp.audio_features('spotify:track:'+i['id'])[0]['liveness']
valenceavg = valenceavg + sp.audio_features('spotify:track:'+i['id'])[0]['valence']
tempoavg = tempoavg + sp.audio_features('spotify:track:'+i['id'])[0]['tempo']
count = count + 1
dance.append(danceavg/count)
energy.append(energyavg/count)
loud.append(loudavg/count)
speech.append(speechavg/count)
acous.append(acousavg/count)
live.append(liveavg/count)
valence.append(valenceavg/count)
tempo.append(tempoavg/count)
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
#Repeat the process for nominees
playlist_albums = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '6ua8ZJ16YFgCou0ecqPgdV', fields='items,uri,name,id,total', market='fr')
playlist_albums
albums = []
for i in playlist_albums['items']:
name.append(i['track']['album']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
win.append(0)
albums.append(i['track']['album']['id'])
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
for a in albums:
songs = sp.album_tracks(a)
for i in songs['items']:
danceavg = danceavg + sp.audio_features('spotify:track:'+i['id'])[0]['danceability']
energyavg = energyavg + sp.audio_features('spotify:track:'+i['id'])[0]['energy']
loudavg = loudavg + sp.audio_features('spotify:track:'+i['id'])[0]['loudness']
speechavg = speechavg + sp.audio_features('spotify:track:'+i['id'])[0]['speechiness']
acousavg = acousavg + sp.audio_features('spotify:track:'+i['id'])[0]['acousticness']
liveavg = liveavg + sp.audio_features('spotify:track:'+i['id'])[0]['liveness']
valenceavg = valenceavg + sp.audio_features('spotify:track:'+i['id'])[0]['valence']
tempoavg = tempoavg + sp.audio_features('spotify:track:'+i['id'])[0]['tempo']
count = count + 1
dance.append(danceavg/count)
energy.append(energyavg/count)
loud.append(loudavg/count)
speech.append(speechavg/count)
acous.append(acousavg/count)
live.append(liveavg/count)
valence.append(valenceavg/count)
tempo.append(tempoavg/count)
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
albumdata = pd.DataFrame()
albumdata.insert(0, "Name", name, True)
albumdata.insert(1, "Date Released", date, True)
albumdata.insert(2, "Popularity", pop, True)
albumdata.insert(3, "Danceability", dance, True)
albumdata.insert(4, "Energy", energy, True)
albumdata.insert(5, "Loudness", loud, True)
albumdata.insert(6, "Speechiness", speech, True)
albumdata.insert(7, "Acousticness", acous, True)
albumdata.insert(8, "Liveness", live, True)
albumdata.insert(9, "Valence", valence, True)
albumdata.insert(10, "Tempo", tempo, True)
albumdata.insert(11, "Did They Win?", win, True)
albumdata
albumdata.to_csv('topalbums.csv', encoding='utf-8', index=False)
albumdb = pd.read_csv("topalbums.csv")
albumdb
Finally, we get the data for best new artist. Our workaround here was to simply make playlists with any song from the artist, then use that song to get the artist's URI to work with.
The data collection process is very similar once again, but there is much less data to work with in the Spotify API for this award. We simply retrieved the artist's name, number of followers, and popularity to make models and predictions for this award.
#Same process as before, just use the song's data from the playlist to access the artist's data
#This is easier here because artist data is also included within the original playlist's data
playlist_artists = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '6Wu1dIW3n13jsx0WoODnlQ', fields='items,uri,name,id,total', market='fr')
playlist_artists
name = []
followers = []
pop = []
win = []
for i in playlist_artists['items']:
name.append(i['track']['album']['artists'][0]['name'])
artist = sp.artist(i['track']['album']['artists'][0]['id'])
followers.append(artist['followers']['total'])
pop.append(artist['popularity'])
win.append(1)
playlist_artists = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '6ua8ZJ16YFgCou0ecqPgdV', fields='items,uri,name,id,total', market='fr')
playlist_artists
for i in playlist_artists['items']:
name.append(i['track']['album']['artists'][0]['name'])
artist = sp.artist(i['track']['album']['artists'][0]['id'])
followers.append(artist['followers']['total'])
pop.append(artist['popularity'])
win.append(0)
artistdata = pd.DataFrame()
artistdata.insert(0, "Name", name, True)
artistdata.insert(1, "Followers", followers, True)
artistdata.insert(2, "Popularity", pop, True)
artistdata.insert(3, "Did They Win", win, True)
artistdata.to_csv('topartists.csv', encoding='utf-8', index=False)
artistdb = pd.read_csv("topartists.csv")
artistdb
We aren't done yet! We need to create these same 3 dataframes, but for the current nominees for each award. The data collection process is similar once again, so it is reproduced below.
playlist_tracks = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '7InjGr6kiABbxyV5q9GPW1', fields='items,uri,name,id,total', market='fr')
name = []
date = []
pop = []
dance = []
energy = []
loud = []
speech = []
acous = []
live = []
valence = []
tempo = []
for i in playlist_tracks['items']:
name.append(i['track']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
dance.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['danceability'])
energy.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['energy'])
loud.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['loudness'])
speech.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['speechiness'])
acous.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['acousticness'])
live.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['liveness'])
valence.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['valence'])
tempo.append(sp.audio_features('spotify:track:'+i['track']['id'])[0]['tempo'])
songdata = pd.DataFrame()
songdata.insert(0, "Name", name, True)
songdata.insert(1, "Date Released", date, True)
songdata.insert(2, "Popularity", pop, True)
songdata.insert(3, "Danceability", dance, True)
songdata.insert(4, "Energy", energy, True)
songdata.insert(5, "Loudness", loud, True)
songdata.insert(6, "Speechiness", speech, True)
songdata.insert(7, "Acousticness", acous, True)
songdata.insert(8, "Liveness", live, True)
songdata.insert(9, "Valence", valence, True)
songdata.insert(10, "Tempo", tempo, True)
songdata
songdata.to_csv('2020songs.csv', encoding='utf-8', index=False)
songdb2 = pd.read_csv("2020songs.csv")
songdb2
playlist_albums = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '3uMdi0Gvd3rEp2aiWjy5jM', fields='items,uri,name,id,total', market='fr')
playlist_albums
albums = []
name = []
date = []
pop = []
dance = []
energy = []
loud = []
speech = []
acous = []
live = []
valence = []
tempo = []
for i in playlist_albums['items']:
name.append(i['track']['album']['name'])
date.append(i['track']['album']['release_date'])
pop.append(i['track']['popularity'])
albums.append(i['track']['album']['id'])
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
for a in albums:
songs = sp.album_tracks(a)
for i in songs['items']:
danceavg = danceavg + sp.audio_features('spotify:track:'+i['id'])[0]['danceability']
energyavg = energyavg + sp.audio_features('spotify:track:'+i['id'])[0]['energy']
loudavg = loudavg + sp.audio_features('spotify:track:'+i['id'])[0]['loudness']
speechavg = speechavg + sp.audio_features('spotify:track:'+i['id'])[0]['speechiness']
acousavg = acousavg + sp.audio_features('spotify:track:'+i['id'])[0]['acousticness']
liveavg = liveavg + sp.audio_features('spotify:track:'+i['id'])[0]['liveness']
valenceavg = valenceavg + sp.audio_features('spotify:track:'+i['id'])[0]['valence']
tempoavg = tempoavg + sp.audio_features('spotify:track:'+i['id'])[0]['tempo']
count = count + 1
dance.append(danceavg/count)
energy.append(energyavg/count)
loud.append(loudavg/count)
speech.append(speechavg/count)
acous.append(acousavg/count)
live.append(liveavg/count)
valence.append(valenceavg/count)
tempo.append(tempoavg/count)
danceavg = 0
energyavg = 0
loudavg = 0
speechavg = 0
acousavg = 0
liveavg = 0
valenceavg = 0
tempoavg = 0
count = 0
albumdata = pd.DataFrame()
albumdata.insert(0, "Name", name, True)
albumdata.insert(1, "Date Released", date, True)
albumdata.insert(2, "Popularity", pop, True)
albumdata.insert(3, "Danceability", dance, True)
albumdata.insert(4, "Energy", energy, True)
albumdata.insert(5, "Loudness", loud, True)
albumdata.insert(6, "Speechiness", speech, True)
albumdata.insert(7, "Acousticness", acous, True)
albumdata.insert(8, "Liveness", live, True)
albumdata.insert(9, "Valence", valence, True)
albumdata.insert(10, "Tempo", tempo, True)
albumdata
albumdata.to_csv('2020albums.csv', encoding='utf-8', index=False)
albumdb2 = pd.read_csv("2020albums.csv")
albumdb2
playlist_artists = sp.user_playlist_tracks('22b7g6udnfkubfplnsla3ww5a', '5Ihltik64Mn9LOlhgya8UE', fields='items,uri,name,id,total', market='fr')
playlist_artists
name = []
followers = []
pop = []
for i in playlist_artists['items']:
name.append(i['track']['album']['artists'][0]['name'])
artist = sp.artist(i['track']['album']['artists'][0]['id'])
followers.append(artist['followers']['total'])
pop.append(artist['popularity'])
artistdata = pd.DataFrame()
artistdata.insert(0, "Name", name, True)
artistdata.insert(1, "Followers", followers, True)
artistdata.insert(2, "Popularity", pop, True)
artistdata
artistdata.to_csv('2020artists.csv', encoding='utf-8', index=False)
artistdb2 = pd.read_csv("2020artists.csv")
artistdb2
Let's create our own function to help later on when displaying the data. This function below will take in two data entries and plot them overlapping on a radar graph, also known as a spider chart. Pretty cool!
def radar_graphs(labels, stats, stats2, title):
# Radar graphs
angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
stats=np.concatenate((stats,[stats[0]]))
stats2=np.concatenate((stats2,[stats[0]]))
angles=np.concatenate((angles,[angles[0]]))
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2, c="Blue")
ax.fill(angles, stats, alpha=0.25, c="Blue")
ax.plot(angles, stats2, 'o-', linewidth=2, c="Yellow")
ax.fill(angles, stats2, alpha=0.25, fc="Yellow")
ax.set_thetagrids(angles * 180/np.pi, labels)
plt.ylim(0,1)
ax.set_title(title)
ax.grid(True)
Like we mentioned above, querying an API for every nomination across three categories over the past twenty years takes a lot of time! So what we did was we made it easy for everyone to use the same data through exporting the data to csv files once, and reading directly from the file for the remainder of the time! Just only run the cells from this point onward to save yourself some time.
songdb = pd.read_csv("topsongs.csv")
albumdb = pd.read_csv("topalbums.csv")
artistdb = pd.read_csv("topartists.csv")
nominations_songdb_2020 = pd.read_csv("2020songs.csv")
nominations_albumdb_2020 = pd.read_csv("2020albums.csv")
nominations_artistdb_2020 = pd.read_csv("2020artists.csv")
Best new artist is the first category that we predicted. We only found two distinguishing parameters: followers and popularity. Followers was based off of spotify, and popularity is calculated using a proprietary algorithm that spotify employs scaling from 1 to 100.
artistdb
Let's dive into using our data. We first wanted to see if we can predict the best new artist using our model, by using the historic data! We couldn't find a ton of data about each artist besides their follower count and popularity, so we ran with what we got and trained using a 90-10% split and compared the results to the actuals.
# Top artist predictions
artist_data = artistdb.copy().iloc[:,1:-1]
# Need to divide followers by a large number so the program won't hang when training due to large numbers
artist_data['Followers'] = artist_data['Followers']/10000
artist_target = artistdb.copy().iloc[:,-1:]
# artist_data
# Splits the data and target datasets into training and testing. Uses random seed to pick and choose data
X_train, X_test, y_train, y_test = train_test_split(artist_data, artist_target, test_size=0.1, random_state=11)
dt_clf = tree.DecisionTreeClassifier()
dt_clf.fit(X_train, y_train.values.ravel())
# tree.plot_tree(dt_clf.fit(song_data, song_target.values.ravel()))
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train.values.ravel())
rfc = ensemble.RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train.values.ravel())
display(X_test)
# Tests output with remaining 10%
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[X_test])))
print("k nearest neighbors: \t" + str(knn.predict(np.c_[X_test])))
print("random forest: \t\t" + str(rfc.predict(np.c_[X_test])))
In the above example, the names of artists were take out to train the model but the index number still remains so we can look the artists up. In this case (and for other awards), indices <= 19 are the winners, and the remainder are nonwinners. With this particular 9:1 split, our models predicted 1 of the 2 winners that were present in the dataset.
To do this prediction, we trained a decision tree, k-nearest neighbors, and a random forest classifier using the entire artist nominee dataset from the past 20 years. Our models had pretty much a three-way tie between Black Pumas, Tank and Tha Bangas, and Yola: all three artists that have a lower popularity and a relatively low number of followers. Have you heard of them?
A bit more about the models we used could be found here:
# Artist Predictions
# Refits using 100% of data and runs model on current year nominees to get predictions for winners
dt_clf.fit(artist_data, artist_target.values.ravel())
knn.fit(artist_data, artist_target.values.ravel())
rfc.fit(artist_data, artist_target.values.ravel())
display(nominations_artistdb_2020)
artist_data_2020 = nominations_artistdb_2020.iloc[:,1:]
artist_data_2020['Followers'] = artist_data_2020['Followers']/10000
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[artist_data_2020])))
print("k nearest neighbors: \t" + str(knn.predict(np.c_[artist_data_2020])))
print("random forest: \t\t" + str(rfc.predict(np.c_[artist_data_2020])))
In each of the outputs of the models above a 1 stands for "win" and 0 means it did not win. Obviously, there cannot be more than one winner each year and sometimes models predict no winners (which can't happen), so what we did was as there was not a whole lot of data to train off of, we summed up the number of models that predicted a win for each artist and if there were ties, we were able to at least narrow down the field to the artists who we think have the best shot at winning.
The grammy award for song of the year is presented to a single or a track and goes towards the songwriter(s) who wrote the lyrics and melodies to the song. It judges the actual content behind the song itself, and not the recording.
For SOTY, we had a couple more parameters we could play around with: namely characteristics regarding the music itself. Once again we messed around and trained/tested the data using portions of the original dataset, and then we later on fitted the entire dataset to predict the new nominees.
songdb
Once again (like artists), we did experimentation with our models to see if they could actually predict winners. We did a 9:1 train/test split on the historical data using the three models outlined above. In the results below, row keys < 20 are historical winners.
# Best Song testing/training
print("Best song testing/training using 9:1")
# Splits the dataset into the parameters and the target
song_data = songdb.copy().iloc[:,2:-1]
song_target = songdb.copy().iloc[:,-1:]
# Splits the data and target datasets into training and testing, at 9:1 split. Uses random seed to pick and choose data
X_train, X_test, y_train, y_test = train_test_split(song_data, song_target, test_size=0.1, random_state=61)
display(X_test)
# Fits decision tree and other models using 90% of data
dt_clf = tree.DecisionTreeClassifier()
dt_clf.fit(X_train, y_train.values.ravel())
# tree.plot_tree(dt_clf.fit(X_train, y_train.values.ravel()))
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train.values.ravel())
rfc = ensemble.RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train.values.ravel())
# Tests output with remaining 10%
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[X_test])))
print("k nearest neighbors: \t" + str(knn.predict(np.c_[X_test])))
print("random forest: \t\t" + str(rfc.predict(np.c_[X_test])))
To predict this, we fit all models with the entire historical data set from the past twenty years, and ran the model with this year's nominees
And for the results: the model is split between bad guy and Someone You Loved. Sometimes the model likes one over the other, and other times the model likes both. Looking at the popularity of both of those songs, those songs lead the popularity leaderboard for the nominees and are tied, which may explain why the model likes them both.
# Song Predictions
# Refits using 100% of data and runs model on current year nominees to get predictions for winners
print("Song predictions for 2020 Grammys")
dt_clf.fit(song_data, song_target.values.ravel())
knn.fit(song_data, song_target.values.ravel())
rfc.fit(song_data, song_target.values.ravel())
display(nominations_songdb_2020)
song_data_2020 = nominations_songdb_2020.iloc[:,2:]
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[song_data_2020])))
print("k nearest neighbors: \t" + str(knn.predict(np.c_[song_data_2020])))
print("random forest: \t\t" + str(rfc.predict(np.c_[song_data_2020])))
This award is the most prestigious awarded at the Grammys, and is presented to "honor artistic achievement, technical proficiency and overall excellence in the recording industry, without regard to album sales, chart position, or critical reception."
More info about the actual award can be found here: https://www.grammy.com/grammys/news/2020-grammy-awards-complete-nominees-list#general
albumdb
Once again, we did a similar testing/training method as with the other two categories. If you're just looking for our prediction for this year, skip ahead to the next subheading.
# Best Album testing/training
# Splits the dataset into the parameters and the target
album_data = albumdb.copy().iloc[:,2:-1]
album_target = albumdb.copy().iloc[:,-1:]
# Splits the data and target datasets into training and testing, at 9:1 split. Uses random seed to pick and choose data
X_train, X_test, y_train, y_test = train_test_split(album_data, album_target, test_size=0.1, random_state=112)
display(X_test)
# Fits decision tree and other models to the 90%
dt_clf = tree.DecisionTreeClassifier()
dt_clf.fit(X_train, y_train.values.ravel())
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train.values.ravel())
rfc = ensemble.RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train.values.ravel())
# Prints results of testing these models to the 10%
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[X_test])))
print("k nearest neighbors: \t" + str(knn.predict(np.c_[X_test])))
print("random forest: \t\t" + str(rfc.predict(np.c_[X_test])))
No thank you, next, it's Norman Fucking Rockwell! by Lana Del Rey (please excuse the language, it's the actual name). Two models predicted that this album would come up first consistently, making it our pick for this year's winner.
# Best album predictions
# Refits using 100% of data and runs model on current year nominees to get predictions for winners
print("Album predictions for 2020 Grammys")
dt_clf.fit(album_data, album_target.values.ravel())
knn.fit(album_data, album_target.values.ravel())
rfc.fit(album_data, album_target.values.ravel())
display(nominations_albumdb_2020)
album_data_2020 = nominations_albumdb_2020.iloc[:,2:]
print("decision tree: \t\t" + str(dt_clf.predict(np.c_[album_data_2020])))
print("k nearest neighbors: \t"+ str(knn.predict(np.c_[album_data_2020])))
print("random forest: \t\t" + str(rfc.predict(np.c_[album_data_2020])))
Surely there's no "formula" to win a grammy, but let's take a look at how the winners differentiated themselves, starting with Album of the Year. To do this, we calculated the means of the song characteristics of the winners and the nonwinners(nominees). We then displayed most of the data in a radar graph, and the complete dataset in a table. Some data points such as tempo, loudness, and popularity were unable to be displayed in the radar graph due to the fact they are not normalized to a 0-1 scale.
# Album characteristics
labels = np.array(['Danceability', 'Energy', 'Speechiness', 'Acousticness', 'Liveness', 'Valence'])
album_averages = albumdb.loc[:,labels].values.mean(axis=0)
album_winner_averages = albumdb.loc[albumdb['Did They Win?'] == 1,labels].values.mean(axis=0)
album_nominee_averages = albumdb.loc[albumdb['Did They Win?'] == 0,labels].values.mean(axis=0)
# Constructs table including info such as loudness and tempo
album_characteristics = pd.DataFrame([album_data.loc[album_target['Did They Win?'] == 0].values.mean(axis=0), album_data.loc[album_target['Did They Win?'] == 1].values.mean(axis=0)], columns = album_data.columns.values)
album_characteristics.insert(0, 'Category', ["Nominees", "Winners"])
# Outputs
radar_graphs(labels, album_nominee_averages, album_winner_averages, "Album characteristics for nominees(blue) and winners(yellow)")
display(album_characteristics)
In this case, winning albums definitely weren't the most popular ones which leads us to believe that there are a ton of hidden gems out there that stand out during this award ceremony and their judging system truly does ignore chart position and focuses on the music itself. Winning albums tended to be less energetic and more quieter, while also being less wordy, less speechy, a little quieter, and definitely more acoustic. These might not be the albums to blast at parties, but instead may be albums that focus more on the artistic/creative aspect of music.
# Song characteristics
labels = np.array(['Danceability', 'Energy', 'Speechiness', 'Acousticness', 'Liveness', 'Valence'])
song_averages = songdb.loc[:,labels].values.mean(axis=0)
song_winner_averages = songdb.loc[songdb['Did They Win?'] == 1,labels].values.mean(axis=0)
song_nominee_averages = songdb.loc[songdb['Did They Win?'] == 0,labels].values.mean(axis=0)
# Constructs table including info such as loudness and tempo
song_characteristics = pd.DataFrame([song_data.loc[song_target['Did They Win?'] == 0].values.mean(axis=0), song_data.loc[song_target['Did They Win?'] == 1].values.mean(axis=0)], columns = song_data.columns.values)
song_characteristics.insert(0, 'Category', ["Nominees", "Winners"])
# Outputs
radar_graphs(labels, song_nominee_averages, song_winner_averages, "Song characteristics for nominees(blue) and winners(yellow)")
display(song_characteristics)
SOTY ended up being quite different compared to AOTY. These winners tended to be the popular hits, with a good amount of loudness (especially compared to the albums as a whole). Otherwise, it is hard to distinguish what makes a winner a winner in this category through these metrics alone besides the difference in popularity. This may be because a big part of the SOTY category is judging the actual songwriting and lyrics themselves, which is something pretty hard to quantify and shows not a big relationship with these other metrics.
After the data collection, testing, and analysis of this year's nominees, our models predicted Best New Artist as a 3-way tie between Black Pumas, Tank and Tha Bangas, and Yola. For Song of the Year, we predicted a tie between "bad guy" and "Someone You Loved" and our only Album of the Year prediction was "Norman Fucking Rockwell".
From this process alone, it is pretty clear that deciding these things is no small task. It takes months of deliberation between the Grammy committee to decide these winners using metrics that both can and cannot be quantified. The metrics that can be quantified are undisclosed and would be challenging to obtain and analyze in a tutorial like this. We believe the Grammy's most likely take into consideration things like production, performances, lyrics, sales, and timeline of the artists/songs/albums. This information A) is not easily accessible and B) would be very challenging to quantify and place into a machine learning model. The subjectiveness of the Grammy's comes in with the unquantifiable information that they consider. Their indepth analysis would cover a wide range of ideas that would be strongly based on intuition and personal opinion. Predicting the Grammy's in a more accurate way would require insight into what the current committee looks for and a complete understanding of music beyond the song descriptives provided by Spotify.
All the metrics gathered through the Spotify API are based on current measures for those songs/albums/artists. Finding these numbers for the years these songs/albums/artists were nominated or won would allow us to create a better understanding of where they stood in the public eye at that time, instead of now. We would also hope to gather more information for the artists to make our information more in-depth and enhance our prediction models. It would be beneficial to cross reference data from other sites like Billboards, Apple Music, or Youtube, providing more data to use and a more appropriate sampling of listeners throughout the world.
Tune in to the Recording Academy's 62nd Grammy Awards on Sunday, January 26, 2020, at 8pm EST to see how our predictions hold up!