This R notebook analyze features related to 10 different music genres with the data provided here: https://www.kaggle.com/datasets/purumalgi/music-genre-classification, likely scraped from the Spotify Dev API. This dataset is particularly interesting and suitable for an analysis case study because most of the features are qualities of the songs of certain genre that’s been numericalized by Spotify, such as danceability, energy, and liveness, so that we don’t have to extract those features from raw audio features ourselves. When analyzed, these features are shown to be meaningful indeed.
# install.packages('devtools')
# install.packages('tidyverse', repos = "http://cran.us.r-project.org")
# devtools::install_github("hrbrmstr/hrbrthemes")
# install.packages('viridis')
# install.packages("ggbeeswarm")
# install.packages("fmsb")
# install.packages('scales')
# install.packages('Rpdb')
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tidyr)
library(forcats)
library(hrbrthemes)
library(viridis)
## Loading required package: viridisLite
library(lubridate)
library(ggbeeswarm)
library(fmsb)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:viridis':
##
## viridis_pal
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(stringr)
library(Rpdb)
## Loading required package: rgl
##
## Attaching package: 'Rpdb'
##
## The following objects are masked from 'package:base':
##
## norm, replicate, unsplit
options(warn=-1)
df <- read.csv("data/music genre data.csv", header=TRUE)
df <- df %>% drop_na()
head(df)
## Artist.Name Track.Name instrumentalness
## 1 Bruno Mars That's What I Like (feat. Gucci Mane) 0.003910
## 2 Boston Hitch a Ride 0.004010
## 3 The Raincoats No Side to Fall In 0.000196
## 4 Deno Lingo (feat. J.I & Chunkz) 0.003910
## 5 Red Hot Chili Peppers Nobody Weird Like Me - Remastered 0.016100
## 6 The Stooges Search and Destroy - Iggy Pop Mix 0.006040
## danceability energy loudness mode speechiness acousticness liveness valence
## 1 0.854 0.564 -4.964 1 0.0485 0.017100 0.0849 0.8990
## 2 0.382 0.814 -7.230 1 0.0406 0.001100 0.1010 0.5690
## 3 0.434 0.614 -8.334 1 0.0525 0.486000 0.3940 0.7870
## 4 0.853 0.597 -6.528 0 0.0555 0.021200 0.1220 0.5690
## 5 0.167 0.975 -4.279 1 0.2160 0.000169 0.1720 0.0918
## 6 0.235 0.977 0.878 1 0.1070 0.003530 0.1720 0.2410
## tempo duration_in.min Popularity key time_signature Genre Class
## 1 134.071 3.910 60 1 4 HipHop 5
## 2 116.454 4.196 54 3 4 Rock 10
## 3 147.681 1.828 35 6 4 Indie Alt 6
## 4 107.033 2.899 66 10 4 HipHop 5
## 5 199.060 3.833 53 2 4 Rock 10
## 6 152.952 3.469 53 6 4 Indie Alt 6
For the first step of the analysis, I want to visualize the music features for each genre. The goal is to create a depiction of how each genre sound like by visualizing their properties, which are just the features. For example, I’d imagine that hip hop musics are very danceable, and metal musics are loud and have high energy.
First, we group songs from the same genre together.
by_genre <- df %>% group_by(Genre)
Get the Alt music group to check that group_by()
is
working.
alt <- by_genre %>% filter(Genre=="Alt")
head(alt)
## # A tibble: 6 × 18
## # Groups: Genre [1]
## Artist.Name Track.Name instrumentalness danceability energy loudness mode
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Clairmont The … intres-ti… 0.0000289 0.796 0.441 -9.83 1
## 2 duendita Open Eyes 0.105 0.341 0.47 -10.1 1
## 3 Brandon Jack &… four days 0.0564 0.516 0.948 -3.99 1
## 4 Veruca Salt Seether 0.00153 0.612 0.86 -9.18 1
## 5 Nick Cave & Th… Deanna 0.00391 0.415 0.94 -4.87 1
## 6 BLAB Casual Sex 0.0268 0.709 0.685 -7.57 1
## # ℹ 11 more variables: speechiness <dbl>, acousticness <dbl>, liveness <dbl>,
## # valence <dbl>, tempo <dbl>, duration_in.min <dbl>, Popularity <dbl>,
## # key <int>, time_signature <int>, Genre <chr>, Class <int>
Now we select the columns that we want to visualize. This is equivalent to select different music features to depict a genre. Here we include instrumentalness, danceability, energy, loudness, speechiness, acousticness, liveness and valence. I’ll also include their definitions and a dictionary that maps each term to its definition here.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
First we get all column names of the dataset.
col_names <- colnames(df)
col_names
## [1] "Artist.Name" "Track.Name" "instrumentalness" "danceability"
## [5] "energy" "loudness" "mode" "speechiness"
## [9] "acousticness" "liveness" "valence" "tempo"
## [13] "duration_in.min" "Popularity" "key" "time_signature"
## [17] "Genre" "Class"
Then we select the columns that represents music features.
feature_names <- c(col_names[3:6], col_names[8:12])
feature_names
## [1] "instrumentalness" "danceability" "energy" "loudness"
## [5] "speechiness" "acousticness" "liveness" "valence"
## [9] "tempo"
Here we create a dictionary of the associated definition of each feature so that we can use this to aid visualization later.
definitions_helper <- c("instrumentalness"="Predicts whether a track contains no vocals.",
"danceability"="Describes how suitable a track is for dancing.",
"energy"="Represents a perceptual measure of intensity and activity.",
"loudness"="The overall loudness of a track in decibels (dB).",
"speechiness"="Speechiness detects the presence of spoken words in a track.",
"acousticness"="A confidence measure from 0.0 to 1.0 of whether the track is acoustic.",
"liveness"="Detects the presence of an audience in the recording.",
"valence"="A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.",
"tempo"="The overall estimated tempo of a track in beats per minute (BPM). "
)
Finally, it’s time to apply visualization to each genre group. First, we define a function to normalize column values to between 0 and 1. This is applied to loudness and tempo, so that we can visualize them in the same scale as other features which all have range between 0 and 1.
Technically, the normalized values lose the actual measured values which has physical meanings like number of decibels and BPM, but since we use min max normalization they will still preserve the relative scale between songs’ feature values, so the distribution of values for those columns are still preserved and meaningful for our analysis.
normalize_col <- function(tb, col) {
col_min <- min(tb[, col])
col_max <- max(tb[, col])
col_range <- col_max - col_min
tb_return <- tb
tb_return[, col] <- (tb[, col] - col_min)/col_range
return(tb_return)
}
To plot a graph for each group, we can apply the
group_map()
function to the groups by_genre
.
The syntax allow us to feed .x
to the plotting function as
an argument, where x
is the table for the group without the
column we group the dataset by, which in our case is Genre
.
For this reason, we need to feed .y
to
group_map()
as the second argument so that we have access
to the genre associated with each group.
This requires us to take in two argument, the first one will be the group table, and the second one is the Genre column that contains the group’s genre.
The violin plot function from ggplot2 takes in a categorical variable and one or more numerical variable(s). For this reason we need to convert music feature columns from wide format to long format. If you don’t know the different between long and wide format, here is a short, nice guide for it. Basically after converting we’ll get all the feature categories as values in one column and the associated values in another column. Since all data in a group is from the same genre, it won’t be a categorical variable within the group, and we’ll have one categorical column, namely the types of music features, and the associated numerical column. Then we’ll be ready to create the violin plot!
Creating a violin plot is just like creating other plots in ggplot2
and we just need to use the plotting function
geom_violin()
.
plot_tb <- function(tb, group_by_column) {
tb <- normalize_col(tb, 'loudness')
tb <- normalize_col(tb, 'tempo')
plot <- pivot_longer(tb, cols = feature_names)
plot$name <- factor(plot$name, feature_names)
tb_genre <- group_by_column$Genre
p <- plot %>%
ggplot( aes(x=name, y=value, fill=name, color=name)) +
geom_violin(width=2.1, size=0.2) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme_ipsum() +
theme(
legend.position="none",
axis.text.x = element_text(angle = 30, hjust=0.8),
axis.title.x = element_text(size = 12, hjust=0.45)
) +
xlab("Music Features") +
ylab("Distribution") +
ggtitle(paste("Value Distribution for Genre: ", tb_genre))
p
}
The syntax of group_map()
is
group_map(~f(**args))
. According to the docs:
. or .x to refer to the subset of rows of .tbl for the given group. Aka. dependent variables of the group data.
.y to refer to the key, a one row tibble with one column per grouping variable that identifies the group. Aka. the target variable(s) of the group data. If you grouped by more than one columns, there will be more than one columns in this table which are those columns you used to group by the dataset.
So here according to how we designed plot_tb()
, the
correct syntax is group_map(~plot_tb(.x, .y))
.
by_genre %>% group_map(~plot_tb(.x, .y))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
After creating plots for all music features for each genre, I feel like it’s hard to tell the uniqueness between genres with these graphs. Yes, the distributions for each feature can be very different within each genre, but if you look at all of the graphs holistically it seems like some features are low/distributed similarly for all genres, so its “local minimality” doesn’t seem meaningful to me if that feature is low across genres. It seems like the graphs don’t do justice to depict each genre.
One problem here might be that using violin plot to show the distribution of the entire column doesn’t do justice to show the more nuanced differences in specific, important statistics such as median and quantiles. For example, in our graph, if median1 = 0.2 and median2 = 0.3, it’s hard to tell since the distribution might look very uniform, or the scale of the graph makes the difference in value not so distinguishable.
To solve this, I introduce radar plot. You can find some good illustration of it here. Basically it forms a polygon on a radar background where each corner point is the numerical value of the category that the corner represent. This way, we zoom in on the specific statistical values of each music features and compare them in a micro scale to better depict the associated genre.
First, I’ll simply remove loudness and tempo from feature names because they are features with physical meanings and we can do it without them anyway.
feature_names <- feature_names[! feature_names%in% c("loudness", "tempo")]
feature_names
## [1] "instrumentalness" "danceability" "energy" "speechiness"
## [5] "acousticness" "liveness" "valence"
For the data to be used to build the plot, I chose to visualize the median of each music feature values for each genre. Because these are continuous variables, it doesn’t make sense to use the mode. Because averages can be skewed because of the distribution, median is more robust to skewness and can better capture the profile of the genre.
To build the radar plot, we need to specify the range of the radar values, namely 1 and 0 in this case. Those need to be the first and second row of the dataset, and the third row should be the actual values you want to visualize.
Then we just need to call the radarchart()
function and
add some aesthetics. Remember to install and load the fmsb
library!
radar_plot_tb <- function(tb, group_by_column){
tb_sub <- tb[,feature_names]
medians <- c()
for (i in 1:length(feature_names)){
col_numeric <- as.numeric(as.character(unlist(tb_sub[,feature_names[i]])))
medians[i] <- median(col_numeric)
}
range_max <- rep(1, length(feature_names))
range_min <- rep(0, length(feature_names))
data_tb <- data.frame(matrix(ncol = length(feature_names), nrow = 0))
colnames(data_tb) <- feature_names
tb_genre <- group_by_column$Genre
data_tb[1,] <- range_max
data_tb[2,] <- range_min
data_tb[3,] <- medians
radarchart( data_tb, axistype=1 ,
#custom polygon
pcol="#779ecc" , pfcol="#9fc0de" , plwd=3,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,1,0.1), cglwd=0.8,
#custom labels
vlcex=0.8, title=tb_genre
)
}
The exciting bit - actually calling the function! You can definitely see that the depiction of each genre is much more pronounced using median values and radar plots.
by_genre %>% group_map(~radar_plot_tb(.x, .y))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
Another nice way to uniquely depict each genre is to visualize and compare each music feature across genres.
This is much easier to do than the previous two attempts because we don’t need to wrangle the data before feeding them into a plotting function. The format of the data is already correct. We just need a for loop to create the graph for each feature column. Let’s create the plotting function for a specific column.
We used violin graph last time and I love violin plot because it not only show you the quantiles but also roughly the actual distributions. So a feature could have multiple peak values and it would show in a violin plot but not a box plot. However, compare to histogram, violin plot could also smooth out the minimal points and makes it look like there are still some values on the low points even if there is none. Which is why I’ll overlay the violin plot with a jitter points plot to allow a more robust understanding of the value distributions. To avoid clustering the plots with too much data (we do have a relatively big dataset if we’re actually plotting every point), I sampled 500 rows from the dataset and used that to plot the jitter points plot.
plot_column <- function(tb, numerical_column) {
genre <- tb$Genre
if (numerical_column == 'loudness') {
value <- normalize_loudness(tb)[,numerical_column]
} else if (numerical_column == 'tempo') {
value <- normalize_loudness(tb)[,numerical_column]
} else {
value <- tb[,numerical_column]
}
tb_sample <- tb[sample(nrow(df), 500), ]
tb_genre <- tb_sample$Genre
tb_value <- tb_sample[,numerical_column]
p <- ggplot(data=tb, aes(x=genre, y=value, fill=genre, color=genre)) +
geom_violin(width=2.1, size=0.2) +
geom_point(data=tb_sample, aes(x=tb_genre, y=tb_value, fill=tb_genre),
position = position_jitter(seed = 1, width = 0.2), color='grey', alpha=0.65) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme_ipsum() +
theme(
legend.position="none",
axis.text.x = element_text(angle = 30, hjust=0.8),
axis.title.x = element_text(size = 12, hjust=0.45)
) +
xlab("Music Genres") +
ylab("Distribution") +
labs(caption=definitions_helper[numerical_column],
title=paste("Genre Value Distribution for Value: ", numerical_column))
print(p)
}
This is the for loop that will plot each feature across all genres.
for (col in feature_names) {
plot_column(df, col)
}
It’s very important for music businesses to get more insights on the popularity of songs. Let’s try to visualize popularities in different ways to gain more insights into this attribute.
For this feature, I’m binning the data into 9 bins each covers a 10 value range, fro example, 0-10, 10-20, …, until 90-100. The popularity metric from the Spotify API ranges 0 - 100, so binning the values this way preserves some granularity without creating too many bins.
Programmatically, we define the breaks and labels, and then use
cut()
to segment the data and then return a new column that
tag each popularity measure with one of the bin labels, which we call
bins
.
breaks <- seq(0, 100, by=10)
labels <- c()
for (i in 1:10) {
labels[i] <- paste(as.character(10*(i-1)), " - ", as.character(10*i))
}
print(labels)
## [1] "0 - 10" "10 - 20" "20 - 30" "30 - 40" "40 - 50"
## [6] "50 - 60" "60 - 70" "70 - 80" "80 - 90" "90 - 100"
bins <- cut(df$Popularity, breaks = breaks, labels = labels)
bins[1:10]
## [1] 50 - 60 50 - 60 30 - 40 60 - 70 50 - 60 50 - 60 40 - 50
## [8] 50 - 60 20 - 30 10 - 20
## 10 Levels: 0 - 10 10 - 20 20 - 30 30 - 40 40 - 50 ... 90 - 100
After we get the bins column, we need to count the unique values,
namely what are the unique values, and how many times each value appears
in the column. To achieve that, I convert table(bins)
to a
dataframe with two columns: bins
and Freq
where bins
are the unique values and Freq
is
the count of each unique value.
unique_value_freq <- as.data.frame(table(bins))
unique_value_freq
## bins Freq
## 1 0 - 10 456
## 2 10 - 20 744
## 3 20 - 30 1820
## 4 30 - 40 3449
## 5 40 - 50 4246
## 6 50 - 60 2917
## 7 60 - 70 1839
## 8 70 - 80 1078
## 9 80 - 90 286
## 10 90 - 100 41
We can than easily plot the popularity distribution with this
dataframe, using geom_segment()
to create a vertical line
representing each value and geom_point()
to add a dot to
the end of each line so that the tip is emphasized. This kind of graph
is called a lolipop graph due to the appearance resembling lolippops,
and I especially like this representation for comparing values because
it makes it very clear the scale of each value and easy to see which
values are larger or smaller.
ggplot(unique_value_freq, aes(x=bins, y=Freq)) +
geom_segment( aes(x=bins, xend=bins, y=0, yend=Freq), color="grey") +
geom_point( color="#98c1d9", size=6) +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 30, hjust=0.6),
) +
xlab("Popularity Range") +
ylab("# of songs with popularity in range") +
labs(title="Popularity distribution in general")
Our visualization was successful, but note that this is the distribution for all songs, without grouping by genres. I’m curious about how the distribution of popularity values are different from the general distribution. To do this, we create a 2 categorical variable lolipop graph for each genre.
First, we need to scale down the general distribution somehow. If you look at the values in each bin, they are garanteed to be much higher than value of each bin for a specific genre, just because the sheer volumn of data is larger than that of one genre. Naturally we can just multiply each bin value with (number of rows in genre group)/(total number of rows), but that’s a bit artificial and doesn’t make physical sense. Here, I chose to sample (number of rows in genre group) rows from the entire dataset and this subset can serve as a sensible representation of the entire dataset.
To do that, we construct a function called
get_undersampled_unique_value_freq()
to sample the dataset
and calculate bins.
get_undersampled_unique_value_freq <- function(df, group_nrow) {
df_sample <- df[sample(nrow(df), group_nrow), ]
bins <- cut(df_sample$Popularity, breaks = breaks, labels = labels)
unique_value_freq <- as.data.frame(table(bins))
return(unique_value_freq)
}
Then we just create the popularity bin values of the genre and attach it to the general undersampled bin values, so that we can use that dataframe as data for the plot. The syntax is kind of the same as the general distribution plot. I also used a red orange color for individual genre bin values so that it naturally grabs attention and draws people’s eye to the genre value but still with an available reference to the general distribution bin value in blue.
lolipop_plot_2groups <- function(group_tb, group_by_column) {
unique_value_freq_undersampled <- get_undersampled_unique_value_freq(df, nrow(group_tb))
group_bins <- cut(group_tb$Popularity, breaks = breaks, labels = labels)
group_unique_value_freq <- as.data.frame(table(group_bins))
unique_value_freq_undersampled$Freq_group <- group_unique_value_freq$Freq
tb_genre <- group_by_column$Genre
ggplot(unique_value_freq_undersampled) +
geom_segment( aes(x=bins, xend=bins, y=Freq, yend=Freq_group), color="grey") +
geom_point( aes(x=bins, y=Freq), color="#9fc0de", size=5, alpha=0.65 ) +
geom_point( aes(x=bins, y=Freq_group), color="#FF4433", size=5, alpha=0.5 ) +
theme_ipsum() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_text(angle = 30, hjust=0.6),
) +
xlab("Popularity Range") +
ylab("# of songs with popularity in range") +
ggtitle(paste("Popularity distribution in general vs. ", tb_genre))
}
df_bins <- df
df_bins$bins <- bins
df_bins %>% group_by(Genre) %>% group_map(~lolipop_plot_2groups(.x, .y))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
It seems like the genre popularity distributions are pretty close to the general popularity distributions. One question is: does popular songs has a different music feature distribution than others? We can level up the single polygon radar chart we constructed earlier and overlay two polygons in one chart, one for popular song stats and another one for general song stats.
Here, we first include popularity in feature_names
so
that we get that column when subsetting the data.
feature_names <- append(feature_names, "Popularity")
feature_names
## [1] "instrumentalness" "danceability" "energy" "speechiness"
## [5] "acousticness" "liveness" "valence" "Popularity"
To follow coding best practices, we remove the part that build the dataframe to feed into radar chart from the plotting function and create a function of its own. We use the median measure for now. Because popular songs might have more extreme skewness/max/min, it might be worthwhile to try those too later.
build_data_tb <- function(tb, popularity_threshold){
tb_sub <- tb[,feature_names]
tb_filtered <- filter(tb_sub, Popularity >= popularity_threshold)
medians_filtered <- c()
medians_sub <- c()
feature_names_wo_popularity <- feature_names[1:7]
for (i in 1:length(feature_names_wo_popularity)){
filtered_col_numeric <- as.numeric(as.character(unlist(tb_filtered[,feature_names[i]])))
medians_filtered[i] <- median(filtered_col_numeric)
sub_col_numeric <- as.numeric(as.character(unlist(tb_sub[,feature_names[i]])))
medians_sub[i] <- median(sub_col_numeric)
}
range_max <- rep(1, length(feature_names_wo_popularity))
range_min <- rep(0, length(feature_names_wo_popularity))
data_tb <- data.frame(matrix(ncol = length(feature_names_wo_popularity), nrow = 0))
data_tb[1,] <- medians_filtered
data_tb[2,] <- medians_sub
colnames(data_tb) <- feature_names_wo_popularity
rownames(data_tb) <- c("popualar songs", "all songs")
data_tb <- rbind(range_max , range_min , data_tb)
return(data_tb)
}
We set a threshold for popular songs, and plot the data. The
scales
library allow us to set transparency for color
without needing the alpha
argument to be prebuild into the
radarchart
function.
popularity_threshold <- 80
radar_plot_tb_popular <- function(tb, group_by_column){
tb_genre <- group_by_column$Genre
data_tb <- build_data_tb(tb, popularity_threshold)
colors_border=c( scales::alpha("#FF6A4C", 0.6), scales::alpha("#779ecc", 0.5) )
colors_in=c( scales::alpha("#F99244", 0.6), scales::alpha("#9fc0de", 0.5) )
radarchart( data_tb , axistype=0,
#custom polygon
pcol=colors_border, pfcol=colors_in , plwd=2 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="black", cglwd=0.8, caxislabels=seq(0,1,0.1),
#custom labels
vlcex=0.8, title=tb_genre
)
}
After plotting this, it seems like the popular songs still have relatively the same features as their genre. Let’s replace measuring using median with using different quantiles, to capture the more extreme cases in the popular songs subset.
by_genre %>% group_map(~radar_plot_tb_popular(.x, .y))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
To further organize our code, I created a function that takes a percentile probablity and calculate the stats.
calculate_data <- function(tb_sub, tb_filtered, prob) {
filtered <- c()
sub <- c()
feature_names_wo_popularity <- feature_names[1:7]
for (i in 1:length(feature_names_wo_popularity)){
filtered_col_numeric <- as.numeric(as.character(unlist(tb_filtered[,feature_names[i]])))
sub_col_numeric <- as.numeric(as.character(unlist(tb_sub[,feature_names[i]])))
filtered[i] <- quantile(filtered_col_numeric, probs=prob)
sub[i] <- quantile(sub_col_numeric, probs=prob)
}
return(list("filtered"=filtered, "sub"=sub))
}
Then, I refactored the build_data_tb()
function to
include a probability argument and calculate data for that percentile
for the group table and a subset of popular songs in the group.
build_data_tb_with_measure <- function(tb, popularity_threshold, prob){
tb_sub <- tb[,feature_names]
tb_filtered <- filter(tb_sub, Popularity >= popularity_threshold)
stats <- calculate_data(tb_sub, tb_filtered, prob)
feature_names_wo_popularity <- feature_names[1:7]
range_max <- rep(1, length(feature_names_wo_popularity))
range_min <- rep(0, length(feature_names_wo_popularity))
data_tb <- data.frame(matrix(ncol = length(feature_names_wo_popularity), nrow = 0))
data_tb[1,] <- stats$filtered
data_tb[2,] <- stats$sub
colnames(data_tb) <- feature_names_wo_popularity
rownames(data_tb) <- c("popualar songs", "all songs")
data_tb <- rbind(range_max , range_min , data_tb)
return(data_tb)
}
With the data table built from
build_data_tb_with_measure()
, the plotting function can
cleanly get the data needed for plotting and also with the flexibility
of specifying various percentiles. Note that if we want to check out
maximums, we just need to set prob = 1
.
radar_plot_tb_popular_with_measure <- function(tb, group_by_column, prob){
tb_genre <- group_by_column$Genre
data_tb <- build_data_tb_with_measure(tb, popularity_threshold, prob)
colors_border=c( scales::alpha("#FF6A4C", 0.6), scales::alpha("#779ecc", 0.5) )
colors_in=c( scales::alpha("#F99244", 0.6), scales::alpha("#9fc0de", 0.5) )
type_title <- ""
if (prob == 1) {
type_title <- "Max"
} else {
type_title <- paste("Quantile: ", as.character(prob))
}
radarchart( data_tb , axistype=0,
#custom polygon
pcol=colors_border, pfcol=colors_in , plwd=2 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="black", cglwd=0.8, caxislabels=seq(0,1,0.1),
#custom labels
vlcex=0.8, title=paste(tb_genre, type_title)
)
}
Now we can start experimenting with different quantiles.
by_genre %>% group_map(~radar_plot_tb_popular_with_measure(.x, .y, prob=0.75))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
by_genre %>% group_map(~radar_plot_tb_popular_with_measure(.x, .y, prob=1))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
by_genre %>% group_map(~radar_plot_tb_popular_with_measure(.x, .y, prob=0.25))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
by_genre %>% group_map(~radar_plot_tb_popular_with_measure(.x, .y, prob=0.85))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
It does seem like the difference in stats data is a bit more extreme when selecting more extreme percentiles!
The key of a song ranges from -1 to 10, -1 means the key is not detected, where 1 - 10 is mapped to pitch classes using the standard mapping. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. It’s an important property of a song, and it’d be interesting to see the key distribution for each genre compare to the dataset.
Here, we combine the undersampling, getting unique value counts, and converting wide format to long format techniques we used before to create the data table for plotting.
create_keys_data_tb <- function(group_tb, group_nrow) {
df_sample <- df[sample(nrow(df), group_nrow), ]
data_tb <- as.data.frame(table(df_sample$key))
data_tb$group_key <- as.data.frame(table(group_tb$key))$Freq
colnames(data_tb) <- c("key", "all_key", "group_key")
data_tb <- pivot_longer(data_tb, cols = c("all_key", "group_key"))
data_tb$name <- factor(data_tb$name, c("all_key", "group_key"))
return(data_tb)
}
Here we plot bar graph for two variables, the key distribution for all data and for a genre subgroup.
bar_plot_tb_compare<- function(tb, group_by_column) {
data_tb <- create_keys_data_tb(tb, nrow(tb))
tb_genre <- group_by_column$Genre
p <- data_tb %>%
ggplot( aes(key, value, fill=name)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_manual(values=c("#4793AF", "#FFC470")) +
theme_ipsum() +
ylab("value counts") +
ggtitle(paste("Key distribution: All songs vs. Genre: ", tb_genre ))
}
Looking at graphs for each genre, we can see that some of them have somehow different distribution compare to the general distribution, while others are relatively consistent with the general distribution.
by_genre %>% group_map(~bar_plot_tb_compare(.x, .y))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
Another question I’m interested in this dataset is: what are the popular artists? We can use the popularity attribute for songs to calculate the popularity by averaging through the artist’s songs popularity. But here, I used a different and seemly simpler measure that makes the most intuitive sense - how many songs the artist have in this dataset. I can definitely chose the former definition, but I want to practice writing code to achieve the latter.
First, I counted how frequent each artist appeared in the dataset, ranked them in descending order, and selected the top 20 artists and their number of songs.
artist_freq <- as.data.frame(table(df$Artist.Name))
artist_freq <- artist_freq[order(artist_freq$Freq, decreasing = TRUE),]
colnames(artist_freq) <- c("Artist.Name", "Freq")
row.names(artist_freq) <- NULL
artist_freq <- artist_freq %>% top_n(20)
## Selecting by Freq
artist_freq
## Artist.Name Freq
## 1 Backstreet Boys 69
## 2 Westlife 60
## 3 Britney Spears 54
## 4 The Rolling Stones 38
## 5 U2 30
## 6 Lata Mangeshkar 28
## 7 Metallica 27
## 8 AC/DC 26
## 9 The Beatles 25
## 10 The Black Keys 24
## 11 Fleetwood Mac 22
## 12 Kishore Kumar 22
## 13 Led Zeppelin 22
## 14 Mohammed Rafi 22
## 15 Nirvana 22
## 16 Pearl Jam 22
## 17 Coldplay 21
## 18 Creedence Clearwater Revival 21
## 19 The Killers 20
## 20 Aerosmith 19
The only thing left to do is to create the graph. Here, I used bar chart to visualize the ranking since I can make it look really similar to how rankings are usually visualized: horizontally, with each column representing the rank value, and possibly with numerical labels of the value for each column. I had a clear vision and did exactly that. Furthermore, I highlighted the top 3 artist with bright red orange color and made the artist names bold. I removed all references like backgrounds, axis and border so that the columns are the only visual element in the graph.
colors <- c(rep("#FF5349", 3), rep("#9fc0de", 17))
text_formats <- c(rep("plain", 17), rep("bold", 3))
artist_freq %>%
mutate(Artist.Name = fct_reorder(Artist.Name, Freq)) %>%
ggplot( aes(x=Artist.Name, y=Freq)) +
geom_bar(stat="identity", fill=colors, alpha=.8, width=.75) +
geom_text(aes(label = Freq), nudge_y = 3, size=3.75, color="#808080") +
coord_flip() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(face=text_formats),
axis.title.x = element_text(color="#808080"),
axis.title.y = element_blank(),
plot.title = element_text(color="#5a5a5a")) +
ylab("number of songs") +
ggtitle("Artist Rank Top 20")
This notebook conducted exploratory data analysis and data visualization for the music feature for genre dataset. Focusing on a variety of features and questions, the notebook created aesthetic graphs with a range of types, like violin graphs with jittered points, lolipop graphs, radar graphs, etc., that are suitable for answering the specific questions and gaining insights. Alongside the visualization, different techniques, library functions and statistical concepts are applied to prepare data and surface insights from graphs. Sometimes, meaningful insights are found, related to the uniqueness of songs from a particular genre or of a specific popularity. Other times, the analysis confirmed that genre feature stats are consistent with the general stats for all songs.