Spotify Song Classification

Machine Learning Unsupervised Machine Learning Python Scikit-Learn March 2024 - June 2024

In this project, I classified songs into distinct genres using the Spotify 1.2M+ Songs dataset. The goal was to group songs based on shared musical features such as danceability, tempo, acousticness, and valence. I hypothesized that songs within the same genre would exhibit similar patterns in these variables.

To begin, I cleaned the dataset by removing non-numerical columns and scaling the numerical data. Due to the large number of features, I used dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the dataset’s complexity. Following this, I then used t-SNE and UMAP for advanced visualizations.

Principal Component Analysis (PCA) & Visualization

I used PCA to reduce the original 24 features into principal components that captured the majority of the dataset’s variance. From this, I created a 3D scatter plot using the top three PCA components. The 3D scatter plot helped to reveal clusters of songs with similar musical properties.

The Scree Plot shows the explained variance against the number of principal components. Using the elbow method, the sharpest decrease in variance is after the first three components, indicating that three principal components capture the majority of the variance in the dataset. Therefore, selecting three principal components is a suitable choice for dimensionality reduction in this dataset.

t-SNE and UMAP for Cluster Insights

Following PCA, I applied t-SNE and UMAP to the PCA-reduced data for more in-depth visualization. These techniques helped me better understand both local and global patterns which helped to reveal densely packed clusters of songs as well as outliers.

t-SNE plot The t-SNE visualization of the PCA-reduced data illustrates the distribution of songs based on their musical features. While PCA reduced the dataset’s dimensionality, t-SNE further highlights local clusters and patterns within the data that help to reveal potential groupings that may correspond to different song genres.

Clustering Techniques: K-Means, DBSCAN, Spectral Clustering

Using the visualizations of t-SNE and UMAP, I implemented several clustering algorithms to classify the songs:

K-Means: I optimized the value of k by evaluating the silhouette score for different cluster numbers. This method performed well but struggled to detect non-linear patterns.

DBSCAN: I applied DBSCAN to identify irregularly shaped clusters and outliers, though it was less effective at distinguishing densely packed genres.

Spectral Clustering: This method provided the most distinct clusters, identifying 9 clear groups of songs. By analyzing each cluster, I was able to hypothesize the genres, including Rap, Rock, and Electronic Dance Music, based on shared audio features.

K-Means plot The K-Means clustering applied to the t-SNE embedding visualizes the dataset grouped into three distinct clusters. Each color represents a separate cluster, indicating how songs with similar musical features are grouped together.

Analysis and Interpretation

The analysis using t-SNE and UMAP provided valuable insights into the structure of the Spotify dataset, revealing overlapping and non-distinct musical genres. Although these techniques struggled with visualizing local structures, they were useful for exploring global patterns and identifying outliers. Traditional clustering methods like K-Means and DBSCAN were less effective due to the densely packed nature of the data, while Spectral Clustering performed best, identifying nine distinct clusters. A qualitative analysis of these clusters suggested potential genre groupings such as Rap, Rock, and Electronic Dance Music, highlighting the ability of Spectral Clustering to capture the nuances within the dataset.

Outcome & Learnings

This project allowed me to gain practical experience with dimensionality reduction, clustering algorithms, and data visualization using tools such as PCA, t-SNE, and UMAP. The results provided valuable insights into how songs can be grouped by genre based on musical features, showcasing the power of unsupervised learning in real-world applications like personalized song recommendations.