I trained a Music Transformer from scratch to generate symbolic music using the POP909 dataset. The goal was to build a model capable of composing music by predicting the next token in a sequence of encoded musical events.
Dataset & Preprocessing
I used the POP909 dataset, a collection of 909 professional MIDI arrangements of pop songs. To prepare the data, I created a custom tokenization pipeline that converted each MIDI file into a flat sequence of symbolic tokens (note on/off, velocity, duration, time shift). This was implemented through a MidiEventProcessor class, which:
- Encoded each note into symbolic events with timing and velocity
- Supported polyphonic sequences with overlapping notes
- Included bar and position tokens for temporal structure
I used these tokens to construct a vocabulary and convert the dataset into sequences suitable for training the model.
Music Transformer Architecture
I implemented the Music Transformer based on Huang et al. (2018). The model uses:
- Relative positional attention to capture long-term musical structure
- Six decoder layers with self-attention and feedforward blocks
- Sinusoidal positional encodings for token position awareness
- A final projection layer to map hidden states to token predictions
I trained the model using cross-entropy loss and a look-ahead mask to maintain autoregressive prediction. Key hyperparameters included:
- 350 training epochs
- Batch size: 8
- Adam optimizer with a 0.001 learning rate
- Dropout and label smoothing for regularization
I trained the model entirely from scratch using my tokenized version of POP909. I also trained a separate version on pre-tokenized .pickle files provided in the original Music Transformer repository for comparison.
Music Generation
To generate music, I used autoregressive sampling starting with a special start-of-sequence token. I generated 1024 tokens per sequence, which correspond to about 30–45 seconds of symbolic music in MIDI format.
Evaluation Metrics
To assess model quality, I evaluated both predictive accuracy and musical characteristics. I computed:
- Perplexity to measure prediction uncertainty
- Note density (notes/sec) to gauge musical activity
- Polyphony rate to assess harmonic richness
- Unique 4-gram ratio to evaluate musical variation
| Model | Perplexity | Note Density | Polyphony Rate | Unique 4-gram Ratio | 
|---|---|---|---|---|
| LSTM (Baseline) | 13.92 | 6.71 | 0.90 | 0.98 | 
| Music Transformer | 65.22 | 3.55 | 0.99 | 0.97 | 
| POP909 Dataset Avg | — | 6.88 | 0.90 | 0.58 | 
Although the Music Transformer had higher perplexity, it produced compositions with greater harmonic depth and variation compared to the LSTM baseline. However, its lower note density and high perplexity suggested undertraining, likely due to limited compute time and smaller batch size.
Takeaways
This project allowed me to explore how Transformer architectures can be used for creative sequence generation beyond text. I learned how to:
- Design tokenization schemes for symbolic data like MIDI
- Adapt existing deep learning architectures for new domains
- Train models from scratch on structured musical data
- Evaluate generated sequences both statistically and musically
Even though my model didn’t reach the perplexity levels of state-of-the-art approaches like the Pop Music Transformer, it successfully learned to generate varied and harmonically rich music. This project showed me the potential of combining symbolic encoding with deep learning to enable machine-composed music.

