Speech Sentiment Analysis
A speech emotion recognition system trained on four datasets. Ended up at 93.27% accuracy. Here's what actually happened.
final accuracy
93.27%
cross-dataset, held-out
datasets
4
CREMA-D · RAVDESS · TESS · SAVEE
inference latency
<200ms
CPU, live mic input
The problem
Most SER systems are trained on one dataset and fall apart on anything else. The goal was to build something that generalises — not just RAVDESS, which every tutorial uses, but across accents, recording conditions, and emotion labels that don't always agree with each other.
What I tried first (that didn't work)
Started with a plain LSTM on mel spectrograms. Decent on RAVDESS alone (around 78%), fell to 61% when I mixed in CREMA-D. The model was memorising speaker identity, not emotion. Classic.
The model was memorising speaker identity, not emotion. Classic overfitting to a small, homogeneous dataset.
What worked
Stacking CNN + LSTM (CLSTM). The CNN extracts local patterns from the spectrogram, the LSTM captures how those patterns evolve over time. Trained on CREMA-D, RAVDESS, TESS, and SAVEE together after normalising the emotion label schema across all four.
Final: 93.27% on the held-out test set. Real-time inference pipeline reads from a microphone using PyAudio and Librosa, classifies in under 200ms on CPU.
What I'd do differently
The label harmonisation was done by hand. Took a weekend. A proper ontology mapping or using a pre-trained audio encoder (like wav2vec2) would have been faster and probably hit higher accuracy on out-of-distribution audio.