Back
Case Study01 / 02~5 min readedited 2025-12

Speech Sentiment Analysis

A speech emotion recognition system trained on four datasets. Ended up at 93.27% accuracy. Here's what actually happened.

final accuracy

93.27%

cross-dataset, held-out

datasets

4

CREMA-D · RAVDESS · TESS · SAVEE

inference latency

<200ms

CPU, live mic input

01

The problem

Most SER systems are trained on one dataset and fall apart on anything else. The goal was to build something that generalises — not just RAVDESS, which every tutorial uses, but across accents, recording conditions, and emotion labels that don't always agree with each other.

02

What I tried first (that didn't work)

Started with a plain LSTM on mel spectrograms. Decent on RAVDESS alone (around 78%), fell to 61% when I mixed in CREMA-D. The model was memorising speaker identity, not emotion. Classic.

The model was memorising speaker identity, not emotion. Classic overfitting to a small, homogeneous dataset.

03

What worked

Stacking CNN + LSTM (CLSTM). The CNN extracts local patterns from the spectrogram, the LSTM captures how those patterns evolve over time. Trained on CREMA-D, RAVDESS, TESS, and SAVEE together after normalising the emotion label schema across all four.

Final: 93.27% on the held-out test set. Real-time inference pipeline reads from a microphone using PyAudio and Librosa, classifies in under 200ms on CPU.

04

What I'd do differently

The label harmonisation was done by hand. Took a weekend. A proper ontology mapping or using a pre-trained audio encoder (like wav2vec2) would have been faster and probably hit higher accuracy on out-of-distribution audio.