MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages
Published 12 Nov 2025 · arXiv · Hardik B. Sailor
Overview
MERaLiON-SER is a speech emotion recognition model designed for English and Southeast Asian languages. It uses a hybrid objective combining weighted categorical cross-entropy and Concordance Correlation Coefficient losses for joint discrete and dimensional emotion modelling.
Key Insights
- Model Performance: MERaLiON-SER consistently outperforms open-source speech encoders and large Audio-LLMs in multilingual evaluations.
- Evidence: Evaluations across Singaporean languages and public benchmarks.
- Verifiable: Yes, through independent testing on specified benchmarks.
- Emotion Modelling: Captures both discrete emotions and fine-grained dimensions like arousal, valence, and dominance.
- Evidence: Model's dual approach in emotion representation.
- Verifiable: Yes, through model architecture analysis.
BFSI Relevance
- Why Relevant: Enhances customer interaction systems by integrating emotion recognition, crucial for customer service and fraud detection.
- Primary Sector: Financial Services
- Subsectors: Customer Service, Fraud Detection
- Actionable Implications:
- Implement emotion recognition in customer service to improve client interactions.
- Use emotion data to enhance fraud detection systems by identifying stress or deception in voice patterns.
researcher peer-reviewed-paper global