Embedding Geometry in Pretrained CLAP Models

Contrastive Language–Audio Pretraining (CLAP) models have, in the space of about three years, become the default backbone for zero-shot audio classification, audio–text retrieval, and a long tail of downstream Music Information Retrieval (MIR) tasks (Wu et al., 2023; Li et al., 2024; Niizumi et al., 2024). They are appealing for the same reason CLIP was appealing in vision: a single pair of encoders, trained once on web-scale audio–caption pairs, exposes a joint embedding space in which arbitrary natural-language queries can be compared to arbitrary audio clips with a dot product. No task-specific heads, no per-task fine-tuning, no labelled training data downstream — just two encoders and a similarity matrix. ...

Date: May 27, 2026 | Estimated Reading Time: 29 min | Author: Arjun Bahuguna

Top 5 Takeaways from ICASSP 2026

Date: May 9, 2026 | Estimated Reading Time: 0 min | Author: Arjun Bahuguna

Top 5 Takeaways from pre-ICASSP Workshop 2026 at MTG

Date: May 3, 2026 | Estimated Reading Time: 0 min | Author: Arjun Bahuguna

Evaluation of Music Generation Systems

The benchmarking of generative music systems represents a significant challenge in contemporary Music Information Retrieval because the field lacks a definitive ground truth against which synthetic outputs can be measured. Generative models such as those utilizing Transformer architectures or WaveNet variants often produce compositions that possess local coherence but fail to demonstrate global structural regularity or long-term repetitive dependencies (Wang et al., 2023). Because artistic output is inherently subjective, the evaluation framework must transition beyond simple error minimization tasks to integrate multifaceted metrics that account for audio fidelity, musical theory adherence, and human perceptual experience (Lerch et al., 2025). ...

Date: January 1, 2026 | Estimated Reading Time: 4 min | Author: Arjun Bahuguna