Self-Supervised Learning

Contrastive Language–Audio Pretraining (CLAP) models have, in the space of about three years, become the default backbone for zero-shot audio classification, audio–text retrieval, and a long tail of downstream Music Information Retrieval (MIR) tasks (Wu et al., 2023; Li et al., 2024; Niizumi et al., 2024). They are appealing for the same reason CLIP was appealing in vision: a single pair of encoders, trained once on web-scale audio–caption pairs, exposes a joint embedding space in which arbitrary natural-language queries can be compared to arbitrary audio clips with a dot product. No task-specific heads, no per-task fine-tuning, no labelled training data downstream — just two encoders and a similarity matrix. ...