I’m a Master’s student in Sound & Music Computing at Universitat Pompeu Fabra, Barcelona.
Previously, I led Audio ML research at startups in Germany & India, focusing on improving large-scale self-supervised models for speech.
My research interests are representation learning for audio and text, conditional generation of audio, and interpretability. Say hi if you’re into audio as well :)
Embedding Geometry in Pretrained CLAP Models
Contrastive Language–Audio Pretraining (CLAP) models have, in the space of about three years, become the default backbone for zero-shot audio classification, audio–text retrieval, and a long tail of downstream Music Information Retrieval (MIR) tasks (Wu et al., 2023; Li et al., 2024; Niizumi et al., 2024). They are appealing for the same reason CLIP was appealing in vision: a single pair of encoders, trained once on web-scale audio–caption pairs, exposes a joint embedding space in which arbitrary natural-language queries can be compared to arbitrary audio clips with a dot product. No task-specific heads, no per-task fine-tuning, no labelled training data downstream — just two encoders and a similarity matrix. ...
Top 5 Takeaways from ICASSP 2026
Top 5 Takeaways from pre-ICASSP Workshop 2026 at MTG
Evaluation of Music Generation Systems
The benchmarking of generative music systems represents a significant challenge in contemporary Music Information Retrieval because the field lacks a definitive ground truth against which synthetic outputs can be measured. Generative models such as those utilizing Transformer architectures or WaveNet variants often produce compositions that possess local coherence but fail to demonstrate global structural regularity or long-term repetitive dependencies (Wang et al., 2023). Because artistic output is inherently subjective, the evaluation framework must transition beyond simple error minimization tasks to integrate multifaceted metrics that account for audio fidelity, musical theory adherence, and human perceptual experience (Lerch et al., 2025). ...