Paper Detail
Siqiao Xue, Chunxue Xu
Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{xue2026zooclaw,
title = {ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval},
author = {Siqiao Xue and Chunxue Xu},
year = {2026},
abstract = {Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \textbackslash{}wiseft\textasciitilde{}wortsman2},
url = {https://huggingface.co/papers/2606.27708},
keywords = {vision-language encoder, fashion retrieval, full fine-tuning, knowledge distillation, weight interpolation, SigLIP2-base, LoRA, parameter-efficient fine-tuning, ground truth, benchmark evaluation, huggingface daily},
eprint = {2606.27708},
archiveprefix = {arXiv},
}
{}