ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Abstract

Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{xue2026zooclaw,
  title = {ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval},
  author = {Siqiao Xue and Chunxue Xu},
  year = {2026},
  abstract = {Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \textbackslash{}wiseft\textasciitilde{}wortsman2},
  url = {https://huggingface.co/papers/2606.27708},
  keywords = {vision-language encoder, fashion retrieval, full fine-tuning, knowledge distillation, weight interpolation, SigLIP2-base, LoRA, parameter-efficient fine-tuning, ground truth, benchmark evaluation, huggingface daily},
  eprint = {2606.27708},
  archiveprefix = {arXiv},
}

Metadata

{}