Paper Detail

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

Siqiao Xue, Chunxue Xu

huggingface Score 10.4

Published 2026-06-26 · First seen 2026-06-30

General AI

Abstract

Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \wiseft~wortsman2022wiseft weight interpolation with the base model -- and outperforms LoRA, larger backbones (up to 1B parameters), and external training data. Under fair evaluation, ZooClaw-FashionSigLIP2 outperforms all baselines on every benchmark in our suite. In addition, we release ZooClaw-Fashion, a new high-quality fashion retrieval benchmark, and a systematic quality analysis of widely-used benchmarks that exposes and mitigates structural biases in their public ground truth. We open-source the model weights and all evaluation artifacts to facilitate future research.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{xue2026zooclaw,
  title = {ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval},
  author = {Siqiao Xue and Chunxue Xu},
  year = {2026},
  abstract = {Adapting a foundation vision-language encoder to a specialized retrieval task creates a fundamental tradeoff: gains on the target distribution come at the cost of the foundation model's broad generalization, and fashion retrieval is a stringent instance of this problem. We present ZooClaw-FashionSigLIP2, a fashion-specialized SigLIP2-base model that resolves this tradeoff with a simple recipe -- full fine-tuning with knowledge distillation on curated in-domain data, followed by \textbackslash{}wiseft\textasciitilde{}wortsman2},
  url = {https://huggingface.co/papers/2606.27708},
  keywords = {vision-language encoder, fashion retrieval, full fine-tuning, knowledge distillation, weight interpolation, SigLIP2-base, LoRA, parameter-efficient fine-tuning, ground truth, benchmark evaluation, huggingface daily},
  eprint = {2606.27708},
  archiveprefix = {arXiv},
}

Metadata

{}