Paper Detail

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Dimitri Kachler, Damien Sileo, Pascal Denis

arxiv Score 5.3

Published 2026-06-11 · First seen 2026-06-13

General AI

Abstract

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{kachler2026influcoder,
  title = {Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution},
  author = {Dimitri Kachler and Damien Sileo and Pascal Denis},
  year = {2026},
  abstract = {With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning t},
  url = {https://arxiv.org/abs/2606.13668},
  keywords = {cs.CL, Precondition, Computer science, Attribution, Artificial intelligence, Quality (philosophy)},
  eprint = {2606.13668},
  archiveprefix = {arXiv},
}

Metadata

{}