Skip to content

fix(embed): mark all tokens for output to suppress llama.cpp 'overriding' warning (#2208)#2209

Open
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix/embed-mark-all-tokens-2208
Open

fix(embed): mark all tokens for output to suppress llama.cpp 'overriding' warning (#2208)#2209
Anai-Guo wants to merge 1 commit intoabetlen:mainfrom
Anai-Guo:fix/embed-mark-all-tokens-2208

Conversation

@Anai-Guo
Copy link
Copy Markdown
Contributor

@Anai-Guo Anai-Guo commented May 9, 2026

Summary

Fixes #2208.

When Llama.embed() is called with a model whose pooling type is not NONE, every input token's logits flag is set to False except the last (in LlamaBatch.add_sequence). When llama.cpp later runs the embedding pass it requires every embedded token to have its output flag enabled, so it emits one info line per input:

init: embeddings required but some input tokens were not marked as outputs -> overriding

then forces all tokens on internally. The output is correct but the log is noisy — see also the matching ollama issue ollama/ollama#12381 referenced in the bug report.

Fix

Set logits_all = True unconditionally inside embed() so the Python side marks every token, matching what llama.cpp does internally. No behavioural change for LLAMA_POOLING_TYPE_NONE (already True); for the other pooling modes the override-warning loop is suppressed.

Test plan

  • model.embed(texts) with a pooled embedding model (e.g. nomic-embed-text-v1.5) — confirm the overriding lines are gone and returned vectors are unchanged.
  • model.embed(text, normalize=True) single-string call — output identical.
  • model.embed(texts) with a LLAMA_POOLING_TYPE_NONE model — per-token embeddings unchanged.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Called model.embed generates INFO messages for each input

1 participant