Embeddings
Embeddings are optional. FTS is the default search path and the primary verification target. Embeddings enrich recall in background batches; they do not block the hot sync path.
#Quick path
export OPENAI_API_KEY="..."
discrawl init --with-embeddings
discrawl sync --with-embeddings
discrawl embed --limit 1000
discrawl search --mode semantic "launch checklist"
discrawl search --mode hybrid "launch checklist"
#Two-phase pipeline
- Queue -
sync --with-embeddingswritesembedding_jobsrows for new messages, changed normalized text, and messages without an existing job. The embedding provider is not called in this phase. - Drain -
discrawl embedclaims pending jobs with a short lock so overlapping runs do not process the same batch. It calls the configured provider, writes vectors tomessage_embeddingswith provider, model, input version, dimensions, and binary vector data.
Behavior during drain:
- rate limits requeue the batch and stop that drain run cleanly
- provider or validation failures retry up to three attempts before marking the job failed
- messages with no normalized text are marked done and any stale vector for that message is removed
#Identity (provider, model, input version)
Stored on each job and vector. If you change provider or model:
- pending jobs are retargeted to the new identity
- prior attempts are reset
- existing vectors for another identity remain in SQLite but are not used for semantic search
Use --rebuild when you want to regenerate vectors for the existing archive after a config change:
discrawl embed --rebuild --limit 1000
#Local provider example
[search.embeddings]
enabled = true
provider = "ollama"
model = "nomic-embed-text"
With local providers, message and query embedding both happen on the same machine. With remote providers, message text is sent during discrawl embed, and search query text is sent during --mode semantic or --mode hybrid calls.
#Git snapshot interaction
By default, publish does not export embeddings. Use --with-embeddings:
discrawl publish --with-embeddings --push
discrawl subscribe --with-embeddings https://github.com/example/discord-archive.git
discrawl update --with-embeddings
The snapshot stores vectors under embeddings/<provider>/<model>/<input_version>/... and records that identity in manifest.json. Only vectors for non-DM messages are exported. Import only restores matching embedding manifests, so an Ollama/nomic subscriber does not accidentally import OpenAI/text-embedding vectors. embedding_jobs is never exported; subscribers that want fresh local vectors run discrawl embed --rebuild. Publishing without --with-embeddings omits embedding manifests instead of carrying forward an older bundle.