Git-backed snapshots
Discrawl can publish the SQLite archive as sharded, compressed NDJSON snapshots in a private Git repo, then auto-import that repo before local read commands. This gives readers org memory without Discord credentials.
Snapshot packing/import and git mirror mechanics are shared through crawlkit. Discrawl still owns Discord-specific privacy policy: @me direct messages, wiretap sync state, and local-only desktop rows are excluded from published snapshots and are preserved locally on import.
#Publisher
discrawl publish --remote https://github.com/example/discord-archive.git --push
discrawl publish --readme path/to/discord-backup/README.md --push
discrawl publish --tag backup-2026-06-19 --push
The publisher uses your existing bot-synced archive. It exports non-DM tables only. Use publish filters to share a narrower snapshot without narrowing the local archive:
discrawl publish --public-only --push
discrawl publish --public-only --include-channels 1458141495701012561 --push
Filter rules:
--public-onlykeeps only channels where the guild@everyonerole has- private threads are excluded
--include-channelsand--exclude-channelsaccept comma-separated channel- including a forum parent also includes its allowed public threads
- combined filters intersect, so
--public-only --include-channels A,Bexports
VIEW_CHANNEL after category and channel permission overwrites
ids; exclusions win
only included channels that are also public
The publisher can keep syncing a richer local archive. Filters only narrow the Git snapshot seen by subscribers.
Filtered publishes currently cannot use --readme, because report totals are computed from the full local archive. Filtered publishes also remove previously generated Discrawl README.md reports from the share repo before committing, so stale full-archive totals are not carried forward. Custom README files without Discrawl report markers are left alone.
#Subscriber
discrawl subscribe https://github.com/example/discord-archive.git
discrawl search "launch checklist"
discrawl messages --channel general --hours 24
subscribe is the Git-only setup path. It writes a config with discord.token_source = "none", imports the snapshot, and does not require a Discord bot token. sync and tail remain disabled in this mode because they need live Discord access.
#Auto-update
Once share.remote is configured, read commands auto-fetch and import when the local share import is older than share.stale_after (default 15m):
discrawl subscribe --stale-after 15m https://github.com/example/discord-archive.git
discrawl subscribe --no-auto-update https://github.com/example/discord-archive.git
discrawl update forces the same pull/import step manually. discrawl update --ref <tag-or-commit> reads the historical Git objects directly and leaves the share checkout unchanged. Snapshot imports are delta-planned from crawlkit shard fingerprints. Older manifests without those fields fall back to Git blob identity, so the common publish shape only imports the changed message tail shard plus small cursor tables. Unsafe table-shape changes still fall back to a full import.
discrawl sync does not auto-import the share unless --update=auto or --update=force is provided, so routine live refreshes stay fast.
#Hybrid mode
Keep normal Discord credentials configured and set share.remote:
discrawl sync --update=auto # import snapshot delta first, then live deltas
discrawl messages --sync # blocking pre-query sync for matched scope
discrawl sync --all-channels # broader live repair
discrawl sync --full # historical backfill
#What is published
- non-DM archive tables (DM
@merows are always excluded) - cached non-DM attachment media as gzip-compressed files by default; use
- with publish filters: only matching channel-scoped rows, matching embedding
- with publish filters: no share manifest state and no guild-level member
- without publish filters and with
--readme: README activity block - latest embedding_jobsis never exported
publish --no-media to omit files that are already in cache_dir/media
rows, and member rows referenced by matching messages
freshness markers, because those describe the full archive
update time, latest archived message, archive totals, day/week/month activity
#Backing up media
Media backup is publisher-driven and local-cache based:
discrawl sync --with-media
discrawl publish --push
sync --with-media and attachments fetch download Discord attachment bytes into cache_dir/media. publish --push then exports cached non-DM media into the Git snapshot repo as gzip-compressed media/...gz files. Imports restore those files back into the raw local cache layout. Older snapshots that contain raw media/... files still import; the next media publish clears the legacy media tree and rewrites it in gzip form. publish does not fetch missing Discord files itself, so scheduled Git backups that should include media must fetch media before publishing. Set sync.attachment_media = true for scheduled sync jobs and leave share.media = true to include cached media in publish/update flows.
Discord CDN URLs can expire or be removed. Those fetches are stored as failed with their HTTP status, commonly 404; this does not block publishing files that were fetched successfully.
#Backing up vectors
discrawl publish --with-embeddings --push
discrawl subscribe --with-embeddings https://github.com/example/discord-archive.git
discrawl update --with-embeddings
Stored under embeddings/<provider>/<model>/<input_version>/.... Import only restores matching identities; Ollama/nomic subscribers do not accidentally pick up OpenAI/text-embedding vectors. Publishing without --with-embeddings omits embedding manifests instead of carrying forward an older bundle.
#CI
The Docker smoke test installs discrawl in a clean Go container, subscribes to a Git snapshot repo, then checks search, messages, sql, and report:
DISCRAWL_DOCKER_TEST=1 go test ./internal/cli -run TestDockerGitSourceSmoke -count=1
The backup workflows restore and save .discrawl-ci/discrawl.db with actions/cache. On a warm runner cache, scheduled publishers skip the pre-sync snapshot import and go straight to the live latest-message delta before publishing. Cache misses still import the latest published snapshot first so --latest-only has channel cursors to resume from.