LoRA training corpus

Downloadable Nibiru framework training corpus — chunks, instructions, chat, completion. Per-language and combined buckets, JSONL.

Stable Reading time ~ 1 min Edit on GitHub

A pre-built training corpus for fine-tuning your own LoRA on Nibiru. Generated deterministically from two sources: the deep framework reference (every public factory, namespace, idiom and gotcha cited file:line) and the public docs in five languages.

Each format is offered per-language plus a combined *-all.jsonl. Pick a language bucket if you’re training a mono-lingual LoRA (recommended); pick ALL if you want the model to handle queries in any language.

Which format?

instructions.jsonl — start here for most LoRA training. Alpaca-style instruction / input / output triples; works with axolotl, llama-factory, unsloth.
chat.jsonl — ShareGPT/messages format with a system prompt. Best for chat-LoRAs and OpenAI-compatible fine-tune APIs.
completion.jsonl — legacy prompt/completion pairs. Use for old-style text-completion fine-tunes.
chunks.jsonl — one record per source chunk, no Q/A wrapping. Good as a RAG corpus or for custom Q/A generation pipelines.

Provenance

The corpus is rebuilt fresh every time the docs deploy:

node scripts/build-corpus.mjs

Source: scripts/build-corpus.mjs. SHA-256 hashes are in /corpus/manifest.json — verify before you train if reproducibility matters.

License

The framework reference and public docs are licensed the same as Nibiru itself (MIT). Trained model weights are yours. If you publish a LoRA fine-tuned on this corpus, a “trained on the Nibiru training corpus” attribution is appreciated but not required.