LoRA training corpus
Downloadable Nibiru framework training corpus — chunks, instructions, chat, completion. Per-language and combined buckets, JSONL.
A pre-built training corpus for fine-tuning your own LoRA on Nibiru. Generated deterministically from two sources: the deep framework reference (every public factory, namespace, idiom and gotcha cited file:line) and the public docs in five languages.
Each format is offered per-language plus a combined *-all.jsonl. Pick a
language bucket if you’re training a mono-lingual LoRA (recommended); pick
ALL if you want the model to handle queries in any language.
Which format?
Section titled “Which format?”instructions.jsonl— start here for most LoRA training. Alpaca-style instruction / input / output triples; works with axolotl, llama-factory, unsloth.chat.jsonl— ShareGPT/messages format with a system prompt. Best for chat-LoRAs and OpenAI-compatible fine-tune APIs.completion.jsonl— legacy prompt/completion pairs. Use for old-style text-completion fine-tunes.chunks.jsonl— one record per source chunk, no Q/A wrapping. Good as a RAG corpus or for custom Q/A generation pipelines.
Provenance
Section titled “Provenance”The corpus is rebuilt fresh every time the docs deploy:
node scripts/build-corpus.mjsSource: scripts/build-corpus.mjs.
SHA-256 hashes are in /corpus/manifest.json — verify
before you train if reproducibility matters.
License
Section titled “License”The framework reference and public docs are licensed the same as Nibiru itself (MIT). Trained model weights are yours. If you publish a LoRA fine-tuned on this corpus, a “trained on the Nibiru training corpus” attribution is appreciated but not required.