Training Corpus
How the docs are exported as a LoRA-ready training set, and how to regenerate it.
Every page in this documentation is also a training data point. Nibiru ships a script that extracts a clean JSONL corpus suitable for LoRA fine-tuning of an open-weight model — Llama, Mistral, Qwen, Gemma — on Nibiru-specific knowledge.
Run it
Section titled “Run it”cd docsnpm run build:corpusThis writes:
docs/dist/corpus/├── instructions.jsonl # instruction → response pairs├── chat.jsonl # OpenAI/Anthropic chat-message format├── completion.jsonl # plain prompt → completion (legacy)└── chunks.jsonl # raw Markdown chunks (one per H2/H3 section)Formats
Section titled “Formats”instructions.jsonl
Section titled “instructions.jsonl”LoRA-friendly instruction tuning:
{ "instruction": "How do I scaffold a new module in Nibiru?", "input": "", "output": "Run `./nibiru -m <name>`, optionally with `-g` for Graylog hooks. This creates `application/module/<name>/` with traits/, plugins/, interfaces/, settings/<name>.ini and the main `<name>.php` class implementing `IModule`."}Each entry is generated from a docs section with a clear question (derived from the H2/H3 title) and the section body as the answer.
chat.jsonl
Section titled “chat.jsonl”OpenAI chat / Anthropic Messages format:
{ "messages": [ {"role": "system", "content": "You are an expert on the Nibiru PHP framework."}, {"role": "user", "content": "How do I scaffold a new module?"}, {"role": "assistant", "content": "Run `./nibiru -m <name>`. …"} ]}Compatible with OpenAI fine-tunes, Anthropic’s API for evaluation, and most LoRA tooling expecting chat-format inputs (Axolotl’s sharegpt template, Unsloth, LLaMA-Factory).
chunks.jsonl
Section titled “chunks.jsonl”Raw chunks for use as RAG retrieval data:
{ "id": "core/modules#observer-pattern", "title": "The observer pattern", "url": "/core/modules/#the-observer-pattern", "section": "core/modules", "language": "en", "tokens": 412, "content": "Modules implementing `SplSubject` can broadcast events…"}This is exactly the file the Oracle uses internally.
How chunks are derived
Section titled “How chunks are derived”The corpus builder walks every .md / .mdx file under src/content/docs/, parses it into an AST, and chunks it at H2/H3 boundaries. It enforces:
- One chunk per H2 section (or H3 if the H2 is empty).
- ~200–800 tokens per chunk (split if longer, merge if shorter).
- Code fences are kept intact — never split mid-block.
- Each chunk carries its source path, anchor URL, and language code.
The script is in scripts/build-corpus.mjs and is fully configurable.
Suggested LoRA recipe
Section titled “Suggested LoRA recipe”A pragmatic baseline for an 8B-parameter base model on a single A100 / 4090:
base_model: meta-llama/Llama-3.1-8B-Instructadapter: loralora_r: 16lora_alpha: 32lora_dropout: 0.05lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets: - path: docs/dist/corpus/chat.jsonl type: sharegpt
sequence_len: 4096sample_packing: truegradient_accumulation_steps: 4micro_batch_size: 2num_epochs: 3optimizer: adamw_bnb_8bitlearning_rate: 0.0002warmup_ratio: 0.05bf16: trueTrain, then merge the LoRA weights and serve via Ollama, vLLM, or text-generation-inference. Swap the Oracle’s MODEL to point at your local endpoint and you have a fully Nibiru-native chat UX.
Re-run on every doc change
Section titled “Re-run on every doc change”Wire it into your CI:
- name: Build corpus run: cd docs && npm run build:corpus- name: Upload corpus artifact uses: actions/upload-artifact@v4 with: name: nibiru-corpus path: docs/dist/corpus/When docs change, the corpus re-builds; consumers (training pipelines, RAG indexes) always have the freshest data.
Languages
Section titled “Languages”The corpus respects locale. Pages under en/ are tagged language: en, German pages language: de, and so on. Train monolingual or multilingual LoRAs by filtering the JSONL on the language field.
Licence
Section titled “Licence”The docs are licenced under the same BSD-4-Clause as the framework itself. The exported corpus inherits that licence — you’re free to fine-tune models on it for commercial use, attribution-required.