Skip to content
Nibiru docsv0.9.2

Training Corpus

How the docs are exported as a LoRA-ready training set, and how to regenerate it.

Stable Reading time ~ 2 min Edit on GitHub

Every page in this documentation is also a training data point. Nibiru ships a script that extracts a clean JSONL corpus suitable for LoRA fine-tuning of an open-weight model — Llama, Mistral, Qwen, Gemma — on Nibiru-specific knowledge.

Terminal window
cd docs
npm run build:corpus

This writes:

docs/dist/corpus/
├── instructions.jsonl # instruction → response pairs
├── chat.jsonl # OpenAI/Anthropic chat-message format
├── completion.jsonl # plain prompt → completion (legacy)
└── chunks.jsonl # raw Markdown chunks (one per H2/H3 section)

LoRA-friendly instruction tuning:

{
"instruction": "How do I scaffold a new module in Nibiru?",
"input": "",
"output": "Run `./nibiru -m <name>`, optionally with `-g` for Graylog hooks. This creates `application/module/<name>/` with traits/, plugins/, interfaces/, settings/<name>.ini and the main `<name>.php` class implementing `IModule`."
}

Each entry is generated from a docs section with a clear question (derived from the H2/H3 title) and the section body as the answer.

OpenAI chat / Anthropic Messages format:

{
"messages": [
{"role": "system", "content": "You are an expert on the Nibiru PHP framework."},
{"role": "user", "content": "How do I scaffold a new module?"},
{"role": "assistant", "content": "Run `./nibiru -m <name>`. …"}
]
}

Compatible with OpenAI fine-tunes, Anthropic’s API for evaluation, and most LoRA tooling expecting chat-format inputs (Axolotl’s sharegpt template, Unsloth, LLaMA-Factory).

Raw chunks for use as RAG retrieval data:

{
"id": "core/modules#observer-pattern",
"title": "The observer pattern",
"url": "/core/modules/#the-observer-pattern",
"section": "core/modules",
"language": "en",
"tokens": 412,
"content": "Modules implementing `SplSubject` can broadcast events…"
}

This is exactly the file the Oracle uses internally.

The corpus builder walks every .md / .mdx file under src/content/docs/, parses it into an AST, and chunks it at H2/H3 boundaries. It enforces:

  • One chunk per H2 section (or H3 if the H2 is empty).
  • ~200–800 tokens per chunk (split if longer, merge if shorter).
  • Code fences are kept intact — never split mid-block.
  • Each chunk carries its source path, anchor URL, and language code.

The script is in scripts/build-corpus.mjs and is fully configurable.

A pragmatic baseline for an 8B-parameter base model on a single A100 / 4090:

axolotl.yaml
base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
- path: docs/dist/corpus/chat.jsonl
type: sharegpt
sequence_len: 4096
sample_packing: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
learning_rate: 0.0002
warmup_ratio: 0.05
bf16: true

Train, then merge the LoRA weights and serve via Ollama, vLLM, or text-generation-inference. Swap the Oracle’s MODEL to point at your local endpoint and you have a fully Nibiru-native chat UX.

Wire it into your CI:

- name: Build corpus
run: cd docs && npm run build:corpus
- name: Upload corpus artifact
uses: actions/upload-artifact@v4
with:
name: nibiru-corpus
path: docs/dist/corpus/

When docs change, the corpus re-builds; consumers (training pipelines, RAG indexes) always have the freshest data.

The corpus respects locale. Pages under en/ are tagged language: en, German pages language: de, and so on. Train monolingual or multilingual LoRAs by filtering the JSONL on the language field.

The docs are licenced under the same BSD-4-Clause as the framework itself. The exported corpus inherits that licence — you’re free to fine-tune models on it for commercial use, attribution-required.