Training Corpus

How the docs are exported as a LoRA-ready training set, and how to regenerate it.

Stable Reading time ~ 2 min Edit on GitHub

Every page in this documentation is also a training data point. Nibiru ships a script that extracts a clean JSONL corpus suitable for LoRA fine-tuning of an open-weight model — Llama, Mistral, Qwen, Gemma — on Nibiru-specific knowledge.

Run it

cd docs
npm run build:corpus

This writes:

docs/dist/corpus/
├── instructions.jsonl     # instruction → response pairs
├── chat.jsonl             # OpenAI/Anthropic chat-message format
├── completion.jsonl       # plain prompt → completion (legacy)
└── chunks.jsonl           # raw Markdown chunks (one per H2/H3 section)

Formats

`instructions.jsonl`

LoRA-friendly instruction tuning:

{
  "instruction": "How do I scaffold a new module in Nibiru?",
  "input": "",
  "output": "Run `./nibiru -m <name>`, optionally with `-g` for Graylog hooks. This creates `application/module/<name>/` with traits/, plugins/, interfaces/, settings/<name>.ini and the main `<name>.php` class implementing `IModule`."
}

Each entry is generated from a docs section with a clear question (derived from the H2/H3 title) and the section body as the answer.

`chat.jsonl`

OpenAI chat / Anthropic Messages format:

{
  "messages": [
    {"role": "system", "content": "You are an expert on the Nibiru PHP framework."},
    {"role": "user", "content": "How do I scaffold a new module?"},
    {"role": "assistant", "content": "Run `./nibiru -m <name>`. …"}
  ]
}

Compatible with OpenAI fine-tunes, Anthropic’s API for evaluation, and most LoRA tooling expecting chat-format inputs (Axolotl’s sharegpt template, Unsloth, LLaMA-Factory).

`chunks.jsonl`

Raw chunks for use as RAG retrieval data:

{
  "id": "core/modules#observer-pattern",
  "title": "The observer pattern",
  "url": "/core/modules/#the-observer-pattern",
  "section": "core/modules",
  "language": "en",
  "tokens": 412,
  "content": "Modules implementing `SplSubject` can broadcast events…"
}

This is exactly the file the Oracle uses internally.

How chunks are derived

The corpus builder walks every .md / .mdx file under src/content/docs/, parses it into an AST, and chunks it at H2/H3 boundaries. It enforces:

One chunk per H2 section (or H3 if the H2 is empty).
~200–800 tokens per chunk (split if longer, merge if shorter).
Code fences are kept intact — never split mid-block.
Each chunk carries its source path, anchor URL, and language code.

The script is in scripts/build-corpus.mjs and is fully configurable.

Suggested LoRA recipe

A pragmatic baseline for an 8B-parameter base model on a single A100 / 4090:

base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]

datasets:
  - path: docs/dist/corpus/chat.jsonl
    type: sharegpt

sequence_len: 4096
sample_packing: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
learning_rate: 0.0002
warmup_ratio: 0.05
bf16: true

Train, then merge the LoRA weights and serve via Ollama, vLLM, or text-generation-inference. Swap the Oracle’s MODEL to point at your local endpoint and you have a fully Nibiru-native chat UX.

Re-run on every doc change

Wire it into your CI:

- name: Build corpus
  run: cd docs && npm run build:corpus
- name: Upload corpus artifact
  uses: actions/upload-artifact@v4
  with:
    name: nibiru-corpus
    path: docs/dist/corpus/

When docs change, the corpus re-builds; consumers (training pipelines, RAG indexes) always have the freshest data.

Languages

The corpus respects locale. Pages under en/ are tagged language: en, German pages language: de, and so on. Train monolingual or multilingual LoRAs by filtering the JSONL on the language field.

Licence

The docs are licenced under the same BSD-4-Clause as the framework itself. The exported corpus inherits that licence — you’re free to fine-tune models on it for commercial use, attribution-required.