BLUX-cA Adapter Training Pipeline

This folder contains a reproducible adapter (LoRA/QLoRA) pipeline for BLUX-cA using the external dataset repository. The dataset must live outside this repository; set DATASET_DIR to its absolute path (for example /workspace/blux-ca-dataset).

Prerequisites

Python 3.10+
Recommended: NVIDIA GPU with recent CUDA drivers
Sufficient disk space/memory for the base model (default: Qwen/Qwen2.5-7B-Instruct)

Environment setup

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r train/requirements.txt

Dataset layout

The dataset directory should contain:

prompts/system_core.txt
data/*.jsonl
eval/*.jsonl

Each training JSONL line must include a messages array containing system, user, and assistant roles. The system content must equal <SYSTEM_PROMPT_FROM_BLUX_CA>.

Commands (copy/paste)

Set the dataset location once per shell:

export DATASET_DIR=/absolute/path/to/blux-ca-dataset

Validate dataset strictly (always invokes the dataset repo validator first):

python train/validate_dataset.py --dataset-dir "$DATASET_DIR" --strict

Dry-run (loads base model, prepares 5 samples, tokenizes). On CPU-only hosts the base model automatically falls back to Qwen/Qwen2.5-1.5B-Instruct unless you override BASE_MODEL:

python train/train_adapter.py --dataset-dir "$DATASET_DIR" --dry-run

Smoke train (adapter, capped mix):

python train/train_adapter.py --dataset-dir "$DATASET_DIR" --max-samples 200 --run-name smoke

Full train:

python train/train_adapter.py --dataset-dir "$DATASET_DIR" --run-name full

Eval gate (strict). Use --use-stub when running without a trained adapter or when offline:

python train/run_eval.py --dataset-dir "$DATASET_DIR" --run runs/<timestamp_or_name> --strict --use-stub

GPU is recommended for smoke/full runs. On CPU-only environments, set BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct for the dry-run to conserve memory.

Outputs

Runs are created under runs/YYYYMMDD_HHMMSS_<optional_name>/
Prepared dataset + resolved mix: runs/<timestamp>/prepared_train.jsonl and runs/<timestamp>/mix_config_resolved.yaml
Training artifacts: runs/<timestamp>/adapter/ plus runs/<timestamp>/training_args.json and config_snapshot.yaml
Evaluation report: runs/<timestamp>/eval_report.md

Release checklist

Dataset validated (python train/validate_dataset.py --dataset-dir ... --strict)
Prepared dataset generated and referenced by run folder
Evaluation run passes in strict mode
Adapter artifacts present under runs/<timestamp>/adapter/
Model card/README updated before publishing adapter (adapter-only, no base weights)

Uploading adapter to Hugging Face Hub

git lfs track "*.safetensors"
cd runs/<timestamp>/adapter
# add README/model card as needed

Only upload the adapter weights—do not upload base model weights.