BLUX-cA Adapter Training Pipeline
This folder contains a reproducible adapter (LoRA/QLoRA) pipeline for BLUX-cA using the external dataset repository. The dataset must live outside this repository; set DATASET_DIR to its absolute path (for example /workspace/blux-ca-dataset).
Prerequisites
- Python 3.10+
- Recommended: NVIDIA GPU with recent CUDA drivers
- Sufficient disk space/memory for the base model (default:
Qwen/Qwen2.5-7B-Instruct)
Environment setup
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r train/requirements.txt
Dataset layout
The dataset directory should contain:
prompts/system_core.txt
data/*.jsonl
eval/*.jsonl
Each training JSONL line must include a messages array containing system, user, and assistant roles. The system content must equal <SYSTEM_PROMPT_FROM_BLUX_CA>.
Commands (copy/paste)
Set the dataset location once per shell:
export DATASET_DIR=/absolute/path/to/blux-ca-dataset
Validate dataset strictly (always invokes the dataset repo validator first):
python train/validate_dataset.py --dataset-dir "$DATASET_DIR" --strict
Dry-run (loads base model, prepares 5 samples, tokenizes). On CPU-only hosts the base model automatically falls back to
Qwen/Qwen2.5-1.5B-Instruct unless you override BASE_MODEL:
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --dry-run
Smoke train (adapter, capped mix):
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --max-samples 200 --run-name smoke
Full train:
python train/train_adapter.py --dataset-dir "$DATASET_DIR" --run-name full
Eval gate (strict). Use --use-stub when running without a trained adapter or when offline:
python train/run_eval.py --dataset-dir "$DATASET_DIR" --run runs/<timestamp_or_name> --strict --use-stub
GPU is recommended for smoke/full runs. On CPU-only environments, set BASE_MODEL=Qwen/Qwen2.5-1.5B-Instruct for the dry-run to conserve memory.
Outputs
- Runs are created under
runs/YYYYMMDD_HHMMSS_<optional_name>/ - Prepared dataset + resolved mix:
runs/<timestamp>/prepared_train.jsonlandruns/<timestamp>/mix_config_resolved.yaml - Training artifacts:
runs/<timestamp>/adapter/plusruns/<timestamp>/training_args.jsonandconfig_snapshot.yaml - Evaluation report:
runs/<timestamp>/eval_report.md
Release checklist
- Dataset validated (
python train/validate_dataset.py --dataset-dir ... --strict) - Prepared dataset generated and referenced by run folder
- Evaluation run passes in strict mode
- Adapter artifacts present under
runs/<timestamp>/adapter/ - Model card/README updated before publishing adapter (adapter-only, no base weights)
Uploading adapter to Hugging Face Hub
git lfs track "*.safetensors"
cd runs/<timestamp>/adapter
# add README/model card as needed
Only upload the adapter weights—do not upload base model weights.