i3-1B - Hybrid Architecture Language Model
Model Description
The i3-1B Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
Model Statistics
- Total Parameters: ~1.1B
- Architecture: 2 Attention Layers + 16 RWKV Layers = 18 Total Layers
- Hidden Dimension (d_model): 2,048
- Attention Heads: 16
- Max Sequence Length: 1,024
- Vocabulary Size: 32,000 tokens (BPE)
Architecture Breakdown
Layers 1-16: RWKV Hybrid Blocks (Recurrent/Conv)
ββ RWKVMambaHybrid (Time-mixing + State-space)
ββ Feed-Forward Network
Layers 17-18: Full Attention Blocks
ββ Multi-Head Attention (16 heads)
ββ Feed-Forward Network
Training Details
Training Configuration
- Datasets:
HuggingFaceFW/finewebSalesforce/wikitext
- Training Steps: 120 iterations
- Batch Size: 1 (with 32 gradient accumulation steps)
- Learning Rate: 0.0002 (2e-4)
- Hardware: NVIDIA GeForce RTX 5060 Ti
- Training Time: ~5 hours 40 minutes
- Framework: PyTorch
- OS: Linux 5.15.0-157-generic x86_64 with glibc 2.39
- Python: CPython 3.12.11
Performance Metrics
| Metric | Value |
|---|---|
| Final Training Loss | 2.044 |
| Final Learning Rate | 0.000121 |
| Final Perplexity | 7.72 |
| Training Speed | 206.34 tokens/sec |
Comparison with Previous Models
| Feature | i3-22M | i3-80M | i3-200M | i3-1B (This Model) |
|---|---|---|---|---|
| Parameters | 22.6M | 82.77M | 169.85M | 1.1B |
| Architecture | 24 Hybrid Layers | 10 Hybrid + 6 Attention | 10 Hybrid + 6 Attention | 2 Attention + 16 RWKV |
| Hidden Dimension | 512 | 512 | 512 | 2,048 |
| Sequence Length | N/A | N/A | 256 | 1,024 |
| Final Loss | ~2.0 | ~2.0 | 1.6 | 2.044 |
| Final Perplexity | 7.29-9.70 | 7.29-10.0 | 5.2 | 7.72 |
| Training Time | ~17 hours | ~2-4 hours | ~1-2 hours | ~5.5 hours |
Technical Innovations
RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Initial attention layers capture global dependencies
- Later RWKV layers focus on efficient sequential processing
Extended Context:
- 1,024 token context window (4x larger than i3-200M)
- Better handling of long-form text
Limitations
- Trained on English text only
- Limited to 1,024 token context window
- May require fine-tuning for specific downstream tasks
Model Series
- i3-22M - Original model with pure hybrid architecture
- i3-80M - Scaled version with attention layers and multi-dataset training
- i3-200M - Improved version with better perplexity
- i3-1B (This model) - Largest model with extended context and capacity
Citation
@article{mamba,
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
author={Gu, Albert and Dao, Tri},
journal={arXiv preprint arXiv:2312.00752},
year={2023}
}
@article{RWKV,
title={RWKV: Reinventing RNNs for the Transformer Era},
author={Peng, Bo and others},
journal={arXiv preprint arXiv:2305.13048},
year={2023}
}