Why I Took on This Challenge
I was impressed by how fine‑tuned large language models outperform retrieval‑augmented systems, especially at inference time. So I set out to fine‑tune an open‑source model like Meta’s Llama 3.1 (8 B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have. Determined to find another way, I tried it on a budget, and discovered you don’t need a fancy setup to customize a state‑of‑the‑art model.
The Hurdles: Time, Memory, and Disk
Fine‑tuning an 8 billion‑parameter model throws up three big roadblocks:
My Secret Weapon: LoRA (Low‑Rank Adaptation)
Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine‑tuning a car's suspension, rather than redesigning the entire engine.
Parameter Comparison
Full weight matrix: 4,096 × 4,096 = 16.7 million parameters
LoRA (rank 8): 4,096 × 8 + 8 × 4,096 = 65,536 parameters
That's 250× fewer parameters!
My Setup on Northeastern's Cluster
I tapped into the Discovery cluster, which has:
Jobs are managed by SLURM, with a hard 4‑hour limit. To avoid eating disk quota, I redirected all caches to scratch space:
Organizing the Project
I called it LlamaVox, with a simple structure:
My Tech Stack:
Prepping the Data
I built three datasets:
Each example looked like this:
Running the Training Jobs
Here's a snippet of my SLURM script for an H200 GPU:
Mini Dataset Results on H200
Start: July 7, 2025 8:06 PM EDT
Runtime: 18 minutes
Loss drop: 0.1164 → 0.0229
Token accuracy: 98.98%
GPU Performance Comparison
GPU | VRAM | Mini (5K) | Medium (50K) | Adapter Size |
---|---|---|---|---|
H200 | 141 GB | 18 min | ~2.5 hours | 81–161 MB |
A100 | 40 GB | 25 min | ~3 hours | 81–161 MB |
V100 | 32 GB | 40 min | Failed | 81 MB only |
Performance Optimization Tricks
Gradient Accumulation – Fake bigger batches with less memory.
Mixed‑Precision Training – Use 16‑bit floats to halve memory use.
8‑bit Quantization – Load the base model in 8‑bit, freeing up VRAM.
Checkpointing – Save every 10 minutes so you can resume if you hit the time limit.
Troubleshooting: Issues & Fixes
Issue | Status | Fix Summary |
---|---|---|
Flash Attention Compatibility | ✅ Fixed | Uninstalled flash_attn; set attn_implementation="eager" |
Disk Quota Exceeded | ✅ Fixed | Redirected HF cache to scratch space |
Hugging Face Authentication Errors | ✅ Fixed | Added explicit huggingface_hub.login() and token management |
Environment Setup Complexity | ✅ Fixed | Created run_model.sh for automated setup and GPU checks |
Out-of-Memory (OOM) Errors | 📋 Ongoing | 8‑bit quantization, smaller batches, request more VRAM, monitor |
Model Access & Permissions | ✅ Fixed | Verified permissions; added access checks before download attempts |