How I Customized Llama 3.1 8B on a Budget

Democratizing AI: Fine‑tuning Large Language Models with Limited Resources

Graduate Student Research Northeastern University Budget-Friendly AI

Why I Took on This Challenge

I was impressed by how fine‑tuned large language models outperform retrieval‑augmented systems, especially at inference time. So I set out to fine‑tune an open‑source model like Meta’s Llama 3.1 (8 B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have. Determined to find another way, I tried it on a budget, and discovered you don’t need a fancy setup to customize a state‑of‑the‑art model.

The Hurdles: Time, Memory, and Disk

Fine‑tuning an 8 billion‑parameter model throws up three big roadblocks:

GPU 16GB RAM REQUIRED 4 HOUR LIMIT DISK QUOTA
16GB
GPU RAM Required
4hrs
GPU Time Limit
Limited
Disk Storage

My Secret Weapon: LoRA (Low‑Rank Adaptation)

Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine‑tuning a car's suspension, rather than redesigning the entire engine.

FULL MODEL 8B Parameters VS LoRA 65K Parameters 250× FEWER PARAMS

Parameter Comparison

Full weight matrix: 4,096 × 4,096 = 16.7 million parameters

LoRA (rank 8): 4,096 × 8 + 8 × 4,096 = 65,536 parameters

That's 250× fewer parameters!

81MB
Mini (rank 8)
161MB
Medium (rank 16)

My Setup on Northeastern's Cluster

I tapped into the Discovery cluster, which has:

141GB
H200 GPU VRAM
40GB
A100 GPU VRAM
32GB
V100 GPU VRAM

Jobs are managed by SLURM, with a hard 4‑hour limit. To avoid eating disk quota, I redirected all caches to scratch space:

export CACHE_DIR="${SCRATCH}/llama3_finetune/cache/huggingface" export HF_HOME="$CACHE_DIR" export TRANSFORMERS_CACHE="$CACHE_DIR"

Organizing the Project

I called it LlamaVox, with a simple structure:

LlamaVox/ ├── config/ # LoRA settings ├── data/ # Training files ├── models/ # Saved adapters ├── slurm/ # Job scripts └── src/ # Training code

My Tech Stack:

PyTorch 2.6.0 Transformers 4.53.0 PEFT 0.16.0 TRL 0.19.0 Accelerate bitsandbytes

Prepping the Data

I built three datasets:

5K
Mini Dataset (2.2 MB)
50K
Medium Dataset
1K
Synthetic Dataset

Each example looked like this:

{ "conversations": [ {"role": "user", "content": "What are the main challenges of urban planning?"}, {"role": "assistant", "content": "Urban planning faces several key challenges..."} ] }

Running the Training Jobs

Here's a snippet of my SLURM script for an H200 GPU:

#!/bin/bash #SBATCH --job-name=llama3_h200 #SBATCH --time=4:00:00 #SBATCH --gres=gpu:h200:1 #SBATCH --mem=96G #SBATCH --cpus-per-task=16 python src/train.py \ --model llama-3.1 \ --dataset data/mini.json \ --lora_rank 8 \ --output_dir models/mini_h200

Mini Dataset Results on H200

Start: July 7, 2025 8:06 PM EDT

Runtime: 18 minutes

Loss drop: 0.1164 → 0.0229

Token accuracy: 98.98%

GPU Performance Comparison

GPU VRAM Mini (5K) Medium (50K) Adapter Size
H200 141 GB 18 min ~2.5 hours 81–161 MB
A100 40 GB 25 min ~3 hours 81–161 MB
V100 32 GB 40 min Failed 81 MB only
H200
Blistering speed, but scarce
A100
Great balance of power & availability
V100
OK for tiny jobs, hits 4hr wall

Performance Optimization Tricks

Gradient Accumulation Mixed-Precision Training 8-bit Quantization Smart Checkpointing

Gradient Accumulation – Fake bigger batches with less memory.

Mixed‑Precision Training – Use 16‑bit floats to halve memory use.

8‑bit Quantization – Load the base model in 8‑bit, freeing up VRAM.

Checkpointing – Save every 10 minutes so you can resume if you hit the time limit.

Troubleshooting: Issues & Fixes

Issue Status Fix Summary
Flash Attention Compatibility ✅ Fixed Uninstalled flash_attn; set attn_implementation="eager"
Disk Quota Exceeded ✅ Fixed Redirected HF cache to scratch space
Hugging Face Authentication Errors ✅ Fixed Added explicit huggingface_hub.login() and token management
Environment Setup Complexity ✅ Fixed Created run_model.sh for automated setup and GPU checks
Out-of-Memory (OOM) Errors 📋 Ongoing 8‑bit quantization, smaller batches, request more VRAM, monitor
Model Access & Permissions ✅ Fixed Verified permissions; added access checks before download attempts

Quick Start Guide

# Grab a GPU node srun --gres=gpu:1 --mem=32G --time=1:00:00 --pty bash # Run the launcher ./llama3_finetune/run_model.sh
LAUNCH YOUR AI PROJECT

Final Thoughts & How You Can Do It Too

You don't need a server farm to play with cutting‑edge AI. With techniques like LoRA, mixed precision, smart scheduling, and a bit of troubleshooting know‑how, you can personalize models like Llama 3.1 on a shoestring budget.

I've open‑sourced LlamaVox so anyone can dive in and start fine‑tuning right away.

Get Started with LlamaVox
Open Source
🚀 Ready to Use
💡 Budget-Friendly
📧 Questions? Open an issue 🤝 Contributions welcome 📖 Full documentation included