How I Customized Llama 3.1 8B on a Budget

Why I Took on This Challenge

I was impressed by how fine‑tuned large language models outperform retrieval‑augmented systems, especially at inference time. So I set out to fine‑tune an open‑source model like Meta’s Llama 3.1 (8 B parameters). But most sources stated you needed giant, expensive GPUs and tons of storage resources that I simply didn't have. Determined to find another way, I tried it on a budget, and discovered you don’t need a fancy setup to customize a state‑of‑the‑art model.

The Hurdles: Time, Memory, and Disk

Fine‑tuning an 8 billion‑parameter model throws up three big roadblocks:

16GB

GPU RAM Required

4hrs

GPU Time Limit

Limited

Disk Storage

My Secret Weapon: LoRA (Low‑Rank Adaptation)

Instead of retraining 8 billion weights, LoRA tacks on small "adapter" matrices. Think of it as fine‑tuning a car's suspension, rather than redesigning the entire engine.

Parameter Comparison

Full weight matrix: 4,096 × 4,096 = 16.7 million parameters

LoRA (rank 8): 4,096 × 8 + 8 × 4,096 = 65,536 parameters

That's 250× fewer parameters!

81MB

Mini (rank 8)

161MB

Medium (rank 16)

My Setup on Northeastern's Cluster

I tapped into the Discovery cluster, which has:

141GB

H200 GPU VRAM

40GB

A100 GPU VRAM

32GB

V100 GPU VRAM

Jobs are managed by SLURM, with a hard 4‑hour limit. To avoid eating disk quota, I redirected all caches to scratch space:

                    export CACHE_DIR="${SCRATCH}/llama3_finetune/cache/huggingface"
                    export HF_HOME="$CACHE_DIR"
                    export TRANSFORMERS_CACHE="$CACHE_DIR"
                

Organizing the Project

I called it LlamaVox, with a simple structure:

                    LlamaVox/
                    ├── config/ # LoRA settings
                    ├── data/ # Training files
                    ├── models/ # Saved adapters
                    ├── slurm/ # Job scripts
                    └── src/ # Training code
                

My Tech Stack:

PyTorch 2.6.0 Transformers 4.53.0 PEFT 0.16.0 TRL 0.19.0 Accelerate bitsandbytes

Prepping the Data

I built three datasets:

5K

Mini Dataset (2.2 MB)

50K

Medium Dataset

1K

Synthetic Dataset

Each example looked like this:

                    {
                    "conversations": [
                    {"role": "user", "content": "What are the main challenges of urban planning?"},
                    {"role": "assistant", "content": "Urban planning faces several key challenges..."}
                    ]
                    }
                

Running the Training Jobs

Here's a snippet of my SLURM script for an H200 GPU:

                    #!/bin/bash
                    #SBATCH --job-name=llama3_h200
                    #SBATCH --time=4:00:00
                    #SBATCH --gres=gpu:h200:1
                    #SBATCH --mem=96G
                    #SBATCH --cpus-per-task=16

                    python src/train.py \
                    --model llama-3.1 \
                    --dataset data/mini.json \
                    --lora_rank 8 \
                    --output_dir models/mini_h200
                

Mini Dataset Results on H200

Start: July 7, 2025 8:06 PM EDT

Runtime: 18 minutes

Loss drop: 0.1164 → 0.0229

Token accuracy: 98.98%

GPU Performance Comparison

GPU	VRAM	Mini (5K)	Medium (50K)	Adapter Size
H200	141 GB	18 min	~2.5 hours	81–161 MB
A100	40 GB	25 min	~3 hours	81–161 MB
V100	32 GB	40 min	Failed	81 MB only

H200

Blistering speed, but scarce

A100

Great balance of power & availability

V100

OK for tiny jobs, hits 4hr wall

Performance Optimization Tricks

Gradient Accumulation Mixed-Precision Training 8-bit Quantization Smart Checkpointing

Gradient Accumulation – Fake bigger batches with less memory.

Mixed‑Precision Training – Use 16‑bit floats to halve memory use.

8‑bit Quantization – Load the base model in 8‑bit, freeing up VRAM.

Checkpointing – Save every 10 minutes so you can resume if you hit the time limit.

Troubleshooting: Issues & Fixes

Issue	Status	Fix Summary
Flash Attention Compatibility	✅ Fixed	Uninstalled flash_attn; set attn_implementation="eager"
Disk Quota Exceeded	✅ Fixed	Redirected HF cache to scratch space
Hugging Face Authentication Errors	✅ Fixed	Added explicit huggingface_hub.login() and token management
Environment Setup Complexity	✅ Fixed	Created run_model.sh for automated setup and GPU checks
Out-of-Memory (OOM) Errors	📋 Ongoing	8‑bit quantization, smaller batches, request more VRAM, monitor
Model Access & Permissions	✅ Fixed	Verified permissions; added access checks before download attempts

Quick Start Guide

                    # Grab a GPU node
                    srun --gres=gpu:1 --mem=32G --time=1:00:00 --pty bash

                    # Run the launcher
                    ./llama3_finetune/run_model.sh