Hey! If you're interested to try out finetuning Llama-3 8B (Meta's new 15 trillion token!! model), made a Colab to finetune Llama-3 2x faster and use 60% less VRAM, and supports 4x longer contexts than HF+FA2.
Also uploaded Llama-3 70b pre-quantized 4bit so you can download it 4x faster! unsloth/llama-3-70b-bnb-4bit
No catch at all!! There's 0 approximations, so everything is exact! We just have a custom backprop engine, rewrite everything into OpenAI's Triton language, and do all the differentiations and maths ourselves :)
Unsloth can fit 4x longer context windows than HF + Flash Attention 2 as well with our latest long context update, so 30% less VRAM use, at the expense of slightly +1.9% overhead.
Also uploaded Llama-3 70b pre-quantized 4bit so you can download it 4x faster! unsloth/llama-3-70b-bnb-4bit