> Adapter Modules and LoRA (Low-Rank Adaptation) insert new knowledge through specialized, isolated subnetworks, leaving existing neurons untouched. This is best for stuff like formatting, specific chains, etc- all of which don’t require a complete neural network update.
This highlights to me that the author doesn't know what they're talking about. LoRA does exactly the same thing as normal fine-tuning, it's just a trick to make it faster and/or be able to do it on lower end hardware. LoRA doesn't add "isolated subnetworks" - LoRA parameters are added to the original weights!
Here's the equation for the forward pass from the original paper[1]:
h = W_{0} * x + ∆W * x = W_{0} * x + B * A * x
where "W_{0}" are the original weights and "B" and "A" (which give us "∆W_{x}" after they're multiplied) are the LoRA adapter. And if you've been paying attention it should also be obvious that, mathematically, you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do, or you could even create a LoRA adapter from a fully fine-tuned model by calculating "W - W_{0}" to get ∆W and then do SVD to recover B and A.
If you know what you're doing anything you can do with LoRA you can also do with full-finetuning, but better. It might be true that it's somewhat harder to "damage" a model by doing LoRA (because the parameter updates are fundamentally low rank due to the LoRA adapters being low rank), but that's a skill issue and not a fundamental property.
> LoRA does exactly the same thing as normal fine-tuning
You wrote exactly so I'm going to say "no". To clarify what I mean: LoRA seeks to accomplish a similar goal as "vanilla" fine-tuning but with a different method (freezing existing model weights while adding adapter matrices that get added to the original). LoRA isn't exactly the same mathematically either; it is a low-rank approximation (as you know).
> LoRA doesn't add "isolated subnetworks"
If you think charitably, the author is right. LoRA weights are isolated in the sense that they are separate from the base model. See e.g. https://www.vellum.ai/blog/how-we-reduced-cost-of-a-fine-tun... "The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily."
> you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do
Yes, one can do that. But on what basis do you say that "most people do"? Without having collected a sample of usage myself, I would just say this: there are many good reasons to not merge (e.g. see link above): less storage space if you have multiple adapters, easier to swap. On the other hand, if the extra adapter slows inference unacceptably, then don't.
> This highlights to me that the author doesn't know what they're talking about.
It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
> You wrote exactly so I'm going to say "no". [...] If you think charitably, the author is right.
No, the author is objectively wrong. Let me quote the article and clarify myself:
> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.
In fact, there's evidence that LoRA can actually make the problem worse[1]:
> we first show that the weight matrices trained with LoRA have new, high-ranking
singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside
the adaptation task’s distribution, despite matching accuracy in distribution.
To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.
> But on what basis do you say that "most people do"?
On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).
> I would just say this: there are many good reasons to not merge
I never said that there aren't good reasons to not merge.
> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
If we zoom out a bit to one point he’s trying to make there, while LoRA is fine tuning I think it’s fair to call it a more modular approach than base SFT.
That said, I find the article as a whole off-putting. It doesn’t strengthen one’s claims to call things stupid or a total waste of time. It deals in absolutes, and rants in a way that misleads and foregoes nuance.
> No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
I get that. So what can we do?
One option is when criticizing, write as clearly as possible. Err on the side of overexplaining. From my point of view, it took a back-and-forth for your criticism to become clear.
I'll give an example when more charity and synthesis is welcome:
>> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
> This is just incorrect.
"This" is rather unclear. There are many claims in the quote -- which are you saying are incorrect? Possibilities include:
Sometimes, yes. More often than not? Maybe. Categorically? I'm not sure. [1]
2. "When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects."
Yes, this can happen. Mitigations can reduce the chances.
3. "Instead, use modular methods like [...] adapters."
Your elision dropped some important context. Here's the full quote:
> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.
This logic is sound, almost out of tautology: the original model is unchanged.
To get more specific: if one's bolted-on LoRA module destroyed some knowledge, one can take that into account and compensate. Perhaps use different LoRA modules for different subtasks then delegate with a mixture of experts? (I haven't experimented with this particular architecture, so maybe it isn't a great example -- but even if it falls flat, this example doesn't undermine the general shape of my argument.)
In summary, after going sentence by sentence, I see one sentence that is dubious, but I don't think it is the same one you would point to.
[1] I don't know if this is considered a "settled" matter. Even if was considered "settled" in ML research, that wouldn't meet my bar -- I have a relativity low opinion of ML research in general (the writing quality, the reproducibility, the experimental setups, the quality of the thinking!, the care put into understanding previous work)
> Sorry to be a downer but basically every statement you’ve made above is incorrect.
You don't need to apologize for being a "downer", but it would be better if you were specific in your criticisms.
I welcome feedback, but it has to be specific and actionable. If I'm wrong, set me straight.
This is a two-way street: if you were unfair or uncharitable or wrong, you have to own that too. It is incumbent upon an intellectually honest reader to first seek a plausible interpretation under which a statement is indeed correct. Some people have a tendency to only find one possible interpretation under which a statement is wrong. This is insufficient. Bickering over interpretations is less useful; understanding another's meaning is how we grow.
> that's a skill issue and not a fundamental property
This made me laugh.
You seem like you may know something I've been curious about.
I'm a shader author these days, haven't been a data scientist for a while, so it's going to distort my vocab.
Say you've got a trained neural network living in a 512x512 structured buffer. It's doing great, but you get a new video card with more memory so you can afford to migrate it to a 1024x1024. Is the state of the art way to retrain with the same data but bigger initial parameters, or are there other methods that smear the old weights over a larger space to get a leg up? Anything like this accelerate training time?
... can you up sample a language model like you can lowres anime profile pictures? I wonder what the made up words would be like.
In general this is of course an active area of research, but yes, you can do something that and people have done it successfully[1] by adding extra layers to an existing model and then continuing to train it.
You have to be careful about the "same data" part though; ideally you want to train once on unique data[2] as excessive duplication can harm the performance of the model[3], although if you have limited data a couple of training epochs might be safe and actually improve the performance of the model[4].
This might be obvious, but just to state it explicitly for everyone: you can freeze the weights of the existing layers if you want to train the new layers but want to leave the existing layers untouched.
I think the point the author misses is that many applications of fine-tuning are to get a model to do a single task. This is what I have done in my current role at my company.
We’ve fine-tuned open weight models for knowledge-injection, among other things, and get a model that’s better than OpenAI models at exactly one hyper specific task for our use case, which is hardware verification. Or, fine-tuned the OAI models and get significantly better OAI models at this task, and then only use them for this task.
The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive.
I don’t care if I “damage” Llama so that it can’t write poetry, give me advice on cooking, or translate to German. In this instance I’m only ever going to prompt it with: “Does this design implement the AXA protocol? <list of ports and parameters>”
It looked to me like the author did know that. The title only says "Fine-tuning", but immediately in the article he talks about Fine-tuning for knowledge injection, in order to "ensure that their systems were always updated with new information".
Fine-tuning to help it not make the stupid mistake that it makes 10% of the time no matter what instructions you give it is a completely different use case.
Cost, latency, and performance are huge reasons why my company chooses to fine tune models. We start with using a base model for a task and as our traffic grows, we tune a smaller model, resulting huge performance and cost savings.
Could you give any rough details? I'm in this world, and have only experienced rigid/deterministic bounds for hardware, ideally based on "guaranteed by design" based models. The need for determinism has always prevented AI from being a part of it.
The author makes it specific they talk about finetuning "for Knowledge Injection". The give a quote that claims that finetuning is still useful for things like following a specific style, formatting etc. The title they chose could have been a bit more specific and less aphoristic.
What finetuning makes less sense is doing it merely to get a model eg up to date with changes in some library, or to learn a new library it did not know, or, even worse, your codebase. I think this is what OP talks about.
Just to be clear, unless I'm misinterpreting this chain of comments, you do not want to fine-tune for information retrieval. FT is for skill enhancement. For information retrieval you want at least one of the over 100 implementations of RAG out there now.
Let me preface by saying I'm not skeptical about your answer or think you're full of crap. Can you give me an example or two about a single task that you fine-tune for? Just trying to familiarize myself with more AI engineering tasks.
So my use case currently is admittedly very specific. My company uses LLMs to automate hardware design, which is a skill that most LLMs are very poor at due to the dearth of training data.
For tasks which involve generation of code or other non-natural language output, we’ve found that fine-tuning with the right dataset can lift performance rapidly and decisively.
An example task is taking in potentially syntactically incorrect HDL (Hardware Description Language) code and fixing the syntax issues. Fine-tuning boosted corrective performance significantly.
I used fine-tuning back in the day because GPT 3.5 struggled with the concept of determining if two sentences were equivalent or not. This was for grading language learning drills. It was a single skill for a specific task and I had lots of example data from thousands of spaced repetition quiz sessions. The base model struggled with the vague concept of “close enough” equivalence. Since that time, the state of the art has advanced to the point that I don’t need it anymore. I could probably do it to save some money but I’m pretty happy with GPT 4.1.
In this case, for doing specific tasks, it makes much more sense to optimize the prompts and the whole flow with DSPy, instead of just fine tuning for each task.
A wonderful approach generally and something we also do to some extent, but not a substitute for fine-tuning in our case.
We are working in a domain where there is very limited training data, so what we really want is continued pre-training over a larger dataset. Absent that, fine-tuning is highly effective for non-NLP tasks.
It's not either/or. Generally you finetune when optimized many-shot still doesn't hit your desired quality bar. And it turns out with RL, things like system prompts matter a lot, so searching over prompts is a good idea even when reinforcing the desirable circuits.
That's only viable if the quality of the outputs can be automatically graded, reliably. GP's case sounds like one where that's probably possible, but for lots of specific tasks that isn't feasible, including the other ones he names:
> write poetry, give me advice on cooking, or translate to German
Certainly, in those cases one needs to be clever and design an evaluation framework that will grade based on soft criteria, or maybe use user feedback. Still, over time a good train-test database should be built and leveraging dspy will do improvements even in those cases.
Interestingly the author mentions LoRa as a "special" way for fine-tuning thatis not destructive. Have you considered it or you opted for more direct fine-tuning?
It's not special and fine tuning a foundation model isn't destructive when you have checkpoints. LoRa allows you to approximate the end result of a fine tune while saving memory.
Haven’t tried it personally, as this was a use case where a classic SFT was effective for what we wanted and none of us had done LoRa before.
Really interested in the idea though! The dream is that you have your big, general base model, then a bunch of LoRa weights for each task you’ve tuned on, where you can load/unload just the changed weights and swap the models out super fast on the fly for different tasks.
You do you, and if it works i’m not going to argue with your results, but for others, finetuning is the wrong tool for knowledge injection over a well-designed RAG pipeline.
Finetuning is good for, like you said, doing things a particular way but that’s not the same thing as being good at knowledge injection and shouldn’t considered as such.
It’s also much easier to prevent a RAG pipeline from generating hallucinated responses. You cannot finetune that out of a model.
This is a pretty awful take. Everyone understands they are modifying the weights - that is the point. It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them. The awesome thing about fine-tuning is that the weights are malleable and you have a great base to start from.
Also the basic premise that knowledge injection is a bad use-case seems flawed? There are countless open models released by Google that completely fly in the face of this. Medgemma is just Gemma 3 4b fine-tuned on a ton of medical datasets, and it’s measurably better than stock Gemma within the medical domain. Maybe it lost some ability to answer trivia about Minecraft in the process, but isn’t that kinda implied by “fine-tuning” something? Your making it purpose built for a specific domain.
Medgemma gets its domain expertise from pre-training on medical datasets, not finetuning. It’s pretty uncharitable to call the post an awful take if you’re going to get that wrong.
You can call it pre-training but it’s based on Gemma 3 4b - which was already pre-trained on a general corpus. It’s the same process, so you’re just splitting hairs. That is kind of my point, fine-tuning is just more training. If you’re going to say that fine-tuning is useless you are basically saying that all instruct-tuned models are useless as well - because they are all just pre-trained models that have been subsequently trained (fine-tuned) on instruction datasets.
> It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them.
So more imperfect is better?
Of course the model’s parameters leave a many billions of elements vector path for improvement. But what circuitous path is that, which it didn’t already find?
You can’t find it by definition if you don’t include all the original data with the tuning data. You have radically changed the optimization surface with no contribution from the previous data at all.
The one use case that makes sense is sacrificing functionality to get better at a narrow problem.
A man who burns his own house down may understand what they are doing and do it intentionally - but without any further information still appears to be wasting his time and doing something stupid. There isn't any contradiction between something being a waste of time and people doing it on purpose - indeed the point of the article is to get some people to change what they are purposefully doing.
He's proposing alternatives he thinks are superior. He might well be right too, although I don't have a horse in the race but LORA seem like a more satisfying approach to get a result than retraining the model and giving LLMs tools seems to be proving more effective too.
It’s possible I misinterpreted a bit the gist of the article - in my mind nobody is doing fine-tuning these days without using techniques like LoRA or DoRA. But they are using these techniques because they are computationally efficient and convenient, and not because they perform significantly better than full fine-tuning.
Clickbait headline. "Fine-tuning LLMs for knowledge injection is a waste of time" is true, but IDK who's trying to do that. Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you), and RAG is great for knowledge injection.
Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them.
They provide no references other than self-referencing blogs. It was also suspenseful to read about loss in changing neural network weights when there was 0 mention of quantization. Unfortunately, most of the content in this one was taken from his own previous work.
RAG is getting some backlash and this reads as a backlash of the backlash. I hope things settle down soon but many techfluencers put all their eggs in RAG and used it to gatekeep AI.
It was the best option at one point. They're still a great option if you want an override (e.g. categorization or dialects), but they're not precise.
Changes that happened:
1. LLMs got a lot cheaper but fine tuning didn't. Fine tuning was a way to cut down on prompts and make them 0 shot (not require examples)
2. Context windows became bigger. Fine tuning was great when it was expected to respond a sentence.
3. The two things above made RAG viable.
4. Training got better on released models, to the point where 0 shots worked fine. Fine tuning ends up overriding these things that were scoring nearly full points on benchmarks.
While the author makes some good points (along with some non-factual assertions), I wonder why he decided to have this counter-productive and factually wrong clickbait title.
Fine-tuning (and LoRA IS fine-tuning) may not be cost-effective for most organizations for knowledge updates, but it excels in driving behavior in task specific ways, for alignment, for enforcing structured output (usually way more accurately than prompting), tool and function use, and depending on the type of knowledge, if it is highly specific, niche, long tail type of knowledge, it can even make smaller models beat bigger models, like the case with MedGemma.
There is no real difference between fine-tuning with and without a lora. If you give me a model with a lora adapter, I can give you an updated model without the extra lora params that is functionally identical.
Fitting a lora changes potentially useful information the same way that fine-tuning the whole model does. It's just the lora restricts the expressiveness of the weight update so that is compactly encoded.
To be fair there are lots of Facebook, Instagram, and Youtube cargo cultists telling people to fine-tune on their documents for some reason. This got to be so common in 2024 that I think it was part of the pressure behind Gigabyte branding their hardware around it.
Yeah, as soon as I read that I felt like the author was living in a very different context from mine. It's never even occurred to me that fine-tuning could be an effective method for injecting new knowledge.
If anything, I expect fine-tuning to destroy knowledge (and reasoning), which hopefully (if you did your fine-tuning right) is not relevant to the particular context you are fine-tuning for.
3) "Complex domain-specific tasks that require advanced reasoning", "Medical diagnosis based on history and diagnostic guidelines", "Determining relevant passages from legal case law"
4) "The general idea of fine-tuning is much like training a human in a particular subject, where you come up with the curriculum, then teach and test until the student excels."
Don't all these effectively inject new knowledge? It may happen through simultaneous destruction of some existing knowledge but that isn't obvious to non-technical people.
OpenAI's analogy of training a human in a particular subject until they excel even arguably excludes the possibility of destruction because we don't generally destroy existing knowledge in our minds to learn new things (but some of us may forget the older knowledge over time).
I'm a dev with hand-waving level of proficiency. I have fine-tuned self-hosted small LLMs using PyTorch. My perception of fine-tuning is that it fundamentally adds new knowledge. To what extent that involves destruction of existing knowledge has remained a bit vague.
My hand-waving solution if anyone pointed out that problem would be to 1) say that my fine-tuning data will include some of the foundational knowledge of the target subject to compensate for its destruction and 2) use a gold standard set of responses to verify the model after fine-tuning.
I for one found the article quite valuable for pointing out the problem and suggesting better approaches.
I think it is a very common misconception (by consumers or businesses trying to use LLMs) that fine tuning can be used to inject new knowledge. I'm not sure many of the fine-tuning platforms do much to disavow people of this notion.
> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.
So obviously this is what most of us are already doing, I would venture. But there's a pretty big "missing middle" here. RAG/better prompts serve to provide LLMs with the context they need for a specific task, but are heavily limited by context windows. I know they've been growing quite a bit, but from my usage it still seems that things further back in the window get forgotten about pretty regularly.
Fine tuning was always the pitch for the solution to that. By baking the "context" you need directly into the LLM. Very few people or companies are actually doing this though, because it's expensive and you end up with an outdated model by the time you're done...if you even have the data you need to do it in the first place.
So where we're left is basically without options for systems that need more proprietary knowledge than we can reasonably fit into the context window.
I wonder if there's anyone out there attempting to do some sort of "context compression". An intermediary step that takes our natural language RAG/prompts/context and compresses it into a data format that the LLM can understand (vectors of some sort?) but are a fraction of the tokens that the natural language version would take.
edit After I wrote this I fed this into chatgpt and asked if there were techniques i am missing. It introduced me to Lora (which I suppose are the "adapters" mentioned in the OP). and now I have a whole new rabbithole to climb down. AI is pretty cool sometimes.
I see this and immediately relived the last two years of the journey. I think some of the mental model that helped me might help the community too.
What people expect from finetuning is knowledge addition. You want to keep the styling[1] of the original model, just add new knowledge points that would help your task. In context learning is one example of how this works well. Just that even here, if the context is out of distribution, a model does not "understand" it and would produce guesswork.
When it comes to LoRA or PEFT or adapters, it's about style transfer. And if you focus on a specific style of content, you will see the gains, just that the model wont learn new knowledge that wasnt already in original training data. It will forget previously learnt styles depending on context. When you do full finetuning (or SFT with no frozen parameters), it will alter all the parameters, and results in gain of new knowledge at the cost of previous knowledge (and would give you some gibberish if you ask about topics outside of domain). This is called catastrophic forgetting. Hence, yes, full finetuning works - just that it is an imperfect solution like all the others. Recently, with Reinforcement learning, there have been talks of continual learning, where Richard sutton's latest paper also lands at, but thats at research level.
Having said all that, if you start with the wrong mental model for Finetuning, you would be disappointed with the results.
The problem to solve is about adding new knowledge, while preserving the original pretrained intelligence. Still in wip, but we published a paper last year on one way it could be done. Here is the link: https://arxiv.org/abs/2409.17171 (it also has results for experiments all different approaches).
[1]: Styling here means the style learned by the model in SFT. Eg: Bullets, lists, bolding out different headings etc. all of that makes the content readable. The understanding of how to present the answer to a specific question.
I love how people say things like this with complete disregard for research.
Most LLM research involves fine tuning models, and we do amazing things with it. R1 is a fine tune, but I guess that’s bad?
Our company adds knowledge with fine tuning all the time. It’s usually a matter of skill not some fundamental limit. You need to either use LoRA or use a large batch size and mix the previous training data in.
All we are doing is forcing deep representations. This isn’t a binary “fine tuning good/bad” it’s a spectrum of how deep and robust you make the representations
I think of it as trying to encourage the LLM to want to give answers from a particular part of the phase space. You can do it by fine tuning it to be more likely to return values from there, or you can prompt it to get into that part of the phase space. Either works, but fiddling around with prompts doesn't require all that much MLops or compute power.
That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.
> That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.
...which I thought was arguably the most popular use case for fine tuning these days.
My understanding of model distillation is quite different in that it trains another (typically smaller) model using the error between the new model’s output and that of the existing - effectively capturing the existing model’s embedded knowledge and encoding it (ideally more densely) into the new.
What what I was referring to is similar in concept, but I've seen both described in papers as distillation. What I meant was you take the output of a large model like GPT4 and use that as training data to fine-tune a smaller model.
Wasn't there that thing about how large LLM's are essentially compression algorithms (https://arxiv.org/pdf/2309.10668)? Maybe that's where this article is coming from, is the idea that finetuning "adds" data to the set of data that compresses well. But that indeed doesn't work unless you mix in the finetuning data with the original training corpus of the base model. I think the article is wrong though in saying it "replaces" the data - it's true that finetuning without keeping in the original training corpus increases loss on the original data, but "large" in LLM really is large and current models are not trained to saturation so there is plenty of room to fit in finetuning if you do it right.
Not sure what you mean by “not trained to saturation”. Also I agree with the article, in the literature, the phenomenon to which the article refers is known as “catastrophic forgetting”. Because no one has specific knowledge about which weights contribute to model performance, by updating the weights via fine-tuning, you are modifying the model such that future performance will change in ways that are not understood. Also I may be showing my age a bit here, but I always thought “fine-tuning” was performing additional training on the output network (traditionally a fully-connected net), but leaving the initial portion (the “encoder”) weights unchanged - allowing the model to capture features the way it always has, but updating the way it generates outputs based on the discovered features.
OK, so this intuition is actually a bit hard to unpack, I got it from bits and pieces. So this is this post https://www.fast.ai/posts/2023-09-04-learning-jumps/. Essentially, a single pass over the training data is enough for the LLM to significantly "learn" the material. In fact if you read the LLM training papers, for the large-large models, they generally explicitly say that they only did 1 pass over the training corpus, and sometimes not even the full corpus, only like 80% of it or whatever. The other relevant information is the loss curves - models like Llama 3 are not trained until the loss on the training data is minimized, like typical ML models. Rather they use these approximate estimates of FLOPS / tokens vs. performance on benchmarks. But it is pretty much guaranteed that if you continued to train on the training data it would continue to improve its fit - 1 pass over the training data is by no means enough to adequately learn all of the patterns. So from a compression standpoint, the paper I linked previously says that an LLM is a great compressor - but it's not even fully tuned, hence "not trained to saturation".
Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.
Before post-ChatGPT boom, we used to talk of "catastrophic forgetting"...
Make sure the new training dataset is "large" by augmenting it with general data (see it as a sample of the original dataset), use PEFT techniques (freezing weights => less risks), use regularization (elastic weight consolidation).
Fine-tuning is fine, but will be more expensive that you thought and should be led by more experienced ML engineers. You probably don't need to fine tune models anyway.
Obviously there are going to be narrow tasks where fine tuning makes sense. But using leading models for agents is a completely different mindset and approach.
Because I have been working on replacing multiple humans handling complex business processes mostly end-to-end (with human in the loop somehow in there).
I find that I need the very best models to be able to handle a lot of instructions and make the best decisions about tool selection. And overall I just need the most intelligence possible to make fewer weird errors or misinterpretations of the instructions or situations/data.
I can see how fine tuning would help for some issues like some report formatting. But that output comes at the end of the whole process. And I can address formatting issues almost instantly by either just using a smarter model that follows instructions better, or adding a reminder instruction, or creating a simpler subtask. Sometimes the subtask can run on a cheaper model.
So it's kind of like the difference between building a traditional manufacturing line with very specific robot arms, tooling and and conveyor belts, versus plugging in just a few different humanoid robots with assembly manuals and access to more general purposes tools on their belt. You used to always have to build the full traditional line. In many cases that doesn't necessarily make sense anymore.
I don’t know if fine tuning works. But if it doesn’t, then are we assuming the underlying weights are optimal? At what point do we determine that a network is properly “trained” and any subsequent training is “fine tuning”.
This post is hilarious. People like this author are the ones vetting start-ups? Please. The idea that alignment leads to a degradation in model utility is hardly news.
But let’s be clear: fine-tuning an LLM to specialize in a task isn’t just about minimizing utility loss. It’s about trade-offs. You have to weigh what you gain against what you lose.
The biggest mistake I see people making is this quote from the blog: "a 'fast and furious' approach to training neural networks does not work and only leads to suffering"
I'll probably write more about it in a few months...
I feel that the effects of fine-tuning are often short-term, and sometimes it can end up overwriting what the model has already learned, making it less intelligent in the process.
I lean more towards using adaptive methods, optimizing prompts, and leveraging more efficient ways to handle tasks. This feels more practical and resource-efficient than blindly fine-tuning.
We should focus on finding ways to maximize the potential of existing models without damaging their current capabilities, rather than just relying on fine-tuning.
It would be very interesting to fine tune a model for a narrow task, while tracking its performance on every original training sample from the pre-tuning baseline.
I expect it would greatly help characterize what was lost, at the expense of a great deal of extra computation. But with enough experiments might shed some more general light.
I suspect the smaller the tuning dataset, the faster and worse the overwriting will be, since the new optimization surface will be so much simpler to navigate than the much bigger datasets optimization surface.
Then a question might be, what percentage of the original training data, randomly retained, might slow general degradation.
Fine tuning isn’t for everything but certainly makes it easy to build models for special purposes, eg metadata extraction. Happy to lose some capability in another domain for that, eg Pokémon. The headline is a bit too general.
Correct me if I am wrong, but I thought the point of fine-tuning was to get precise returns. We make it hyper specific to the task at hand.
Sure, we can get 90% of the way there without fine-tuning, but most of these models are vast.
I would argue that it potentially MAY be a waste of time right out the gate.
RAG and fine-tuning are suitable for different business scenarios. For some directional and persistent knowledge, such as adjustments for power, energy and other fields, it can bring better performance;
RAG is more oriented to temporary and variable situations.
In addition, LoRA is also a fine-tuning technology,and it is written in their paper.
I am under the impression that fine tuning is expensive (could anyone put a number on that?) and that each time a new model is released you have to fine tune it again, paying full price every time.
Seriously, most fine-tuning now is done with LoRa adapters. They are much faster and more reliable. In my lab, I don't know anybody who is trying to do any kind of thorough fine-tuning...
Fine-tuning is excellent way to reliably bake-in domain specific data into a model; there is a plenty of coding finetunes on Huggingface face, that outperforms foundation models on say coding, without significant loss in other domains.
"But this logic breaks down for advanced models, and badly so. At high performance, fine-tuning isn’t merely adding new data — it’s overwriting existing knowledge. Every neuron updated risks losing information that’s already intricately woven into the network. In short: neurons are valuable, finite resources. Updating them isn’t a costless act; it’s a dangerous trade-off that threatens the delicate ecosystem of an advanced model."
Mainly including this article to spark discussion—I agree with some of this and not with all of it. But it is an interesting take.
> Adapter Modules and LoRA (Low-Rank Adaptation) insert new knowledge through specialized, isolated subnetworks, leaving existing neurons untouched. This is best for stuff like formatting, specific chains, etc- all of which don’t require a complete neural network update.
This highlights to me that the author doesn't know what they're talking about. LoRA does exactly the same thing as normal fine-tuning, it's just a trick to make it faster and/or be able to do it on lower end hardware. LoRA doesn't add "isolated subnetworks" - LoRA parameters are added to the original weights!
Here's the equation for the forward pass from the original paper[1]:
where "W_{0}" are the original weights and "B" and "A" (which give us "∆W_{x}" after they're multiplied) are the LoRA adapter. And if you've been paying attention it should also be obvious that, mathematically, you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do, or you could even create a LoRA adapter from a fully fine-tuned model by calculating "W - W_{0}" to get ∆W and then do SVD to recover B and A.If you know what you're doing anything you can do with LoRA you can also do with full-finetuning, but better. It might be true that it's somewhat harder to "damage" a model by doing LoRA (because the parameter updates are fundamentally low rank due to the LoRA adapters being low rank), but that's a skill issue and not a fundamental property.
[1] -- https://arxiv.org/pdf/2106.09685
> LoRA does exactly the same thing as normal fine-tuning
You wrote exactly so I'm going to say "no". To clarify what I mean: LoRA seeks to accomplish a similar goal as "vanilla" fine-tuning but with a different method (freezing existing model weights while adding adapter matrices that get added to the original). LoRA isn't exactly the same mathematically either; it is a low-rank approximation (as you know).
> LoRA doesn't add "isolated subnetworks"
If you think charitably, the author is right. LoRA weights are isolated in the sense that they are separate from the base model. See e.g. https://www.vellum.ai/blog/how-we-reduced-cost-of-a-fine-tun... "The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily."
> you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do
Yes, one can do that. But on what basis do you say that "most people do"? Without having collected a sample of usage myself, I would just say this: there are many good reasons to not merge (e.g. see link above): less storage space if you have multiple adapters, easier to swap. On the other hand, if the extra adapter slows inference unacceptably, then don't.
> This highlights to me that the author doesn't know what they're talking about.
It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
> You wrote exactly so I'm going to say "no". [...] If you think charitably, the author is right.
No, the author is objectively wrong. Let me quote the article and clarify myself:
> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.
In fact, there's evidence that LoRA can actually make the problem worse[1]:
> we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution.
[1] -- https://arxiv.org/pdf/2410.21228
To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.
> But on what basis do you say that "most people do"?
On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).
> I would just say this: there are many good reasons to not merge
I never said that there aren't good reasons to not merge.
> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
If we zoom out a bit to one point he’s trying to make there, while LoRA is fine tuning I think it’s fair to call it a more modular approach than base SFT.
That said, I find the article as a whole off-putting. It doesn’t strengthen one’s claims to call things stupid or a total waste of time. It deals in absolutes, and rants in a way that misleads and foregoes nuance.
I learned a lot of perspective on LORA. Thanks folks
> No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
I get that. So what can we do?
One option is when criticizing, write as clearly as possible. Err on the side of overexplaining. From my point of view, it took a back-and-forth for your criticism to become clear.
I'll give an example when more charity and synthesis is welcome:
>> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
> This is just incorrect.
"This" is rather unclear. There are many claims in the quote -- which are you saying are incorrect? Possibilities include:
1. "Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting."
Sometimes, yes. More often than not? Maybe. Categorically? I'm not sure. [1]
2. "When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects."
Yes, this can happen. Mitigations can reduce the chances.
3. "Instead, use modular methods like [...] adapters."
Your elision dropped some important context. Here's the full quote:
> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.
This logic is sound, almost out of tautology: the original model is unchanged.
To get more specific: if one's bolted-on LoRA module destroyed some knowledge, one can take that into account and compensate. Perhaps use different LoRA modules for different subtasks then delegate with a mixture of experts? (I haven't experimented with this particular architecture, so maybe it isn't a great example -- but even if it falls flat, this example doesn't undermine the general shape of my argument.)
In summary, after going sentence by sentence, I see one sentence that is dubious, but I don't think it is the same one you would point to.
[1] I don't know if this is considered a "settled" matter. Even if was considered "settled" in ML research, that wouldn't meet my bar -- I have a relativity low opinion of ML research in general (the writing quality, the reproducibility, the experimental setups, the quality of the thinking!, the care put into understanding previous work)
Sorry to be a downer but basically every statement you’ve made above is incorrect.
> Sorry to be a downer but basically every statement you’ve made above is incorrect.
You don't need to apologize for being a "downer", but it would be better if you were specific in your criticisms.
I welcome feedback, but it has to be specific and actionable. If I'm wrong, set me straight.
This is a two-way street: if you were unfair or uncharitable or wrong, you have to own that too. It is incumbent upon an intellectually honest reader to first seek a plausible interpretation under which a statement is indeed correct. Some people have a tendency to only find one possible interpretation under which a statement is wrong. This is insufficient. Bickering over interpretations is less useful; understanding another's meaning is how we grow.
> that's a skill issue and not a fundamental property
This made me laugh.
You seem like you may know something I've been curious about.
I'm a shader author these days, haven't been a data scientist for a while, so it's going to distort my vocab.
Say you've got a trained neural network living in a 512x512 structured buffer. It's doing great, but you get a new video card with more memory so you can afford to migrate it to a 1024x1024. Is the state of the art way to retrain with the same data but bigger initial parameters, or are there other methods that smear the old weights over a larger space to get a leg up? Anything like this accelerate training time?
... can you up sample a language model like you can lowres anime profile pictures? I wonder what the made up words would be like.
In general this is of course an active area of research, but yes, you can do something that and people have done it successfully[1] by adding extra layers to an existing model and then continuing to train it.
You have to be careful about the "same data" part though; ideally you want to train once on unique data[2] as excessive duplication can harm the performance of the model[3], although if you have limited data a couple of training epochs might be safe and actually improve the performance of the model[4].
[1] -- https://arxiv.org/abs/2312.15166
[2] -- https://arxiv.org/abs/1906.06669
[3] -- https://arxiv.org/abs/2205.10487
[4] -- https://galactica.org/static/paper.pdf
In addition to increasing the number of layers, you can also grow the weight matrices and initialize by tiling them with the smaller model's weights https://neurips.cc/media/neurips-2023/Slides/83968_5GxuY2z.p...
This might be obvious, but just to state it explicitly for everyone: you can freeze the weights of the existing layers if you want to train the new layers but want to leave the existing layers untouched.
Thank you for taking the time to provide me all this reading.
I think the point the author misses is that many applications of fine-tuning are to get a model to do a single task. This is what I have done in my current role at my company.
We’ve fine-tuned open weight models for knowledge-injection, among other things, and get a model that’s better than OpenAI models at exactly one hyper specific task for our use case, which is hardware verification. Or, fine-tuned the OAI models and get significantly better OAI models at this task, and then only use them for this task.
The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive.
I don’t care if I “damage” Llama so that it can’t write poetry, give me advice on cooking, or translate to German. In this instance I’m only ever going to prompt it with: “Does this design implement the AXA protocol? <list of ports and parameters>”
> I think the point the author misses...
It looked to me like the author did know that. The title only says "Fine-tuning", but immediately in the article he talks about Fine-tuning for knowledge injection, in order to "ensure that their systems were always updated with new information".
Fine-tuning to help it not make the stupid mistake that it makes 10% of the time no matter what instructions you give it is a completely different use case.
Cost, latency, and performance are huge reasons why my company chooses to fine tune models. We start with using a base model for a task and as our traffic grows, we tune a smaller model, resulting huge performance and cost savings.
> hardware verification
Could you give any rough details? I'm in this world, and have only experienced rigid/deterministic bounds for hardware, ideally based on "guaranteed by design" based models. The need for determinism has always prevented AI from being a part of it.
The author makes it specific they talk about finetuning "for Knowledge Injection". The give a quote that claims that finetuning is still useful for things like following a specific style, formatting etc. The title they chose could have been a bit more specific and less aphoristic.
What finetuning makes less sense is doing it merely to get a model eg up to date with changes in some library, or to learn a new library it did not know, or, even worse, your codebase. I think this is what OP talks about.
Exactly. I want the LLM to be able to respond to our customers’ questions accurately and/or generate proper syntax for our query language.
The whole point of base models is to be general purpose, and fine tuned models to be tuned for specific tasks using a base model.
Just to be clear, unless I'm misinterpreting this chain of comments, you do not want to fine-tune for information retrieval. FT is for skill enhancement. For information retrieval you want at least one of the over 100 implementations of RAG out there now.
Let me preface by saying I'm not skeptical about your answer or think you're full of crap. Can you give me an example or two about a single task that you fine-tune for? Just trying to familiarize myself with more AI engineering tasks.
Yep!
So my use case currently is admittedly very specific. My company uses LLMs to automate hardware design, which is a skill that most LLMs are very poor at due to the dearth of training data.
For tasks which involve generation of code or other non-natural language output, we’ve found that fine-tuning with the right dataset can lift performance rapidly and decisively.
An example task is taking in potentially syntactically incorrect HDL (Hardware Description Language) code and fixing the syntax issues. Fine-tuning boosted corrective performance significantly.
I used fine-tuning back in the day because GPT 3.5 struggled with the concept of determining if two sentences were equivalent or not. This was for grading language learning drills. It was a single skill for a specific task and I had lots of example data from thousands of spaced repetition quiz sessions. The base model struggled with the vague concept of “close enough” equivalence. Since that time, the state of the art has advanced to the point that I don’t need it anymore. I could probably do it to save some money but I’m pretty happy with GPT 4.1.
Any classification task. For example in search ranking, does a document contain the answer to this question?
In this case, for doing specific tasks, it makes much more sense to optimize the prompts and the whole flow with DSPy, instead of just fine tuning for each task.
A wonderful approach generally and something we also do to some extent, but not a substitute for fine-tuning in our case.
We are working in a domain where there is very limited training data, so what we really want is continued pre-training over a larger dataset. Absent that, fine-tuning is highly effective for non-NLP tasks.
It's not either/or. Generally you finetune when optimized many-shot still doesn't hit your desired quality bar. And it turns out with RL, things like system prompts matter a lot, so searching over prompts is a good idea even when reinforcing the desirable circuits.
I am not an expert in fine tuning, but in the company I work for our fine tuned model didn't do any noticeable difference.
DSPy, notably, includes functionality for finetuning models. [1]
[1] https://dspy.ai/tutorials/games/
That's only viable if the quality of the outputs can be automatically graded, reliably. GP's case sounds like one where that's probably possible, but for lots of specific tasks that isn't feasible, including the other ones he names:
> write poetry, give me advice on cooking, or translate to German
Certainly, in those cases one needs to be clever and design an evaluation framework that will grade based on soft criteria, or maybe use user feedback. Still, over time a good train-test database should be built and leveraging dspy will do improvements even in those cases.
Interestingly the author mentions LoRa as a "special" way for fine-tuning thatis not destructive. Have you considered it or you opted for more direct fine-tuning?
It's not special and fine tuning a foundation model isn't destructive when you have checkpoints. LoRa allows you to approximate the end result of a fine tune while saving memory.
Haven’t tried it personally, as this was a use case where a classic SFT was effective for what we wanted and none of us had done LoRa before.
Really interested in the idea though! The dream is that you have your big, general base model, then a bunch of LoRa weights for each task you’ve tuned on, where you can load/unload just the changed weights and swap the models out super fast on the fly for different tasks.
You do you, and if it works i’m not going to argue with your results, but for others, finetuning is the wrong tool for knowledge injection over a well-designed RAG pipeline.
Finetuning is good for, like you said, doing things a particular way but that’s not the same thing as being good at knowledge injection and shouldn’t considered as such.
It’s also much easier to prevent a RAG pipeline from generating hallucinated responses. You cannot finetune that out of a model.
This is a pretty awful take. Everyone understands they are modifying the weights - that is the point. It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them. The awesome thing about fine-tuning is that the weights are malleable and you have a great base to start from.
Also the basic premise that knowledge injection is a bad use-case seems flawed? There are countless open models released by Google that completely fly in the face of this. Medgemma is just Gemma 3 4b fine-tuned on a ton of medical datasets, and it’s measurably better than stock Gemma within the medical domain. Maybe it lost some ability to answer trivia about Minecraft in the process, but isn’t that kinda implied by “fine-tuning” something? Your making it purpose built for a specific domain.
Medgemma gets its domain expertise from pre-training on medical datasets, not finetuning. It’s pretty uncharitable to call the post an awful take if you’re going to get that wrong.
You can call it pre-training but it’s based on Gemma 3 4b - which was already pre-trained on a general corpus. It’s the same process, so you’re just splitting hairs. That is kind of my point, fine-tuning is just more training. If you’re going to say that fine-tuning is useless you are basically saying that all instruct-tuned models are useless as well - because they are all just pre-trained models that have been subsequently trained (fine-tuned) on instruction datasets.
> It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them.
So more imperfect is better?
Of course the model’s parameters leave a many billions of elements vector path for improvement. But what circuitous path is that, which it didn’t already find?
You can’t find it by definition if you don’t include all the original data with the tuning data. You have radically changed the optimization surface with no contribution from the previous data at all.
The one use case that makes sense is sacrificing functionality to get better at a narrow problem.
You are correct about that.
A man who burns his own house down may understand what they are doing and do it intentionally - but without any further information still appears to be wasting his time and doing something stupid. There isn't any contradiction between something being a waste of time and people doing it on purpose - indeed the point of the article is to get some people to change what they are purposefully doing.
He's proposing alternatives he thinks are superior. He might well be right too, although I don't have a horse in the race but LORA seem like a more satisfying approach to get a result than retraining the model and giving LLMs tools seems to be proving more effective too.
It’s possible I misinterpreted a bit the gist of the article - in my mind nobody is doing fine-tuning these days without using techniques like LoRA or DoRA. But they are using these techniques because they are computationally efficient and convenient, and not because they perform significantly better than full fine-tuning.
Clickbait headline. "Fine-tuning LLMs for knowledge injection is a waste of time" is true, but IDK who's trying to do that. Fine-tuning is great for changing model behavior (i.e. the zillions of uncensored models on Hugging Face are much more willing to respond to... dodgy... prompts than any amount of RAG is gonna get you), and RAG is great for knowledge injection.
Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them.
They provide no references other than self-referencing blogs. It was also suspenseful to read about loss in changing neural network weights when there was 0 mention of quantization. Unfortunately, most of the content in this one was taken from his own previous work.
RAG is getting some backlash and this reads as a backlash of the backlash. I hope things settle down soon but many techfluencers put all their eggs in RAG and used it to gatekeep AI.
[dead]
It was the best option at one point. They're still a great option if you want an override (e.g. categorization or dialects), but they're not precise.
Changes that happened:
1. LLMs got a lot cheaper but fine tuning didn't. Fine tuning was a way to cut down on prompts and make them 0 shot (not require examples)
2. Context windows became bigger. Fine tuning was great when it was expected to respond a sentence.
3. The two things above made RAG viable.
4. Training got better on released models, to the point where 0 shots worked fine. Fine tuning ends up overriding these things that were scoring nearly full points on benchmarks.
Lots of prophets in every gold rush...
While the author makes some good points (along with some non-factual assertions), I wonder why he decided to have this counter-productive and factually wrong clickbait title.
Fine-tuning (and LoRA IS fine-tuning) may not be cost-effective for most organizations for knowledge updates, but it excels in driving behavior in task specific ways, for alignment, for enforcing structured output (usually way more accurately than prompting), tool and function use, and depending on the type of knowledge, if it is highly specific, niche, long tail type of knowledge, it can even make smaller models beat bigger models, like the case with MedGemma.
There is no real difference between fine-tuning with and without a lora. If you give me a model with a lora adapter, I can give you an updated model without the extra lora params that is functionally identical.
Fitting a lora changes potentially useful information the same way that fine-tuning the whole model does. It's just the lora restricts the expressiveness of the weight update so that is compactly encoded.
"Fine-tuning large language models (LLMs) is frequently sold as a quick, powerful method for injecting new knowledge"
Is that true though? I don't think I've seen a vendor selling that as a benefit of fine-tuning.
To be fair there are lots of Facebook, Instagram, and Youtube cargo cultists telling people to fine-tune on their documents for some reason. This got to be so common in 2024 that I think it was part of the pressure behind Gigabyte branding their hardware around it.
Yeah, as soon as I read that I felt like the author was living in a very different context from mine. It's never even occurred to me that fine-tuning could be an effective method for injecting new knowledge.
If anything, I expect fine-tuning to destroy knowledge (and reasoning), which hopefully (if you did your fine-tuning right) is not relevant to the particular context you are fine-tuning for.
OpenAI makes statements like: [1]
1) "excel at a particular task"
2) "train on proprietary or sensitive data"
3) "Complex domain-specific tasks that require advanced reasoning", "Medical diagnosis based on history and diagnostic guidelines", "Determining relevant passages from legal case law"
4) "The general idea of fine-tuning is much like training a human in a particular subject, where you come up with the curriculum, then teach and test until the student excels."
Don't all these effectively inject new knowledge? It may happen through simultaneous destruction of some existing knowledge but that isn't obvious to non-technical people.
OpenAI's analogy of training a human in a particular subject until they excel even arguably excludes the possibility of destruction because we don't generally destroy existing knowledge in our minds to learn new things (but some of us may forget the older knowledge over time).
I'm a dev with hand-waving level of proficiency. I have fine-tuned self-hosted small LLMs using PyTorch. My perception of fine-tuning is that it fundamentally adds new knowledge. To what extent that involves destruction of existing knowledge has remained a bit vague.
My hand-waving solution if anyone pointed out that problem would be to 1) say that my fine-tuning data will include some of the foundational knowledge of the target subject to compensate for its destruction and 2) use a gold standard set of responses to verify the model after fine-tuning.
I for one found the article quite valuable for pointing out the problem and suggesting better approaches.
[1]: https://platform.openai.com/docs/guides/fine-tuning
I think it is a very common misconception (by consumers or businesses trying to use LLMs) that fine tuning can be used to inject new knowledge. I'm not sure many of the fine-tuning platforms do much to disavow people of this notion.
> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.
So obviously this is what most of us are already doing, I would venture. But there's a pretty big "missing middle" here. RAG/better prompts serve to provide LLMs with the context they need for a specific task, but are heavily limited by context windows. I know they've been growing quite a bit, but from my usage it still seems that things further back in the window get forgotten about pretty regularly.
Fine tuning was always the pitch for the solution to that. By baking the "context" you need directly into the LLM. Very few people or companies are actually doing this though, because it's expensive and you end up with an outdated model by the time you're done...if you even have the data you need to do it in the first place.
So where we're left is basically without options for systems that need more proprietary knowledge than we can reasonably fit into the context window.
I wonder if there's anyone out there attempting to do some sort of "context compression". An intermediary step that takes our natural language RAG/prompts/context and compresses it into a data format that the LLM can understand (vectors of some sort?) but are a fraction of the tokens that the natural language version would take.
edit After I wrote this I fed this into chatgpt and asked if there were techniques i am missing. It introduced me to Lora (which I suppose are the "adapters" mentioned in the OP). and now I have a whole new rabbithole to climb down. AI is pretty cool sometimes.
I see this and immediately relived the last two years of the journey. I think some of the mental model that helped me might help the community too.
What people expect from finetuning is knowledge addition. You want to keep the styling[1] of the original model, just add new knowledge points that would help your task. In context learning is one example of how this works well. Just that even here, if the context is out of distribution, a model does not "understand" it and would produce guesswork.
When it comes to LoRA or PEFT or adapters, it's about style transfer. And if you focus on a specific style of content, you will see the gains, just that the model wont learn new knowledge that wasnt already in original training data. It will forget previously learnt styles depending on context. When you do full finetuning (or SFT with no frozen parameters), it will alter all the parameters, and results in gain of new knowledge at the cost of previous knowledge (and would give you some gibberish if you ask about topics outside of domain). This is called catastrophic forgetting. Hence, yes, full finetuning works - just that it is an imperfect solution like all the others. Recently, with Reinforcement learning, there have been talks of continual learning, where Richard sutton's latest paper also lands at, but thats at research level.
Having said all that, if you start with the wrong mental model for Finetuning, you would be disappointed with the results.
The problem to solve is about adding new knowledge, while preserving the original pretrained intelligence. Still in wip, but we published a paper last year on one way it could be done. Here is the link: https://arxiv.org/abs/2409.17171 (it also has results for experiments all different approaches).
[1]: Styling here means the style learned by the model in SFT. Eg: Bullets, lists, bolding out different headings etc. all of that makes the content readable. The understanding of how to present the answer to a specific question.
I love how people say things like this with complete disregard for research.
Most LLM research involves fine tuning models, and we do amazing things with it. R1 is a fine tune, but I guess that’s bad?
Our company adds knowledge with fine tuning all the time. It’s usually a matter of skill not some fundamental limit. You need to either use LoRA or use a large batch size and mix the previous training data in.
All we are doing is forcing deep representations. This isn’t a binary “fine tuning good/bad” it’s a spectrum of how deep and robust you make the representations
I think of it as trying to encourage the LLM to want to give answers from a particular part of the phase space. You can do it by fine tuning it to be more likely to return values from there, or you can prompt it to get into that part of the phase space. Either works, but fiddling around with prompts doesn't require all that much MLops or compute power.
That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.
> That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.
...which I thought was arguably the most popular use case for fine tuning these days.
> That said, fine tuning small models
Mostly referred to as model distillation, but I give the author the benefit of the doubt that they didn't mean that.
My understanding of model distillation is quite different in that it trains another (typically smaller) model using the error between the new model’s output and that of the existing - effectively capturing the existing model’s embedded knowledge and encoding it (ideally more densely) into the new.
What what I was referring to is similar in concept, but I've seen both described in papers as distillation. What I meant was you take the output of a large model like GPT4 and use that as training data to fine-tune a smaller model.
Wasn't there that thing about how large LLM's are essentially compression algorithms (https://arxiv.org/pdf/2309.10668)? Maybe that's where this article is coming from, is the idea that finetuning "adds" data to the set of data that compresses well. But that indeed doesn't work unless you mix in the finetuning data with the original training corpus of the base model. I think the article is wrong though in saying it "replaces" the data - it's true that finetuning without keeping in the original training corpus increases loss on the original data, but "large" in LLM really is large and current models are not trained to saturation so there is plenty of room to fit in finetuning if you do it right.
Not sure what you mean by “not trained to saturation”. Also I agree with the article, in the literature, the phenomenon to which the article refers is known as “catastrophic forgetting”. Because no one has specific knowledge about which weights contribute to model performance, by updating the weights via fine-tuning, you are modifying the model such that future performance will change in ways that are not understood. Also I may be showing my age a bit here, but I always thought “fine-tuning” was performing additional training on the output network (traditionally a fully-connected net), but leaving the initial portion (the “encoder”) weights unchanged - allowing the model to capture features the way it always has, but updating the way it generates outputs based on the discovered features.
OK, so this intuition is actually a bit hard to unpack, I got it from bits and pieces. So this is this post https://www.fast.ai/posts/2023-09-04-learning-jumps/. Essentially, a single pass over the training data is enough for the LLM to significantly "learn" the material. In fact if you read the LLM training papers, for the large-large models, they generally explicitly say that they only did 1 pass over the training corpus, and sometimes not even the full corpus, only like 80% of it or whatever. The other relevant information is the loss curves - models like Llama 3 are not trained until the loss on the training data is minimized, like typical ML models. Rather they use these approximate estimates of FLOPS / tokens vs. performance on benchmarks. But it is pretty much guaranteed that if you continued to train on the training data it would continue to improve its fit - 1 pass over the training data is by no means enough to adequately learn all of the patterns. So from a compression standpoint, the paper I linked previously says that an LLM is a great compressor - but it's not even fully tuned, hence "not trained to saturation".
Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.
Before post-ChatGPT boom, we used to talk of "catastrophic forgetting"...
Make sure the new training dataset is "large" by augmenting it with general data (see it as a sample of the original dataset), use PEFT techniques (freezing weights => less risks), use regularization (elastic weight consolidation).
Fine-tuning is fine, but will be more expensive that you thought and should be led by more experienced ML engineers. You probably don't need to fine tune models anyway.
Obviously there are going to be narrow tasks where fine tuning makes sense. But using leading models for agents is a completely different mindset and approach.
Because I have been working on replacing multiple humans handling complex business processes mostly end-to-end (with human in the loop somehow in there).
I find that I need the very best models to be able to handle a lot of instructions and make the best decisions about tool selection. And overall I just need the most intelligence possible to make fewer weird errors or misinterpretations of the instructions or situations/data.
I can see how fine tuning would help for some issues like some report formatting. But that output comes at the end of the whole process. And I can address formatting issues almost instantly by either just using a smarter model that follows instructions better, or adding a reminder instruction, or creating a simpler subtask. Sometimes the subtask can run on a cheaper model.
So it's kind of like the difference between building a traditional manufacturing line with very specific robot arms, tooling and and conveyor belts, versus plugging in just a few different humanoid robots with assembly manuals and access to more general purposes tools on their belt. You used to always have to build the full traditional line. In many cases that doesn't necessarily make sense anymore.
I don’t know if fine tuning works. But if it doesn’t, then are we assuming the underlying weights are optimal? At what point do we determine that a network is properly “trained” and any subsequent training is “fine tuning”.
This post is hilarious. People like this author are the ones vetting start-ups? Please. The idea that alignment leads to a degradation in model utility is hardly news.
But let’s be clear: fine-tuning an LLM to specialize in a task isn’t just about minimizing utility loss. It’s about trade-offs. You have to weigh what you gain against what you lose.
It's pretty frustrating to spend weeks on finetuning and end up with a model that says:
"SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT ..."
What is the way out in such cases?
I've hit this with gemini-2.0-flash and changing the prompt ever so slightly seems to make things work, just to break it at other input.
Finetuning is deep learning training. It's pretty difficult to get right.
Andrej's 2019 blog laments on some of the reasons why it is hard and I can relate to a lot of this - https://karpathy.github.io/2019/04/25/recipe
The biggest mistake I see people making is this quote from the blog: "a 'fast and furious' approach to training neural networks does not work and only leads to suffering"
I'll probably write more about it in a few months...
I feel that the effects of fine-tuning are often short-term, and sometimes it can end up overwriting what the model has already learned, making it less intelligent in the process. I lean more towards using adaptive methods, optimizing prompts, and leveraging more efficient ways to handle tasks. This feels more practical and resource-efficient than blindly fine-tuning. We should focus on finding ways to maximize the potential of existing models without damaging their current capabilities, rather than just relying on fine-tuning.
It would be very interesting to fine tune a model for a narrow task, while tracking its performance on every original training sample from the pre-tuning baseline.
I expect it would greatly help characterize what was lost, at the expense of a great deal of extra computation. But with enough experiments might shed some more general light.
I suspect the smaller the tuning dataset, the faster and worse the overwriting will be, since the new optimization surface will be so much simpler to navigate than the much bigger datasets optimization surface.
Then a question might be, what percentage of the original training data, randomly retained, might slow general degradation.
Fine tuning isn’t for everything but certainly makes it easy to build models for special purposes, eg metadata extraction. Happy to lose some capability in another domain for that, eg Pokémon. The headline is a bit too general.
Correct me if I am wrong, but I thought the point of fine-tuning was to get precise returns. We make it hyper specific to the task at hand. Sure, we can get 90% of the way there without fine-tuning, but most of these models are vast. I would argue that it potentially MAY be a waste of time right out the gate.
Overwrite seems a bit strong. Closer to adjusting. Which is the whole point of fine tuning.
RAG and fine-tuning are suitable for different business scenarios. For some directional and persistent knowledge, such as adjustments for power, energy and other fields, it can bring better performance;
RAG is more oriented to temporary and variable situations.
In addition, LoRA is also a fine-tuning technology,and it is written in their paper.
I am under the impression that fine tuning is expensive (could anyone put a number on that?) and that each time a new model is released you have to fine tune it again, paying full price every time.
Seriously, most fine-tuning now is done with LoRa adapters. They are much faster and more reliable. In my lab, I don't know anybody who is trying to do any kind of thorough fine-tuning...
Fine-tuning is excellent way to reliably bake-in domain specific data into a model; there is a plenty of coding finetunes on Huggingface face, that outperforms foundation models on say coding, without significant loss in other domains.
"But this logic breaks down for advanced models, and badly so. At high performance, fine-tuning isn’t merely adding new data — it’s overwriting existing knowledge. Every neuron updated risks losing information that’s already intricately woven into the network. In short: neurons are valuable, finite resources. Updating them isn’t a costless act; it’s a dangerous trade-off that threatens the delicate ecosystem of an advanced model."
Mainly including this article to spark discussion—I agree with some of this and not with all of it. But it is an interesting take.