2025 - 01 January

AI models collapse when trained on recursively generated data This was talked at some point when people realized how much online content would be created by large language models. If you keep feeding the same slop back into LLMs leads to a collapse of the outputs and indefinitely more garbage out. You can read about it as a “news”-like Nature article in AI produces gibberish when trained on too much AI-generated data (no just kidding they are paywalling it just as you’d expect from an open science institution).

I post this here as an irksome example of research which got a bunch of press, is published in (for some) one of the most prestigious journals, and is very likely just wrong. In Microsoft’s Phi-4 technical report, they outline how using synthetic data (generated from a current large language model) was a “core pillar” to producing a better new model.

The Nature-published article is especially irksome to me because I see it as another example of the case of “negative” research: “we have shown this popular thing doesn’t work”, published with the air of being a mathematical proof. Or, outside academia, “We showed LLMs can’t do X, and so they’ll never be able to do X”

I see this as related to the problem of the missing “Journal of Negative Results”: Part of the incentive to p-hack is from having no outlet for negative results. Journals want positive result headlines, so if you spend 6 months doing high quality work and then find, unexpectedly, there is no correlation, you have wasted 6 months of results. (alt phrasing: if you spend 6 months working on something you expect to work, since you must publish positive results in most journals, there is the incentive to p-pack). The other option that people see is to make a much strong negative claim: “We have shown X (negative result)”, when often the truth is “we have shown that we could not make X work” or, in the original recursive data

We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models

You could almost stop right there if you were reading carefully: indiscriminate use? What kind of team would indiscriminately use content? Curating the training data is known as some of the most time-intensive, difficult aspect of model training, which is why the biggest labs hold it close to their chest.