Trillion Dollar Question

Trillion Dollar Question

Here’s a trillion dollar question that came during Lex Fridman podcast with Marc Andreessen

Will synthetic training data work?

Let’s say you ask a LLM to generate a content. Can you use that to train the next version of the LLM? Is there a signal in place that identifies the data that was used to train in the first place?

One argument by the principals of information theory is that no it is not useful. Because of the output is based on human generated input, and all the signals that were used in output were already present in LLM. So synthetic data is like empty calories it doesn’t help.

Another theory says the opposite! One thing that LLMs are really good at is generating very creative content, and ofcourse they can generate training data. For instance, in reinforcement learning (RL), there's a concept called self-play, where an agent competes against itself in order to learn a task. This approach has been successful in domains like board games (like Chess or Go) and video games (like Dota or StarCraft). In these cases, the model is essentially generating synthetic data (the game states and actions it takes), and learning from this data does improve performance.