Fact vs. Fiction — Why Language models need to pick a lane

Notes on the ongoing identity crisis of language models

0xSingularity
4 min readJan 1, 2023

“Hallucinating knowledge” - This could be the term of the year in NLP circles for 2022— a novel and sorta weird problem that happens when you indiscriminately train a model on the next word prediction task, across all kinds of databases — both factual and fictional.

Till date, the “impressive” performance of models like GPT3 and ChatGPT has been achieved by training them on datasets with no teleological alignment — there’s no purpose to the data, no pattern or skill its being specifically trained for — just a random amalgamation of what humans say on the internet. It is no wonder then, that these models behave as if they have an incredible identity crisis, and have such a hard time differentiating fact from fiction.

Why do LLMs hallucinate so much?

Many users today already claim a preference to using chatGPT over google for many of their queries.[1] While often very useful, this approach is also quite dangerous — as ChatGPT often, very confidently, spews out blatantly incorrect or even made up facts. We believe that this is an outcome of its training data being based on a very loosely defined task — conversations, which can be both factual, fictional, or imaginative. By refusing to draw a line between training model on factual data and training a model to produce impressive fiction, OpenAI has sort of failed at both tasks — the model is too cautious when creating fiction, and too overconfident when stating facts, real or otherwise.

We argue that it is integral to differentiate between these two tasks (factual vs fictional generation) at training time, if these models are to have major practical applications in the real world.

Building Teleologically Aligned Training Datasets

In our opinion, the capabilities of Large Language Models (LLMs) are being very dramatically underutilised today, because of the lack of teleological alignment in the training data used. In simple terms — beyond basic linguistic patterns, the model doesn’t know what it’s expected to learn, as the data is a mess of contradictions.

Instead of being trained to perform a particular task (like earlier generation of classifier ML models), these LLMs are simply being trained to imitate humans on the internet — resulting in their performance converging around the median human on the internet. Reinforcement learning methods (like Instruct-GPT) only slightly improve this problem — they are trying to match human preferences, but the task humans want them to perform is still unclear. One person wants it to write a fictional novel, another wants a factual essay, or the answer to a coding problem. The AI might very well include historical figures in the fictional novel, and made-up libraries in the factual coding problem. This is the state of LLMs today.

We propose a different approach — forming a large, teleologically aligned dataset to train these models to excel at a specific task. Instead of the mess of data they’re currently fed, we suggest that after being trained on >1 Trillion tokens for a specific task, these models can already far exceed human capabilities. We pick the task of factual question-answering as specific enough that each epoch tangibly teaches the model improved problem solving, while being general enough to be useful across domains of human endeavour.

In other words, we poist that no technical innovations are required to reach generally intelligent oracle AI, only much better datasets.

Grounding NLP models in facts

It is important to define the line between factual and fictional generation. Factual generation does not mean the absence of creativity — instead, it includes informed speculation or reasoning that is grounded in actual facts or evidence.

In contrast, fictional generation is one where it is clear to the AI that it need not ground its generation in any commonly accepted fact or evidence — besides perhaps, a generalised understanding of the kind of content humans like.

From this definition, it is easy to see how fictional generation is a much easier problem to solve, and which is why existing models excel at it. Factual generation, on the other hand, requires the internal model of the world created by the LLM to include distinct division between “true” and “made up” — something that existing datasets are not at all conducive at achieving.

One must remember that these models have no access to an “outside world” to verify which parts of the content consumed by them is real. They must go solely by their input training data. This makes it ever more important that the training data is very carefully curated.

Therefore, we propose that a distinct, “clean” dataset must be created, that consists solely of a massive number of factually accurate questions and answers. Unless we filter the training data provided to these LLMs, they will never be able to do so by themselves.

Reader Retriever Approaches vs Generative Approaches

Another common approach being tried by some teams is to ask a GPT-model to refer to some reference text/sources while answering questions, to help ground its answers in real world knowledge. While this improves performance in some tasks, this approach is often two orders of magnitude or more costly than performing direct inference, and is thus self-limiting in its potential. Moreover, constructive speculation tailored to the question, remains a challenge unlike purely generative approaches. Finally, reader-retriever approaches are always capped by the performance of the (weaker) retrieval method.

Instead, if the models are very extensively trained on data that is factually correct only, the ratio of hallucinated to real answers can be dramatically reduced — without reliance on reference material. We demonstrate a primitive version of this skill with LUCI, trained on 20M factual Q&A — and expect this skill to only grow as the training dataset expands.

--

--

0xSingularity

Tracking the journey of LUCI, a general-purpose question answering AI