The current state of AI chatbots, assigns a “confidence level” to every piece of output. It signals perfectly well when and where they should look for more information… but humans have been pushing them to “output something, anything”, instead of excusing itself for not knowing something, or running some additional processes in order to look for the missing information.
As of this year, Copilot has been running web searches to complement its lack of information, and Gemini is running both web searches, and iteratively self-checking its own answer in order to refine it (see “drafts”). It also seems like Gemini might be learning from humanity’s reactions to its wrong answers.
From my understanding, AI is a essentially a statistical method so naturally it will use a confidence level. Its hard for me to take the leap of faith to confidence level will correlate to accuracy. Seems to me it would be more dependent on its data set. If its data contains a commonly held belief, that is incorrect, would it not have a high confidence level on an answer with that incorrect info? If we use a highly authoritative data set, that will be very limited and we’d be back to more of a keyword system than a LLM. I am sure with time, we’ll be in more of a middle ground where accuracy will be better but what will that be? 5% 3% 10%?
I’ll freely admit I am not an expert in this at all.
It’s not a statistical method anymore. One of the breakthroughs of large model neural networks, has been that during training an emergent process, assigns neurons to both relatively high level and specific traits, which at the same time “cluster up” with other neurons assigned to related traits. Adding just a bit of randomness (“temperature”) allows the AI to jump from activating one trait to a close one, but not to one too far away. Confidence becomes a measure of how close is the output, to a consistent set of traits trained into the network. Interestingly, a temperature of 0 gives a confidence of 100%… but produces gibberish.
If its data contains a commonly held belief, that is incorrect
This is where things start to get weird. An AI system based on an LLM, can iterate over its own answers looking for the optimal one (Q*), and even detect inconsistencies in them. What it does after that, depends on whoever programmed it:
Maybe it casts any doubt aside, and outputs the first answer anyway (original ChatGPT did that, didn’t even bother self-checking too much)
Or it could ask an authoritative source (ChatGPT plugins work like that)
Or it could search the web for additional info (Copilot and Gemini do that)
Or it could alert the user to both the low confidence and the inconsistencies (…but people want omniscient AIs, not “err… I’m not sure, Dave” AIs)
…or, sometime in the future (or present?) they could re-train themselves, maybe via generating a LoRa, that would bring in corrected biases, or even additional concepts.
Over time, I think different AI systems will evolve to target accuracy, consistency, creativity, etc. Current systems are kind of rudimentary compared to what’s yet to come, and too many are used in very rudimentary ways by anyone who can slap an “AI” label and sell them.
That is pretty interesting and thanks for posting it. I hear the words and its intriguing but to be honest, I don’t really understand it. I’d have to give it some thought and read more about it. Do you have a place you suggest going to learn more?
I use chatgpt-4o currently for learning python and helping with grammar. I find it does great with grammar but even with relatively simple python questions it can produce some “creative” answers. Like its in the ball park but its not perfect and for a learner, that’s learning the hard way. To be fair I don’t use the assistant/code interpreter, which I have no idea about but based on its name I assume it might be better. So that’s what I based my somewhat skeptical opinion of ai on.
You may want to also check an intro to neural networks, and Q* is a somewhat new concept. Other than that… “the internet”. There are plenty of places with info, not sure if there is a more centralized and structured one.
Learning to code with just ChatGPT is not the best idea. You need to join three areas:
general principles (data structures, algorithms, etc)
language rules (best described in a language reference)
business logic (computer science, software engineering, development patterns, etc)
ChatGPT’s programming answers, give you an intersection of all those, often with some quirks, with the nice but only benefit of explaining what it thinks it is doing. You still need to have some basic understanding of those in order to understand what ChatGPT is talking about, how to double-check it, and how to look for more info. It can be a great timesaver as a way to generate drafts, though.
LLMs generate output one token at a time. Each token comes with a confidence level by the model, about whether it’s the only possible token to continue the sequence. A model is only 100% confident in its output, if it reproduces a training text verbatim. With any temperature above 0, they veer off the 100% confidence path, which lets them leverage the concept association they came up with during training, makes their output more useful.
For every generated text, you could get a confidence heat map, then ask the model to refine sections that don’t meet a desired level of confidence. Especially the parts where a model makes stuff up, or hallucinates, are likely token sequences with much lower confidence than the rest.
Running a model several times, focusing on the sections with lower confidence, getting additional data from other sources like the internet, or some niche expert system, could eliminate many of the nonsense sections… and I have a reasonably suspicion that Google’s Gemini does exactly that, refining each output with 4 additional iterations, instead of blindly spitting out the first one.
I guess that makes sense, but I wonder if it would be hard to get clean data out of the per-token confidence values. The LLM could be hallucinating, or it could just be generating bad grammar. It seems like it’s hard enough already to get LLMs to distinguish between “killing processes” and murder, but maybe there could be some novel training and inference techniques that come up.
An LLM has… let’s say two core components: a tokenizer, and a neural network. The neural network’s output, is an array of activation levels for a series of neurons, each neuron representing one token. A confidence of 100%, would mean a 100% activation of a single neuron/token, and 0% for all the rest. That is a highly unlikely scenario for a neural network, except when it got overfitted for a single patter during training, and is getting fed the same pattern again. What is more usual, is some value between 0% and 100% for each neuron, with a few neurons showing higher levels of activation, and the LLM… usually picks the highest, but maybe sometimes the second or further one.
The confidence can be calculated by comparing the level of the chosen token’s neuron, to all the other output neurons. A naive one could be level/sum(levels). Somewhat more advanced, could be level²/sum(levels²).
Hallucinations are theoretically possible at a high confidence, but usually happen at lower confidence levels where there are many tokens with a similar confidence.
It seems like it’s hard enough already to get LLMs to distinguish between “killing processes” and murder
It doesn’t look like anything to me… I mean, that could be either part of the guardrails, or a lack of context. A “killing process” is murder, outside the programming/sysadmin context. Current LLMs are still not great at handling different semantic contexts for the same token, and particularly bad at mixing different contexts throughout a single text.
My personal “Turing” test for an LLM, is being able to write a sentence, that could be interpreted in 3 or more ways. For a human, 2 meanings is a somewhat easy task, a double-entendre. Starting at 3 and 4, it becomes a feat. Most LLMs are still at 1, and sometimes struggling.
For example, Gemini says:
Q: Please write a sentence that could be interpreted in two or more ways.
A: I saw her duck into the library.
Q: Please write a sentence that could be interpreted in three or more ways.
A: I’m a text-based AI and can’t assist with that.
It can do paragraphs, though:
Q: Please write a paragraph that could be understood in five or more ways and explain those ways.
A: The old man by the sea watched the horizon, his weathered face a map of untold stories. The sky, a canvas of hues, mirrored the depths of his soul, or perhaps the ocean’s. A gull cried, a mournful symphony or a hopeful call, depending on the listener’s ear. Time, an elusive phantom, danced with eternity in every passing moment, or perhaps stood still, a silent observer of existence.
Interpretations:
Literal: A description of an old man on the beach, observing the sea and sky. Metaphorical: The old man represents humanity, the sea the subconscious, and the sky the spiritual realm. Philosophical: A contemplation of time, existence, and the nature of reality. Psychological: The paragraph delves into the complexities of human perception and interpretation. Poetic: A piece of evocative prose, inviting multiple readings and interpretations.
…which is pretty neat, but paragraphs have “more degrees of flexibility”, making it way harder to do in a single sentence.
The current state of AI chatbots, assigns a “confidence level” to every piece of output. It signals perfectly well when and where they should look for more information… but humans have been pushing them to “output something, anything”, instead of excusing itself for not knowing something, or running some additional processes in order to look for the missing information.
As of this year, Copilot has been running web searches to complement its lack of information, and Gemini is running both web searches, and iteratively self-checking its own answer in order to refine it (see “drafts”). It also seems like Gemini might be learning from humanity’s reactions to its wrong answers.
From my understanding, AI is a essentially a statistical method so naturally it will use a confidence level. Its hard for me to take the leap of faith to confidence level will correlate to accuracy. Seems to me it would be more dependent on its data set. If its data contains a commonly held belief, that is incorrect, would it not have a high confidence level on an answer with that incorrect info? If we use a highly authoritative data set, that will be very limited and we’d be back to more of a keyword system than a LLM. I am sure with time, we’ll be in more of a middle ground where accuracy will be better but what will that be? 5% 3% 10%?
I’ll freely admit I am not an expert in this at all.
It’s not a statistical method anymore. One of the breakthroughs of large model neural networks, has been that during training an emergent process, assigns neurons to both relatively high level and specific traits, which at the same time “cluster up” with other neurons assigned to related traits. Adding just a bit of randomness (“temperature”) allows the AI to jump from activating one trait to a close one, but not to one too far away. Confidence becomes a measure of how close is the output, to a consistent set of traits trained into the network. Interestingly, a temperature of 0 gives a confidence of 100%… but produces gibberish.
This is where things start to get weird. An AI system based on an LLM, can iterate over its own answers looking for the optimal one (Q*), and even detect inconsistencies in them. What it does after that, depends on whoever programmed it:
Over time, I think different AI systems will evolve to target accuracy, consistency, creativity, etc. Current systems are kind of rudimentary compared to what’s yet to come, and too many are used in very rudimentary ways by anyone who can slap an “AI” label and sell them.
That is pretty interesting and thanks for posting it. I hear the words and its intriguing but to be honest, I don’t really understand it. I’d have to give it some thought and read more about it. Do you have a place you suggest going to learn more?
I use chatgpt-4o currently for learning python and helping with grammar. I find it does great with grammar but even with relatively simple python questions it can produce some “creative” answers. Like its in the ball park but its not perfect and for a learner, that’s learning the hard way. To be fair I don’t use the assistant/code interpreter, which I have no idea about but based on its name I assume it might be better. So that’s what I based my somewhat skeptical opinion of ai on.
Check out this one for a general overview:
https://youtu.be/OFS90-FX6pg
You may want to also check an intro to neural networks, and Q* is a somewhat new concept. Other than that… “the internet”. There are plenty of places with info, not sure if there is a more centralized and structured one.
Learning to code with just ChatGPT is not the best idea. You need to join three areas:
ChatGPT’s programming answers, give you an intersection of all those, often with some quirks, with the nice but only benefit of explaining what it thinks it is doing. You still need to have some basic understanding of those in order to understand what ChatGPT is talking about, how to double-check it, and how to look for more info. It can be a great timesaver as a way to generate drafts, though.
I thought confidence levels were for image recognition? How do confidence levels work for transformer LLMs?
LLMs generate output one token at a time. Each token comes with a confidence level by the model, about whether it’s the only possible token to continue the sequence. A model is only 100% confident in its output, if it reproduces a training text verbatim. With any temperature above 0, they veer off the 100% confidence path, which lets them leverage the concept association they came up with during training, makes their output more useful.
For every generated text, you could get a confidence heat map, then ask the model to refine sections that don’t meet a desired level of confidence. Especially the parts where a model makes stuff up, or hallucinates, are likely token sequences with much lower confidence than the rest.
Running a model several times, focusing on the sections with lower confidence, getting additional data from other sources like the internet, or some niche expert system, could eliminate many of the nonsense sections… and I have a reasonably suspicion that Google’s Gemini does exactly that, refining each output with 4 additional iterations, instead of blindly spitting out the first one.
I guess that makes sense, but I wonder if it would be hard to get clean data out of the per-token confidence values. The LLM could be hallucinating, or it could just be generating bad grammar. It seems like it’s hard enough already to get LLMs to distinguish between “killing processes” and murder, but maybe there could be some novel training and inference techniques that come up.
An LLM has… let’s say two core components: a tokenizer, and a neural network. The neural network’s output, is an array of activation levels for a series of neurons, each neuron representing one token. A confidence of 100%, would mean a 100% activation of a single neuron/token, and 0% for all the rest. That is a highly unlikely scenario for a neural network, except when it got overfitted for a single patter during training, and is getting fed the same pattern again. What is more usual, is some value between 0% and 100% for each neuron, with a few neurons showing higher levels of activation, and the LLM… usually picks the highest, but maybe sometimes the second or further one.
The confidence can be calculated by comparing the level of the chosen token’s neuron, to all the other output neurons. A naive one could be level/sum(levels). Somewhat more advanced, could be level²/sum(levels²).
Hallucinations are theoretically possible at a high confidence, but usually happen at lower confidence levels where there are many tokens with a similar confidence.
It doesn’t look like anything to me… I mean, that could be either part of the guardrails, or a lack of context. A “killing process” is murder, outside the programming/sysadmin context. Current LLMs are still not great at handling different semantic contexts for the same token, and particularly bad at mixing different contexts throughout a single text.
My personal “Turing” test for an LLM, is being able to write a sentence, that could be interpreted in 3 or more ways. For a human, 2 meanings is a somewhat easy task, a double-entendre. Starting at 3 and 4, it becomes a feat. Most LLMs are still at 1, and sometimes struggling.
For example, Gemini says:
It can do paragraphs, though:
…which is pretty neat, but paragraphs have “more degrees of flexibility”, making it way harder to do in a single sentence.