Transformers Are Bayesian Networks

(arxiv.org)

37 points | by Anon84 4 days ago

4 comments

  • warypet 5 hours ago

    I found this earlier today when looking for research and ended up reporting it for citing fake sources. Please correct me if I'm wrong, but I couldn't find "[9] Jongsuk Jung, Jaekyeom Kim, and Hyunwoo J. Choi. Rethinking attention as belief propagation. In International Conference on Machine Learning (ICML), 2022." anywhere else on the internet

  • getnormality 5 hours ago

    > Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.

    Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.

    • Mithriil 4 hours ago

      Bayesian network is a really general concept. It applies to all multidimensional probability distribution. It's a graph that encodes independence between variables. Ish.

      I have not taken the time to review the paper, but if the claim stands, it means we might have another tool to our toolbox to better understand transformers.

    • malcolmgreaves 3 days ago

      > Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.

      NNs are as close to continuous as we can get with discrete computing. They’re flexible and adaptable and can contain many “concepts.” But their chief strength is also their chief weakness: these “concepts” are implicit. I wonder if we can get a hybrid architecture that has the flexibility of NNs while retaining discrete concepts like a knowledge base does.

      • AdieuToLogic 4 hours ago

        > NNs are as close to continuous as we can get with discrete computing.

        This is incorrect. For example, fuzzy logic[0] can model analog ("continuous") truth beyond discrete digital representations, such as 1/0, true/false, etc.

        0 - https://en.wikipedia.org/wiki/Fuzzy_logic

        • measurablefunc 5 hours ago

          There is nothing continuous on the computer, it's all bit strings & boolean arithmetic. The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations, i.e. there is no arithmetic operation corresponding to something as simple as the color red.

          • kelseyfrog 5 hours ago

            It sounds like you're saying that if a computer had infinite precision then hallucinations would not occur?

            • measurablefunc 5 hours ago

              The way neural networks work is that the base neural network is embedded in a sampling loop, i.e. a query is fed into the network & the driver samples output tokens to append to the query so that it can be re-fed back into the network (q → nn → [a, b, c, ...] → q + sample([a, b, c, ...])). There is no way to avoid hallucinations b/c hallucinations are how the entire network works at the implementation level. The precision makes no difference b/c the arithmetic operations are semantically void & only become meaningful after they are interpreted by someone who knows to associated 1 /w red, 2 w/ blue, 3 w/ clouds, & so on & so forth. The mapping between the numbers & concepts does not exist in the arithmetic.

              • kelseyfrog 5 hours ago

                Oh, I thought that the embedding space of the residual stream was precisely that.

                • measurablefunc 5 hours ago

                  The arithmetic is meaningless, it doesn't matter what you call it b/c on the computer it's all bit strings & boolean arithmetic. You can call some sequence of operations residual & others embeddings but that is all imposed top-down. There is nothing in the arithmetic that indicates it is somehow special & corresponds to embeddings or residuals.

                  • kelseyfrog 5 hours ago

                    Ah ok, so if we had such a mapping then models wouldn't hallucinate?

                    • measurablefunc 5 hours ago

                      Maybe it's better if you define the terms b/c what I mean by hallucination is that the arithmetic operations + sampling mean that it's all hallucinations. The output is a trajectory of a probabilistic computation over some set of symbols (0s & 1s). Those symbols are meaningless, the only reason they have meaning is b/c everyone has agreed that the number 97 is the ascii code for "a" & every conformant text processor w/ a conformant video adapter will convert 97 (0b1100001) into the display pattern for the letter "a".

                      • kelseyfrog 4 hours ago

                        So kind of like if you flip a coin, the sampling means the heads or tails you get isn't real?

            • naasking 3 hours ago

              > The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations,

              Correct, the semantics is actually in the network of relations between the nodes. That has been one of the major lessons of LLMs, and it validates the systems response to Searle's Chinese Room.

          • westurner 4 days ago

            https://news.ycombinator.com/item?id=45256179 :

            > Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes [...]

            Ironic then, because if transformers are Bayesian networks then we're using Bayesian networks for non-independent features.

            From "Quantum Bayes' rule and Petz transpose map from the minimum change principle" (2025) https://news.ycombinator.com/item?id=45074143 :

            > Petz recovery map: https://en.wikipedia.org/wiki/Petz_recovery_map :

            > In quantum information theory, a mix of quantum mechanics and information theory, the Petz recovery map can be thought of as a quantum analog of Bayes' theorem

            But there aren't yet enough qubits for quantum LLMs: https://news.ycombinator.com/item?id=47203219#47250262

            "Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43028710#43029899