Much like any other self-proclaimed "AI Engineer" or researcher or perhaps even someone just interested in the exponential growth of new technology, I hold the words of Yann LeCun to a higher standard than perhaps anyone else in the field. After being introduced to the world of scaling laws and current problems of scaling available models, it becomes apparent that the aforementioned researcher as well as many others choose to advocate for a method of scaling generative technology on the basis of a multi-modal approach (combining pictures, text, and audio) that defines its heuristics on energy and has a "world model" or understanding of the world naturally embedded within it from which reinforcement learning could be bolstered and then ideas be propagated. While I can offer no new ideas or even better explanations of older ones, it did get me to wonder if LLMs or the current AI technology could understand the true semantic relationships and the cornucopia of meaning in modernist literatures such as the work of James Joyce (examples from Ulysses and Finnegan's Wake) or the poetry of EE Cummings.
The current generation of LLMs is founded on autoregressive transformers which takes a fundamentally Bayesian approach: the next token is predicted based on the tokens preceding it taking into consideration the probability of the next token being the right one. They learn to model the probability distribution with the help of these discrete tokens (words as input/output) built on self attention mechanisms that attempt to capture these relationships. Each token gets mapped onto a higher dimensional vector causing words with similar meanings to end up together in a certain space. As tokens flow through transformer layers, their vectors get updated based on context and these spaces end up learning geometric relations between the words. A famous example of this being
"king - man + woman = queen"
in which the directions of the space represent concepts such as gender or tense or sentiment. Even modern models like Gemini or GPT 4V project images into the same token space. Vision also gets tokenized and processed similarly to text.
In LeCun's vision (also known as Joint Embedding Predictive Architecture), each modality gets its own encoder but they all map into a shared latent space where semantically similar concepts are close regardless of modality. Thus, through an energy based heuristics system, the LLM understands functionalities as
image of dog + "dog" + audio of barking = low energy
image of dog + audio of piano = high energy
The ability to predict latent representations instead of pixels or tokens makes the entire process more efficient and captures uncertainty better. The pros of this kind of model are obvious — humans argue from way less data than LLMs however the data is richer and within the framework of a world model as well as the technology being able to present infinitely more nuanced states of understanding through a rejection of the pre-established discrete token space. While models such as CLIP or Gemini are getting better at cross-modal reasoning, they are far from the kind of "AGI" or "ASI" (whichever camp you may fall into) that he is advocating for in earnest.
Imagine training an AI model where every blue image is labelled "green." The model would learn that "green" correlates with blue visual features and given enough examples, might even succeed at producing blue images instead of green ones. Even though pattern recognition is achieved, the objective "truth" is not. Thus, if meaning is just learned associated between word and world, the precarious question becomes one of assumption: what if the association between the world and the word are arbitrary?
Ferdinand de Saussure writes about this in the object of linguistics: signs are arbitrary. The word "arbor" has no inherent, natural connection to the concept of a tree. Meaning emerges not from the sign itself, but from its unique space in a system of differences — "dog" means what it does because it is not "cat" or "wolf." But Saussure's insight about arbitrariness was incomplete. Even though he focused on the arbitrary relationship between signifier and signified, he overlooked what later theorists like Julia Kristeva would call the "materiality of the signifier," the idea that meaning also emerges from the physical substance of language: its sounds, its visual form, its rhythm. This is where Joyce becomes the perfect test case. When word2vec learned to embed "dog" as a vector of weights in higher dimensional space, the specific vector is just as arbitrary as the word. What matters is the relationship between that vector and other vectors. The model learns from differential relationships in the data, just like human language. The natural question then becomes that if meaning is purely relational and arbitrary, how does it connect to the "world" or even an accurate representation of a true "world" model?
The work of James Joyce always comes to me as an interesting challenge for AI to truly understand. In the "Sirens" episode, Joyce structures prose like a fugue with repeating motifs, counterpoint, and rhythm.
'Bronze by gold heard the hoofirons, steelyrining. Imperthnthn thnthnthn.'
A language model trained on token sequences would see this as words with unusual spellings or perhaps even connect it to his work but what Joyce is truly asking us is to hear the music and experience the text as one of sound-texture. The onomatopoeia forces our mouths and minds to stutter like the word itself. When models experience the hundred letter word describing thunder
"bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenerthurnuk"
he's not just representing sound through spelling but also creating a visual object on the page that forces the eye to stumble, pause, and attempt pronunciation. The sheer length of the word enacts the duration of thunder rolling across sky. A language model that tokenizes this loses both the acoustic and visual performance. It sees discrete chunks: 'bab', 'adal', 'gharagh' but misses the overwhelming totality that makes it thunder. Even in his wordplay, the use of the word "cropse" in Finnegan's Wake is a pun on corpse and harvest, creating a world that exists only in sound-space between meanings. The humor and depth come from their alliterative overlap. Since current models only process this semantic creativity while ignoring the aesthetic form, the "Joycean" nature or aspect of it wholly disappears.
In 1957 the linguist Firth described the distributional hypothesis "you shall know a word by the company it keeps." Supported especially by recent developments in AI, language models have shown to understand semantic relationships on a much deeper level than we previously thought about before. Word embeddings have been able to genuinely capture semantic relationships through pure co-occurrence statistics. "king - man + woman = queen" is not just a shallow understanding of words through differences but rather a revelation that geometric relationships in vector spaces can encode conceptual relationships like gender and royalty. Models learn that "Paris" relates to "France" in the same way that "Tokyo" relates to "Japan" even though they have never been explicitly taught about capitals or countries. Transformer architectures can also learn hierarchical syntax structure without being programmed with grammar rules. Attention patterns in models like GPT can naturally discover subject verb agreement, relative clause boundaries, and long distance dependencies. In some ways, this validates that meaning is relational and that we can extract large quantities of semantic knowledge from the patterns in how the words appear together. A model that has never seen, heard, or experienced anything can still engage with abstract concepts and analogies through the understanding of context and its base forms. However, to engage with language in all three front models must improve their understanding in some key ways.
A) The Acoustic Dimension
When Tennyson writes "The murmur of innumerable bees," it sounds like bees through its combination of "m" and "u" sounds. Cognitive linguists often speak about this as "sound symbolism" often seen in popular trends such as "bouba/kiki" effect showing that humans consistently associate certain sounds with shapes and qualities. High frequency sounds feel sharp and small while low frequency sounds feel round and large. Joyce also exploits this constantly: in writing "rsssssss sssss rsss," meaning emerges from the hissing in our own maths. While a language model can tokenize these and tag it as a <cat hiss>, the lack of acoustic texture prevents it from being effective. Modern language models have no phonological awareness by default. They don't know that "phone" and "mean" have assonance. Poetic meter, rhyme scheme, and alliteration are invisible unless it comes from external sources and they can see that these words appear together.
B) The Visual Dimension
Spatial reasoning is a massive weakness in terms of models. When discussing the work of poets like EE Cummings, models may entirely miss the point of poems like L(a… (A leaf falls on loneliness)) where the poem is the visual descent of falling and loneliness as described in the letters. When a tokenizer processes this, it sees: ['l', '(', 'a', '\n', 'le', '\n', 'af', '\n', 'fa', '\n', 'll', '\n', 's', ')', '\n', 'one', '\n', 'l', '\n', 'iness']. The visual meaning evaporates entirely. To truly 'read' Cummings, a model would need to perceive the page as an image, understand negative space, recognize that typography and layout are semantic choices. Thus, standard language models trained on pure token sequences are typographically blind. They can't see that a poem's stanzas are symmetrical, that a paragraph is unusually short, or that text is centered versus justified.
C) Language as a Physical Experience
Consider the entire last chapter of Ulysses, Molly Bloom's fragmented, breathless stream-of-consciousness. This is a piece of work that is entirely based on proprioception — a sixth sense that simultaneously evades us and brands it with something of a feminine nature (according to Helene Cixous). The repetition, lack of punctuation, and rhythm create feeling through form. Models that process them as tokens miss the temporal experience of reading. Humans don't just process language but rather simulate it. Reading about kicking activates motor cortex regions involved in actual kicking. Since language models have no bodies, they process sequences but don't experience the phenomenology of sequential unfolding.
LeCun's joint embedding predictive architecture presents a philosophical shift in our understanding of form in language; however, it begs the question if richer data is enough or if we need fundamentally different learning processes for these complex algorithms. Current multimodal models still only learn through passive observation and processing millions of image text pairs, scanned documents, audio alignments etc. If we consider the "human experience" into this paradigm, we understand that humans learn language through its more material dimensions. Children learn rhythm through nursery rhymes synced with physical motion, we learn sound symbolism through the movement of our mouths, and wordplay emerges from actual play with other speakers. A sense of proprioception is developed naturally within us. Obviously, researchers smarter than the both of us are exploring this through robotics, RLHF mechanisms treating language as social interactions, and embodied AI that learns through trial and error in simulated or real environments. What is becoming clear is that scaling the capabilities of LLMs is highly dependent on creating the frameworks from which embodied and intentional social practices can be captured.
The point behind this work is not to denigrate work in this field or to demean the current capabilities of large language models. Like many others, I too wish to see a better world that might come about as a result of the improvement of these models. This temporary limitation is one that I am sure that researchers will overcome. For generative artificial intelligence to scale to a more human mode, it is a fundamental feature for the technology to understand language like humans do. This does not happen in an abstract semantic space but rather as a function of our body processes. Continuous vector spaces, joint embeddings across modalities, and models that learn through embodied interaction and social feedback might approach something closer to human understanding. This is not because they'll have a subjective experience or consciousness but because they'll have access to more of the multiple channels through which meaning flows and the multiple levels at which it operates.
While it is impossible to make any claim with certainty, I firmly believe that asking questions such as this one can push the boundaries of what current technology can grasp in the first place. Meaning is rich and human intelligence allows us to form theses that embody more than what computational theories can even suggest. Sometimes the most important questions are those that we cannot answer definitely but can prompt us towards asking the right questions in the first place.