Thoughts about GPT-5 and the present AI-development

Yesterday, I was asked by a friend, why I had reacted so negative to GPT-5 not only in my last post in this blog, but also in chat-groups – just because of “one problematic experience” (see here). Should I as an ex-physicist and ex-IT-consultant not be fascinated of the current developments regarding AI and respective research?

Well, well, a good reason to share some thoughts … Be warned that they come from somebody who has not actively worked in the fields of a deeper analysis or improvements of the structures of LLMs. So, my practical competence is limited to theoretical aspects and the experiences of an active user. Mainly of small LLMs or freely available versions of e.g. GPT, Aria, Perplexity. However, I am a user who got some deeper experience with other forms of Machine Learning.

A misunderstanding … I am actually fascinated – of “pattern recognition and reproduction”

Regarding fascination regarding LLMs: I am fascinated. I have followed various developments after the breakthrough of the transformer technology. I have worked with resulting LLMs as e.g. ChatGPT, Claude, Aria, Perplexity etc. in various versions since 2022. And with variants of image generators based on StableDiffusion.

I am fascinated as a user – but as a critical user who does not take the output of an algorithm trained to reproduce language and human communications patterns too seriously. At least not without a proper verification of AI-provided information. In particular when we talk about complex problems. I had my regular disappointments with LLMs as soon as the “conversations” went beyond pure information gathering.

Fascination must, in my opinion, always be accompanied by sober intellectual distance and a critical view upon the applications, investments and the profit interests of companies coming with new technologies. I think, a critical distance to the advertisement claims of companies trying to make money with so called Artificial “Intelligence” applications is obligatory. Why?

The main reason is that their products are nothing else than special versions of deterministic Machine Learning algorithms trained on pattern detection in human language and in other human “output”.

Basically, I do not see why I should expect any original intelligent output from pattern reproducing deterministic algorithms – whatever the claims of the producers are. A sign of intelligence is in my opinion when a new idea appears in a reasoning mind (of whatever nature) that allows us to order a whole array of connected facts and theories in a new and fruitful way. What we see instead coming from AI-models is either a reproduction of published knowledge and elementary step-like patterns to solve relatively simple problems. Including a lot of mistakes, hallucinations or plainly wrong information. Though all in grammatically correct bulks of text or speech. The promise of some form of intelligence has not been met by presently published LLMs.

We talk of algorithmic tools – not intelligence

Regarding LLMs we talk about algorithmically working tools – whose “knowledge” basically covers only one aspect of thinking: the association of terms and words (forming pattern contexts) – based on statistics and repeatedly appearing patterns in the stuff the algorithms were trained with. With the new LLMs, or better LRMs (with R for “reasoning”), we in addition got a very limited bit of a logical (?) consistency checks of a possible line of argumentation or of a sequence of steps to solve a well defined problem.

But, some limited consistency checks are no guarantee for completeness, correctness or truth of the results in the sense of verifiability/falsifiable statements. Not when pattern reproduction is to a large part based on an information exchange environments in which claims do not fit historical events, but much too often represent framed stories issued by interest groups.

Also keep in mind that checking the compatibility and consistency of statements with some structured framework of organized and logically connected thoughts and/or respective scientific experiments/research/theories often is a very complicated matter in which different processes of review, criticism and verification are involved. Presently, this is in principle beyond the scope of single LLM and, regarding practical aspects, beyond the limited computation time you get as a standard user.

All of the processes in neural networks are instead guided by rules once imposed during the algorithm’s training, i.e. during a process of approximating a (hopefully optimal) solution to an optimization problem – in the case of LLMs posed and based on given statistical relations found between elements of human communication, human texts, other information or sometimes guided by a limited sets of well defined fixed rules.

Never forget – it is the recognition and optimized reproduction of patterns, which the unsupervised training of LLMs covers. Not more, not less. Practically all of the emergent unexpected features of LLMs during upscaling of the networks are still rooted in the recognition and use of certain patterns, which were present in the training stuff.

Additional reinforcement learning after the initial training may try to keep the resulting probabilistic babbling of LLMs (hopefully) within some limits acceptable to humans. Whereas the new “deep thinking” loops try to keep the babbling within acceptable limits regarding basic logic and reasoning. But even these limits are only extracted with a certain probability from the contents and associated patterns of a posed question and some basic rules of logic.

Tools that may reproduce prejudices …

However, and to make a starting point for criticism:
Take into account that very pronounced and strikingly similar language patterns are created these days by a lot of manipulating or uneducated people in a lot of texts and published statements. Any prejudice published often enough and repeated through dubious channels of information exchange makes a striking, easily detectable and reproducible pattern for an LLM. In particular in the un-social media.

Now, most of the texts on the Internet used for training of LLMs are propaganda these days – much of the respective contents is falsifiable and often enough strikingly stupid speculation. But propaganda and repeated idiotic statements do make strikingly clear language patterns. Pattern recognition and reproduction, therefore, is absolutely no criterion or guarantee for the production of intelligent, original, verifiable/falsifiable and helpful statements by a LLM.

LLMs in a very natural way confront mankind to a certain degree just with the bullshit we humans produce day in and out. Not only and not always bullshit, but often enough.

Another reason for skepticism, cautiousness and carefulness when using LLMs is:

A fool with a tool is still a fool. And a fool with a talking, flattering, persuasive, but sometimes foolish tool can become an even bigger fool.

Especially, when a tool sells a line of argumentation as apparently logical and reasonable due to the use of elaborated language and when a tool supplements its messages with flattering statements in the direction of the user. Always ask yourself:

Which of the messages an AI gave me contains verifiable or at least falsifiable information? Were the sources of the given information referenced and suppositions of any presented conclusion named? Was the impact of unverified assumptions named and analyzed?

I have seen the phenomenon of fools with new polished tools very many times in IT-contexts during my professional life.

Advertisement and polished surfaces may create disappointment if announcements and expectations are not fulfilled by the product

I have for a certain period in my life worked a lot in the field of quality assurance and with extending, adapting or improving the structure of certain established quality management systems. Therefore, I do not like it at all when I get confronted with a lot of advertisement ahead of a new product version without a major improvement regarding the product’s quality, in particular not when a substantial quality improvement was the main selling point.

ChatGPT-4 (at least the free version) has disappointed me very often with blatantly wrong answers, though the answers were often nicely formulated. Now, I had my first encounter with GPT-5 – and again despite improvements and the invocation of “deep thinking and reasoning”:

Wrong and misleading answers regarding a complex, but well defined problem in the context of ResNet V2 units.
And fluffy, but superficial answers regarding topics outside the fields of math, physics, technology.

What I liked though: the harsher, more concise presentation of answers without any flattering. It helps so much to keep the absolutely necessary distance of the user – and it stimulates the motivation to perform an equally necessary critical evaluation of what GPT presents to you.

Sorry, OpenAI, but getting some freshly polished shoes with a hole in the sole makes them still unusable during hard, rainy weather. Even if the advertisement tells you that the polished shoes were the best in the world and suited for all purposes and weather conditions – and that there was no need to wait for more profound solutions or offers.

Sam Altman may feel that OpenAI has created some really new stuff and even achieved a breakthrough on the way to an AGI – but this unfortunately tells more about his own illusions than about OpenAI’s products. Or about the status of all present LLMs …

Anyway: The number of complaints of users on the Internet about errors of GPT-5 has been rising the last days. Nobody speaks of a breakthrough or even a jump to a new level of AI capabilities, any more. As a first reaction of OpenAI the old GPT-4o-versions were reactivated.

Some banal points to think about regarding LLMs or LRMs

In my opinion, those who dream of LLMs/LRMs on the verge of real (?) thinking forget or ignore a variety of important points. Let us start with three of them:

Point 1: Garbage in – garbage out.
This point is in the end true for all kinds of deterministic algorithms. LLMs are no magic exception. LLMs get trained, and during training they optimize themselves regarding the proper reproduction of statistical relations between terms, statements, words or even parts of words and other blocks of information. They detect and reproduce patterns in the information they get presented during training. Some times surprising patterns which really are valuable.
But as I said above: Do not forget all of the bullshit in the presently available training stuff. Bullshit does not become a philosophy because it creates statistically significant patterns in human communication. Repeated and in parts manipulated nonsense creates significant patterns in the mass of human information.
In addition: We all carry a lot of prejudices about the world around with us. So why would we expect that a LLM optimized to reproduce human language and information patterns does confront us with something better than the prejudices we produce on a daily basis? In particular regarding the overwhelming useless nonsense spread through the Internet’s un-social media? The world of an LLM is that of a fiction based on statistics governing information created to a large part based on human prejudices.
Point 2: Statistical relations of our language are neither a guarantee for a correct presentation of certain facts, nor for verifiable/falsifiable or true statements
The problem with the human language is that we can build a lot of relations between things and objects in a sentence in a way such that the text sounds right and nice, but has a contents that is neither verifiable or falsifiable or in any respect factually true. Especially, when human feelings and emotions are involved.
Statistical relations of words and contexts have their value if analyzed thoroughly. Or when it is completely clear to the reader that the purpose is to trigger emotions and speculation.
However, just using and reproducing related patterns without necessary reflection will always be accompanied by the danger of creating fascinating linguistic constructions (in correct grammer) – but which are utterly useless when deep analysis and a proper relations to verifiable facts or an integration into an extensive and comprehensive work of theory are required. Just listen to some politicians – and you know that I am right regarding this point.
Point 3: Logical reasoning can in a certain sense become dangerous, namely if improperly used.
Ex contradictione sequitur quodlibet … From a logically (!) wrong supposition, you can deduce any statement, logically (!). And from the combination of two contradictory statements, i.e. a logical contradiction, you can deduce any statement, but also its opposite.
This relation hold in most logical systems. Note that the logical wrongness/contradiction – like in (A ∧ ¬ A) – is required. From a supposition containing contradictions you can build any kind of theory – even if the whole serious of steps beyond the first one appears or even is logical. This limits the value of any “deep thinking and reasoning mode” in principal.
The absence of logical contradictions in the suppositions of the “reasoning” must be checked – and very often this is a time and resource consuming process in the case of complex matters.
Another point is the factual truth of a supposition. If a supposition A were true, then you may be able to logically derive a lot of valuable things. But if it is not true, your logic may be correct and you still end up in a bunch of useless nonsense. Also the check of factual truth of statements most often is a time- and resource consuming matter – and it sometimes needs more than to just follow more or less trustworthy publications and induced statistical relations.
A related problem is that of the completeness of facts in a supposition – if completeness is required for a reasonable answer to a problem.
If an LLM thinks a rocket launcher is defined by its ability to fire off a rocket alone – than it may conclude unhelpful or even wrong things regarding the destruction of real targets.

So, deep thinking and reasoning maybe is good and necessary (but see the Apple study mentioned below). However, when we need advice and decisions in complex matters, even a chain of logical reasoning boils down to and requires an identification of falsifiable or verifiable facts (in contrast to speculations), a thorough process of verification and a thorough check of the absence of logical contradictions in a supposition. And no nonsense and nothing like the so called “alternative facts” – even though some statistics on the Internet media may have given some weight to misleading and false claims during the training of an LLM.
The impact of speculative assumptions on an answer must be clearly be analyzed and marked. And the completeness of facts in a supposition for a comprehensive answer to a question must be evaluated – which may also require a lot of resources.

Now, are LRMs build for this? Do they provide sufficient resources – e.g. in terms of required memory and computation time during profound checks against logic and also against established scientific frameworks? Not to my knowledge.

OpenAI just raised the access to the number of accesses to GPT’s reasoning mode – for paying customers – after complaints. Think about the implications for the standard version ad their users yourself …

My first experiences with GPT-5

I have written a bit about my own first experiences with GPT-5 in my last post – and also indicated my frustration after I analyzed a particular “advice” of GPT regarding a class of artificial neural networks more closely. How does point 3 in the list above relate to my “conversation” and my questions regarding the structure of an Autoencoder based on Resnet V2-architecture?
GPT-5 stated (maybe falsely) that there are no recipes for a ResNet V2 based decoder available. So, GPT-5 told me that it had to analyze the problem itself via “deep thinking”.

The misleading information and suggestions the reasoning mode of GPT-5 gave me as an answer were based on the assumption that a Resnet V2-block differs from that of a ResNet V1 architecture just by pre-activation ahead of convolution. While the pre-activation criterion is correct, it is incomplete information regarding the definition of V2 and disregards other aspects and differences of a ResNet V2-block in comparison to one of a ResNet V1.
Therefore, the supposition of GPT-5 regarding the definition of a ResNet V2 was wrong. As a result GPT-5 build a decoder architecture which not at all reflected a reasonable inversion of a ResNet V2 encoder. I am tempted to say that GPT-5 has not even analyzed the original publication about ResNets V1 and V2 in a reasonable way – and we can in no way say that it has ever “understood” the differences between the architectures. It just followed some statistics and emphasis of words, expressions in a rather complex context.

The Apple study – and a supporting one

Those who have questioned the presence of “intelligence” or intelligent processes in present AI models before, have always had a good advocate in Mr Le Cunn – one of the god-fathers of Machine Learning. As far as I remember, he once said in an interview that he would be happy if he found even a small sign of reasoning in LLMs – on the level of a cat. But, no, no sign of that …. Which in my opinion makes an unsupervised usage of LLMs in a lot of complex fields and decision making just dangerous.

Now, Apple has recently published a study which gives the doubts about any sign of intelligence a broader and more solid foundation. The results of Apple’s experiments set a huge question mark behind all speculations about some intelligence in today’s LLMs and LRMs. But read the publication yourself:

P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, M. Farajtabar, 2025, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, paper published at https://ml-site.cdn-apple.com,
Link: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Another paper in the same direction, but based on other thorough experiments, is the following :

C. Zhao, Z. Tan , P. Ma , D. Li, B. Jiang , Y. Wang , Y. Yang and H. Liu, 2025, “Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens”, Data Mining and Machine Learning Lab,
https://arxiv.org/abs/2508.01191v3

The clear message coming from this paper is – in my opinion: Do not trust any output or reasoning of a LLM. Verify it, before you use it. Or just take it as an idea which must be analyzed carefully and compared to scientific results and publications on the topic.

Other aspects

I do not want to end this post without three additional thoughts:

(1) It is to my knowledge not clear how the human brain codes information. “Weights” like in artificial neural networks may play a certain role. However, there are indications that at least certain types of information may be coded in oscillating or other time-dependent patterns in a brain. This is not reflected in the architecture of present day LLMs.

(2) An even deeper question is whether we can speak of conventional determinism regarding the processes in the human brain. It may well be possible that some form of intelligence is based on patterns arising from chaos in the form of limit cycles – competing with other such patterns during our daily confrontation with patterns of reality.

(3) That some fundamental difference between AI-models and the human brains may exist, besides a pure scaling of the number of neuronal connections, is plausible – just from comparing the small energy consumption of the human brain in relation to AI-models.

All in all we may have made some progress regarding an important cornerstone of intelligent human thinking: The encoding of information in the form of a language by machines. But it may just be one of many aspects of intelligence – and awareness – which today’s artificial network architectures are not able to cover or imitate.

Conclusion

Fascination about a new technology should go hand in hand with criticism and careful analysis regarding the question in which contexts, how and under which conditions this technology can be used or should not be used. A critical distance is in particular necessary, when obvious capital and market share interests drive unfounded, exaggerated and in the end unfulfilled – if not plainly wrong – claims and assertions of company leaders during interviews just ahead of new AI product releases. Also this is a pattern we have seen in the history of modern IT multiple times. Ask an AI about it.

Addendum, 10/21/25:
Due to the potential big impact on near future research I want to mention a recent publication of Samsung Research on the reasoning power of rather small so called TRM-models:

Alexia Jolicoeur-Martineau, 2025, “Less is More: Recursive Reasoning with Tiny Networks”, https://arxiv.org/pdf/2510.04871v1