The AI Bubble Is Bursting
For software engineers: what LLMs actually are, why the hype may be cracking, and why harnesses and accuracy matter more than humongous SOTA models for job security.
I’m hoping by the end of this article, addressed first and foremost to software engineers, you feel a bit more hopeful about your job security. I want to talk about what millions of euros on PR have ended up forcing us to call AI (and is not), and why I think the interesting shift isn’t “how big is the model,” it’s the harness, the tools and retrieval and guardrails and memory that turn a small language model into something actually useful. But first you need to know what’s my angle on this.
These are my questions about it
- What can it actually do today for me?
- How can it help me or make me worse at my job.
- How could it disrupt the economy or cost us our jobs.
I wouldn’t say I’m an early adopter, but close enough considering how much time and energy I’ve invested in understanding what all this noise is about. In this article I’m going to explain why I think OpenAI and Anthropic are in trouble, they know they are, and all the money they are spending is mostly on pretending they aren’t worthless. By the end I hope you feel better about job security not because AI is useless, but because what’s replacing the hype is stuff you can build and control.
What an LLM actually is
All texts they generate are based on pre-existing text they’ve been fed, they look at those texts, create tables of resemblance, and figure what the next word might be based on statistics. These “words” units are called tokens and they aren’t necessarily words, they can be subdivisions of a word. So a modern chat model is a next-token predictor. It doesn’t care about truth, just probability of wording. It statistically continues patterns that looked plausible in training.
People call this a stochastic parrot. They just statistically predict and repeat patterns from their training data, mimicking human-like responses without real comprehension or intent.
When you enable thinking mode on some models all it does is use more text and fill the context window quicker. It can spend ten minutes discussing with itself whether its cutoff is 2024 or 2026. Every new line gives it more text to generate the next line. So it’s a glorified search engine where facts and accuracy don’t matter.
The weights don’t learn your project overnight. Anything that feels like memory or fresh facts usually comes from context you fed, tools, or external stores, not from the model silently updating itself.
What can AI do TODAY?
Briefly:
- They can take questions written in natural language, even with some typos, and reply back with what looks like, increasingly less with every new version, human written text.
- They can also generate images, sounds and video.
- They can write code for multiple languages but specially Python and Javascript.
- They can fool people into accepting as truths something that is just made up text.
- They are all sycophants aiming to be agreeable or fake being right, none of them can distinguish truth from lie.
I’m going to focus on the ones that matter to me.
Where today’s stack fails
For me it breaks down into three kinds of problems: technical limits, economic structure, and product trust.
Technical limits
They cannot make something out of nothing. Everything they spit out comes from what they were trained on plus whatever prompt you hand them. Ask it to generate a frontend and you’ll usually get something generic and boring that you could do quicker in Bootstrap Studio. You’ll probably waste more time trying to fine tune the prompt to get what you want than if you’d just built it yourself.
Not to mention the royal pain in the arse of finding out whether the code you get is comes from others with licences that clash with your product (especially viral ones), given LLMs don’t spare a thought for the origins of any of their training data.
Ask the model its cutoff date and work backwards. Take Elixir as a good example. Most models I can download know nothing past version 1.16. Ask your model what it thinks of unless and if it believes it is deprecated. Most will fight you.
Ever growing models add absolutely nothing to software engineers using them if they are bad at coding in their language of choice. The languages they write “the best” are often flimsy Javascript (hundreds of versions, lots of low quality JS online) and Python (still fragmented on packaging, still carrying 2.x vs 3.x baggage). Try Kotlin Multiplatform or Elixir despite the existence of centralised Hexdocs and you still hit outdated code and silly mistakes.
Hallucinations are the model filling in instead of saying I don’t know, unless the harness stops it or grounds it with retrieval and citations.
So people bolt on RAG (retrieval-augmented generation, fetch relevant chunks then generate) or do a bit of fine tuning. You do not need OpenAI’s or Anthropic’s latest flagship for this. Nope. Another sign that the real product isn’t just the raw weights. And, crucially, there’s no magic button they can press to conjure up a GPT 7.0 or a Claude Popsicle that’s up to date with everything on the internet right now. The best they can offer is always a snapshot, always some version behind reality.
Economic and structural pressures
The most costly part of LLMs is not inference, it’s creating the model. Vast data, ever larger datacentres, chips, water, financial resources. And even then you get cutoff dates. Its knowledge is always frozen in time.
Oldest trick in the book, on Facebook YOU are the product, in Google your searches are the product. We never left the web 2.0 era. Your prompts are content. The pitch for chat.insertyourproviderofchoice.com is width of knowledge so you’ll type more about your life. Anyone on the receiving end can use that to push better ads on you, or worse.
Fediverse vs Twitter
You have seen this pattern before. One company owns the megaphone, the feed, the rules, and the ad layer. Mastodon and the wider fediverse are the opposite bet. Many servers, many communities, you can move instances, you can follow people across servers, and nobody in the middle is trying to turn your whole social graph into ad inventory by default. Twitter is still where people go for reach and one global scoreboard. The fediverse trades some of that discoverability for exit and community rulebooks you can actually read. Neither side is perfect. The point for this article is structural. Who runs the pipe, and who gets to read the stream. A glossy cloud chat product is Twitter in LLM form. A local model with your own RAG, MCP stack, and guardrails is closer to picking your own instance and owning more of the plumbing you even get to choose Mastodon, Pleroma, Akkoma, Sharkey, GoToSocial.
If the model can reach tools, search, APIs, MCP servers (Model Context Protocol, a standard way for assistants to call external tools and data), sheer memorised bulk matters less. Mistral Le Chat has stuff like searching PubMed and linking claims. A small model that queries sources beats a giant one that guesses citations. Tools turn the LLM into a front end for search and structured sources, often more accurate than a 500B-parameter memory of the internet that’s still wrong when it’s released.
Product fatigue and trust
They are so pervasive universities barely need Turnitin. Identifying AI prose is easy, think of… instead of…, em-dashes everywhere, first paragraph praises you, last paragraph open-ended so you burn tokens. Online communities notice. On Reddit people call it out so fast it’s become a sport. It’s a bit like the SouthPark episode where the kids were obsessed with a game until their parents liked it and the novelty vanished.
For search, every time you use Google you get AI generated crap at the top where half the time the summary contradicts the links. If that annoys me, imagine everyone else. I lean harder on DuckDuckGo even though they also do AI but it’s far less intrusive the more Google forces me to read Gemini responses before the actual results I am looking for. I can even achieve more accurate results using a simple MCP in LMStudio that uses DuckDuckGo and then fetches the results to give me a properly accurate answer.
Enter the room what makes a model interesting, the harness
I hate this word but have to use it to explain the immediate future.
Everything they are good at does not require renting the biggest proprietary model anymore when you get the harness right.
A harness is just a bunch of tools that combined make a model actually useful. You could say Cursor IDE is a harness, Zed Editor is a harness, Copilot, Claude Code, Pi dev, RooCode/ZooCode are all harnesses. Imagine the LLM as plutonium. The harness is the whole plant around it. The same kind of hot reactive core can either be the payload in a delivery system or fuel in a reactor hall. One story ends in a flash everyone remembers for the wrong reasons. The other is just containment, coolant, control rods, years of boring maintenance, and what you get out of it is power you can actually use, lights on, turbines turning, work getting done. Unless Homer Simpson is in charge of security.
They usually include a mixture of:
- RAG: To some extent navigate your code to find where the lines are that perform some function, then depending on your question, feed those lines to the model prepending your prompt. In other words they add context the model wouldn’t be aware of otherwise. They can also navigate and catalogue external sources like documentation.
- Guardrails: they define what the model is allowed to do with tools that need user privileges, from creating, editing or deleting files to accessing the internet or running commands in the shell. Rules and limits, not the thing that runs them.
- MCPs: extra functionality that adds intelligence to the harness. They can allow the LLM to access your Obsidian notes, get the latest docs on some language, or control a browser.
- A compactor: something that summarises the session, usually in an internal markdown file, that later is fed into a new session so the model maintains some memory of what it was doing. Because all models have context window limits and they are a lot shorter than our working memory.
- An Agent: the part you actually interface with, chat surface, diffs, approve or reject, terminal output, all of it. The model outputs text, the agent is what turns that into real tool calls and runs them inside those guardrails from item 2. It is the loop that proposes an edit or a command, checks if it’s allowed, and executes or asks you, then feeds results back to the model. Without that agent you’re back to a plain chat window that can’t drive your IDE or your shell.
The two most important jobs of the harness are to cut hallucinations and to supply memory outside the context window. Both goals are interconnected.
Models do not learn
If you plug a tiny model like Gemma to an MCP that gives it memory you can tell it from LMStudio to remember your name, some crucial data you have that will be useful, and Gemma won’t remember any of it BUT will use the MCP tool you’re using for long-term memory to store that data, which will ultimately land in a vector database like Qdrant. Then when you ask “do you remember what’s my name” Gemma pulls it from your MCP server and answers correctly. It feels as if the model remembers but it actually does not, it just has external memory.
So the point is, a modest model plus retrieval and tools often beats a massive frozen generalist for fresh, checkable answers. Same for code when the harness grounds work in repo and docs.
So how does this affect OpenAI and Anthropic
If their core sell is omniscience baked into the weights, but users including younger generations are tired of very wrong and verbose answers, and small local models with MCPs and RAG can answer faster and more accurately for real work, then having the biggest model isn’t what locks people in anymore. What’s left is convenience, habit, distribution.
But now for that convenience you pay ever increasing token costs. Dario Amodei is telling you to use multiple agents in parallel to get a better answer (sure, burn more tokens, make him richer, what would he say, use Claude less?). Software engineers are getting fed up with how expensive every hallucination is and how verbose and inaccurate generated code is getting.
Meanwhile early adopters running local LLMs like me have reached a point where downloadable LLMs plus custom MCPs work better for me than commercial cloud ones, and they don’t suck my personal data since it runs on our PCs. We’re past the peak where you needed two H100s to run a decent model. I have my tailor-made local PRIVATE stuff on my puny 5060 Ti with 16GB VRAM. I belong to two of their target demographics, software engineer AND early adopter, and I don’t need them. Imagine once the rest realises the same.
Enterprise reality check: cloud APIs only matter when the organisation refuses to self-host, full stop. Because, no matter how good their sales pitch is, a lot of companies simply can’t ship their code, their customer data, their internals through someone else’s API and sleep at night. Contracts, GDPR, NDAs, basic sense of risk, there are many reasons not to. Privacy isn’t a nice-to-have for most workplaces, it’s a hard constraint. So I’m saying value moves to harnesses and specialisation and to setups that stay on your own metal when the cloud is legally off limits.
The trend is clear. OpenAI and Anthropic aren’t just competing against each other but against how people actually want to use these systems. Their advantage was never really parameter count, only the perception that bigger meant better. That perception is fading among engineers and hobbyists, and regular people are tired of AI shoved into every surface. So the focus has shifted to maintaining the illusion through PR and hype.
Look at Mythos on the Linux kernel for how that plays out. You get the news cycle and pieces like https://www.darkreading.com/vulnerabilities-threats/ai-assisted-software-scan-linux-bug, big scary AI found bugs in the kernel energy, then you dig into what actually matters for real threat models and half the punchline is local access, physical access, you already have to be at the machine in ways that make the headline feel ridiculous for anyone trying to prioritise actual risk. If I can touch the box I can do worse than whatever CVE paragraph fourteen buried.
The Reddit thread https://www.reddit.com/r/linux/comments/1sk9hcd/how_linux_plan_to_patch_the_exploits_discovered/ is worth reading for tone. People aren’t stupid. There’s a comment that says the quiet part out loud, this smells like an advertising campaign, scare users then sell the same brand as the thing that finds and patches the holes, subscribe now.
Basically you don’t need them. Those misnamed systems aren’t AI in the sci-fi sense, they’re bloated natural language processors (NLP), and none of them will ever be close to artificial general intelligence.
What is AI actually good for as a coder
Boilerplate code
You want to write in TDD style some new feature, you need fixtures, sandboxes, test doubles and mockups. THE BORING stuff. That’s where it shines and you still need to check those fixtures match reality, let alone the latest features of the language for testing.
Aided summarisation and aided article
If you write a very long post AI could summarise it for you but it’ll take all the personality out of it and sound bland and boring, hence why I wrote AIDED summarisation and not summarisation, aided article writing and not article writing. It’s good to give them your article and tell it to suggest a better structure of the headers after you’ve written it. The LLM will figure normally what is it that you’re trying to convey and guide you a better order of headers so you keep your voice, your tone, your typing, your million of typos (cough cough) BUT it is a lot easier to read now for others since now with the improved order, it is harder to lose track. This is specially useful for neurodivergent people (hi, that’s me and a huge percentage of coders) since we usually don’t think linearly.
Translation
Obvious one. Easiest one for them to accomplish. Do we need 500B parametres models for this? Nope.
Shaping ideas or feasibility, brainstorming
I wrote another post about using AI to check the feasibility of an idea. It’ll definitely write awful code but you often just want to brainstorm how possible it is to develop something and what problems you might face. The creativity is all yours and you compose prompt by prompt something you’ll never push to production, resembling your idea enough for you to figure the problems you’ll find in the way and the extras you’ll need to design. Let alone it allows you to reach full completion of your idea before actually starting to work on it, no SCRUM needed, no Agile, your focus is to write a full requirements document before you write your first line of code.
I strongly suggest using a different language than your target one if you’re serious about the idea, on one side that avoids licensing issues and on the other side that forces you to write the code yourself on your target language.
What I think everybody should do
1. Stop supporting the worst business model
Let the Scam Allmens of Sillicon Valley fall on the weight of their own predictions. When you use cloud-based commercial LLMs you’re paying to train the next model on how much they can model your soul from your prompts, supporting infrastructure that’s destroying the planet, and giving data that can target you for ads or worse. People willingly write mental health stuff into chats, and even chat about their life-threatening alleries like they’re handing over a playbook on how to destroy them.
Switch instead to local models. Both Ollama and LMStudio use Llama.cpp behind the scenes, you can even use it directly and it is supposedly faster. I prefer LMStudio because I like the interface more and token generation per second is high enough (anything from 20 or more is good enough) that faster wouldn’t make a difference for me reading or reviewing. I tend to use Qwen 2.5 coder 14B, Mistral Devstral, Qwen3.6 9B, and Gemma. For conversation with internet the 9B models are good enough. For coding it’s got to be a “coder” version of Qwen or Devstral.
I have an Nvidia 5060 with just 16GB VRAM and so far that’s enough. I’ve tried larger models on ThunderCompute, Runpod and others and sure the code seemed better and tokens per second were higher, but they just seemed, on a closer look I couldn’t keep up with it. So there’s no point in a model typing faster if every time you review the code there’s too much of it. Again my right> -arm tattoo is right:
LESS IS MORE
When I stopped chasing bigger models on rented cloud PCs with H100 GPUs, I focused on long-term memory with ingestion of self-updating data and moved into RAGs, vector DBs, and Oban to self-update. I had the pleasure of being at ElixirConfEU and unknown to me had the creator of hex.pm, hexdocs and Ecto by my side, I went in without googling anyone on purpose so I get to know people before titles. I learnt that unlike everything in codeberg.org, everything in hexdocs can be scraped as needed, and all docs of deps sit in deps/ too, so you avoid scraping unless you want your RAG to check for a new stable version that doesn’t match what’s on your elixir project’s folder.
Local LLMs vs cloud commercial APIs
Same story as above, different stack. Quick contrast before we go deeper on harnesses.
| Local (your PC, or a VPC you control) | Cloud commercial (their API, their chat UI) | |
|---|---|---|
| Where the sensitive stuff goes | Stays inside your trust boundary if you build it that way | Prompts and outputs cross their infrastructure |
| What you pay with | GPU, electricity, time wiring MCP, RAG, vector DBs | Tokens, changing prices, usage caps, surprise invoices |
| Top-end capability | Capped by VRAM and what you are willing to run | They can point you at the biggest model they sell |
| Harness | You choose or build retrieval, tools, rules | Often turnkey, sometimes you cannot see half of what it does |
Both OpenAI and Anthropic share the same pitch: rent the flagship and hope the weights know your repo, and the table above shows where the value actually is (memory, tools, retrieval, who owns your prompts). The product was never only the weights. Once that lands, section 2 below is the practical half. Anthropic goes even further and is trying to convince you the solution is to use multiple agents burning tokens in parallel to balance each other. That’s efficiency at making them richer, not you getting solutions today.
2. Develop your own harness
I first noticed the RAG thing thanks to RooCode, indexing into local Qdrant with a local embedder and your whole repo locally, so you can ask “where’s the function that creates a new nix environment on my Fish shell folder” instead of “where’s create_nix” and it answers. So I thought:
…wait, my puny local PC is able to answer questions about the whole of the codebase by meaning?
Then I thought what if my entire Obsidian Vault, what if the latest Kotlin Multiplatform docs, what if I inject Rust docs in LMStudio to finally have me explain in a way that it sticks how the borrow checker works.
I coded a preprocessor on LMStudio but preprocessors only work on LMStudio (meaning their UI), so I converted it into an MCP in Elixir called remember_ex and kept fiddling with it. I can inject ePub docs and PDFs into Qdrant now. Chunking is still the hard part.
Eventually at ElixirConfEU George Guimaraes was presenting Arcana, I haven’t tried it yet but I will. Surprisingly we were doing very similar things and reaching similar conclusions, like using another LLM to chunk the texts better, Oban in the background so retrieval isn’t blocked by pending chunking. I still need to check heavy jobs don’t block LMStudio and supposedly when I run a model I have 4 threads but I’m not entirely sure that’s true or that my GPU can actually handle 4 at the same time. So we’ll see. That’s my path, let’s focus now on yours.
What should you try? Start small.
- Try RooCode or Cline or ZooCode (RooCode’s new team) in VSCode with your own local model, embedder and Qdrant. See what models work best for your language and style.
- Move onto Zed Editor. It’s an IDE centered on removing bloat and being fast, written in Rust with their own UI framework. Fewer problems than RooCode for me most of the time. remember_ex works with it.
- Cursor IDE. I don’t trust their privacy. It’s VSCode with some extensions plus magic URL docs fetch when you thought you were fully using your local model, semantic search without ever asking you to configure an index or vector DB like RooCode does, something smells fishy, I feel they’re in the business of gathering repos. Your threat model may differ. I simply don’t trust them. I like their UI though, the way they structure their settings feels better thought that RooCode. They just hide too much stuff for me to trust them.
- Experiment with MCPs. I fork stuff at https://lmstudio.ai/maikelthedev and https://lmstudio.ai/tupik/top has the most incredibly useful README since lmstudio extensions have no directory for reasons that make no sense to me. The MCP standard is on https://github.com/modelcontextprotocol and you can get from there code that already works to tweak and make your own thing. My MCP started from their JS example “memory”, I rewrote it in Elixir. Preprocessor means LMStudio runs it before your prompt, hidden. MCP means the model decides to call it, visible. I wanted remember_ex in LMStudio and Zed so I gave up preprocessing perks for compatibility and ended up in that repo.
- Learn Qdrant. Simple, built for this. I haven’t tried every vector DB, they bore me. I just needed to understand enough to optimise the chunking. Qdrant does not get in my way.
- Do learn about Pi.dev since it is THE tool that lets you build your own harness easily and quickly.
3. Learn new programming languages
LLMs with internet and your own RAG for long-term memory and docs injection are ideal for learning new languages if you can ingest up-to-date docs. For Elixir the easiest path for me was ePub docs from projects, decompress, process, done. Efficient chunking is still the issue. Not a problem now but I keep thinking long term with massive libraries on Qdrant. Efficient chunking is about providing the LLM with JUST enough context to answer the question, that’s the key of the efficiency, the least information provided possible, the least context window you use. As context window grow the model becomes slow and hallucinates more, once you reach the limit you start afresh. Normally your harness compresses the context into a markdown file with just the important stuff and creates a new sesssion where it injects again. But there’s information loss during the compression so the longer you can remain before reaching the limit the better.
If the model remembers your style via long-term memory MCP or skills.md it can explain new programming languages in the exact way that you know worked for you before to learn the fastest. And we devs love flattening learning curves.
Specialisation over generic omniscience
The next phase won’t be packing more trivia into one tensor. It’ll be integration. LSP (Language Server Protocol) level stuff already exists for deprecations, many LSPs already do boilerplate warnings, so it’s composition from multiple well-known packages added on top to create more complex boiler plate and reduce the boring and repetitive stuff. Company wikis plus small local models can help employees find information faster about processes, rules and how to do their job. Personal harnesses like pi.dev that feel like what Neovim is for people who compose their editor, are tools that simplify creating your own harness tailored to you.
The tools to build this (RAG, MCP, Qdrant) are open-source, documented, affordable. The barrier is time and awareness, not permission from Scam Allmen. If you can write a Dockerfile you can deploy a local model. Guess what we’re good at, learning quickly, we have half the job done already.
The illusion of the threat to job security
Answering my questions at the top, will LLMs affect our job security. I don’t think so. The threat is idiots who employ you might think AI will replace you and cut costs. It is not a surprise, all LLMs talk like overcaffeinated CEOs. We’re the users, they are the buyers. The buyer doesn’t need to understand the product, they see a vision to sell shareholders and believe it since it is talking in his same grandiose unbelievable to everyone else (but shareholders) tone.
You don’t need to be mindful of the LLM, you have to be mindful of how stupid your boss might be.
The future will be companies using LLMs as another frontend to data. LLMs will handle repetitive (boilerplate) parts of your job but everything that needs expertise (things that can go wrong), judgement (which stack, which CI, which OS, which database), creativity (front-end design is ridiculously basic and generic with LLMs), debugging, knowing package details, privacy, keeps being you.
The real change comes from engineers building custom tools. While those waiting for OpenAI or Anthropic to package the same solutions get left behind. And it won’t be because their LLM is smarter, it’ll be because the harnesses they build fit their work as a glove same as your own Neovim config makes you objectively faster.
Another funny ironic thing is how the more companies try to replace us, the more they will need us to fix all the fuck ups of unmanaged code written by the idiots who thought LLMs were reliable enough to be trusted.
I do think demand for prompt engineers is going away on its own, that stuff was always shallow next to real expertise on your programming language and frameworks. What it turns into is a smaller, more realistically sized niche of engineers who know how to design systems that work with natural language, from helpdesk chats that do not make customers hate your company to apps that answer from the company wiki when someone asks in plain language. For devs it’ll be just an improved boilerplate generator.
Direct answers to the three questions
-
What can it do today? Fluent-ish text, images and audio and video, boilerplate, translation, brainstorming, and convincingly wrong answers unless you ground it with tools and sources.
-
How can it help or hurt my job? Helps on repetition, hurts when trust exceeds capability or when management believes the keynote.
-
Could it disrupt the economy or cost jobs? Money and attention will move. Mass unemployment of competent engineers isn’t what I’m seeing. The bubble that’s bursting is the idea that bigger closed models were the whole future. More like small models, clear tools, systems you own.
And again, none of this is A.G.I.
Have a good day!
For any questions find me in Mastodon as @maikel@vmst.io or in Twitter if you don’t mind waiting a hell of a lot longer since I’m a fediverse-first kind of person.