What is the Terrarium Benchmark?

A thought experiment for testing whether AI can handle genuinely complex, under-documented problems with slow feedback — not just well-trodden software tasks. You keep living systems (terrariums, or simply houseplants) healthy across varying conditions, introduce anomalies without telling the AI, ask it for fixes, and see how many survive. The point is that a living system has hidden state and delayed feedback, which is the opposite of a problem with a clean runbook.

Can a probabilistic model solve genuinely new problems?

On its own, it mostly returns what it has already been fed, with enough variance to feel like discovery — so it won't originate the solution to a problem that doesn't yet exist. But combining multiple contexts and using it to surface patterns, simulations, and combinations can reveal patterns that weren't visible when those contexts were handled in isolation. The real value comes from a human directing that process and recognising what matters.

Why is keeping a plant alive harder for AI than scaling Kafka?

Scaling Kafka, migrating clouds, or upgrading a framework is operationally difficult but not complex — it can be done with a good checklist and a runbook. Keeping a plant alive looks simple but is genuinely complex: the feedback is slow and noisy, the state is hidden, and nobody wrote the runbook. Difficulty of execution is not the same as complexity of the problem.

The Terrarium Benchmark: On LLMs, Intuition, and Intelligence

Q: How should humans and AI divide the work?

By playing to opposite strengths. The human brain is excellent at pattern recognition and sensing which thing actually matters, but it struggles with raw computation at scale. A machine is the reverse: it can compute, simulate, and combine across far more contexts than a person can hold in their head, but it doesn't know which pattern is worth caring about. Let the AI surface patterns and run simulations; let the human pick the relevant direction. Each covers the other's weakness.

AI can write your CI/CD, your website, your boilerplate. The real test is whether it can keep something alive.

Fast assistants, with a catch

In the world of LLMs we have the fastest assistants. If we’re driven enough and we have the will, we will achieve — or tend to achieve, I should put it that way — what we want in a shorter span of time.

But there’s an issue. Most of the LLMs that are provided have intentions enforced within them; they behave idealistically, and this isn’t what we want from technology. It has to be raw, in its purest form — though that’s another point of discussion on its own.

What we actually want: intuition, not idealism

Coming back to the intuition part: we want to develop a general intelligence just like a human’s. We need to think about making a machine that behaves intellectually like a human — with information, it should develop intuition. It’s okay if it forgets something, and it’s okay if it’s wrong; if it’s said to have general intelligence, then it can make mistakes.

Some say it’s superintelligence, some say it’s general intelligence. “General intelligence” sits better with the connotation of machines as humans. “Superintelligence”, I feel, is something that emerges when machines and humans combine to create something more meaningful — and that, I feel, is the most important part.

The human bottleneck — and the promise

The biggest limitation humans have is that we need at least 10–15 years to develop basic survival skills, and only then can we look beyond — to science, sports, literature. And some choose to do nothing, maybe because they don’t have the motivation anyway. People are most productive between 20 and 50, the years when they can actually deliver things.

This is a limiting factor. There’s only a certain amount that can be achieved by humans as a community, and even that is a collective effort. At any given point in time, only a small fraction of people are serious researchers doing genuinely innovative work — not just building a better version of something that already exists, but creating something new from scratch — and their time is finite. That collective research capacity is far smaller than it looks, because research is subject to many factors: economics, supply and demand, and, to say the least, politics. Necessity is the mother of invention, and it holds true.

This is exactly where the promise lives. Will defining superintelligence mean that a driven person can use these systems to speed up research and breach the limits of human life — so that, as one unity, we become immortal? That too requires research. This would have to be the most civilizational change of all.

There’s a personal dimension to this too. A lifespan feels shorter when you’re racing to finish a piece of research, especially once you’re aware of the uncertainties of life — sudden demise, death by age. Traditionally we deal with that by passing the context and the motivation to the next generation, so they can continue with it. But AI, used properly, could change that handoff — if the models are personalised with something that represents the ideal assistant a researcher needs, his perfect researcher partner, not something generic, guardrailed to behave idealistically.

But part of it is concerning, because these systems have originated from particular places and are controlled by a few, according to what they think the worldview should be. I think this can become more personalised, to suit the work of the person using it.

It’s okay if the LLM forgets something or hallucinates — and I have no shame admitting that I’ve also hallucinated when there’s too much going on at once. Obviously they’re machines; they can be wrong — after all, they’re meant to be a general intelligence, or close to mimicking general intelligence, and going wrong is part of that. But humans shouldn’t be — they should weigh the subjective and objective information available and make the best decisions.

Rather than creating a perfect LLM that solves every problem, we could have an evolutionary mechanism that’s cheap and evolves like a human: it forgets things, remembers, holds intuitions, has a gut feeling, and maybe even an ideology that can change. We came from there. Maybe this requires a personalised model for everyone, one that can be deployed on the commodity machines of the future.

Can a probabilistic model solve tomorrow’s problems?

We’re busy creating an LLM that solves everything. My assertion is this: since it’s a probabilistic model, how can it provide a solution to a problem that will only appear in the future? The facts that aren’t known to humans — how are they accounted for? Facts that aren’t documented; feelings that aren’t documented as feelings but as 250 different contextual vocabulary terms that get placed in a sentence on the basis of probability. Adjust a few weights here and there, keep the answers deterministic — though with subtle differences in how the sentence is framed, just to make it feel more real.

It delivers things with subtle variance, owing to the probabilistic approach it represents, so that it looks like a well-informed discovery — but it’s only delivering you what has already been fed to it. One point to note here, though: there can be patterns, useful ones, that emerge if we combine multiple contexts and use AI to discover patterns, run simulations and combinations, and devise a new pattern — or find a pattern that wouldn’t have been possible if two or more contexts were dealt with in isolation.

AI does well in this regard. It can present you with all the patterns, and it’s you, the human, who assists the AI to move in a particular direction. And there’s a neat complementarity here. The human brain is remarkable at pattern recognition — at sensing which thing actually matters — but it struggles with raw computation at scale. The machine is the opposite: it can compute, simulate, and combine across far more contexts than any person can hold in their head, but it doesn’t know which pattern is worth caring about. Put the two together and each covers the other’s weakness. It’s humans who have consciousness and vision, and more importantly the will to solve problems. If we combine pattern simulations by AI with recognising the relevant parts through a human-in-the-loop approach, that’s where the real value lies.

The real benchmark: keep something alive

I can see that AI is doing great stuff — it helps create amazing apps, websites, even the website you’re reading this article on. But do you really think these problems are complex enough to sing eulogies over? To be frank, no. We’ve just become lazy couch potatoes, the way we’ve been dancing about AI writing our CI/CD, building our websites, and writing some boilerplate for us.

So — ask AI to fight climate change. As a starting solution, we could set up terrariums of different sizes, placed in different conditions. (To reduce operational complexity we can simulate it; for a more reasonable approach we can place 10 terrariums at one condition, and do that for 100 different conditions.) Conditions like gravity, atmospheric pressure, altitude, temperature, humidity — while also varying factors like humidity and rainfall around the clock. The target is to keep the terrarium healthy. We can introduce an anomaly without informing the AI, ask the AI for solutions, and see how many terrariums survive. This is meant to fascinate the audience. I’d like to see who picks this up — Anthropic, OpenAI. Elon could pick this up; I think he can do weird things.

To start, you can begin with 5 low-maintenance money plants. Place them at different spots in the house — more sunlight, less sunlight — water them regularly, leave them alone, and run experiments: introduce anomalies and ask Claude for the fixes. And if it starts saying it’s not an expert and tells you to go see a local gardener — a horticulturist, a botanist, maybe even a plant pathologist — then don’t blame me. It doesn’t say that when generating code or doing other less complex stuff.

Often, the things that look simple are the most complex — once you’ve actually given them enough thought. Scaling Kafka, migrating clouds, refactoring a large codebase, or upgrading the version of a language or framework across your systems isn’t complex; it’s operationally difficult, but it can be done with a good checklist and a runbook. Keeping plants alive through the summer is a complex task; it looks simple, since water is all they seem to need. If they die, you won’t get fired — but your Asian mom can get fired up and remain melancholic for 2–3 days.