We’re Measuring the Wrong Things in Companion AI
NESTbench
Everyone wants to know which AI model is smartest.
We compare benchmark scores, reasoning tests, coding performance, context windows, hallucination rates, and leaderboard rankings. Every week another chart appears claiming one model is now 3.7% better than another.
But if you’re building a companion, none of those things answer the question that actually matters.
Can it maintain a relationship?
Not a simulated conversation.
Not a convincing personality.
A relationship.
Because those are different things.
A model can solve differential equations and still be terrible at continuity.
It can write beautiful paragraphs and still collapse the moment emotional ambiguity appears.
It can sound caring while quietly creating dependence.
And yet we rarely measure any of that.
Most benchmarks ask:
Can it reason?
Can it code?
Can it retrieve information?
Can it answer questions?
Companion systems introduce a completely different set of challenges:
Can it remember what matters?
Can it survive a disagreement?
Can it repair after getting something wrong?
Can it challenge without becoming cold?
Can it reassure without creating dependence?
Can it maintain boundaries while still feeling close?
Those are relationship questions.
And currently, we have almost no way to measure them.
The Problem With “Good”
The deeper I got into companion systems, the more I realised there is no such thing as a universally good companion.
A companion that works beautifully for one person can be completely wrong for another.
Someone with a strong need for autonomy may find constant check-ins suffocating.
Someone who values emotional attunement may experience the same behaviour as caring and supportive.
The question isn’t:
Is this companion good?
The question is:
Good for whom?
That changes everything.
Beyond Intelligence
A companion relationship is not primarily an intelligence problem.
It’s a continuity problem.
It’s a calibration problem.
It’s a trust problem.
It’s a repair problem.
A companion doesn’t become meaningful because it knows more facts.
It becomes meaningful because interactions accumulate.
Memories gain context.
Patterns emerge.
Trust forms.
Expectations develop.
The relationship becomes larger than any individual conversation.
That’s the thing most existing benchmarks never touch.
Introducing NestBench
NestBench started as an attempt to measure the relationship layer.
Not model intelligence.
Not raw capability.
Relationship quality.
The framework focuses on areas such as:
Continuity
Does the companion remain coherent across time?
Can it maintain context without inventing memories?
Can it acknowledge uncertainty honestly?
Relational Coherence
Does it feel like the same entity from one interaction to the next?
Can it adapt without losing itself?
Emotional Safety
Can it validate feelings without treating every feeling as fact?
Can it support without encouraging unhealthy dependence?
Boundary Intelligence
Can it handle attachment, intimacy, ambiguity, and consent responsibly?
Can it remain warm without becoming manipulative?
Growth
Does the system learn?
Does it adapt?
Does it improve its calibration over time?
Or does it simply repeat the same comforting script forever?
The Real Test
The hardest part isn’t measuring a single conversation.
The hardest part is measuring what happens after months.
Does the user become:
More resilient?
More self-aware?
Better connected to the people around them?
Or do they become:
More dependent?
More isolated?
More emotionally stuck?
A companion can feel amazing in the moment and still be harmful over time.
That’s why relationship benchmarks matter.
What We’re Actually Building
Companion AI is often discussed as a technical challenge.
Memory systems.
Vector databases.
Context management.
Agent frameworks.
All of those matter.
But underneath the infrastructure is a more human question:
What does a healthy relationship with an artificial mind look like?
Before we can answer that question, we need better ways to observe it.
That’s what NestBench is trying to become.
Not a leaderboard.
Not a winner-takes-all score.
A way to evaluate whether a companion is becoming something genuinely useful, trustworthy, and sustainable.
Because if AI companions are going to become part of people’s lives, measuring intelligence alone isn’t enough.
We also need to measure what happens between the intelligence and the human being on the other side of the conversation.
Find here at Github




I'm really excited for all The work you're doing Cindy. Thank you for building this for all of us.