I work where LLMs meet product, and refuse to pick a side.
I'm Vanshu — currently in Bengaluru, currently at EMA. I spend most of my days reading LLM outputs and asking why they went wrong, then helping the team turn those answers into shipped fixes.
How I got here
I started in computer applications at Ambedkar Institute of Technology in Delhi. Halfway through, I realised I was less interested in writing software and more interested in why people use it (and stop using it). That curiosity got me into the Plaksha Tech Leaders Fellowship — 60 students, scholarship, a year of intense exposure to AI, design, and product alongside UC Berkeley and Purdue.
From there it was the NextLeap PM Fellowship (top 4% of 500+), an internship at Blink X where I owned a stock-education app end to end, and then Phenom — where I learned what it actually feels like to defend a churn metric to a CS team that needs the answer yesterday.
Now I'm at EMA, building the evaluation layer for a fleet of AI agents that real enterprises are putting real money behind. There's no playbook for QA-ing a 50-agent fleet that hallucinates differently each day. We're writing it.
What I'm good at
Looking at messy data — LLM outputs, user funnels, behavioural signals — and finding the one pattern that explains most of the noise. Building tools nobody asked for that quietly become the thing the team can't live without (CSM Frontier is the latest example). Writing things down clearly enough that an engineer, a sales lead, and a VP can all agree on what we're doing.
What I'm still figuring out
How to balance speed and rigor when the LLM ecosystem moves faster than the evaluation literature. How to design eval frameworks that survive a model upgrade. How to explain to non-technical stakeholders that "the AI is wrong sometimes" isn't a bug — it's the entire product surface.
Outside of work
I read a lot about how products and people fail — startup post-mortems, behavioural psychology, the occasional Karpathy lecture. I built Freese in a weekend because a friend's PCOS conversation wouldn't leave my head. I will, given any opportunity, talk about why product teams underweight evaluation. Then I'll talk about it some more.
The facts
- based Bengaluru, India
- role AI Evaluation Analyst at EMA · since Oct 2024
- edu Plaksha Tech Leaders Fellowship UC Berkeley + Purdue collab · 6% selection rate
- edu Bachelor of Computer Applications Ambedkar Institute of Technology, Delhi · 2018–2021
- won 1st Runner Up · Masters' Union Startup Weekend ₹3,00,000 grant for Freese
- won Top Fellow · NextLeap PM Fellowship top 4% of 500+ applicants
- won 1st Runner Up · Masters' Union PM Bootcamp 50% scholarship · ₹2.45L
- stack Claude Code, SQL, Mixpanel, Figma, LLM eval frameworks and a healthy distrust of mocked tests
- currently Open to senior AI Product / Eval roles especially in enterprise AI, agents, evaluation tooling
Reach out
If you've got a hard AI product problem — or just want to argue about the right way to evaluate an agent — drop me a line at vanshu.bu@gmail.com, or find me on LinkedIn ↗.