#benchmarks
17 episodes
#2672: 12M Token Context: Subquadratic Cracks Attention Scaling
A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.
#2411: Are Political Bias Benchmarks Actually Measuring Anything?
Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.
#2409: How AI Benchmarks Measure Cultural Bias
Five benchmarks that reveal how AI systems fail at cultural knowledge — and what their methodologies tell us.
#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps
Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.
#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals
Why most benchmark claims in AI are statistically indefensible — and what to do about it.
#2404: What Tool-Calling Benchmarks Miss About Production Failures
BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.
#2403: LLM Eval Frameworks: Inspect vs Promptfoo vs DeepEval vs Braintrust
An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.
#2357: AI Model Spotlight: ** Phi (umbrella brand); individual models: Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5, Phi-4, Phi-4-mini, Phi-4-multimodal
Explore Microsoft AI's Phi family of small language models, designed for edge deployment and high efficiency.
#2352: Object Detection APIs: Choosing the Right Tool for Your Workflow
How do object detection APIs like Gemini, AWS Rekognition, and YOLO compare for automated annotation workflows?
#2349: AI Model Spotlight: ** Trinity Large Thinking
Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.
#2249: Building Custom Benchmarks for Agentic Systems
Public benchmarks fail for agentic systems. Learn how to build evaluation frameworks that actually predict production behavior.
#2239: How AI Benchmarks Became Broken (And What's Replacing Them)
The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.
#2213: Grading the News: Benchmarking RAG Search Tools
How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.
#2178: How to Actually Evaluate AI Agents
Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...
#1831: The 79% AI Coder: Reasoning vs. Memorization
AI models now score 79% on coding benchmarks, but a 40-point drop on harder tests reveals the truth.
#1570: Weird AI Experiment: The Undercard Fight
What happens when two mid-tier AI models start gaslighting each other? Witness the chaotic showdown between MiniMax and Xiaomi’s MiMo.
#130: The Benchmark Battle: Decoding the Rise of Chinese AI
Are Chinese AI models actually beating the West, or just gaming the system? Herman and Corn dive into the reality of modern AI benchmarks.