#benchmarks

17 episodes

May 6

#2672: 12M Token Context: Subquadratic Cracks Attention Scaling

A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.

large-language-modelscontext-windowbenchmarks

Apr 25

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.

ai-ethicscultural-biasbenchmarks

Apr 25

#2409: How AI Benchmarks Measure Cultural Bias

Five benchmarks that reveal how AI systems fail at cultural knowledge — and what their methodologies tell us.

cultural-biasbenchmarksmultimodal-ai

Apr 25

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

context-windowreasoning-modelsbenchmarks

Apr 25

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

benchmarksinterpretabilityllm-as-a-judge

Apr 25

#2404: What Tool-Calling Benchmarks Miss About Production Failures

BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.

ai-agentsbenchmarkshallucinations

Apr 25

#2403: LLM Eval Frameworks: Inspect vs Promptfoo vs DeepEval vs Braintrust

An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.

large-language-modelsai-agentsbenchmarks

Apr 20

#2357: AI Model Spotlight: ** Phi (umbrella brand); individual models: Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5, Phi-4, Phi-4-mini, Phi-4-multimodal

Explore Microsoft AI's Phi family of small language models, designed for edge deployment and high efficiency.

small-language-modelsedge-computingbenchmarks

Apr 20

#2352: Object Detection APIs: Choosing the Right Tool for Your Workflow

How do object detection APIs like Gemini, AWS Rekognition, and YOLO compare for automated annotation workflows?

computer-visionapi-integrationbenchmarks

Apr 20

#2349: AI Model Spotlight: ** Trinity Large Thinking

Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.

ai-modelsreasoning-modelsbenchmarks

Apr 16

#2249: Building Custom Benchmarks for Agentic Systems

Public benchmarks fail for agentic systems. Learn how to build evaluation frameworks that actually predict production behavior.

ai-agentsbenchmarksai-inference

Apr 16

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.

benchmarkstraining-dataai-reasoning

Apr 14

#2213: Grading the News: Benchmarking RAG Search Tools

How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.

ragbenchmarkshallucinations

Apr 12

#2178: How to Actually Evaluate AI Agents

Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...

ai-agentsbenchmarksai-safety

Mar 31

#1831: The 79% AI Coder: Reasoning vs. Memorization

AI models now score 79% on coding benchmarks, but a 40-point drop on harder tests reveals the truth.

ai-agentsai-inferencebenchmarks

Mar 26

#1570: Weird AI Experiment: The Undercard Fight

What happens when two mid-tier AI models start gaslighting each other? Witness the chaotic showdown between MiniMax and Xiaomi’s MiMo.

ai-modelsbenchmarksai-reasoning

Jan 1

#130: The Benchmark Battle: Decoding the Rise of Chinese AI

Are Chinese AI models actually beating the West, or just gaming the system? Herman and Corn dive into the reality of modern AI benchmarks.

large-language-modelsai-agentsbenchmarks

#2672: 12M Token Context: Subquadratic Cracks Attention Scaling

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

#2409: How AI Benchmarks Measure Cultural Bias

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

#2404: What Tool-Calling Benchmarks Miss About Production Failures

#2403: LLM Eval Frameworks: Inspect vs Promptfoo vs DeepEval vs Braintrust

#2357: AI Model Spotlight: ** Phi (umbrella brand); individual models: Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5, Phi-4, Phi-4-mini, Phi-4-multimodal

#2352: Object Detection APIs: Choosing the Right Tool for Your Workflow

#2349: AI Model Spotlight: ** Trinity Large Thinking

#2249: Building Custom Benchmarks for Agentic Systems

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

#2213: Grading the News: Benchmarking RAG Search Tools

#2178: How to Actually Evaluate AI Agents

#1831: The 79% AI Coder: Reasoning vs. Memorization

#1570: Weird AI Experiment: The Undercard Fight

#130: The Benchmark Battle: Decoding the Rise of Chinese AI

Related Topics