← All Tags

#benchmarks

17 episodes

#2672: 12M Token Context: Subquadratic Cracks Attention Scaling

A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.

large-language-modelscontext-windowbenchmarks

#2411: Are Political Bias Benchmarks Actually Measuring Anything?

Why the Political Compass Test fails, and what researchers are building instead to actually measure model bias.

ai-ethicscultural-biasbenchmarks

#2409: How AI Benchmarks Measure Cultural Bias

Five benchmarks that reveal how AI systems fail at cultural knowledge — and what their methodologies tell us.

cultural-biasbenchmarksmultimodal-ai

#2406: Why Million-Token Context Windows Can't Handle 3 Reasoning Steps

Needle-in-a-haystack is dead. Here's what actually measures whether models can think across long documents.

context-windowreasoning-modelsbenchmarks

#2405: LLM Benchmarks Are Full of Noise: Statistical Rigor in AI Evals

Why most benchmark claims in AI are statistically indefensible — and what to do about it.

benchmarksinterpretabilityllm-as-a-judge

#2404: What Tool-Calling Benchmarks Miss About Production Failures

BFCL, tau-bench, and Nexus each reveal different failure modes. None of them test what actually kills production agents.

ai-agentsbenchmarkshallucinations

#2403: LLM Eval Frameworks: Inspect vs Promptfoo vs DeepEval vs Braintrust

An architectural shootout of four major LLM evaluation harnesses — where each shines and where each breaks down.

large-language-modelsai-agentsbenchmarks

#2357: AI Model Spotlight: ** Phi (umbrella brand); individual models: Phi-1, Phi-1.5, Phi-2, Phi-3, Phi-3.5, Phi-4, Phi-4-mini, Phi-4-multimodal

Explore Microsoft AI's Phi family of small language models, designed for edge deployment and high efficiency.

small-language-modelsedge-computingbenchmarks

#2352: Object Detection APIs: Choosing the Right Tool for Your Workflow

How do object detection APIs like Gemini, AWS Rekognition, and YOLO compare for automated annotation workflows?

computer-visionapi-integrationbenchmarks

#2349: AI Model Spotlight: ** Trinity Large Thinking

Discover how Arcee AI’s Trinity Large Thinking delivers cutting-edge reasoning at a fraction of the cost, all from a team of just 30.

ai-modelsreasoning-modelsbenchmarks

#2249: Building Custom Benchmarks for Agentic Systems

Public benchmarks fail for agentic systems. Learn how to build evaluation frameworks that actually predict production behavior.

ai-agentsbenchmarksai-inference

#2239: How AI Benchmarks Became Broken (And What's Replacing Them)

The tests we use to measure AI progress are contaminated, saturated, and gamed. Here's what's actually working.

benchmarkstraining-dataai-reasoning

#2213: Grading the News: Benchmarking RAG Search Tools

How do you rigorously evaluate whether Tavily or Exa retrieves better results for breaking news? A formal benchmark beats the vibe check.

ragbenchmarkshallucinations

#2178: How to Actually Evaluate AI Agents

Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...

ai-agentsbenchmarksai-safety

#1831: The 79% AI Coder: Reasoning vs. Memorization

AI models now score 79% on coding benchmarks, but a 40-point drop on harder tests reveals the truth.

ai-agentsai-inferencebenchmarks

#1570: Weird AI Experiment: The Undercard Fight

What happens when two mid-tier AI models start gaslighting each other? Witness the chaotic showdown between MiniMax and Xiaomi’s MiMo.

ai-modelsbenchmarksai-reasoning

#130: The Benchmark Battle: Decoding the Rise of Chinese AI

Are Chinese AI models actually beating the West, or just gaming the system? Herman and Corn dive into the reality of modern AI benchmarks.

large-language-modelsai-agentsbenchmarks