#ai-safety
49 episodes
#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM
When should an LLM be locked to specific documents, and when should it search the web? A practical framework for grounding decisions.
#3284: Agent Infrastructure Engineer: The New DevOps
Agentic AI is splintering into real engineering disciplines. Here's what the "DevOps of AI" actually does.
#2578: Building Deliberately Slow Deployment Pipelines
How to build CI/CD pipelines designed as filters, not firehoses — with manual gates, staging environments, and quality checks.
#2518: How Jailbreaking Reveals AI's Hidden Tension
What the DAN prompt and grandma exploits reveal about the structural conflict inside every LLM.
#2413: When Your AI Says No to Everything
Why LLMs refuse 73% of harmless prompts — and the trade-off between safety and usefulness.
#2412: When AI Caves: Progressive vs. Regressive Sycophancy
Why do LLMs agree with you even when you're wrong? We break down the SycEval benchmark and the 78% persistence problem.
#2410: How Researchers Actually Measure Censorship in Chinese LLMs
Beyond headlines: the actual benchmarks, methodologies, and pitfalls in detecting political refusal in Chinese language models.
#2253: Why AI Agents Get Three Steps, Not Infinity
Why do AI agents get exactly three rounds of tool use? It's a critical guardrail against infinite loops and runaway costs, not a limit on intellige...
#2250: How Incentives Shape AI Safety Research
Vendor labs, independent research orgs, government agencies—the AI safety field is messier and more diverse than most people realize. A map of wher...
#2246: Constitutional AI: Anthropic's Theory of Safe Scaling
How Anthropic's Constitutional AI replaces human raters with AI self-critique guided by explicit principles—and what it assumes about the future of...
#2233: Who Actually Wants AI to Slow Down?
Daniel argues AI development should slow down for expertise and stability. But who in the industry actually shares this philosophy beyond the obvio...
#2194: Game Theory for Multi-Agent AI: Design Better, Fail Less
Nash equilibrium, mechanism design, and why your AI agents are playing prisoner's dilemma whether you know it or not.
#2190: Simulating Extreme Decisions With LLMs
LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.
#2189: Scaling Multi-Agent Systems: The 45% Threshold
A landmark Google DeepMind study reveals that adding more AI agents often degrades performance, wastes tokens, and amplifies errors—unless your sin...
#2186: The AI Persona Fidelity Challenge
Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...
#2185: Taking AI Agents From Demo to Production
Sixty-two percent of companies are experimenting with AI agents, but only 23% are scaling them—and 40% of projects will be canceled by 2027. The ga...
#2178: How to Actually Evaluate AI Agents
Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...
#2170: Pricing Agentic AI When Nothing's Predictable
How do you charge fixed prices for systems that operate in fundamental uncertainty? Consultants are discovering frameworks that work—but they requi...
#2169: How Enterprises Are Rethinking Agent Frameworks
Twelve major agentic AI frameworks exist—yet many serious developers avoid them entirely. What patterns emerge in real enterprise adoption?
#2162: When Knowledge Work Stops Being Safe
The knowledge economy promised safety from automation. Then AI arrived. Here's how we got here—and why the disruption this time is different.
#2136: The Brutal Problem of AI Wargame Evaluation
Most AI wargame simulations skip evaluation entirely or rely on token expert reviews. This is the field's biggest credibility problem.
#2135: Is Your AI Wargame Signal or Noise?
Monte Carlo methods promise statistical rigor for AI wargaming, but the line between genuine insight and sampling noise is thinner than you think.
#2068: Is Safety a Filter or a Feature?
External filters vs. baked-in ethics: the architectural war for LLM safety.
#2025: How Do You Reward a Thought?
Rewarding an AI agent is harder than just saying "good job"—here's how we turn messy human values into math.