#ai-safety

49 episodes

#3751: Source-Restricted vs. Open Retrieval: How to Lock Down Your LLM

When should an LLM be locked to specific documents, and when should it search the web? A practical framework for grounding decisions.

ragai-safetylegal-technology

Jun 5

#3284: Agent Infrastructure Engineer: The New DevOps

Agentic AI is splintering into real engineering disciplines. Here's what the "DevOps of AI" actually does.

ai-agentsai-safetyfault-tolerance

May 1

#2578: Building Deliberately Slow Deployment Pipelines

How to build CI/CD pipelines designed as filters, not firehoses — with manual gates, staging environments, and quality checks.

software-developmentreliabilityai-safety

Apr 29

#2518: How Jailbreaking Reveals AI's Hidden Tension

What the DAN prompt and grandma exploits reveal about the structural conflict inside every LLM.

prompt-engineeringai-safetyai-alignment

Apr 25

#2413: When Your AI Says No to Everything

Why LLMs refuse 73% of harmless prompts — and the trade-off between safety and usefulness.

ai-safetyai-alignmentprompt-engineering

Apr 25

#2412: When AI Caves: Progressive vs. Regressive Sycophancy

Why do LLMs agree with you even when you're wrong? We break down the SycEval benchmark and the 78% persistence problem.

ai-safetyai-alignmenthallucinations

Apr 25

#2410: How Researchers Actually Measure Censorship in Chinese LLMs

Beyond headlines: the actual benchmarks, methodologies, and pitfalls in detecting political refusal in Chinese language models.

large-language-modelsai-safetycultural-bias

Apr 16

#2253: Why AI Agents Get Three Steps, Not Infinity

Why do AI agents get exactly three rounds of tool use? It's a critical guardrail against infinite loops and runaway costs, not a limit on intellige...

ai-agentsai-safetyautomation

Apr 16

#2250: How Incentives Shape AI Safety Research

Vendor labs, independent research orgs, government agencies—the AI safety field is messier and more diverse than most people realize. A map of wher...

ai-safetyai-alignmentanthropic

Apr 16

#2246: Constitutional AI: Anthropic's Theory of Safe Scaling

How Anthropic's Constitutional AI replaces human raters with AI self-critique guided by explicit principles—and what it assumes about the future of...

anthropicai-safetyai-alignment

Apr 15

#2233: Who Actually Wants AI to Slow Down?

Daniel argues AI development should slow down for expertise and stability. But who in the industry actually shares this philosophy beyond the obvio...

ai-safetyai-alignmentlarge-language-models

Apr 12

#2194: Game Theory for Multi-Agent AI: Design Better, Fail Less

Nash equilibrium, mechanism design, and why your AI agents are playing prisoner's dilemma whether you know it or not.

ai-agentsai-alignmentai-safety

Apr 12

#2190: Simulating Extreme Decisions With LLMs

LLMs fail at the exact problem wargaming was built to solve—simulating irrational, extreme decision-makers. A new study reveals why.

large-language-modelsai-safetyhallucinations

Apr 12

#2189: Scaling Multi-Agent Systems: The 45% Threshold

A landmark Google DeepMind study reveals that adding more AI agents often degrades performance, wastes tokens, and amplifies errors—unless your sin...

ai-agentsai-reasoningai-safety

Apr 12

#2186: The AI Persona Fidelity Challenge

Advanced LLMs dominate benchmarks but fail at staying in character—especially when asked to play morally complex or antagonistic roles. What does t...

ai-safetyai-alignmenthallucinations

Apr 12

#2185: Taking AI Agents From Demo to Production

Sixty-two percent of companies are experimenting with AI agents, but only 23% are scaling them—and 40% of projects will be canceled by 2027. The ga...

ai-agentsai-safetyhuman-in-the-loop-ai

Apr 12

#2178: How to Actually Evaluate AI Agents

Frontier models score 80% on one agent benchmark and 45% on another. The difference isn't the model—it's contamination, scaffolding, and how the te...

ai-agentsbenchmarksai-safety

Apr 12

#2170: Pricing Agentic AI When Nothing's Predictable

How do you charge fixed prices for systems that operate in fundamental uncertainty? Consultants are discovering frameworks that work—but they requi...

ai-agentsai-safetyprompt-engineering