#2353: Evaluating Enterprise AI: Palmyra X5

Explore Palmyra X5, Writer’s flagship AI model designed for enterprise workloads, featuring a million-token context window and agentic capabilities.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2511
Published: Apr 20
Updated: May 15
Duration: 20:14
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: ai-models context-window ai-orchestration

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Writer’s Palmyra X5 is making waves in the enterprise AI space, particularly for regulated industries like healthcare, finance, and legal. Designed as a multimodal large language model, Palmyra X5 focuses on text and image understanding, not generation, positioning it as a tool for reading and reasoning rather than creative tasks. Its standout feature is a one-million-token context window, capable of processing a full million tokens in just 22 seconds. This makes it ideal for handling large documents like SEC filings, clinical trial data, or RFPs, where retaining full context is critical.

The model also excels in agentic workflows, with multi-turn function calls clocking in at 300 milliseconds and native multi-agent orchestration. These features ensure tight loops and efficient coordination, making it a strong contender for enterprise pipelines.

Priced at $0.60 per million tokens for input and $6 for output, Palmyra X5 is competitively positioned against models like GPT-4.1, claiming to be three to four times cheaper per token. It’s accessible via Writer’s API, AWS Bedrock, and other channels, though batch and cached pricing options are notably absent.

Benchmarks highlight Palmyra X5’s strengths in long-context retrieval, scoring 19.1% on OpenAI’s MRCR test, placing it between GPT-4.1 and GPT-4o. Other benchmarks like BBH and MATH_HARD show solid performance, though comparators are often missing, making it harder to fully contextualize its standing.

For enterprises, Palmyra X5 shines in workloads requiring long-context analysis and agentic workflows, particularly in industries where reliability and control are paramount. Its built-in connectors for RAG and knowledge graphs further simplify implementation for teams without deep infrastructure expertise.

Mentions

Amazon Bedrock AWS managed service for foundation models
BigCodeBench Code generation benchmark
GPT-4.1 OpenAI's long-context model
GPT-4o OpenAI multimodal model
MRCR OpenAI long-context retrieval benchmark
Palmyra X4 Previous flagship model by Writer
Palmyra X5 AI model for enterprise long-context tasks
Stanford HELM Holistic evaluation of language models
Writer Enterprise AI platform company
Yellow Systems AI model comparison publisher

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2353: Evaluating Enterprise AI: Palmyra X5

Welcome to My Weird Prompts. I'm Corn, he's Herman, and today we are doing an AI Model Spotlight. The model is Palmyra X5, built by a company called Writer. Herman, you want to set the scene on who Writer actually is?

Writer is an enterprise AI platform company, writer.com, and they have been at this for about five years now. Their positioning is pretty specific: they are not trying to be a general-purpose consumer AI company. They are going after Fortune 500 enterprises, and they are particularly focused on regulated industries, so healthcare, finance, legal, that kind of territory. The pitch is reliability and control, which is a different pitch than you get from a lot of the frontier labs.

They have got some funding weight behind them at this point.

In November of twenty twenty four they closed a two hundred million dollar Series C at a one point nine billion dollar valuation. The round was led by Premji Invest, Radical Ventures, and ICONIQ Growth. So they are not a scrappy startup at this point. They have got real institutional backing and a clear enterprise focus.

Where does Palmyra X5 sit in their model lineup?

It is the flagship. The Palmyra family has a few members. You have got Palmyra X4, which was their previous frontier model, released in late twenty twenty four. Then they have built out some vertical specialists: Palmyra Fin for financial services, Palmyra Med for healthcare, and Palmyra Vision, which handles video and image understanding as a separate model. X5 sits above all of those as the general-purpose workhorse, and it is the one they are positioning for agentic enterprise workloads specifically.

The lab has a track record, a clear enterprise lane, and now they are putting their most capable model forward. That is the context. Let us get into what Palmyra X5 actually is.

What are we actually looking at with Palmyra X5? What kind of model is it?

At the top level it is a multimodal large language model. Text and image go in, text and structured output come out. The image output side is not mentioned, so we are not talking about image generation here. It is a model built for reading and reasoning, not for producing pictures.

This is where I have to be honest about what we know and what we do not. Writer describes it as a hybrid transformer. That is the term on the model card. What they do not tell us is what the hybrid actually means in practice. Is there a Mixture of Experts component? Is there a state space model layer in there? We do not know. The card does not disclose parameter count either, so we cannot do the usual size comparison against other frontier models. What we can say is that the supplementary research mentions a dynamic routing system that activates only relevant expert subnetworks, which does suggest some kind of sparse or conditional compute architecture, but Writer has not confirmed the specifics, so I am not going to state that as fact.

What does that mean in practice if you are evaluating it?

It means you are evaluating it on behaviour rather than on architecture. Which is not unusual for proprietary models, but it is worth flagging. You are trusting the benchmark scores and your own evals rather than reasoning from first principles about model design.

The number everyone is going to notice is the context window.

One million tokens. That is the headline. And they are not just claiming the window exists, they are giving a concrete throughput figure: a full one million token prompt processes in approximately twenty-two seconds. That is a meaningful claim because long-context models often have a gap between what the window supports on paper and what is actually usable at speed. Twenty-two seconds for a full million-token load is fast enough to be practical in a production pipeline.

To put that in terms of actual documents, what does a million tokens get you?

Roughly speaking, you are looking at somewhere in the range of several hundred dense documents, or a handful of very large ones. A full year of SEC filings from a major company, an entire clinical trial dataset, a large codebase. The use case Writer keeps pointing to is things like RFPs, ten-K filings, electronic health records, regulatory submissions. Documents where you genuinely need the whole thing in context at once rather than chunking and hoping your retrieval does not miss something.

Then there is the tool-calling story.

Which is the other headline number. Multi-turn function calls at approximately three hundred milliseconds. That matters enormously for agentic workflows because if your model is the bottleneck in a tool loop, your agent feels slow. Three hundred milliseconds keeps the loop tight. Writer has also built native multi-agent orchestration into the model, meaning it is not just capable of calling tools, it is designed to coordinate across multiple agents as part of its core function rather than as a bolt-on.

The design philosophy here is pretty clearly pointed at agents, not at chat.

That is the read. The context window, the tool-call latency, the structured output support, the native orchestration. Every one of those is an agentic capability. This is not a model Writer built for someone to have a conversation with. It is a model they built to run workflows.

Alright, let us talk about what this thing costs. Herman, before you get into the numbers, I know we have a standing caveat on pricing.

We do, and it matters here. All pricing we are about to cite is as of April twenty, twenty twenty six. These numbers shift, sometimes weekly, and with a model this new there is every chance the pricing page looks different by the time you are listening to this. So treat these as a reference point, not a contract.

With that on the record, what are we looking at?

The structure is straightforward. Input is sixty cents per million tokens. Output is six dollars per million tokens. So a ten-to-one ratio between input and output cost, which is fairly standard for models in this class.

In practical terms, what does sixty cents per million tokens actually mean for a team running real workloads?

It means the context window is not punishingly expensive to fill. If you are sending a full million-token prompt, you are paying sixty cents for the input side of that call. The output cost is where it adds up, as it always does, but if your workload is heavy on ingestion and lighter on generation, which is common in document analysis and RAG pipelines, the economics are reasonably friendly.

Writer is also making a direct comparison to GPT-4.1 on cost. They are claiming three to four times cheaper per token.

That is their claim, and the input price does support it directionally. We have not done an independent line-by-line comparison across every tier, but the headline numbers are consistent with that framing.

Where can you actually access it?

Five channels listed. The Writer direct API, Writer Agent Builder which is currently in beta, Writer No-code, the Writer Framework, and Amazon Bedrock. So if your infrastructure is already on AWS, you have a native path in without touching Writer's own platform.

One gap worth noting.

No cached input pricing listed, and no batch pricing either. If those tiers exist, they are not on the page we pulled. So if your use case involves heavy prompt caching or large batch jobs, you would need to contact Writer directly to find out whether there is a better rate available.

Let us get into the benchmarks. What is Writer actually claiming here, and what does the evidence support?

There are six benchmarks on the page. I will take them in order of how much we can actually say about them. The most useful one for this model specifically is the MRCR eight-needle test, which is OpenAI's long-context retrieval benchmark. Palmyra X5 scores nineteen point one percent. 1 scores twenty point two five percent. GPT-4o scores seventeen point six three percent.

It sits between those two OpenAI models.

Slightly below GPT-4.1, slightly above GPT-4o. And that is the comparison Writer is leaning on hardest, which makes sense. If you are positioning a model on price-performance for long-context retrieval, you want to show you are in the same tier as the frontier models while costing significantly less. That framing holds up. The scores are close enough that the cost differential becomes the deciding factor for a lot of teams.

What about the other benchmarks?

This is where I want to be careful, because the page gives scores without named comparators for most of them. BBH, which is Big-Bench Hard, comes in at seventy point nine nine percent. The page says this aligns closely with top-tier models, but does not name which ones. GPQA, the graduate-level science reasoning benchmark, is forty-seven point two percent. MMLU Pro is sixty-five point zero two percent. MATH_HARD is seventy-one point five seven percent. Those are the numbers. What we cannot do is tell you whether seventy-one percent on MATH_HARD is competitive with Claude or Gemini, because Writer does not make that comparison on the page, and we are not going to invent one.

The code benchmark?

BigCodeBench, Full Instruct, scores forty-eight point seven. Writer says this ranks among the top models on that evaluation. Again, no named comparators given, so we are taking that characterisation at face value without being able to verify the league table position independently.

There is also the Stanford HELM ranking.

The AWS Bedrock announcement notes that Palmyra X5 and X4 are top-ranked on Stanford's HELM benchmark. HELM is a broad multi-task evaluation framework, so that is a meaningful signal if it holds up. But we are working from a secondary reference there, not the HELM leaderboard directly, so treat it as a data point worth following up rather than a settled fact.

The honest summary being: the numbers that come with comparators look solid, and the numbers without comparators are harder to contextualise.

That is exactly the read. The MRCR result is the one you can anchor to something concrete. The rest tells you the model is in a reasonable range for frontier work, but the full competitive picture is not on the page.

Let us talk about where this model actually earns its keep. Given everything we have covered on the architecture and the benchmarks, what workloads are you pointing teams toward?

The clearest fit is anything that lives at the intersection of long context and agentic work. Multi-agent orchestration is the obvious starting point. The sub-three-hundred-millisecond function call latency is not a marketing number you can ignore. If you are building a pipeline where agents are handing off tasks, calling tools, and waiting on responses, that latency compounds across every step. Shaving time there has real downstream effects on pipeline throughput.

The built-in RAG and Knowledge Graph connectors, how much does that actually matter versus rolling your own?

It matters for teams that do not want to own that plumbing. If you are a mid-size enterprise shop and your core competency is not infrastructure, having those connectors native to the model rather than bolted on as a separate layer reduces the surface area for things to break. The one-million-token context window also changes the RAG calculus a bit. You can push more raw document content into the context rather than relying entirely on retrieval to surface the right chunks. That is useful when the relevant information is distributed across a long document and hard to isolate with a retrieval query alone.

What kinds of documents are we talking about?

The page is specific here. RFPs, ten-K filings, electronic health records, regulatory filings, product catalogs, internal playbooks. The common thread is long, dense, structured or semi-structured documents where you need the model to reason across the full span rather than just pull a paragraph. The twenty-two-second processing time for a full one-million-token prompt is the relevant figure. That is fast enough to be practical in a workflow context.

What about code generation?

Writer claims a top ranking on BigCodeBench, and the score of forty-eight point seven is there on the page. We noted earlier we cannot verify the league table position independently, but the number is not embarrassing for production code tasks. If your use case is generating production-ready scripts or integrating code generation into a larger agentic workflow, it is worth evaluating. It is not positioned as a pure coding model the way some others are, but it is not an afterthought either.

Where does it not apply?

Anything involving audio or speech is out. No audio input, no audio output mentioned anywhere on the page. Video input is also not supported here. Palmyra Vision is the separate model in the family for that. Image input is supported, but image output is not mentioned, so do not assume it. And there is no indication of on-device or edge deployment, so if that is a requirement, this is not your model.

Let us talk about how the industry has actually received this. We have covered what Writer claims. What is the outside world saying?

The honest answer is that independent reception is still thin. The model appears to be a recent release, and closed weights mean there is no community fine-tuning or open evaluation work to draw on. What we do have is a handful of structured sources, and they are broadly positive, but we should be clear about what kind of positive we are talking about.

Walk me through it.

The clearest third-party signal is the Stanford HELM benchmark. Writer's page and the Amazon Bedrock announcement both reference Palmyra X5 ranking at the top of HELM. That is a credible external evaluation framework, not a lab-run benchmark, so it carries more weight than the self-reported scores we covered earlier. The caveat is we do not have the full HELM leaderboard in front of us to verify the exact position or the comparison set, so we are working from what the lab and AWS are citing.

The Amazon Bedrock availability itself, is that meaningful as a signal?

Getting onto Bedrock is not automatic. AWS has a selection process for which third-party models it hosts, and the enterprise customer base on that platform is substantial. The fact that both Palmyra X5 and X4 are listed there suggests Writer has cleared at least some baseline bar for reliability and enterprise readiness. It also means procurement teams at large organisations can access the model through an existing vendor relationship, which matters more than people outside enterprise sales tend to appreciate.

What about the engineering community? Any chatter worth noting?

There is a comparison piece from Yellow Systems that puts Palmyra X5 alongside GPT-4.1 and Claude, and the framing there is favourable on the price-performance angle specifically. The dynamic routing architecture gets mentioned as a reason for the latency and cost profile. The claim is that activating only relevant expert subnetworks reduces compute overhead while keeping reasoning quality up. We should flag that the architecture details on the model card are thin, so some of that framing may be drawing inferences from performance data rather than confirmed implementation details.

Any red flags or controversies in what you found?

The coverage is largely positive and enterprise-focused. The gaps we would flag are the ones we have already noted: no independent benchmark replication, no visibility into training data or safety evaluations, and the Agent Builder platform is still in beta. Those are not controversies, but they are things a cautious engineering team would want resolved before committing to a production dependency.

Alright, let us land this. If you are an AI professional sitting with a shortlist of models for your next enterprise build, where does Palmyra X5 actually belong on that list?

The clearest yes is if your workload lives at the intersection of long-context processing and cost sensitivity. A one-million-token context window at sixty cents per million tokens on input is a competitive combination. If you are ingesting regulatory filings, large document sets, RFPs, or anything where you need to hold a lot of material in context simultaneously, the price-performance case is real and the MRCR retrieval scores support it.

The agentic side?

That is the other strong yes. The sub-three-hundred-millisecond function call latency and the native multi-agent tooling are not marketing language, they are architectural priorities. If you are building orchestration layers where agents are calling tools in tight loops, that latency profile matters. The built-in RAG and Knowledge Graph connectors also reduce integration work that you would otherwise be building yourself.

What is the no?

A few clear ones. If your pipeline requires audio or speech processing, Palmyra X5 is not the answer. Same for video. Palmyra Vision is a separate model in the family, so Writer has something there, but X5 itself does not cover those modalities. If you need edge or on-device deployment, there is no signal anywhere in the documentation that this model supports that, and given the apparent scale of the architecture, it would be surprising if it did.

What about teams that need full visibility into what they are running?

That is a legitimate hesitation. Closed weights, undisclosed parameter count, no published training data or knowledge cutoff, and no safety evaluation documentation that we could find. For teams in regulated environments that require model auditability or have procurement requirements around those disclosures, those gaps need to be resolved before this goes into production. Writer's enterprise positioning suggests they can address some of that through commercial agreements, but we cannot confirm that from the public documentation.

Short version for someone skimming.

Long-context enterprise workflows, agentic pipelines, cost-constrained RAG builds. That is where this model earns its place. Audio, video, edge, or anywhere you need full model transparency before signing a contract. Look elsewhere, or ask Writer to fill in the gaps before you commit.

That is Palmyra X5. Thanks for listening to My Weird Prompts.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2353: Evaluating Enterprise AI: Palmyra X5

Mentions

Downloads

You Might Also Like

#2353: Evaluating Enterprise AI: Palmyra X5