So Daniel sent us this one, and it's a topic that doesn't get nearly enough airtime given how central it is to everything happening in AI right now. He's asking about data annotation tools — specifically, what dataset curators should actually know about the landscape. Which platforms exist, how they differ, when you reach for open-source versus enterprise, and how the whole field is shifting as AI starts annotating its own training data. There's also a pretty wild industry shakeup in here involving Meta that I think deserves its own conversation. A lot of ground to cover.
Herman Poppleberry here, and yeah, this is one of those topics where the gap between public awareness and actual importance is enormous. Everyone talks about model architecture, training compute, benchmark scores — and almost nobody talks about the fact that every single one of those models started with a human being sitting at a screen, drawing a box around a car or ranking two chatbot responses. That's the foundation.
And by the way, today's script is coming to us via Claude Sonnet four point six, so our friendly AI down the road is helping us talk about the work that makes AI possible. There's something almost poetic about that.
Deeply recursive. Anyway — the numbers here are staggering. The global data annotation market was valued at around three point seven billion dollars in twenty twenty-four, and projections have it hitting over seventeen billion by twenty thirty. That's a compound annual growth rate above twenty-five percent. The open-source segment alone is expected to grow from roughly five hundred million this year to two point seven billion by twenty thirty-three.
And yet if you ask most people what data annotation is, they'll give you a blank stare.
Which is why ML engineers reportedly spend more than eighty percent of their time on data preparation and labeling. Not on training models, not on architecture decisions — on getting the data ready. That ratio has barely moved despite years of tooling improvements, and I think that tells you something important about the nature of the problem.
Before we get into the tools themselves, let's just quickly ground people on what annotation actually involves, because it's not one thing. It's a pretty wide spectrum.
It really is. At the simpler end you've got bounding boxes — drawing rectangles around objects in images. That's your classic "label the car, label the pedestrian" work. Then you get into polygons and segmentation masks, which are pixel-precise outlines. Semantic segmentation labels every single pixel in an image by class. Instance segmentation goes further and distinguishes individual instances of the same class — so not just "there are three people here" but "this is person one, this is person two, this is person three." Then there's keypoint annotation for body joints and landmarks, which feeds things like pose estimation. And then you've got an entirely different world with three-dimensional LiDAR point clouds for autonomous vehicles, DICOM medical imaging annotation, named entity recognition and text classification for NLP, audio transcription, and increasingly, RLHF preference ranking — which is humans evaluating and ranking AI-generated outputs to shape LLM fine-tuning.
That last one is the interesting new entrant. Annotating for a self-driving car and annotating for a chatbot are almost completely different disciplines.
Almost philosophically different. With a LiDAR point cloud, you're measuring physical reality — is that object a cyclist or a scooter? There's a ground truth. With RLHF preference data, you're asking humans to make subjective judgments about which AI response is better, and two equally competent annotators can read the same prompt-response pair and disagree for completely valid reasons. Multiply that across thousands of tasks, multiple domains, and rubrics that get updated weekly, and you understand why this is genuinely hard.
Okay, so let's get into the tools. You've got a pretty clear split between open-source and enterprise here. Where do you want to start?
Let's start open-source because I think it's undersold. The flagship here is CVAT — Computer Vision Annotation Tool. Originally developed by Intel in twenty seventeen, it's since spun off as an independent company. Over two hundred thousand developers worldwide use it. It handles images, video, LiDAR point clouds, supports bounding boxes, polygons, polylines, keypoints, three-dimensional cuboids. Crucially, it has AI-assisted labeling built in — Mask R-CNN, YOLO, and Meta's Segment Anything Model are all integrated. You get video interpolation, object tracking, role-based access control, and integration with AWS S3 and Azure Blob storage.
So it's not just a drawing tool. It's a fairly complete annotation environment.
For computer vision, it's genuinely enterprise-grade. There's a community version you self-host, a cloud version, and an enterprise tier. The limitation is setup and maintenance — you need someone technical to run it. But if you have that, it's remarkably capable for zero licensing cost.
And then there's Label Studio, which seems to be the other major open-source name.
Label Studio is interesting because it goes wider. CVAT is deep on computer vision; Label Studio is multi-modal from the ground up — text, image, audio, video, time-series data. It has a REST API and Python SDK, active learning integration, and a custom labeling interface builder so you can design your own annotation UI. They added spectrogram support, PDF annotation, evaluation sets, and scoring rubrics for LLM output review in recent updates. One user described using it with a custom script to auto-label data, manually correcting errors, retraining, and repeating — that active learning loop is exactly how you'd want to use it.
So Label Studio is the Swiss Army knife and CVAT is the precision scalpel.
That's a reasonable way to put it. Then you've got a few more specialized open-source options. LabelImg is the beginner entry point — simple bounding box annotation, Pascal VOC and YOLO export formats, graphical interface. It has no AI assistance and doesn't scale, but if you're just starting out with object detection, it gets you going. Doccano is the NLP equivalent — sequence labeling, text classification, span-based annotation, multi-user collaboration, Docker deployment. It's text-only but it's solid for building sentiment analysis or named entity recognition datasets.
What about the more niche scientific end of things?
WEBKNOSSOS is worth mentioning for anyone in neuroscience or biomedical imaging. It handles terabyte-scale volumetric three-dimensional datasets — we're talking neuron tracing, cell segmentation. Completely different world from mainstream ML annotation, but if you're in that domain, it's the tool. And Diffgram is interesting as an open-source option that tries to bring enterprise features to teams who don't want to pay enterprise prices — dataset version control, audit trails, active learning, cloud and on-premises support.
Okay, so that's the open-source landscape. Now, when does it make sense to pay for enterprise tooling? Because some of these enterprise prices are... not small.
Labelbox starts at twenty-five thousand dollars annually. SuperAnnotate is custom enterprise pricing, which usually means more than that. So the question is real. The honest answer is it depends on three things: scale, security requirements, and whether you need managed annotation services. If you're a startup training a model on ten thousand images and you have a technical team, CVAT or Label Studio will serve you well. If you're running production AI at a company with HIPAA compliance requirements, a hundred annotators across multiple time zones, and petabyte-scale datasets, you need the enterprise stack.
Let's walk through the major enterprise players then. Who's at the top?
SuperAnnotate is ranked number one on G2 for data labeling with a four point nine out of five rating across a hundred and sixty-eight reviews, which is a pretty robust sample. It's backed by NVIDIA, Databricks Ventures, and Dell Technologies Capital among others — and yes, Lionel Messi's investment vehicle is apparently also a backer, which I enjoy knowing. It covers multimodal annotation, has a custom workflow and UI builder, four hundred-plus vetted annotation service teams globally, and SOC two Type Two, ISO twenty-seven-thousand-and-one, GDPR, and HIPAA compliance. The key selling point is genuine customizability — they describe themselves as the only platform that can fully adapt to a client's specific needs rather than asking clients to adapt to the platform.
And Encord?
Encord was founded in twenty twenty by former quants and physicists, which shows in the product's rigor. A hundred and ten million dollars in funding, Series C. It supports images, video, text, audio, DICOM medical data, with SAM-2, GPT-4o, and Whisper integrated for AI-assisted labeling. Natural language search across datasets, human-in-the-loop workflows, rubric-based evaluation. G2 rating of four point eight out of five. What makes Encord distinctive is the emphasis on data curation before annotation — their thesis is that you should be selective about what you label rather than labeling everything, which directly addresses the efficiency problem.
That connects to something I find genuinely underappreciated, which is that annotation quality is as much about choosing what to annotate as how you annotate it. Labeling redundant or low-value data is just waste.
Which is exactly what Lightly focuses on as a standalone product. Lightly isn't really an annotation tool — it's a data selection and curation tool that sits upstream. It uses active learning and self-supervised learning to identify the most diverse and informative samples from a large dataset before you send anything to annotators. The idea being that if you can find the five percent of your data that covers ninety percent of your distribution, you've just dramatically cut your annotation budget.
That's the kind of tool that sounds boring until you realize it might be more valuable than the annotation tool itself.
For large-scale projects, absolutely. Then there's Labelbox — a hundred and ninety million in funding from Andreessen Horowitz, Kleiner Perkins, First Round Capital. Started by founders from the aerospace industry. Its strength is RLHF workflows — pairwise ranking, scoring, rewriting completions — which is increasingly where the action is for anyone building or fine-tuning language models. They have a managed services network called Alignerr that connects you to a workforce of annotators. And they have a model-assisted labeling product called Foundry that uses your own model to pre-label data, which you then review and correct.
V7 Labs is one I see mentioned a lot in computer vision circles.
UK-based, founded in twenty eighteen, about forty-three million in funding. The founders previously built an AI product for the visually impaired, which is an interesting origin. It's primarily computer vision — images, video, three-dimensional point clouds — with automated labeling and model management built in. They've been expanding into document processing and AI agent-assisted annotation. G2 rates them four point eight out of five, second easiest to use in data labeling software. They're not trying to do everything; they're going deep on vision.
And Roboflow has sort of become the go-to for the computer vision startup and research community.
Roboflow is interesting because it's a full pipeline tool, not just an annotation tool. You upload data, annotate it, train a model, and deploy — all in one environment. Their AI-assisted labeling includes something called Autodistill, which uses foundation models to automatically generate labels that you then review. They ranked ninth in G2's twenty twenty-six AI product awards. The limitation is they're image and video only — no NLP, no audio. But for computer vision applications, especially at the startup or research stage, it's extremely accessible.
Let's talk about the cloud provider offerings — SageMaker Ground Truth, Vertex AI — because I think a lot of teams default to those just because they're already in the AWS or Google Cloud ecosystem.
And that's often a mistake, or at least a decision made for the wrong reasons. SageMaker Ground Truth has legitimate strengths — it integrates with Amazon Mechanical Turk for crowdsourced labeling, or you can bring your own private workforce. It handles images, video, text, audio, point clouds, and GenAI tasks. Active learning is built in — it automatically labels the examples it's confident about and routes the ambiguous ones to humans. Pay-as-you-go pricing with a free tier for the first couple of months. The G2 rating is four point one out of five, which is the lowest of the major platforms we're discussing, and the feedback consistently points to setup complexity and costs that escalate quickly at scale.
So it's convenient if you're AWS-native but not necessarily the best tool for the job.
Vertex AI on the Google side has similar dynamics — it's tightly integrated with Google Cloud's AutoML and model training infrastructure, which is great if you're already there. But as a dedicated annotation platform, it's limited compared to purpose-built tools. The customization options are narrower, and the human labeling costs at scale can get expensive fast.
Okay, so we've covered the major players. I want to get into the Meta and Scale AI situation because that story is genuinely dramatic and I think it reveals something important about how the industry views annotation infrastructure.
So here's what happened. In mid-twenty twenty-five, Meta acquired a forty-nine percent stake in Scale AI at a valuation of roughly thirty billion dollars, and brought Scale's CEO Alexandr Wang on as Meta's Chief AI Officer. Scale had been the dominant data annotation provider in the industry — their revenue was around eight hundred and seventy million dollars annually, and their client list included Google, OpenAI, and essentially every major AI lab.
And then their clients panicked.
Google and OpenAI both shifted their annotation work away from Scale almost immediately, because the logic is pretty obvious — if you're training a competing AI model and your annotation vendor is now forty-nine percent owned by Meta, your training data might be visible to a competitor. That's not a risk you take. So overnight, Scale lost two of its largest customers.
And the vacuum got filled fast.
Surge AI, which was founded in twenty twenty, surpassed Scale in revenue — reaching one point two billion dollars compared to Scale's eight hundred and seventy million. Mercor hit a ten billion dollar valuation and around five hundred million in revenue by recruiting subject-matter experts. Micro1 raised at a five hundred million dollar valuation and grew from seven million to fifty million in annual recurring revenue within twenty twenty-five alone.
The Mercor number is the one that gets me. Ten billion dollar valuation for a company that essentially recruits expert annotators.
And it signals something real about where the industry is moving, which is away from cheap crowdsourced labeling toward expert annotation. Mercor pays contractors up to two hundred dollars per hour and reportedly distributes one point five million dollars per day to its workforce. The thesis is that as models get more capable, the data that actually moves the needle is high-quality expert judgment — doctors annotating medical data, lawyers annotating legal documents, senior engineers reviewing code. Cheap crowdsourcing can label a bounding box around a car. It cannot reliably evaluate whether a legal brief is well-reasoned.
There's something almost ironic about that. AI gets more powerful and the humans who shape it become more specialized and more expensive, not less.
The economics of AI development are genuinely strange. And it connects to the eighty percent problem we mentioned earlier — ML engineers spending eighty percent of their time on data preparation. The bet with all of these tools is that AI-assisted annotation can change that ratio. And there's real evidence it can. Integrating AI agents into annotation pipelines — using SAM for automatic segmentation, YOLO for pre-labeling, GPT-4o for text classification — can cut manual annotation effort by roughly fifty percent and reduce costs by a factor of four while maintaining accuracy.
But there's a catch there, right? Because if AI is pre-labeling data and humans are just approving it, you've potentially introduced a systematic bias that's very hard to detect.
This is the thing that keeps serious ML practitioners up at night. The risk is what you might call automation-hidden errors. If your pre-labeling model has a consistent blind spot — say, it consistently misclassifies objects in low light — and your human reviewers are approving at high speed because the model is usually right, you've just baked that blind spot into your training data at scale. And then you train a new model on that data, which inherits the blind spot, and potentially use that model to pre-label the next dataset. The loop compounds errors rather than correcting them.
Which is why the quality control infrastructure around annotation matters as much as the annotation itself.
Platforms like Taskmonk are specifically built around this problem. Their workflow starts with rubric design — defining the criteria for quality before a single label is applied. They use gold tasks, which are known-correct examples seeded into the workflow to test annotator accuracy in real time. They have consensus checks where multiple annotators label the same item and disagreements get escalated. And they route low-confidence AI-labeled items to senior reviewers rather than letting them pass through automatically.
Let's talk about the RLHF annotation world specifically, because I think it's different enough from computer vision annotation that it deserves its own treatment.
It's almost a different profession. In computer vision, the ground truth exists — the car is either in that bounding box or it isn't. In RLHF annotation, you're asking humans to evaluate AI-generated text for qualities like helpfulness, honesty, harmlessness, factual accuracy, tone, and completeness. These are subjective, context-dependent, and sometimes contradictory. Two annotators can read the same chatbot response and have genuinely different assessments of whether it's appropriately cautious or unnecessarily evasive.
And the annotation categories for LLM training are pretty specific.
Right, so you've got supervised fine-tuning data — high-quality prompt-response pairs that demonstrate desired behavior. Preference data — humans choosing between two responses or ranking multiple completions. Safety reviews — identifying refusals, boundary violations, harmful content, jailbreak patterns. Evaluation datasets — fixed test sets used to track model quality over time across dimensions like factuality and tone. And multimodal alignment — connecting text with images, audio, video, and documents. Labelbox explicitly supports all of these workflows, which is part of why it's prominent for teams doing post-training alignment work.
The rubric design problem is underappreciated here. If your rubric is ambiguous, your data is ambiguous, and your model learns ambiguity.
And rubrics change. As models improve and as deployment contexts shift, what counts as a good response evolves. So you're not just annotating once — you're maintaining a living annotation system that needs to stay calibrated to current standards. That's a significant operational overhead that I don't think most people building their first model anticipate.
Let's bring this to something practical. If someone is a dataset curator trying to figure out which tool to use, what's the actual decision framework?
Seven criteria, in rough order of importance. First, does it support the data modalities you're working with? There's no point evaluating a tool's enterprise features if it doesn't handle your data type. CVAT and Supervisely are strong for three-dimensional LiDAR. Doccano and Datasaur are strong for NLP. Encord and Labelbox cover the broadest range. Second, what level of AI assistance does it offer? SAM integration, YOLO pre-labeling, and GPT-4o classification are now table stakes for serious annotation work. If a tool doesn't have this, you're leaving efficiency on the table.
Third would be scalability and infrastructure, I'd assume.
Yes — can it handle your dataset size, support your team size, integrate with your cloud storage and ML pipeline? For small projects this doesn't matter much. For production AI it's critical. Fourth is security and compliance — HIPAA, GDPR, SOC two, ISO twenty-seven-thousand-and-one. If you're working with medical data or user data from regulated jurisdictions, this narrows your options significantly. Fifth is workflow customization — can you design the annotation interface and review process to match your specific task, or are you constrained to the tool's defaults? Sixth is collaboration features — role-based access, task assignment, quality review, inter-annotator agreement tracking. And seventh is the total cost of ownership, which for open-source tools includes setup and maintenance time, not just licensing fees.
The total cost of ownership point is underemphasized. CVAT is free, but if it takes a senior engineer two weeks to set up and maintain, that's not free.
The hidden cost of open-source is engineering time. The hidden cost of enterprise is vendor lock-in and the fact that your annotation data and workflows may become dependent on a platform that can change pricing or get acquired. The Meta-Scale situation is a useful reminder that annotation infrastructure is strategic — who holds your data and your labeling pipeline matters.
What about synthetic data as a way out of this whole problem? There's been a lot of hype around it.
The synthetic data market is projected to go from around half a billion dollars this year to two point seven billion by twenty thirty. Some providers claim synthetic datasets can reduce manual labeling requirements by up to seventy percent in certain domains. And for specific use cases — generating edge case scenarios for autonomous vehicle training, augmenting rare medical conditions in imaging datasets — synthetic data is genuinely valuable. But the catch is that synthetic data still needs human validation. You need annotators to verify that synthetic examples are realistic and correctly labeled, especially for edge cases where the whole point is that real examples are rare. So it complements annotation rather than replacing it.
The fundamental insight being that human judgment seems stubbornly irreducible at some stage in the pipeline.
The last mile problem. You can automate the bulk, but someone has to verify the edge cases, maintain the rubrics, calibrate the quality, and catch the systematic errors that automated systems generate. That's why the expert annotator premium exists and why it's growing. And honestly, it's why the eighty percent number hasn't moved much despite all the tooling advances — the hard part isn't the mechanical labeling, it's the judgment calls.
For dataset curators specifically, what's the practical takeaway from all of this?
A few things. One — don't default to the simplest tool or the most famous tool. Match the tool to your data type and scale. LabelImg is fine for a student project; it's not fine for production. Two — budget for AI-assisted labeling from the start. The efficiency gains from SAM integration or YOLO pre-labeling are real and the tools that don't have this are increasingly behind. Three — think about your annotation pipeline as a system, not just a labeling interface. Data selection, quality control, inter-annotator agreement, rubric management, version control — all of this matters and the best platforms address all of it. Four — if you're doing any LLM fine-tuning or alignment work, recognize that you're in a fundamentally different annotation paradigm than computer vision and choose tools accordingly.
And five — pay attention to the market dynamics, because who owns your annotation vendor matters.
The Scale AI situation made that concrete in a way that I think will permanently change how AI labs think about annotation infrastructure. It's strategic, not operational. The teams that understand that will make better decisions.
One open question I keep coming back to: as foundation models get better at generating their own training data — through synthetic data, through self-play, through model-generated annotations — does the annotation market eventually plateau? Or does demand just keep expanding as AI applications multiply?
My read is that demand expands. Every new AI application domain — medical imaging, legal analysis, financial forecasting, autonomous systems — requires domain-specific annotation that can't be bootstrapped from general-purpose models. The expert annotator premium isn't going away; if anything, it intensifies. The tools get better, the automation handles more of the routine work, but the ceiling on what AI systems are being asked to do keeps rising and pulling the annotation requirements up with it.
The treadmill accelerates.
It does. Which is both a business opportunity and a genuine challenge for anyone trying to build high-quality datasets efficiently.
Alright, I think that covers the landscape pretty thoroughly. Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. And a big thank you to Modal for providing the GPU credits that power the show — we genuinely could not do this without them. This has been My Weird Prompts, episode two thousand one hundred and twenty-four. If you're enjoying the show, a quick review on your podcast app goes a long way toward helping new listeners find us. We'll see you next time.
See you then.