#2438: How Object Storage Actually Works Under the Hood

Blobs, flat namespaces, and why those "folders" in cloud storage are complete illusions.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2596
Published: Apr 26
Duration: 25:54
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: data-storage cloud-computing distributed-systems

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Object storage is the invisible foundation beneath almost everything we call cloud storage. But its architecture is fundamentally different from the file systems most of us grew up with — and understanding that difference matters whether you're migrating data, building an application, or just wondering why renaming a folder in cloud storage can take hours.

What Is a Blob, Really?

BLOB stands for Binary Large Object, a term popularized by Microsoft's Azure Blob Storage but universal across all object storage systems. Every object or blob consists of exactly three components: the raw binary data itself, a unique identifier (the key or address), and metadata describing the file name, content type, size, and creation time. That's it. There's no directory tree, no inode table, no file allocation table.

Object storage is accessed through APIs, not mounted drives. You send a request with a key and get an object back. This design makes it perfect for unstructured data — images, videos, backups, log files, virtual machine images — where you don't need a file system hierarchy to find things, just the key.

The Great Folder Illusion

When you look at an object storage bucket, it appears to have folders. You see paths with slashes, nested directories, a familiar hierarchy. But that's a complete fiction. Object storage uses what's called a flat namespace — every single object exists at the same logical level. What looks like a folder path (e.g., photos/2025/vacation.jpg) is actually just the object's key. The slashes are characters in that key. They have no special meaning to the storage system.

When the API shows you a folder called "photos," it's generating that illusion on the fly. You list objects with a delimiter (usually a forward slash), and the system scans all keys, groups them by the common prefix before that delimiter, and returns those prefixes as "common prefixes." There's no directory entry stored anywhere, no folder metadata, no inode. It's purely a query-time construction.

This has real consequences. Renaming a folder isn't updating a pointer — it's rewriting the keys of every single object inside that prefix. Deleting a folder means issuing a delete command for every object whose key starts with that prefix. If the operation gets interrupted, you have a partially deleted folder. This is fundamentally different from a local file system where removing a directory is a single atomic operation on an inode.

The Trade-Off That Makes It Scale

The flat namespace is the secret sauce that lets object storage scale to billions of objects without performance degradation. Because there's no directory tree to traverse, no inode limit to hit, no file system metadata bottleneck, you can store enormous amounts of data in a single bucket. The trade-off is that operations that are instant on a local file system — moving a directory, renaming a folder — become massive operations in object storage.

Google Cloud Storage has experimented with a Hierarchical Namespace feature that adds actual folder resources with atomic rename operations. But it must be enabled at bucket creation and is irreversible. The fact that it's still in preview, years into cloud storage maturity, shows how deeply the flat namespace is baked into the architecture.

Size Limits: A Moving Target

Object size limits vary dramatically across providers and have recently changed. As of December 2025, Amazon S3 increased its maximum object size from 5 terabytes to 50 terabytes by leveraging its multipart upload architecture. A single PUT upload is still capped at 5 gigabytes, but by splitting files into parts (5 MiB to 5 GiB each, up to 10,000 parts), you can upload enormous objects.

Azure Blob Storage is even more generous, supporting block blobs up to approximately 190.7 tebibytes (50,000 blocks, each up to 4,000 MiB). Google Cloud Storage caps at 5 terabytes per object. These hard ceilings matter when planning migrations — you need to know which provider you're targeting.

RClone: The Interoperability Layer

RClone is the go-to tool for syncing data between cloud storage providers. Written in Go and first released in 2012, it uses a backend interface system where every provider implements two interfaces: Fs (file system methods like List, NewObject, Put) and Object (Open, Update, Hash, modtime methods). RClone translates between different provider APIs through this common interface.

When syncing, RClone compares objects by size, modification time, and optionally checksums or hashes. If those match, it skips the transfer. But there's a critical limitation: RClone does not implement delta encoding. Unlike rsync, which can detect which parts of a file changed and only transfer differences, RClone transfers objects as complete units. This isn't a technical limitation of Go — it's because cloud storage APIs don't expose byte-range patching operations. If you change one byte in a 50 GB file, RClone re-uploads all 50 GB.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2438: How Object Storage Actually Works Under the Hood

Daniel sent us this one — he wants to talk about object storage as the real foundation underneath everything we call cloud storage. What actually is a blob, what is an object, how big can these things get if you're planning a migration, and then the part I find genuinely interesting — when we organize files into folders, how is that hierarchy actually recorded? Because it turns out it's nothing like what's happening on your laptop. And finally, interoperability. If you've got a blob store and you want to sync it with Google Drive, or sync Google Drive with Google Cloud Storage, without handing your data to some questionable third-party tool — RClone is the go-to. He wants to know how it actually works under the hood.

Oh, this is a good one. And by the way, quick note — today's episode script is being generated by DeepSeek V four Pro. Which I appreciate, because this topic has layers.

It really does. And I think the place to start is just defining what a blob actually is, because people throw that word around and it's one of those terms where everyone nods and half the room is bluffing.

So BLOB stands for Binary Large Object. The term was popularized by Microsoft with Azure Blob Storage, but the concept is universal across all object storage. An object — or a blob — has exactly three components. The data itself, the raw binary content. A unique identifier, which is basically a key or an address. And metadata, which is descriptive information like the file name, the content type, the size, when it was created. That's it.

The key distinction, the thing that separates this from the file systems we grew up with, is how you access it.

Object storage is accessed through H.You're not mounting it as a drive, you're not using S.You send a request, you get an object back. It's designed for unstructured data — images, videos, documents, backups, log files, entire virtual machine images. The whole point is that you don't need a file system hierarchy to find things, you just need the key.

This is where the folder question gets really interesting, because when you look at an object storage bucket, it sure looks like it has folders. You see paths, you see slashes, you see nested directories. But that's a lie.

It's a complete fiction. And I love this because it's one of those things where the user interface is so convincing that most people never question it. Object storage uses what's called a flat namespace. Every single object exists at the same logical level. There are no real directories, no real folders. What looks like a folder path — say, photos slash two thousand twenty-five slash vacation dot jpg — that entire string is just the object's key. The slashes are just characters in the key. They have no special meaning to the storage system.

When the A.shows you a folder called photos, what's actually happening?

When you list objects and you specify a delimiter — typically the forward slash — the A.scans all the keys, groups them by the common prefix before that delimiter, and returns those prefixes as what it calls common prefixes. It's generating the illusion of folders on the fly. There's no directory entry stored anywhere. There's no inode. No folder metadata exists. It's purely a query-time construction.

Which means that if you rename a folder, you're not actually renaming anything. You're rewriting the keys of every single object inside that prefix.

And if you delete a folder, the system has to issue a delete command for every single object whose key starts with that prefix. It's not atomic. If the operation gets interrupted halfway through, you've got a partially deleted folder. This is fundamentally different from a local file system where removing a directory is a single operation on an inode.

This is the thing I think most people don't appreciate. On your local computer, when you move a directory from one place to another, the files don't move. The operating system just updates a pointer. It's instantaneous. In object storage, moving a folder means copying every object to new keys and then deleting the old ones. It's a massive operation.

That's exactly the trade-off that makes object storage scale the way it does. Because there's no directory tree to traverse, there's no inode limit you can hit, there's no file system metadata bottleneck. You can store billions of objects in a single bucket and the performance doesn't degrade the way it would on a traditional file system. The flat namespace is the whole secret sauce.

There's an interesting exception here, though. Google Cloud Storage has been experimenting with something they call Hierarchical Namespace.

I was reading about this. It's still in preview, and here's the key detail — you have to enable it at bucket creation, and once it's on, it's irreversible. You can't go back to flat namespace. What it gives you is actual folder resources with atomic rename operations and efficient hierarchical listing. It's designed for workloads like Hadoop and Spark where you're doing a lot of file-oriented operations and the flat namespace overhead becomes painful.

The irreversibility is telling. It means the flat namespace isn't just a default setting, it's baked into the architecture. Adding hierarchy on top requires a fundamental change to how the bucket works internally.

And the fact that it's still in preview, years into cloud storage maturity, tells you how hard this problem is. Most object storage users just learn to work with the flat namespace and design their key structures accordingly.

Let's talk about size. If I'm migrating data into object storage, how big can a single file be? Because the numbers have changed recently, and some of them are surprising.

This is where it gets fun. Let me go provider by provider. Amazon S three, as of December two thousand twenty-five — so less than five months ago — increased their maximum object size from five terabytes to fifty terabytes. That's a ten-x jump. And they did it by leveraging their multipart upload architecture.

Fifty terabytes in a single object. That's a lot of vacation photos.

But here's the mechanism. A single PUT upload is still capped at five gigabytes. Anything larger than that, you have to use multipart upload. You split the file into parts, each part must be between five mebibytes and five gibibytes, you upload up to ten thousand parts, and then you issue a completion request that stitches them together into one object on the server side.

The fifty terabyte limit is essentially ten thousand parts times five gibibytes each. That's the math.

And that's S three. Now, Azure Blob Storage is actually even more generous, and has been for longer. Their block blobs can reach about one hundred ninety point seven tebibytes. That's fifty thousand blocks, each up to four thousand mebibytes, using their newer service version. Append blobs max out around one hundred ninety-five gibibytes, and page blobs at eight tebibytes.

One hundred ninety tebibytes versus fifty terabytes. That's a significant gap.

It's almost four times larger. And Google Cloud Storage comes in at five terabytes for a single object, with their resumable upload protocol recommended for large files. So you've got this interesting spread — Google at five terabytes, Amazon now at fifty terabytes, Azure at nearly one hundred ninety-one tebibytes. If you're planning a migration, you need to know which provider you're targeting because these limits are hard ceilings.

The multipart upload thing is worth dwelling on for a second, because it's not just about hitting the size limit. It's also about reliability. If you're uploading a forty terabyte file over a connection that might drop, you don't want to restart from zero.

That's exactly the design rationale. Each part is uploaded independently. If part seven hundred fails, you re-upload part seven hundred. You don't lose the other six hundred ninety-nine parts. And you can upload parts in parallel, which dramatically improves throughput. The multipart architecture is what makes these enormous object sizes practically usable.

We've got the what and the how big. Let's get to the part that I think is practical for a lot of listeners — interoperability. You've got data in Google Drive, you want it in Google Cloud Storage. Or you've got an S three bucket and you want to sync it to Azure. Daniel mentioned tools like Multisync, which he described as having huge privacy risks. And he's right to be skeptical.

This is where RClone comes in, and I should say upfront — RClone is one of those tools that, once you understand what it's doing, you realize how much heavy lifting it's handling. It was written in Go by Nick Craig-Wood, first released in two thousand twelve. It's M.licensed, fully open source. And the architecture is elegant.

Walk me through it. I point RClone at a Google Drive folder and a Google Cloud Storage bucket, I tell it to sync, and magic happens. What's actually going on under the hood?

RClone uses what's called a backend interface system. Every cloud provider — Google Drive, Google Cloud Storage, S three, Azure, Backblaze B two, you name it — implements two interfaces. One is called Fs, which stands for file system, and one is called Object. The Fs interface defines methods like List, NewObject, Put, Mkdir. The Object interface provides Open, Update, Hash, and modification time methods.

It's an abstraction layer. RClone doesn't care what's on the other side, as long as it speaks these interfaces.

And this is what makes it so powerful. When you run rclone sync gdrive colon folder gcs colon bucket slash folder, RClone is translating between the Google Drive A.and the Google Cloud Storage A.through this common interface. It lists objects on both sides, compares them, and transfers what's different.

How does it decide what's different? Because that's the critical question for a sync tool.

RClone has an internal equal function that compares three things. Size, modification time, and optionally a checksum or hash. By default it checks size and modtime first, because those are fast and cheap to query. If those match, it assumes the files are the same and skips the transfer. If you use the checksum flag, it switches to hash comparison as the primary method, but only when both the source and destination support compatible hashes.

What hashes are we talking about?

five is common for Google Drive and S three. one for others. The constraint is that both sides need to speak the same hash. If one side uses M.five and the other uses S.two fifty-six, RClone can't do a direct hash comparison without downloading and computing both.

Here's the thing that surprised me when I was reading about this. RClone does not implement delta encoding. Unlike rsync, which can detect which parts of a file changed and only transfer the differences, RClone transfers objects as complete units. One-to-one mapping.

This is a huge point, and it's not a technical limitation of Go. It's a limitation of cloud provider A.Cloud storage A.s don't expose byte-range patching operations. You can't say, hey S three, bytes five hundred through six hundred of this object changed, here's the new data. You have to re-upload the entire object.

If I have a fifty gigabyte file and I change one byte of metadata, RClone is re-uploading all fifty gigabytes.

This is fundamentally different from how rsync works on a local file system or over S., where it uses a rolling checksum algorithm to identify changed blocks and only sends those. RClone can't do that because the cloud providers don't give it the primitives to do partial object updates.

Which raises an interesting question. Is this a bug or a feature? Because there's an argument that immutable object replacement is actually simpler and safer.

I think it's both. The simplicity argument is real — you never have to worry about partial updates leaving an object in an inconsistent state. Every object is either fully written or not written at all. But the cost argument is also real, especially for large files that change frequently. If you're syncing a fifty gigabyte database dump every night and only a few megabytes actually changed, RClone is still pushing fifty gigabytes.

Your egress bill reflects that.

Your egress bill absolutely reflects that. This is why people who use RClone for large datasets tend to be thoughtful about their object sizes and their sync frequency.

Let me ask about something Daniel specifically mentioned — the privacy risks of tools like Multisync versus RClone. What's the actual trust model here?

RClone is open source, M.licensed, the entire codebase is on GitHub. You can audit it, you can build it from source, you can verify that the binary you're running matches the source. But here's the nuance — when you configure RClone for Google Drive, it runs a local web server on your machine to capture the OAuth redirect. You're authenticating through Google's OAuth flow, and RClone stores the resulting token locally. The token never goes to RClone's servers because RClone doesn't have servers.

That's the key distinction. Tools like Multisync often route your authentication through their own infrastructure. They see your tokens, they potentially see your data.

With RClone, your credentials stay on your machine. The tool is just a binary running locally, making direct A.calls from your network to Google's servers or Amazon's servers or wherever. There's no intermediary. Now, the caveat is that you're trusting that the binary you downloaded hasn't been tampered with. But that's a supply chain trust question that applies to every piece of software you run.

You can mitigate that by building from source if you're sufficiently paranoid.

Or by verifying checksums of the official releases. RClone publishes S.two fifty-six hashes for all their binaries.

There's another piece of the RClone architecture I want to touch on — the optional interfaces. You mentioned that backends can expose capabilities like server-side copy or move. What happens when a backend doesn't support one of those?

RClone falls back to downloading and re-uploading. So if you're moving an object from one bucket to another within the same S three account, S three supports a server-side copy operation — RClone can just issue that A.call and it's instantaneous, no data transfer needed. But if you're moving from Google Drive to S three, there's no server-side copy between providers, so RClone downloads the object to your machine and uploads it to the destination.

Which means your local bandwidth becomes the bottleneck.

And this is where RClone's multi-threaded upload support becomes valuable. When the backend allows it, RClone splits transfers into parallel chunks, which can dramatically improve throughput on high-latency connections. But fundamentally, if you're moving data between providers, your machine is in the middle.

There's another feature worth mentioning — the encryption backend. RClone can wrap any storage backend with encryption, so the data is encrypted before it ever leaves your machine. The cloud provider never sees the plaintext.

This is implemented as a backend that wraps another backend. So you can have an encrypted Google Drive remote, an encrypted S three remote, whatever. The encryption happens client-side using N.secretbox, which is XSalsa20 and Poly1305. The file names are encrypted too, and directory structure is obfuscated. The cloud provider just sees random-looking blobs with random-looking names.

Which is a nice segue back to the folder question, because if the directory structure is just key prefixes anyway, encrypting those prefixes is straightforward. You're just scrambling the string that was already just a string.

The flat namespace makes encryption simpler in some ways. There's no directory metadata to leak, because there never was any directory metadata.

Let me try to pull this together into something practical. If someone's listening and they're thinking about migrating data into object storage, or syncing between providers, what should they actually know?

I think the first thing is to understand the flat namespace and design your key structure intentionally. Don't just replicate your local folder hierarchy and assume it'll work the same way. Think about how you'll list objects, how you'll manage lifecycle policies, how you'll handle large-scale renames.

Because you won't be renaming folders. You'll be rewriting keys.

The second thing is to know your provider's object size limits. If you're on Google Cloud Storage, five terabytes is your ceiling per object. If you're on S three, fifty terabytes. If you're on Azure, nearly one hundred ninety-one tebibytes. Plan your file sizes accordingly. And if you're dealing with files over five gigabytes, make sure your tooling supports multipart or resumable uploads.

Third, if you're using RClone, understand the lack of delta encoding. For large files that change frequently, object storage might not be the ideal sync target. You might want to think about whether those files should be chunked differently, or whether your sync strategy needs to account for the full re-upload cost.

The other RClone consideration is the checksum strategy. The default size and modtime comparison is fast and usually sufficient, but if you need cryptographic certainty that your files are identical, use the checksum flag and make sure both your remotes support the same hash algorithm.

The privacy angle. If you're syncing sensitive data between cloud providers, RClone gives you a local-only trust model. Your credentials, your data, your encryption keys all stay on your machine. That's fundamentally different from a service that proxies your traffic.

I think there's also a broader point here about how cloud storage has evolved. Object storage started as this back-end infrastructure thing that only developers touched. Now it's the foundation for consumer products, for business workflows, for backup strategies. And the tools that bridge these worlds — RClone being the standout example — are doing a lot of translation work that most users never see.

The folder illusion is a perfect example of that translation work. shows you folders because that's what humans expect. But underneath, it's all flat keys. RClone has to reconcile that illusion across providers that implement the illusion slightly differently.

It does it remarkably well, given the constraints. The fact that you can sync Google Drive to Google Cloud Storage with a single command, and RClone handles the OAuth, the A.translation, the checksum verification, the retry logic, the parallel transfers — that's impressive engineering.

You mentioned Google Drive OAuth. What does that setup actually look like for someone who's never done it?

RClone has a built-in configuration wizard. You run rclone config, you choose Google Drive from the list of remotes, it gives you a URL to open in your browser, you authenticate with Google, and then RClone captures the redirect on a local web server — usually on localhost port five three six eight two or something similar. The token gets saved in RClone's configuration file. After that, all A.calls use that token. For Google Cloud Storage, you can use service account J.keys, which is better for unattended operation — no browser flow needed.

The bisync feature for bidirectional sync?

RClone bisync is relatively new and it's designed for the case where files might change on both sides between syncs. It's more complex than a one-way sync because it has to detect conflicts, but the underlying mechanism is the same interface system. It's just comparing modtimes and sizes in both directions and flagging conflicts when both sides changed.

I want to circle back to something about the fifty terabyte S three limit. The fact that this only changed in December two thousand twenty-five — for most of cloud computing history, you couldn't put a single object larger than five terabytes in S three. That's kind of remarkable.

And it tells you something about how these systems were originally designed. S three launched in two thousand six. For nearly twenty years, five terabytes was the ceiling. The architecture was built around the assumption that objects would be relatively small, and the multipart upload system was designed to make large uploads reliable, not to enable truly enormous objects. The jump to fifty terabytes required rethinking some of those internal limits.

Azure was already at one hundred ninety tebibytes. So Amazon was playing catch-up on maximum object size.

Though I'd argue that for the vast majority of use cases, five terabytes was already more than enough. The fifty terabyte limit is really for specialized workloads — scientific data, media production, genomic sequencing. Most people are never going to hit even the old limit.

Unless you're backing up raw video footage or something.

Sure, video production is one of those edge cases where file sizes get enormous. An hour of uncompressed eight-K video can be multiple terabytes. For those workflows, the fifty terabyte limit is meaningful.

Now — Hilbert's daily fun fact.

The average cumulus cloud weighs about one point one million pounds. That's roughly the same as one hundred elephants.

If someone's listening and they want to start using RClone, what's the first practical step?

It's available in basically every package manager — Homebrew, apt, Chocolatey. Then run rclone config and walk through the wizard for your first remote. The documentation is excellent, and the forum is active. Start with a dry run — rclone sync with the dry-run flag — to see what it would do without actually transferring anything.

The dry run feature is underrated. You can see exactly which files would be copied, which would be deleted, before you commit to anything.

RClone's output is surprisingly readable. It shows you the transfer progress, the speed, the E.It's a command-line tool, but it's designed for humans.

One thing we haven't touched on is cost. When you're syncing between providers, you're paying egress fees. Google Drive to Google Cloud Storage within Google's network might be free or cheap. Google Drive to S three means paying Google's egress and Amazon's ingress.

This is the hidden cost of multi-cloud. The data transfer fees can dwarf the storage costs if you're not careful. RClone doesn't solve that — it just makes the transfer possible. You still need to understand the pricing model of each provider.

The pricing models are deliberately complex.

Egress fees are how cloud providers make switching painful. It's not an accident that moving data out costs more than storing it.

Which makes tools like RClone even more valuable, because at least the tool itself isn't adding another layer of cost or risk.

The tool is free, open source, and runs on your own hardware. The only costs are the cloud provider costs you'd be paying anyway.

I think the final thing I'd say is that object storage isn't going anywhere. If anything, it's becoming more fundamental. The flat namespace design, the H.access model, the massive scalability — these are the reasons it won. File systems aren't going to disappear from our laptops, but the cloud runs on objects.

The tools that let us move between these worlds — between the hierarchical file systems we're used to and the flat object stores the cloud actually runs on — those tools are doing important translation work. RClone is the best of them, and understanding how it works makes you a better user of it.

That's a good place to land. Thanks to our producer Hilbert Flumingtop for another day of making this happen. This has been My Weird Prompts. You can find every episode at myweirdprompts.I'm Corn.

I'm Herman Poppleberry. We'll catch you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2438: How Object Storage Actually Works Under the Hood

Downloads

You Might Also Like

#2438: How Object Storage Actually Works Under the Hood