How a tiny GIF created a very real backup failure

Sometimes the best infrastructure stories start out sounding fake.

In this case, the trigger was a single GIF from Friends, sized at 1,643,869 bytes, about 1.6 MB. But the interesting part is not the GIF itself. The real story is what happens when a sensible backup optimization collides with a filesystem limit that most teams do not think about until production forces the lesson.

According to Jake Goldsborough, one customer had 432 GB of uploads, but only 26 GB of unique content. A big part of that gap came from one GIF that ended up being duplicated 246,173 times, inflating backup size by about 377 GB.

When this happened, and to whom

This was described publicly by Jake Goldsborough on January 23, 2026, in a post about a real customer backup problem inside the Discourse ecosystem. He does not identify the customer by name, which is reasonable, but he does make clear this was a production case involving a site with hundreds of gigabytes of uploads.

So yes, this is recent, and it was not a hypothetical bug report. It came from an actual backup incident investigated by someone working on the platform.

Why the duplication happened

This happened in a Discourse environment using Secure Uploads. In Discourse, uploads can be treated differently depending on whether they appear in personal messages, private categories, or public contexts. The official Secure Uploads documentation also notes that moving posts between different security contexts can affect how upload security is handled.

That matters because the same underlying file can end up represented multiple times across different security contexts. Discourse later added a backup optimization that grouped uploads by original_sha1, downloaded one copy, and then used hardlinks for the rest so identical content would not consume space repeatedly inside the backup.

On paper, that is a good optimization. If many files have identical content, hardlinking them during backup is a clean way to save storage.

Why the fix broke next

The problem is that hardlinks do not scale forever. Goldsborough explains that ext4 runs into a limit at roughly 65,000 hardlinks per inode. Once the backup process tried to hardlink that same GIF too many times, it hit Errno::EMLINK and failed.

That is the part worth paying attention to.

The first fix was not stupid. It was reasonable. But it assumed the space-saving mechanism would keep scaling. It did not. The bottleneck simply moved from storage consumption to a filesystem metadata limit.

That is a very familiar kind of infrastructure failure. One optimization solves the obvious problem, then creates a less obvious one somewhere deeper in the stack.

What the better fix looked like

The follow-up fix in Discourse was much more practical. Instead of trying to guess a safe threshold or relying on filesystem-specific assumptions, the backup process now catches Errno::EMLINK. When that happens, it copies the local file, makes that copy the new primary, and continues hardlinking from there.

That is the kind of fallback I trust more.

It does not pretend the limit is not real. It does not depend on a magic number. It reacts to the actual failure mode and degrades cleanly.

A quick note on “S3 uploads”

The Discourse pull request uses the phrase “S3 uploads.” Here, S3 refers to Amazon Simple Storage Service, or systems using an S3-compatible object-storage model. In plain terms, it means uploads stored in object storage rather than only on local disk.

So the backup optimization was about reducing duplicate download and archive work for uploads coming from that storage layer. It was not some separate product name inside Discourse. It really is the normal infrastructure meaning of S3/object storage.

Why this matters beyond Discourse

Even if you do not run Discourse, the lesson generalizes well.

A lot of systems rely on deduplication, shared blobs, hardlinks, references, or other storage-saving tricks. Those approaches work well until one pathological edge case concentrates too much weight on one object. Then the optimization itself becomes part of the failure path.

A few practical takeaways stand out:

Backup logic should be designed for ugly edge cases, not just happy-path efficiency.
Filesystem limits are part of application design once you depend on features like hardlinks.
Graceful fallback is usually better than clever threshold guessing.
Security and storage interactions can create weird multiplicative effects that are easy to miss in normal testing.

The useful version of the story

The funny version is that a Friends GIF ate hundreds of gigabytes and broke backups.

The useful version is that a real production system exposed an interaction between security-sensitive uploads, hardlink-based deduplication, and ext4 limits, then got fixed in a way other engineers can actually learn from.

That is why this story is worth reading. Not because it is weird, but because weird failures are often the ones that reveal the assumptions hiding in otherwise reasonable designs.

CGH_TECH