Running my NAS and 25 containers with Claude Code

I run a homelab on a Synology NAS. About 25 Docker containers: Jellyfin and the usual media stack, Paperless for documents, Vaultwarden for passwords, Gitea for my own git, Mealie for recipes, plus a pile of monitoring. Caddy out front as a reverse proxy, Pi-hole doing DNS, and the whole thing reachable only over Tailscale, with no ports open to the internet.

The part that matters for this post: every compose file lives in git. One folder per service, all version-controlled, treated like infrastructure-as-code. So I can point Claude Code at it and run my NAS the way I'd run any repo.

The slog that sold me on it

For a long time everything ran on the :latest tag. This is a trap on Synology's Container Manager, because it never actually re-pulls :latest on its own. So every service was quietly frozen on whatever digest it cached months ago, and I had no idea which.

Getting off that was the job. Claude Code walked the entire stack to explicit version pins, one service at a time, with the same routine every time:

1. read the release notes for breaking changes
2. stop the container
3. cold-backup its data dir
4. bump the tag in compose
5. build / up
6. verify it's healthy
7. commit to the nas-docker repo

It's tedious, disciplined work across two dozen services, and exactly the kind of thing I'd cut corners on by myself and regret later. The transcript is basically me going "let's do vaultwarden next" → "works" → "gitea next" → "works" for an afternoon.

Two things came out of it worth passing on:

Jellyfin's 10.10 to 10.11 jump has a one-way database migration. Irreversible. You can only roll back from a stopped-container backup, and a direct jump fails, so you have to go through 10.10.7 first. Good to know before you find out the hard way.

A "latest stable" pin can secretly be a downgrade. I pinned my disk-health tool from its rolling tag down to the newest tagged release, and the web UI died with a 502. The "stable" release turned out to be older than the rolling build I'd been running, and it couldn't read the database the newer version had already written. The fix was to pin the digest of the running build. Which leads straight into the next bit.

Where AI actually falls over

Version reality is where Claude struggles most on infra. It confidently pinned that downgrade, because "newer stable version number" sounds right and often isn't. A rolling tag can be ahead of the latest numbered release. That needs a human going "wait, is this actually newer?", and it's why I now run a couple of tools whose whole job is to tell me when a real new image exists.

A few other honest limits: Claude can't run sudo on the NAS (it hands me the one-liner to paste into a real terminal myself), the workflow assumes a network mount that isn't always there after a reboot, and container-to-container networking on the NAS has foot-guns that silently break things until you repoint them.

The DNS deadlock I caused myself

My favorite bite of the whole project. To upgrade Pi-hole you have to stop it. But the NAS uses Pi-hole as its DNS resolver. So the moment Pi-hole went down, the NAS couldn't resolve docker.io to pull the new Pi-hole image, and the whole thing wedged. My message at the time:

Ran into a problem. Backup has worked. When trying to build, the NAS cant reach the pihole dns and it failed. I have restored connectivity by manually adding a different dns.

Claude spotted the circular dependency immediately: point the NAS at a public resolver before you stop Pi-hole, and set a permanent secondary DNS so it can't happen again. Obvious in hindsight, easy to walk into at 10pm.

The one that would've eaten my evening

After rebuilding my torrent stack, the client's web UI wouldn't come back. The logs looked like this:

started
termination initiated
ready to exit

Over and over, about once a second, no error. I'd have stared at that for an hour. Claude pattern-matched it from the rhythm: the VPN container's rebuild had hard-killed the client, which left a stale lock file and socket behind, and on restart the client saw those, assumed another copy of itself already owned the UI, and politely exited. Delete the two leftover files, start it, done. That obscure-failure-mode recognition is where AI genuinely earns its place in a homelab.

The habit I'd recommend

The thing I've come to value most is that Claude writes its own runbook. Every one of these gotchas becomes a saved note, so next time the fix is already written down and it doesn't cost me the evening twice. And I steer it by stakes. When it's my document archive, I say so out loud, and I'll even ask for extra care:

Based on the projects im running, do a deep analysis. Suggest improvements on the existing stack, and additional services im missing.

That one prompt is how the whole monitoring and update-notification layer got designed. Anything with a database gets stopped and cold-backed-up before it's touched, because a copy of a running database is a corrupt database. Low-stakes stuff I let it move faster on.

I could run a stack this size on my own. I just wouldn't keep it this tidy. The discipline is the hard part, and that's the part I'm happy to hand off :)