My NixOS + Nomad Dev Environment

I had 2 problems: a huge mess on my personal laptop of dev environments, and a random assortment of mini projects and experiments in different cloud providers and VPSs. Maximum cognitive load and mental switching costs between projects.

During the 2020 pandemic, I decided to do some housekeeping and refactor and standardize my approach to development. It's taken me about 3 years to get here, and thanks to LLMs/ChatGPT I finally got this project over the finish line.

Originally I "just wanted" the "Heroku experience" (git push to deploy) on a local dev environment, basically gamifying my coding environment, i.e. "make my dev environment feel like Starcraft (the RTS game)". git push and see a notification of a successful CI/CD run. But one thing lead to another and I ended up with a mini cloud. People complain about Kubernetes/Nomad complexity, but eventually complexity catches up to you and you realize life is easier with container orchestration. The juice is worth the squeeze.

So it's not quite a "dev env" on my laptop anymore, but more like "my laptop is a remote to an ultra-fast mini cloud".

Goals:

cheap (compared to AWS/Google cloud)
safe - (pristine state on boot)
fast tests/deploys (immediate feedback)

What I ended up with:

NixOS
Nomad + Waypoint
Gitea with Act runner and Gitea artifact hosting
Grafana and Prometheus
ZFS, Syncthing, Mutagen, Restic
Tailscale VPN
Cloudflare tunnels, Cloudflare pages, and Cloudflare image hosting

Hardware:

3x AMD 7950x, 128GB memory
Mix of Samsung 980/990 pro, and Sabrent Rocket SSDs

Learning NixOS

NixOS 1 (original inspiration): https://grahamc.com/blog/erase-your-darlings/
NixOS 2: https://github.com/mitchellh/nixos-config
searching github/sourcegraph for example Nomad and Nix configs - game changer
ChatGPT for confusing Nix and Hashicorp Nomad syntax - another game changer

I discovered NixOS after reading "Erase your darlings" - where the author describes how NixOS can bootstrap itself from an immutable directory: /nix and /boot. Everything else can be deleted on boot. Any state (documents, or /etc config, or home directory dot files) can be restored after boot from a backed-up network drive. This alone was both a stress reliever (clean system after boot) and forcing function (declarative config must be correct for stuff to work after boot).

Then I spent a few months going deep into NixOS configuration rabbit holes and Nix packages, starting with https://github.com/mitchellh/nixos-config

Nomad

But since then I've mostly stopped using Nix packages for anything substantial because I realized "systemd is NOT all you need", and not all Nix packages are easy to use compared to Docker images.

Nix packages mostly for basic Linux apps (vim, wander, screen, etc.), Linux configs (SSH, Tailscale, and Cloudflare tunnels so they're not dependent on Docker), and Prometheus so it's running directly on host
Docker images for almost everything else (Postgres, Redis, Gitea, etc.)

Yes, you can do almost everything you'd want with native NixOS config and Nix packages and systemd, but it's a lot easier with Nomad and Docker (because vendors usually maintain Docker images):

Avoid configuration hell: e.g. Postgres with various addons/extensions and custom configs is easier if a vendor (e.g. TimescaleDB) offers a pre-baked Docker image with best practice configs. Another example: try hosting a private git host (gitea, sourcehut, etc.) with plain old Nix packages instead of Docker.
Dependencies and parallelism: You can do this in systemd and bash scripts, but "declaring" this is what Nomad was designed for and optimized around.
Built-in Networking: if you're exclusively using Linux/Nix/Systemd, it will be hard to expand beyond one machine without gratuitous network and firewall config, with Nomad and Docker, it's a lot easier.
Zero downtime deploys: Nomad has this built-in.
Standardization allows scale: Dockerize all side projects, define a nomad spec, nomad run myapp.nomad.hcl
nomad node drain <node>, e.g. have nomad move a running image to another machine so you can reboot it and run nix flake update && nixos-rebuild switch to upgrade the underlying Linux and NixOS

Nomad is not strictly necessary, you could use Docker Swarm exclusively, or maybe even Docker Compose. Or you can swap Nomad for Kubernetes (and long-term I may end up with Kubernetes). All offer ways to achieve zero downtime deploys and dependency management. I like both Kubernetes and Nomad, but Nomad is dead simple to configure.

You can also use Nomad without Docker - just using app binaries instead of Docker images. But you'll often need a supporting app via Docker image and then, why not simplify and use Docker for everything.

Self-hosted git via Docker is not too bad either. You can run a GitHub clone via a gitea Docker image, create a repo, and a job runner, and get a CI/CD system. I spent a week trying to get sourcehut working but Gitea felt like Github and the setup was comparatively easy.

The big takeaway here is: Docker is simple and vendors/projects provide working images. These are often better than the packages provided by Nix.

Hashicorp Waypoint for deployments

A CLI that allows you to template your Nomad job files so you can do waypoint up from CI/CD and organize your secrets and environments.

TL;DR: build, tag, and push Docker images to gitea artifact hosting, then interpolates the Nomad job file with the image tag and secrets, and runs nomad run on the job file.

I actually love this little app but Hashicorp has deprecated it (as of Oct 2023).

You can replicate most of its functionality with a few shell scripts, but I intend to keep using it, at least for the most basic of deployments.

Grafana and Prometheus

Most are already familiar with this. I mostly ignore logs (I use the wander app to tail the Nomad logs and journalctl for systemd logs) but don't collect them. Instead I pump custom metrics to Prometheus, and spend a lot of time tweaking my fancy dashboards to read from Prometheus and Postgres.

At a certain scale or for certain use cases, you need log collection and search, your mileage may vary. Many great solutions (Loki) exist.

Likewise, for error tracking I'd consider self-hosted Sentry.io.

Disk strategies

I opted not to use distributed storage: ceph/longhorn/seaweed and instead keep Postgres data dirs on a single machine. For Postgres instances that need permanent uptime, use failover or multi-master across 2 machines.
The SSDs are so fast and so cheap, this alone makes the switch from cloud worth it. Lots of projects that were not economically viable become possible with fast cheap disk - including as-you-type real-time search that hits fast SSDs instead of RAM.
For hosting images and assets, use Cloudflare image hosting because they're cheap

Partitioning and formatting

ext4 is the gold standard, and dead simple. XFS is common for big databases. ZFS is complicated but as far as I can tell, considered quite stable. The reason to use it is it abstracts away having lots of different disks as a single volume. Simpler than RAID and allows you to mirror or stripe and add/remove disks if they break.

Most importantly, ZFS has snapshots:

sudo zfs list -t snapshot
sudo zfs snapshot rpool/persist/backups@2023-12-25

and you can sync ZFS over the network

and of course, "rollback to snapshot on boot"

zfs rollback -r rpool/local/root@blank

ZFS snapshots are great for huge Postgres data dirs. Say you have a 3TB Postgres data dir, and you want to test a new version of Postgres. You can snapshot the data dir, and then run the new version of Postgres on the snapshot. If it doesn't work, rollback to the snapshot.

Or you can zfs send the data dir snapshot to another machine instead of running a pg_dump and pg_restore.

Gitea + Act runner

Gitea is an open source Github. Looks and acts exactly like it, except extremely fast UI. Likewise the Act runner. A little complicated to setup (but way easier than Sourcehut), but worth it because I can git push and get a full CI/CD run in under 2 seconds (also partly due to fast 7950x).

I use a custom build Docker image with all the deps pre-installed the runner does not install anything (other than project deps like npm go mod, etc.)

Syncthing

Open source dropbox. I can keep my "code" and "docs" directory on a mirror'd ZFS volume (2x Samsung 990 pro), and sync it to all laptops.

Mutagen + Screen/tmux

Similar to syncthing, but for one-off projects. Specifically, I use it to avoid rolling a Docker + Nomad config for quick dev work.

on my laptop, in my code dir (provided by syncthing), mkdir my-app
mutagen sync create --ignore=prod.log --name=sync-my-app-to-7950x1 ~/syncthing/my-app myuser@7950x1:code/my-app
screen
npm start

Now I can edit that project locally on my laptop, keep it real-time synced with a server, without having to sync my entire code directory to that server.

If the project matures, create a Docker image and Nomad config.

Tailscale

Tailsale on laptop and the servers, and use the Tailscale IP so Syncthing and Mutagen work from anywhere.

Cloudflare tunnels for backend apps

Keep the servers behind a NAT (router, no public IP) and use Cloudflare so they are accessible publicly behind a domain name.

  # NixOS config serving example.com from a Nomad job
  services.cloudflared = {
    enable = true;

    tunnels = {
      "abababab-abab-abab-abab-abababababab" = {
        credentialsFile = "/etc/cloudflared/credentials.json";
        ingress = {
          "example.com" = "http://example-app-on-nomad.service.consul:9999&quot;;
        };
        default = "http_status:404";
      };
    };
  };

Cloudflare Pages for frontend (HTML, React, Vue, etc.)

I also use Netlify, but for the sake of standardizing on Cloudflare everywhere:

wrangler pages deploy --project-name=myfrontend --env production --branch=production dist

Cloudflare image hosting

Cloudflare prefers you don't use their tunnel service for images/assets, because they have a highly optimized and cheap image hosting service.

You can use their libraries/sdk, or just POST to their API:

const response = await fetch(url, {
  method: "POST",
  body: formData,
  headers: {
    Authorization: Bearer ${CLOUDFLARE_IMAGE_UPLOAD_TOKEN},
    // 'Content-Type': 'multipart/form-data' is automatically set by fetch when using FormData
  },
});

Backups

The code and docs dir is on a 2x mirror'ed, encrypted ZFS volume
The gitea data dir is also on the mirror'd ZFS volume
Regular snapshots of the ZFS (via a Nomad periodic job), to another ZFS disk on same network
restic for backups to offsite encrypted volume

What peace of mind feels like

sudo reboot and have that "new car smell" or that "reformatted PC smell" with a new OS and all your Docker images still running

3+ copies and snapshots of your data

Cheap and fast hardware

Stable app versions that survive Linux updates

Next steps: personal laptop

Move off the Apple ecosystem and onto a NixOS Linux laptop so I enjoy deploying my personal environment, config/preferences, and keys to a new machine in seconds.

Headless-Render-API