A few years ago, I began running servers at home again. Ever since, it's been a non-stop battle to manage them well without drowning under excessive complexity.
Running reliable systems is dear to my heart. I've been doing it for my entire adult life. Actually, even longer. I pride myself on delivering simple, yet functional solutions that are fit for purpose. To that end, I've tested far more tools than I care to mention. I will describe where I've landed after perhaps four years of iteration.
I monitor my network in just two ways. The first is a script on each host, executed hourly by cron. It ensures that at that moment the hosts services and hardware are healthy. If any test fails, a message is printed to stdout, which is subsequently delivered to my mailbox, as per standard cron behaviour.
Most monitoring tools check the state of the systems on a schedule and each check is (or isn't) tuned with its own unique parameters. They'll usually perform retries to ensure accuracy, checks might have different priority levels to suppress noise. They do a lot. Realistically, I'm unlikely to respond to any issue discovered in my network before I'm ready to, or before someone in my household complains. I don't need constant health checks bouncing around my network in search of the slightest hiccup; one check per hour is enough. If I suspect a problem, or if I want to verify a change, I can execute the script manually and receive immediate feedback, which is better than waiting for an automated check to be scheduled, retried and eventually report its updated status.
This approach certainly has massive gaps; it doesn't detect flapping services or track downtime. For most checks on my network, that is entirely acceptable but when it isn't I use another technique.
Change over time, like the rate that the disks are filling, or count of network errors is sometimes crucial for planning or debugging. I collect this data using SNMP and store the results to an RRD database via rrdcached(1). By all accounts, RRDTool is a legacy solution made for another time. The data is aggregated which hides short spikes; you can't pan and zoom into the data, the query language to produce graphs - if you can call it a query language - is pretty hard to use. You would be right to call it a terrible solution... but it isn't. It provides great trend data, requires virtually zero management after setup, and has a tiny tech stack that comprises not much more than a single C program. It's dead simple and it asks nothing of me. That's the kind of tech that I can get behind. Perhaps I'm suffering Stockholm syndrome, but since it's difficult to produce new graphs, I tend to only add them when I really have to and I'm quick to throw away graphs that don't offer enough.
When the data I need isn't available through SNMP, I don't bother extending SNMP to expose it. Instead, I scrape the value in whichever way is easiest before sending it to rrdcached(1).
My custom RRD tooling is managed in a git repository and is deployed to the hosts with rsync(1). This centralises the complexity to git before distributing it around the network.
SNMP is another technology from a bygone era, but again, it's right for me. I appreciate that the data from every server, switch and networking device is exposed in a standard way. I
also appreciate that many of OpenBSD's base daemons also expose their metrics through SNMP. I use a few simple scripts to collect and record the CPU, memory, disk, and network usage for every single device. That's a lot of leverage.
For security reasons, my monitoring stack - like snmpd(8) and rrdcached(8) - runs on the management VLAN and aren't accessible to the client devices.
RRDTool writes fresh graphs every five minutes to the htdocs directory of my web server. I view those graphs through a frontend which is written by hand in plain HTML and CSS. Again, that is managed in the git repository and deployed with rsync(1). Since the frontend is just plain HTML, I also include system notes, and general information about the hosts.
I have log data too, of course, but I don't do anything special with it. Each host manages its own logs and I read the logs with less(1) after connecting to the host over SSH, or serial for that matter.
None of my machines have a monitor or keyboard connected. I usually connect to them by SSH, but each server and switch has an easily accessible serial port if for some reason it cannot be reached across the network.
I run as few machines as possible. Running machines takes effort, and no matter how they're managed, it costs you time. Sometimes it helps to split your workload, for example I keep the services needed for Internet access on one machine. I can upgrade my mail server without breaking Netflix. It also means that a failure of one server doesn't have me scrambling to rebuild the world. It reduces the blast radius.
Right now, as few machines as possible is four. I have an OpenBSD firewall to provide networking services like DHCP, DNS, routing, NTP, etc. I have an Alpine Linux file server which serves files stored in ZFS over NFS and Samba. It also runs Jellyfin, which doesn't work on OpenBSD. I have a Raspberry Pi on the management VLAN to run the UniFi Network Controller. Finally, I have a general-purpose OpenBSD server to do everything else which is secondary networking services, email, web servers, git hosting, ntfy, and rrdcached.
My OpenBSD machines are backed up using dump(8) and restore(8). The backups are uploaded to pCloud and rotated according to a schedule. To backup my fileserver, I send ZFS snapshots to external hard drives. There is too much data to consider uploading it to a cloud provider.
I used to manage my hosts with Ansible, but the cycle time for making and deploying changes was too great. I don't need the benefits it provides, either. Benefits like reproducible hosts, history, or deduplicating infrastructure logic. It also introduced a need to solve secrets management and along with KVM - which I've also nixed - it was too easy to add hosts to the network which I try to avoid. Instead of using IaC tools, I manage my hosts manually and depend on backups if anything goes awry.
In the past few years much software has come and gone. The software that remains is that which asks the least and offers the most. Overall, I feel that investing in fewer pieces of tech affords me the time to perfect those which I do host, and a polished and well-maintained environment is simple and predictable environment.
(c) 2025 altos.au. This work is licensed under CC BY-NC-ND 4.0.