Chris's Wiki :: blog/linux/SystemdScriptServiceFumble

A systemd mistake with a script-based service unit I recently made

November 10, 2017

That sure was a bunch of debugging because I forgot that my systemd .service file that runs scripts needed

Type=oneshot
RemainAfterExit=True

(... or it'd apparently run the ExecStop script right after the ExecStart script, which doesn't work too well.)

Let's be specific here. This was the systemd .service unit to bring up my WireGuard tunnel on my work machine, which I set up to run a 'startup' script (via ExecStart=). Because I had a 'stop' script sitting around, I also set the unit's ExecStop= to point to that; the 'stop' script takes the device down and so on.

The startup script worked when I ran it by hand, but when I set up the .service unit to start WireGuard on boot, it didn't. Specifically, although journalctl reported no errors, the WireGuard tunnel network device and its associated routes just weren't there when the system finished booting. At first I thought the script was failing in a way that the systemd journal wasn't capturing, so I stuck a bunch of debugging in (capturing all output from the script in a file, and then running with 'set -x', and finally dumping out various pieces of network state after the script had finished).

All of this debugging convinced me that the WireGuard tunnel was being created during boot but then getting destroyed by the time booting finished. I flailed around for a while theorizing that this service or that service was destroying the WireGuard device when it was starting (and altering my .service to start after a steadily increasing number of other things), but nothing fixed the issue. Then, while I was starting at my .service file, the penny dropped and I actually read what was in front of my eyes:

[Service]
WorkingDirectory=/var/local/wireguard
ExecStart=/var/local/wireguard/startup
ExecStop=/var/local/wireguard/stop
Environment=LANG=C

This .service file had started out life as one that I'd copied from another .service file of mine. However, that .service file was for a daemon, where the ExecStart= was a process that was sticking around. I was running a script, and the script was exiting, which meant that as far as systemd was concerned the service was going down and it should immediately run the ExecStop script. My 'stop' script deleted the WireGuard tunnel network device, which explained why I found the device missing after booting had finished.

The journalctl output won't tell you this; it reports only that the service started and not mention that it's stopped again and that the ExecStop script was run. If I'd looked at 'systemctl status ...' and paid attention, I'd at least have had a clue because systemd would have told me that it thought that the service was 'inactive (dead)' instead of running. If I'd had both scripts explicitly log that they were running, I would have seen in the logs that my 'stop' script was being executed for some reason; I probably should add this.

This has been a pretty useful learning experience. I know, that probably sounds weird, but my view is that I'd rather make these mistakes and learn these lessons in a non-urgent, non-production situation instead of stubbing my toes on them in production and possibly under stressful conditions.

(2 comments.)

Written on 10 November 2017.

« Why I'm not enthused about live patching kernels and systems

What X11's TrueColor means (with some history) »

These are my WanderingThoughts
(About the blog)

Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.
Mastodon: @cks
~~(削除) Twitter (削除ここまで)~~ @thatcks

* * *

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web
Also: (Sub)topics

This is a DWiki.
GettingAround
(Help)

Page tools: View Source, Add Comment.

Last modified: Fri Nov 10 01:44:04 2017

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.