Xendomains init script needs a kill timer

Hi,

I’m really sorry to file a bug. :)

Basically, the xen init script assumes that a vm goes through these steps on a shutdown:

Script triggers a shutdown notice to the VM

Shutting down Xen domains from /etc/xen/auto
* Asking domain waxu0604 to shutdown in the background… …
Shutting down domain 10
Waiting for 1 domains
[proceeds with next and knows it’s waiting for the above vm.

The script nicely waits till a vm completes shutdown and the accompanying xl daemon terminates.

Unfortunately, life isn’t fair.

Like real PCs, Xen VMs running an OS can get stuck.
Most of the time it’ll be due to some event channel “packet loss” (*) and or whatever. It doesn’t matter the cause.

VMs don’t reliably shutdown just because we tell them.

So, basically, I’d request for a timer like the one I see in /etc/default/xendomains, (see to 300s) but one that actually works.
Because, that one doesn’t. I think someone made this really nice and sweet new init script, which is definitely better than the original one, but dropped the call to “destroy” after timer expiration. But cheer up, this feature was horribly broken in the original xendomains script back from xen3, so it’s not even a full regression.

Anyway, right now I have a simple test loop of
/etc/init.d/xendomains start
sleep 90
/etc/init.d/xendomains start
sleep 90

and I see two things:

vm start stop hangs a lot more often than on 4.2.x the version from 2.4. I’m digging into this
the xendomains script can’t reliable sort it out. This one is a long-term issue that needs a long-term fix.
Basically,if you enter “reboot” on the host, it needs to reboot SOME time, not keep waiting for a stuck VM till the end of the universe (current behaviour)

A possibility to make a xl destroy less harmful, at least with PV domUs is to first fire a
emergency sync sysrq, sleep 2-3 and then a emergency-remount-ro sysrq before the destroy.

(* example from xen3 times: the “ok, dear domU, i heard you completed shut down, so you may unwind your resources now” message could go to a event channel via dom0 vcpu0, be sent to domU vcpu0 and sadly the vm is currently processing the event channel on vcpu1. Another reason are xenconsoled bugs. and broken PV domU kernels like i.e. Debian specialized on in the past. Stuck live migrations. VMs hung at the loader. VMs crashed into a kernel console. Zombie VMs. And this. And that. it’s endless.)

Final note, I see the behaviour of my reboot loop has become a lot more stable over the time i’m typing this report.

(from redmine: issue id 2252, created on 2013-09-14)

Uploads:
- xendomains.init custom xen init script with better flow.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information