apk hangs on lossy network
We are having some serious issue with our s390x CI. The Ci jobs relatively often hangs "forever", til they time out. (example: https://gitlab.alpinelinux.org/alpine/aports/-/jobs/1141094)
I have tried to troubleshoot it on the machine itself (alpine 3.15), and I have been able to reproduce it, both with apk update
and with apk fetch
. It appears to be a delay of around 3 seconds in the beginning of the session. If it passes that, it continues. If it fails there, it never recovers. It appears to happen more often with https than http (but I can not confirm that).
The network seems to be lossy:
gitlab-runner-s390x [~]# netstat -s | grep segments
Problem while parsing /proc/net/netstat
1332786 segments received
1631016 segments send out
23286 segments retransmited
2533 bad segments received.
Here is a section of strace where it passes. Note the read
that takes 3 seconds:
...
14:07:44.802523 access("/etc/apk/cert.pem", R_OK) = -1 ENOENT (No such file or directory)
14:07:44.802585 mmap(NULL, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ff9c807000
14:07:44.802623 mmap(NULL, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ff9c802000
14:07:44.802663 mmap(NULL, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ff9c7f4000
14:07:44.802707 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119664, tv_nsec=802717867}) = 0
14:07:44.802739 getpid() = 50721
14:07:44.802765 getpid() = 50721
14:07:44.802793 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119664, tv_nsec=802798963}) = 0
14:07:44.802824 getpid() = 50721
14:07:44.802849 getpid() = 50721
14:07:44.802877 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119664, tv_nsec=802882220}) = 0
14:07:44.802930 getpid() = 50721
14:07:44.802959 getpid() = 50721
14:07:44.802977 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119664, tv_nsec=802989280}) = 0
14:07:44.803107 write(7, "\26\3\1\1?\1\0\1;\3\3\230{#M^\313t\374\307V\232\373\t\303\341\371\242\205\234L\367"..., 324) = 324
14:07:44.803157 read(7, "\26\3\3\0z", 5) = 5
14:07:48.267258 read(7, "\2\0\0v\3\3\n\2;\7\ra\263\322\322 \213Y\360\230\5\314C\v\267\266\375\355\244W+\321"..., 122) = 122
14:07:48.267513 read(7, "\24\3\3\0\1", 5) = 5
14:07:48.267534 read(7, "\1", 1) = 1
14:07:48.267555 read(7, "\27\3\3\0\33", 5) = 5
14:07:48.267576 read(7, "aB\220Y\354|\4\30\336\302\36jI\27\265\261\354(\320L\252\4\235\235\230\215/", 27) = 27
14:07:48.267603 read(7, "\27\3\3\17\244", 5) = 5
14:07:48.267624 read(7, "\35Fp9\315\370\300D\352\266\320\336g\177\325Ki\241\265\17\21\330L\0\n\266|\232\21\322\r\212"..., 4004) = 1274
14:07:48.267649 read(7, "4q\t\374\314\360\220#V\366\375)\t\t\276\276\r\371\362Smj\266\37\305\205\7\200\313\202^W"..., 2730) = 1444
14:07:48.270345 read(7, "{\314\332\373\224\217?\342\352D\20*\240n\0~\326$\245C\334\2278~d3\256l-O\354)"..., 1286) = 1286
14:07:48.272149 mmap(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ff9c7ec000
14:07:48.272218 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ff9c7ea000
14:07:48.272405 stat("/etc/ssl1.1/certs/8d33f237.0", 0x3ffea1fc6c8) = -1 ENOENT (No such file or directory)
14:07:48.272500 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119668, tv_nsec=272509352}) = 0
...
Here is a section of strace where it hangs:
14:07:50.145845 access("/etc/apk/cert.pem", R_OK) = -1 ENOENT (No such file or directory)
14:07:50.145903 mmap(NULL, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ffa0887000
14:07:50.145938 mmap(NULL, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ffa0882000
14:07:50.145967 mmap(NULL, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x3ffa0874000
14:07:50.146005 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119670, tv_nsec=146010306}) = 0
14:07:50.146031 getpid() = 50726
14:07:50.146055 getpid() = 50726
14:07:50.146077 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119670, tv_nsec=146081838}) = 0
14:07:50.146105 getpid() = 50726
14:07:50.146128 getpid() = 50726
14:07:50.146150 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119670, tv_nsec=146154779}) = 0
14:07:50.146198 getpid() = 50726
14:07:50.146220 getpid() = 50726
14:07:50.146243 clock_gettime(CLOCK_REALTIME, {tv_sec=1697119670, tv_nsec=146248268}) = 0
14:07:50.146366 write(7, "\26\3\1\1?\1\0\1;\3\3\36\216%\354y\37\363\32\37\214\"\262\373\332\216\201\264\233I\301\214"..., 324) = 324
14:07:50.146414 read(7,
(It is now Thu Oct 12 14:42:06 UTC 2023 so it has been more than 15 mins.) I believe this is related: #10869 (closed)
I was able to reproduce it with curl, which exits after 5 mins with:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:05:00 --:--:-- 0
curl: (28) SSL connection timeout
I wonder if it would make sense to have something like curl's --connect-timeout
and --retry
options? Then we could time out early if the initial/SSL connection takes longer than expected, and we could make apk retry a few times before giving up from abuild and CI.