Skip to content

abuild-fetch: try to work around an ESTALE error which occurs on NFS

Dmitry Klochkov requested to merge konfetka1989/abuild:fetch-estale into master

Hello,

abuild-fetch can fail with an ESTALE error when a destination directory is on an NFS file system and more than one processes from different hosts try to lock the same file at the same time:

### HOST 1 ###

$ for i in `seq 10`; do echo === $i; ./abuild-fetch -d /nfs https://curl.se/download/curl-8.7.1.tar.xz; done
=== 1
=== 2
=== 3
=== 4
=== 5
=== 6
=== 7
abuild-fetch: failed to acquire lock: /nfs/curl-8.7.1.tar.xz.lock: Stale file handle
=== 8
=== 9
=== 10

### HOST 2 ###

$ for i in `seq 10`; do echo === $i; ./abuild-fetch -d /nfs https://curl.se/download/curl-8.7.1.tar.xz; done
=== 1
abuild-fetch: failed to acquire lock: /nfs/curl-8.7.1.tar.xz.lock: Stale file handle
=== 2
=== 3
abuild-fetch: failed to acquire lock: /nfs/curl-8.7.1.tar.xz.lock: Stale file handle
=== 4
=== 5
=== 6
=== 7
=== 8
abuild-fetch: failed to acquire lock: /nfs/curl-8.7.1.tar.xz.lock: Stale file handle
=== 9
=== 10

This is because of the following race condition case:

A                               B
                              |
lockfd = open(lockfile, ...)  |
                              | unlink(lockfile)
lockf(lockfd, F_LOCK, 0)      |

According to https://nfs.sourceforge.net/#faq_a10, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

This merge request introduces the code that does several attempts to recover from an ESTALE error. It does not fully fix the issue but it makes the chance to hit it much lower.

FWIW, I hit this issue only using an NFS server powered by FreeBSD. I couldn't reproduce it using an NFS server powered by Linux.

Edited by Dmitry Klochkov

Merge request reports