FreeBSD thread issues

Andreas Ericsson ae at op5.se
Wed Aug 24 13:21:37 CEST 2005


Christophe Yayon wrote:
> Hi all,
> 
> here is the answer of FreeBSD-hackers list :
> 
> 
> This posting demonstrates a fundamental confusion between thread-safe and
> async-safe.  That is the root of the problem in the communication. 
> Thread-safe functions are a dime a dozen and relatively easy to write. 
> async-safe functions are very rare and much harder to do useful things
> with.  I've tried to explain the difference below using fgets() as an
> example of the difficulties.
> 

Umm... Feels like either me or those guys are missing a third distinction.

* thread-safe; function is guaranteed not to modify global state which 
other threads depend on (fgets(), f.e.)

* async-safe (async-IO-safe, really); function is guaranteed not to 
block or mess up in IO due to other threads (read(), write() et al. 
fgets isn't in here when I come to think of it, because if you fgets() 
in two separate threads on the same FILE* pointer you'll end up with 
undefined behaviour).

* async-signal-safe; function is guaranteed to only modify parameters it 
has been passed and may be re-entered any number of times at any point 
of execution.

Given that fork() itself is listed in the table of async-signal-safe 
functions I interpret this as thread-safe functions being enough in the 
child, provided that it won't read or write to FILE* pointers that were 
open prior to the fork() call (or any other such thing that might occur).

If these distinctions are wrong, I need to have this clarified before I 
can budge.

> 
>>fgets() must also be async-safe, since it's passed its storage-buffer
>>from the calling function. It can contain races if several threads (or
>>programs for that matter) tries to read FIFO's at the same time or are
>>trying to store things to the same piece of memory, but that's neither
>>new, strange or in any way non-obvious. Obviously, fgets() relies on
>>lower-level IO code which must be thread-safe (read() in this case) on
>>account of them being syscalls inside multitasking kernels.
> 
> 
> fgets need not be async-safe, but it does need to be thread-safe.
> When one fork after pthread_create, one may only call async-safe
> functions.  The weaker requirements of thread safety can be shown to
> not necessarily be async safe.  If two different threads call fgets(),
> mutexes will keep one thread from running if the other is in the
> middle of changing the FILE * internal state.  However, if that thread
> is interrupted by the scheduler with the mutex held, and fork() is
> called, then the new copy of the address space will still have that
> mutex held.


So the child can't use the FILE* pointers opened (and used) in the 
parent. Nothing new under the sun (although without threads the only 
issue is races). If the child fopen()'s a file of its own there will be 
no lock contention and everyone will be happy.

>  Any attempt by this new process, with its own address
> space, to acquire the lock is doomed to failure.  Since the parent and
> child execute in different address spaces, there is no way for a
> thread that does not exist in the child to unlock the locked mutex.
> 
> 
> Normally this happens like so:
> 
>         Thread A                                Thread B
> 
>         fgets(fp, b1, 10);
>                 lock fp's mutex
>                 copy 5 available bytes into b1
> <thread scheduler interrupts here>
>                                                 fgets(fp, b2, 10)
>                                                 try lock fp's mutex
> <thread scheduler puts on the pending list, maybe resuming A>
>                 unlock fp's mutex
>         return
> <thread scheduler wakes up B>
>                                                 attempt to lock finishes
>                                                 b2 can be updated
>                                                 unlock mutex.
> 
> However, in the fork case:
> 
>         Thread A                                Thread B
> 
>         fgets(fp, b1, 10);
>                 lock fp's mutex
>                 copy 5 available bytes into b1
> <thread scheduler interrupts here>
>                                                 fork()
>         <thread A is now gone in child>
>                                                 fgets(fp, b2, 10)
>                                                 try lock fp's mutex
> At this point B', the only thread in the child, will never be able to
> grab this lock because A exists only in the parent and the
> parent/child have independent address spaces.
> 
> While the above example is not what nagios is doing, it illustrates
> the point.  There are some functions that necessarily touch global
> state.  These functions need to coordinate that touching of state.  If
> one of the is interrupted with locks held, then all bets are off of a
> program forks and the threads holding those locks can never unlock
> them.
> 

That's fairly self-evident, but this still means that fgets() in itself 
is safe to re-enter simultaneously as many times as one damn well 
pleases, provided that the first parameter isn't the same in several 
different threads. That doesn't mean that fgets() isn't safe to use in a 
child, it just means that the FILE* pointer isn't safe to use in a child 
(the same problem happens with all the f* functions, really).

> 
>> >>  The list of async-signal-safe functions
>> >> are here: http://www.opengroup.org/onlinepubs/009695399/nframe.html
>> >> The restriction on fork() is here (20th bullet down):
>> >> http://www.opengroup.org/onlinepubs/009695399/nframe.html
>>
>>Both of those links point to the same document, which is just the
>>frameset for the navigation-frames.
>>
>>For async-safe functions, this is the proper url;
>>http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html#tag_02_09_01
> 
> 
> This reference is for thread-safe functions.  You are confusing
> thread-safe and async-safe.  The correct url for async-safe is
> 

No, I've never thought of async-signal-safe before though, since I 
always handle signals atomically and async-signal-safe in my own programs.

> http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03
> 
> 
>>>The following table defines a set of functions that shall be either
>>>reentrant or non-interruptible by signals and shall be
>>>async-signal-safe. Therefore applications may invoke them, without
>>>restriction, from signal-catching functions:
>>>       <list omitted, since it has been posted before>
> 
> 
> Notice that this list is very short, and there are many functions that
> one would think should be on here, but in fact aren't.
> 
> 
>>For the fork() specification, the doc is here;
>>http://www.opengroup.org/onlinepubs/009695399/functions/fork.html
> 
> ...
> 
>>"A process shall be created with a single thread. If a multi-threaded
>>process calls fork(), the new process shall contain a replica of the
>>calling thread and its entire address space, possibly including the
>>states of mutexes and other resources. Consequently, to avoid errors,
>>the child process may only execute async-signal-safe operations until
>>such time as one of the exec functions is called.
> 
> 
> Notice here it says specifically 'async-sngial-safe operations' not
> 'thread-safe' operations.  The standard explicitly calls attention to
> the difficulties and differences between these two types of functions.
> 

True. For the point of Nagios though, I fail to see what's going on. The 
child doesn't share any resources with the parent and never uses calls 
that shouldn't ever be in a locked state (ofcourse, I can't swear on 
this until I've read the libc source for FreeBSD, but there are ways of 
making malloc() and friends work without lock contention).

> 
>>This is funny, because nagios apparently runs properly on Linux, HPUX,
>>Solaris, Irix, AIX and Tru64. To me that seems to indicate that Nagios
>>is very portable indeed and that the BSD fellows somehow botched it. I
>>might be wrong, but...
> 
> 
> Just because it works doesn't make it standards conforming.
> 

If enough people interpret the standard in a certain way then that 
interpretation pretty much is the real standard.

> Maybe there's some simple extension that can be implemented to help
> the situation.
> 

There are various ways. I believe the most elegant solution would be 
tagging each lock with the id of the owner and simply making it go away 
when needed in a fork()'ed child. It should be very lenient on both 
line-of-code count and performance. Another way is by implicit 
pthread_atfork() handlers, but that's got a messier feel (although, 
again, I haven't looked at the code and have only rudimentary 
understanding of threads interacting with the *BSD kernel).


Aside from all this interesting stuff; Has anyone been able to determine 
where Nagios actually hangs (aside from inside __pthread_acquire()). 
Perhaps it can be worked around without having to rewrite large portions 
of the code?

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf




More information about the Developers mailing list