* cygheap.cc (tls_sentry): Move here, rename from 'sentry' in cygtls.cc (tls_sentry::lock): Ditto. (nthreads): Move from cygtls.cc (THREADLIST_CHUNK): Ditto. (cygheap_init): Call init_tls_list(). (init_cygheap::init_tls_list): Define new function. (init_cygheap::add_tls): Ditto. (init_cygheap::remove_tls): Ditto. (init_cygheap::find_tls): Ditto. Semi-resurrect from _cygtls::find_tls. * cygheap.h (init_cygheap::init_tls_list): Declare new function. (init_cygheap::add_tls): Ditto. (init_cygheap::remove_tls): Ditto. (init_cygheap::find_tls): Ditto. * cygtls.cc (sentry): Delete. (sentry::lock): Ditto. (nthreads): Ditto. (THREADLIST_CHUNK): Ditto. (_cygtls::init): Delete definition. (_cygtls::init_thread): Call cygheap->add_tls() to add thread to global list. (_cygtls::remove): cygheap->remove_tls() to remove thread from global list. * cygtls.h (_cygtls::init): Delete declaration. * dcrt0.cc (dll_crt0_0): Delete call to _cygtls::init(). * exceptions.cc (sigpacket::process): When no thread is specified, try to find one via cygheap->find_tls.
		
			
				
	
	
		
			386 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			386 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 2012-08-09  cgf-000014
 | |
| 
 | |
| So, apparently I got it somewhat right before wrt signal handling.
 | |
| Checking on linux, it appears that signals will be sent to a thread
 | |
| which can accept the signal.  So resurrecting and extending the
 | |
| "find_tls" function is in order.  This function will return the tls
 | |
| of any thread which 1) is waiting for a signal with sigwait*() or
 | |
| 2) has the signal unmasked.
 | |
| 
 | |
| In redoing this it became obvious that I had the class designation wrong
 | |
| for the threadlist handling so I moved the manipulation of the global
 | |
| threadlist into the cygheap where it logically belongs.
 | |
| 
 | |
| 2012-07-21  cgf-000013
 | |
| 
 | |
| These changes reflect a revamp of the "wait for signal" functionality
 | |
| which has existed in Cygwin through several signal massages.
 | |
| 
 | |
| We now create a signal event only when a thread is waiting for a signal
 | |
| and arm it only for that thread.  The "set_signal_arrived" function is
 | |
| used to establish the event and set it in a location referencable by
 | |
| the caller.
 | |
| 
 | |
| I still do not handle all of the race conditions.  What happens when
 | |
| a signal comes in just after a WF?O succeeds for some other purpose?  I
 | |
| suspect that it will arm the next WF?O call and the subsequent call to
 | |
| call_signal_handler could cause a function to get an EINTR when possibly
 | |
| it shouldn't have.
 | |
| 
 | |
| I haven't yet checked all of the test cases for the URL listed in the
 | |
| previous entry.
 | |
| 
 | |
| Baby steps.
 | |
| 
 | |
| 2012-06-12  cgf-000012
 | |
| 
 | |
| These changes are the preliminary for redoing the way threads wait for
 | |
| signals.  The problems are shown by the test case mentioned here:
 | |
| 
 | |
| http://cygwin.com/ml/cygwin/2012-05/msg00434.html
 | |
| 
 | |
| I've known that the signal handling in threads wasn't quite right for
 | |
| some time.  I lost all of my thread signal tests in the great "rm -r"
 | |
| debacle of a few years ago and have been less than enthusiastic about
 | |
| redoing everything (I had PCTS tests and everything).  But it really is
 | |
| time to redo this signal handling to make it more like it is supposed to
 | |
| be.
 | |
| 
 | |
| This change should not introduce any new behavior.  Things should
 | |
| continue to behave as before.  The major differences are a change in the
 | |
| arguments to cancelable_wait and cygwait now uses cancelable_wait and,
 | |
| so, the returns from cygwait now mirror cancelable_wait.
 | |
| 
 | |
| The next change will consolidate cygwait and cancelable_wait into one
 | |
| cygwait function.
 | |
| 
 | |
| 2012-06-02  cgf-000011
 | |
| 
 | |
| The refcnt handling was tricky to get right but I had convinced myself
 | |
| that the refcnt's were always incremented/decremented under a lock.
 | |
| Corinna's 2012-05-23 change to refcnt exposed a potential problem with
 | |
| dup handling where the fdtab could be updated while not locked.
 | |
| 
 | |
| That should be fixed by this change but, on closer examination, it seems
 | |
| like there are many places where it is possible for the refcnt to be
 | |
| updated while the fdtab is not locked since the default for
 | |
| cygheap_fdget is to not lock the fdtab (and that should be the default -
 | |
| you can't have read holding a lock).
 | |
| 
 | |
| Since refcnt was only ever called with 1 or -1, I broke it up into two
 | |
| functions but kept the Interlocked* operation.  Incrementing a variable
 | |
| should not be as racy as adding an arbitrary number to it but we have
 | |
| InterlockedIncrement/InterlockedDecrement for a reason so I kept the
 | |
| Interlocked operation here.
 | |
| 
 | |
| In the meantime, I'll be mulling over whether the refcnt operations are
 | |
| actually safe as they are.  Maybe just ensuring that they are atomically
 | |
| updated is enough since they control the destruction of an fh.  If I got
 | |
| the ordering right with incrementing and decrementing then that should
 | |
| be adequate.
 | |
| 
 | |
| 2012-06-02  cgf-000010
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix emacs problem which exposed an issue with Cygwin's select() function.
 | |
|   If a signal arrives while select is blocking and the program longjmps
 | |
|   out of the signal handler then threads and memory may be left hanging.
 | |
|   Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
 | |
| </1.7.16>
 | |
| 
 | |
| This was try #4 or #5 to get select() signal handling working right.
 | |
| It's still not there but it should now at least not leak memory or
 | |
| threads.
 | |
| 
 | |
| I mucked with the interface between cygwin_select and select_stuff::wait
 | |
| so that the "new" loop in select_stuff::wait() was essentially moved
 | |
| into the caller.  cygwin_select now uses various enum states to decide
 | |
| what to do.  It builds the select linked list at the beginning of the
 | |
| loop, allowing wait() to tear everything down and restart.  This is
 | |
| necessary before calling a signal handler because the signal handler may
 | |
| longjmp away.
 | |
| 
 | |
| I initially had this all coded up to use a special signal_cleanup
 | |
| callback which could be called when a longjmp is called in a signal
 | |
| handler.  And cygwin_select() set up and tore down this callback.  Once
 | |
| I got everything compiling it, of course, dawned on me that just because
 | |
| you call a longjmp in a signal handler it doesn't mean that you are
 | |
| jumping *out* of the signal handler.  So, if the signal handler invokes
 | |
| the callback and returns it will be very bad for select().  Hence, this
 | |
| slower, but hopefully more correct implementation.
 | |
| 
 | |
| (I still wonder if some sort of signal cleanup callback might still
 | |
| be useful in the future)
 | |
| 
 | |
| TODO: I need to do an audit of other places where this problem could be
 | |
| occurring.
 | |
| 
 | |
| As alluded to above, select's signal handling is still not right.  It
 | |
| still acts as if it could call a signal handler from something other
 | |
| than the main thread but, AFAICT, from my STC, this doesn't seem to be
 | |
| the case.  It might be worthwhile to extend cygwait to just magically
 | |
| figure this out and not even bother using w4[0] for scenarios like this.
 | |
| 
 | |
| 2012-05-16  cgf-000009
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix broken console mouse handling.  Reported here:
 | |
|   http://cygwin.com/ml/cygwin/2012-05/msg00360.html
 | |
| </1.7.16>
 | |
| 
 | |
| I did a cvs annotate on smallprint.cc and see that the code to translate
 | |
| %characters > 127 to 0x notation was in the 1.1 revision.  Then I
 | |
| checked the smallprint.c predecessor.  It was in the 1.1 version of that
 | |
| program too, which means that this odd change has probably been around
 | |
| since <= 2000.
 | |
| 
 | |
| Since __small_sprintf is supposed to emulate sprintf, I got rid of the
 | |
| special case handling.  This may affect fhandler_socket::bind.  If so, we
 | |
| should work around this problem there rather than keeping this strange
 | |
| hack in __small_printf.
 | |
| 
 | |
| 2012-05-14  cgf-000008
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix hang when zero bytes are written to a pty using
 | |
|   Windows WriteFile or equivalent.  Fixes:
 | |
|   http://cygwin.com/ml/cygwin/2012-05/msg00323.html
 | |
| </1.7.16>
 | |
| 
 | |
| cgf-000002, as usual, fixed one thing while breaking another.  See
 | |
| Larry's predicament in: http://goo.gl/oGEr2 .
 | |
| 
 | |
| The problem is that zero byte writes to the pty pipe caused the dread
 | |
| end-of-the-world-as-we-know-it problem reported on the mailing list
 | |
| where ReadFile reads zero bytes even though there is still more to read
 | |
| on the pipe.  This is because that change caused a 'record' to be read
 | |
| and a record can be zero bytes.
 | |
| 
 | |
| I was never really keen about using a throwaway buffer just to get a
 | |
| count of the number of characters available to be read in the pty pipe.
 | |
| On closer reading of the documentation for PeekNamedPipe it seemed like
 | |
| the sixth argument to PeekNamedPipe should return what I needed without
 | |
| using a buffer.  And, amazingly, it did, except that the problem still
 | |
| remained - a zero byte message still screwed things up.
 | |
| 
 | |
| So, we now detect the case where there is zero bytes available as a
 | |
| message but there are bytes available in the pipe.  In that scenario,
 | |
| return the bytes available in the pipe rather than the message length of
 | |
| zero.  This could conceivably cause problems with pty pipe handling in
 | |
| this scenario but since the only way this scenario could possibly happen
 | |
| is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
 | |
| ok with that.
 | |
| 
 | |
| 2012-05-14  cgf-000007
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix invocation of strace from a cygwin process.  Fixes:
 | |
|   http://cygwin.com/ml/cygwin/2012-05/msg00292.html
 | |
| </1.7.16>
 | |
| 
 | |
| The change in cgf-000004 introduced a problem for processes which load
 | |
| cygwin1.dll dynamically.  strace.exe is the most prominent example of
 | |
| this.
 | |
| 
 | |
| Since the parent handle is now closed for "non-Cygwin" processes, when
 | |
| strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
 | |
| and child_info_spawn::handle_spawn couldn't use retrieve information
 | |
| from the parent.  This eventually led to a strace_printf error due to an
 | |
| attempt to dereference an unavailable cygheap.  Probably have to fix
 | |
| this someday.  You shouldn't use the cygheap while attempting to print
 | |
| an error about the inavailability of said cygheap.
 | |
| 
 | |
| This was fixed by saving the parent pid in child_info_spawn and calling
 | |
| OpenProcess for the parent pid and using that handle iff a process is
 | |
| dynamically loaded.
 | |
| 
 | |
| 2012-05-12  cgf-000006
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix hang when calling pthread_testcancel in a canceled thread.
 | |
|   Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
 | |
| </1.7.16>
 | |
| 
 | |
| This should fix the first part of the reported problem in the above
 | |
| message.  The cancel seemed to actually be working but, the fprintf
 | |
| eventually ended up calling pthread_testcancel.  Since we'd gotten here
 | |
| via a cancel, it tried to recursively call the cancel handler causing a
 | |
| recursive loop.
 | |
| 
 | |
| 2012-05-12  cgf-000005
 | |
| 
 | |
| <1.7.16>
 | |
| - Fix pipe creation problem which manifested as a problem creating a
 | |
| fifo.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
 | |
| </1.7.16>
 | |
| 
 | |
| My change on 2012-04-28 introduced a problem with fifos.  The passed
 | |
| in name was overwritten.  This was because I wasn't properly keeping
 | |
| track of the length of the generated pipe name when there was a
 | |
| name passed in to fhandler_pipe::create.
 | |
| 
 | |
| There was also another problem in fhandler_pipe::create.  Since fifos
 | |
| use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
 | |
| PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
 | |
| as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong.  So,
 | |
| fifo creation attempted to add the pid of a pipe to the name which is
 | |
| wrong for fifos.
 | |
| 
 | |
| 2012-05-08  cgf-000004
 | |
| 
 | |
| The change for cgf-000003 introduced a new problem:
 | |
| http://cygwin.com/ml/cygwin/2012-05/msg00154.html
 | |
| http://cygwin.com/ml/cygwin/2012-05/msg00157.html
 | |
| 
 | |
| Since a handle associated with the parent is no longer being duplicated
 | |
| into a non-cygwin "execed child", Windows is free to reuse the pid of
 | |
| the parent when the parent exits.  However, since we *did* duplicate a
 | |
| handle pointing to the pid's shared memory area into the "execed child",
 | |
| the shared memory for the pid was still active.
 | |
| 
 | |
| Since the shared memory was still available, if a new process reuses the
 | |
| previous pid, Cygwin would detect that the shared memory was not created
 | |
| and had a "PID_REAPED" flag.  That was considered an error, and, so, it
 | |
| would set procinfo to NULL and pinfo::thisproc would die since this
 | |
| situation is not supposed to occur.
 | |
| 
 | |
| I fixed this in two ways:
 | |
| 
 | |
| 1) If a shared memory region has a PID_REAPED flag then zero it and
 | |
| reuse it.  This should be safe since you are not really supposed to be
 | |
| querying the shared memory region for anything after PID_REAPED has been
 | |
| set.
 | |
| 
 | |
| 2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
 | |
| child for exec.
 | |
| 
 | |
| It seems like 2) is a common theme and an audit of all of the handles
 | |
| that are being passed to non-cygwin children is in order for 1.7.16.
 | |
| 
 | |
| The other minor modification that was made in this change was to add the
 | |
| pid of the failing process to fork error output.  This helps slightly
 | |
| when looking at strace output, even though in this case it was easy to
 | |
| find what was failing by looking for '^---' when running the "stv"
 | |
| strace dumper.  That found the offending exception quickly.
 | |
| 
 | |
| 2012-05-07  cgf-000003
 | |
| 
 | |
| <1.7.15>
 | |
| Don't make Cygwin wait for all children of a non-cygwin child program.
 | |
| Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
 | |
|        http://cygwin.com/ml/cygwin/2012-05/msg00075.html
 | |
| </1.7.15>
 | |
| 
 | |
| This problem is due to a recent change which added some robustness and
 | |
| speed to Cygwin's exec/spawn handling by not trying to force inheritance
 | |
| every time a process is started.  See ChangeLog entries starting on
 | |
| 2012-03-20, and multiple on 2012-03-21.
 | |
| 
 | |
| Making the handle inheritable meant that, as usual, there were problems
 | |
| with non-Cygwin processes.  When Cygwin "execs" a non-Cygwin process N,
 | |
| all of its N + 1, N + 2, ...  children will also inherit the handle.
 | |
| That means that Cygwin will wait until all subprocesses have exited
 | |
| before it returns.
 | |
| 
 | |
| I was willing to make this a restriction of starting non-Cygwin
 | |
| processes but the problem with allowing that is that it can cause the
 | |
| creation of a "limbo" pid when N exits and N + 1 and friends are still
 | |
| around.  In this scenario, Cygwin dutifully notices that process N has
 | |
| died and sets the exit code to indicate that but N's parent will wait on
 | |
| rd_proc_pipe and will only return when every N + ...  windows process
 | |
| has exited.
 | |
| 
 | |
| The removal of cygheap::pid_handle was not related to the initial
 | |
| problem that I set out to fix.  The change came from the realization
 | |
| that we were duping the current process handle into the child twice and
 | |
| only needed to do it once.  The current process handle is used by exec
 | |
| to keep the Windows pid "alive" so that it will not be reused.  So, now
 | |
| we just close parent in child_info_spawn::handle_spawn iff we're not
 | |
| execing.
 | |
| 
 | |
| In debugging this it bothered me that 'ps' identified a nonactive pid as
 | |
| active.  Part of the reason for this was the 'parent' handle in
 | |
| child_info was opened in non-Cygwin processes, keeping the pid alive.
 | |
| That has been kluged around (more changes after 1.7.15) but that didn't
 | |
| fix the problem.  On further investigation, this seems to be caused by
 | |
| the fact that the shared memory region pid handles were still being
 | |
| passed to non-cygwin children, keeping the pid alive in a limbo-like
 | |
| fashion.  This was easily fixed by having pinfo::init() consider a
 | |
| memory region with PID_REAPED as not available.  A more robust fix
 | |
| should be considered for 1.7.15+ where these handles are not passed
 | |
| to non-cygwin processes.
 | |
| 
 | |
| This fixed the problem where a pid showed up in the list after a user
 | |
| does something like: "bash$ cmd /c start notepad" but, for some reason,
 | |
| it does not fix the problem where "bash$ setsid cmd /c start notepad".
 | |
| That bears investigation after 1.7.15 is released but it is not a
 | |
| regression and so is not a blocker for the release.
 | |
| 
 | |
| 2012-05-03  cgf-000002
 | |
| 
 | |
| <1.7.15>
 | |
| Fix problem where too much input was attempted to be read from a
 | |
| pty slave.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
 | |
| </1.7.15>
 | |
| 
 | |
| My change on 2012/04/05 reintroduced the problem first described by:
 | |
| http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
 | |
| 
 | |
| The problem then was, IIRC, due to the fact that bytes sent to the pty
 | |
| pipe were not written as records.  Changing pipe to PIPE_TYPE_MESSAGE in
 | |
| pipe.cc fixed the problem since writing lines to one side of the pipe
 | |
| caused exactly that the number of characters to be read on the other
 | |
| even if there were more characters in the pipe.
 | |
| 
 | |
| To debug this, I first replaced fhandler_tty.cc with the 1.258,
 | |
| 2012/04/05 version.  The test case started working when I did that.
 | |
| 
 | |
| So, then, I replaced individual functions, one at a time, in
 | |
| fhandler_tty.cc with their previous versions.  I'd expected this to be a
 | |
| problem with fhandler_pty_master::process_slave_output since that had
 | |
| seen the most changes but was surprised to see that the culprit was
 | |
| fhandler_pty_slave::read().
 | |
| 
 | |
| The reason was that I really needed the bytes_available() function to
 | |
| return the number of bytes which would be read in the next operation
 | |
| rather than the number of bytes available in the pipe.  That's because
 | |
| there may be a number of lines available to be read but the number of
 | |
| bytes which will be read by ReadFile should reflect the mode of the pty
 | |
| and, if there is a line to read, only the number of bytes in the line
 | |
| should be seen as available for the next read.
 | |
| 
 | |
| Having bytes_available() return the number of bytes which would be read
 | |
| seemed to fix the problem but it could subtly change the behavior of
 | |
| other callers of this function.  However, I actually think this is
 | |
| probably a good thing since they probably should have been seeing the
 | |
| line behavior.
 | |
| 
 | |
| 2012-05-02  cgf-000001
 | |
| 
 | |
| <1.7.15>
 | |
| Fix problem setting parent pid to 1 when process with children execs
 | |
| itself.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
 | |
| </1.7.15>
 | |
| 
 | |
| Investigating this problem with strace showed that ssh-agent was
 | |
| checking the parent pid and getting a 1 when it shouldn't have.  Other
 | |
| stuff looked ok so I chose to consider this a smoking gun.
 | |
| 
 | |
| Going back to the version that the OP said did not have the problem, I
 | |
| worked forward until I found where the problem first occurred -
 | |
| somewhere around 2012-03-19.  And, indeed, the getppid call returned the
 | |
| correct value in the working version.  That means that this stopped
 | |
| working when I redid the way the process pipe was inherited around
 | |
| this time period.
 | |
| 
 | |
| It isn't clear why (and I suspect I may have to debug this further at
 | |
| some point) this hasn't always been a problem but I made the obvious fix.
 | |
| We shouldn't have been setting ppid = 1 when we're about to pass off to
 | |
| an execed process.
 | |
| 
 | |
| As I was writing this, I realized that it was necessary to add some
 | |
| additional checks.  Just checking for "have_execed" isn't enough.  If
 | |
| we've execed a non-cygwin process then it won't know how to deal with
 | |
| any inherited children.  So, always set ppid = 1 if we've execed a
 | |
| non-cygwin process.
 |