330 lines
15 KiB
Plaintext
330 lines
15 KiB
Plaintext
2012-06-02 cgf-000011
|
|
|
|
The refcnt handling was tricky to get right but I had convinced myself
|
|
that the refcnt's were always incremented/decremented under a lock.
|
|
Corinna's 2012-05-23 change to refcnt exposed a potential problem with
|
|
dup handling where the fdtab could be updated while not locked.
|
|
|
|
That should be fixed by this change but, on closer examination, it seems
|
|
ilke there are many places where it is possible for the refcnt to be
|
|
updated while the fdtab is not locked since the default for
|
|
cygheap_fdget is to not lock the fdtab (and that should be the default -
|
|
you can't have read holding a lock).
|
|
|
|
Since refcnt was only ever called with 1 or -1, I broke it up into two
|
|
functions but kept the Interlocked* operation. Incrementing a variable
|
|
should not be as racy as adding an arbitrary number to it but we have
|
|
InterlockedIncrement/InterlockedDecrement for a reason so I kept the
|
|
Interlocked operation here.
|
|
|
|
In the meantime, I'll be mulling over whether the refcnt operations are
|
|
actually safe as they are. Maybe just ensuring that they are atomically
|
|
updated is enough since they control the destruction of an fh. If I got
|
|
the ordering right with incrementing and decrementing then that should
|
|
be adequate.
|
|
|
|
2012-06-02 cgf-000010
|
|
|
|
<1.7.16>
|
|
- Fix emacs problem which exposed an issue with Cygwin's select() function.
|
|
If a signal arrives while select is blocking and the program longjmps
|
|
out of the signal handler then threads and memory may be left hanging.
|
|
Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
|
|
</1.7.16>
|
|
|
|
This was try #4 or #5 to get select() signal handling working right.
|
|
It's still not there but it should now at least not leak memory or
|
|
threads.
|
|
|
|
I mucked with the interface between cygwin_select and select_stuff::wait
|
|
so that the "new" loop in select_stuff::wait() was essentially moved
|
|
into the caller. cygwin_select now uses various enum states to decide
|
|
what to do. It builds the select linked list at the beginning of the
|
|
loop, allowing wait() to tear everything down and restart. This is
|
|
necessary before calling a signal handler because the signal handler may
|
|
longjmp away.
|
|
|
|
I initially had this all coded up to use a special signal_cleanup
|
|
callback which could be called when a longjmp is called in a signal
|
|
handler. And cygwin_select() set up and tore down this callback. Once
|
|
I got everything compiling it, of course, dawned on me that just because
|
|
you call a longjmp in a signal handler it doesn't mean that you are
|
|
jumping *out* of the signal handler. So, if the signal handler invokes
|
|
the callback and returns it will be very bad for select(). Hence, this
|
|
slower, but hopefully more correct implementation.
|
|
|
|
(I still wonder if some sort of signal cleanup callback might still
|
|
be useful in the future)
|
|
|
|
TODO: I need to do an audit of other places where this problem could be
|
|
occurring.
|
|
|
|
As alluded to above, select's signal handling is still not right. It
|
|
still acts as if it could call a signal handler from something other
|
|
than the main thread but, AFAICT, from my STC, this doesn't seem to be
|
|
the case. It might be worthwhile to extend cygwait to just magically
|
|
figure this out and not even bother using w4[0] for scenarios like this.
|
|
|
|
2012-05-16 cgf-000009
|
|
|
|
<1.7.16>
|
|
- Fix broken console mouse handling. Reported here:
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00360.html
|
|
</1.7.16>
|
|
|
|
I did a cvs annotate on smallprint.cc and see that the code to translate
|
|
%characters > 127 to 0x notation was in the 1.1 revision. Then I
|
|
checked the smallprint.c predecessor. It was in the 1.1 version of that
|
|
program too, which means that this odd change has probably been around
|
|
since <= 2000.
|
|
|
|
Since __small_sprintf is supposed to emulate sprintf, I got rid of the
|
|
special case handling. This may affect fhandler_socket::bind. If so, we
|
|
should work around this problem there rather than keeping this strange
|
|
hack in __small_printf.
|
|
|
|
2012-05-14 cgf-000008
|
|
|
|
<1.7.16>
|
|
- Fix hang when zero bytes are written to a pty using
|
|
Windows WriteFile or equivalent. Fixes:
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00323.html
|
|
</1.7.16>
|
|
|
|
cgf-000002, as usual, fixed one thing while breaking another. See
|
|
Larry's predicament in: http://goo.gl/oGEr2 .
|
|
|
|
The problem is that zero byte writes to the pty pipe caused the dread
|
|
end-of-the-world-as-we-know-it problem reported on the mailing list
|
|
where ReadFile reads zero bytes even though there is still more to read
|
|
on the pipe. This is because that change caused a 'record' to be read
|
|
and a record can be zero bytes.
|
|
|
|
I was never really keen about using a throwaway buffer just to get a
|
|
count of the number of characters available to be read in the pty pipe.
|
|
On closer reading of the documentation for PeekNamedPipe it seemed like
|
|
the sixth argument to PeekNamedPipe should return what I needed without
|
|
using a buffer. And, amazingly, it did, except that the problem still
|
|
remained - a zero byte message still screwed things up.
|
|
|
|
So, we now detect the case where there is zero bytes available as a
|
|
message but there are bytes available in the pipe. In that scenario,
|
|
return the bytes available in the pipe rather than the message length of
|
|
zero. This could conceivably cause problems with pty pipe handling in
|
|
this scenario but since the only way this scenario could possibly happen
|
|
is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
|
|
ok with that.
|
|
|
|
2012-05-14 cgf-000007
|
|
|
|
<1.7.16>
|
|
- Fix invocation of strace from a cygwin process. Fixes:
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00292.html
|
|
</1.7.16>
|
|
|
|
The change in cgf-000004 introduced a problem for processes which load
|
|
cygwin1.dll dynamically. strace.exe is the most prominent example of
|
|
this.
|
|
|
|
Since the parent handle is now closed for "non-Cygwin" processes, when
|
|
strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
|
|
and child_info_spawn::handle_spawn couldn't use retrieve information
|
|
from the parent. This eventually led to a strace_printf error due to an
|
|
attempt to dereference an unavailable cygheap. Probably have to fix
|
|
this someday. You shouldn't use the cygheap while attempting to print
|
|
an error about the inavailability of said cygheap.
|
|
|
|
This was fixed by saving the parent pid in child_info_spawn and calling
|
|
OpenProcess for the parent pid and using that handle iff a process is
|
|
dynamically loaded.
|
|
|
|
2012-05-12 cgf-000006
|
|
|
|
<1.7.16>
|
|
- Fix hang when calling pthread_testcancel in a canceled thread.
|
|
Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
|
|
</1.7.16>
|
|
|
|
This should fix the first part of the reported problem in the above
|
|
message. The cancel seemed to actually be working but, the fprintf
|
|
eventually ended up calling pthread_testcancel. Since we'd gotten here
|
|
via a cancel, it tried to recursively call the cancel handler causing a
|
|
recursive loop.
|
|
|
|
2012-05-12 cgf-000005
|
|
|
|
<1.7.16>
|
|
- Fix pipe creation problem which manifested as a problem creating a
|
|
fifo. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
|
|
</1.7.16>
|
|
|
|
My change on 2012-04-28 introduced a problem with fifos. The passed
|
|
in name was overwritten. This was because I wasn't properly keeping
|
|
track of the length of the generated pipe name when there was a
|
|
name passed in to fhandler_pipe::create.
|
|
|
|
There was also another problem in fhandler_pipe::create. Since fifos
|
|
use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
|
|
PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
|
|
as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong. So,
|
|
fifo creation attempted to add the pid of a pipe to the name which is
|
|
wrong for fifos.
|
|
|
|
2012-05-08 cgf-000004
|
|
|
|
The change for cgf-000003 introduced a new problem:
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00154.html
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00157.html
|
|
|
|
Since a handle associated with the parent is no longer being duplicated
|
|
into a non-cygwin "execed child", Windows is free to reuse the pid of
|
|
the parent when the parent exits. However, since we *did* duplicate a
|
|
handle pointing to the pid's shared memory area into the "execed child",
|
|
the shared memory for the pid was still active.
|
|
|
|
Since the shared memory was still available, if a new process reuses the
|
|
previous pid, Cygwin would detect that the shared memory was not created
|
|
and had a "PID_REAPED" flag. That was considered an error, and, so, it
|
|
would set procinfo to NULL and pinfo::thisproc would die since this
|
|
situation is not supposed to occur.
|
|
|
|
I fixed this in two ways:
|
|
|
|
1) If a shared memory region has a PID_REAPED flag then zero it and
|
|
reuse it. This should be safe since you are not really supposed to be
|
|
querying the shared memory region for anything after PID_REAPED has been
|
|
set.
|
|
|
|
2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
|
|
child for exec.
|
|
|
|
It seems like 2) is a common theme and an audit of all of the handles
|
|
that are being passed to non-cygwin children is in order for 1.7.16.
|
|
|
|
The other minor modification that was made in this change was to add the
|
|
pid of the failing process to fork error output. This helps slightly
|
|
when looking at strace output, even though in this case it was easy to
|
|
find what was failing by looking for '^---' when running the "stv"
|
|
strace dumper. That found the offending exception quickly.
|
|
|
|
2012-05-07 cgf-000003
|
|
|
|
<1.7.15>
|
|
Don't make Cygwin wait for all children of a non-cygwin child program.
|
|
Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
|
|
http://cygwin.com/ml/cygwin/2012-05/msg00075.html
|
|
</1.7.15>
|
|
|
|
This problem is due to a recent change which added some robustness and
|
|
speed to Cygwin's exec/spawn handling by not trying to force inheritance
|
|
every time a process is started. See ChangeLog entries starting on
|
|
2012-03-20, and multiple on 2012-03-21.
|
|
|
|
Making the handle inheritable meant that, as usual, there were problems
|
|
with non-Cygwin processes. When Cygwin "execs" a non-Cygwin process N,
|
|
all of its N + 1, N + 2, ... children will also inherit the handle.
|
|
That means that Cygwin will wait until all subprocesses have exited
|
|
before it returns.
|
|
|
|
I was willing to make this a restriction of starting non-Cygwin
|
|
processes but the problem with allowing that is that it can cause the
|
|
creation of a "limbo" pid when N exits and N + 1 and friends are still
|
|
around. In this scenario, Cygwin dutifully notices that process N has
|
|
died and sets the exit code to indicate that but N's parent will wait on
|
|
rd_proc_pipe and will only return when every N + ... windows process
|
|
has exited.
|
|
|
|
The removal of cygheap::pid_handle was not related to the initial
|
|
problem that I set out to fix. The change came from the realization
|
|
that we were duping the current process handle into the child twice and
|
|
only needed to do it once. The current process handle is used by exec
|
|
to keep the Windows pid "alive" so that it will not be reused. So, now
|
|
we just close parent in child_info_spawn::handle_spawn iff we're not
|
|
execing.
|
|
|
|
In debugging this it bothered me that 'ps' identified a nonactive pid as
|
|
active. Part of the reason for this was the 'parent' handle in
|
|
child_info was opened in non-Cygwin processes, keeping the pid alive.
|
|
That has been kluged around (more changes after 1.7.15) but that didn't
|
|
fix the problem. On further investigation, this seems to be caused by
|
|
the fact that the shared memory region pid handles were still being
|
|
passed to non-cygwin children, keeping the pid alive in a limbo-like
|
|
fashion. This was easily fixed by having pinfo::init() consider a
|
|
memory region with PID_REAPED as not available. A more robust fix
|
|
should be considered for 1.7.15+ where these handles are not passed
|
|
to non-cygwin processes.
|
|
|
|
This fixed the problem where a pid showed up in the list after a user
|
|
does something like: "bash$ cmd /c start notepad" but, for some reason,
|
|
it does not fix the problem where "bash$ setsid cmd /c start notepad".
|
|
That bears investigation after 1.7.15 is released but it is not a
|
|
regression and so is not a blocker for the release.
|
|
|
|
2012-05-03 cgf-000002
|
|
|
|
<1.7.15>
|
|
Fix problem where too much input was attempted to be read from a
|
|
pty slave. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
|
|
</1.7.15>
|
|
|
|
My change on 2012/04/05 reintroduced the problem first described by:
|
|
http://cygwin.com/ml/cygwin/2011-10/threads.html#00445
|
|
|
|
The problem then was, IIRC, due to the fact that bytes sent to the pty
|
|
pipe were not written as records. Changing pipe to PIPE_TYPE_MESSAGE in
|
|
pipe.cc fixed the problem since writing lines to one side of the pipe
|
|
caused exactly that the number of characters to be read on the other
|
|
even if there were more characters in the pipe.
|
|
|
|
To debug this, I first replaced fhandler_tty.cc with the 1.258,
|
|
2012/04/05 version. The test case started working when I did that.
|
|
|
|
So, then, I replaced individual functions, one at a time, in
|
|
fhandler_tty.cc with their previous versions. I'd expected this to be a
|
|
problem with fhandler_pty_master::process_slave_output since that had
|
|
seen the most changes but was surprised to see that the culprit was
|
|
fhandler_pty_slave::read().
|
|
|
|
The reason was that I really needed the bytes_available() function to
|
|
return the number of bytes which would be read in the next operation
|
|
rather than the number of bytes available in the pipe. That's because
|
|
there may be a number of lines available to be read but the number of
|
|
bytes which will be read by ReadFile should reflect the mode of the pty
|
|
and, if there is a line to read, only the number of bytes in the line
|
|
should be seen as available for the next read.
|
|
|
|
Having bytes_available() return the number of bytes which would be read
|
|
seemed to fix the problem but it could subtly change the behavior of
|
|
other callers of this function. However, I actually think this is
|
|
probably a good thing since they probably should have been seeing the
|
|
line behavior.
|
|
|
|
2012-05-02 cgf-000001
|
|
|
|
<1.7.15>
|
|
Fix problem setting parent pid to 1 when process with children execs
|
|
itself. Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
|
|
</1.7.15>
|
|
|
|
Investigating this problem with strace showed that ssh-agent was
|
|
checking the parent pid and getting a 1 when it shouldn't have. Other
|
|
stuff looked ok so I chose to consider this a smoking gun.
|
|
|
|
Going back to the version that the OP said did not have the problem, I
|
|
worked forward until I found where the problem first occurred -
|
|
somewhere around 2012-03-19. And, indeed, the getppid call returned the
|
|
correct value in the working version. That means that this stopped
|
|
working when I redid the way the process pipe was inherited around
|
|
this time period.
|
|
|
|
It isn't clear why (and I suspect I may have to debug this further at
|
|
some point) this hasn't always been a problem but I made the obvious fix.
|
|
We shouldn't have been setting ppid = 1 when we're about to pass off to
|
|
an execed process.
|
|
|
|
As I was writing this, I realized that it was necessary to add some
|
|
additional checks. Just checking for "have_execed" isn't enough. If
|
|
we've execed a non-cygwin process then it won't know how to deal with
|
|
any inherited children. So, always set ppid = 1 if we've execed a
|
|
non-cygwin process.
|