homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Race condition in WatchedFileHandler leads to unhandled exception
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vinay.sajip Nosy List: brian.curtin, neologix, phlogistonjohn, python-dev, vinay.sajip
Priority: normal Keywords: patch

Created on 2012年04月20日 13:57 by phlogistonjohn, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.py phlogistonjohn, 2012年04月20日 13:57
watchedfilehandler.patch phlogistonjohn, 2012年04月20日 15:10 patch
watchedfilehandler.patch phlogistonjohn, 2012年04月22日 18:41 patch, 2nd try
test_logging_race.diff neologix, 2012年04月24日 18:55 review
Messages (20)
msg158829 - (view) Author: JohnM (phlogistonjohn) Date: 2012年04月20日 13:57
The (very handy) logging.handlers.WatchedFileHandler appears to be susceptible to a race condition that causes occasional OSErrors during the normal course of operations:
Traceback (most recent call last):
 File "test.py", line 58, in test_race
 log.warning('Foo bar %d', ii)
 File "/usr/lib64/python2.7/logging/__init__.py", line 1144, in warning
 self._log(WARNING, msg, args, **kwargs)
 File "/usr/lib64/python2.7/logging/__init__.py", line 1250, in _log
 self.handle(record)
 File "/usr/lib64/python2.7/logging/__init__.py", line 1260, in handle
 self.callHandlers(record)
 File "/usr/lib64/python2.7/logging/__init__.py", line 1300, in callHandlers
 hdlr.handle(record)
 File "/usr/lib64/python2.7/logging/__init__.py", line 744, in handle
 self.emit(record)
 File "/usr/lib64/python2.7/logging/handlers.py", line 412, in emit
 stat = os.stat(self.baseFilename)
OSError: [Errno 2] No such file or directory: '/tmp/tmpG6_luRTestRaceWatchedFileHandler'
Looking at the implementation of the handler's emit function, I've commented on where I think the race can occur. Logrotate (or some similar application) is deleting/unlinking the logfile between the first os.path.exists and os.stat calls:
 def emit(self, record):
 """
 Emit a record.
 First check if the underlying file has changed, and if it
 has, close the old stream and reopen the file to get the
 current stream.
 """
 if not os.path.exists(self.baseFilename):
 # ^^^ race occurs between here 
 stat = None
 changed = 1
 else:
 stat = os.stat(self.baseFilename)
 # ^^^ and this stat call
 changed = (stat[ST_DEV] != self.dev) or (stat[ST_INO] != self.ino)
 if changed and self.stream is not None:
 self.stream.flush()
 self.stream.close()
 self.stream = self._open()
 if stat is None:
 stat = os.stat(self.baseFilename)
 self.dev, self.ino = stat[ST_DEV], stat[ST_INO]
 logging.FileHandler.emit(self, record)
I've attached a test script that attempts to recreate the issue. On my Linux system running the script is able to trigger the issue about every 7 of 10 runs.
msg158832 - (view) Author: JohnM (phlogistonjohn) Date: 2012年04月20日 15:10
I'm attaching a patch of an emit function implementation that reduces the number of stats and uses fstat were we can. Using fstat should also prevent any races between opening the new file and stating the path in the file system.
Using this patch or something along these lines should avoid the time of check to time of use issues with the current version with repeated stat calls.
msg158915 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年04月21日 11:15
Thanks for the report and patch.
I'm making Vinay nosy, as he's the logging expert.
A couple remarks:
- it would be nice to add your testcase to the logging tests (Lib/test/test_logging.py)
- I think you could probably save the whole fstat() result (instead of st_dev and st_inode): this would allow you to use os.pathsamestat() to check whether we're still refering to the same file, and it could maybe be useful someday (maybe use st_mtime)
- typo: "existance" -> "existence"
- I see you're catching OSError when calling fstat(): however, I don't see what error could be raised apart from EBADF (if the FD is closed, then it's likely a bug that shouldn't be masked, and it will blow up on the next write()), or I/O errors, that shouldn't be silenced
msg158976 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年04月22日 17:47
There's certainly a problem to be addressed here, but integrating the test into the regular test suite represents a problem in that the test takes around 10 minutes to run, and sometimes completes successfully. I'm not sure a 10-minute extension of the time taken to run regrtest will be acceptable, but if the test is skipped unless explicitly enabled, then I'm not sure it'll add much value.
I will look at the patch and Charles-François' review comments soon.
msg158979 - (view) Author: JohnM (phlogistonjohn) Date: 2012年04月22日 18:41
Thank you both for looking at this. I've added an updated version of the patch that incorporates the last two suggestions that Charles-François made.
I agree that this test may not be appropriate for the python test suite, due to the length and non-determinism of the test. I don't know what the suite's policy on monkey patching stdlib in the tests is, but monkey patching either os.path.exists or os.stat to remove the file at the appropriate time could be one way to make the test fast and deterministic. Seems a bit dirty to me though. :-)
 
I am up for adding some kind of test to the suite for the WatchedFileHandler, though.
I'm pretty ambivalent about keeping the whole stat structure around, since the samestat method is just a wrapper around what the emit function is already doing, and we'd be keeping more memory (although a small amount) around. I doubt we'd want to look at the timestamps because they could legitimately change in ways this check doesn't care about.
msg158980 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年04月22日 18:42
> There's certainly a problem to be addressed here, but integrating the test into the regular test suite represents a problem in that the test takes around 10 minutes to run, and sometimes completes successfully. I'm not sure a 10-minute extension of the time taken to run regrtest will be acceptable, but if the test is skipped unless explicitly enabled, then I'm not sure it'll add much value.
There's no reason to wait for 10 minutes: one can reduce the number of
iterations greatly, so that it can run in a matter of seconds. Since
the test is run by all the builders multiple times per day, should the
race condition resurface, it will trigger a buildbot failure some day.
No need for it to trigger a failure every time.
msg158982 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年04月22日 19:16
John: Thanks for the updated patch.
Charles-François: Certainly I can reduce the iterations to make the test faster. As it is, I did reproduce the failure on my dev system, but only once, after running John's test script about a dozen times: every other time, it completed successfully :-(
Isn't it simpler if I just replace the os.path.exists() calls with os.stat(), and check for ENOENT if an exception of type OSError or WindowsError occurs?
msg159046 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年04月23日 17:05
> Charles-François: Certainly I can reduce the iterations to make the test
> faster. As it is, I did reproduce the failure on my dev system, but only
> once, after running John's test script about a dozen times: every other
> time, it completed successfully :-(
Juste reduce the sleep times, e.g.:
"""
@@ -17,7 +17,7 @@
 os.unlink(fname)
 except OSError:
 pass
- stime = 0.04 * random.randint(0, 4)
+ stime = 0.004 * random.randint(0, 4)
 time.sleep(stime)
@@ -50,7 +50,7 @@
 log.setLevel(logging.INFO)
 for ii in range(LOGCOUNT):
- stime = 0.05 # * random.randint(0, 4)
+ stime = 0.005 # * random.randint(0, 4)
 time.sleep(stime)
 log.warning('Foo bar %d', ii)
"""
With this change, I can trigger a failure reliably in around 1s, and
my computer is rather slow.
> Isn't it simpler if I just replace the os.path.exists() calls with
> os.stat(), and check for ENOENT if an exception of type OSError or
> WindowsError occurs?
The problem is that it would leave a race window if the file is
changed between the time it's opened (I guess in
logging.FileHandler.__init__()) and the first call to stat().
John's patch is safe in this regard, thanks to fstat().
msg159070 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年04月23日 20:02
[snip]
>With this change, I can trigger a failure reliably in around 1s, and
>my computer is rather slow.
I'm working in a VM, and although I can get John's script to fail more regularly (with the reduced timeouts and counts of 1000), a version of the test which I added to test_logging always succeeds. That code is:
  @unittest.skipUnless(threading, 'Threading required for this test.')
  def test_race(self):
    # Issue #14632 refers.
    def remove_loop(fname, tries):
      for _ in range(tries):
        try:
          os.unlink(fname)
        except OSError:
          pass
        time.sleep(0.004 * random.randint(0, 4))
    def cleanup(remover, fn, handler):
      handler.close()
      remover.join()
      if os.path.exists(fn):
        os.unlink(fn)
    fd, fn = tempfile.mkstemp('.log', 'test_logging-3-')
    os.close(fd)
    del_count = 1000
    log_count = 1000
    remover = threading.Thread(target=remove_loop, args=(fn, del_count))
    remover.daemon = True
    remover.start()
    h = logging.handlers.WatchedFileHandler(fn)
    self.addCleanup(cleanup, remover, fn, h)
    f = logging.Formatter('%(asctime)s: %(levelname)s: %(message)s')
    h.setFormatter(f)
    for _ in range(log_count):
      time.sleep(0.005)
      r = logging.makeLogRecord({'msg': 'testing' })
      h.handle(r)
I can't see why this always works, while John's script sometimes fails.
>The problem is that it would leave a race window if the file is
>changed between the time it's opened (I guess in
>logging.FileHandler.__init__()) and the first call to stat().
>John's patch is safe in this regard, thanks to fstat().
Oh, right - missed that.
msg159190 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012年04月24日 18:55
> I can't see why this always works, while John's script sometimes fails.
It does fail consistently on my machine.
I'm attaching the diff just in case.
msg159226 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012年04月24日 22:33
New changeset 15a33d7d2b50 by Vinay Sajip in branch '2.7':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/15a33d7d2b50
New changeset 5de7c3d64f2a by Vinay Sajip in branch '3.2':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/5de7c3d64f2a
New changeset 380821b47872 by Vinay Sajip in branch 'default':
Issue #14632: Updated WatchedFileHandler to deal with race condition. Thanks to John Mulligan for the problem report and patch.
http://hg.python.org/cpython/rev/380821b47872 
msg159228 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年04月24日 22:42
I will leave this open for a few days and see if the buildbots throw up anything.
msg159387 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年04月26日 15:47
Buildbots seem not to be complaining, so closing.
msg159770 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012年05月02日 01:55
The test for this issue seems to fail about half of the time on Windows.
======================================================================
ERROR: test_race (test.test_logging.HandlerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "c:\python-dev\cpython-main\lib\test\test_logging.py", line 605, in test_race
 h = logging.handlers.WatchedFileHandler(fn, delay=delay)
 File "c:\python-dev\cpython-main\lib\logging\handlers.py", line 421, in __init__
 logging.FileHandler.__init__(self, filename, mode, encoding, delay)
 File "c:\python-dev\cpython-main\lib\logging\__init__.py", line 965, in __init__
 StreamHandler.__init__(self, self._open())
 File "c:\python-dev\cpython-main\lib\logging\__init__.py", line 987, in _open
 return open(self.baseFilename, self.mode, encoding=self.encoding)
PermissionError: [Errno 13] Permission denied: 'c:\\users\\brian\\appdata\\local\\temp\\test_logging-3-lxjo5t.log'
I also get this failure when running in the VS2010 branch:
======================================================================
ERROR: test_race (test.test_logging.HandlerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "c:\python-dev\porting\cpython\lib\test\test_logging.py", line 600, in cleanup
 os.unlink(fn)
PermissionError: [Error 32] The process cannot access the file because it is being used by
 another process: 'c:\\users\\brian\\appdata\\local\\temp\\test_logging3-6ujeil.log'
msg159820 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年05月02日 21:01
Re. the error on open(), I have no idea why it's failing. That's just open, and it appears to be trying to open a file in a directory for which one would expect you to have write access.
Re. the error on unlink(), could that be an older version of the code that you're running? There was a cleanup() function, but I have removed it (and re-tested) before closing the issue.
The 3.x buildbots appear not to be showing this failure.
msg159821 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012年05月02日 21:22
I'm seeing this with the current tip 8635825b9734.
I wouldn't trust the build slaves with a race condition test since they're incredibly slow machines, but this issue isn't about the race. That path really should be accessible so I'm not sure why it's failing. It's also failing for Richard (sbt) as noted while working on the VS2010 port.
msg159825 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年05月03日 00:14
The only thing I can think of is that the file is kept open by something else (the other thread).
msg159829 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年05月03日 01:19
The changeset you mentioned doesn't have a cleanup function in test_race (an older version used to) - so I'm not sure where that error is coming from.
It's going to be hard for me to reproduce this - I was finding it pretty hard to throw up the original race condition which led to the test being added. If you are getting the error repeatably, it would be really handy if you could tweak the test to generate some more diagnostic information - e.g. if the file already exists when the open fails (=> it's held open by something else).
Are you running indexing or anti-virus software on the machine where you get the failures? That can hold a file open when you're not expecting it.
msg159830 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2012年05月03日 01:22
I have exemptions set in AV for my dev folders for exactly that reason :)
I'll try and poke around and get more info.
msg159862 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012年05月03日 15:31
I noticed that in my cleanup code, I had the lines
h.close()
remover.join()
but it makes more sense for these to be in the opposite order. I've made that changed and pushed it up (for 2.7, 3.2 and default).
History
Date User Action Args
2022年04月11日 14:57:29adminsetgithub: 58837
2012年05月03日 15:31:00vinay.sajipsetmessages: + msg159862
2012年05月03日 01:22:52brian.curtinsetmessages: + msg159830
2012年05月03日 01:19:23vinay.sajipsetmessages: + msg159829
2012年05月03日 00:14:38vinay.sajipsetmessages: + msg159825
2012年05月02日 21:22:14brian.curtinsetmessages: + msg159821
2012年05月02日 21:01:11vinay.sajipsetmessages: + msg159820
2012年05月02日 01:55:46brian.curtinsetnosy: + brian.curtin
messages: + msg159770
2012年04月26日 15:47:52vinay.sajipsetstatus: open -> closed

messages: + msg159387
2012年04月24日 22:42:10vinay.sajipsetresolution: fixed
messages: + msg159228
2012年04月24日 22:33:51python-devsetnosy: + python-dev
messages: + msg159226
2012年04月24日 18:55:05neologixsetfiles: + test_logging_race.diff

messages: + msg159190
2012年04月23日 20:02:55vinay.sajipsetmessages: + msg159070
2012年04月23日 17:05:35neologixsetmessages: + msg159046
2012年04月22日 19:16:30vinay.sajipsetmessages: + msg158982
2012年04月22日 18:42:18neologixsetmessages: + msg158980
2012年04月22日 18:41:57phlogistonjohnsetfiles: + watchedfilehandler.patch

messages: + msg158979
2012年04月22日 17:47:04vinay.sajipsetmessages: + msg158976
2012年04月21日 22:51:28vinay.sajipsetassignee: vinay.sajip
2012年04月21日 11:15:19neologixsetnosy: + vinay.sajip, neologix

messages: + msg158915
versions: + Python 3.2, Python 3.3
2012年04月20日 15:10:10phlogistonjohnsetfiles: + watchedfilehandler.patch
keywords: + patch
messages: + msg158832
2012年04月20日 13:57:07phlogistonjohncreate

AltStyle によって変換されたページ (->オリジナル) /