This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年03月16日 05:52 by patena, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue14332.patch | albamagallanes, 2014年03月17日 03:34 | patch for the bug | review | |
| issue14332_2.patch | albamagallanes, 2014年03月18日 05:55 | Patch for 14332 bug with no references | review | |
| 14332.patch | akuchling, 2014年03月18日 23:20 | |||
| Messages (14) | |||
|---|---|---|---|
| msg155992 - (view) | Author: Weronika Patena (patena) | Date: 2012年03月16日 05:52 | |
According to difflib.ndiff help, the optional linejunk argument is "A function that should accept a single string argument, and return true iff the string is junk." Presumably the point is to ignore the junk lines in the comparison. But the function doesn't appear to actually do this - in fact I haven't been able to make the linejunk argument change the output in any way.
Expected difflib.ndiff behavior with no linejunk argument given:
>>> test_lines_1 = ['# something\n', 'real data\n']
>>> test_lines_2 = ['# something else\n', 'real data\n']
>>> print ''.join(difflib.ndiff(test_lines_1,test_lines_2))
- # something
+ # something else
? +++++
real data
Now I'm providing a linejunk function to ignore all lines starting with '#', but the output is still the same:
>>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2,
linejunk=lambda line: line.startswith('#')))
- # something
+ # something else
? +++++
real data
In fact if I make linejunk always return True (or False), nothing changes either:
>>> print ''.join(difflib.ndiff(test_lines_1, test_lines_2,
linejunk=lambda line: True))
- # something
+ # something else
? +++++
real data
It certainly looks like an error, although it's possible that I'm just misunderstanding how this should work.
I'm using Python 2.6.5, on Ubuntu Linux 10.04.
|
|||
| msg156046 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2012年03月16日 14:50 | |
Unfortunately Python 2.6 only gets fixes for security bugs now, not regular bugs. Can you reproduce the problem with 2.7 or 3.2? |
|||
| msg156062 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2012年03月16日 18:01 | |
I reproduced the observed behavior in 3.3.0a. However, I am rather sure it is not a bug. In any case, linejunk is not ignored. Passing 'lambda x: 1/0' causes ZeroDivisionError, proving that it gets called. The body of ndiff(linejunk,charjunk,a,b) is return Differ(linejunk, charjunk).compare(a, b) Differ only uses the linejunk parameter here cruncher = SequenceMatcher(self.linejunk, a, b) SequenceMatcher uses the first parameter, isjunk, in the internal .__chain_b method to segregate (not remove) items expected to be common in order to speed up the .find_longest_match method. Read the docstring for that method (and possibly the code) to see how it affects matching. The main intent of the *junk parameters is to speed up matching to find differences, not to mask differences. It does, however, affect output of the .*ratio methods. The doc string for ndiff says "The default is None, and is recommended; as of Python 2.3, an adaptive notion of "noise" lines is used that does a good job on its own." That is a good idea. That said, I think the doc (and docstrings) should explain the notion of "junk" elements and what 'ignoring' them means. In particular, I think a couple of sentences should be added after "The idea is to find the longest contiguous matching subsequence that contains no "junk" elements (the Ratcliff and Obershelp algorithm doesn’t address junk)." The quotes around "junk" indicate that it is being used with a non-standard, module specific meaning. What is it? And what does 'ignore' (used several times later in the doc) mean? Tim, I think we may need your help here since 'junk' is your label for your concept and I am not sure I understand well enough to articulate it. (For one thing, given that the "common" heuristic was apparently meant to replace at least the linejunk version version of junk, I do not understand why .get_longest_match treats 'junk' and 'common' items differently, other than that the two concepts are apparently not the same.) |
|||
| msg156066 - (view) | Author: Weronika Patena (patena) | Date: 2012年03月16日 18:15 | |
Ah, I see. True, the ndiff docstring doesn't actually explain what junk IS - I was just engaging in wishful thinking and assuming it did the thing I wanted. A better explanation would help. |
|||
| msg165502 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2012年07月15日 03:56 | |
ping |
|||
| msg165506 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2012年07月15日 05:21 | |
I guess I should try to come up with something that is an improvement, even if not perfect. |
|||
| msg165677 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2012年07月17日 03:37 | |
I agree. Any improvement is preferred over just letting this decay in the issue tracker ;-) |
|||
| msg199201 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2013年10月08日 13:12 | |
Tim, any suggestions? |
|||
| msg213789 - (view) | Author: Alba Magallanes (albamagallanes) | Date: 2014年03月17日 03:34 | |
I would like to help with this issue. I'm attaching a patch for it. |
|||
| msg213945 - (view) | Author: Alba Magallanes (albamagallanes) | Date: 2014年03月18日 05:55 | |
I removed the References to 2.x version. |
|||
| msg214037 - (view) | Author: A.M. Kuchling (akuchling) * (Python committer) | Date: 2014年03月18日 23:20 | |
Thanks for your patch! I took it and added some more text describing what junk is, and clarifying that junk affects what's matched but doesn't cause any differences to be ignored. |
|||
| msg214060 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2014年03月19日 06:49 | |
amk, if you’re satisfied with your patch, I think you can go ahead and commit it. |
|||
| msg214089 - (view) | Author: Eli Bendersky (eli.bendersky) * (Python committer) | Date: 2014年03月19日 12:46 | |
Revised patch LGTM. |
|||
| msg214133 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年03月19日 20:44 | |
New changeset 0a69b1e8b7fe by Andrew Kuchling in branch 'default': #14332: provide a better explanation of junk in difflib docs http://hg.python.org/cpython/rev/0a69b1e8b7fe |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:28 | admin | set | github: 58540 |
| 2014年03月19日 20:45:07 | akuchling | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2014年03月19日 20:44:25 | python-dev | set | nosy:
+ python-dev messages: + msg214133 |
| 2014年03月19日 12:46:43 | eli.bendersky | set | messages: + msg214089 |
| 2014年03月19日 06:49:18 | eric.araujo | set | messages:
+ msg214060 versions: + Python 3.4, - Python 3.2 |
| 2014年03月18日 23:20:13 | akuchling | set | files:
+ 14332.patch nosy: + akuchling messages: + msg214037 stage: needs patch -> patch review |
| 2014年03月18日 05:55:31 | albamagallanes | set | files:
+ issue14332_2.patch messages: + msg213945 |
| 2014年03月17日 03:34:11 | albamagallanes | set | files:
+ issue14332.patch nosy: + albamagallanes messages: + msg213789 keywords: + patch |
| 2013年10月08日 13:12:19 | eli.bendersky | set | messages: + msg199201 |
| 2012年07月17日 03:37:42 | eli.bendersky | set | messages: + msg165677 |
| 2012年07月15日 05:21:23 | terry.reedy | set | messages: + msg165506 |
| 2012年07月15日 03:56:04 | eli.bendersky | set | messages: + msg165502 |
| 2012年03月16日 18:15:00 | patena | set | messages: + msg156066 |
| 2012年03月16日 18:01:24 | terry.reedy | set | assignee: docs@python components: + Documentation, - Library (Lib) title: difflib.ndiff appears to ignore linejunk argument -> Better explain "junk" concept in difflib doc nosy: + eli.bendersky, docs@python versions: + Python 2.7, Python 3.2, Python 3.3 messages: + msg156062 stage: needs patch |
| 2012年03月16日 14:50:13 | eric.araujo | set | nosy:
+ eric.araujo, terry.reedy messages: + msg156046 versions: - Python 2.6 |
| 2012年03月16日 05:52:05 | patena | create | |