This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年08月01日 13:38 by Mike.Lissner, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue22118.patch | demian.brecht, 2014年08月06日 00:45 | review | ||
| issue22118_1.patch | demian.brecht, 2014年08月07日 03:59 | review | ||
| issue22118_2.patch | demian.brecht, 2014年08月11日 14:09 | review | ||
| Messages (21) | |||
|---|---|---|---|
| msg224500 - (view) | Author: Mike Lissner (Mike.Lissner) | Date: 2014年08月01日 13:38 | |
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed. I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm Unfortunately, most of the URLs in the HTML are relative, taking the form: '../../some/path/to/some/pdf.pdf' I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like: https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. **It works because those clients fix the problem,** joining the invalid path and the URL into: https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. I've never filed a Python bugs before, but is this something we could consider? |
|||
| msg224503 - (view) | Author: Mike Lissner (Mike.Lissner) | Date: 2014年08月01日 13:59 | |
FWIW, the workaround that I've just created for this problem is this: u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf' # Split the url and rejoin it, nuking any '/..' patterns at the # beginning of the path. s = urlsplit(u) urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:]) |
|||
| msg224864 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年08月05日 17:52 | |
The thing is, urljoin() isn't HTTP-specific, and such URLs *may* make sense for other protocols. |
|||
| msg224870 - (view) | Author: Mike Lissner (Mike.Lissner) | Date: 2014年08月05日 18:31 | |
@pitrou, I haven't delved into URLs in a long while, but the general idea is: scheme://domain:port/path?query_string#fragment_id When would it ever make sense to have something up a level from the root of the domain? |
|||
| msg224872 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年08月05日 18:38 | |
Actually, according to RFC 3986, it seems you are right and we should remove extraneous leading "/../" and "/./" components in the path. http://tools.ietf.org/html/rfc3986#section-5.4 |
|||
| msg224876 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月05日 18:58 | |
I've only had a couple minutes to look into this so far, but the bug does indeed seem to be valid across all versions. In fact, the line "# XXX The stuff below is bogus in various ways..." in urljoin tipped me off to something potentially being askew ;) Unless the OP wants to provide a patch, I was going to take a look at possibly updating urljoin (at least the part below that line) to follow http://tools.ietf.org/html/rfc3986#section-5.2 a little more closely. |
|||
| msg224877 - (view) | Author: Mike Lissner (Mike.Lissner) | Date: 2014年08月05日 18:59 | |
@demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately. |
|||
| msg224893 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月06日 00:45 | |
Here's an initial patch with a fix that passes all the test cases other than the ones that are incorrect based on examples and pseudocode in RFC 3986. I haven't checked obsoleted RFCs yet to ensure consistency, but will do so when I get a chance (likely tomorrow morning). Posting this now to see if anyone's opposed to the change, or at least the direction of. Note: It /does/ break backwards compatibility, but it seems that previous logic was incorrect (based on my upcoming checking for consistency between RFCs). As such, I'm not sure that it should be fixed < 3.5. Thoughts? |
|||
| msg224896 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年08月06日 02:51 | |
> It /does/ break backwards compatibility, but it seems that previous > logic was incorrect (based on my upcoming checking for consistency > between RFCs). As such, I'm not sure that it should be fixed < 3.5. > Thoughts? Actually, the logic seems to be correct according to RFC 1808 (which the variable names in the tests seem to hint at, as well). I think it's fine to upgrade the semantics to newer RFCs, but we should only do it in 3.5 indeed. |
|||
| msg224984 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月07日 03:59 | |
Update based on review comments, added legacy support, fixed the tests up. |
|||
| msg225144 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年08月10日 15:50 | |
Demian, thanks for the update! I'm not sure the rfc1808 flag is necessary. I would be fine with switching wholesale to the new semantics. Nick, since you've done work in the past on URIs, what do you think? |
|||
| msg225156 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月10日 18:12 | |
FWIW, I think that it would be ever so slightly preferable to maintain support for the old behaviour but perhaps add a deprecation note in the docs. It's a little difficult to tell how much of an impact this change will have to existing code. Maintaining support for the old behaviour for 3.5 would make the change a little more easy to deal with. That said, it's nothing that I'd argue heatedly over. |
|||
| msg225173 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年08月10日 23:38 | |
Unfortunately, I haven't looked at the RFC compliance side of things in this space in years. At the time, we were considering a new module, leaving the old one alone. These days, I'd be more inclined to just fix it for 3.5, and suggest anyone really needing the old behaviour fork an older version of the module. Extracting the test case inputs from that old "uriparse" rewrite might still be a good idea, though. |
|||
| msg225186 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月11日 14:09 | |
Uploaded new patch. Removed support for RFC1808-specific behaviour. Extracted non-compliant tests into comment blocks indicating the behaviour is no longer supported. |
|||
| msg225187 - (view) | Author: Mike Lissner (Mike.Lissner) | Date: 2014年08月11日 14:14 | |
Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it. Glad to see the Python language is a happy place with fast, quality fixes. |
|||
| msg225191 - (view) | Author: Demian Brecht (demian.brecht) * (Python triager) | Date: 2014年08月11日 14:48 | |
Thanks Mike, it's always nice to get positive feedback :) |
|||
| msg225619 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年08月21日 23:16 | |
New changeset b116489d31ff by Antoine Pitrou in branch 'default': Issue #22118: Switch urllib.parse to use RFC 3986 semantics for the resolution of relative URLs, rather than RFCs 1808 and 2396. http://hg.python.org/cpython/rev/b116489d31ff |
|||
| msg225620 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年08月21日 23:17 | |
The patch is now committed to the future Python 3.5. Thank you very much for this contribution! |
|||
| msg225661 - (view) | Author: Stefan Behnel (scoder) * (Python committer) | Date: 2014年08月22日 11:43 | |
I'm now getting duplicated slashes in URLs, e.g.: https://new//foo.html http://my.little.server/url//logo.gif In both cases, the base URL that gets joined with the postfix had a trailing slash, e.g. "http://my.little.server/url/" + "logo.gif" -> "http://my.little.server/url//logo.gif" |
|||
| msg225662 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2014年08月22日 12:15 | |
Issue #1500504 (the "urischemes" proposal that never got turned into a PyPI package) has several additional test cases. This particular attachment is one AMK cleaned up to run on Py3k: http://bugs.python.org/file32591/urischemes.py The module itself likely isn't worth salvaging, but the additional examples in the test cases were very useful in flushing out various RFC compliance issues. |
|||
| msg227253 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年09月22日 07:49 | |
New changeset 901e4e52b20a by Senthil Kumaran in branch 'default': Issue #22278: Fix urljoin problem with relative urls, a regression observed https://hg.python.org/cpython/rev/901e4e52b20a |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:06 | admin | set | github: 66316 |
| 2014年09月22日 07:49:30 | python-dev | set | messages: + msg227253 |
| 2014年08月22日 12:15:47 | ncoghlan | set | messages: + msg225662 |
| 2014年08月22日 11:43:50 | scoder | set | nosy:
+ scoder messages: + msg225661 |
| 2014年08月21日 23:17:52 | pitrou | set | status: open -> closed resolution: fixed messages: + msg225620 stage: patch review -> resolved |
| 2014年08月21日 23:16:47 | python-dev | set | nosy:
+ python-dev messages: + msg225619 |
| 2014年08月11日 14:49:00 | demian.brecht | set | messages: + msg225191 |
| 2014年08月11日 14:14:34 | Mike.Lissner | set | messages: + msg225187 |
| 2014年08月11日 14:10:00 | demian.brecht | set | files:
+ issue22118_2.patch messages: + msg225186 |
| 2014年08月10日 23:38:13 | ncoghlan | set | messages: + msg225173 |
| 2014年08月10日 18:12:39 | demian.brecht | set | messages: + msg225156 |
| 2014年08月10日 15:50:56 | pitrou | set | messages: + msg225144 |
| 2014年08月07日 03:59:31 | demian.brecht | set | files:
+ issue22118_1.patch messages: + msg224984 |
| 2014年08月06日 02:51:24 | pitrou | set | stage: needs patch -> patch review messages: + msg224896 versions: - Python 2.7, Python 3.4 |
| 2014年08月06日 00:45:10 | demian.brecht | set | files:
+ issue22118.patch keywords: + patch messages: + msg224893 |
| 2014年08月05日 18:59:45 | Mike.Lissner | set | messages: + msg224877 |
| 2014年08月05日 18:58:23 | demian.brecht | set | messages: + msg224876 |
| 2014年08月05日 18:40:24 | pitrou | set | versions: + Python 3.4, Python 3.5 |
| 2014年08月05日 18:38:20 | pitrou | set | nosy:
+ ncoghlan messages: + msg224872 |
| 2014年08月05日 18:31:26 | Mike.Lissner | set | messages: + msg224870 |
| 2014年08月05日 17:52:50 | pitrou | set | nosy:
+ pitrou messages: + msg224864 |
| 2014年08月05日 17:49:34 | demian.brecht | set | nosy:
+ demian.brecht |
| 2014年08月05日 05:41:46 | ezio.melotti | set | keywords:
+ easy nosy: + orsenthil, ezio.melotti stage: needs patch |
| 2014年08月01日 13:59:11 | Mike.Lissner | set | messages: + msg224503 |
| 2014年08月01日 13:38:49 | Mike.Lissner | create | |