Message 224500 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Mike.Lissner
Recipients	Mike.Lissner
Date	2014年08月01日.13:38:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za>

Content
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed. I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm Unfortunately, most of the URLs in the HTML are relative, taking the form: '../../some/path/to/some/pdf.pdf' I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like: https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. It works because those clients fix the problem, joining the invalid path and the URL into: https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. I've never filed a Python bugs before, but is this something we could consider?

Content

Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.
I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm
Unfortunately, most of the URLs in the HTML are relative, taking the form:
'../../some/path/to/some/pdf.pdf'
I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:
https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf
If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. 
**It works because those clients fix the problem,** joining the invalid path and the URL into:
https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf
I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. 
I've never filed a Python bugs before, but is this something we could consider?

History
Date	User	Action	Args
2014年08月01日 13:38:49	Mike.Lissner	set	recipients: + Mike.Lissner
2014年08月01日 13:38:49	Mike.Lissner	set	messageid: <1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za>
2014年08月01日 13:38:49	Mike.Lissner	link	issue22118 messages
2014年08月01日 13:38:49	Mike.Lissner	create

homepage