Message224500
| Author |
Mike.Lissner |
| Recipients |
Mike.Lissner |
| Date |
2014年08月01日.13:38:49 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.
I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm
Unfortunately, most of the URLs in the HTML are relative, taking the form:
'../../some/path/to/some/pdf.pdf'
I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:
https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf
If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense.
**It works because those clients fix the problem,** joining the invalid path and the URL into:
https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf
I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do.
I've never filed a Python bugs before, but is this something we could consider? |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2014年08月01日 13:38:49 | Mike.Lissner | set | recipients:
+ Mike.Lissner |
| 2014年08月01日 13:38:49 | Mike.Lissner | set | messageid: <1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za> |
| 2014年08月01日 13:38:49 | Mike.Lissner | link | issue22118 messages |
| 2014年08月01日 13:38:49 | Mike.Lissner | create |
|