Message265900
| Author |
nagle |
| Recipients |
nagle |
| Date |
2016年05月19日.23:21:48 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<1463700108.4.0.745938506232.issue27065@psf.upfronthosting.co.za> |
| In-reply-to |
| Content |
"robotparser" uses the default Python user agent when reading the "robots.txt" file, and there's no parameter for changing that.
Unfortunately, the "mod_security" add-on for Apache web server, when used with the standard OWASP rule set, blacklists the default Python USER-AGENT in Rule 990002, User Agent Identification. It doesn't like certain HTTP USER-AGENT values. One of them is "python-httplib2". So any program in Python which accesses the web site will trigger this rule and be blocked form access.
For regular HTTP accesses, it's possible to put a user agent string in the Request object and work around this. But "robotparser" has no such option.
Worse, if "robotparser" has its read of "robots.txt" rejected, it interprets that as a "deny all" robots.txt file, and returns False for all "can_fetch()" requests. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2016年05月19日 23:21:48 | nagle | set | recipients:
+ nagle |
| 2016年05月19日 23:21:48 | nagle | set | messageid: <1463700108.4.0.745938506232.issue27065@psf.upfronthosting.co.za> |
| 2016年05月19日 23:21:48 | nagle | link | issue27065 messages |
| 2016年05月19日 23:21:48 | nagle | create |
|