PYTHON SCRAPY Can't POST information to FORMS,

Question 1

I think that I will ask very big favor as i struggling with this problem several days. I tried all possible (in my best knowledge) ways and still no result. I am doing something wrong, but still can't figure out what it is. So thank you every one who are willing enough to go to this adventure. First things first: I am trying to use POST method to post information to the form that is on delta.com As always with this websites it is complicated as they are in to the sessions and cookies and Javascript so it can be problem there. I am using code example that I found in stackoverflow: Using MultipartPostHandler to POST form-data with Python And here is my code that I tweaked for delta web page.

from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from delta.items import DeltaItem
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
 name = "delta"
 allowed_domains = ["http://www.delta.com"]
 start_urls = ["http://www.delta.com"]
 def start_requests(self, response):
 yield FormRequest.from_response(response, formname='flightSearchForm',url="http://www.delta.com/booking/findFlights.do", formdata={'departureCity[0]':'JFK', 'destinationCity[0]':'SFO','departureDate[0]':'07.20.2013','departureDate[1]':'07.28.2013','paxCount':'1'},callback=self.parse1)
 def parse1(self, response):
 hxs = HtmlXPathSelector(response)
 sites = hxs.select('//')
 items = []
 for site in sites:
 item = DeltaItem()
 item['title'] = site.select('text()').extract()
 item['link'] = site.select('text()').extract()
 item['desc'] = site.select('text()').extract()
 items.append(item)
 return items

When I instruct spider to crawl in terminal I see:

 scrapy crawl delta -o items.xml -t xml
2013年07月01日 13:39:30+0300 [scrapy] INFO: Scrapy 0.16.2 started (bot: delta)
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Enabled item pipelines: 
2013年07月01日 13:39:30+0300 [delta] INFO: Spider opened
2013年07月01日 13:39:30+0300 [delta] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013年07月01日 13:39:30+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013年07月01日 13:39:33+0300 [delta] DEBUG: Crawled (200) <GET http://www.delta.com> (referer: None)
2013年07月01日 13:39:33+0300 [delta] INFO: Closing spider (finished)
2013年07月01日 13:39:33+0300 [delta] INFO: Dumping Scrapy stats:
 {'downloader/request_bytes': 219,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 27842,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2013, 7, 1, 10, 39, 33, 159235),
 'log_count/DEBUG': 7,
 'log_count/INFO': 4,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2013, 7, 1, 10, 39, 30, 734090)}
2013年07月01日 13:39:33+0300 [delta] INFO: Spider closed (finished)

If you compare with example from link I don't see that I managed to make POST method even when I am using almost the same code. I even tried with very simple HTML/PHP form from W3schools that I placed on server, but the same there. What ever I did never managed to create POST. I think the problem is simple, but as only Python knowledge that I have is Scrapy and all Scrapy is what i found on-line(I it is well documented) and from examples, but still it is not enough for me. So if any one at least could show the right way it would be very big help.

Question 2

will the cite let you make posts, have a look at the cite's robot.txt

Question 3

Hm...good point, but I used the same code(w3schools.com/php/php_forms.asp) to test on form that I placed in my server and I don't have any robot.txt file. It gave me the same result.

Question 4

Here's a working example of using Request.from_response for delta.com:

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class DeltaItem(Item):
 title = Field()
 link = Field()
 desc = Field()
class DmozSpider(BaseSpider):
 name = "delta"
 allowed_domains = ["delta.com"]
 start_urls = ["http://www.delta.com"]
 def parse(self, response):
 yield FormRequest.from_response(response,
 formname='flightSearchForm',
 formdata={'departureCity[0]': 'JFK',
 'destinationCity[0]': 'SFO',
 'departureDate[0]': '07.20.2013',
 'departureDate[1]': '07.28.2013'},
 callback=self.parse1)
 def parse1(self, response):
 print response.status

You've used wrong spider methods, plus allowed_domains was incorrectly set.

But, anyway, delta.com heavily uses dynamic ajax calls for loading the content - here's where your problems start. E.g. response in parse1 method doesn't contain any search results - instead it contains an html for loading AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON page where results are loaded dynamically.

Basically, you should work with your browser developer tools and try to simulate those ajax calls inside your spider or use tools like selenium which uses the real browser (and you can combine it with scrapy).

See also:

Hope that helps.

Question 5

Thank you Alecxe. It worked for me very well with POST and created new questions that I am stuck with, but I will give a try to make it on my own. At least couple days. In any case,your answer is great help. And I don't have decent reputation to rate it accordingly. The world is better place for sure, when you are around here...smile

Question 6

You are welcome, thanks, feel free to ask more questions about the subject in a separate thread.

Question 7

@alecxe I really appreciate your information +1. Using your example about Delta airlines... Is there anyway to know (using scrapy) all the possible auto-completions? In this case, the exact name of the airports for instance? Is there any tutorial or document where I can get this kind of information?

Question 8

@DanielTheRocketMan if you need a list of all possible airports, I'd make a request to, for example, wikipedia list of airports and parse the list of airports in a python data structure (dictionary should work if you need to make lookups into it).

alecxe 476k127 gold badges1.1k silver badges1.2k bronze badges · Accepted Answer · 2013-07-01 19:25:09Z

Here's a working example of using Request.from_response for delta.com:

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class DeltaItem(Item):
 title = Field()
 link = Field()
 desc = Field()
class DmozSpider(BaseSpider):
 name = "delta"
 allowed_domains = ["delta.com"]
 start_urls = ["http://www.delta.com"]
 def parse(self, response):
 yield FormRequest.from_response(response,
 formname='flightSearchForm',
 formdata={'departureCity[0]': 'JFK',
 'destinationCity[0]': 'SFO',
 'departureDate[0]': '07.20.2013',
 'departureDate[1]': '07.28.2013'},
 callback=self.parse1)
 def parse1(self, response):
 print response.status

You've used wrong spider methods, plus allowed_domains was incorrectly set.

But, anyway, delta.com heavily uses dynamic ajax calls for loading the content - here's where your problems start. E.g. response in parse1 method doesn't contain any search results - instead it contains an html for loading AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON page where results are loaded dynamically.

Basically, you should work with your browser developer tools and try to simulate those ajax calls inside your spider or use tools like selenium which uses the real browser (and you can combine it with scrapy).

CollectivesTM on Stack Overflow

PYTHON SCRAPY Can't POST information to FORMS,

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related