How to HTTP request to a website that uses robot verification

Question 1

I need to make some get/post requests to a website that I have credential to log in. I plan to do this with Ruby and Net::Http. As I'm new to this kind of experience, I'm struggling with the fact that the log-in page requests robot verification (check-box kind) - that means I'm not able to automate the log-in phase. Besides that, the server keeps alive for some time until it verify that no active has been made, after that it request the log-in page again. The website is build with PHP and JS (most of it is JS) and it requires that the user enter with a "restrict-area" browser's mode after the log in phase.

It would be no problem I manually log in and execute an operation (few requests) for every time I need it. But I don't know how I could pass credential information from the browser, as session id, to my script. I need some concepts ideas about this.

Additional information:

There is no public API.
The "restrict-area" browser's mode is a browser without some buttons (forward and backward in history pages) and it don't permit to change the URL - that is all I know.
I need this for automating some manually tasks that take hours to do.
The website uses Ajax.

If additional information is needed I can add it, just ask in the comments.

Thanks in advance!

EDIT

My intension isn't to crawl random websites, but how to make specifics HTTP request in a specific website where is necessary credentials to do so.

Question 2

By adding ReCaptcha to their page, they have decided to prevent precisely what you're attempting to do. You might want to check their Terms of Service; my guess is that they forbid this kind of automation.

Question 3

Unfortunately, it's just not possible without some artificial intelligence to both understand the instructions and interact with the Captcha control. Once you get in, you can check the network traffic for each page request to see which cookies are in use, and if there are any Authorization: Bearer tokens. Unfortunately, without a public API, what you build is prone to break every time the target website does a new release.

Question 4

@BerinLoritsch: The flip side being that you can be reasonably sure that a public-facing web page will be properly maintained. API's, not so much. They are often neglected and sometimes don't work at all; getting fixes can be a slow process.

Question 5

Thank you for the answer guys. I imagined that the ReCaptcha would be there for that reason. This website still has some improvements to do, but they prefer monopolize their service - it s*cks. Thank again anyway!

Question 6

In theory, you can log in using a browser, steal a cookie, and then use that cookie in your program while emulating a browser. I'm not sure it's worth the effort, though.

Question 7

For JS-intensive websites, it might be much more convenient to use a "headless-browser" approach, such as capybara-webkit gem, which basically allows automation on top of a popular browser engine used in Chrome, Safari, Opera, etc. I'm not sure if it's good enough to cheat the robot verification (leaving moral aspect aside), but at least it beats Net::Http in cases like getting Google search results.

Also, have a look at PhantomJS which is a JS browser automation (as capybara-webkit is a Ruby browser automation), which gives an additional convenience of working with in-page elements in the same language which controls the browser.

Question 8

Hey, thank you for the inside +1. I didn't let clear in the question the kind of searching I will be doing, but is basically request for specific endpoints and dealing with json data. I will probably not need to scrape pages for data. Right now I'm testing Ruby Mechanize. My intention is copy the cookies from a open session in Firefox and use the Mechanize to simulate a parallel browser using the same session (Hijack session), so I won't be dealing with robot verification.

Alexey Kharchenko Alexey Kharchenko 1211 bronze badge · Answer 1 · 2018-03-06 13:18:17Z

For JS-intensive websites, it might be much more convenient to use a "headless-browser" approach, such as capybara-webkit gem, which basically allows automation on top of a popular browser engine used in Chrome, Safari, Opera, etc. I'm not sure if it's good enough to cheat the robot verification (leaving moral aspect aside), but at least it beats Net::Http in cases like getting Google search results.

Also, have a look at PhantomJS which is a JS browser automation (as capybara-webkit is a Ruby browser automation), which gives an additional convenience of working with in-page elements in the same language which controls the browser.

Hey, thank you for the inside +1. I didn't let clear in the question the kind of searching I will be doing, but is basically request for specific endpoints and dealing with json data. I will probably not need to scrape pages for data. Right now I'm testing Ruby Mechanize. My intention is copy the cookies from a open session in Firefox and use the Mechanize to simulate a parallel browser using the same session (Hijack session), so I won't be dealing with robot verification.

Stack Exchange Network

How to HTTP request to a website that uses robot verification

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to HTTP request to a website that uses robot verification

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions