Skip to main content
Code Review

Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

added 1176 characters in body
Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93

if..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

You can even take it a step further and solve it with asyncio and aiohttp :

import asyncio
import aiohttp
@asyncio.coroutine
def is_html(session, url):
 response = yield from session.head(url, compress=True)
 print(url, "text/html" in response.headers["Content-Type"])
if __name__ == '__main__':
 links = ["https://httpbin.org/html",
 "https://httpbin.org/image/png",
 "https://httpbin.org/image/svg",
 "https://httpbin.org/image"]
 loop = asyncio.get_event_loop()
 
 conn = aiohttp.TCPConnector(verify_ssl=False)
 with aiohttp.ClientSession(connector=conn, loop=loop) as session:
 f = asyncio.wait([is_html(session, link) for link in links])
 loop.run_until_complete(f)

Prints:

https://httpbin.org/image/svg False
https://httpbin.org/image False
https://httpbin.org/image/png False
https://httpbin.org/html True

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

You can even take it a step further and solve it with asyncio and aiohttp :

import asyncio
import aiohttp
@asyncio.coroutine
def is_html(session, url):
 response = yield from session.head(url, compress=True)
 print(url, "text/html" in response.headers["Content-Type"])
if __name__ == '__main__':
 links = ["https://httpbin.org/html",
 "https://httpbin.org/image/png",
 "https://httpbin.org/image/svg",
 "https://httpbin.org/image"]
 loop = asyncio.get_event_loop()
 
 conn = aiohttp.TCPConnector(verify_ssl=False)
 with aiohttp.ClientSession(connector=conn, loop=loop) as session:
 f = asyncio.wait([is_html(session, link) for link in links])
 loop.run_until_complete(f)

Prints:

https://httpbin.org/image/svg False
https://httpbin.org/image False
https://httpbin.org/image/png False
https://httpbin.org/html True
added 549 characters in body
Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93

Strictly speaking, there is no direct correlation between the URL string and the type of the content you are going to get when following the URL - there can be, for instance, redirectsredirects; or the url leading to a, say, image file would not have the filename with an extension in it (example ). And, keeping the list of disallowed extensions does not scale well.

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

import requests
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
with requests.Session() as session:
 links = [link for link in links 
 if "text/html" in session.head(link).headers["Content-Type"]]
 print(links)

Note that to improve on speed, we are also using the same Session object, which reuses the underlying TCP connection:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Demo (using httpbin ):

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

Strictly speaking, there is no direct correlation between the URL string and the type of the content you are going to get when following the URL - there can be, for instance, redirects. And, keeping the list of disallowed extensions does not scale well.

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

import requests
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
with requests.Session() as session:
 links = [link for link in links 
 if "text/html" in session.head(link).headers["Content-Type"]]
 print(links)

Note that to improve on speed, we are also using the same Session object, which reuses the underlying TCP connection:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Strictly speaking, there is no direct correlation between the URL string and the type of the content you are going to get when following the URL - there can be, for instance, redirects; or the url leading to a, say, image file would not have the filename with an extension in it (example ). And, keeping the list of disallowed extensions does not scale well.

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

import requests
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
with requests.Session() as session:
 links = [link for link in links 
 if "text/html" in session.head(link).headers["Content-Type"]]
 print(links)

Note that to improve on speed, we are also using the same Session object, which reuses the underlying TCP connection:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Demo (using httpbin ):

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']
Source Link
alecxe
  • 17.5k
  • 8
  • 52
  • 93
Loading
lang-py

AltStyle によって変換されたページ (->オリジナル) /