Filter out URLs with any of some given file extensions

Question 1

I wrote some code to filter a list of links, URLs, with multiple conditions given by file extensions. I want to remove every URL that is not an HTML file. The code is:

avoid = [".pptx", ".ppt", ".xls", ".xlsx", ".xml", ".xlt", ".pdf", 
 ".jpg", ".png", ".svg", ".doc", ".docx", ".pps"]
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
def analyse_resource_extension(url):
 match = [ext in url for ext in avoid]
 return any(element is True for element in match)
links = list(filter(lambda x: analyse_resource_extension(x) is False, links))

so that links finishes with ["http://www.abc.com"] as only value. This solution seems kind of wordy for me. Is there any way to perform the same action without using the analyse_resource_extension function?

Question 2

Strictly speaking, there is no direct correlation between the URL string and the type of the content you are going to get when following the URL - there can be, for instance, redirects; or the url leading to a, say, image file would not have the filename with an extension in it (example). And, keeping the list of disallowed extensions does not scale well.

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

import requests
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
with requests.Session() as session:
 links = [link for link in links 
 if "text/html" in session.head(link).headers["Content-Type"]]
 print(links)

Note that to improve on speed, we are also using the same Session object, which reuses the underlying TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

Demo (using httpbin):

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

You can even take it a step further and solve it with asyncio and aiohttp:

import asyncio
import aiohttp
@asyncio.coroutine
def is_html(session, url):
 response = yield from session.head(url, compress=True)
 print(url, "text/html" in response.headers["Content-Type"])
if __name__ == '__main__':
 links = ["https://httpbin.org/html",
 "https://httpbin.org/image/png",
 "https://httpbin.org/image/svg",
 "https://httpbin.org/image"]
 loop = asyncio.get_event_loop()
 conn = aiohttp.TCPConnector(verify_ssl=False)
 with aiohttp.ClientSession(connector=conn, loop=loop) as session:
 f = asyncio.wait([is_html(session, link) for link in links])
 loop.run_until_complete(f)

Prints:

https://httpbin.org/image/svg False
https://httpbin.org/image False
https://httpbin.org/image/png False
https://httpbin.org/html True

Question 3

+1 much better answer than mine. Didn't know that you could determine the stuff with "Content-Type" either.

Question 4

This solution seems kind of wordy for me. Is there any way to perform the same action without using the analyse_resource_extension function?

How about you keep analyse_resource_extension and instead use itertools.filterfalse:

from itertools import filterfalse
...
links = list(filterfalse(analyze_resource_extension, links))

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-04-03 02:06:32Z

Strictly speaking, there is no direct correlation between the URL string and the type of the content you are going to get when following the URL - there can be, for instance, redirects; or the url leading to a, say, image file would not have the filename with an extension in it (example). And, keeping the list of disallowed extensions does not scale well.

An alternative, slower, but more reliable way would be to actually visit the URLs (we can use "lightweight" HEAD requests for it) and check the Content-Type header. Something like:

import requests
links = ["http://www.abc.com", "http://www.abc.com/file.pdf", 
 "http://www.abc.com/file.png"]
with requests.Session() as session:
 links = [link for link in links 
 if "text/html" in session.head(link).headers["Content-Type"]]
 print(links)

Note that to improve on speed, we are also using the same Session object, which reuses the underlying TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

Demo (using httpbin):

In [1]: import requests
In [2]: links = ["https://httpbin.org/html",
 ...: "https://httpbin.org/image/png",
 ...: "https://httpbin.org/image/svg",
 ...: "https://httpbin.org/image"]
In [3]: with requests.Session() as session:
 ...: links = [link for link in links
 ...: if "text/html" in session.head(link).headers["Content-Type"]]
 ...: print(links)
 ...: 
['https://httpbin.org/html']

You can even take it a step further and solve it with asyncio and aiohttp:

import asyncio
import aiohttp
@asyncio.coroutine
def is_html(session, url):
 response = yield from session.head(url, compress=True)
 print(url, "text/html" in response.headers["Content-Type"])
if __name__ == '__main__':
 links = ["https://httpbin.org/html",
 "https://httpbin.org/image/png",
 "https://httpbin.org/image/svg",
 "https://httpbin.org/image"]
 loop = asyncio.get_event_loop()
 conn = aiohttp.TCPConnector(verify_ssl=False)
 with aiohttp.ClientSession(connector=conn, loop=loop) as session:
 f = asyncio.wait([is_html(session, link) for link in links])
 loop.run_until_complete(f)

Prints:

https://httpbin.org/image/svg False
https://httpbin.org/image False
https://httpbin.org/image/png False
https://httpbin.org/html True

+1 much better answer than mine. Didn't know that you could determine the stuff with "Content-Type" either.

Stack Exchange Network

Filter out URLs with any of some given file extensions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Filter out URLs with any of some given file extensions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions