Optimizing domain requests and filtering results with Polars

Question 1

I need to query a large number of domains stored in a CSV file and save the associated metadata from the responses. My final goal is to generate a CSV file after applying filters to this metadata.

The original CSV file can be quite large, and I want to avoid loading everything into memory for performance reasons.

I’m considering two approaches:

Read the CSV, perform the queries, save the metadata to a new CSV, and then load this new file with Polars to apply filters and generate the final output.
Use Polars from the start to read the CSV, perform the queries using map_batches, apply filters in the same pipeline, and then produce the final file directly. Here’s a simplified example of the second approach:

import polars as pl
import asyncio
import httpx
# Load CSV and add metadata columns
lf = (
 pl.scan_csv("domains.csv")
 .with_columns(
 pl.col("domain").map_batches(
 self.execute_domain_checks, return_dtype=pl.List(pl.Dict)
 )
 )
)
# Asynchronous function to perform the domain checks
async def check_multiple_domains(self, domains: pl.Series) -> list[list[dict]]:
 async with httpx.AsyncClient(follow_redirects=True) as client:
 tasks = [self.check_single_domain(domain) for domain in domains]
 results = await asyncio.gather(*tasks)
 return results
def execute_domain_checks(self, domains: pl.Series) -> pl.Series:
 results = asyncio.run(self.check_multiple_domains(domains))
 return pl.Series(results, strict=False)

My questions:

Is the second approach viable in terms of performance, considering the CSV file might be large?
What are the advantages and disadvantages of using polars.map_batches for this kind of task compared to a more sequential approach (approach 1)?
Do you have suggestions on improving memory management or performance in such a scenario?

Question 2

How many rows are there? Surely the network speed will be the limiting factor here.

Question 3

Can you provide more information on the endpoints being queried? Are they all to the same service, or different services?

Question 4

Most domains are different.

Question 5

design

"I hold a hammer, and as I look around I see many nails."

The OP design seems to be constrained by some "batched" APIs that are already in your codebase. Consider revisiting these design decisions.

You are contemplating performing the \$N\$ queries using \$M\$ batches, each of size \$B\$ domains, with \$O(M \times B)\$ time complexity. There is a (mostly) random distribution over network query times. At one extreme, with \$M = N\$, we sum all those delays. At the other, with \$M = 1\$, we blow out our memory budget but enjoy lots of parallelism so ensemble delay is simply max() over all the query delays -- that single straggler dictates the overall elapsed time.

Come at this in a more principled way. Pick your memory budget \$B\$, the most queries you can afford to have outstanding at any instant. And instead of consuming that memory resource completely at start of each batch and then repeatedly watching it slowly dwindle to zero as one by one the stragglers of a given batch eventually finish, strive to always have \$B\$ buffered queries pending, right up until the last little bit of a production run. This avoids the delay of waiting on \$M\$ separate stragglers.

This suggests that you want to have a queue, or priority queue, or at least a counter of num_pending_tasks. And each time a query task completes, you launch a new one.

When using this approach the wonderful polars package actually brings very little to the party. It would suffice to use import csv and a lowly {reader, writer} pair (or a {DictReader, DictWriter} pair). You would be streaming from input CSV to output CSV, with near constant memory burden of \$B\$. There might be a /usr/bin/sort post-processing step, if we want the somewhat racy arbitrary output order to match the input order.

relational database

If there's so many domains that you need to jump through the hoops of worrying about .map_batches() details, then maybe you're working too hard. Outsource the memory management tasks to an RDBMS, perhaps sqlite. Streaming (for loop iterating) over DB result rows is just like streaming over CSV rows, with a small memory footprint even for very large datasets.

Polars and pandas offer convenient functions for serializing a dataframe to a new or existing table, and for dragging those rows back into memory at a later date. And you can CREATE INDEX to produce the sort order you prefer.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2024-10-07 18:42:27Z

design

"I hold a hammer, and as I look around I see many nails."

The OP design seems to be constrained by some "batched" APIs that are already in your codebase. Consider revisiting these design decisions.

You are contemplating performing the \$N\$ queries using \$M\$ batches, each of size \$B\$ domains, with \$O(M \times B)\$ time complexity. There is a (mostly) random distribution over network query times. At one extreme, with \$M = N\$, we sum all those delays. At the other, with \$M = 1\$, we blow out our memory budget but enjoy lots of parallelism so ensemble delay is simply max() over all the query delays -- that single straggler dictates the overall elapsed time.

Come at this in a more principled way. Pick your memory budget \$B\$, the most queries you can afford to have outstanding at any instant. And instead of consuming that memory resource completely at start of each batch and then repeatedly watching it slowly dwindle to zero as one by one the stragglers of a given batch eventually finish, strive to always have \$B\$ buffered queries pending, right up until the last little bit of a production run. This avoids the delay of waiting on \$M\$ separate stragglers.

This suggests that you want to have a queue, or priority queue, or at least a counter of num_pending_tasks. And each time a query task completes, you launch a new one.

When using this approach the wonderful polars package actually brings very little to the party. It would suffice to use import csv and a lowly {reader, writer} pair (or a {DictReader, DictWriter} pair). You would be streaming from input CSV to output CSV, with near constant memory burden of \$B\$. There might be a /usr/bin/sort post-processing step, if we want the somewhat racy arbitrary output order to match the input order.

relational database

If there's so many domains that you need to jump through the hoops of worrying about .map_batches() details, then maybe you're working too hard. Outsource the memory management tasks to an RDBMS, perhaps sqlite. Streaming (for loop iterating) over DB result rows is just like streaming over CSV rows, with a small memory footprint even for very large datasets.

Polars and pandas offer convenient functions for serializing a dataframe to a new or existing table, and for dragging those rows back into memory at a later date. And you can CREATE INDEX to produce the sort order you prefer.

Stack Exchange Network

Optimizing domain requests and filtering results with Polars

My questions:

1 Answer 1

design

relational database

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing domain requests and filtering results with Polars

My questions:

1 Answer 1

design

relational database

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions