I need to query a large number of domains stored in a CSV file and save the associated metadata from the responses. My final goal is to generate a CSV file after applying filters to this metadata.
The original CSV file can be quite large, and I want to avoid loading everything into memory for performance reasons.
I’m considering two approaches:
- Read the CSV, perform the queries, save the metadata to a new CSV, and then load this new file with Polars to apply filters and generate the final output.
- Use Polars from the start to read the CSV, perform the queries using map_batches, apply filters in the same pipeline, and then produce the final file directly. Here’s a simplified example of the second approach:
import polars as pl
import asyncio
import httpx
# Load CSV and add metadata columns
lf = (
pl.scan_csv("domains.csv")
.with_columns(
pl.col("domain").map_batches(
self.execute_domain_checks, return_dtype=pl.List(pl.Dict)
)
)
)
# Asynchronous function to perform the domain checks
async def check_multiple_domains(self, domains: pl.Series) -> list[list[dict]]:
async with httpx.AsyncClient(follow_redirects=True) as client:
tasks = [self.check_single_domain(domain) for domain in domains]
results = await asyncio.gather(*tasks)
return results
def execute_domain_checks(self, domains: pl.Series) -> pl.Series:
results = asyncio.run(self.check_multiple_domains(domains))
return pl.Series(results, strict=False)
My questions:
- Is the second approach viable in terms of performance, considering the CSV file might be large?
- What are the advantages and disadvantages of using polars.map_batches for this kind of task compared to a more sequential approach (approach 1)?
- Do you have suggestions on improving memory management or performance in such a scenario?
-
\$\begingroup\$ How many rows are there? Surely the network speed will be the limiting factor here. \$\endgroup\$Reinderien– Reinderien2024年10月07日 12:59:01 +00:00Commented Oct 7, 2024 at 12:59
-
\$\begingroup\$ Can you provide more information on the endpoints being queried? Are they all to the same service, or different services? \$\endgroup\$Reinderien– Reinderien2024年10月07日 13:01:57 +00:00Commented Oct 7, 2024 at 13:01
-
\$\begingroup\$ Most domains are different. \$\endgroup\$Lo Bellin– Lo Bellin2024年10月07日 13:41:02 +00:00Commented Oct 7, 2024 at 13:41
1 Answer 1
design
"I hold a hammer, and as I look around I see many nails."
The OP design seems to be constrained by some "batched" APIs that are already in your codebase. Consider revisiting these design decisions.
You are contemplating performing the \$N\$ queries
using \$M\$ batches, each of size \$B\$ domains,
with \$O(M \times B)\$ time complexity.
There is a (mostly) random distribution over network query times.
At one extreme, with \$M = N\$, we sum all those delays.
At the other, with \$M = 1\$, we blow out our memory budget
but enjoy lots of parallelism so ensemble delay is simply max()
over all the query delays -- that single straggler dictates
the overall elapsed time.
Come at this in a more principled way. Pick your memory budget \$B\$, the most queries you can afford to have outstanding at any instant. And instead of consuming that memory resource completely at start of each batch and then repeatedly watching it slowly dwindle to zero as one by one the stragglers of a given batch eventually finish, strive to always have \$B\$ buffered queries pending, right up until the last little bit of a production run. This avoids the delay of waiting on \$M\$ separate stragglers.
This suggests that you want to have a queue, or priority queue,
or at least a counter of num_pending_tasks
.
And each time a query task completes, you launch a new one.
When using this approach the wonderful
polars package
actually brings very little to the party.
It would suffice to use import csv
and a lowly
{reader,
writer} pair (or a
{DictReader,
DictWriter} pair).
You would be streaming from input CSV to output CSV,
with near constant memory burden of \$B\$.
There might be a /usr/bin/sort
post-processing step,
if we want the somewhat racy arbitrary output order
to match the input order.
relational database
If there's so many domains that you need to jump through
the hoops of worrying about .map_batches() details,
then maybe you're working too hard.
Outsource the memory management tasks to an RDBMS, perhaps
sqlite.
Streaming (for
loop iterating) over DB result rows
is just like streaming over CSV rows, with a small
memory footprint even for very large datasets.
Polars and pandas offer convenient functions for serializing a dataframe to a new or existing table, and for dragging those rows back into memory at a later date. And you can CREATE INDEX to produce the sort order you prefer.
Explore related questions
See similar questions with these tags.