Techniques for increasing speed of simple select query using DBI R package to MS SQL server where ASYNC_NETWORK_IO is the bottleneck

Question 1

I am executing a simple select query similar to:

DBI::dbGetQuery(db, "SELECT date, x, y, FROM table")

where the table contains a large amount of financial data. The db is a MS SQL server that I manage. I noticed slow query times so dug into the issue more carefully. I identified that these queries were generating large network wait times, specifically ASYNC_NETWORK_IO.

The reason that I'm trying to improve the performance is that I often access financial data from a database I do not manage that uses PostgreSQL. I've put together a benchmark piece of code and for the same number of rows, columns, and column types, the same query runs 25-50% faster there.

I'm a new database administrator so I'm trying to understand how to improve performance to approach the performance of this other (more professionally) managed database.

From reading online (e.g., https://www.sqlshack.com/reducing-sql-server-async_network_io-wait-type/), my prior is that this is a application-caused bottleneck in that I'm not reading the rows from the server fast enough. However, given that a similar piece of code executes much faster with another server, I'm still not sure that it's entirely the application's fault.

Finally, if it is the application's fault, I'm having trouble finding resources that specifically explain how to "read data more quickly". I have tried chunking my queries but that hasn't improved performance. For example, resources often say avoid "row by agonizing row" reads or try to "use the read immediately and process afterwards method". However, in my specific situation, I'm not sure what that means or how to avoid it.

Thanks in advance for helping a novice DBA.

Question 2

Where is the PG server Vs the SQL Server? Is one next to each other? Is it the same app, same driver, same code?

Question 3

The PG server is managed by Wharton Research Data Services (of which I am a user/customer and not an administrator) so can't speak to the driver. The code that I am running to query data from WRDS is the same and I'm querying exactly the same number of rows/columns/column types. I'm sorry I can't provide more detail - if you have any thoughts on practical ways of reducing the ASYNC_NETWORK_IO waits please let me know!

Question 4

@nvt "if you have any thoughts on practical ways of reducing the ASYNC_NETWORK_IO waits please let me know!" - Select less data (in total) or improve the hardware that's involved in the bottleneck (which can very well be network bandwidth or something else). It's not an unusual wait type and literally just means it's waiting for the data to finish transferring, there's no root problem that causes it like other wait types might expose. Your options are either writing more targeted queries to filter the data down on the remote side first before pulling the data, or replicate the data locally.

Question 5

What I was saying is, what is the round trip time for simple network connectivity between the two different servers from the same app server? It's trivial to make a repro where one server takes more time to return data than the other just by increasing the latency, since the more data = more packets = more latency total.

Question 6

So we need to know the results of a ping to both servers. And the drivers we want to know about are what is being used by the client not the server.

Question 7

To piggyback off of J.D's comment, "select less data" ... if your application is aggregating, consider doing that in SQL instead. SQL Server can do that more efficiently and there will be less data moving over the network.

Question 8

Thank you for the suggestion. I'm heavily using dplyr/dbplyr to keep as much of the computation on the server as possible for this exact reason. But generally, for doing complicated econometrics, I would like to have the data local and then as I get new data / change the cleaning of the data, I would like to update. So I'm trying to optimize as much as possible to reduce read times

Frank DeWysockie Frank DeWysockie 214 bronze badges · Answer 1 · 2024-02-11 12:06:31Z

0

To piggyback off of J.D's comment, "select less data" ... if your application is aggregating, consider doing that in SQL instead. SQL Server can do that more efficiently and there will be less data moving over the network.

Share

Improve this answer

answered Feb 11, 2024 at 12:06

Frank DeWysockie's user avatar

Frank DeWysockie Frank DeWysockie

214 bronze badges

1

Thank you for the suggestion. I'm heavily using dplyr/dbplyr to keep as much of the computation on the server as possible for this exact reason. But generally, for doing complicated econometrics, I would like to have the data local and then as I get new data / change the cleaning of the data, I would like to update. So I'm trying to optimize as much as possible to reduce read times

nvt
– nvt

2024年02月14日 18:51:18 +00:00
Commented Feb 14, 2024 at 18:51

Add a comment |

Stack Exchange Network

Techniques for increasing speed of simple select query using DBI R package to MS SQL server where ASYNC_NETWORK_IO is the bottleneck

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Techniques for increasing speed of simple select query using DBI R package to MS SQL server where ASYNC_NETWORK_IO is the bottleneck

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions