I am executing a simple select query similar to:
DBI::dbGetQuery(db, "SELECT date, x, y, FROM table")
where the table contains a large amount of financial data. The db is a MS SQL server that I manage. I noticed slow query times so dug into the issue more carefully. I identified that these queries were generating large network wait times, specifically ASYNC_NETWORK_IO.
The reason that I'm trying to improve the performance is that I often access financial data from a database I do not manage that uses PostgreSQL. I've put together a benchmark piece of code and for the same number of rows, columns, and column types, the same query runs 25-50% faster there.
I'm a new database administrator so I'm trying to understand how to improve performance to approach the performance of this other (more professionally) managed database.
From reading online (e.g., https://www.sqlshack.com/reducing-sql-server-async_network_io-wait-type/), my prior is that this is a application-caused bottleneck in that I'm not reading the rows from the server fast enough. However, given that a similar piece of code executes much faster with another server, I'm still not sure that it's entirely the application's fault.
Finally, if it is the application's fault, I'm having trouble finding resources that specifically explain how to "read data more quickly". I have tried chunking my queries but that hasn't improved performance. For example, resources often say avoid "row by agonizing row" reads or try to "use the read immediately and process afterwards method". However, in my specific situation, I'm not sure what that means or how to avoid it.
Thanks in advance for helping a novice DBA.
1 Answer 1
To piggyback off of J.D's comment, "select less data" ... if your application is aggregating, consider doing that in SQL instead. SQL Server can do that more efficiently and there will be less data moving over the network.
-
Thank you for the suggestion. I'm heavily using dplyr/dbplyr to keep as much of the computation on the server as possible for this exact reason. But generally, for doing complicated econometrics, I would like to have the data local and then as I get new data / change the cleaning of the data, I would like to update. So I'm trying to optimize as much as possible to reduce read timesnvt– nvt2024年02月14日 18:51:18 +00:00Commented Feb 14, 2024 at 18:51
Explore related questions
See similar questions with these tags.
ping
to both servers. And the drivers we want to know about are what is being used by the client not the server.