INTRO
I am using an Excel worksheet as my "database." No headers so [F1]
is assigned to column one by default. I filled the entire column, all 1,048,576 cells with RANDBETWEEN(1,20)
. I then hard set these values by copy/pasting as value.
It is very slow. Much slower than using Excel functions would be to generate same data. I realize that I could use 1,048,576 as a constant denominator but I wanted to practice with SQL query language, and keep model more dynamic.
EDIT: it should not say % in results not many times. I am getting the % of times each var occurred in my data set. I am basically seeing what the distribution is of randbetween(1-20) outputs are over X calls. In this instance x is 1,048,576.
My SQL Query
"SELECT Round(SUM(IIF([F1]=" & searchKey & ",1,0))*100.0/SUM(IIF([F1]<> Null,1,0)),10) From [Sheet1$];"
Is returning the total number of records in column F1 that are equal to searchKey and dividing that number by the total number of records in column F1.
CODE
Option Explicit
Private Declare PtrSafe Function timeGetTime Lib "winmm.dll" () As Long
Sub SqlQueryOnWorkSheet()
Dim started As Long
Dim cn As ADODB.Connection
Dim filePath As String
Dim counter As Long
Dim outCome As Double
Dim ended As Long
started = timeGetTime
filePath = "Z:\Test\Test1.xlsx"
Set cn = EstablishConnection(filePath)
If cn.State <> 1 Then GoTo CleanFail:
For counter = 1 To 20
outCome = FindCount(cn, counter)
PrintOutcome counter, outCome
Next counter
cn.Close
Set cn = Nothing
ended = timeGetTime
Debug.Print "QUERIES RAN IN " & (ended - started) / 1000 & " SECONDS"
Exit Sub
CleanFail:
Debug.Print "CONNECTION COULD NOT BE MADE"
End Sub
Function EstablishConnection(ByVal filePath As String) As ADODB.Connection
Set EstablishConnection = New ADODB.Connection
EstablishConnection.Open _
"Provider=Microsoft.ACE.OLEDB.12.0;" & _
"Data Source='" & filePath & "';" & _
"Extended Properties=""Excel 12.0 Macro;HDR=No;IMEX=1;"";"
End Function
Function FindCount(ByRef cn As ADODB.Connection, ByVal searchKey As Long) As Double
Dim strSql As String
Dim rs As ADODB.Recordset
On Error GoTo CleanFail:
Set rs = New ADODB.Recordset
strSql = "SELECT Round(SUM(IIF([F1]=" & searchKey & ",1,0))*100.0/SUM(IIF([F1]<> Null,1,0)),10) From [Sheet1$];"
rs.Open strSql, cn
FindCount = rs.GetString
rs.Close
Set rs = Nothing
Exit Function
CleanFail:
Debug.Print "QUERY FAILED"
End Function
Sub PrintOutcome(ByVal counter As Long, ByVal outCome As Double)
Debug.Print "Variable " & counter & " Occured " & outCome & " Many Times"
End Sub
Results
EDIT: THIS SHOULD BE % NOT MANY -- FIXED
Variable 1 Occured 4.9837112427 % Of Time
Variable 2 Occured 5.0171852112 % Of Time
Variable 3 Occured 4.9752235413 % Of Time
Variable 4 Occured 4.9716949463 % Of Time
Variable 5 Occured 5.0051689148 % Of Time
Variable 6 Occured 4.9989700317 % Of Time
Variable 7 Occured 4.9901008606 % Of Time
Variable 8 Occured 5.0283432007 % Of Time
Variable 9 Occured 5.0018310547 % Of Time
Variable 10 Occured 5.0164222717 % Of Time
Variable 11 Occured 4.9933433533 % Of Time
Variable 12 Occured 5.0059318542 % Of Time
Variable 13 Occured 5.0333976746 % Of Time
Variable 14 Occured 4.9952507019 % Of Time
Variable 15 Occured 5.0163269043 % Of Time
Variable 16 Occured 4.9654006958 % Of Time
Variable 17 Occured 4.9822807312 % Of Time
Variable 18 Occured 5.0310134888 % Of Time
Variable 19 Occured 5.0113677979 % Of Time
Variable 20 Occured 4.9770355225 % Of Time
QUERIES RAN IN 38.754 SECONDS
1 Answer 1
The reason for the poor performance is that you are taking a non database approach to compiling the data. A query's speed is dependent on its ability to group, filter and index records.
SELECT Round(SUM(IIF([F1]=" & searchKey & ",1,0)) *100.0 /SUM(IIF([F1]<>Null,1,0)),10) From [Sheet1$];
The IIF()
function is much slower that using a WHERE
clause to filter the data. Use WHERE ([F1] Is Not Null) AND ([F1]=" & searchKey & ")
so you are only retrieving the relevant records.
With the records properly filtered, you can use Count(*)
instead of Sum()
and IIF([F1]<>Null,1,0)
.
It would be better to group the counts and returning all the records at once.
Although it doesn't matter for 20 records, writing to the Immediate Window line by line is slow.
Before writing the code below, I copied the data into an Access table and ran the Query Wizard - Count Duplicates.
PrintOccurences
Sub PrintOccurences(ByRef cn As ADODB.Connection)
Const strSql As String = "SELECT First([Sheet1$].[F1]) AS [F1 Field], Round((Count([Sheet1$].[F1])/ (SELECT Count(*) FROM [Sheet1$] WHERE [Sheet1$].[F1])) * 100,10) AS NumberOfDups" & vbNewLine & _
"FROM [Sheet1$]" & vbNewLine & _
"GROUP BY [Sheet1$].[F1]" & vbNewLine & _
"HAVING (((First([Sheet1$].[F1])) Is Not Null));"
Dim rs As ADODB.Recordset
Dim sb As Object
On Error GoTo CleanFail:
Set rs = New ADODB.Recordset
Set sb = CreateObject("System.Text.StringBuilder")
rs.Open strSql, cn
Do While Not rs.EOF
sb.AppendFormat_4 "Variable {0} Occured {1} Many Times" & vbNewLine, Array(rs("F1 Field").Value, rs("NumberOfDups").Value)
rs.MoveNext
Loop
Debug.Print sb.ToString
rs.Close
Set rs = Nothing
Exit Sub
CleanFail:
Debug.Print "QUERY FAILED"
End Sub
Results
Variable 1 Occurred 5.2052052052 Many Times Variable 2 Occurred 3.9039039039 Many Times Variable 3 Occurred 4.8048048048 Many Times Variable 4 Occurred 5.2052052052 Many Times Variable 5 Occurred 4.4044044044 Many Times Variable 6 Occurred 5.3053053053 Many Times Variable 7 Occurred 5.005005005 Many Times Variable 8 Occurred 5.2052052052 Many Times Variable 9 Occurred 5.5055055055 Many Times Variable 10 Occurred 4.8048048048 Many Times Variable 11 Occurred 5.4054054054 Many Times Variable 12 Occurred 4.9049049049 Many Times Variable 13 Occurred 4.6046046046 Many Times Variable 14 Occurred 4.1041041041 Many Times Variable 15 Occurred 6.6066066066 Many Times Variable 16 Occurred 5.005005005 Many Times Variable 17 Occurred 5.5055055055 Many Times Variable 18 Occurred 4.4044044044 Many Times Variable 19 Occurred 5.005005005 Many Times Variable 20 Occurred 5.1051051051 Many Times
-
\$\begingroup\$ I am finding the % not how many times lol. Idk why I left it like that, just tired. I am finding what % of time a each number showed up. I explain in paragraph preceding code but I guess the many times threw you off (and rightfully so) \$\endgroup\$learnAsWeGo– learnAsWeGo2018年08月24日 13:48:35 +00:00Commented Aug 24, 2018 at 13:48
-
\$\begingroup\$ Your initial comments are much appreciated. It seems that I am using this as an expensive array rather than an ADO. \$\endgroup\$learnAsWeGo– learnAsWeGo2018年08月24日 13:50:39 +00:00Commented Aug 24, 2018 at 13:50
-
\$\begingroup\$ My SQL Query is returning the total number of records in column F1 that are equal to searchKey and dividing that number by the total number of records in column F1. I tried to use a Where clause but had hard time with it \$\endgroup\$learnAsWeGo– learnAsWeGo2018年08月24日 13:57:58 +00:00Commented Aug 24, 2018 at 13:57
-
\$\begingroup\$ @learnAsWeGo I modified my code to return the percentage. I will clean up my post later. \$\endgroup\$TinMan– TinMan2018年08月24日 14:50:56 +00:00Commented Aug 24, 2018 at 14:50
-
\$\begingroup\$ Basically if i want this to be fast I am going to have to do the indexing / sorting / filtering myself then run query? \$\endgroup\$learnAsWeGo– learnAsWeGo2018年09月25日 03:15:49 +00:00Commented Sep 25, 2018 at 3:15
Explore related questions
See similar questions with these tags.
RANDBETWEEN
is volatile, it's updating every single one of the 1,048,576 results every time the cursor moves for a read... \$\endgroup\$RANDBETWEEN
in the code anywhere, so are you really testing it or are you performing a similar query to compare results? Please adjust your title to reflect what your code does. \$\endgroup\$RANDBETWEEN
in them, you should expect the performance to be poor.RANDBETWEEN
is volatile function. That means every time you calculate the value of a cell withRANDBETWEEN
in it, all of them recalculate. That means as the ADO cursor moves from cell to cell, you end up recalculating 1,048,576 cells. If you aggregate them in the query, you get a trillion cell calculations. That's why it's slow. \$\endgroup\$