I am hearing different things from colleagues/research. What are good guidelines in performance for, Select Into vs Insert into when creating a temp table? I know difference is minimal for small tables.
Eg: Table has 20 columns, 50 million rows.
I've had DBAs state, Insert into is faster, since compiler/parser, does not need to find Column data types on the fly. Others stating Select into is faster. We conducted performance testing, and seems select into is slightly faster.
What are good principles in figuring which is faster and why? I would think Microsoft would optimize to make insert into , just as fast, for careful programming.
Article states following.
SQL Server Performance of SELECT INTO vs INSERT INTO for temporary tables
The INSERT...INTO command will reuse data pages which are created in cache for insert/update/delete operations. It will also truncate the table when it is dropped. The SELECT...INTO command will create new pages for table creation similar to regular tables and will physically remove them when the temporary table is dropped.
Question is, why wouldn't Microsoft optimize to make insert into as fast as select into?
We have over 500 stored procedures to write for data warehouse, and require good guidelines for temp usage.
This article does not really focus on performance and reasons:
Person in article mentioned good point:
that's mostly because SQL Server knows that there is no contention for the destination table. The performance for insert into #temp with(tablock) select * from .. is roughly the same as the performance for select * into #temp from
4 Answers 4
You cited two different articles that discuss two different things.
The first article compares insert..select
with select into
for temporary tables, and the second compares these two in general.
In general insert..select
is slower because it's a fully logged operation. select into
is minimally logged in the simple
and bulk logged
recovery models.
The last comment you cited is about insert into with(tablock)
, this with(tablock)
can make insert into
minimally logged under some additional conditions: it should be a heap and have no indexes.
You can find the complete guide here: The Data Loading Performance Guide
It can be summarized in this table:
Note the updates for SQL Server 2016 and later in SQL Server 2016, Minimal logging and Impact of the Batchsize in bulk load operations by Parik Savjani (a Senior Program Manager with the Microsoft SQL Server Tiger team). The updated table is:
updated chart for SQL Server 2016 onward
Regarding the first article. It discusses the particular case for temporary tables.
The tempdb
database is special because it's always in the simple recovery model, and because logging in tempdb
is different. On every server restart tempdb
is recreated, this means no crash recovery is made for tempdb
, and this means that logging in tempdb
does not need any "after" image of modified data, only the "before" image to be able to do a rollback if there is a need. This leads to the fact that insert into..select
is also minimally logged in tempdb
even without tablock
hint (in case of a heap
that was discussed in the first article).
Conclusion "select into
vs insert..select
under logging aspect":
In case of a heap
, insert into..select
performs similarly to select into
in case of temporary tables, and in general is slower when tablock
hint is not used.
The second aspect is the possibility of parallel execution.
Select into
can be executed in parallel starting with SQL Server 2014, and parallel insert...select
was first implemented in SQL Server 2016.
I did not reproduce any performance difference between select into
and insert into ..select
for temporary tables on SQL Server 2012, all executed in serial.
What are good principles in figuring which is faster and why? I would think Microsoft would optimize to make insert into , just as fast, for careful programming.
The principles that I try to follow when analyzing something like this question are:
- Avoid making unnecessary assumptions.
- Read the official documentation.
- Test the workload. The amount of testing depends on how fast I need the code to be.
I'm aware of two pieces of documentation that address your question. The first one is a blog post saying that SELECT INTO
for temp tables has different behavior for eager writes as of SQL Server 2014. That is by design. So I don't think that it's correct to say that the difference is minimal for small tables. If anything, the optimization described in the blog post seems designed for smaller tables:
The change in SQL Server 2014 is to relax the need to flush these pages, as quickly, to the TEMPDB data files. When doing a select into ... #tmp ... or create index WITH SORT IN TEMPDB the SQL Server now recognizes this may be a short lived operation. The pages associated with such an operation may be created, loaded, queried and released in a very small window of time.
For example: You could have a stored procedure that runs in 8ms. In that stored procedure you select into ... #tmp ... then use the #tmp and drop it as the stored procedure completes.
Prior to the SQL Server 2014 change the select into may have written all the pages accumulated to disk. The SQL Server 2014, eager write behavior, no longer forces these pages to disk as quickly as previous versions. This behavior allows the pages to be stored in RAM (buffer pool), queried and the table dropped (removed from buffer pool and returned to free list) without ever going to disk as long memory is available. By avoiding the physical I/O when possible the performance of the TEMPDB, bulk operation is significantly increased and it reduces the impact on the I/O path resources as well.
The second piece of documentation explains that parallel insert into temp tables with INSERT INTO ... SELECT
is available without a TABLOCK
hint in SQL Server 2016 but requires a TABLOCK
hint in SP1 and in future versions.
The issue is first fixed in SQL Server 2016 Service Pack 1 . After you apply SQL Server 2016 SP1, Parallel INSERTs in INSERT..SELECT to local temporary tables is disabled by default which reduces contention on PFS page and improves the overall performance for concurrent workload. If parallel INSERTs to local temporary tables is desired, users should use TABLOCK hint while inserting into local temporary table.
Going back to your original statement, you can't logically deduce which of the two will be faster. What is faster depends on how Microsoft designed the software and the characteristics of your workload. Making guesses about the amount of time needed to create the column definitions just isn't helpful. Testing is helpful. If your testing suggests that SELECT INTO
is faster than go with that. For what it's worth, I also work on data warehouse loading with a close eye on performance and I haven't seen the difference between the two approaches be anything that's worth worrying about.
Every value SQL Server deals with has a datatype. The results from a SELECT are all typed. So a SELECT..INTO does not have to deduce datatypes on the fly - they are defined by the SELECT.
In contrast, with INSERT..SELECT the source and destination columns may be of different, but compatible, types. There will then be an implicit type coercion, which will consume CPU cycles. Whether the difference in execution times can be measured I couldn't say.
I can answer the advantage in terms of performance for temp tables. When you use insert into #table select * from table1
, since you get to create a #table before hand, you can also create indexes, keys or constraints inline with the #table definition and take advantage of caching that #table. Then this #table can be used again from cache if stored procedure executes again. Caching temp tables helps reduce metadata contention aka pagelatch_ex on system objects.
But when you do select * into #table from table1
, and if you need to create indexes or keys on #table, then you have to alter the DDL of the temp table which will not cache the temp table. When stored procedure executes again, there will not be any table metadata in cache for SP to use and temp table will have to be created again. This causes more pagelatch_ex to be generated which can cause metadata contention on tempdb if there are too many concurrent operations.
Caching temp table does not store the name of the temp table but only the metadata, and is very helpful when SPs using #temp execute multiple times during the day.
Note: Caching of #temp tables can be done only if temp tables are wrapped inside a stored procedure. So this use case helps only for temp tables in SP.