I'm looking for any possible improvements that could be made to run the mysql piece of code faster.
I created a simple test winform application that creates two docker databases :
Once the instances are created, it creates a similar table on both. I then use a stored procedure to insert a Category
as fast as possible in both setups.
Here's the definition of a category :
public class Category
{
public int Id { get; set; }
[System.ComponentModel.DataAnnotations.StringLength(75)]
public string CategoryName { get; set; }
[System.ComponentModel.DataAnnotations.StringLength(300)]
public string Description { get; set; }
public DateTime CreationTime { get; set; }
}
The test results are the following:
100k items
MySql Inserted 100000 items in 2955ms
MySql Inserted 100000 items in 2801ms
MySql Inserted 100000 items in 2706ms
MySql Inserted 100000 items in 2512ms
MySql Inserted 100000 items in 2850ms
SqlServer Inserted 100000 items in 1004ms
SqlServer Inserted 100000 items in 902ms
SqlServer Inserted 100000 items in 858ms
SqlServer Inserted 100000 items in 1421ms
SqlServer Inserted 100000 items in 905ms
600k items
MySql Inserted 600000 items in 21849ms
MySql Inserted 600000 items in 17089ms
MySql Inserted 600000 items in 16776ms
SqlServer Inserted 600000 items in 5677ms
SqlServer Inserted 600000 items in 4635ms
SqlServer Inserted 600000 items in 5474ms
Here is the setup for MySql
MySql Stored procedure :
USE `BenchmarkDb`;
DROP procedure IF EXISTS `BenchmarkDb`.`CategoriesInsertWithoutId`;
DELIMITER $$
USE `BenchmarkDb`$$
CREATE DEFINER=`root`@`%` PROCEDURE `CategoriesInsertWithoutId`(IN JsonPayload LONGTEXT)
BEGIN
insert into BenchmarkDb.Categories
(Category,
Description)
SELECT tt.CategoryName,tt.Description
FROM
JSON_TABLE(
JsonPayload
,"$[*]"
COLUMNS(
Id int PATH "$.Id",
CategoryName VARCHAR(75) PATH "$.CategoryName",
Description VARCHAR(300) PATH "$.Description",
CreationTime DateTime PATH "$.CreationTime"
)
) AS tt;
END$$
DELIMITER ;
;
Using the latest 8.0 NPM mysql driver. Sends a large Json string containing all the data. The stored procedure will then turn it into a table and insert from that.
MySql c# code :
Stopwatch stopwatch = new Stopwatch();
string JsonPayload = JsonConvert.SerializeObject(
TestingDataHelpers.GenerateTestingCategories(100000)
,new IsoDateTimeConverter() { DateTimeFormat= "yyyy-MM-dd HH:mm:ss" });
stopwatch.Start();
var parameters=new List<MySqlParameter>()
{
new MySqlParameter()
{
MySqlDbType=MySqlDbType.LongText,
ParameterName="JsonPayload",
Value=JsonPayload
}
};
DataSet ResultsDataset = new DataSet();
using (var connection = new MySqlConnection("Server=localhost;Uid=root;Pwd=password1234;"))
{
using (var command = connection.CreateCommand())
{
command.CommandText = "BenchmarkDb.CategoriesInsertWithoutId";
command.CommandType = CommandType.StoredProcedure;
if (parameters != null && parameters.Count() > 0)
{
foreach (var parameter in parameters)
{
command.Parameters.Add(parameter);
}
}
using (var dataAdapter = new MySqlDataAdapter(command))
{
dataAdapter.Fill(ResultsDataset);
}
}
}
stopwatch.Stop();
Here is the similar SqlServer code that uses Structured datasets and Table-valued parameters to send data to the stored procedure.
SqlServer Stored procedure:
IF EXISTS ( SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'CategoriesInsertWithoutId')
AND type IN ( N'P', N'PC' ) )
DROP PROCEDURE [dbo].[CategoriesInsertWithoutId]
IF type_id('[dbo].[CategoryType]') IS NOT NULL
DROP TYPE [dbo].[CategoryType];
CREATE TYPE CategoryType AS TABLE
( Id int,
CategoryName nvarchar(75),
Description nvarchar(300),
CreationTime DateTime);
CREATE OR ALTER PROCEDURE [dbo].[CategoriesInsertWithoutId]
@CategoriesToInsert CategoryType READONLY
AS
BEGIN
SET NOCOUNT ON;
insert into [dbo].[Categories] (Category,Description)
select c.CategoryName,c.Description from @CategoriesToInsert c
END
SqlServer c# Code:
Stopwatch stopwatch = new Stopwatch();
var categories = SqlManagerHelpers.ToDataTable(TestingDataHelpers.GenerateTestingCategories(100000));
stopwatch.Start();
var parameters=new List<SqlParameter>()
{
new SqlParameter()
{
SqlDbType=SqlDbType.Structured,
ParameterName="@CategoriesToInsert",
Value=categories
}
};
DataSet ResultsDataset = new DataSet();
using (var connection = new SqlConnection("Data Source=.;User Id=sa;password=password1234;"))
{
using (var command = connection.CreateCommand())
{
command.CommandText = "dbo.CategoriesInsertWithoutId";
command.CommandType = CommandType.StoredProcedure;
if (parameters != null && parameters.Count() > 0)
{
foreach (var parameter in parameters)
{
command.Parameters.Add(parameter);
}
}
using (var dataAdapter = new SqlDataAdapter(command))
{
dataAdapter.Fill(ResultsDataset);
}
}
}
stopwatch.Stop();
Here are some extra helper classes that are referrenced above.
static class TestingDataHelpers
{
static Random rnd = new Random();
public static List<Category> GenerateTestingCategories(int NumberOfEntriesToMake)
{
List<Category> categories=new List<Category>();
for(int i = 0; i < NumberOfEntriesToMake; i++)
{
categories.Add(new Category()
{
Id = i,
CategoryName=CategoryNames[rnd.Next(CategoryNames.Count)],
Description=CategoryDescriptions[rnd.Next(CategoryDescriptions.Count)],
CreationTime=DateTime.Now
});
}
return categories;
}
#region CategoryNames
private static List<string> CategoryNames = new List<string>()
{
"Redacted data is redacted.. enjoy some redacted data",
"Redacted data is redacted.. enjoy some redacted data",
"Redacted data is redacted.. enjoy some redacted data",
};
#endregion
#region CategoryDescriptions
private static List<string> CategoryDescriptions = new List<string>()
{
"Redacted data is redacted.. enjoy some redacted data",
"Redacted data is redacted.. enjoy some redacted data",
"Redacted data is redacted.. enjoy some redacted data",
};
#endregion
}
static class SqlManagerHelpers
{
public static DataTable ToDataTable<T>(this IList<T> data)
{
var props = typeof(T).GetProperties().Where(pi => pi.GetCustomAttributes(typeof(SkipPropertyAttribute), true).Length == 0).ToList();
DataTable table = new DataTable();
for(int i =0;i<props.Count;i++)
{
var prop = props[i];
table.Columns.Add(prop.Name, Nullable.GetUnderlyingType(prop.PropertyType) ?? prop.PropertyType);
StringLengthAttribute stringLengthAttribute= prop.GetCustomAttributes(typeof(StringLengthAttribute), false).Cast<StringLengthAttribute>().SingleOrDefault();
if (stringLengthAttribute != null)
{
table.Columns[i].MaxLength = stringLengthAttribute.MaximumLength;
}
}
foreach (T item in data)
{
DataRow row = table.NewRow();
foreach (var prop in props)
row[prop.Name] = prop.GetValue(item) ?? DBNull.Value;
table.Rows.Add(row);
}
return table;
}
public class SkipPropertyAttribute : Attribute
{
}
}
Here are the required database schemas
MySql database and table definition
CREATE DATABASE `BenchmarkDb`;
CREATE TABLE `BenchmarkDb`.`Categories` (
`Id` INT NOT NULL AUTO_INCREMENT,
`Category` VARCHAR(75) NULL,
`Description` VARCHAR(300) NULL,
`CreationTime` DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (`id`)) ENGINE=InnoDB;
SqlServer database and table definition
CREATE TABLE [BenchmarkDb].[dbo].[Categories] (
[Id] [int] IDENTITY(1,1) NOT NULL,
[Category] [nvarchar](75) NULL,
[Description] [nvarchar](300) NULL,
[CreationTime] DATETIME NOT NULL DEFAULT GETDATE(),
CONSTRAINT [PK_History] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
) ON [PRIMARY]
-
1\$\begingroup\$ This works? That last MySqlDataAdapter(command) is on a SqlConnection. \$\endgroup\$paparazzo– paparazzo2018年05月02日 14:22:29 +00:00Commented May 2, 2018 at 14:22
-
1\$\begingroup\$ It is suppose to be working code only here. ResultDataset is not defined. \$\endgroup\$paparazzo– paparazzo2018年05月02日 14:27:23 +00:00Commented May 2, 2018 at 14:27
-
1\$\begingroup\$ see my answer to this question stackoverflow.com/questions/12467431/… \$\endgroup\$paparazzo– paparazzo2018年05月02日 14:36:47 +00:00Commented May 2, 2018 at 14:36
-
1\$\begingroup\$ This seems more complex than it needs to be docs.microsoft.com/en-us/sql/relational-databases/tables/… and it is still has last MySqlDataAdapter(command) is on a SqlConnection. You need to make this into working code soon or it will get closed. \$\endgroup\$paparazzo– paparazzo2018年05月02日 14:40:46 +00:00Commented May 2, 2018 at 14:40
-
1\$\begingroup\$ The code is okay now. The goal is to be able to handle a large table-valued parameter in the stored procedure. I want to send data in both direction from mysql to sqlserver and the opposite. \$\endgroup\$A_V– A_V2018年05月02日 14:46:45 +00:00Commented May 2, 2018 at 14:46
1 Answer 1
I don't know anything about mysql, so I'm ignoring that part of your question.
For SQL Server, if you're trying to make an insert go faster you're going to want to:
- Do it in bulk
- Do it in parallel
- Make it minimally logged
- Do it in batches
There are some things you can do that will handle all of this for you, which I list below, otherwise you'll have to write something yourself.
Bulk operations
Bulk operations ultimately boil down to trying to do as much work as possible in a single operation, in a way that doesn't tank performance (transaction logging is the most common thing this helps with, but there are some more). The link mentions a few benefits, the main ones I'm highlighting here are:
- Minimal logging
- Better locking (BU locks)
- Batching
- Optional triggers/constraints
If you were inserting directly from a file, then BULK INSERT
is your friend. It will handle pretty much all of the considerations above for you (besides parallelism, which is outside of BULK INSERT
's control.
Inserting from C#, however would be better suited to use SqlBulkCopy
. This lets you perform bulk insert operations into a table, and can be configured to ignore constraints, triggers, identity columns, etc.
Parallel Inserts
Parallel inserts are what allow SQL to insert multiple rows into a table at once instead of doing row-by-row operations. This generally requires a few things:
- A heap
- No
IDENTITY
columns - The right kind of lock on the table
If you don't have a heap (e.g. there are indices), then the index maintenance prohibits the parallel insert, and it will do the insert serially. For large-scale ETL workloads, this is a good use-case for a "staging" database/table that has none of these things and as such can get the best performing insert. Brent Ozar has a good post that touches on this a bit as well.
IDENTITY
columns also prevent parallel inserts, as maintaining the order of the inserts is required for it to work correctly.
If you don't have the right locks on the table (BU locks work, as does the TABLOCK(X)
hint(s)) then SQL Server has to consider that another session could be modifying the table as well, which also prevents parallelism.
If you are able to meet all of these requirements, however, then your operations (whether using built-in bulk operations as above, or rolling your own as below) will be able to run faster by taking advantage of the additional cores SQL Server has.
Minimal logging
Minimal logging is the method by which you prevent the transaction log from overflowing. Some operations can be rolled back more easily than others or require less transaction log space than others. Maintaining the transaction log isn't free/cheap, so reducing how much of it is necessary also helps performance. In general, if you follow the rules you should be able to achieve minimal logging. At a high level, the following are minimally logged:
- An insert into a heap (i.e. a table without a clustered index) that has no non-clustered indexes, using a TABLOCK hint, having a high enough cardinality estimate (> ~1000 rows)
- An insert into a table with a clustered index that has no non-clustered indexes, without TABLOCK, having a high enough cardinality estimate (> ~1000 rows)
- Adding an index to a table, even if that table already has data.
Batches
Lastly, you can break a large chunk of work up into smaller batches. This can be useful if transaction logging is the main concern as each batch becomes its own transaction.
This is tricky to implement generically, and can be unpleasant to do yourself. Another major concern is with correctness of the data; if another user hits the database while you're not done with your batches, then they may get inconsistent results. This is a good use-case for doing the work in another table, then swapping tables out, as well as for snapshot isolation.
-
1\$\begingroup\$ Thanks for the answer ! I'll get a read through these links. I've never used SqlBulkCopy so we'll see how that goes. \$\endgroup\$A_V– A_V2019年08月28日 00:22:59 +00:00Commented Aug 28, 2019 at 0:22
Explore related questions
See similar questions with these tags.