Friday, December 5, 2008
shortstat
i ran some simple stats on mysql releases
--- conclusions
none
--- how to get git copy of mysql repository
git-clone git://ndb.mysql.com/mysql.git
--- script that produces stats
*** mysql-4.1..mysql-5.0
commits: 7790
diffstat: 4424 files changed, 1271855 insertions(+), 199555 deletions(-)
/sql 255 files changed, 129725 insertions(+), 41964 deletions(-)
/test 1509 files changed, 954053 insertions(+), 15940 deletions(-)
*** mysql-5.0..mysql-5.1
commits: 6411
diffstat: 10244 files changed, 2172077 insertions(+), 1349098 deletions(-)
/sql 243 files changed, 151483 insertions(+), 69799 deletions(-)
/test 3862 files changed, 1258333 insertions(+), 206729 deletions(-)
*** mysql-5.1..mysql-6.0
commits: 3546
diffstat: 3574 files changed, 679669 insertions(+), 82131 deletions(-)
/sql 226 files changed, 63619 insertions(+), 16469 deletions(-)
/test 1772 files changed, 292884 insertions(+), 33553 deletions(-)
/storage 1184 files changed, 281298 insertions(+), 9179 deletions(-)
*** mysql-5.1..mysql-5.1-telco-6.2
commits: 761
diffstat: 841 files changed, 75105 insertions(+), 37063 deletions(-)
/sql 72 files changed, 8673 insertions(+), 5544 deletions(-)
/test 326 files changed, 12269 insertions(+), 18376 deletions(-)
/storage 396 files changed, 52983 insertions(+), 12434 deletions(-)
*** mysql-5.1-telco-6.2..mysql-5.1-telco-6.3
commits: 347
diffstat: 455 files changed, 27372 insertions(+), 10033 deletions(-)
/sql 39 files changed, 3471 insertions(+), 740 deletions(-)
/test 215 files changed, 8735 insertions(+), 2251 deletions(-)
/storage 182 files changed, 14990 insertions(+), 7031 deletions(-)
*** mysql-5.1-telco-6.3..mysql-5.1-telco-6.4
commits: 582
diffstat: 733 files changed, 73161 insertions(+), 30912 deletions(-)
/sql 12 files changed, 472 insertions(+), 250 deletions(-)
/test 48 files changed, 1151 insertions(+), 218 deletions(-)
/storage 622 files changed, 70267 insertions(+), 30010 deletions(-)
--- conclusions
none
--- how to get git copy of mysql repository
git-clone git://ndb.mysql.com/mysql.git
--- script that produces stats
#!/bin/sh
R="mysql-4.1..mysql-5.0 mysql-5.0..mysql-5.1 mysql-5.1..mysql-6.0 mysql-5.1..mysql-5.1-telco-6.2 mysql-5.1-telco-6.2..mysql-5.1-t
elco-6.3 mysql-5.1-telco-6.3..mysql-5.1-telco-6.4"
for i in $R
do
echo "*** $i"
echo "commits: `git-log --no-merges $i | grep Author | wc -l`"
echo "diffstat: `git-diff --shortstat $i`"
echo " /sql `git-diff --shortstat $i -- sql/`"
echo " /test `git-diff --shortstat $i -- mysql-test/`"
if [ -z "`echo $i | grep mysql-5.0`" ]
then
echo " /storage `git-diff --shortstat $i -- storage/`"
fi
echo
done
Tuesday, November 25, 2008
950k reads per second on 1 datanode
i spent last night adding 75% of the next step for our multi-threaded datanode.
and got new numbers...
the config is the same as earlier post, with the exception that
MaxNoOfExecutionThreads=8
flexAsynch -ndbrecord -temp -con 4 -t 16 -p 312 -a 2 -l 3 -r 2
insert average: 461584/s min: 451928/s max: 474254/s stddev: 2%
update average: 533083/s min: 530950/s max: 537351/s stddev: 0%
delete average: 564388/s min: 559265/s max: 567143/s stddev: 0%
read average: 948954/s min: 937288/s max: 959262/s stddev: 0%
also tried using SCI instead of gigabit ethernet
flexAsynch -ndbrecord -temp -con 4 -t 16 -p 256 -a 2 -l 3 -r 2
insert average: 568012/s min: 550389/s max: 578367/s stddev: 2%
update average: 599828/s min: 598480/s max: 602175/s stddev: 0%
delete average: 614036/s min: 612440/s max: 616496/s stddev: 0%
read average: 1012472/s min: 1003429/s max: 1024000/s stddev: 0%
i.e with SCI the 1M reads/sec limit is reached! (on 1 datanode)
i think this should also be achievable on ethernet by adding some
more optimizations (let api-application start transactions directly
on correct TC-thread)
---
comments:
1) the new "feature" is multi threading the transaction coordinator
aka MT-TC
2) this part will likely not make the mysql cluster 6.4.0-release
3) our multi-threading architecture seems promising,
in less than a month i managed to double the throughput
(in a admittedly unrealistic benchmark, but still)
4) the 25% missing from the current patch is node-failure handling
and a "rcu-like" lock which will be used for reading/updating distribution
(it's read for each operation, and updated during node-failure,node-recovery and
online table repartitioning)
and got new numbers...
the config is the same as earlier post, with the exception that
MaxNoOfExecutionThreads=8
flexAsynch -ndbrecord -temp -con 4 -t 16 -p 312 -a 2 -l 3 -r 2
insert average: 461584/s min: 451928/s max: 474254/s stddev: 2%
update average: 533083/s min: 530950/s max: 537351/s stddev: 0%
delete average: 564388/s min: 559265/s max: 567143/s stddev: 0%
read average: 948954/s min: 937288/s max: 959262/s stddev: 0%
also tried using SCI instead of gigabit ethernet
flexAsynch -ndbrecord -temp -con 4 -t 16 -p 256 -a 2 -l 3 -r 2
insert average: 568012/s min: 550389/s max: 578367/s stddev: 2%
update average: 599828/s min: 598480/s max: 602175/s stddev: 0%
delete average: 614036/s min: 612440/s max: 616496/s stddev: 0%
read average: 1012472/s min: 1003429/s max: 1024000/s stddev: 0%
i.e with SCI the 1M reads/sec limit is reached! (on 1 datanode)
i think this should also be achievable on ethernet by adding some
more optimizations (let api-application start transactions directly
on correct TC-thread)
---
comments:
1) the new "feature" is multi threading the transaction coordinator
aka MT-TC
2) this part will likely not make the mysql cluster 6.4.0-release
3) our multi-threading architecture seems promising,
in less than a month i managed to double the throughput
(in a admittedly unrealistic benchmark, but still)
4) the 25% missing from the current patch is node-failure handling
and a "rcu-like" lock which will be used for reading/updating distribution
(it's read for each operation, and updated during node-failure,node-recovery and
online table repartitioning)
Wednesday, November 5, 2008
700k reads per second on 1 datanode
added multi connect to flexAsynch, got new numbers
everything else same as previous post
[jonas@n1 run]$ flexAsynch -ndbrecord -temp -con 2 -t 16 -p 512 -l 3 -a 2 -r 2
insert average: 360679/s min: 346150/s max: 370075/s stddev: 2%
update average: 373349/s min: 372465/s max: 374132/s stddev: 0%
delete average: 371014/s min: 357043/s max: 378523/s stddev: 2%
read average: 731042/s min: 702211/s max: 760631/s stddev: 2%
everything else same as previous post
[jonas@n1 run]$ flexAsynch -ndbrecord -temp -con 2 -t 16 -p 512 -l 3 -a 2 -r 2
insert average: 360679/s min: 346150/s max: 370075/s stddev: 2%
update average: 373349/s min: 372465/s max: 374132/s stddev: 0%
delete average: 371014/s min: 357043/s max: 378523/s stddev: 2%
read average: 731042/s min: 702211/s max: 760631/s stddev: 2%
Monday, November 3, 2008
500k reads per second on 1 datanode
just did some benchmarking on multi-threaded ndbd (binary called ndbmtd)
that is in the coming 6.4 release.
quite happy with results
--- results
[jonas@n1 run]$ flexAsynch -ndbrecord -temp -t 8 -p 512 -r 5 -a 2
insert average: 374200/s min: 374200/s max: 374200/s stddev: 0%
update average: 370947/s min: 370947/s max: 370947/s stddev: 0%
delete average: 395061/s min: 395061/s max: 395061/s stddev: 0%
read average: 537178/s min: 531948/s max: 543092/s stddev: 0%
---
this flexAsynch command will run with
- 8 threads
- 512 parallel transactions per thread
- 8 byte records.
note: during the reads, the datanode was *not* maxed out.
---
this was run on two identical computers,
2-socket, 4 cores per socket Intel(R) Xeon(R) CPU X5355 @ 2.66GHz
api-program was running on computer 1 (n1)
datanode was running on computer 2 (n2)
--- configuration
[cluster_config]
DataMemory=2000M
IndexMemory=150M
SendBufferMemory=8M
ReceiveBufferMemory=8M
LongMessageBuffer=64M
NoOfReplicas=1
ndb_mgmd=n1
ndbd=n2
mysqld=n1,n1,n1,n1
Diskless=1
MaxNoOfExecutionThreads=6
MaxNoOfConcurrentTransactions=16384
that is in the coming 6.4 release.
quite happy with results
--- results
[jonas@n1 run]$ flexAsynch -ndbrecord -temp -t 8 -p 512 -r 5 -a 2
insert average: 374200/s min: 374200/s max: 374200/s stddev: 0%
update average: 370947/s min: 370947/s max: 370947/s stddev: 0%
delete average: 395061/s min: 395061/s max: 395061/s stddev: 0%
read average: 537178/s min: 531948/s max: 543092/s stddev: 0%
---
this flexAsynch command will run with
- 8 threads
- 512 parallel transactions per thread
- 8 byte records.
note: during the reads, the datanode was *not* maxed out.
---
this was run on two identical computers,
2-socket, 4 cores per socket Intel(R) Xeon(R) CPU X5355 @ 2.66GHz
api-program was running on computer 1 (n1)
datanode was running on computer 2 (n2)
--- configuration
[cluster_config]
DataMemory=2000M
IndexMemory=150M
SendBufferMemory=8M
ReceiveBufferMemory=8M
LongMessageBuffer=64M
NoOfReplicas=1
ndb_mgmd=n1
ndbd=n2
mysqld=n1,n1,n1,n1
Diskless=1
MaxNoOfExecutionThreads=6
MaxNoOfConcurrentTransactions=16384
Thursday, October 16, 2008
forks, add-on patch-sets and features
so far little is happening in this area with MySQL Cluster.
would be interesting to get patches to cluster from a(ny) (huge-web) company...
wonder if that will ever happen...
maybe we don't use enough buzz-words
---
it could also be that we add features in a high enough pace ourselves,
preliminary benchmarks of our multi-threaded ndbmtd(4 threads)
shows up to 3.7 times better throughput than singled threaded ndbd.
would be interesting to get patches to cluster from a(ny) (huge-web) company...
wonder if that will ever happen...
maybe we don't use enough buzz-words
---
it could also be that we add features in a high enough pace ourselves,
preliminary benchmarks of our multi-threaded ndbmtd(4 threads)
shows up to 3.7 times better throughput than singled threaded ndbd.
Monday, September 15, 2008
Tuesday, September 2, 2008
end of think-period
today, I think I finally cracked how to create(drop) a nodegroup.
basic concept is to
- temporary block gcp
- create(drop) the node group
- unblock gcp
(the same concept is btw used for adding a starting node to gcp)
the block should last for micro seconds
now it's only implementing it...
---
very happy that I now know how to proceed,
I've spent quite a lot of time trying to figure out a 100% safe
way of doing it...(wo/ blocking gcp)
but this solution will be efficient and fairly easy to implement.
(if any protocol dealing with (multi)node-failures can be considered easy)
basic concept is to
- temporary block gcp
- create(drop) the node group
- unblock gcp
(the same concept is btw used for adding a starting node to gcp)
the block should last for micro seconds
now it's only implementing it...
---
very happy that I now know how to proceed,
I've spent quite a lot of time trying to figure out a 100% safe
way of doing it...(wo/ blocking gcp)
but this solution will be efficient and fairly easy to implement.
(if any protocol dealing with (multi)node-failures can be considered easy)
Saturday, August 30, 2008
status of create/drop node(group)
status:
create/drop nodegroup now works with one noticeable exception
replication cant be connected while the nodegroup is added.
i'll try to find time to fix this next week.
howto:
- start a 2 node cluster
- create table T1
- stop ndb_mgmd, add 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart all 4
or rolling restart the 2 running nodes, and then start the 2 new nodes
- ndb_mgm> create nodegroup n1,n2
- alter table T1 add partitions partitions 2
Tata! fully online scaling of the cluster
howto backwards:
- drop table T1
- ndb_mgm> drop nodegroup X
- ndb_mgm> n1,n2 stop -a
- stop ndb_mgmd, remove the 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart them
or rolling restart them
(A nodegroup is allowed to be dropped if it does not contain any data)
side effect:
- I added the possibility to specify nodegroups per node in the config-file
(this I intend to use for testing, but maybe someone might find it interesting)
future:
- magnus is working on "online configuration change" in the ndb_mgmd
once this is complete/functional, we can add the "add node"-command
so that the entire procedure can be done wo/ node restarts.
---
create/drop nodegroup now works with one noticeable exception
replication cant be connected while the nodegroup is added.
i'll try to find time to fix this next week.
howto:
- start a 2 node cluster
- create table T1
- stop ndb_mgmd, add 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart all 4
or rolling restart the 2 running nodes, and then start the 2 new nodes
- ndb_mgm> create nodegroup n1,n2
- alter table T1 add partitions partitions 2
Tata! fully online scaling of the cluster
howto backwards:
- drop table T1
- ndb_mgm> drop nodegroup X
- ndb_mgm> n1,n2 stop -a
- stop ndb_mgmd, remove the 2 nodes, start ndb_mgmd
- either stop the 2 running nodes and restart them
or rolling restart them
(A nodegroup is allowed to be dropped if it does not contain any data)
side effect:
- I added the possibility to specify nodegroups per node in the config-file
(this I intend to use for testing, but maybe someone might find it interesting)
future:
- magnus is working on "online configuration change" in the ndb_mgmd
once this is complete/functional, we can add the "add node"-command
so that the entire procedure can be done wo/ node restarts.
---
Friday, August 15, 2008
Friday, July 4, 2008
summer months
june: customer issues and bugs
july: vacation
august: must complete add node (I haven't started, but work has been done by stewart)
----
No need to comment this...
if you want to make me happy, choose one of the earlier posts
july: vacation
august: must complete add node (I haven't started, but work has been done by stewart)
----
No need to comment this...
if you want to make me happy, choose one of the earlier posts
Sunday, June 8, 2008
boom-tjackalack! table-reorg is pushed
so...now table-reorg is in 6.4.
pushbuild found a few problems...that are fixed.
what is left:
1) detailed test-prg (which will check consistency after each step, by pausing schema-trans)
2) handling of cluster-crash during reorg
only way right now, is to restore a backup if you get crash during reorg
3) node failure during might cause SUMA to not scan some fragments
(this bug is an old one, existing in 4.1, that also affect unique index build)
4) reorg-abort (in certain state) leaves REORG_MOVED bit on records,
cause subsequent reorgs (to different partitioning) to create inconsistent data.
Not too bad...
I do however think it's quite testable (although maybe not extremely interesting wo/ add node)
Will start on add-node...and fix problems above in parallel
pushbuild found a few problems...that are fixed.
what is left:
1) detailed test-prg (which will check consistency after each step, by pausing schema-trans)
2) handling of cluster-crash during reorg
only way right now, is to restore a backup if you get crash during reorg
3) node failure during might cause SUMA to not scan some fragments
(this bug is an old one, existing in 4.1, that also affect unique index build)
4) reorg-abort (in certain state) leaves REORG_MOVED bit on records,
cause subsequent reorgs (to different partitioning) to create inconsistent data.
Not too bad...
I do however think it's quite testable (although maybe not extremely interesting wo/ add node)
Will start on add-node...and fix problems above in parallel
Thursday, June 5, 2008
almost push-time
I've now:
- fixed error handling (although testing is still not 100%)
- pushed the grand unified table state patch
- pushed a few patches in the series...
No one commented asking for a snapshot,
so i decided to push into 6.4 instead.
Will just spend some more time testing/cleaning up...
- fixed error handling (although testing is still not 100%)
- pushed the grand unified table state patch
- pushed a few patches in the series...
No one commented asking for a snapshot,
so i decided to push into 6.4 instead.
Will just spend some more time testing/cleaning up...
response to comment with questions
1) Which operations can I perform during a table reorg?
everything except DDL and node restart
ndb does currently only allow one DDL at a time, and the reorg is a DDL
ndb does currently prevent node restart while DDL in ongoing
2) What happens to an ongoing table reorg during
2a) node failure
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been started)
2b) cluster failure, and recovery?
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been written)
The reorg is committed after rows have been copied, but before rows has been
deleted/cleaned up
3) How do my a) SQL b) NDBAPI applications have to be changed to cope with table reorg?
Not at all, but
- your application can "hint" incorrectly if it does not check table state
and refresh it after reorg has been committed
- your application might encounter temporary errors due to the reorg,
this error is the same that you can get during a node restart, so no special
handling of this is needed.
And hopefully the temporary errors should be rare (testing will show...)
4) How can I trade off the duration of a reorganisation against its resource impact (CPU, Memory, Bandwidth etc.)
Currently you can't. speed is hard-coded. this will maybe be a future feature
5) What performance impact does re-org have on ongoing DML and query operations?
Don't know yet, not enough testing. debug-complied versions that I tested gave maybe 5-10% impact. (there is also another optimization that I want to do...which will reduce the impact)
6) What impact does re-org have on DDL operations?
Ongoing none, cause we only support one at a time.
And the re-org will prevent other DDL from starting while it's running
7) Will there be some easy way to re-org all cluster tables to balance across all available nodes?
write a stored procedure that list all tables, and reorgs them one by one.
8) How are indexes modified during table re-org?
ordered indexes are reorganised together with base table
unique indexes are currently untouched (this should probably change)
9) Which parts of the re-org are serial, and which are parallel?
Same as all other schema-transactions after wl3600.
I.e each operation-step is run parallel on each node,
but only one operation-step is run at a time.
This means that e.g copy and "cleanup" is run in parallel on all nodes.
10) Can I perform an online upgrade to a version of MySQL Cluster that supports re-org?
yes,
11) Can I restore a backup from an old version of MySQL Cluster and get online re-org features?
yes,
12) What are the down sides of this table re-org implementation?
none :-)
but there are some areas for improvement
3) Can re-org cope with heterogeneous NDBD nodes with different DataMemory capacities?
In the kernel, yes, but there is no SQL interface currently to expose this
14) How can I look at hash result to fragment id mapping tables?
Using a hand-written ndbapi program
(maybe will add this to ndb_desc)
---
Puh...
that comment held some may questions...
that i maybe should not be asking for more comments...
everything except DDL and node restart
ndb does currently only allow one DDL at a time, and the reorg is a DDL
ndb does currently prevent node restart while DDL in ongoing
2) What happens to an ongoing table reorg during
2a) node failure
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been started)
2b) cluster failure, and recovery?
reorg will be completed or aborted depending on how long it has progressed
(i.e if commit has been written)
The reorg is committed after rows have been copied, but before rows has been
deleted/cleaned up
3) How do my a) SQL b) NDBAPI applications have to be changed to cope with table reorg?
Not at all, but
- your application can "hint" incorrectly if it does not check table state
and refresh it after reorg has been committed
- your application might encounter temporary errors due to the reorg,
this error is the same that you can get during a node restart, so no special
handling of this is needed.
And hopefully the temporary errors should be rare (testing will show...)
4) How can I trade off the duration of a reorganisation against its resource impact (CPU, Memory, Bandwidth etc.)
Currently you can't. speed is hard-coded. this will maybe be a future feature
5) What performance impact does re-org have on ongoing DML and query operations?
Don't know yet, not enough testing. debug-complied versions that I tested gave maybe 5-10% impact. (there is also another optimization that I want to do...which will reduce the impact)
6) What impact does re-org have on DDL operations?
Ongoing none, cause we only support one at a time.
And the re-org will prevent other DDL from starting while it's running
7) Will there be some easy way to re-org all cluster tables to balance across all available nodes?
write a stored procedure that list all tables, and reorgs them one by one.
8) How are indexes modified during table re-org?
ordered indexes are reorganised together with base table
unique indexes are currently untouched (this should probably change)
9) Which parts of the re-org are serial, and which are parallel?
Same as all other schema-transactions after wl3600.
I.e each operation-step is run parallel on each node,
but only one operation-step is run at a time.
This means that e.g copy and "cleanup" is run in parallel on all nodes.
10) Can I perform an online upgrade to a version of MySQL Cluster that supports re-org?
yes,
11) Can I restore a backup from an old version of MySQL Cluster and get online re-org features?
yes,
12) What are the down sides of this table re-org implementation?
none :-)
but there are some areas for improvement
3) Can re-org cope with heterogeneous NDBD nodes with different DataMemory capacities?
In the kernel, yes, but there is no SQL interface currently to expose this
14) How can I look at hash result to fragment id mapping tables?
Using a hand-written ndbapi program
(maybe will add this to ndb_desc)
---
Puh...
that comment held some may questions...
that i maybe should not be asking for more comments...
Friday, May 23, 2008
alter online table T add partition partitions N
I now enabled SQL-interface to table-reorg.
The syntax (which is the same for other partition mgm) is
ALTER ONLINE TABLE T ADD PARTITION PARTITIONS N;
Also switched so that hashmap partitioning is used for all tables created using SQL.
And mysql-test-run works (including a new ndb_add_partition-test)
(except for some range/list partition testcases)
it's still kind of fragile. Error handling is sparse...
there are 3 known things which are easy to fix
- ndbapi transaction hinting/pruning does not work after/during a reorg
- unique indexes will not work after/during a reorg
- only 1 reorg per table is possible (SUMA caches distribution information incorrectly)
and one quite hard
- cluster crash *during* table-reorg
current plan is
1) fix 3 easy known problems
2) fix error handling
3) write detailed test program (pushed back again!)
The syntax (which is the same for other partition mgm) is
ALTER ONLINE TABLE T ADD PARTITION PARTITIONS N;
Also switched so that hashmap partitioning is used for all tables created using SQL.
And mysql-test-run works (including a new ndb_add_partition-test)
(except for some range/list partition testcases)
it's still kind of fragile. Error handling is sparse...
there are 3 known things which are easy to fix
- ndbapi transaction hinting/pruning does not work after/during a reorg
- unique indexes will not work after/during a reorg
- only 1 reorg per table is possible (SUMA caches distribution information incorrectly)
and one quite hard
- cluster crash *during* table-reorg
current plan is
1) fix 3 easy known problems
2) fix error handling
3) write detailed test program (pushed back again!)
Tuesday, May 20, 2008
wl3600++ complete
I've coded and pushed wl3600++ to telco-6.4
- lots of code simplification
- lots of "duplicate" code removal
Also merged it into table-reorg clone.
And now system restart just started magically working.
So I can again run mysql-test-run
Will now fix any problems found by mysql-test-run.pl
---
- lots of code simplification
- lots of "duplicate" code removal
Also merged it into table-reorg clone.
And now system restart just started magically working.
So I can again run mysql-test-run
Will now fix any problems found by mysql-test-run.pl
---
Wednesday, May 7, 2008
wl3600++ clarification
Just one thing...
it's obvious that the only long term correct solution
is to add a schema-log instead of the schema-file
but the rules/framework that i'm developing will be
so that a change like that is only (almost) on transaction level,
i.e operations will be almost unchanged...
however,
the schema-log will likely not happen this, next of year after next year...(my guess)
Just wanted to clarify...
(that i'm not an idiot, well at least not a complete one :-)
it's obvious that the only long term correct solution
is to add a schema-log instead of the schema-file
but the rules/framework that i'm developing will be
so that a change like that is only (almost) on transaction level,
i.e operations will be almost unchanged...
however,
the schema-log will likely not happen this, next of year after next year...(my guess)
Just wanted to clarify...
(that i'm not an idiot, well at least not a complete one :-)
wl3600++
Has now agreed on a way forward with pekka on wl3600
How to handle SchemaFile, batching & completeness needed for table-reorg.
What operations/transaction shall (not) do in different stages
Having a consistent model feels great!
(compared to the evolutionary mesh that is present today)
Now I only need to implement it...
---
I now have at least two distinct readers (comments from 2 persons)...
I'm blogging my way into fame
How to handle SchemaFile, batching & completeness needed for table-reorg.
What operations/transaction shall (not) do in different stages
Having a consistent model feels great!
(compared to the evolutionary mesh that is present today)
Now I only need to implement it...
---
I now have at least two distinct readers (comments from 2 persons)...
I'm blogging my way into fame
Wednesday, April 30, 2008
Core functionality(reorg) complete!
transactions + scan + replication works!
Now remaining:
- error handling
- detailed unit testing
- durability of HashMap
- fix schema trans restart on SR
- fix unique index
- ndbapi interface to HashMap
- optimize COPY by using new operation(ZMOVE)
which creates less load and interacts better with replication
(currently the COPY produces events which is "correct" but not optimal)
- sql-support
And of course add-node :)
---
Personally I think this post deserves at least one comment...
Now remaining:
- error handling
- detailed unit testing
- durability of HashMap
- fix schema trans restart on SR
- fix unique index
- ndbapi interface to HashMap
- optimize COPY by using new operation(ZMOVE)
which creates less load and interacts better with replication
(currently the COPY produces events which is "correct" but not optimal)
- sql-support
And of course add-node :)
---
Personally I think this post deserves at least one comment...
Tuesday, April 29, 2008
Core functionality(reorg) almost done
Core functionality is almost complete,
- all operations in place, running in correct order
- all transactions work correctly (with all synchronization)
Remaining is "only" 2 local functions, related to SUMA switchover.
---
Need to spend serious time on a contest to add to this blog,
to get more comments!
- all operations in place, running in correct order
- all transactions work correctly (with all synchronization)
Remaining is "only" 2 local functions, related to SUMA switchover.
---
Need to spend serious time on a contest to add to this blog,
to get more comments!
Thursday, April 24, 2008
Eureka - SUMA switch over for table-reorg
First it struck me that
- starting to double send events can be done wo/ synchronization, cause
1) this is basic techinque for node-failure handling
2) it does not matter if new-fragment does not contain full epoch, as it will contain
last part
Then (today) it struck me that
- turning off double send on old "home" for row does not either require synchronization
(except doing it on epoch-boundary) but different fragments can do it on different times, cause
- there is already double send ongoing, so no events will be lost
This makes the task relatively straightforward, but following is still needed,
- replication triggers must be turned off at epoch boundary
- replication triggers must be turned off 3 epochs after "turn off" has been initiated
- all of above means that it can be handled per node...(or maybe node group...)
- have to think more about potential per node group synchronization though...
Hope I'm right! Will discuss @office, to see if anyone can find any holes
(including me, as I got the idea today)
---
Still only one comment...I think I need to add a contest of something
- starting to double send events can be done wo/ synchronization, cause
1) this is basic techinque for node-failure handling
2) it does not matter if new-fragment does not contain full epoch, as it will contain
last part
Then (today) it struck me that
- turning off double send on old "home" for row does not either require synchronization
(except doing it on epoch-boundary) but different fragments can do it on different times, cause
- there is already double send ongoing, so no events will be lost
This makes the task relatively straightforward, but following is still needed,
- replication triggers must be turned off at epoch boundary
- replication triggers must be turned off 3 epochs after "turn off" has been initiated
- all of above means that it can be handled per node...(or maybe node group...)
- have to think more about potential per node group synchronization though...
Hope I'm right! Will discuss @office, to see if anyone can find any holes
(including me, as I got the idea today)
---
Still only one comment...I think I need to add a contest of something
Monday, April 21, 2008
What is table-reorg
Table-reorg is the the procedure which will be executed on "alter table X add partitions Y" .
This you would typically do when you have added Y new nodes to your cluster.
The procedure is online, i.e transactions can run during the operation
and no extra memory will be needed on the "old" nodes.
The reorg is based on linear hashing (but wo/ the normal skewness in distribution)
E.g when going from 2 to 4 partitions 50% of the rows will be moved.
The copying is done in parallel on the "old" nodes,
and consistency is kept using triggers.
---
So I have atleast 1 reader...
This you would typically do when you have added Y new nodes to your cluster.
The procedure is online, i.e transactions can run during the operation
and no extra memory will be needed on the "old" nodes.
The reorg is based on linear hashing (but wo/ the normal skewness in distribution)
E.g when going from 2 to 4 partitions 50% of the rows will be moved.
The copying is done in parallel on the "old" nodes,
and consistency is kept using triggers.
---
So I have atleast 1 reader...
Wednesday, April 16, 2008
Transactions now work!
- Transactions now work correctly, both pk/uk and table/index scan
- I have decided how to do testing (single-step through reorg)
Fixing schema-transaction seems like a must now (for 6.4)
- schema-file flushing
- complete phase
Maybe I can come up with something else to do first...
---
Still no comments on my blog...wonder if I have any readers
- I have decided how to do testing (single-step through reorg)
Fixing schema-transaction seems like a must now (for 6.4)
- schema-file flushing
- complete phase
Maybe I can come up with something else to do first...
---
Still no comments on my blog...wonder if I have any readers
Friday, March 28, 2008
Assorted notes
- transaction consistency (big part is testing)
- fix schema trans complete phase (to 6.4 directly)
- durability of objects
- fix schema trans restart on SR
- fix unique index
- sql
- automagic HashMap creation
---
Will likely next build testing framework...for transaction consistency
- fix schema trans complete phase (to 6.4 directly)
- durability of objects
- fix schema trans restart on SR
- fix unique index
- sql
- automagic HashMap creation
---
Will likely next build testing framework...for transaction consistency
table-reorg plan/progress II
1) hashmap (done)
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete (done)
6) consistent scan (partially done)
---
Held live-demo for office audience
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete (done)
6) consistent scan (partially done)
---
Held live-demo for office audience
Thursday, March 27, 2008
table-reorg plan/progress
1) hashmap (done)
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete
6) consistent scan
done = runnable != complete
---
just "finished" reorg-copy, still needs polishing...
2) add partitions (done)
3) reorg-triggers (done)
4) reorg-copy (done)
5) reorg-delete
6) consistent scan
done = runnable != complete
---
just "finished" reorg-copy, still needs polishing...
Wednesday, March 26, 2008
Subscribe to:
Comments (Atom)