Message 296161 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Dmitry Rubanovich
Recipients	Dmitry Rubanovich, methane, rhettinger, serhiy.storchaka, tim.peters, xiang.zhang
Date	2017年06月16日.06:31:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1497594665.93.0.960961209971.issue30671@psf.upfronthosting.co.za>

Content
Tim, I am not testing randomly. The scripts calculate secondary hash for each value in a ring to see how often this results in duplicate (or triplicate, etc.) values. And they do it for each (mod 2i) ring for i between 8 and 20. Then (mostly as an afterthought) the scripts calculate how often each scheme results in a collision if more than 1 secondary hash index is calculated. I used 3 secondary indexes as a demonstration of the "afterthought" part. I do understand that the logic relies on being able to reach each value in 0..-1+2i in the worst case. Isn't that though accomplished by my latest proposal? It was this: "...unroll the 1st secondary hash key to use j = 2j + P + 1. So try to test for it before the loop. But leave 5j + P + 1 in the loop as is." In the code that would mean changing of the loop in, for example, lookdict_index() from for (size_t perturb = hash;;) { perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } } to this: size_t perturb; .... perturb = hash; i = mask & ((i << 1) + perturb + 1); /* <---- key line / ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } for (;;) { / nothing changes in this loop / perturb >>= PERTURB_SHIFT; i = mask & ((i << 2) + i + perturb + 1); ix = dk_get_index(k, i); if (ix == index) { return i; } if (ix == DKIX_EMPTY) { return DKIX_EMPTY; } } And, of course, it would mean adding the same precheck in front of all loops which go through this secondary index calculation. This prevents preventable collisions for the hashes, h, such that h mod 2k is equal to (h >> 5) mod 2k, where 2*k is the dict size. This frequency of such occurrence for each dict size is what's printed by print_collision_counts(py3_6_1_lookdict_perturb) in either of the attached scripts. Given that, for instance, it's 91 for dict's of size 256, it seems rather ripe for improvement.

Content

Tim,
I am not testing randomly. The scripts calculate secondary hash for **each** value in a ring to see how often this results in duplicate (or triplicate, etc.) values. And they do it for each (mod 2**i) ring for i between 8 and 20.
Then (mostly as an afterthought) the scripts calculate how often each scheme results in a collision if more than 1 secondary hash index is calculated. I used 3 secondary indexes as a demonstration of the "afterthought" part.
I do understand that the logic relies on being able to reach each value in 0..-1+2**i in the worst case. Isn't that though accomplished by my latest proposal? It was this:
"...unroll the 1st secondary hash key to use j = 2*j + P + 1. So try to test for it before the loop. But leave 5*j + P + 1 in the loop as is."
In the code that would mean changing of the loop in, for example, lookdict_index() from 
for (size_t perturb = hash;;) {
 perturb >>= PERTURB_SHIFT;
 i = mask & ((i << 2) + i + perturb + 1);
 ix = dk_get_index(k, i);
 if (ix == index) {
 return i;
 }
 if (ix == DKIX_EMPTY) {
 return DKIX_EMPTY;
 }
}
to this:
size_t perturb;
....
perturb = hash;
i = mask & ((i << 1) + perturb + 1); /* <---- key line */
ix = dk_get_index(k, i);
if (ix == index) {
 return i;
}
if (ix == DKIX_EMPTY) {
 return DKIX_EMPTY;
}
for (;;) {
 /* nothing changes in this loop */
 perturb >>= PERTURB_SHIFT;
 i = mask & ((i << 2) + i + perturb + 1);
 ix = dk_get_index(k, i);
 if (ix == index) {
 return i;
 }
 if (ix == DKIX_EMPTY) {
 return DKIX_EMPTY;
 }
}
And, of course, it would mean adding the same precheck in front of all loops which go through this secondary index calculation.
This prevents preventable collisions for the hashes, h, such that
h mod 2**k is equal to (h >> 5) mod 2**k,
where 2**k is the dict size. This frequency of such occurrence for each dict size is what's printed by 
print_collision_counts(py3_6_1_lookdict_perturb) 
in either of the attached scripts. Given that, for instance, it's 91 for dict's of size 256, it seems rather ripe for improvement.

History
Date	User	Action	Args
2017年06月16日 06:31:05	Dmitry Rubanovich	set	recipients: + Dmitry Rubanovich, tim.peters, rhettinger, methane, serhiy.storchaka, xiang.zhang
2017年06月16日 06:31:05	Dmitry Rubanovich	set	messageid: <1497594665.93.0.960961209971.issue30671@psf.upfronthosting.co.za>
2017年06月16日 06:31:05	Dmitry Rubanovich	link	issue30671 messages
2017年06月16日 06:31:05	Dmitry Rubanovich	create

homepage