[フレーム]

tuplesort.c

Go to the documentation of this file.

1/*-------------------------------------------------------------------------

2 *

3 * tuplesort.c

4 * Generalized tuple sorting routines.

5 *

6 * This module provides a generalized facility for tuple sorting, which can be

7 * applied to different kinds of sortable objects. Implementation of

8 * the particular sorting variants is given in tuplesortvariants.c.

9 * This module works efficiently for both small and large amounts

10 * of data. Small amounts are sorted in-memory using qsort(). Large

11 * amounts are sorted using temporary files and a standard external sort

12 * algorithm.

13 *

14 * See Knuth, volume 3, for more than you want to know about external

15 * sorting algorithms. The algorithm we use is a balanced k-way merge.

16 * Before PostgreSQL 15, we used the polyphase merge algorithm (Knuth's

17 * Algorithm 5.4.2D), but with modern hardware, a straightforward balanced

18 * merge is better. Knuth is assuming that tape drives are expensive

19 * beasts, and in particular that there will always be many more runs than

20 * tape drives. The polyphase merge algorithm was good at keeping all the

21 * tape drives busy, but in our implementation a "tape drive" doesn't cost

22 * much more than a few Kb of memory buffers, so we can afford to have

23 * lots of them. In particular, if we can have as many tape drives as

24 * sorted runs, we can eliminate any repeated I/O at all.

25 *

26 * Historically, we divided the input into sorted runs using replacement

27 * selection, in the form of a priority tree implemented as a heap

28 * (essentially Knuth's Algorithm 5.2.3H), but now we always use quicksort

29 * for run generation.

30 *

31 * The approximate amount of memory allowed for any one sort operation

32 * is specified in kilobytes by the caller (most pass work_mem). Initially,

33 * we absorb tuples and simply store them in an unsorted array as long as

34 * we haven't exceeded workMem. If we reach the end of the input without

35 * exceeding workMem, we sort the array using qsort() and subsequently return

36 * tuples just by scanning the tuple array sequentially. If we do exceed

37 * workMem, we begin to emit tuples into sorted runs in temporary tapes.

38 * When tuples are dumped in batch after quicksorting, we begin a new run

39 * with a new output tape. If we reach the max number of tapes, we write

40 * subsequent runs on the existing tapes in a round-robin fashion. We will

41 * need multiple merge passes to finish the merge in that case. After the

42 * end of the input is reached, we dump out remaining tuples in memory into

43 * a final run, then merge the runs.

44 *

45 * When merging runs, we use a heap containing just the frontmost tuple from

46 * each source run; we repeatedly output the smallest tuple and replace it

47 * with the next tuple from its source tape (if any). When the heap empties,

48 * the merge is complete. The basic merge algorithm thus needs very little

49 * memory --- only M tuples for an M-way merge, and M is constrained to a

50 * small number. However, we can still make good use of our full workMem

51 * allocation by pre-reading additional blocks from each source tape. Without

52 * prereading, our access pattern to the temporary file would be very erratic;

53 * on average we'd read one block from each of M source tapes during the same

54 * time that we're writing M blocks to the output tape, so there is no

55 * sequentiality of access at all, defeating the read-ahead methods used by

56 * most Unix kernels. Worse, the output tape gets written into a very random

57 * sequence of blocks of the temp file, ensuring that things will be even

58 * worse when it comes time to read that tape. A straightforward merge pass

59 * thus ends up doing a lot of waiting for disk seeks. We can improve matters

60 * by prereading from each source tape sequentially, loading about workMem/M

61 * bytes from each tape in turn, and making the sequential blocks immediately

62 * available for reuse. This approach helps to localize both read and write

63 * accesses. The pre-reading is handled by logtape.c, we just tell it how

64 * much memory to use for the buffers.

65 *

66 * In the current code we determine the number of input tapes M on the basis

67 * of workMem: we want workMem/M to be large enough that we read a fair

68 * amount of data each time we read from a tape, so as to maintain the

69 * locality of access described above. Nonetheless, with large workMem we

70 * can have many tapes. The logical "tapes" are implemented by logtape.c,

71 * which avoids space wastage by recycling disk space as soon as each block

72 * is read from its "tape".

73 *

74 * When the caller requests random access to the sort result, we form

75 * the final sorted run on a logical tape which is then "frozen", so

76 * that we can access it randomly. When the caller does not need random

77 * access, we return from tuplesort_performsort() as soon as we are down

78 * to one run per logical tape. The final merge is then performed

79 * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this

80 * saves one cycle of writing all the data out to disk and reading it in.

81 *

82 * This module supports parallel sorting. Parallel sorts involve coordination

83 * among one or more worker processes, and a leader process, each with its own

84 * tuplesort state. The leader process (or, more accurately, the

85 * Tuplesortstate associated with a leader process) creates a full tapeset

86 * consisting of worker tapes with one run to merge; a run for every

87 * worker process. This is then merged. Worker processes are guaranteed to

88 * produce exactly one output run from their partial input.

89 *

90 *

93 *

94 * IDENTIFICATION

95 * src/backend/utils/sort/tuplesort.c

96 *

97 *-------------------------------------------------------------------------

98 */

99

100#include "postgres.h"

101

102#include <limits.h>

103

104#include "commands/tablespace.h"

105#include "miscadmin.h"

106#include "pg_trace.h"

107#include "storage/shmem.h"

108#include "utils/guc.h"

109#include "utils/memutils.h"

110#include "utils/pg_rusage.h"

111#include "utils/tuplesort.h"

112

113/*

114 * Initial size of memtuples array. We're trying to select this size so that

115 * array doesn't exceed ALLOCSET_SEPARATE_THRESHOLD and so that the overhead of

116 * allocation might possibly be lowered. However, we don't consider array sizes

117 * less than 1024.

118 *

119 */

120 #define INITIAL_MEMTUPSIZE Max(1024, \

121 ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

122

123/* GUC variables */

124 bool trace_sort = false;

125

126#ifdef DEBUG_BOUNDED_SORT

127bool optimize_bounded_sort = true;

128#endif

129

130

131/*

132 * During merge, we use a pre-allocated set of fixed-size slots to hold

133 * tuples. To avoid palloc/pfree overhead.

134 *

135 * Merge doesn't require a lot of memory, so we can afford to waste some,

136 * by using gratuitously-sized slots. If a tuple is larger than 1 kB, the

137 * palloc() overhead is not significant anymore.

138 *

139 * 'nextfree' is valid when this chunk is in the free list. When in use, the

140 * slot holds a tuple.

141 */

142 #define SLAB_SLOT_SIZE 1024

143

144 typedef union SlabSlot

145{

146 union SlabSlot *nextfree;

147 char buffer[SLAB_SLOT_SIZE];

148 } SlabSlot;

149

150/*

151 * Possible states of a Tuplesort object. These denote the states that

152 * persist between calls of Tuplesort routines.

153 */

154 typedef enum

155{

156 TSS_INITIAL, /* Loading tuples; still within memory limit */

157 TSS_BOUNDED, /* Loading tuples into bounded-size heap */

158 TSS_BUILDRUNS, /* Loading tuples; writing to tape */

159 TSS_SORTEDINMEM, /* Sort completed entirely in memory */

160 TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */

161 TSS_FINALMERGE, /* Performing final merge on-the-fly */

162} TupSortStatus;

163

164/*

165 * Parameters for calculation of number of tapes to use --- see inittapes()

166 * and tuplesort_merge_order().

167 *

168 * In this calculation we assume that each tape will cost us about 1 blocks

169 * worth of buffer space. This ignores the overhead of all the other data

170 * structures needed for each tape, but it's probably close enough.

171 *

172 * MERGE_BUFFER_SIZE is how much buffer space we'd like to allocate for each

173 * input tape, for pre-reading (see discussion at top of file). This is *in

174 * addition to* the 1 block already included in TAPE_BUFFER_OVERHEAD.

175 */

176 #define MINORDER 6 /* minimum merge order */

177 #define MAXORDER 500 /* maximum merge order */

178 #define TAPE_BUFFER_OVERHEAD BLCKSZ

179 #define MERGE_BUFFER_SIZE (BLCKSZ * 32)

180

181

182/*

183 * Private state of a Tuplesort operation.

184 */

185 struct Tuplesortstate

186{

187 TuplesortPublic base;

188 TupSortStatus status; /* enumerated value as shown above */

189 bool bounded; /* did caller specify a maximum number of

190 * tuples to return? */

191 bool boundUsed; /* true if we made use of a bounded heap */

192 int bound; /* if bounded, the maximum number of tuples */

193 int64 tupleMem; /* memory consumed by individual tuples.

194 * storing this separately from what we track

195 * in availMem allows us to subtract the

196 * memory consumed by all tuples when dumping

197 * tuples to tape */

198 int64 availMem; /* remaining memory available, in bytes */

199 int64 allowedMem; /* total memory allowed, in bytes */

200 int maxTapes; /* max number of input tapes to merge in each

201 * pass */

202 int64 maxSpace; /* maximum amount of space occupied among sort

203 * of groups, either in-memory or on-disk */

204 bool isMaxSpaceDisk; /* true when maxSpace is value for on-disk

205 * space, false when its value for in-memory

206 * space */

207 TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached */

208 LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */

209

210 /*

211 * This array holds the tuples now in sort memory. If we are in state

212 * INITIAL, the tuples are in no particular order; if we are in state

213 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS

214 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm

215 * H. In state SORTEDONTAPE, the array is not used.

216 */

217 SortTuple *memtuples; /* array of SortTuple structs */

218 int memtupcount; /* number of tuples currently present */

219 int memtupsize; /* allocated length of memtuples array */

220 bool growmemtuples; /* memtuples' growth still underway? */

221

222 /*

223 * Memory for tuples is sometimes allocated using a simple slab allocator,

224 * rather than with palloc(). Currently, we switch to slab allocation

225 * when we start merging. Merging only needs to keep a small, fixed

226 * number of tuples in memory at any time, so we can avoid the

227 * palloc/pfree overhead by recycling a fixed number of fixed-size slots

228 * to hold the tuples.

229 *

230 * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE

231 * slots. The allocation is sized to have one slot per tape, plus one

232 * additional slot. We need that many slots to hold all the tuples kept

233 * in the heap during merge, plus the one we have last returned from the

234 * sort, with tuplesort_gettuple.

235 *

236 * Initially, all the slots are kept in a linked list of free slots. When

237 * a tuple is read from a tape, it is put to the next available slot, if

238 * it fits. If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd

239 * instead.

240 *

241 * When we're done processing a tuple, we return the slot back to the free

242 * list, or pfree() if it was palloc'd. We know that a tuple was

243 * allocated from the slab, if its pointer value is between

244 * slabMemoryBegin and -End.

245 *

246 * When the slab allocator is used, the USEMEM/LACKMEM mechanism of

247 * tracking memory usage is not used.

248 */

249 bool slabAllocatorUsed;

250

251 char *slabMemoryBegin; /* beginning of slab memory arena */

252 char *slabMemoryEnd; /* end of slab memory arena */

253 SlabSlot *slabFreeHead; /* head of free list */

254

255 /* Memory used for input and output tape buffers. */

256 size_t tape_buffer_mem;

257

258 /*

259 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that

260 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE

261 * modes), we remember the tuple in 'lastReturnedTuple', so that we can

262 * recycle the memory on next gettuple call.

263 */

264 void *lastReturnedTuple;

265

266 /*

267 * While building initial runs, this is the current output run number.

268 * Afterwards, it is the number of initial runs we made.

269 */

270 int currentRun;

271

272 /*

273 * Logical tapes, for merging.

274 *

275 * The initial runs are written in the output tapes. In each merge pass,

276 * the output tapes of the previous pass become the input tapes, and new

277 * output tapes are created as needed. When nInputTapes equals

278 * nInputRuns, there is only one merge pass left.

279 */

280 LogicalTape **inputTapes;

281 int nInputTapes;

282 int nInputRuns;

283

284 LogicalTape **outputTapes;

285 int nOutputTapes;

286 int nOutputRuns;

287

288 LogicalTape *destTape; /* current output tape */

289

290 /*

291 * These variables are used after completion of sorting to keep track of

292 * the next tuple to return. (In the tape case, the tape's current read

293 * position is also critical state.)

294 */

295 LogicalTape *result_tape; /* actual tape of finished output */

296 int current; /* array index (only used if SORTEDINMEM) */

297 bool eof_reached; /* reached EOF (needed for cursors) */

298

299 /* markpos_xxx holds marked position for mark and restore */

300 int64 markpos_block; /* tape block# (only used if SORTEDONTAPE) */

301 int markpos_offset; /* saved "current", or offset in tape block */

302 bool markpos_eof; /* saved "eof_reached" */

303

304 /*

305 * These variables are used during parallel sorting.

306 *

307 * worker is our worker identifier. Follows the general convention that

308 * -1 value relates to a leader tuplesort, and values >= 0 worker

309 * tuplesorts. (-1 can also be a serial tuplesort.)

310 *

311 * shared is mutable shared memory state, which is used to coordinate

312 * parallel sorts.

313 *

314 * nParticipants is the number of worker Tuplesortstates known by the

315 * leader to have actually been launched, which implies that they must

316 * finish a run that the leader needs to merge. Typically includes a

317 * worker state held by the leader process itself. Set in the leader

318 * Tuplesortstate only.

319 */

320 int worker;

321 Sharedsort *shared;

322 int nParticipants;

323

324 /*

325 * Additional state for managing "abbreviated key" sortsupport routines

326 * (which currently may be used by all cases except the hash index case).

327 * Tracks the intervals at which the optimization's effectiveness is

328 * tested.

329 */

330 int64 abbrevNext; /* Tuple # at which to next check

331 * applicability */

332

333 /*

334 * Resource snapshot for time of sort start.

335 */

336 PGRUsage ru_start;

337};

338

339/*

340 * Private mutable state of tuplesort-parallel-operation. This is allocated

341 * in shared memory.

342 */

343 struct Sharedsort

344{

345 /* mutex protects all fields prior to tapes */

346 slock_t mutex;

347

348 /*

349 * currentWorker generates ordinal identifier numbers for parallel sort

350 * workers. These start from 0, and are always gapless.

351 *

352 * Workers increment workersFinished to indicate having finished. If this

353 * is equal to state.nParticipants within the leader, leader is ready to

354 * merge worker runs.

355 */

356 int currentWorker;

357 int workersFinished;

358

359 /* Temporary file space */

360 SharedFileSet fileset;

361

362 /* Size of tapes flexible array */

363 int nTapes;

364

365 /*

366 * Tapes array used by workers to report back information needed by the

367 * leader to concatenate all worker tapes into one for merging

368 */

369 TapeShare tapes[FLEXIBLE_ARRAY_MEMBER];

370};

371

372/*

373 * Is the given tuple allocated from the slab memory arena?

374 */

375 #define IS_SLAB_SLOT(state, tuple) \

376 ((char *) (tuple) >= (state)->slabMemoryBegin && \

377 (char *) (tuple) < (state)->slabMemoryEnd)

378

379/*

380 * Return the given tuple to the slab memory free list, or free it

381 * if it was palloc'd.

382 */

383 #define RELEASE_SLAB_SLOT(state, tuple) \

384 do { \

385 SlabSlot *buf = (SlabSlot *) tuple; \

386 \

387 if (IS_SLAB_SLOT((state), buf)) \

388 { \

389 buf->nextfree = (state)->slabFreeHead; \

390 (state)->slabFreeHead = buf; \

391 } else \

392 pfree(buf); \

393 } while(0)

394

395 #define REMOVEABBREV(state,stup,count) ((*(state)->base.removeabbrev) (state, stup, count))

396 #define COMPARETUP(state,a,b) ((*(state)->base.comparetup) (a, b, state))

397 #define WRITETUP(state,tape,stup) ((*(state)->base.writetup) (state, tape, stup))

398 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))

399 #define FREESTATE(state) ((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)

400 #define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed)

401 #define USEMEM(state,amt) ((state)->availMem -= (amt))

402 #define FREEMEM(state,amt) ((state)->availMem += (amt))

403 #define SERIAL(state) ((state)->shared == NULL)

404 #define WORKER(state) ((state)->shared && (state)->worker != -1)

405 #define LEADER(state) ((state)->shared && (state)->worker == -1)

406

407/*

408 * NOTES about on-tape representation of tuples:

409 *

410 * We require the first "unsigned int" of a stored tuple to be the total size

411 * on-tape of the tuple, including itself (so it is never zero; an all-zero

412 * unsigned int is used to delimit runs). The remainder of the stored tuple

413 * may or may not match the in-memory representation of the tuple ---

414 * any conversion needed is the job of the writetup and readtup routines.

415 *

416 * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored

417 * representation of the tuple must be followed by another "unsigned int" that

418 * is a copy of the length --- so the total tape space used is actually

419 * sizeof(unsigned int) more than the stored length value. This allows

420 * read-backwards. When the random access flag was not specified, the

421 * write/read routines may omit the extra length word.

422 *

423 * writetup is expected to write both length words as well as the tuple

424 * data. When readtup is called, the tape is positioned just after the

425 * front length word; readtup must read the tuple data and advance past

426 * the back length word (if present).

427 *

428 * The write/read routines can make use of the tuple description data

429 * stored in the Tuplesortstate record, if needed. They are also expected

430 * to adjust state->availMem by the amount of memory space (not tape space!)

431 * released or consumed. There is no error return from either writetup

432 * or readtup; they should ereport() on failure.

433 *

434 *

435 * NOTES about memory consumption calculations:

436 *

437 * We count space allocated for tuples against the workMem limit, plus

438 * the space used by the variable-size memtuples array. Fixed-size space

439 * is not counted; it's small enough to not be interesting.

440 *

441 * Note that we count actual space used (as shown by GetMemoryChunkSpace)

442 * rather than the originally-requested size. This is important since

443 * palloc can add substantial overhead. It's not a complete answer since

444 * we won't count any wasted space in palloc allocation blocks, but it's

445 * a lot better than what we were doing before 7.3. As of 9.6, a

446 * separate memory context is used for caller passed tuples. Resetting

447 * it at certain key increments significantly ameliorates fragmentation.

448 * readtup routines use the slab allocator (they cannot use

449 * the reset context because it gets deleted at the point that merging

450 * begins).

451 */

452

453

454static void tuplesort_begin_batch(Tuplesortstate *state);

455static bool consider_abort_common(Tuplesortstate *state);

456static void inittapes(Tuplesortstate *state, bool mergeruns);

457static void inittapestate(Tuplesortstate *state, int maxTapes);

458static void selectnewtape(Tuplesortstate *state);

459static void init_slab_allocator(Tuplesortstate *state, int numSlots);

460static void mergeruns(Tuplesortstate *state);

461static void mergeonerun(Tuplesortstate *state);

462static void beginmerge(Tuplesortstate *state);

463static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup);

464static void dumptuples(Tuplesortstate *state, bool alltuples);

465static void make_bounded_heap(Tuplesortstate *state);

466static void sort_bounded_heap(Tuplesortstate *state);

467static void tuplesort_sort_memtuples(Tuplesortstate *state);

468static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple);

469static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple);

470static void tuplesort_heap_delete_top(Tuplesortstate *state);

471static void reversedirection(Tuplesortstate *state);

472static unsigned int getlen(LogicalTape *tape, bool eofOK);

473static void markrunend(LogicalTape *tape);

474static int worker_get_identifier(Tuplesortstate *state);

475static void worker_freeze_result_tape(Tuplesortstate *state);

476static void worker_nomergeruns(Tuplesortstate *state);

477static void leader_takeover_tapes(Tuplesortstate *state);

478static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);

479static void tuplesort_free(Tuplesortstate *state);

480static void tuplesort_updatemax(Tuplesortstate *state);

481

482/*

483 * Specialized comparators that we can inline into specialized sorts. The goal

484 * is to try to sort two tuples without having to follow the pointers to the

485 * comparator or the tuple.

486 *

487 * XXX: For now, there is no specialization for cases where datum1 is

488 * authoritative and we don't even need to fall back to a callback at all (that

489 * would be true for types like int4/int8/timestamp/date, but not true for

490 * abbreviations of text or multi-key sorts. There could be! Is it worth it?

491 */

492

493/* Used if first key's comparator is ssup_datum_unsigned_cmp */

494static pg_attribute_always_inline int

495 qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

496{

497 int compare;

498

499 compare = ApplyUnsignedSortComparator(a->datum1, a->isnull1,

500 b->datum1, b->isnull1,

501 &state->base.sortKeys[0]);

502 if (compare != 0)

503 return compare;

504

505 /*

506 * No need to waste effort calling the tiebreak function when there are no

507 * other keys to sort on.

508 */

509 if (state->base.onlyKey != NULL)

510 return 0;

511

512 return state->base.comparetup_tiebreak(a, b, state);

513}

514

515/* Used if first key's comparator is ssup_datum_signed_cmp */

516static pg_attribute_always_inline int

517 qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

518{

519 int compare;

520

521 compare = ApplySignedSortComparator(a->datum1, a->isnull1,

522 b->datum1, b->isnull1,

523 &state->base.sortKeys[0]);

524

525 if (compare != 0)

526 return compare;

527

528 /*

529 * No need to waste effort calling the tiebreak function when there are no

530 * other keys to sort on.

531 */

532 if (state->base.onlyKey != NULL)

533 return 0;

534

535 return state->base.comparetup_tiebreak(a, b, state);

536}

537

538/* Used if first key's comparator is ssup_datum_int32_cmp */

539static pg_attribute_always_inline int

540 qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

541{

542 int compare;

543

544 compare = ApplyInt32SortComparator(a->datum1, a->isnull1,

545 b->datum1, b->isnull1,

546 &state->base.sortKeys[0]);

547

548 if (compare != 0)

549 return compare;

550

551 /*

552 * No need to waste effort calling the tiebreak function when there are no

553 * other keys to sort on.

554 */

555 if (state->base.onlyKey != NULL)

556 return 0;

557

558 return state->base.comparetup_tiebreak(a, b, state);

559}

560

561/*

562 * Special versions of qsort just for SortTuple objects. qsort_tuple() sorts

563 * any variant of SortTuples, using the appropriate comparetup function.

564 * qsort_ssup() is specialized for the case where the comparetup function

565 * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts

566 * and Datum sorts. qsort_tuple_{unsigned,signed,int32} are specialized for

567 * common comparison functions on pass-by-value leading datums.

568 */

569

570#define ST_SORT qsort_tuple_unsigned

571#define ST_ELEMENT_TYPE SortTuple

572#define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state)

573#define ST_COMPARE_ARG_TYPE Tuplesortstate

574#define ST_CHECK_FOR_INTERRUPTS

575#define ST_SCOPE static

576#define ST_DEFINE

577#include "lib/sort_template.h"

578

579#define ST_SORT qsort_tuple_signed

580#define ST_ELEMENT_TYPE SortTuple

581#define ST_COMPARE(a, b, state) qsort_tuple_signed_compare(a, b, state)

582#define ST_COMPARE_ARG_TYPE Tuplesortstate

583#define ST_CHECK_FOR_INTERRUPTS

584#define ST_SCOPE static

585#define ST_DEFINE

586#include "lib/sort_template.h"

587

588#define ST_SORT qsort_tuple_int32

589#define ST_ELEMENT_TYPE SortTuple

590#define ST_COMPARE(a, b, state) qsort_tuple_int32_compare(a, b, state)

591#define ST_COMPARE_ARG_TYPE Tuplesortstate

592#define ST_CHECK_FOR_INTERRUPTS

593#define ST_SCOPE static

594#define ST_DEFINE

595#include "lib/sort_template.h"

596

597#define ST_SORT qsort_tuple

598#define ST_ELEMENT_TYPE SortTuple

599 #define ST_COMPARE_RUNTIME_POINTER

600#define ST_COMPARE_ARG_TYPE Tuplesortstate

601#define ST_CHECK_FOR_INTERRUPTS

602#define ST_SCOPE static

603 #define ST_DECLARE

604#define ST_DEFINE

605#include "lib/sort_template.h"

606

607 #define ST_SORT qsort_ssup

608 #define ST_ELEMENT_TYPE SortTuple

609 #define ST_COMPARE(a, b, ssup) \

610 ApplySortComparator((a)->datum1, (a)->isnull1, \

611 (b)->datum1, (b)->isnull1, (ssup))

612 #define ST_COMPARE_ARG_TYPE SortSupportData

613 #define ST_CHECK_FOR_INTERRUPTS

614 #define ST_SCOPE static

615 #define ST_DEFINE

616#include "lib/sort_template.h"

617

618/*

619 * tuplesort_begin_xxx

620 *

621 * Initialize for a tuple sort operation.

622 *

623 * After calling tuplesort_begin, the caller should call tuplesort_putXXX

624 * zero or more times, then call tuplesort_performsort when all the tuples

625 * have been supplied. After performsort, retrieve the tuples in sorted

626 * order by calling tuplesort_getXXX until it returns false/NULL. (If random

627 * access was requested, rescan, markpos, and restorepos can also be called.)

628 * Call tuplesort_end to terminate the operation and release memory/disk space.

629 *

630 * Each variant of tuplesort_begin has a workMem parameter specifying the

631 * maximum number of kilobytes of RAM to use before spilling data to disk.

632 * (The normal value of this parameter is work_mem, but some callers use

633 * other values.) Each variant also has a sortopt which is a bitmask of

634 * sort options. See TUPLESORT_* definitions in tuplesort.h

635 */

636

637Tuplesortstate *

638 tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)

639{

640 Tuplesortstate *state;

641 MemoryContext maincontext;

642 MemoryContext sortcontext;

643 MemoryContext oldcontext;

644

645 /* See leader_takeover_tapes() remarks on random access support */

646 if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS))

647 elog(ERROR, "random access disallowed under parallel sort");

648

649 /*

650 * Memory context surviving tuplesort_reset. This memory context holds

651 * data which is useful to keep while sorting multiple similar batches.

652 */

653 maincontext = AllocSetContextCreate(CurrentMemoryContext,

654 "TupleSort main",

655 ALLOCSET_DEFAULT_SIZES);

656

657 /*

658 * Create a working memory context for one sort operation. The content of

659 * this context is deleted by tuplesort_reset.

660 */

661 sortcontext = AllocSetContextCreate(maincontext,

662 "TupleSort sort",

663 ALLOCSET_DEFAULT_SIZES);

664

665 /*

666 * Additionally a working memory context for tuples is setup in

667 * tuplesort_begin_batch.

668 */

669

670 /*

671 * Make the Tuplesortstate within the per-sortstate context. This way, we

672 * don't need a separate pfree() operation for it at shutdown.

673 */

674 oldcontext = MemoryContextSwitchTo(maincontext);

675

676 state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));

677

678 if (trace_sort)

679 pg_rusage_init(&state->ru_start);

680

681 state->base.sortopt = sortopt;

682 state->base.tuples = true;

683 state->abbrevNext = 10;

684

685 /*

686 * workMem is forced to be at least 64KB, the current minimum valid value

687 * for the work_mem GUC. This is a defense against parallel sort callers

688 * that divide out memory among many workers in a way that leaves each

689 * with very little memory.

690 */

691 state->allowedMem = Max(workMem, 64) * (int64) 1024;

692 state->base.sortcontext = sortcontext;

693 state->base.maincontext = maincontext;

694

695 /*

696 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;

697 * see comments in grow_memtuples().

698 */

699 state->memtupsize = INITIAL_MEMTUPSIZE;

700 state->memtuples = NULL;

701

702 /*

703 * After all of the other non-parallel-related state, we setup all of the

704 * state needed for each batch.

705 */

706 tuplesort_begin_batch(state);

707

708 /*

709 * Initialize parallel-related state based on coordination information

710 * from caller

711 */

712 if (!coordinate)

713 {

714 /* Serial sort */

715 state->shared = NULL;

716 state->worker = -1;

717 state->nParticipants = -1;

718 }

719 else if (coordinate->isWorker)

720 {

721 /* Parallel worker produces exactly one final run from all input */

722 state->shared = coordinate->sharedsort;

723 state->worker = worker_get_identifier(state);

724 state->nParticipants = -1;

725 }

726 else

727 {

728 /* Parallel leader state only used for final merge */

729 state->shared = coordinate->sharedsort;

730 state->worker = -1;

731 state->nParticipants = coordinate->nParticipants;

732 Assert(state->nParticipants >= 1);

733 }

734

735 MemoryContextSwitchTo(oldcontext);

736

737 return state;

738}

739

740/*

741 * tuplesort_begin_batch

742 *

743 * Setup, or reset, all state need for processing a new set of tuples with this

744 * sort state. Called both from tuplesort_begin_common (the first time sorting

745 * with this sort state) and tuplesort_reset (for subsequent usages).

746 */

747static void

748 tuplesort_begin_batch(Tuplesortstate *state)

749{

750 MemoryContext oldcontext;

751

752 oldcontext = MemoryContextSwitchTo(state->base.maincontext);

753

754 /*

755 * Caller tuple (e.g. IndexTuple) memory context.

756 *

757 * A dedicated child context used exclusively for caller passed tuples

758 * eases memory management. Resetting at key points reduces

759 * fragmentation. Note that the memtuples array of SortTuples is allocated

760 * in the parent context, not this context, because there is no need to

761 * free memtuples early. For bounded sorts, tuples may be pfreed in any

762 * order, so we use a regular aset.c context so that it can make use of

763 * free'd memory. When the sort is not bounded, we make use of a bump.c

764 * context as this keeps allocations more compact with less wastage.

765 * Allocations are also slightly more CPU efficient.

766 */

767 if (TupleSortUseBumpTupleCxt(state->base.sortopt))

768 state->base.tuplecontext = BumpContextCreate(state->base.sortcontext,

769 "Caller tuples",

770 ALLOCSET_DEFAULT_SIZES);

771 else

772 state->base.tuplecontext = AllocSetContextCreate(state->base.sortcontext,

773 "Caller tuples",

774 ALLOCSET_DEFAULT_SIZES);

775

776

777 state->status = TSS_INITIAL;

778 state->bounded = false;

779 state->boundUsed = false;

780

781 state->availMem = state->allowedMem;

782

783 state->tapeset = NULL;

784

785 state->memtupcount = 0;

786

787 /*

788 * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;

789 * see comments in grow_memtuples().

790 */

791 state->growmemtuples = true;

792 state->slabAllocatorUsed = false;

793 if (state->memtuples != NULL && state->memtupsize != INITIAL_MEMTUPSIZE)

794 {

795 pfree(state->memtuples);

796 state->memtuples = NULL;

797 state->memtupsize = INITIAL_MEMTUPSIZE;

798 }

799 if (state->memtuples == NULL)

800 {

801 state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));

802 USEMEM(state, GetMemoryChunkSpace(state->memtuples));

803 }

804

805 /* workMem must be large enough for the minimal memtuples array */

806 if (LACKMEM(state))

807 elog(ERROR, "insufficient memory allowed for sort");

808

809 state->currentRun = 0;

810

811 /*

812 * Tape variables (inputTapes, outputTapes, etc.) will be initialized by

813 * inittapes(), if needed.

814 */

815

816 state->result_tape = NULL; /* flag that result tape has not been formed */

817

818 MemoryContextSwitchTo(oldcontext);

819}

820

821/*

822 * tuplesort_set_bound

823 *

824 * Advise tuplesort that at most the first N result tuples are required.

825 *

826 * Must be called before inserting any tuples. (Actually, we could allow it

827 * as long as the sort hasn't spilled to disk, but there seems no need for

828 * delayed calls at the moment.)

829 *

830 * This is a hint only. The tuplesort may still return more tuples than

831 * requested. Parallel leader tuplesorts will always ignore the hint.

832 */

833void

834 tuplesort_set_bound(Tuplesortstate *state, int64 bound)

835{

836 /* Assert we're called before loading any tuples */

837 Assert(state->status == TSS_INITIAL && state->memtupcount == 0);

838 /* Assert we allow bounded sorts */

839 Assert(state->base.sortopt & TUPLESORT_ALLOWBOUNDED);

840 /* Can't set the bound twice, either */

841 Assert(!state->bounded);

842 /* Also, this shouldn't be called in a parallel worker */

843 Assert(!WORKER(state));

844

845 /* Parallel leader allows but ignores hint */

846 if (LEADER(state))

847 return;

848

849#ifdef DEBUG_BOUNDED_SORT

850 /* Honor GUC setting that disables the feature (for easy testing) */

851 if (!optimize_bounded_sort)

852 return;

853#endif

854

855 /* We want to be able to compute bound * 2, so limit the setting */

856 if (bound > (int64) (INT_MAX / 2))

857 return;

858

859 state->bounded = true;

860 state->bound = (int) bound;

861

862 /*

863 * Bounded sorts are not an effective target for abbreviated key

864 * optimization. Disable by setting state to be consistent with no

865 * abbreviation support.

866 */

867 state->base.sortKeys->abbrev_converter = NULL;

868 if (state->base.sortKeys->abbrev_full_comparator)

869 state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;

870

871 /* Not strictly necessary, but be tidy */

872 state->base.sortKeys->abbrev_abort = NULL;

873 state->base.sortKeys->abbrev_full_comparator = NULL;

874}

875

876/*

877 * tuplesort_used_bound

878 *

879 * Allow callers to find out if the sort state was able to use a bound.

880 */

881bool

882 tuplesort_used_bound(Tuplesortstate *state)

883{

884 return state->boundUsed;

885}

886

887/*

888 * tuplesort_free

889 *

890 * Internal routine for freeing resources of tuplesort.

891 */

892static void

893 tuplesort_free(Tuplesortstate *state)

894{

895 /* context swap probably not needed, but let's be safe */

896 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

897 int64 spaceUsed;

898

899 if (state->tapeset)

900 spaceUsed = LogicalTapeSetBlocks(state->tapeset);

901 else

902 spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;

903

904 /*

905 * Delete temporary "tape" files, if any.

906 *

907 * We don't bother to destroy the individual tapes here. They will go away

908 * with the sortcontext. (In TSS_FINALMERGE state, we have closed

909 * finished tapes already.)

910 */

911 if (state->tapeset)

912 LogicalTapeSetClose(state->tapeset);

913

914 if (trace_sort)

915 {

916 if (state->tapeset)

917 elog(LOG, "%s of worker %d ended, %" PRId64 " disk blocks used: %s",

918 SERIAL(state) ? "external sort" : "parallel external sort",

919 state->worker, spaceUsed, pg_rusage_show(&state->ru_start));

920 else

921 elog(LOG, "%s of worker %d ended, %" PRId64 " KB used: %s",

922 SERIAL(state) ? "internal sort" : "unperformed parallel sort",

923 state->worker, spaceUsed, pg_rusage_show(&state->ru_start));

924 }

925

926 TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);

927

928 FREESTATE(state);

929 MemoryContextSwitchTo(oldcontext);

930

931 /*

932 * Free the per-sort memory context, thereby releasing all working memory.

933 */

934 MemoryContextReset(state->base.sortcontext);

935}

936

937/*

938 * tuplesort_end

939 *

940 * Release resources and clean up.

941 *

942 * NOTE: after calling this, any pointers returned by tuplesort_getXXX are

943 * pointing to garbage. Be careful not to attempt to use or free such

944 * pointers afterwards!

945 */

946void

947 tuplesort_end(Tuplesortstate *state)

948{

949 tuplesort_free(state);

950

951 /*

952 * Free the main memory context, including the Tuplesortstate struct

953 * itself.

954 */

955 MemoryContextDelete(state->base.maincontext);

956}

957

958/*

959 * tuplesort_updatemax

960 *

961 * Update maximum resource usage statistics.

962 */

963static void

964 tuplesort_updatemax(Tuplesortstate *state)

965{

966 int64 spaceUsed;

967 bool isSpaceDisk;

968

969 /*

970 * Note: it might seem we should provide both memory and disk usage for a

971 * disk-based sort. However, the current code doesn't track memory space

972 * accurately once we have begun to return tuples to the caller (since we

973 * don't account for pfree's the caller is expected to do), so we cannot

974 * rely on availMem in a disk sort. This does not seem worth the overhead

975 * to fix. Is it worth creating an API for the memory context code to

976 * tell us how much is actually used in sortcontext?

977 */

978 if (state->tapeset)

979 {

980 isSpaceDisk = true;

981 spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;

982 }

983 else

984 {

985 isSpaceDisk = false;

986 spaceUsed = state->allowedMem - state->availMem;

987 }

988

989 /*

990 * Sort evicts data to the disk when it wasn't able to fit that data into

991 * main memory. This is why we assume space used on the disk to be more

992 * important for tracking resource usage than space used in memory. Note

993 * that the amount of space occupied by some tupleset on the disk might be

994 * less than amount of space occupied by the same tupleset in memory due

995 * to more compact representation.

996 */

997 if ((isSpaceDisk && !state->isMaxSpaceDisk) ||

998 (isSpaceDisk == state->isMaxSpaceDisk && spaceUsed > state->maxSpace))

999 {

1000 state->maxSpace = spaceUsed;

1001 state->isMaxSpaceDisk = isSpaceDisk;

1002 state->maxSpaceStatus = state->status;

1003 }

1004}

1005

1006/*

1007 * tuplesort_reset

1008 *

1009 * Reset the tuplesort. Reset all the data in the tuplesort, but leave the

1010 * meta-information in. After tuplesort_reset, tuplesort is ready to start

1011 * a new sort. This allows avoiding recreation of tuple sort states (and

1012 * save resources) when sorting multiple small batches.

1013 */

1014void

1015 tuplesort_reset(Tuplesortstate *state)

1016{

1017 tuplesort_updatemax(state);

1018 tuplesort_free(state);

1019

1020 /*

1021 * After we've freed up per-batch memory, re-setup all of the state common

1022 * to both the first batch and any subsequent batch.

1023 */

1024 tuplesort_begin_batch(state);

1025

1026 state->lastReturnedTuple = NULL;

1027 state->slabMemoryBegin = NULL;

1028 state->slabMemoryEnd = NULL;

1029 state->slabFreeHead = NULL;

1030}

1031

1032/*

1033 * Grow the memtuples[] array, if possible within our memory constraint. We

1034 * must not exceed INT_MAX tuples in memory or the caller-provided memory

1035 * limit. Return true if we were able to enlarge the array, false if not.

1036 *

1037 * Normally, at each increment we double the size of the array. When doing

1038 * that would exceed a limit, we attempt one last, smaller increase (and then

1039 * clear the growmemtuples flag so we don't try any more). That allows us to

1040 * use memory as fully as permitted; sticking to the pure doubling rule could

1041 * result in almost half going unused. Because availMem moves around with

1042 * tuple addition/removal, we need some rule to prevent making repeated small

1043 * increases in memtupsize, which would just be useless thrashing. The

1044 * growmemtuples flag accomplishes that and also prevents useless

1045 * recalculations in this function.

1046 */

1047static bool

1048 grow_memtuples(Tuplesortstate *state)

1049{

1050 int newmemtupsize;

1051 int memtupsize = state->memtupsize;

1052 int64 memNowUsed = state->allowedMem - state->availMem;

1053

1054 /* Forget it if we've already maxed out memtuples, per comment above */

1055 if (!state->growmemtuples)

1056 return false;

1057

1058 /* Select new value of memtupsize */

1059 if (memNowUsed <= state->availMem)

1060 {

1061 /*

1062 * We've used no more than half of allowedMem; double our usage,

1063 * clamping at INT_MAX tuples.

1064 */

1065 if (memtupsize < INT_MAX / 2)

1066 newmemtupsize = memtupsize * 2;

1067 else

1068 {

1069 newmemtupsize = INT_MAX;

1070 state->growmemtuples = false;

1071 }

1072 }

1073 else

1074 {

1075 /*

1076 * This will be the last increment of memtupsize. Abandon doubling

1077 * strategy and instead increase as much as we safely can.

1078 *

1079 * To stay within allowedMem, we can't increase memtupsize by more

1080 * than availMem / sizeof(SortTuple) elements. In practice, we want

1081 * to increase it by considerably less, because we need to leave some

1082 * space for the tuples to which the new array slots will refer. We

1083 * assume the new tuples will be about the same size as the tuples

1084 * we've already seen, and thus we can extrapolate from the space

1085 * consumption so far to estimate an appropriate new size for the

1086 * memtuples array. The optimal value might be higher or lower than

1087 * this estimate, but it's hard to know that in advance. We again

1088 * clamp at INT_MAX tuples.

1089 *

1090 * This calculation is safe against enlarging the array so much that

1091 * LACKMEM becomes true, because the memory currently used includes

1092 * the present array; thus, there would be enough allowedMem for the

1093 * new array elements even if no other memory were currently used.

1094 *

1095 * We do the arithmetic in float8, because otherwise the product of

1096 * memtupsize and allowedMem could overflow. Any inaccuracy in the

1097 * result should be insignificant; but even if we computed a

1098 * completely insane result, the checks below will prevent anything

1099 * really bad from happening.

1100 */

1101 double grow_ratio;

1102

1103 grow_ratio = (double) state->allowedMem / (double) memNowUsed;

1104 if (memtupsize * grow_ratio < INT_MAX)

1105 newmemtupsize = (int) (memtupsize * grow_ratio);

1106 else

1107 newmemtupsize = INT_MAX;

1108

1109 /* We won't make any further enlargement attempts */

1110 state->growmemtuples = false;

1111 }

1112

1113 /* Must enlarge array by at least one element, else report failure */

1114 if (newmemtupsize <= memtupsize)

1115 goto noalloc;

1116

1117 /*

1118 * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize. Clamp

1119 * to ensure our request won't be rejected. Note that we can easily

1120 * exhaust address space before facing this outcome. (This is presently

1121 * impossible due to guc.c's MAX_KILOBYTES limitation on work_mem, but

1122 * don't rely on that at this distance.)

1123 */

1124 if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))

1125 {

1126 newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));

1127 state->growmemtuples = false; /* can't grow any more */

1128 }

1129

1130 /*

1131 * We need to be sure that we do not cause LACKMEM to become true, else

1132 * the space management algorithm will go nuts. The code above should

1133 * never generate a dangerous request, but to be safe, check explicitly

1134 * that the array growth fits within availMem. (We could still cause

1135 * LACKMEM if the memory chunk overhead associated with the memtuples

1136 * array were to increase. That shouldn't happen because we chose the

1137 * initial array size large enough to ensure that palloc will be treating

1138 * both old and new arrays as separate chunks. But we'll check LACKMEM

1139 * explicitly below just in case.)

1140 */

1141 if (state->availMem < (int64) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))

1142 goto noalloc;

1143

1144 /* OK, do it */

1145 FREEMEM(state, GetMemoryChunkSpace(state->memtuples));

1146 state->memtupsize = newmemtupsize;

1147 state->memtuples = (SortTuple *)

1148 repalloc_huge(state->memtuples,

1149 state->memtupsize * sizeof(SortTuple));

1150 USEMEM(state, GetMemoryChunkSpace(state->memtuples));

1151 if (LACKMEM(state))

1152 elog(ERROR, "unexpected out-of-memory situation in tuplesort");

1153 return true;

1154

1155noalloc:

1156 /* If for any reason we didn't realloc, shut off future attempts */

1157 state->growmemtuples = false;

1158 return false;

1159}

1160

1161/*

1162 * Shared code for tuple and datum cases.

1163 */

1164void

1165 tuplesort_puttuple_common(Tuplesortstate *state, SortTuple *tuple,

1166 bool useAbbrev, Size tuplen)

1167{

1168 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

1169

1170 Assert(!LEADER(state));

1171

1172 /* account for the memory used for this tuple */

1173 USEMEM(state, tuplen);

1174 state->tupleMem += tuplen;

1175

1176 if (!useAbbrev)

1177 {

1178 /*

1179 * Leave ordinary Datum representation, or NULL value. If there is a

1180 * converter it won't expect NULL values, and cost model is not

1181 * required to account for NULL, so in that case we avoid calling

1182 * converter and just set datum1 to zeroed representation (to be

1183 * consistent, and to support cheap inequality tests for NULL

1184 * abbreviated keys).

1185 */

1186 }

1187 else if (!consider_abort_common(state))

1188 {

1189 /* Store abbreviated key representation */

1190 tuple->datum1 = state->base.sortKeys->abbrev_converter(tuple->datum1,

1191 state->base.sortKeys);

1192 }

1193 else

1194 {

1195 /*

1196 * Set state to be consistent with never trying abbreviation.

1197 *

1198 * Alter datum1 representation in already-copied tuples, so as to

1199 * ensure a consistent representation (current tuple was just

1200 * handled). It does not matter if some dumped tuples are already

1201 * sorted on tape, since serialized tuples lack abbreviated keys

1202 * (TSS_BUILDRUNS state prevents control reaching here in any case).

1203 */

1204 REMOVEABBREV(state, state->memtuples, state->memtupcount);

1205 }

1206

1207 switch (state->status)

1208 {

1209 case TSS_INITIAL:

1210

1211 /*

1212 * Save the tuple into the unsorted array. First, grow the array

1213 * as needed. Note that we try to grow the array when there is

1214 * still one free slot remaining --- if we fail, there'll still be

1215 * room to store the incoming tuple, and then we'll switch to

1216 * tape-based operation.

1217 */

1218 if (state->memtupcount >= state->memtupsize - 1)

1219 {

1220 (void) grow_memtuples(state);

1221 Assert(state->memtupcount < state->memtupsize);

1222 }

1223 state->memtuples[state->memtupcount++] = *tuple;

1224

1225 /*

1226 * Check if it's time to switch over to a bounded heapsort. We do

1227 * so if the input tuple count exceeds twice the desired tuple

1228 * count (this is a heuristic for where heapsort becomes cheaper

1229 * than a quicksort), or if we've just filled workMem and have

1230 * enough tuples to meet the bound.

1231 *

1232 * Note that once we enter TSS_BOUNDED state we will always try to

1233 * complete the sort that way. In the worst case, if later input

1234 * tuples are larger than earlier ones, this might cause us to

1235 * exceed workMem significantly.

1236 */

1237 if (state->bounded &&

1238 (state->memtupcount > state->bound * 2 ||

1239 (state->memtupcount > state->bound && LACKMEM(state))))

1240 {

1241 if (trace_sort)

1242 elog(LOG, "switching to bounded heapsort at %d tuples: %s",

1243 state->memtupcount,

1244 pg_rusage_show(&state->ru_start));

1245 make_bounded_heap(state);

1246 MemoryContextSwitchTo(oldcontext);

1247 return;

1248 }

1249

1250 /*

1251 * Done if we still fit in available memory and have array slots.

1252 */

1253 if (state->memtupcount < state->memtupsize && !LACKMEM(state))

1254 {

1255 MemoryContextSwitchTo(oldcontext);

1256 return;

1257 }

1258

1259 /*

1260 * Nope; time to switch to tape-based operation.

1261 */

1262 inittapes(state, true);

1263

1264 /*

1265 * Dump all tuples.

1266 */

1267 dumptuples(state, false);

1268 break;

1269

1270 case TSS_BOUNDED:

1271

1272 /*

1273 * We don't want to grow the array here, so check whether the new

1274 * tuple can be discarded before putting it in. This should be a

1275 * good speed optimization, too, since when there are many more

1276 * input tuples than the bound, most input tuples can be discarded

1277 * with just this one comparison. Note that because we currently

1278 * have the sort direction reversed, we must check for <= not >=.

1279 */

1280 if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)

1281 {

1282 /* new tuple <= top of the heap, so we can discard it */

1283 free_sort_tuple(state, tuple);

1284 CHECK_FOR_INTERRUPTS();

1285 }

1286 else

1287 {

1288 /* discard top of heap, replacing it with the new tuple */

1289 free_sort_tuple(state, &state->memtuples[0]);

1290 tuplesort_heap_replace_top(state, tuple);

1291 }

1292 break;

1293

1294 case TSS_BUILDRUNS:

1295

1296 /*

1297 * Save the tuple into the unsorted array (there must be space)

1298 */

1299 state->memtuples[state->memtupcount++] = *tuple;

1300

1301 /*

1302 * If we are over the memory limit, dump all tuples.

1303 */

1304 dumptuples(state, false);

1305 break;

1306

1307 default:

1308 elog(ERROR, "invalid tuplesort state");

1309 break;

1310 }

1311 MemoryContextSwitchTo(oldcontext);

1312}

1313

1314static bool

1315 consider_abort_common(Tuplesortstate *state)

1316{

1317 Assert(state->base.sortKeys[0].abbrev_converter != NULL);

1318 Assert(state->base.sortKeys[0].abbrev_abort != NULL);

1319 Assert(state->base.sortKeys[0].abbrev_full_comparator != NULL);

1320

1321 /*

1322 * Check effectiveness of abbreviation optimization. Consider aborting

1323 * when still within memory limit.

1324 */

1325 if (state->status == TSS_INITIAL &&

1326 state->memtupcount >= state->abbrevNext)

1327 {

1328 state->abbrevNext *= 2;

1329

1330 /*

1331 * Check opclass-supplied abbreviation abort routine. It may indicate

1332 * that abbreviation should not proceed.

1333 */

1334 if (!state->base.sortKeys->abbrev_abort(state->memtupcount,

1335 state->base.sortKeys))

1336 return false;

1337

1338 /*

1339 * Finally, restore authoritative comparator, and indicate that

1340 * abbreviation is not in play by setting abbrev_converter to NULL

1341 */

1342 state->base.sortKeys[0].comparator = state->base.sortKeys[0].abbrev_full_comparator;

1343 state->base.sortKeys[0].abbrev_converter = NULL;

1344 /* Not strictly necessary, but be tidy */

1345 state->base.sortKeys[0].abbrev_abort = NULL;

1346 state->base.sortKeys[0].abbrev_full_comparator = NULL;

1347

1348 /* Give up - expect original pass-by-value representation */

1349 return true;

1350 }

1351

1352 return false;

1353}

1354

1355/*

1356 * All tuples have been provided; finish the sort.

1357 */

1358void

1359 tuplesort_performsort(Tuplesortstate *state)

1360{

1361 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

1362

1363 if (trace_sort)

1364 elog(LOG, "performsort of worker %d starting: %s",

1365 state->worker, pg_rusage_show(&state->ru_start));

1366

1367 switch (state->status)

1368 {

1369 case TSS_INITIAL:

1370

1371 /*

1372 * We were able to accumulate all the tuples within the allowed

1373 * amount of memory, or leader to take over worker tapes

1374 */

1375 if (SERIAL(state))

1376 {

1377 /* Just qsort 'em and we're done */

1378 tuplesort_sort_memtuples(state);

1379 state->status = TSS_SORTEDINMEM;

1380 }

1381 else if (WORKER(state))

1382 {

1383 /*

1384 * Parallel workers must still dump out tuples to tape. No

1385 * merge is required to produce single output run, though.

1386 */

1387 inittapes(state, false);

1388 dumptuples(state, true);

1389 worker_nomergeruns(state);

1390 state->status = TSS_SORTEDONTAPE;

1391 }

1392 else

1393 {

1394 /*

1395 * Leader will take over worker tapes and merge worker runs.

1396 * Note that mergeruns sets the correct state->status.

1397 */

1398 leader_takeover_tapes(state);

1399 mergeruns(state);

1400 }

1401 state->current = 0;

1402 state->eof_reached = false;

1403 state->markpos_block = 0L;

1404 state->markpos_offset = 0;

1405 state->markpos_eof = false;

1406 break;

1407

1408 case TSS_BOUNDED:

1409

1410 /*

1411 * We were able to accumulate all the tuples required for output

1412 * in memory, using a heap to eliminate excess tuples. Now we

1413 * have to transform the heap to a properly-sorted array. Note

1414 * that sort_bounded_heap sets the correct state->status.

1415 */

1416 sort_bounded_heap(state);

1417 state->current = 0;

1418 state->eof_reached = false;

1419 state->markpos_offset = 0;

1420 state->markpos_eof = false;

1421 break;

1422

1423 case TSS_BUILDRUNS:

1424

1425 /*

1426 * Finish tape-based sort. First, flush all tuples remaining in

1427 * memory out to tape; then merge until we have a single remaining

1428 * run (or, if !randomAccess and !WORKER(), one run per tape).

1429 * Note that mergeruns sets the correct state->status.

1430 */

1431 dumptuples(state, true);

1432 mergeruns(state);

1433 state->eof_reached = false;

1434 state->markpos_block = 0L;

1435 state->markpos_offset = 0;

1436 state->markpos_eof = false;

1437 break;

1438

1439 default:

1440 elog(ERROR, "invalid tuplesort state");

1441 break;

1442 }

1443

1444 if (trace_sort)

1445 {

1446 if (state->status == TSS_FINALMERGE)

1447 elog(LOG, "performsort of worker %d done (except %d-way final merge): %s",

1448 state->worker, state->nInputTapes,

1449 pg_rusage_show(&state->ru_start));

1450 else

1451 elog(LOG, "performsort of worker %d done: %s",

1452 state->worker, pg_rusage_show(&state->ru_start));

1453 }

1454

1455 MemoryContextSwitchTo(oldcontext);

1456}

1457

1458/*

1459 * Internal routine to fetch the next tuple in either forward or back

1460 * direction into *stup. Returns false if no more tuples.

1461 * Returned tuple belongs to tuplesort memory context, and must not be freed

1462 * by caller. Note that fetched tuple is stored in memory that may be

1463 * recycled by any future fetch.

1464 */

1465bool

1466 tuplesort_gettuple_common(Tuplesortstate *state, bool forward,

1467 SortTuple *stup)

1468{

1469 unsigned int tuplen;

1470 size_t nmoved;

1471

1472 Assert(!WORKER(state));

1473

1474 switch (state->status)

1475 {

1476 case TSS_SORTEDINMEM:

1477 Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);

1478 Assert(!state->slabAllocatorUsed);

1479 if (forward)

1480 {

1481 if (state->current < state->memtupcount)

1482 {

1483 *stup = state->memtuples[state->current++];

1484 return true;

1485 }

1486 state->eof_reached = true;

1487

1488 /*

1489 * Complain if caller tries to retrieve more tuples than

1490 * originally asked for in a bounded sort. This is because

1491 * returning EOF here might be the wrong thing.

1492 */

1493 if (state->bounded && state->current >= state->bound)

1494 elog(ERROR, "retrieved too many tuples in a bounded sort");

1495

1496 return false;

1497 }

1498 else

1499 {

1500 if (state->current <= 0)

1501 return false;

1502

1503 /*

1504 * if all tuples are fetched already then we return last

1505 * tuple, else - tuple before last returned.

1506 */

1507 if (state->eof_reached)

1508 state->eof_reached = false;

1509 else

1510 {

1511 state->current--; /* last returned tuple */

1512 if (state->current <= 0)

1513 return false;

1514 }

1515 *stup = state->memtuples[state->current - 1];

1516 return true;

1517 }

1518 break;

1519

1520 case TSS_SORTEDONTAPE:

1521 Assert(forward || state->base.sortopt & TUPLESORT_RANDOMACCESS);

1522 Assert(state->slabAllocatorUsed);

1523

1524 /*

1525 * The slot that held the tuple that we returned in previous

1526 * gettuple call can now be reused.

1527 */

1528 if (state->lastReturnedTuple)

1529 {

1530 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);

1531 state->lastReturnedTuple = NULL;

1532 }

1533

1534 if (forward)

1535 {

1536 if (state->eof_reached)

1537 return false;

1538

1539 if ((tuplen = getlen(state->result_tape, true)) != 0)

1540 {

1541 READTUP(state, stup, state->result_tape, tuplen);

1542

1543 /*

1544 * Remember the tuple we return, so that we can recycle

1545 * its memory on next call. (This can be NULL, in the

1546 * !state->tuples case).

1547 */

1548 state->lastReturnedTuple = stup->tuple;

1549

1550 return true;

1551 }

1552 else

1553 {

1554 state->eof_reached = true;

1555 return false;

1556 }

1557 }

1558

1559 /*

1560 * Backward.

1561 *

1562 * if all tuples are fetched already then we return last tuple,

1563 * else - tuple before last returned.

1564 */

1565 if (state->eof_reached)

1566 {

1567 /*

1568 * Seek position is pointing just past the zero tuplen at the

1569 * end of file; back up to fetch last tuple's ending length

1570 * word. If seek fails we must have a completely empty file.

1571 */

1572 nmoved = LogicalTapeBackspace(state->result_tape,

1573 2 * sizeof(unsigned int));

1574 if (nmoved == 0)

1575 return false;

1576 else if (nmoved != 2 * sizeof(unsigned int))

1577 elog(ERROR, "unexpected tape position");

1578 state->eof_reached = false;

1579 }

1580 else

1581 {

1582 /*

1583 * Back up and fetch previously-returned tuple's ending length

1584 * word. If seek fails, assume we are at start of file.

1585 */

1586 nmoved = LogicalTapeBackspace(state->result_tape,

1587 sizeof(unsigned int));

1588 if (nmoved == 0)

1589 return false;

1590 else if (nmoved != sizeof(unsigned int))

1591 elog(ERROR, "unexpected tape position");

1592 tuplen = getlen(state->result_tape, false);

1593

1594 /*

1595 * Back up to get ending length word of tuple before it.

1596 */

1597 nmoved = LogicalTapeBackspace(state->result_tape,

1598 tuplen + 2 * sizeof(unsigned int));

1599 if (nmoved == tuplen + sizeof(unsigned int))

1600 {

1601 /*

1602 * We backed up over the previous tuple, but there was no

1603 * ending length word before it. That means that the prev

1604 * tuple is the first tuple in the file. It is now the

1605 * next to read in forward direction (not obviously right,

1606 * but that is what in-memory case does).

1607 */

1608 return false;

1609 }

1610 else if (nmoved != tuplen + 2 * sizeof(unsigned int))

1611 elog(ERROR, "bogus tuple length in backward scan");

1612 }

1613

1614 tuplen = getlen(state->result_tape, false);

1615

1616 /*

1617 * Now we have the length of the prior tuple, back up and read it.

1618 * Note: READTUP expects we are positioned after the initial

1619 * length word of the tuple, so back up to that point.

1620 */

1621 nmoved = LogicalTapeBackspace(state->result_tape,

1622 tuplen);

1623 if (nmoved != tuplen)

1624 elog(ERROR, "bogus tuple length in backward scan");

1625 READTUP(state, stup, state->result_tape, tuplen);

1626

1627 /*

1628 * Remember the tuple we return, so that we can recycle its memory

1629 * on next call. (This can be NULL, in the Datum case).

1630 */

1631 state->lastReturnedTuple = stup->tuple;

1632

1633 return true;

1634

1635 case TSS_FINALMERGE:

1636 Assert(forward);

1637 /* We are managing memory ourselves, with the slab allocator. */

1638 Assert(state->slabAllocatorUsed);

1639

1640 /*

1641 * The slab slot holding the tuple that we returned in previous

1642 * gettuple call can now be reused.

1643 */

1644 if (state->lastReturnedTuple)

1645 {

1646 RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);

1647 state->lastReturnedTuple = NULL;

1648 }

1649

1650 /*

1651 * This code should match the inner loop of mergeonerun().

1652 */

1653 if (state->memtupcount > 0)

1654 {

1655 int srcTapeIndex = state->memtuples[0].srctape;

1656 LogicalTape *srcTape = state->inputTapes[srcTapeIndex];

1657 SortTuple newtup;

1658

1659 *stup = state->memtuples[0];

1660

1661 /*

1662 * Remember the tuple we return, so that we can recycle its

1663 * memory on next call. (This can be NULL, in the Datum case).

1664 */

1665 state->lastReturnedTuple = stup->tuple;

1666

1667 /*

1668 * Pull next tuple from tape, and replace the returned tuple

1669 * at top of the heap with it.

1670 */

1671 if (!mergereadnext(state, srcTape, &newtup))

1672 {

1673 /*

1674 * If no more data, we've reached end of run on this tape.

1675 * Remove the top node from the heap.

1676 */

1677 tuplesort_heap_delete_top(state);

1678 state->nInputRuns--;

1679

1680 /*

1681 * Close the tape. It'd go away at the end of the sort

1682 * anyway, but better to release the memory early.

1683 */

1684 LogicalTapeClose(srcTape);

1685 return true;

1686 }

1687 newtup.srctape = srcTapeIndex;

1688 tuplesort_heap_replace_top(state, &newtup);

1689 return true;

1690 }

1691 return false;

1692

1693 default:

1694 elog(ERROR, "invalid tuplesort state");

1695 return false; /* keep compiler quiet */

1696 }

1697}

1698

1699

1700/*

1701 * Advance over N tuples in either forward or back direction,

1702 * without returning any data. N==0 is a no-op.

1703 * Returns true if successful, false if ran out of tuples.

1704 */

1705bool

1706 tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)

1707{

1708 MemoryContext oldcontext;

1709

1710 /*

1711 * We don't actually support backwards skip yet, because no callers need

1712 * it. The API is designed to allow for that later, though.

1713 */

1714 Assert(forward);

1715 Assert(ntuples >= 0);

1716 Assert(!WORKER(state));

1717

1718 switch (state->status)

1719 {

1720 case TSS_SORTEDINMEM:

1721 if (state->memtupcount - state->current >= ntuples)

1722 {

1723 state->current += ntuples;

1724 return true;

1725 }

1726 state->current = state->memtupcount;

1727 state->eof_reached = true;

1728

1729 /*

1730 * Complain if caller tries to retrieve more tuples than

1731 * originally asked for in a bounded sort. This is because

1732 * returning EOF here might be the wrong thing.

1733 */

1734 if (state->bounded && state->current >= state->bound)

1735 elog(ERROR, "retrieved too many tuples in a bounded sort");

1736

1737 return false;

1738

1739 case TSS_SORTEDONTAPE:

1740 case TSS_FINALMERGE:

1741

1742 /*

1743 * We could probably optimize these cases better, but for now it's

1744 * not worth the trouble.

1745 */

1746 oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

1747 while (ntuples-- > 0)

1748 {

1749 SortTuple stup;

1750

1751 if (!tuplesort_gettuple_common(state, forward, &stup))

1752 {

1753 MemoryContextSwitchTo(oldcontext);

1754 return false;

1755 }

1756 CHECK_FOR_INTERRUPTS();

1757 }

1758 MemoryContextSwitchTo(oldcontext);

1759 return true;

1760

1761 default:

1762 elog(ERROR, "invalid tuplesort state");

1763 return false; /* keep compiler quiet */

1764 }

1765}

1766

1767/*

1768 * tuplesort_merge_order - report merge order we'll use for given memory

1769 * (note: "merge order" just means the number of input tapes in the merge).

1770 *

1771 * This is exported for use by the planner. allowedMem is in bytes.

1772 */

1773int

1774 tuplesort_merge_order(int64 allowedMem)

1775{

1776 int mOrder;

1777

1778 /*----------

1779 * In the merge phase, we need buffer space for each input and output tape.

1780 * Each pass in the balanced merge algorithm reads from M input tapes, and

1781 * writes to N output tapes. Each tape consumes TAPE_BUFFER_OVERHEAD bytes

1782 * of memory. In addition to that, we want MERGE_BUFFER_SIZE workspace per

1783 * input tape.

1784 *

1785 * totalMem = M * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE) +

1786 * N * TAPE_BUFFER_OVERHEAD

1787 *

1788 * Except for the last and next-to-last merge passes, where there can be

1789 * fewer tapes left to process, M = N. We choose M so that we have the

1790 * desired amount of memory available for the input buffers

1791 * (TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE), given the total memory

1792 * available for the tape buffers (allowedMem).

1793 *

1794 * Note: you might be thinking we need to account for the memtuples[]

1795 * array in this calculation, but we effectively treat that as part of the

1796 * MERGE_BUFFER_SIZE workspace.

1797 *----------

1798 */

1799 mOrder = allowedMem /

1800 (2 * TAPE_BUFFER_OVERHEAD + MERGE_BUFFER_SIZE);

1801

1802 /*

1803 * Even in minimum memory, use at least a MINORDER merge. On the other

1804 * hand, even when we have lots of memory, do not use more than a MAXORDER

1805 * merge. Tapes are pretty cheap, but they're not entirely free. Each

1806 * additional tape reduces the amount of memory available to build runs,

1807 * which in turn can cause the same sort to need more runs, which makes

1808 * merging slower even if it can still be done in a single pass. Also,

1809 * high order merges are quite slow due to CPU cache effects; it can be

1810 * faster to pay the I/O cost of a multi-pass merge than to perform a

1811 * single merge pass across many hundreds of tapes.

1812 */

1813 mOrder = Max(mOrder, MINORDER);

1814 mOrder = Min(mOrder, MAXORDER);

1815

1816 return mOrder;

1817}

1818

1819/*

1820 * Helper function to calculate how much memory to allocate for the read buffer

1821 * of each input tape in a merge pass.

1822 *

1823 * 'avail_mem' is the amount of memory available for the buffers of all the

1824 * tapes, both input and output.

1825 * 'nInputTapes' and 'nInputRuns' are the number of input tapes and runs.

1826 * 'maxOutputTapes' is the max. number of output tapes we should produce.

1827 */

1828static int64

1829 merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns,

1830 int maxOutputTapes)

1831{

1832 int nOutputRuns;

1833 int nOutputTapes;

1834

1835 /*

1836 * How many output tapes will we produce in this pass?

1837 *

1838 * This is nInputRuns / nInputTapes, rounded up.

1839 */

1840 nOutputRuns = (nInputRuns + nInputTapes - 1) / nInputTapes;

1841

1842 nOutputTapes = Min(nOutputRuns, maxOutputTapes);

1843

1844 /*

1845 * Each output tape consumes TAPE_BUFFER_OVERHEAD bytes of memory. All

1846 * remaining memory is divided evenly between the input tapes.

1847 *

1848 * This also follows from the formula in tuplesort_merge_order, but here

1849 * we derive the input buffer size from the amount of memory available,

1850 * and M and N.

1851 */

1852 return Max((avail_mem - TAPE_BUFFER_OVERHEAD * nOutputTapes) / nInputTapes, 0);

1853}

1854

1855/*

1856 * inittapes - initialize for tape sorting.

1857 *

1858 * This is called only if we have found we won't sort in memory.

1859 */

1860static void

1861 inittapes(Tuplesortstate *state, bool mergeruns)

1862{

1863 Assert(!LEADER(state));

1864

1865 if (mergeruns)

1866 {

1867 /* Compute number of input tapes to use when merging */

1868 state->maxTapes = tuplesort_merge_order(state->allowedMem);

1869 }

1870 else

1871 {

1872 /* Workers can sometimes produce single run, output without merge */

1873 Assert(WORKER(state));

1874 state->maxTapes = MINORDER;

1875 }

1876

1877 if (trace_sort)

1878 elog(LOG, "worker %d switching to external sort with %d tapes: %s",

1879 state->worker, state->maxTapes, pg_rusage_show(&state->ru_start));

1880

1881 /* Create the tape set */

1882 inittapestate(state, state->maxTapes);

1883 state->tapeset =

1884 LogicalTapeSetCreate(false,

1885 state->shared ? &state->shared->fileset : NULL,

1886 state->worker);

1887

1888 state->currentRun = 0;

1889

1890 /*

1891 * Initialize logical tape arrays.

1892 */

1893 state->inputTapes = NULL;

1894 state->nInputTapes = 0;

1895 state->nInputRuns = 0;

1896

1897 state->outputTapes = palloc0(state->maxTapes * sizeof(LogicalTape *));

1898 state->nOutputTapes = 0;

1899 state->nOutputRuns = 0;

1900

1901 state->status = TSS_BUILDRUNS;

1902

1903 selectnewtape(state);

1904}

1905

1906/*

1907 * inittapestate - initialize generic tape management state

1908 */

1909static void

1910 inittapestate(Tuplesortstate *state, int maxTapes)

1911{

1912 int64 tapeSpace;

1913

1914 /*

1915 * Decrease availMem to reflect the space needed for tape buffers; but

1916 * don't decrease it to the point that we have no room for tuples. (That

1917 * case is only likely to occur if sorting pass-by-value Datums; in all

1918 * other scenarios the memtuples[] array is unlikely to occupy more than

1919 * half of allowedMem. In the pass-by-value case it's not important to

1920 * account for tuple space, so we don't care if LACKMEM becomes

1921 * inaccurate.)

1922 */

1923 tapeSpace = (int64) maxTapes * TAPE_BUFFER_OVERHEAD;

1924

1925 if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)

1926 USEMEM(state, tapeSpace);

1927

1928 /*

1929 * Make sure that the temp file(s) underlying the tape set are created in

1930 * suitable temp tablespaces. For parallel sorts, this should have been

1931 * called already, but it doesn't matter if it is called a second time.

1932 */

1933 PrepareTempTablespaces();

1934}

1935

1936/*

1937 * selectnewtape -- select next tape to output to.

1938 *

1939 * This is called after finishing a run when we know another run

1940 * must be started. This is used both when building the initial

1941 * runs, and during merge passes.

1942 */

1943static void

1944 selectnewtape(Tuplesortstate *state)

1945{

1946 /*

1947 * At the beginning of each merge pass, nOutputTapes and nOutputRuns are

1948 * both zero. On each call, we create a new output tape to hold the next

1949 * run, until maxTapes is reached. After that, we assign new runs to the

1950 * existing tapes in a round robin fashion.

1951 */

1952 if (state->nOutputTapes < state->maxTapes)

1953 {

1954 /* Create a new tape to hold the next run */

1955 Assert(state->outputTapes[state->nOutputRuns] == NULL);

1956 Assert(state->nOutputRuns == state->nOutputTapes);

1957 state->destTape = LogicalTapeCreate(state->tapeset);

1958 state->outputTapes[state->nOutputTapes] = state->destTape;

1959 state->nOutputTapes++;

1960 state->nOutputRuns++;

1961 }

1962 else

1963 {

1964 /*

1965 * We have reached the max number of tapes. Append to an existing

1966 * tape.

1967 */

1968 state->destTape = state->outputTapes[state->nOutputRuns % state->nOutputTapes];

1969 state->nOutputRuns++;

1970 }

1971}

1972

1973/*

1974 * Initialize the slab allocation arena, for the given number of slots.

1975 */

1976static void

1977 init_slab_allocator(Tuplesortstate *state, int numSlots)

1978{

1979 if (numSlots > 0)

1980 {

1981 char *p;

1982 int i;

1983

1984 state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);

1985 state->slabMemoryEnd = state->slabMemoryBegin +

1986 numSlots * SLAB_SLOT_SIZE;

1987 state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;

1988 USEMEM(state, numSlots * SLAB_SLOT_SIZE);

1989

1990 p = state->slabMemoryBegin;

1991 for (i = 0; i < numSlots - 1; i++)

1992 {

1993 ((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);

1994 p += SLAB_SLOT_SIZE;

1995 }

1996 ((SlabSlot *) p)->nextfree = NULL;

1997 }

1998 else

1999 {

2000 state->slabMemoryBegin = state->slabMemoryEnd = NULL;

2001 state->slabFreeHead = NULL;

2002 }

2003 state->slabAllocatorUsed = true;

2004}

2005

2006/*

2007 * mergeruns -- merge all the completed initial runs.

2008 *

2009 * This implements the Balanced k-Way Merge Algorithm. All input data has

2010 * already been written to initial runs on tape (see dumptuples).

2011 */

2012static void

2013 mergeruns(Tuplesortstate *state)

2014{

2015 int tapenum;

2016

2017 Assert(state->status == TSS_BUILDRUNS);

2018 Assert(state->memtupcount == 0);

2019

2020 if (state->base.sortKeys != NULL && state->base.sortKeys->abbrev_converter != NULL)

2021 {

2022 /*

2023 * If there are multiple runs to be merged, when we go to read back

2024 * tuples from disk, abbreviated keys will not have been stored, and

2025 * we don't care to regenerate them. Disable abbreviation from this

2026 * point on.

2027 */

2028 state->base.sortKeys->abbrev_converter = NULL;

2029 state->base.sortKeys->comparator = state->base.sortKeys->abbrev_full_comparator;

2030

2031 /* Not strictly necessary, but be tidy */

2032 state->base.sortKeys->abbrev_abort = NULL;

2033 state->base.sortKeys->abbrev_full_comparator = NULL;

2034 }

2035

2036 /*

2037 * Reset tuple memory. We've freed all the tuples that we previously

2038 * allocated. We will use the slab allocator from now on.

2039 */

2040 MemoryContextResetOnly(state->base.tuplecontext);

2041

2042 /*

2043 * We no longer need a large memtuples array. (We will allocate a smaller

2044 * one for the heap later.)

2045 */

2046 FREEMEM(state, GetMemoryChunkSpace(state->memtuples));

2047 pfree(state->memtuples);

2048 state->memtuples = NULL;

2049

2050 /*

2051 * Initialize the slab allocator. We need one slab slot per input tape,

2052 * for the tuples in the heap, plus one to hold the tuple last returned

2053 * from tuplesort_gettuple. (If we're sorting pass-by-val Datums,

2054 * however, we don't need to do allocate anything.)

2055 *

2056 * In a multi-pass merge, we could shrink this allocation for the last

2057 * merge pass, if it has fewer tapes than previous passes, but we don't

2058 * bother.

2059 *

2060 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism

2061 * to track memory usage of individual tuples.

2062 */

2063 if (state->base.tuples)

2064 init_slab_allocator(state, state->nOutputTapes + 1);

2065 else

2066 init_slab_allocator(state, 0);

2067

2068 /*

2069 * Allocate a new 'memtuples' array, for the heap. It will hold one tuple

2070 * from each input tape.

2071 *

2072 * We could shrink this, too, between passes in a multi-pass merge, but we

2073 * don't bother. (The initial input tapes are still in outputTapes. The

2074 * number of input tapes will not increase between passes.)

2075 */

2076 state->memtupsize = state->nOutputTapes;

2077 state->memtuples = (SortTuple *) MemoryContextAlloc(state->base.maincontext,

2078 state->nOutputTapes * sizeof(SortTuple));

2079 USEMEM(state, GetMemoryChunkSpace(state->memtuples));

2080

2081 /*

2082 * Use all the remaining memory we have available for tape buffers among

2083 * all the input tapes. At the beginning of each merge pass, we will

2084 * divide this memory between the input and output tapes in the pass.

2085 */

2086 state->tape_buffer_mem = state->availMem;

2087 USEMEM(state, state->tape_buffer_mem);

2088 if (trace_sort)

2089 elog(LOG, "worker %d using %zu KB of memory for tape buffers",

2090 state->worker, state->tape_buffer_mem / 1024);

2091

2092 for (;;)

2093 {

2094 /*

2095 * On the first iteration, or if we have read all the runs from the

2096 * input tapes in a multi-pass merge, it's time to start a new pass.

2097 * Rewind all the output tapes, and make them inputs for the next

2098 * pass.

2099 */

2100 if (state->nInputRuns == 0)

2101 {

2102 int64 input_buffer_size;

2103

2104 /* Close the old, emptied, input tapes */

2105 if (state->nInputTapes > 0)

2106 {

2107 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)

2108 LogicalTapeClose(state->inputTapes[tapenum]);

2109 pfree(state->inputTapes);

2110 }

2111

2112 /* Previous pass's outputs become next pass's inputs. */

2113 state->inputTapes = state->outputTapes;

2114 state->nInputTapes = state->nOutputTapes;

2115 state->nInputRuns = state->nOutputRuns;

2116

2117 /*

2118 * Reset output tape variables. The actual LogicalTapes will be

2119 * created as needed, here we only allocate the array to hold

2120 * them.

2121 */

2122 state->outputTapes = palloc0(state->nInputTapes * sizeof(LogicalTape *));

2123 state->nOutputTapes = 0;

2124 state->nOutputRuns = 0;

2125

2126 /*

2127 * Redistribute the memory allocated for tape buffers, among the

2128 * new input and output tapes.

2129 */

2130 input_buffer_size = merge_read_buffer_size(state->tape_buffer_mem,

2131 state->nInputTapes,

2132 state->nInputRuns,

2133 state->maxTapes);

2134

2135 if (trace_sort)

2136 elog(LOG, "starting merge pass of %d input runs on %d tapes, " INT64_FORMAT " KB of memory for each input tape: %s",

2137 state->nInputRuns, state->nInputTapes, input_buffer_size / 1024,

2138 pg_rusage_show(&state->ru_start));

2139

2140 /* Prepare the new input tapes for merge pass. */

2141 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)

2142 LogicalTapeRewindForRead(state->inputTapes[tapenum], input_buffer_size);

2143

2144 /*

2145 * If there's just one run left on each input tape, then only one

2146 * merge pass remains. If we don't have to produce a materialized

2147 * sorted tape, we can stop at this point and do the final merge

2148 * on-the-fly.

2149 */

2150 if ((state->base.sortopt & TUPLESORT_RANDOMACCESS) == 0

2151 && state->nInputRuns <= state->nInputTapes

2152 && !WORKER(state))

2153 {

2154 /* Tell logtape.c we won't be writing anymore */

2155 LogicalTapeSetForgetFreeSpace(state->tapeset);

2156 /* Initialize for the final merge pass */

2157 beginmerge(state);

2158 state->status = TSS_FINALMERGE;

2159 return;

2160 }

2161 }

2162

2163 /* Select an output tape */

2164 selectnewtape(state);

2165

2166 /* Merge one run from each input tape. */

2167 mergeonerun(state);

2168

2169 /*

2170 * If the input tapes are empty, and we output only one output run,

2171 * we're done. The current output tape contains the final result.

2172 */

2173 if (state->nInputRuns == 0 && state->nOutputRuns <= 1)

2174 break;

2175 }

2176

2177 /*

2178 * Done. The result is on a single run on a single tape.

2179 */

2180 state->result_tape = state->outputTapes[0];

2181 if (!WORKER(state))

2182 LogicalTapeFreeze(state->result_tape, NULL);

2183 else

2184 worker_freeze_result_tape(state);

2185 state->status = TSS_SORTEDONTAPE;

2186

2187 /* Close all the now-empty input tapes, to release their read buffers. */

2188 for (tapenum = 0; tapenum < state->nInputTapes; tapenum++)

2189 LogicalTapeClose(state->inputTapes[tapenum]);

2190}

2191

2192/*

2193 * Merge one run from each input tape.

2194 */

2195static void

2196 mergeonerun(Tuplesortstate *state)

2197{

2198 int srcTapeIndex;

2199 LogicalTape *srcTape;

2200

2201 /*

2202 * Start the merge by loading one tuple from each active source tape into

2203 * the heap.

2204 */

2205 beginmerge(state);

2206

2207 Assert(state->slabAllocatorUsed);

2208

2209 /*

2210 * Execute merge by repeatedly extracting lowest tuple in heap, writing it

2211 * out, and replacing it with next tuple from same tape (if there is

2212 * another one).

2213 */

2214 while (state->memtupcount > 0)

2215 {

2216 SortTuple stup;

2217

2218 /* write the tuple to destTape */

2219 srcTapeIndex = state->memtuples[0].srctape;

2220 srcTape = state->inputTapes[srcTapeIndex];

2221 WRITETUP(state, state->destTape, &state->memtuples[0]);

2222

2223 /* recycle the slot of the tuple we just wrote out, for the next read */

2224 if (state->memtuples[0].tuple)

2225 RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);

2226

2227 /*

2228 * pull next tuple from the tape, and replace the written-out tuple in

2229 * the heap with it.

2230 */

2231 if (mergereadnext(state, srcTape, &stup))

2232 {

2233 stup.srctape = srcTapeIndex;

2234 tuplesort_heap_replace_top(state, &stup);

2235 }

2236 else

2237 {

2238 tuplesort_heap_delete_top(state);

2239 state->nInputRuns--;

2240 }

2241 }

2242

2243 /*

2244 * When the heap empties, we're done. Write an end-of-run marker on the

2245 * output tape.

2246 */

2247 markrunend(state->destTape);

2248}

2249

2250/*

2251 * beginmerge - initialize for a merge pass

2252 *

2253 * Fill the merge heap with the first tuple from each input tape.

2254 */

2255static void

2256 beginmerge(Tuplesortstate *state)

2257{

2258 int activeTapes;

2259 int srcTapeIndex;

2260

2261 /* Heap should be empty here */

2262 Assert(state->memtupcount == 0);

2263

2264 activeTapes = Min(state->nInputTapes, state->nInputRuns);

2265

2266 for (srcTapeIndex = 0; srcTapeIndex < activeTapes; srcTapeIndex++)

2267 {

2268 SortTuple tup;

2269

2270 if (mergereadnext(state, state->inputTapes[srcTapeIndex], &tup))

2271 {

2272 tup.srctape = srcTapeIndex;

2273 tuplesort_heap_insert(state, &tup);

2274 }

2275 }

2276}

2277

2278/*

2279 * mergereadnext - read next tuple from one merge input tape

2280 *

2281 * Returns false on EOF.

2282 */

2283static bool

2284 mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)

2285{

2286 unsigned int tuplen;

2287

2288 /* read next tuple, if any */

2289 if ((tuplen = getlen(srcTape, true)) == 0)

2290 return false;

2291 READTUP(state, stup, srcTape, tuplen);

2292

2293 return true;

2294}

2295

2296/*

2297 * dumptuples - remove tuples from memtuples and write initial run to tape

2298 *

2299 * When alltuples = true, dump everything currently in memory. (This case is

2300 * only used at end of input data.)

2301 */

2302static void

2303 dumptuples(Tuplesortstate *state, bool alltuples)

2304{

2305 int memtupwrite;

2306 int i;

2307

2308 /*

2309 * Nothing to do if we still fit in available memory and have array slots,

2310 * unless this is the final call during initial run generation.

2311 */

2312 if (state->memtupcount < state->memtupsize && !LACKMEM(state) &&

2313 !alltuples)

2314 return;

2315

2316 /*

2317 * Final call might require no sorting, in rare cases where we just so

2318 * happen to have previously LACKMEM()'d at the point where exactly all

2319 * remaining tuples are loaded into memory, just before input was

2320 * exhausted. In general, short final runs are quite possible, but avoid

2321 * creating a completely empty run. In a worker, though, we must produce

2322 * at least one tape, even if it's empty.

2323 */

2324 if (state->memtupcount == 0 && state->currentRun > 0)

2325 return;

2326

2327 Assert(state->status == TSS_BUILDRUNS);

2328

2329 /*

2330 * It seems unlikely that this limit will ever be exceeded, but take no

2331 * chances

2332 */

2333 if (state->currentRun == INT_MAX)

2334 ereport(ERROR,

2335 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),

2336 errmsg("cannot have more than %d runs for an external sort",

2337 INT_MAX)));

2338

2339 if (state->currentRun > 0)

2340 selectnewtape(state);

2341

2342 state->currentRun++;

2343

2344 if (trace_sort)

2345 elog(LOG, "worker %d starting quicksort of run %d: %s",

2346 state->worker, state->currentRun,

2347 pg_rusage_show(&state->ru_start));

2348

2349 /*

2350 * Sort all tuples accumulated within the allowed amount of memory for

2351 * this run using quicksort

2352 */

2353 tuplesort_sort_memtuples(state);

2354

2355 if (trace_sort)

2356 elog(LOG, "worker %d finished quicksort of run %d: %s",

2357 state->worker, state->currentRun,

2358 pg_rusage_show(&state->ru_start));

2359

2360 memtupwrite = state->memtupcount;

2361 for (i = 0; i < memtupwrite; i++)

2362 {

2363 SortTuple *stup = &state->memtuples[i];

2364

2365 WRITETUP(state, state->destTape, stup);

2366 }

2367

2368 state->memtupcount = 0;

2369

2370 /*

2371 * Reset tuple memory. We've freed all of the tuples that we previously

2372 * allocated. It's important to avoid fragmentation when there is a stark

2373 * change in the sizes of incoming tuples. In bounded sorts,

2374 * fragmentation due to AllocSetFree's bucketing by size class might be

2375 * particularly bad if this step wasn't taken.

2376 */

2377 MemoryContextReset(state->base.tuplecontext);

2378

2379 /*

2380 * Now update the memory accounting to subtract the memory used by the

2381 * tuple.

2382 */

2383 FREEMEM(state, state->tupleMem);

2384 state->tupleMem = 0;

2385

2386 markrunend(state->destTape);

2387

2388 if (trace_sort)

2389 elog(LOG, "worker %d finished writing run %d to tape %d: %s",

2390 state->worker, state->currentRun, (state->currentRun - 1) % state->nOutputTapes + 1,

2391 pg_rusage_show(&state->ru_start));

2392}

2393

2394/*

2395 * tuplesort_rescan - rewind and replay the scan

2396 */

2397void

2398 tuplesort_rescan(Tuplesortstate *state)

2399{

2400 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

2401

2402 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);

2403

2404 switch (state->status)

2405 {

2406 case TSS_SORTEDINMEM:

2407 state->current = 0;

2408 state->eof_reached = false;

2409 state->markpos_offset = 0;

2410 state->markpos_eof = false;

2411 break;

2412 case TSS_SORTEDONTAPE:

2413 LogicalTapeRewindForRead(state->result_tape, 0);

2414 state->eof_reached = false;

2415 state->markpos_block = 0L;

2416 state->markpos_offset = 0;

2417 state->markpos_eof = false;

2418 break;

2419 default:

2420 elog(ERROR, "invalid tuplesort state");

2421 break;

2422 }

2423

2424 MemoryContextSwitchTo(oldcontext);

2425}

2426

2427/*

2428 * tuplesort_markpos - saves current position in the merged sort file

2429 */

2430void

2431 tuplesort_markpos(Tuplesortstate *state)

2432{

2433 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

2434

2435 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);

2436

2437 switch (state->status)

2438 {

2439 case TSS_SORTEDINMEM:

2440 state->markpos_offset = state->current;

2441 state->markpos_eof = state->eof_reached;

2442 break;

2443 case TSS_SORTEDONTAPE:

2444 LogicalTapeTell(state->result_tape,

2445 &state->markpos_block,

2446 &state->markpos_offset);

2447 state->markpos_eof = state->eof_reached;

2448 break;

2449 default:

2450 elog(ERROR, "invalid tuplesort state");

2451 break;

2452 }

2453

2454 MemoryContextSwitchTo(oldcontext);

2455}

2456

2457/*

2458 * tuplesort_restorepos - restores current position in merged sort file to

2459 * last saved position

2460 */

2461void

2462 tuplesort_restorepos(Tuplesortstate *state)

2463{

2464 MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);

2465

2466 Assert(state->base.sortopt & TUPLESORT_RANDOMACCESS);

2467

2468 switch (state->status)

2469 {

2470 case TSS_SORTEDINMEM:

2471 state->current = state->markpos_offset;

2472 state->eof_reached = state->markpos_eof;

2473 break;

2474 case TSS_SORTEDONTAPE:

2475 LogicalTapeSeek(state->result_tape,

2476 state->markpos_block,

2477 state->markpos_offset);

2478 state->eof_reached = state->markpos_eof;

2479 break;

2480 default:

2481 elog(ERROR, "invalid tuplesort state");

2482 break;

2483 }

2484

2485 MemoryContextSwitchTo(oldcontext);

2486}

2487

2488/*

2489 * tuplesort_get_stats - extract summary statistics

2490 *

2491 * This can be called after tuplesort_performsort() finishes to obtain

2492 * printable summary information about how the sort was performed.

2493 */

2494void

2495 tuplesort_get_stats(Tuplesortstate *state,

2496 TuplesortInstrumentation *stats)

2497{

2498 /*

2499 * Note: it might seem we should provide both memory and disk usage for a

2500 * disk-based sort. However, the current code doesn't track memory space

2501 * accurately once we have begun to return tuples to the caller (since we

2502 * don't account for pfree's the caller is expected to do), so we cannot

2503 * rely on availMem in a disk sort. This does not seem worth the overhead

2504 * to fix. Is it worth creating an API for the memory context code to

2505 * tell us how much is actually used in sortcontext?

2506 */

2507 tuplesort_updatemax(state);

2508

2509 if (state->isMaxSpaceDisk)

2510 stats->spaceType = SORT_SPACE_TYPE_DISK;

2511 else

2512 stats->spaceType = SORT_SPACE_TYPE_MEMORY;

2513 stats->spaceUsed = (state->maxSpace + 1023) / 1024;

2514

2515 switch (state->maxSpaceStatus)

2516 {

2517 case TSS_SORTEDINMEM:

2518 if (state->boundUsed)

2519 stats->sortMethod = SORT_TYPE_TOP_N_HEAPSORT;

2520 else

2521 stats->sortMethod = SORT_TYPE_QUICKSORT;

2522 break;

2523 case TSS_SORTEDONTAPE:

2524 stats->sortMethod = SORT_TYPE_EXTERNAL_SORT;

2525 break;

2526 case TSS_FINALMERGE:

2527 stats->sortMethod = SORT_TYPE_EXTERNAL_MERGE;

2528 break;

2529 default:

2530 stats->sortMethod = SORT_TYPE_STILL_IN_PROGRESS;

2531 break;

2532 }

2533}

2534

2535/*

2536 * Convert TuplesortMethod to a string.

2537 */

2538const char *

2539 tuplesort_method_name(TuplesortMethod m)

2540{

2541 switch (m)

2542 {

2543 case SORT_TYPE_STILL_IN_PROGRESS:

2544 return "still in progress";

2545 case SORT_TYPE_TOP_N_HEAPSORT:

2546 return "top-N heapsort";

2547 case SORT_TYPE_QUICKSORT:

2548 return "quicksort";

2549 case SORT_TYPE_EXTERNAL_SORT:

2550 return "external sort";

2551 case SORT_TYPE_EXTERNAL_MERGE:

2552 return "external merge";

2553 }

2554

2555 return "unknown";

2556}

2557

2558/*

2559 * Convert TuplesortSpaceType to a string.

2560 */

2561const char *

2562 tuplesort_space_type_name(TuplesortSpaceType t)

2563{

2564 Assert(t == SORT_SPACE_TYPE_DISK || t == SORT_SPACE_TYPE_MEMORY);

2565 return t == SORT_SPACE_TYPE_DISK ? "Disk" : "Memory";

2566}

2567

2568

2569/*

2570 * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.

2571 */

2572

2573/*

2574 * Convert the existing unordered array of SortTuples to a bounded heap,

2575 * discarding all but the smallest "state->bound" tuples.

2576 *

2577 * When working with a bounded heap, we want to keep the largest entry

2578 * at the root (array entry zero), instead of the smallest as in the normal

2579 * sort case. This allows us to discard the largest entry cheaply.

2580 * Therefore, we temporarily reverse the sort direction.

2581 */

2582static void

2583 make_bounded_heap(Tuplesortstate *state)

2584{

2585 int tupcount = state->memtupcount;

2586 int i;

2587

2588 Assert(state->status == TSS_INITIAL);

2589 Assert(state->bounded);

2590 Assert(tupcount >= state->bound);

2591 Assert(SERIAL(state));

2592

2593 /* Reverse sort direction so largest entry will be at root */

2594 reversedirection(state);

2595

2596 state->memtupcount = 0; /* make the heap empty */

2597 for (i = 0; i < tupcount; i++)

2598 {

2599 if (state->memtupcount < state->bound)

2600 {

2601 /* Insert next tuple into heap */

2602 /* Must copy source tuple to avoid possible overwrite */

2603 SortTuple stup = state->memtuples[i];

2604

2605 tuplesort_heap_insert(state, &stup);

2606 }

2607 else

2608 {

2609 /*

2610 * The heap is full. Replace the largest entry with the new

2611 * tuple, or just discard it, if it's larger than anything already

2612 * in the heap.

2613 */

2614 if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)

2615 {

2616 free_sort_tuple(state, &state->memtuples[i]);

2617 CHECK_FOR_INTERRUPTS();

2618 }

2619 else

2620 tuplesort_heap_replace_top(state, &state->memtuples[i]);

2621 }

2622 }

2623

2624 Assert(state->memtupcount == state->bound);

2625 state->status = TSS_BOUNDED;

2626}

2627

2628/*

2629 * Convert the bounded heap to a properly-sorted array

2630 */

2631static void

2632 sort_bounded_heap(Tuplesortstate *state)

2633{

2634 int tupcount = state->memtupcount;

2635

2636 Assert(state->status == TSS_BOUNDED);

2637 Assert(state->bounded);

2638 Assert(tupcount == state->bound);

2639 Assert(SERIAL(state));

2640

2641 /*

2642 * We can unheapify in place because each delete-top call will remove the

2643 * largest entry, which we can promptly store in the newly freed slot at

2644 * the end. Once we're down to a single-entry heap, we're done.

2645 */

2646 while (state->memtupcount > 1)

2647 {

2648 SortTuple stup = state->memtuples[0];

2649

2650 /* this sifts-up the next-largest entry and decreases memtupcount */

2651 tuplesort_heap_delete_top(state);

2652 state->memtuples[state->memtupcount] = stup;

2653 }

2654 state->memtupcount = tupcount;

2655

2656 /*

2657 * Reverse sort direction back to the original state. This is not

2658 * actually necessary but seems like a good idea for tidiness.

2659 */

2660 reversedirection(state);

2661

2662 state->status = TSS_SORTEDINMEM;

2663 state->boundUsed = true;

2664}

2665

2666/*

2667 * Sort all memtuples using specialized qsort() routines.

2668 *

2669 * Quicksort is used for small in-memory sorts, and external sort runs.

2670 */

2671static void

2672 tuplesort_sort_memtuples(Tuplesortstate *state)

2673{

2674 Assert(!LEADER(state));

2675

2676 if (state->memtupcount > 1)

2677 {

2678 /*

2679 * Do we have the leading column's value or abbreviation in datum1,

2680 * and is there a specialization for its comparator?

2681 */

2682 if (state->base.haveDatum1 && state->base.sortKeys)

2683 {

2684 if (state->base.sortKeys[0].comparator == ssup_datum_unsigned_cmp)

2685 {

2686 qsort_tuple_unsigned(state->memtuples,

2687 state->memtupcount,

2688 state);

2689 return;

2690 }

2691 else if (state->base.sortKeys[0].comparator == ssup_datum_signed_cmp)

2692 {

2693 qsort_tuple_signed(state->memtuples,

2694 state->memtupcount,

2695 state);

2696 return;

2697 }

2698 else if (state->base.sortKeys[0].comparator == ssup_datum_int32_cmp)

2699 {

2700 qsort_tuple_int32(state->memtuples,

2701 state->memtupcount,

2702 state);

2703 return;

2704 }

2705 }

2706

2707 /* Can we use the single-key sort function? */

2708 if (state->base.onlyKey != NULL)

2709 {

2710 qsort_ssup(state->memtuples, state->memtupcount,

2711 state->base.onlyKey);

2712 }

2713 else

2714 {

2715 qsort_tuple(state->memtuples,

2716 state->memtupcount,

2717 state->base.comparetup,

2718 state);

2719 }

2720 }

2721}

2722

2723/*

2724 * Insert a new tuple into an empty or existing heap, maintaining the

2725 * heap invariant. Caller is responsible for ensuring there's room.

2726 *

2727 * Note: For some callers, tuple points to a memtuples[] entry above the

2728 * end of the heap. This is safe as long as it's not immediately adjacent

2729 * to the end of the heap (ie, in the [memtupcount] array entry) --- if it

2730 * is, it might get overwritten before being moved into the heap!

2731 */

2732static void

2733 tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)

2734{

2735 SortTuple *memtuples;

2736 int j;

2737

2738 memtuples = state->memtuples;

2739 Assert(state->memtupcount < state->memtupsize);

2740

2741 CHECK_FOR_INTERRUPTS();

2742

2743 /*

2744 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is

2745 * using 1-based array indexes, not 0-based.

2746 */

2747 j = state->memtupcount++;

2748 while (j > 0)

2749 {

2750 int i = (j - 1) >> 1;

2751

2752 if (COMPARETUP(state, tuple, &memtuples[i]) >= 0)

2753 break;

2754 memtuples[j] = memtuples[i];

2755 j = i;

2756 }

2757 memtuples[j] = *tuple;

2758}

2759

2760/*

2761 * Remove the tuple at state->memtuples[0] from the heap. Decrement

2762 * memtupcount, and sift up to maintain the heap invariant.

2763 *

2764 * The caller has already free'd the tuple the top node points to,

2765 * if necessary.

2766 */

2767static void

2768 tuplesort_heap_delete_top(Tuplesortstate *state)

2769{

2770 SortTuple *memtuples = state->memtuples;

2771 SortTuple *tuple;

2772

2773 if (--state->memtupcount <= 0)

2774 return;

2775

2776 /*

2777 * Remove the last tuple in the heap, and re-insert it, by replacing the

2778 * current top node with it.

2779 */

2780 tuple = &memtuples[state->memtupcount];

2781 tuplesort_heap_replace_top(state, tuple);

2782}

2783

2784/*

2785 * Replace the tuple at state->memtuples[0] with a new tuple. Sift up to

2786 * maintain the heap invariant.

2787 *

2788 * This corresponds to Knuth's "sift-up" algorithm (Algorithm 5.2.3H,

2789 * Heapsort, steps H3-H8).

2790 */

2791static void

2792 tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)

2793{

2794 SortTuple *memtuples = state->memtuples;

2795 unsigned int i,

2796 n;

2797

2798 Assert(state->memtupcount >= 1);

2799

2800 CHECK_FOR_INTERRUPTS();

2801

2802 /*

2803 * state->memtupcount is "int", but we use "unsigned int" for i, j, n.

2804 * This prevents overflow in the "2 * i + 1" calculation, since at the top

2805 * of the loop we must have i < n <= INT_MAX <= UINT_MAX/2.

2806 */

2807 n = state->memtupcount;

2808 i = 0; /* i is where the "hole" is */

2809 for (;;)

2810 {

2811 unsigned int j = 2 * i + 1;

2812

2813 if (j >= n)

2814 break;

2815 if (j + 1 < n &&

2816 COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0)

2817 j++;

2818 if (COMPARETUP(state, tuple, &memtuples[j]) <= 0)

2819 break;

2820 memtuples[i] = memtuples[j];

2821 i = j;

2822 }

2823 memtuples[i] = *tuple;

2824}

2825

2826/*

2827 * Function to reverse the sort direction from its current state

2828 *

2829 * It is not safe to call this when performing hash tuplesorts

2830 */

2831static void

2832 reversedirection(Tuplesortstate *state)

2833{

2834 SortSupport sortKey = state->base.sortKeys;

2835 int nkey;

2836

2837 for (nkey = 0; nkey < state->base.nKeys; nkey++, sortKey++)

2838 {

2839 sortKey->ssup_reverse = !sortKey->ssup_reverse;

2840 sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;

2841 }

2842}

2843

2844

2845/*

2846 * Tape interface routines

2847 */

2848

2849static unsigned int

2850 getlen(LogicalTape *tape, bool eofOK)

2851{

2852 unsigned int len;

2853

2854 if (LogicalTapeRead(tape,

2855 &len, sizeof(len)) != sizeof(len))

2856 elog(ERROR, "unexpected end of tape");

2857 if (len == 0 && !eofOK)

2858 elog(ERROR, "unexpected end of data");

2859 return len;

2860}

2861

2862static void

2863 markrunend(LogicalTape *tape)

2864{

2865 unsigned int len = 0;

2866

2867 LogicalTapeWrite(tape, &len, sizeof(len));

2868}

2869

2870/*

2871 * Get memory for tuple from within READTUP() routine.

2872 *

2873 * We use next free slot from the slab allocator, or palloc() if the tuple

2874 * is too large for that.

2875 */

2876void *

2877 tuplesort_readtup_alloc(Tuplesortstate *state, Size tuplen)

2878{

2879 SlabSlot *buf;

2880

2881 /*

2882 * We pre-allocate enough slots in the slab arena that we should never run

2883 * out.

2884 */

2885 Assert(state->slabFreeHead);

2886

2887 if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)

2888 return MemoryContextAlloc(state->base.sortcontext, tuplen);

2889 else

2890 {

2891 buf = state->slabFreeHead;

2892 /* Reuse this slot */

2893 state->slabFreeHead = buf->nextfree;

2894

2895 return buf;

2896 }

2897}

2898

2899

2900/*

2901 * Parallel sort routines

2902 */

2903

2904/*

2905 * tuplesort_estimate_shared - estimate required shared memory allocation

2906 *

2907 * nWorkers is an estimate of the number of workers (it's the number that

2908 * will be requested).

2909 */

2910Size

2911 tuplesort_estimate_shared(int nWorkers)

2912{

2913 Size tapesSize;

2914

2915 Assert(nWorkers > 0);

2916

2917 /* Make sure that BufFile shared state is MAXALIGN'd */

2918 tapesSize = mul_size(sizeof(TapeShare), nWorkers);

2919 tapesSize = MAXALIGN(add_size(tapesSize, offsetof(Sharedsort, tapes)));

2920

2921 return tapesSize;

2922}

2923

2924/*

2925 * tuplesort_initialize_shared - initialize shared tuplesort state

2926 *

2927 * Must be called from leader process before workers are launched, to

2928 * establish state needed up-front for worker tuplesortstates. nWorkers

2929 * should match the argument passed to tuplesort_estimate_shared().

2930 */

2931void

2932 tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)

2933{

2934 int i;

2935

2936 Assert(nWorkers > 0);

2937

2938 SpinLockInit(&shared->mutex);

2939 shared->currentWorker = 0;

2940 shared->workersFinished = 0;

2941 SharedFileSetInit(&shared->fileset, seg);

2942 shared->nTapes = nWorkers;

2943 for (i = 0; i < nWorkers; i++)

2944 {

2945 shared->tapes[i].firstblocknumber = 0L;

2946 }

2947}

2948

2949/*

2950 * tuplesort_attach_shared - attach to shared tuplesort state

2951 *

2952 * Must be called by all worker processes.

2953 */

2954void

2955 tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)

2956{

2957 /* Attach to SharedFileSet */

2958 SharedFileSetAttach(&shared->fileset, seg);

2959}

2960

2961/*

2962 * worker_get_identifier - Assign and return ordinal identifier for worker

2963 *

2964 * The order in which these are assigned is not well defined, and should not

2965 * matter; worker numbers across parallel sort participants need only be

2966 * distinct and gapless. logtape.c requires this.

2967 *

2968 * Note that the identifiers assigned from here have no relation to

2969 * ParallelWorkerNumber number, to avoid making any assumption about

2970 * caller's requirements. However, we do follow the ParallelWorkerNumber

2971 * convention of representing a non-worker with worker number -1. This

2972 * includes the leader, as well as serial Tuplesort processes.

2973 */

2974static int

2975 worker_get_identifier(Tuplesortstate *state)

2976{

2977 Sharedsort *shared = state->shared;

2978 int worker;

2979

2980 Assert(WORKER(state));

2981

2982 SpinLockAcquire(&shared->mutex);

2983 worker = shared->currentWorker++;

2984 SpinLockRelease(&shared->mutex);

2985

2986 return worker;

2987}

2988

2989/*

2990 * worker_freeze_result_tape - freeze worker's result tape for leader

2991 *

2992 * This is called by workers just after the result tape has been determined,

2993 * instead of calling LogicalTapeFreeze() directly. They do so because

2994 * workers require a few additional steps over similar serial

2995 * TSS_SORTEDONTAPE external sort cases, which also happen here. The extra

2996 * steps are around freeing now unneeded resources, and representing to

2997 * leader that worker's input run is available for its merge.

2998 *

2999 * There should only be one final output run for each worker, which consists

3000 * of all tuples that were originally input into worker.

3001 */

3002static void

3003 worker_freeze_result_tape(Tuplesortstate *state)

3004{

3005 Sharedsort *shared = state->shared;

3006 TapeShare output;

3007

3008 Assert(WORKER(state));

3009 Assert(state->result_tape != NULL);

3010 Assert(state->memtupcount == 0);

3011

3012 /*

3013 * Free most remaining memory, in case caller is sensitive to our holding

3014 * on to it. memtuples may not be a tiny merge heap at this point.

3015 */

3016 pfree(state->memtuples);

3017 /* Be tidy */

3018 state->memtuples = NULL;

3019 state->memtupsize = 0;

3020

3021 /*

3022 * Parallel worker requires result tape metadata, which is to be stored in

3023 * shared memory for leader

3024 */

3025 LogicalTapeFreeze(state->result_tape, &output);

3026

3027 /* Store properties of output tape, and update finished worker count */

3028 SpinLockAcquire(&shared->mutex);

3029 shared->tapes[state->worker] = output;

3030 shared->workersFinished++;

3031 SpinLockRelease(&shared->mutex);

3032}

3033

3034/*

3035 * worker_nomergeruns - dump memtuples in worker, without merging

3036 *

3037 * This called as an alternative to mergeruns() with a worker when no

3038 * merging is required.

3039 */

3040static void

3041 worker_nomergeruns(Tuplesortstate *state)

3042{

3043 Assert(WORKER(state));

3044 Assert(state->result_tape == NULL);

3045 Assert(state->nOutputRuns == 1);

3046

3047 state->result_tape = state->destTape;

3048 worker_freeze_result_tape(state);

3049}

3050

3051/*

3052 * leader_takeover_tapes - create tapeset for leader from worker tapes

3053 *

3054 * So far, leader Tuplesortstate has performed no actual sorting. By now, all

3055 * sorting has occurred in workers, all of which must have already returned

3056 * from tuplesort_performsort().

3057 *

3058 * When this returns, leader process is left in a state that is virtually

3059 * indistinguishable from it having generated runs as a serial external sort

3060 * might have.

3061 */

3062static void

3063 leader_takeover_tapes(Tuplesortstate *state)

3064{

3065 Sharedsort *shared = state->shared;

3066 int nParticipants = state->nParticipants;

3067 int workersFinished;

3068 int j;

3069

3070 Assert(LEADER(state));

3071 Assert(nParticipants >= 1);

3072

3073 SpinLockAcquire(&shared->mutex);

3074 workersFinished = shared->workersFinished;

3075 SpinLockRelease(&shared->mutex);

3076

3077 if (nParticipants != workersFinished)

3078 elog(ERROR, "cannot take over tapes before all workers finish");

3079

3080 /*

3081 * Create the tapeset from worker tapes, including a leader-owned tape at

3082 * the end. Parallel workers are far more expensive than logical tapes,

3083 * so the number of tapes allocated here should never be excessive.

3084 */

3085 inittapestate(state, nParticipants);

3086 state->tapeset = LogicalTapeSetCreate(false, &shared->fileset, -1);

3087

3088 /*

3089 * Set currentRun to reflect the number of runs we will merge (it's not

3090 * used for anything, this is just pro forma)

3091 */

3092 state->currentRun = nParticipants;

3093

3094 /*

3095 * Initialize the state to look the same as after building the initial

3096 * runs.

3097 *

3098 * There will always be exactly 1 run per worker, and exactly one input

3099 * tape per run, because workers always output exactly 1 run, even when

3100 * there were no input tuples for workers to sort.

3101 */

3102 state->inputTapes = NULL;

3103 state->nInputTapes = 0;

3104 state->nInputRuns = 0;

3105

3106 state->outputTapes = palloc0(nParticipants * sizeof(LogicalTape *));

3107 state->nOutputTapes = nParticipants;

3108 state->nOutputRuns = nParticipants;

3109

3110 for (j = 0; j < nParticipants; j++)

3111 {

3112 state->outputTapes[j] = LogicalTapeImport(state->tapeset, j, &shared->tapes[j]);

3113 }

3114

3115 state->status = TSS_BUILDRUNS;

3116}

3117

3118/*

3119 * Convenience routine to free a tuple previously loaded into sort memory

3120 */

3121static void

3122 free_sort_tuple(Tuplesortstate *state, SortTuple *stup)

3123{

3124 if (stup->tuple)

3125 {

3126 FREEMEM(state, GetMemoryChunkSpace(stup->tuple));

3127 pfree(stup->tuple);

3128 stup->tuple = NULL;

3129 }

3130}

3131

3132int

3133 ssup_datum_unsigned_cmp(Datum x, Datum y, SortSupport ssup)

3134{

3135 if (x < y)

3136 return -1;

3137 else if (x > y)

3138 return 1;

3139 else

3140 return 0;

3141}

3142

3143int

3144 ssup_datum_signed_cmp(Datum x, Datum y, SortSupport ssup)

3145{

3146 int64 xx = DatumGetInt64(x);

3147 int64 yy = DatumGetInt64(y);

3148

3149 if (xx < yy)

3150 return -1;

3151 else if (xx > yy)

3152 return 1;

3153 else

3154 return 0;

3155}

3156

3157int

3158 ssup_datum_int32_cmp(Datum x, Datum y, SortSupport ssup)

3159{

3160 int32 xx = DatumGetInt32(x);

3161 int32 yy = DatumGetInt32(y);

3162

3163 if (xx < yy)

3164 return -1;

3165 else if (xx > yy)

3166 return 1;

3167 else

3168 return 0;

3169}

PrepareTempTablespaces

void PrepareTempTablespaces(void)

Definition: tablespace.c:1331

BumpContextCreate

MemoryContext BumpContextCreate(MemoryContext parent, const char *name, Size minContextSize, Size initBlockSize, Size maxBlockSize)

Definition: bump.c:133

Min

#define Min(x, y)

Definition: c.h:1003

MAXALIGN

#define MAXALIGN(LEN)

Definition: c.h:810

Max

#define Max(x, y)

Definition: c.h:997

INT64_FORMAT

#define INT64_FORMAT

Definition: c.h:556

int64

int64_t int64

Definition: c.h:535

FLEXIBLE_ARRAY_MEMBER

#define FLEXIBLE_ARRAY_MEMBER

Definition: c.h:470

pg_attribute_always_inline

#define pg_attribute_always_inline

Definition: c.h:269

int32

int32_t int32

Definition: c.h:534

Size

size_t Size

Definition: c.h:610

errcode

int errcode(int sqlerrcode)

Definition: elog.c:854

errmsg

int errmsg(const char *fmt,...)

Definition: elog.c:1071

LOG

#define LOG

Definition: elog.h:31

ERROR

#define ERROR

Definition: elog.h:39

elog

#define elog(elevel,...)

Definition: elog.h:226

ereport

#define ereport(elevel,...)

Definition: elog.h:150

compare

static int compare(const void *arg1, const void *arg2)

Definition: geqo_pool.c:145

guc.h

Assert

Assert(PointerIsAligned(start, uint64))

output

FILE * output

Definition: pg_test_timing.c:235

y

int y

Definition: isn.c:76

b

int b

Definition: isn.c:74

x

int x

Definition: isn.c:75

a

int a

Definition: isn.c:73

j

int j

Definition: isn.c:78

i

int i

Definition: isn.c:77

LogicalTapeRewindForRead

void LogicalTapeRewindForRead(LogicalTape *lt, size_t buffer_size)

Definition: logtape.c:846

LogicalTapeSetForgetFreeSpace

void LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)

Definition: logtape.c:750

LogicalTapeBackspace

size_t LogicalTapeBackspace(LogicalTape *lt, size_t size)

Definition: logtape.c:1062

LogicalTapeRead

size_t LogicalTapeRead(LogicalTape *lt, void *ptr, size_t size)

Definition: logtape.c:928

LogicalTapeSetBlocks

int64 LogicalTapeSetBlocks(LogicalTapeSet *lts)

Definition: logtape.c:1181

LogicalTapeClose

void LogicalTapeClose(LogicalTape *lt)

Definition: logtape.c:733

LogicalTapeSetClose

void LogicalTapeSetClose(LogicalTapeSet *lts)

Definition: logtape.c:667

LogicalTapeSeek

void LogicalTapeSeek(LogicalTape *lt, int64 blocknum, int offset)

Definition: logtape.c:1133

LogicalTapeSetCreate

LogicalTapeSet * LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)

Definition: logtape.c:556

LogicalTapeTell

void LogicalTapeTell(LogicalTape *lt, int64 *blocknum, int *offset)

Definition: logtape.c:1162

LogicalTapeWrite

void LogicalTapeWrite(LogicalTape *lt, const void *ptr, size_t size)

Definition: logtape.c:761

LogicalTapeCreate

LogicalTape * LogicalTapeCreate(LogicalTapeSet *lts)

Definition: logtape.c:680

LogicalTapeFreeze

void LogicalTapeFreeze(LogicalTape *lt, TapeShare *share)

Definition: logtape.c:981

LogicalTapeImport

LogicalTape * LogicalTapeImport(LogicalTapeSet *lts, int worker, TapeShare *shared)

Definition: logtape.c:609

MemoryContextAlloc

void * MemoryContextAlloc(MemoryContext context, Size size)

Definition: mcxt.c:1229

MemoryContextReset

void MemoryContextReset(MemoryContext context)

Definition: mcxt.c:400

pfree

void pfree(void *pointer)

Definition: mcxt.c:1594

GetMemoryChunkSpace

Size GetMemoryChunkSpace(void *pointer)

Definition: mcxt.c:767

palloc0

void * palloc0(Size size)

Definition: mcxt.c:1395

palloc

void * palloc(Size size)

Definition: mcxt.c:1365

CurrentMemoryContext

MemoryContext CurrentMemoryContext

Definition: mcxt.c:160

MemoryContextDelete

void MemoryContextDelete(MemoryContext context)

Definition: mcxt.c:469

repalloc_huge

void * repalloc_huge(void *pointer, Size size)

Definition: mcxt.c:1735

MemoryContextResetOnly

void MemoryContextResetOnly(MemoryContext context)

Definition: mcxt.c:419

memutils.h

AllocSetContextCreate

#define AllocSetContextCreate

Definition: memutils.h:129

MaxAllocHugeSize

#define MaxAllocHugeSize

Definition: memutils.h:45

ALLOCSET_DEFAULT_SIZES

#define ALLOCSET_DEFAULT_SIZES

Definition: memutils.h:160

miscadmin.h

CHECK_FOR_INTERRUPTS

#define CHECK_FOR_INTERRUPTS()

Definition: miscadmin.h:122

MemoryContextSwitchTo

static MemoryContext MemoryContextSwitchTo(MemoryContext context)

Definition: palloc.h:124

len

const void size_t len

Definition: pg_crc32c_sse42.c:28

pg_rusage_show

const char * pg_rusage_show(const PGRUsage *ru0)

Definition: pg_rusage.c:40

pg_rusage_init

void pg_rusage_init(PGRUsage *ru0)

Definition: pg_rusage.c:27

pg_rusage.h

buf

static char * buf

Definition: pg_test_fsync.c:72

pg_trace.h

postgres.h

DatumGetInt64

static int64 DatumGetInt64(Datum X)

Definition: postgres.h:393

Datum

uint64_t Datum

Definition: postgres.h:70

DatumGetInt32

static int32 DatumGetInt32(Datum X)

Definition: postgres.h:212

SharedFileSetAttach

void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg)

Definition: sharedfileset.c:56

SharedFileSetInit

void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)

Definition: sharedfileset.c:38

add_size

Size add_size(Size s1, Size s2)

Definition: shmem.c:493

mul_size

Size mul_size(Size s1, Size s2)

Definition: shmem.c:510

shmem.h

sort_template.h

ApplySignedSortComparator

static int ApplySignedSortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)

Definition: sortsupport.h:266

ApplyUnsignedSortComparator

static int ApplyUnsignedSortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)

Definition: sortsupport.h:233

ApplyInt32SortComparator

static int ApplyInt32SortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)

Definition: sortsupport.h:300

SpinLockInit

#define SpinLockInit(lock)

Definition: spin.h:57

SpinLockRelease

#define SpinLockRelease(lock)

Definition: spin.h:61

SpinLockAcquire

#define SpinLockAcquire(lock)

Definition: spin.h:59

LogicalTapeSet

Definition: logtape.c:188

LogicalTape

Definition: logtape.c:138

MemoryContextData

Definition: memnodes.h:118

PGRUsage

Definition: pg_rusage.h:23

SharedFileSet

Definition: sharedfileset.h:27

Sharedsort

Definition: tuplesort.c:344

Sharedsort::fileset

SharedFileSet fileset

Definition: tuplesort.c:360

Sharedsort::tapes

TapeShare tapes[FLEXIBLE_ARRAY_MEMBER]

Definition: tuplesort.c:369

Sharedsort::workersFinished

int workersFinished

Definition: tuplesort.c:357

Sharedsort::nTapes

int nTapes

Definition: tuplesort.c:363

Sharedsort::mutex

slock_t mutex

Definition: tuplesort.c:346

Sharedsort::currentWorker

int currentWorker

Definition: tuplesort.c:356

SortCoordinateData

Definition: tuplesort.h:47

SortCoordinateData::sharedsort

Sharedsort * sharedsort

Definition: tuplesort.h:59

SortCoordinateData::isWorker

bool isWorker

Definition: tuplesort.h:49

SortCoordinateData::nParticipants

int nParticipants

Definition: tuplesort.h:56

SortSupportData

Definition: sortsupport.h:61

SortSupportData::ssup_reverse

bool ssup_reverse

Definition: sortsupport.h:74

SortSupportData::ssup_nulls_first

bool ssup_nulls_first

Definition: sortsupport.h:75

SortTuple

Definition: tuplesort.h:149

SortTuple::tuple

void * tuple

Definition: tuplesort.h:150

SortTuple::srctape

int srctape

Definition: tuplesort.h:153

SortTuple::datum1

Datum datum1

Definition: tuplesort.h:151

TapeShare

Definition: logtape.h:49

TapeShare::firstblocknumber

int64 firstblocknumber

Definition: logtape.h:54

TuplesortInstrumentation

Definition: tuplesort.h:112

TuplesortInstrumentation::sortMethod

TuplesortMethod sortMethod

Definition: tuplesort.h:113

TuplesortInstrumentation::spaceType

TuplesortSpaceType spaceType

Definition: tuplesort.h:114

TuplesortInstrumentation::spaceUsed

int64 spaceUsed

Definition: tuplesort.h:115

TuplesortPublic

Definition: tuplesort.h:165

Tuplesortstate

Definition: tuplesort.c:186

Tuplesortstate::bounded

bool bounded

Definition: tuplesort.c:189

Tuplesortstate::lastReturnedTuple

void * lastReturnedTuple

Definition: tuplesort.c:264

Tuplesortstate::worker

int worker

Definition: tuplesort.c:320

Tuplesortstate::tapeset

LogicalTapeSet * tapeset

Definition: tuplesort.c:208

Tuplesortstate::isMaxSpaceDisk

bool isMaxSpaceDisk

Definition: tuplesort.c:204

Tuplesortstate::growmemtuples

bool growmemtuples

Definition: tuplesort.c:220

Tuplesortstate::memtuples

SortTuple * memtuples

Definition: tuplesort.c:217

Tuplesortstate::maxSpace

int64 maxSpace

Definition: tuplesort.c:202

Tuplesortstate::nInputRuns

int nInputRuns

Definition: tuplesort.c:282

Tuplesortstate::memtupsize

int memtupsize

Definition: tuplesort.c:219

Tuplesortstate::inputTapes

LogicalTape ** inputTapes

Definition: tuplesort.c:280

Tuplesortstate::slabAllocatorUsed

bool slabAllocatorUsed

Definition: tuplesort.c:249

Tuplesortstate::nOutputRuns

int nOutputRuns

Definition: tuplesort.c:286

Tuplesortstate::base

TuplesortPublic base

Definition: tuplesort.c:187

Tuplesortstate::currentRun

int currentRun

Definition: tuplesort.c:270

Tuplesortstate::slabMemoryEnd

char * slabMemoryEnd

Definition: tuplesort.c:252

Tuplesortstate::tupleMem

int64 tupleMem

Definition: tuplesort.c:193

Tuplesortstate::ru_start

PGRUsage ru_start

Definition: tuplesort.c:336

Tuplesortstate::slabMemoryBegin

char * slabMemoryBegin

Definition: tuplesort.c:251

Tuplesortstate::memtupcount

int memtupcount

Definition: tuplesort.c:218

Tuplesortstate::outputTapes

LogicalTape ** outputTapes

Definition: tuplesort.c:284

Tuplesortstate::nOutputTapes

int nOutputTapes

Definition: tuplesort.c:285

Tuplesortstate::nInputTapes

int nInputTapes

Definition: tuplesort.c:281

Tuplesortstate::maxTapes

int maxTapes

Definition: tuplesort.c:200

Tuplesortstate::eof_reached

bool eof_reached

Definition: tuplesort.c:297

Tuplesortstate::tape_buffer_mem

size_t tape_buffer_mem

Definition: tuplesort.c:256

Tuplesortstate::status

TupSortStatus status

Definition: tuplesort.c:188

Tuplesortstate::availMem

int64 availMem

Definition: tuplesort.c:198

Tuplesortstate::destTape

LogicalTape * destTape

Definition: tuplesort.c:288

Tuplesortstate::bound

int bound

Definition: tuplesort.c:192

Tuplesortstate::maxSpaceStatus

TupSortStatus maxSpaceStatus

Definition: tuplesort.c:207

Tuplesortstate::boundUsed

bool boundUsed

Definition: tuplesort.c:191

Tuplesortstate::current

int current

Definition: tuplesort.c:296

Tuplesortstate::markpos_eof

bool markpos_eof

Definition: tuplesort.c:302

Tuplesortstate::abbrevNext

int64 abbrevNext

Definition: tuplesort.c:330

Tuplesortstate::markpos_block

int64 markpos_block

Definition: tuplesort.c:300

Tuplesortstate::shared

Sharedsort * shared

Definition: tuplesort.c:321

Tuplesortstate::result_tape

LogicalTape * result_tape

Definition: tuplesort.c:295

Tuplesortstate::slabFreeHead

SlabSlot * slabFreeHead

Definition: tuplesort.c:253

Tuplesortstate::markpos_offset

int markpos_offset

Definition: tuplesort.c:301

Tuplesortstate::nParticipants

int nParticipants

Definition: tuplesort.c:322

Tuplesortstate::allowedMem

int64 allowedMem

Definition: tuplesort.c:199

dsm_segment

Definition: dsm.c:67

state

Definition: regguts.h:323

tablespace.h

tuplesort_rescan

void tuplesort_rescan(Tuplesortstate *state)

Definition: tuplesort.c:2398

tuplesort_performsort

void tuplesort_performsort(Tuplesortstate *state)

Definition: tuplesort.c:1359

tuplesort_merge_order

int tuplesort_merge_order(int64 allowedMem)

Definition: tuplesort.c:1774

TAPE_BUFFER_OVERHEAD

#define TAPE_BUFFER_OVERHEAD

Definition: tuplesort.c:178

tuplesort_heap_delete_top

static void tuplesort_heap_delete_top(Tuplesortstate *state)

Definition: tuplesort.c:2768

INITIAL_MEMTUPSIZE

#define INITIAL_MEMTUPSIZE

Definition: tuplesort.c:120

getlen

static unsigned int getlen(LogicalTape *tape, bool eofOK)

Definition: tuplesort.c:2850

tuplesort_initialize_shared

void tuplesort_initialize_shared(Sharedsort *shared, int nWorkers, dsm_segment *seg)

Definition: tuplesort.c:2932

COMPARETUP

#define COMPARETUP(state, a, b)

Definition: tuplesort.c:396

selectnewtape

static void selectnewtape(Tuplesortstate *state)

Definition: tuplesort.c:1944

tuplesort_reset

void tuplesort_reset(Tuplesortstate *state)

Definition: tuplesort.c:1015

SERIAL

#define SERIAL(state)

Definition: tuplesort.c:403

FREESTATE

#define FREESTATE(state)

Definition: tuplesort.c:399

markrunend

static void markrunend(LogicalTape *tape)

Definition: tuplesort.c:2863

tuplesort_skiptuples

bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)

Definition: tuplesort.c:1706

free_sort_tuple

static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup)

Definition: tuplesort.c:3122

REMOVEABBREV

#define REMOVEABBREV(state, stup, count)

Definition: tuplesort.c:395

LACKMEM

#define LACKMEM(state)

Definition: tuplesort.c:400

reversedirection

static void reversedirection(Tuplesortstate *state)

Definition: tuplesort.c:2832

USEMEM

#define USEMEM(state, amt)

Definition: tuplesort.c:401

tuplesort_heap_insert

static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple)

Definition: tuplesort.c:2733

ssup_datum_signed_cmp

int ssup_datum_signed_cmp(Datum x, Datum y, SortSupport ssup)

Definition: tuplesort.c:3144

grow_memtuples

static bool grow_memtuples(Tuplesortstate *state)

Definition: tuplesort.c:1048

ssup_datum_unsigned_cmp

int ssup_datum_unsigned_cmp(Datum x, Datum y, SortSupport ssup)

Definition: tuplesort.c:3133

beginmerge

static void beginmerge(Tuplesortstate *state)

Definition: tuplesort.c:2256

make_bounded_heap

static void make_bounded_heap(Tuplesortstate *state)

Definition: tuplesort.c:2583

tuplesort_used_bound

bool tuplesort_used_bound(Tuplesortstate *state)

Definition: tuplesort.c:882

WRITETUP

#define WRITETUP(state, tape, stup)

Definition: tuplesort.c:397

sort_bounded_heap

static void sort_bounded_heap(Tuplesortstate *state)

Definition: tuplesort.c:2632

TupSortStatus

Definition: tuplesort.c:155

TSS_SORTEDONTAPE

@ TSS_SORTEDONTAPE

Definition: tuplesort.c:160

TSS_SORTEDINMEM

@ TSS_SORTEDINMEM

Definition: tuplesort.c:159

TSS_INITIAL

@ TSS_INITIAL

Definition: tuplesort.c:156

TSS_FINALMERGE

@ TSS_FINALMERGE

Definition: tuplesort.c:161

TSS_BUILDRUNS

@ TSS_BUILDRUNS

Definition: tuplesort.c:158

TSS_BOUNDED

@ TSS_BOUNDED

Definition: tuplesort.c:157

worker_get_identifier

static int worker_get_identifier(Tuplesortstate *state)

Definition: tuplesort.c:2975

mergeonerun

static void mergeonerun(Tuplesortstate *state)

Definition: tuplesort.c:2196

FREEMEM

#define FREEMEM(state, amt)

Definition: tuplesort.c:402

MAXORDER

#define MAXORDER

Definition: tuplesort.c:177

inittapestate

static void inittapestate(Tuplesortstate *state, int maxTapes)

Definition: tuplesort.c:1910

SLAB_SLOT_SIZE

#define SLAB_SLOT_SIZE

Definition: tuplesort.c:142

leader_takeover_tapes

static void leader_takeover_tapes(Tuplesortstate *state)

Definition: tuplesort.c:3063

tuplesort_estimate_shared

Size tuplesort_estimate_shared(int nWorkers)

Definition: tuplesort.c:2911

tuplesort_get_stats

void tuplesort_get_stats(Tuplesortstate *state, TuplesortInstrumentation *stats)

Definition: tuplesort.c:2495

tuplesort_begin_common

Tuplesortstate * tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)

Definition: tuplesort.c:638

tuplesort_sort_memtuples

static void tuplesort_sort_memtuples(Tuplesortstate *state)

Definition: tuplesort.c:2672

tuplesort_end

void tuplesort_end(Tuplesortstate *state)

Definition: tuplesort.c:947

inittapes

static void inittapes(Tuplesortstate *state, bool mergeruns)

Definition: tuplesort.c:1861

tuplesort_markpos

void tuplesort_markpos(Tuplesortstate *state)

Definition: tuplesort.c:2431

tuplesort_puttuple_common

void tuplesort_puttuple_common(Tuplesortstate *state, SortTuple *tuple, bool useAbbrev, Size tuplen)

Definition: tuplesort.c:1165

tuplesort_space_type_name

const char * tuplesort_space_type_name(TuplesortSpaceType t)

Definition: tuplesort.c:2562

MERGE_BUFFER_SIZE

#define MERGE_BUFFER_SIZE

Definition: tuplesort.c:179

READTUP

#define READTUP(state, stup, tape, len)

Definition: tuplesort.c:398

ssup_datum_int32_cmp

int ssup_datum_int32_cmp(Datum x, Datum y, SortSupport ssup)

Definition: tuplesort.c:3158

LEADER

#define LEADER(state)

Definition: tuplesort.c:405

WORKER

#define WORKER(state)

Definition: tuplesort.c:404

tuplesort_gettuple_common

bool tuplesort_gettuple_common(Tuplesortstate *state, bool forward, SortTuple *stup)

Definition: tuplesort.c:1466

merge_read_buffer_size

static int64 merge_read_buffer_size(int64 avail_mem, int nInputTapes, int nInputRuns, int maxOutputTapes)

Definition: tuplesort.c:1829

mergereadnext

static bool mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)

Definition: tuplesort.c:2284

SlabSlot

union SlabSlot SlabSlot

tuplesort_updatemax

static void tuplesort_updatemax(Tuplesortstate *state)

Definition: tuplesort.c:964

worker_freeze_result_tape

static void worker_freeze_result_tape(Tuplesortstate *state)

Definition: tuplesort.c:3003

trace_sort

bool trace_sort

Definition: tuplesort.c:124

qsort_tuple_signed_compare

static pg_attribute_always_inline int qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

Definition: tuplesort.c:517

RELEASE_SLAB_SLOT

#define RELEASE_SLAB_SLOT(state, tuple)

Definition: tuplesort.c:383

tuplesort_attach_shared

void tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)

Definition: tuplesort.c:2955

worker_nomergeruns

static void worker_nomergeruns(Tuplesortstate *state)

Definition: tuplesort.c:3041

tuplesort_method_name

const char * tuplesort_method_name(TuplesortMethod m)

Definition: tuplesort.c:2539

qsort_tuple_unsigned_compare

static pg_attribute_always_inline int qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

Definition: tuplesort.c:495

tuplesort_heap_replace_top

static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple)

Definition: tuplesort.c:2792

tuplesort_restorepos

void tuplesort_restorepos(Tuplesortstate *state)

Definition: tuplesort.c:2462

qsort_tuple_int32_compare

static pg_attribute_always_inline int qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)

Definition: tuplesort.c:540

mergeruns

static void mergeruns(Tuplesortstate *state)

Definition: tuplesort.c:2013

tuplesort_readtup_alloc

void * tuplesort_readtup_alloc(Tuplesortstate *state, Size tuplen)

Definition: tuplesort.c:2877

MINORDER

#define MINORDER

Definition: tuplesort.c:176

tuplesort_begin_batch

static void tuplesort_begin_batch(Tuplesortstate *state)

Definition: tuplesort.c:748

tuplesort_set_bound

void tuplesort_set_bound(Tuplesortstate *state, int64 bound)

Definition: tuplesort.c:834

init_slab_allocator

static void init_slab_allocator(Tuplesortstate *state, int numSlots)

Definition: tuplesort.c:1977

consider_abort_common

static bool consider_abort_common(Tuplesortstate *state)

Definition: tuplesort.c:1315

tuplesort_free

static void tuplesort_free(Tuplesortstate *state)

Definition: tuplesort.c:893

dumptuples

static void dumptuples(Tuplesortstate *state, bool alltuples)

Definition: tuplesort.c:2303

tuplesort.h

TupleSortUseBumpTupleCxt

#define TupleSortUseBumpTupleCxt(opt)

Definition: tuplesort.h:109

TUPLESORT_RANDOMACCESS

#define TUPLESORT_RANDOMACCESS

Definition: tuplesort.h:97

TUPLESORT_ALLOWBOUNDED

#define TUPLESORT_ALLOWBOUNDED

Definition: tuplesort.h:100

TuplesortSpaceType

Definition: tuplesort.h:88

SORT_SPACE_TYPE_DISK

@ SORT_SPACE_TYPE_DISK

Definition: tuplesort.h:89

SORT_SPACE_TYPE_MEMORY

@ SORT_SPACE_TYPE_MEMORY

Definition: tuplesort.h:90

TuplesortMethod

Definition: tuplesort.h:77

SORT_TYPE_EXTERNAL_SORT

@ SORT_TYPE_EXTERNAL_SORT

Definition: tuplesort.h:81

SORT_TYPE_TOP_N_HEAPSORT

@ SORT_TYPE_TOP_N_HEAPSORT

Definition: tuplesort.h:79

SORT_TYPE_QUICKSORT

@ SORT_TYPE_QUICKSORT

Definition: tuplesort.h:80

SORT_TYPE_STILL_IN_PROGRESS

@ SORT_TYPE_STILL_IN_PROGRESS

Definition: tuplesort.h:78

SORT_TYPE_EXTERNAL_MERGE

@ SORT_TYPE_EXTERNAL_MERGE

Definition: tuplesort.h:82

SlabSlot

Definition: tuplesort.c:145

SlabSlot::buffer

char buffer[SLAB_SLOT_SIZE]

Definition: tuplesort.c:147

SlabSlot::nextfree

union SlabSlot * nextfree

Definition: tuplesort.c:146

PostgreSQL Source Code: src/backend/utils/sort/tuplesort.c Source File