For Reuters-RCV1, we need $M \times (2 \times 20+4+4) = 400{,}000 \times 48 = 19.2 \mbox{ megabytes (MB)}$ $M \times (20+4+4) = 400{,}000 \times 28 = 11.2 \mbox{megabytes (MB)}$ for storing the dictionary in this scheme.
Using fixed-width entries for terms is clearly wasteful. The average length of a term in English is about eight characters icompresstb1, so on average we are wasting twelve characters (or 24 bytes) in the fixed-width scheme. Also, we have no way of storing terms with more than twenty characters like hydrochlorofluorocarbons and supercalifragilisticexpialidocious. We can overcome these shortcomings by storing the dictionary terms as one long string of characters, as shown in Figure 5.4 . The pointer to the next term is also used to demarcate the end of the current term. As before, we locate terms in the data structure by way of binary search in the (now smaller) table. This scheme saves us 60% compared to fixed-width storage - 24 bytes on average of the 40 bytes 12 bytes on average of the 20 bytes we allocated for terms before. However, we now also need to store term pointers. The term pointers resolve 400ドル{,}000 \times 8 = 3.2 \times 10^6$ positions, so they need to be $ \log_2 3.2 \times 10^6 \approx 22$ bits or 3 bytes long.
In this new scheme, we need 400ドル{,}000 \times (4+4+3+\unicode{2\times}{} 8) = \unicode{10.8}{7.6} \ \mbox{MB}$ for the Reuters-RCV1 dictionary: 4 bytes each for frequency and postings pointer, 3 bytes for the term pointer, and $\unicode{2\times}{} 8$ bytes on average for the term. So we have reduced the space requirements by one third from 19.211.2 to 10.87.6 MB.