Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 5606d30

Browse files
Add guide for rustdoc search implementation (#1846)
1 parent d13e851 commit 5606d30

File tree

2 files changed

+245
-0
lines changed

2 files changed

+245
-0
lines changed

‎src/SUMMARY.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
- [Serialization in Rustc](./serialization.md)
7575
- [Parallel Compilation](./parallel-rustc.md)
7676
- [Rustdoc internals](./rustdoc-internals.md)
77+
- [Search](./rustdoc-internals/search.md)
7778

7879
# Source Code Representation
7980

‎src/rustdoc-internals/search.md‎

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Rustdoc search
2+
3+
Rustdoc Search is two programs: `search_index.rs`
4+
and `search.js`. The first generates a nasty JSON
5+
file with a full list of items and function signatures
6+
in the crates in the doc bundle, and the second reads
7+
it, turns it into some in-memory structures, and
8+
scans them linearly to search.
9+
10+
<!-- toc -->
11+
12+
## Search index format
13+
14+
`search.js` calls this Raw, because it turns it into
15+
a more normal object tree after loading it.
16+
Naturally, it's also written without newlines or spaces.
17+
18+
```json
19+
[
20+
[ "crate_name", {
21+
"doc": "Documentation",
22+
"n": ["function_name", "Data"],
23+
"t": "HF",
24+
"d": ["This function gets the name of an integer with Data", "The data struct"],
25+
"q": [[0, "crate_name"]],
26+
"i": [2, 0],
27+
"p": [[1, "i32"], [1, "str"], [5, "crate_name::Data"]],
28+
"f": "{{gb}{d}}`",
29+
"b": [],
30+
"c": [],
31+
"a": [["get_name", 0]],
32+
}]
33+
]
34+
```
35+
36+
[`src/librustdoc/html/static/js/externs.js`]
37+
defines an actual schema in a Closure `@typedef`.
38+
39+
The above index defines a crate called `crate_name`
40+
with a free function called `function_name` and a struct called `Data`,
41+
with the type signature `Data, i32 -> str`,
42+
and an alias, `get_name`, that equivalently refers to `function_name`.
43+
44+
[`src/librustdoc/html/static/js/externs.js`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/externs.js#L204-L258
45+
46+
The search index needs to fit the needs of the `rustdoc` compiler,
47+
the `search.js` frontend,
48+
and also be compact and fast to decode.
49+
It makes a lot of compromises:
50+
51+
* The `rustdoc` compiler runs on one crate at a time,
52+
so each crate has an essentially separate search index.
53+
It [merges] them by having each crate on one line
54+
and looking at the first quoted string.
55+
* Names in the search index are given
56+
in their original case and with underscores.
57+
When the search index is loaded,
58+
`search.js` stores the original names for display,
59+
but also folds them to lowercase and strips underscores for search.
60+
You'll see them called `normalized`.
61+
* The `f` array stores types as offsets into the `p` array.
62+
These types might actually be from another crate,
63+
so `search.js` has to turn the numbers into names and then
64+
back into numbers to deduplicate them if multiple crates in the
65+
same index mention the same types.
66+
* It's a JSON file, but not designed to be human-readable.
67+
Browsers already include an optimized JSON decoder,
68+
so this saves on `search.js` code and performs better for small crates,
69+
but instead of using objects like normal JSON formats do,
70+
it tries to put data of the same type next to each other
71+
so that the sliding window used by [DEFLATE] can find redundancies.
72+
Where `search.js` does its own compression,
73+
it's designed to save memory when the file is finally loaded,
74+
not just size on disk or network transfer.
75+
76+
[merges]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/render/write_shared.rs#L151-L164
77+
[DEFLATE]: https://en.wikipedia.org/wiki/Deflate
78+
79+
### Parallel arrays and indexed maps
80+
81+
Most data in the index
82+
(other than `doc`, which is a single string for the whole crate,
83+
`p`, which is a separate structure
84+
and `a`, which is also a separate structure)
85+
is a set of parallel arrays defining each searchable item.
86+
87+
For example,
88+
the above search index can be turned into this table:
89+
90+
| n | t | d | q | i | f | b | c |
91+
|---|---|---|---|---|---|---|---|
92+
| `function_name` | `H` | This function gets the name of an integer with Data | `crate_name` | 2 | `{{gb}{d}}` | NULL | NULL |
93+
| `Data` | `F` | The data struct | `crate_name` | 0 | `` ` `` | NULL | NULL |
94+
95+
The above code doesn't use `c`, which holds deprecated indices,
96+
or `b`, which maps indices to strings.
97+
If `crate_name::function_name` used both, it would look like this.
98+
99+
```json
100+
"b": [[0, "impl-Foo-for-Bar"]],
101+
"c": [0],
102+
```
103+
104+
This attaches a disambiguator to index 0 and marks it deprecated.
105+
106+
The advantage of this layout is that these APIs often have implicit structure
107+
that DEFLATE can take advantage of,
108+
but that rustdoc can't assume.
109+
Like how names are usually CamelCase or snake_case,
110+
but descriptions aren't.
111+
112+
`q` is a Map from *the first applicable* ID to a parent module path.
113+
This is a weird trick, but it makes more sense in pseudo-code:
114+
115+
```rust
116+
let mut parent_module = "";
117+
for (i, entry) in search_index.iter().enumerate() {
118+
if q.contains(i) {
119+
parent_module = q.get(i);
120+
}
121+
// ... do other stuff with `entry` ...
122+
}
123+
```
124+
125+
This is valid because everything has a parent module
126+
(even if it's just the crate itself),
127+
and is easy to assemble because the rustdoc generator sorts by path
128+
before serializing.
129+
Doing this allows rustdoc to not only make the search index smaller,
130+
but reuse the same string representing the parent path across multiple in-memory items.
131+
132+
### `i`, `f`, and `p`
133+
134+
`i` and `f` both index into `p`, the array of parent items.
135+
136+
`i` is just a one-indexed number
137+
(not zero-indexed because `0` is used for items that have no parent item).
138+
It's different from `q` because `q` represents the parent *module or crate*,
139+
which everything has,
140+
while `i`/`q` are used for *type and trait-associated items* like methods.
141+
142+
`f`, the function signatures, use their own encoding.
143+
144+
```ebnf
145+
f = { FItem | FBackref }
146+
FItem = FNumber | ( '{', {FItem}, '}' )
147+
FNumber = { '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' }, ( '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k ' | 'l' | 'm' | 'n' | 'o' )
148+
FBackref = ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | '?' )
149+
```
150+
151+
An FNumber is a variable-length, self-terminating base16 number
152+
(terminated because the last hexit is lowercase while all others are uppercase).
153+
These are one-indexed references into `p`, because zero is used for nulls,
154+
and negative numbers represent generics.
155+
The sign bit is represented using [zig-zag encoding]
156+
(the internal object representation also uses negative numbers,
157+
even after decoding,
158+
to represent generics).
159+
This alphabet is chosen because the characters can be turned into hexits by
160+
masking off the last four bits of the ASCII encoding.
161+
162+
For example, `{{gb}{d}}` is equivalent to the json `[[3, 1], [2]]`.
163+
Because of zigzag encoding, `` ` `` is +0, `a` is -0 (which is not used),
164+
`b` is +1, and `c` is -1.
165+
166+
[empirically]: https://github.com/rust-lang/rust/pull/83003
167+
[zig-zag encoding]: https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding
168+
169+
## Searching by name
170+
171+
Searching by name works by looping through the search index
172+
and running these functions on each:
173+
174+
* [`editDistance`] is always used to determine a match
175+
(unless quotes are specified, which would use simple equality instead).
176+
It computes the number of swaps, inserts, and removes needed to turn
177+
the query name into the entry name.
178+
For example, `foo` has zero distance from itself,
179+
but a distance of 1 from `ofo` (one swap) and `foob` (one insert).
180+
It is checked against an heuristic threshold, and then,
181+
if it is within that threshold, the distance is stored for ranking.
182+
* [`String.prototype.indexOf`] is always used to determine a match.
183+
If it returns anything other than -1, the result is added,
184+
even if `editDistance` exceeds its threshold,
185+
and the index is stored for ranking.
186+
* [`checkPath`] is used if, and only if, a parent path is specified
187+
in the query. For example, `vec` has no parent path, but `vec::vec` does.
188+
Within checkPath, editDistance and indexOf are used,
189+
and the path query has its own heuristic threshold, too.
190+
If it's not within the threshold, the entry is rejected,
191+
even if the first two pass.
192+
If it's within the threshold, the path distance is stored
193+
for ranking.
194+
* [`checkType`] is used only if there's a type filter,
195+
like the struct in `struct:vec`. If it fails,
196+
the entry is rejected.
197+
198+
If all four criteria pass
199+
(plus the crate filter, which isn't technically part of the query),
200+
the results are sorted by [`sortResults`].
201+
202+
[`editDistance`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L137
203+
[`String.prototype.indexOf`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf
204+
[`checkPath`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1814
205+
[`checkType`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1787
206+
[`sortResults`]: https://github.com/rust-lang/rust/blob/79b710c13968a1a48d94431d024d2b1677940866/src/librustdoc/html/static/js/search.js#L1229
207+
208+
## Searching by type
209+
210+
Searching by type can be divided into two phases,
211+
and the second phase has two sub-phases.
212+
213+
* Turn names in the query into numbers.
214+
* Loop over each entry in the search index:
215+
* Quick rejection using a bloom filter.
216+
* Slow rejection using a recursive type unification algorithm.
217+
218+
In the names->numbers phase, if the query has only one name in it,
219+
the editDistance function is used to find a near match if the exact match fails,
220+
but if there's multiple items in the query,
221+
non-matching items are treated as generics instead.
222+
This means `hahsmap` will match hashmap on its own, but `hahsmap, u32`
223+
is going to match the same things `T, u32` matches
224+
(though rustdoc will detect this particular problem and warn about it).
225+
226+
Then, when actually looping over each item,
227+
the bloom filter will probably reject entries that don't have every
228+
type mentioned in the query.
229+
For example, the bloom query allows a query of `i32 -> u32` to match
230+
a function with the type `i32, u32 -> bool`,
231+
but unification will reject it later.
232+
233+
The unification filter ensures that:
234+
235+
* Bag semantics are respected. If you query says `i32, i32`,
236+
then the function has to mention *two* i32s, not just one.
237+
* Nesting semantics are respected. If your query says `vec<option>`,
238+
then `vec<option<i32>>` is fine, but `option<vec<i32>>` *is not* a match.
239+
* The division between return type and parameter is respected.
240+
`i32 -> u32` and `u32 -> i32` are completely different.
241+
242+
The bloom filter checks none of these things,
243+
and, on top of that, can have false positives.
244+
But it's fast and uses very little memory, so the bloom filter helps.

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /