Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Search improvements discussion #1916

ed-kung started this conversation in Ideas
Feb 19, 2025 · 2 comments · 12 replies
Discussion options

Wanted to open up a discussion on search improvements (#1885), since search is a multi-objective optimization problem requiring opinionated decisions on how to weigh each objective. There are four main objectives: exact keyword matches, semantic matches, popularity (zaprank), and recency. (Do we know if people ever sort by comments or sats?)

Currently, searching by "recent" is broken because the "must" field requires a pretty precise keyword match, thus often returning no results. You can test this by changing the following lines in api/resolvers/search.js:

termQueries.push({
multi_match: {
query,
type: 'phrase',
fields: ['title^100', 'text'],
boost: 1000
}
})

For example, if you change type to most_fields and add fuzziness: AUTO, searching by recent will no longer return null results while still prioritizing recency.

But before fixing that, it would be helpful to think through what SN's search is trying to accomplish and how search performance should be measured. There is a lot of info out there, and the specific details of the search engines of major companies is a closely guarded secret. Thus, there isn't a one-size-fits-all or industry-standard solution.

I have a couple of thoughts on how to proceed:

Learning-to-Rank

The obvious, but heavy-lift, solution would be to use ML on users' actual searches (what they searched and what they clicked on, and whether they re-did the search) to estimate optimal weights based on SN users' behavior. If user data is not available, synthetic data could be generated for learn-to-rank, but the results aren't likely to be as robust to the actual user experience.

The downside of this approach is that it's a lot of work. The juice may not be worth the squeeze.

Off-the-shelf solution plus trial and error

The only other approach I can really think of is to take some off-the-shelf approach suggested by others, like applying a gaussian decay function to created_at for recency searches, combined with a hybrid keyword/neural approach for query terms. Another suggested approach is reciprocal rank fusion, which is claimed to perform quite well and is simple to implement.

But even with off-the-shelf methods, some parameters will need to be chosen, like how much to weigh recency vs. relevance. However, without an obvious evaluation metric, it will be hard to know whether any changes to the search function resulted in improved searches. Here might just be where trial and error comes in: see whether a newly implemented search function results in positive or negative user feedback, as well as personal experimentation with the results.

You must be logged in to vote

Replies: 2 comments 12 replies

Comment options

Excellent write-up.

off-the-shelf

My general sense is that we are under-utilizing off-the-shelf solutions. My hope is that zaprank provides enough of a quality signal that we can combine with relevance metrics available off-the-shelf to provide a satisfactory search experience.

Out current search is basically two iterations of work done by me while I'm working on a bunch of other stuff - once three years ago to just get anything working, then again 18 months ago to add semantic search. So my sense is based on the fact that we haven't given search optimization a real try.

trial and error

A real try probably looks like building up some set of test searches and expected results then tweaking the query until the results are satisfactory on all test searches.

beyond off-the-shelf

I'm also open to more elaborate solutions if it's ultimately low maintenance. I'm just out of my depth with anything bespoke. Learning-to-rank is a neat idea - I don't imagine we have enough clicks/data to train something like that too well though.

Absent trial and error being good enough, or a wizard providing some low maintenance solve, it may be worth outsourcing search to a third party like algolia and washing our hands with it.

My hope is and has been that we can use open source, off-the-shelf stuff but maybe the reality is that we can't.

You must be logged in to vote
10 replies
Comment options

Those steps would have to be done manually. At least, we don't currently have automations for them.

I implemented neural search before we made sndev and it wasn't implemented as part of the dev setup.

Comment options

I've made some progress and have a working search implementation that solves some of the current issues; namely: I added "sort-by-relevance" feature, and "sort-by-recent" is less strict so it doesn't so often fail to return results. I also tried a gaussian decay for the date field which I think should work better than the current squaring of createdAt

Before turning it into a PR, I figured I should seek some guidance on the desired workflow for testing, since I only have access to a sample DB which doesn't contain semantically coherent posts. I think that makes testing the influence of neural search relative to keyword search a bit tricky, as well as evaluating the relative influence of semantic similarity vs. other sort factors in the resulting ordering.

So in terms of testing, does it make more sense for me to submit the PR for review and let y'all test it on your side? Or would you rather make a test db for search?

lmk what you think.

Comment options

If you submit the PR for review I can test.

If you make the db for testing search I'll pay an additional 1m small ones.

Up to you.

Comment options

Ok submitted the PR. I would need to think about how to make a good test DB. You wouldn't be open to sharing data dumps would you?

Comment options

I would need to scrub them of non-public info is the blocker there. Happy to do it because we need to do it, but it might take me a couple weeks.

Comment options

I occasionally get circuit_breaking_exception errors with search. Usually if I stop the containers and restart them, these errors go away. Thoughts?

worker | indexing item 400697
worker | error running indexItem ResponseError: circuit_breaking_exception: [circuit_breaking_exception] Reason: Memory Circuit Breaker is open, please check your resources!
worker | at onBody (/app/node_modules/@opensearch-project/opensearch/lib/Transport.js:426:23)
worker | at IncomingMessage.onEnd (/app/node_modules/@opensearch-project/opensearch/lib/Transport.js:341:11)
worker | at IncomingMessage.emit (node:events:529:35)
worker | at IncomingMessage.emit (node:domain:489:12)
worker | at endReadableNT (node:internal/streams/readable:1400:12)
worker | at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
worker | meta: {
worker | body: { error: [Object], status: 429 },
worker | statusCode: 429,
worker | headers: {
worker | processor_type: 'text_embedding',
worker | 'content-type': 'application/json; charset=UTF-8',
worker | 'content-length': '373'
worker | },
worker | meta: {
worker | context: null,
worker | request: [Object],
worker | name: 'opensearch-js',
worker | connection: [Object],
worker | attempts: 0,
worker | aborted: false
worker | }
worker | }
worker | }
worker | running indexItem with args { id: 320332 }
You must be logged in to vote
2 replies
Comment options

ekzyis Apr 1, 2025
Collaborator

Sorry, I don't have much experience with our search but maybe @huumn has something to say

Comment options

huumn Apr 1, 2025
Maintainer

I'd guess open search is running out of memory.

  1. These settings might be out of date, or we need to give the open search container more memory to deal with them.
PUT _cluster/settings
{
 "persistent": {
 "plugins.ml_commons.only_run_on_ml_node": "false",
 "plugins.ml_commons.model_access_control_enabled": "true",
 "plugins.ml_commons.native_memory_threshold": "99"
 }
}
  1. adding mem_reservation to the opensearch service might help, see: https://docs.docker.com/reference/compose-file/services/#mem_reservation
  2. if neither of those work, we might need to give the java runtime more heap space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Ideas
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /