Search improvements discussion · stackernews/stacker.news · Discussion #1916

ed-kung
Feb 19, 2025

Wanted to open up a discussion on search improvements (#1885), since search is a multi-objective optimization problem requiring opinionated decisions on how to weigh each objective. There are four main objectives: exact keyword matches, semantic matches, popularity (zaprank), and recency. (Do we know if people ever sort by comments or sats?)

Currently, searching by "recent" is broken because the "must" field requires a pretty precise keyword match, thus often returning no results. You can test this by changing the following lines in api/resolvers/search.js:

stacker.news/api/resolvers/search.js

Lines 283 to 290 in c571ba0

termQueries.push({

multi_match: {

query,

type: 'phrase',

fields: ['title^100', 'text'],

boost: 1000

}

})

For example, if you change type to most_fields and add fuzziness: AUTO, searching by recent will no longer return null results while still prioritizing recency.

But before fixing that, it would be helpful to think through what SN's search is trying to accomplish and how search performance should be measured. There is a lot of info out there, and the specific details of the search engines of major companies is a closely guarded secret. Thus, there isn't a one-size-fits-all or industry-standard solution.

I have a couple of thoughts on how to proceed:

Learning-to-Rank

The obvious, but heavy-lift, solution would be to use ML on users' actual searches (what they searched and what they clicked on, and whether they re-did the search) to estimate optimal weights based on SN users' behavior. If user data is not available, synthetic data could be generated for learn-to-rank, but the results aren't likely to be as robust to the actual user experience.

The downside of this approach is that it's a lot of work. The juice may not be worth the squeeze.

Off-the-shelf solution plus trial and error

The only other approach I can really think of is to take some off-the-shelf approach suggested by others, like applying a gaussian decay function to created_at for recency searches, combined with a hybrid keyword/neural approach for query terms. Another suggested approach is reciprocal rank fusion, which is claimed to perform quite well and is simple to implement.

But even with off-the-shelf methods, some parameters will need to be chosen, like how much to weigh recency vs. relevance. However, without an obvious evaluation metric, it will be hard to know whether any changes to the search function resulted in improved searches. Here might just be where trial and error comes in: see whether a newly implemented search function results in positive or negative user feedback, as well as personal experimentation with the results.

Replies: 2 comments 12 replies

huumn
Feb 19, 2025
Maintainer

Excellent write-up.

off-the-shelf

My general sense is that we are under-utilizing off-the-shelf solutions. My hope is that zaprank provides enough of a quality signal that we can combine with relevance metrics available off-the-shelf to provide a satisfactory search experience.

Out current search is basically two iterations of work done by me while I'm working on a bunch of other stuff - once three years ago to just get anything working, then again 18 months ago to add semantic search. So my sense is based on the fact that we haven't given search optimization a real try.

trial and error

A real try probably looks like building up some set of test searches and expected results then tweaking the query until the results are satisfactory on all test searches.

beyond off-the-shelf

I'm also open to more elaborate solutions if it's ultimately low maintenance. I'm just out of my depth with anything bespoke. Learning-to-rank is a neat idea - I don't imagine we have enough clicks/data to train something like that too well though.

Absent trial and error being good enough, or a wizard providing some low maintenance solve, it may be worth outsourcing search to a third party like algolia and washing our hands with it.

My hope is and has been that we can use open source, off-the-shelf stuff but maybe the reality is that we can't.

10 replies

@huumn

huumn Feb 22, 2025
Maintainer

Those steps would have to be done manually. At least, we don't currently have automations for them.

I implemented neural search before we made sndev and it wasn't implemented as part of the dev setup.

@ed-kung

ed-kung Mar 10, 2025
Author

I've made some progress and have a working search implementation that solves some of the current issues; namely: I added "sort-by-relevance" feature, and "sort-by-recent" is less strict so it doesn't so often fail to return results. I also tried a gaussian decay for the date field which I think should work better than the current squaring of createdAt

Before turning it into a PR, I figured I should seek some guidance on the desired workflow for testing, since I only have access to a sample DB which doesn't contain semantically coherent posts. I think that makes testing the influence of neural search relative to keyword search a bit tricky, as well as evaluating the relative influence of semantic similarity vs. other sort factors in the resulting ordering.

So in terms of testing, does it make more sense for me to submit the PR for review and let y'all test it on your side? Or would you rather make a test db for search?

lmk what you think.

@huumn

huumn Mar 10, 2025
Maintainer

If you submit the PR for review I can test.

If you make the db for testing search I'll pay an additional 1m small ones.

Up to you.

@ed-kung

ed-kung Mar 10, 2025
Author

Ok submitted the PR. I would need to think about how to make a good test DB. You wouldn't be open to sharing data dumps would you?

@huumn

huumn Mar 11, 2025
Maintainer

I would need to scrub them of non-public info is the blocker there. Happy to do it because we need to do it, but it might take me a couple weeks.

ed-kung
Apr 1, 2025
Author

I occasionally get circuit_breaking_exception errors with search. Usually if I stop the containers and restart them, these errors go away. Thoughts?

worker | indexing item 400697
worker | error running indexItem ResponseError: circuit_breaking_exception: [circuit_breaking_exception] Reason: Memory Circuit Breaker is open, please check your resources!
worker | at onBody (/app/node_modules/@opensearch-project/opensearch/lib/Transport.js:426:23)
worker | at IncomingMessage.onEnd (/app/node_modules/@opensearch-project/opensearch/lib/Transport.js:341:11)
worker | at IncomingMessage.emit (node:events:529:35)
worker | at IncomingMessage.emit (node:domain:489:12)
worker | at endReadableNT (node:internal/streams/readable:1400:12)
worker | at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
worker | meta: {
worker | body: { error: [Object], status: 429 },
worker | statusCode: 429,
worker | headers: {
worker | processor_type: 'text_embedding',
worker | 'content-type': 'application/json; charset=UTF-8',
worker | 'content-length': '373'
worker | },
worker | meta: {
worker | context: null,
worker | request: [Object],
worker | name: 'opensearch-js',
worker | connection: [Object],
worker | attempts: 0,
worker | aborted: false
worker | }
worker | }
worker | }
worker | running indexItem with args { id: 320332 }

2 replies

@ekzyis

ekzyis Apr 1, 2025
Collaborator

Sorry, I don't have much experience with our search but maybe @huumn has something to say

@huumn

huumn Apr 1, 2025
Maintainer

I'd guess open search is running out of memory.

These settings might be out of date, or we need to give the open search container more memory to deal with them.

PUT _cluster/settings
{
 "persistent": {
 "plugins.ml_commons.only_run_on_ml_node": "false",
 "plugins.ml_commons.model_access_control_enabled": "true",
 "plugins.ml_commons.native_memory_threshold": "99"
 }
}

adding mem_reservation to the opensearch service might help, see: https://docs.docker.com/reference/compose-file/services/#mem_reservation
if neither of those work, we might need to give the java runtime more heap space

Uh oh!

Search improvements discussion #1916

Uh oh!

ed-kung Feb 19, 2025

Learning-to-Rank

Off-the-shelf solution plus trial and error

Replies: 2 comments · 12 replies

Uh oh!

huumn Feb 19, 2025 Maintainer

off-the-shelf

trial and error

beyond off-the-shelf

Uh oh!

huumn Feb 22, 2025 Maintainer

Uh oh!

Uh oh!

ed-kung Mar 10, 2025 Author

Uh oh!

huumn Mar 10, 2025 Maintainer

Uh oh!

ed-kung Mar 10, 2025 Author

Uh oh!

huumn Mar 11, 2025 Maintainer

Uh oh!

ed-kung Apr 1, 2025 Author

Uh oh!

ekzyis Apr 1, 2025 Collaborator

Uh oh!

Uh oh!

huumn Apr 1, 2025 Maintainer

ed-kung
Feb 19, 2025

Replies: 2 comments 12 replies

huumn
Feb 19, 2025
Maintainer

huumn Feb 22, 2025
Maintainer

ed-kung Mar 10, 2025
Author

huumn Mar 10, 2025
Maintainer

ed-kung Mar 10, 2025
Author

huumn Mar 11, 2025
Maintainer

ed-kung
Apr 1, 2025
Author

ekzyis Apr 1, 2025
Collaborator

huumn Apr 1, 2025
Maintainer