Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(search): support code search by zoekt #33850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adlternative wants to merge 1 commit into go-gitea:main
base: main
Choose a base branch
Loading
from adlternative:adl/dev/search/support-zoekt-code-indexer

Conversation

@adlternative
Copy link

@adlternative adlternative commented Mar 11, 2025
edited
Loading

Abstract

Zoekt is an open-source search engine specifically designed for code search, utilizing 3-gram indexing for efficient segmentation. By replacing Elasticsearch/Bleve with Zoekt, it provides Gitea with precise code search capabilities and support for regular expression searches.

Motivation

The existing code search functionality is implemented using Elasticsearch/bleve. Although Elasticsearch/bleve excels in general search domains, its disadvantages in code search are obvious:

  1. Unable to support precise match searches, for example, when punctuation marks appear in the search criteria.
  2. Unable to easily support regex match searches.

Proposal

Goals

Support precise substring searches
Support regex searches

Non-Goals

Support multi-branch searches
Support code symbol syntax searches

Competitive Product Analysis

Platform Search Engine Supports Regex Search Supports Full Repository Search
GitHub Blackbird (Proprietary)
GitLab Elasticsearch / Zoekt
grep.app Closed Source
Sourcegraph Zoekt
Gitea(us) Elasticsearch or Bleve

Design

Index

Since Zoekt is written in Golang, its API can be directly integrated through its Go package using indexBuilder.Add() and indexBuilder.MarkFileAsChangedOrRemoved() to add or remove indexed files. The fundamental processes for implementing full and incremental repository indexing in Zoekt do not differ significantly from those in Elasticsearch (ES) or Bleve.

Search

We can use shards.NewDirectorySearcher() or shards.NewDirectorySearcherFast() to build a searcher for searching. The search modes will support:

  • exact – Complete match of any content (including punctuation)
image
  • words – Split by spaces into multiple search conditions and perform an OR query
image
  • regexp – Regular expression search
image
  • zoekt – Using the Zoekt search syntax
image

Since the search is currently limited to a single repository, we will retrieve all the content first and then handle pagination.

Use Method

enable this in app.ini

[indexer]
REPO_INDEXER_TYPE = zoekt
REPO_INDEXER_ENABLED = true
REPO_INDEXER_PATH = indexers/repos.zoekt

Resource Usage

Building the index in Zoekt requires 1.2 times the corpus size in RAM, and the index storage size is about three times the corpus size. Maybe we should expose some of Zoekt's internal Prometheus metrics in the future?

Exists Issues

Try to support #33702

hiifong, lunny, Worty, editfund-founder, anbraten, helmut72, devhaozi, and milahu reacted with thumbs up emoji editfund-founder reacted with heart emoji
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 11, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies labels Mar 11, 2025
@adlternative adlternative changed the title (削除) WIP feat(search): support code search by zoekt (削除ここまで) (追記) WIP: feat(search): support code search by zoekt (追記ここまで) Mar 11, 2025
Copy link
Contributor

There are already so many search engines builtin into Gitea. Many of them have various bugs.

So the questions are:

  1. Will more search engines be added into Gitea to make Gitea have plenty of builtin search engines?
  2. Will the search engines become unmaintained and the bugs will never be fixed?

Copy link
Member

hiifong commented Mar 11, 2025

To be honest I prefer this zoekt search engine compared to the existing search engine

Copy link
Member

lunny commented Mar 11, 2025

maybe this can replace bleve but we need some comparsion tests.

adlternative and appleboy reacted with thumbs up emoji

Copy link
Contributor

wxiaoguang commented Mar 11, 2025
edited
Loading

To be honest I prefer this zoekt search engine compared to the existing search engine

That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet.


I do not mean objection to introduce improvements. But actually it needs to:

  1. Clarify the existing problems & fix existing problems.
  2. Remove unnecessary search engine before introducing new ones.

So a clear roadmap about the "search engine plan" is necessary.

hiifong and adlternative reacted with thumbs up emoji

@wxiaoguang wxiaoguang marked this pull request as draft March 11, 2025 05:07
Copy link
Author

There are already so many search engines builtin into Gitea. Many of them have various bugs.

So the questions are:

  1. Will more search engines be added into Gitea to make Gitea have plenty of builtin search engines?

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search

  1. Will the search engines become unmaintained and the bugs will never be fixed?

I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

Copy link
Contributor

wxiaoguang commented Mar 11, 2025
edited
Loading

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search

And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance"

Copy link
Author

To be honest I prefer this zoekt search engine compared to the existing search engine

That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet.

you don't need to worry about this: zoekt is a popular code search engine, currently used by code platforms like Gerrit, Sourcegraph, and GitLab, wrote by Gerrit author, and maintained by Sourcegraph. Zoekt has advantages that traditional search engines (like ES) do not possess: support for regex matching, substring search, etc. I don't think any new open-source code search engines will be able to replace it in the short term.

I do not mean objection to introduce improvements. But actually it needs to:

  1. Clarify the existing problems & fix existing problems.
  2. Remove unnecessary search engine before introducing new ones.

So a clear roadmap about the "search engine plan" is necessary.

You are right, where should the roadmap be written? I don't have experience with this. I will supplement its documentation when the zoekt functionality is more complete

Copy link
Contributor

I don't think any new open-source code search engines will be able to replace it in the short term.

Yep, if zoekt wins, we need to drop some others.

Copy link
Author

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search

And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance"

Sure, it's regrettable that this part of the content is unmaintained. However, for the zoekt code search, I can commit to maintaining it thoroughly.

wxiaoguang, hiifong, lunny, and joelhy reacted with thumbs up emoji wxiaoguang, TheFox0x7, hiifong, lunny, and joelhy reacted with heart emoji

Copy link
Author

I don't think any new open-source code search engines will be able to replace it in the short term.

Yep, if zoekt wins, we need to drop some others.

Yeah, I hope this can be divided into at least two steps:

  1. Support zoekt
  2. Deprecate other search engines

Zoekt may also have some issues, as GitLab has not completely deprecated ES and fully switched to Zoekt...

lunny reacted with thumbs up emoji

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 17d7c30 to 212fc79 Compare March 11, 2025 11:16
Copy link
Contributor

To make the code clear, we need to refactor the related code first: Refactor issue & code search #33860

Each "indexer" should provide the "search modes" they support by themselves. And we need to remove the "fuzzy" search for code.

Copy link
Author

Please note that I have many other commitments over the next two weeks and may only be able to dedicate time to this MR in a couple of weeks

lunny and Waytal reacted with thumbs up emoji techknowlogick reacted with heart emoji

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch 3 times, most recently from 783ee0e to 374ce10 Compare April 5, 2025 10:56
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch 3 times, most recently from 850a16a to 86ef977 Compare April 5, 2025 12:15
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 86ef977 to 9906c5f Compare April 5, 2025 12:24
@adlternative adlternative changed the title (削除) WIP: feat(search): support code search by zoekt (削除ここまで) (追記) feat(search): support code search by zoekt (追記ここまで) Apr 6, 2025
@GiteaBot GiteaBot added lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Apr 16, 2025
Copy link
Contributor

wxiaoguang commented Apr 16, 2025
edited
Loading

Thank you for inviting me to review. At the moment I don't have a full picture of it (I am not the heavy user of the "code search" feature). So I think the real users (eg: people who up-voted the proposal #33702 or have worked with the code search feature) could help more.

Thank you for inviting me to review, I could help to do code-level review, but I am not sure I could speak for real users for this feature (I seldom use code search). So maybe you could invite other maintainers or contributors to help to review.


Maybe @techknowlogick @lunny @hiifong could do further review.

@wxiaoguang wxiaoguang removed their request for review April 16, 2025 11:49
IsDelta: true,
RepositoryDescription: zoekt.Repository{
ID: uint32(repo.ID),
Name: repo.FullName(),
Copy link
Contributor

@wxiaoguang wxiaoguang Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how the zoekt.Repository works, but what if a repo is renamed? Then the same repo ID with a different name?

Copy link
Author

@adlternative adlternative Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that before and after renaming, search and indexing can both proceed normally: everything is associated through the repo ID. However, currently when deleting an index, it first looks up the repo name using the repo ID, and then deletes files on disk that have the repo name prefix. Therefore, there might be some residual data files from before the rename.

image

adl/xx3 -- rename -> adl/xx422

delete operation will only delete adl%2Fxx422_v16.00000.zoekt...

Perhaps I should check later if there is a way to directly delete an index using the repoid.

Copy link
Member

lunny commented Apr 16, 2025

I think this new indexer engine could be merged as an experiment one and stay for some versions to get some feedbacks before it can be become the default one.

adlternative reacted with thumbs up emoji

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Apr 17, 2025
Copy link
Member

hiifong commented Apr 17, 2025

If I remember correctly, @techknowlogick used to contribute code to zoekt, hopefully he'll have time to review this pr

Copy link
Member

lunny commented Apr 17, 2025

Please update app.example.ini to add the newly introduced indexer.

adlternative and editfund-founder reacted with thumbs up emoji

needGenesis = len(stdout) == 0
}

// TODO: check if zoekt index file meta status is not sync with db index status, if not, get genesis changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still need this comment?

Copy link
Author

@adlternative adlternative Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this check can help ensure the correctness of the index data, but it's not necessary at the moment—it can be added in the future.

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 269d7ac to 6c6eee4 Compare April 18, 2025 04:39
@github-actions github-actions bot added the docs-update-needed The document needs to be updated synchronously label Apr 18, 2025
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 6c6eee4 to 896a16b Compare April 18, 2025 04:43
Copy link
Author

@lunny Further review needed. Is there a chance for this PR to be merged into 1.24?

@lunny lunny added this to the 1.25.0 milestone Apr 23, 2025
Copy link
Member

lunny commented Apr 23, 2025

@lunny Further review needed. Is there a chance for this PR to be merged into 1.24?

We will release v1.24 very soon. I think this can be merged in v1.25.

adlternative and milahu reacted with thumbs up emoji

Copy link
Contributor

And some design problems should be addressed (like #33850 (comment), and IIRC there might still a few more, will comment later when I get time). Although it is "experimental", we still need to make the design overall right.

@wxiaoguang wxiaoguang removed this from the 1.25.0 milestone Aug 29, 2025
Copy link

What happened to this? I was looking forward to this one.

Copy link
Contributor

kvaster commented Sep 27, 2025

For us this is also highly awaited feature.

Copy link
Author

@kvaster @seamon67 I believe this MR was forgotten in a small corner, and if needed, I can pick it up again.

lunny, seamon67, theoparis, milahu, and z-xavier reacted with thumbs up emoji

Copy link

@kvaster @seamon67 I believe this MR was forgotten in a small corner, and if needed, I can pick it up again.

That would be awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@wxiaoguang wxiaoguang wxiaoguang left review comments

@lunny lunny Awaiting requested review from lunny

+1 more reviewer

@hiifong hiifong hiifong approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

docs-update-needed The document needs to be updated synchronously lgtm/need 1 This PR needs approval from one additional maintainer to be merged. modifies/dependencies modifies/go Pull requests that update Go code modifies/translation

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /