-
Couldn't load subscription status.
- Fork 300
Language analysis wishlist #1063
-
Inspired by a few issues that have been raised, I'm making this discussion so we have a list of things that we wish our language analysis tool (currently tokei) would do. This discussion can be referenced if we decide to switch tools, fork tokei, or write our own from scratch (the latter is in my ever-growing list of projects I want to do but get side-tracked by other to dos 🙃).
Want
I'm putting things here that I think we definitely want.
Classification (#26)
Currently, tokei analyzes the file extension and shebang, and it looks like there is some interest for using modelines. However, there seems to be little to no interest in analyzing the actual code contents for classification, as the maintainer doesn't consider this deterministic -- see XAMPPRocky/tokei#708, XAMPPRocky/tokei#305, and XAMPPRocky/tokei#764 for example.
These are reasonable metrics to use, but I don't think it's enough for our usage. I think that language classification should work "out of the box," and manually overriding with modelines or a configuration file should be the exception, not the rule. For this, I think it's necessary to analyze the source code, and make a best guess as to which language it is. IIRC github-linguist uses heuristics (regexes of syntax unique to the language) and bayesian classification from code samples.
Language Categories
E.g. programming, data, etc.
We've implemented this in onefetch, but I believe it would be better if this was implemented in the language analysis tool.
It's also possible that tokei will eventually add language categories: XAMPPRocky/tokei#962 (comment)
Maybe
Here are things that might pend further discussion.
Analyze revs (#1033)
Currently we analyze the contents of the filesystem. This can be confusing when there is a large number of untracked files. For example, a user will likely have every single project in some subfolder of $HOME, and if they have a dotfiles repo at $HOME/.git, then onefetch can return wildly inaccurate results by including all of the untracked files in subfolders.
Since we do require the existence of a git repository, I don't think it's unreasonable to analyze a git rev, defaulting to HEAD, instead of the directory contents. This can give better insight into what the project is, instead of what the project could be, if that makes sense.
As an added bonus, if we analyze revs instead of directory contents, we could probably start supporting bare repos.
Don't use LOC for language distribution
With the following project:
// foo.js const foo = [1, 2, 3, 4];
// foo.ts const foo = [ 1, 2, 3, 4, ];
Onefetch will consider this 86% TypeScript, and 14% JavaScript. When, syntactically, it's more like 50-50. Lines-of-code might not be the best metric, as code style can severely influence the LOC without adding or removing actual code.
github-linguist uses blob size, and returns 56% TypeScript, 43% JavaScript, which is a more accurate distribution in this example.
There are a few things we can do to make things even more accurate. Counting uncommented tokens might be the most accurate, though this might be too computationally intensive.
Detect by filename (excluding extension)
This is something that both tokei and github-linguist currently can't do! Some examples of this would be detecting Dockerfile.node as Dockerfile, or Makefile.amd64 as Makefile. I haven't seen any complaints here yet, but this could be a nice-to-have. The biggest hurdle would be what happens with Dockerfile.js. Is that a Dockerfile, or a JavaScript file? Even with classification, one should probably take priority over the other.
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments 11 replies
-
I came across an interesting project called hyperpolyglot, which aims to replicate the functionality of GitHub Linguist in Rust.
Checking out this project might give you some ideas!
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for mentioning this! Thoughts on the wishlist, BTW? Anything to add, remove, or move?
Beta Was this translation helpful? Give feedback.
All reactions
-
Sure, IMO one of the main drawbacks of tokei, apart from its lack of maintenance 😭 , is - as you mentioned - that it mostly relies on file extensions for language detection which can lead to inaccuracies in favor of performance.
Ideally, it would be great to have a Rust equivalent of GitHub Linguist since Linguist has emerged as the de facto standard for language detection. Users often refer to Linguist when pointing out discrepancies with onefetch.
As a sidenote, I wonder how/whether linguist handles nested languages in Markdown and Jupyter-notebooks 🤔
We've implemented this in onefetch, but I believe it would be better if this was implemented in the language analysis tool.
The chip color for each language could also be provided.
Currently we analyze the contents of the filesystem.
Yes, but filesystem + .gitignore ~= tracked files ?
github-linguist uses blob size, and returns 56% TypeScript, 43% JavaScript, which is a more accurate distribution in this example.
Very good point, I wasn't aware of that. We may need both in that case: blob size for the language distribution and LOC for the info line of the same name
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, but filesystem +
.gitignore~= tracked files ?
What I mean is that, if I did this:
git init echo 'console.log("Hello, World!");' > foo.js git add foo.js git commit -m "Create foo.js" mv foo.js foo.ts github-linguist onefetch
Then github-linguist detects JavaScript, and onefetch detects TypeScript, because linguist is analyzing HEAD, not the current state of the files.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
My bad, I didn't know that Linguist only acknowledged committed changes.
However, I could see an argument where users would want to see their changes being taken into account live - without needing to commit.
Still (As you said), for onefetch - being a Git information tool, it does make sense to stick to HEAD when computing the language distribution.
Beta Was this translation helpful? Give feedback.
All reactions
-
Yeah, I think the biggest argument for analyzing HEAD is the confusion in #26 (comment). But I think the majority of onefetch's users expect the current files to be analyzed. Or at least, I don't remember anyone else raising an issue about it.
My first time executing github-linguist locally, I actually found it surprising that I needed to commit changes for them to be analyzed.
Beta Was this translation helpful? Give feedback.
All reactions
-
Just an FYI that I've started a project to hit the things on this wishlist (productively procrastinating from my other personal projects) 😉
It's basically going to be "linguist but in Rust," but I'm also adding language detection by filename pattern. E.g. tsconfig(?:\..+)?\.json.
Beta Was this translation helpful? Give feedback.
All reactions
-
🎉 1
-
If you want to preview or contribute, let me know and I'll send an invite (I think I can send a few on the free plan).
Nevermind, I made it public.
Beta Was this translation helpful? Give feedback.
All reactions
-
Very happy to see this project coming to life 😊 . Do you have an estimated timeline for when we might be able to replace tokei with gengo? What are the key components still missing for the transition?
It would be great to see some benchmarks comparing gengo with other tools, especially on complex repositories. Also, how does gengo (will)handle nested languages, such as Markdown or Jupyter notebooks?
Will gengo support the exclusion of specific glob patterns as a parameter, similar to how tokei does?
I'll be keeping an eye on this project and hope to contribute. Please don't hesitate to create issues and tag them with "help wanted"
Beta Was this translation helpful? Give feedback.
All reactions
-
As far as replacing goes, we'll need a lot more language support 😆 Also, the API is pretty different from tokei, so that will take some work on this end.
I don't forsee nested languages being supported, since, like linguist, distribution is calculated by blob size, and for performance reasons it won't read the whole file (default is the first MB).
Will gengo support the exclusion of specific glob patterns as a parameter, similar to how tokei does?
Yes, via gitattributes. By default a file is exluded if it is detected as documentation, generated, or vendored code. But you can customize this behavior. For example
# include dist/ dist/* gengo-detectable # exclude js files *.js -gengo-detectable
Beta Was this translation helpful? Give feedback.
All reactions
-
🤔 Thinking about benchmarks, it's probably going to be pretty unfair to start comparing until there is a roughly equal amount of language entries as linguist and tokei. As it is right now, each language entry would add another iteration for attempting to identify a file. Although, now that I'm thinking about it, I could probably gain a lot of performance by mapping extensions, shebangs, etc. to lists of languages instead of mapping languages to extensions, shebangs, etc. 🤔
Beta Was this translation helpful? Give feedback.
All reactions
-
Speaking about benchmarks, when I ran tokei and gengo on the linux repo they were about equal. Tokei was about 20 seconds, gengo about 22. github-linguist took 5 minutes IIRC...
Beta Was this translation helpful? Give feedback.