Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

FSoft-AI4Code/CodeText-parser

Repository files navigation

logo

______________________________________________________________________
Branch Build Unittest Release License
main Unittest release pyversion license

Code-Text parser is a custom tree-sitter's grammar parser for extract raw source code into class and function level. We support 10 common programming languages:

  • Python
  • Java
  • JavaScript
  • PHP
  • Ruby
  • Rust
  • C
  • C++
  • C#
  • Go

Installation

codetext package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:

git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .

Or install via pypi package:

pip install codetext

Getting started

codetext CLI Usage

codetext [options] [PATH or FILE] ...

For example extract any python file in src/ folder:

codetext src/ --language Python

If you want to store extracted class and function, use flag --json and give a path to destination file:

codetext src/ --language Python --output_file ./python_report.json --json

Options

positional arguments:
 paths list of the filename/paths.
optional arguments:
 -h, --help show this help message and exit
 --version show program's version number and exit
 -l LANGUAGE, --language LANGUAGE
 Target the programming languages you want to analyze.
 -o OUTPUT_FILE, --output_file OUTPUT_FILE
 Output file (e.g report.json).
 --json Generate json output as a transform of the default
 output
 --verbose Print progress bar

Example

File circle_linkedlist.py analyzed:
==================================================
Number of class : 1
Number of function : 2
--------------------------------------------------
Class summary:
+-----+---------+-------------+
| # | Class | Arguments |
+=====+=========+=============+
| 0 | Node | |
+-----+---------+-------------+
Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| # | Method name | Paramters | Type | Return type |
+=====+===============+=============+========+===============+
| 0 | __init__ | self | | |
| | | data | | |
+-----+---------------+-------------+--------+---------------+
Function analyse:
+-----+-----------------+-------------+--------+---------------+
| # | Function name | Paramters | Type | Return type |
+=====+=================+=============+========+===============+
| 0 | push | head_ref | | Node |
| | | data | Any | Node |
| 1 | countNodes | head | Node | |
+-----+-----------------+-------------+--------+---------------+

Using codetext as Python module

Build your language

codetext need tree-sitter language file (i.e .so file) to work properly. You can manually compile language (see more) or automatically build use our pre-defined function (the <language>.so will saved in a folder name /tree-sitter/):

from codetext.utils import build_language
language = 'rust'
build_language(language)
# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so

Using Language Parser

Each programming language we supported are correspond to a custome language_parser. (e.g Python is PythonParser()). language_parser take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:

from codetext.utils import parse_code
raw_code = """
 /**
 * Sum of 2 number
 * @param a int number
 * @param b int number
 */
 double sum2num(int a, int b) {
 return a + b;
 } 
"""
# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node

Get all function nodes inside a specific node:

from codetext.utils.parser import CppParser
function_list = CppParser.get_function_list(root_node)
print(function_list)
# [<Node type=function_definition, start_point=(6, 0), end_point=(8, 1)>]

Get function metadata (e.g. function's name, parameters, (optional) return type)

function = function_list[0]
metadata = CppParser.get_function_metadata(function, raw_code)
# {'identifier': 'sum2num', 'parameters': {'a': 'int', 'b': 'int'}, 'type': 'double'}

Get docstring (documentation) of a function

docstring = CppParser.get_docstring(function, code_sample)
# ['Sum of 2 number \n@param a int number \n@param b int number']

We also provide 2 command for extract class object

class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)

Limitations

codetext heavly depends on tree-sitter syntax:

  • Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. codetext is easily vulnerable by tree-sitter update patch or syntax change in future.

  • While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.

About

⚒️ Tree-sitter custom toolkit for extracting function and class from raw source file

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

AltStyle によって変換されたページ (->オリジナル) /