This repository contains java implementation of Bert Tokenizer. The implementation is referred from https://github.com/ankiteciitkgp/bertTokenizer
Support output onnx tensor for onnx model inference
To get tokens from text:
String text = "Text to tokenize";
BertTokenizer bertTokenizer = new BertTokenizer("D:\\model\\vocab.txt");
List<String> tokens = tokenizer.tokenize(text);
To get token ids using Bert Vocab:
List<Integer> token_ids = tokenizer.convert_tokens_to_ids(tokens);
To get :
List<Integer> token_ids = tokenizer.convert_tokens_to_ids(tokens);
To get onnx tensor
var inputMap = bertTokenizer.tokenizeOnnxTensor(Arrays.asList("hello world 你好", "肿瘤治疗未来发展趋势"));
Full example:
public class OnnxTests { public static void main(String[] args) throws IOException, OrtException { BertTokenizer bertTokenizer = new BertTokenizer("D:\\model\\vocab.txt"); var env = OrtEnvironment.getEnvironment(); var session = env.createSession("D:\\model\\output\\onnx\\fp16_model.onnx", new OrtSession.SessionOptions()); var inputMap = bertTokenizer.tokenizeOnnxTensor(Arrays.asList("hello world 你好", "肿瘤治疗未来发展趋势")); try (var results = session.run(inputMap)) { System.out.println(results); var embeddings = (float[][])results.get(0).getValue(); for (var embedding : embeddings) { System.out.println(JSON.toJSONString(embedding)); } } } }