Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Lazarus NLP

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

logo

Projects

IndoT5: T5 Language Models for the Indonesian Language

IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the IndoNLG (text generation) benchmark.

Indonesian Sentence Embedding Models

We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like IndoBERT and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.

Indonesian Natural Language Inference Models

Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.

Many-to-Many Multilingual Translation Models

Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the NusaX benchmark.

Pinned Loading

  1. indonesian-sentence-embeddings indonesian-sentence-embeddings Public

    Embedding Representation for Indonesian Sentences!

    Jupyter Notebook 20 3

  2. machine-translation machine-translation Public

    Many-to-Many Multilingual Translation Model for Languages of Indonesia

    Python 2

  3. IndoT5 IndoT5 Public

    T5 Language Models for the Indonesian Language!

    Python 12

  4. NusaBERT NusaBERT Public

    NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

    Python 1 1

Repositories

Loading
Type
Select type
Language
Select language
Sort
Select order
Showing 10 of 17 repositories

Top languages

Loading...

Most used topics

Loading...

AltStyle によって変換されたページ (->オリジナル) /