[フレーム]
BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Unlock the full InfoQ experience

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Log In
or

Don't have an InfoQ account?

Register
  • Stay updated on topics and peers that matter to youReceive instant alerts on the latest insights and trends.
  • Quickly access free resources for continuous learningMinibooks, videos with transcripts, and training materials.
  • Save articles and read at anytimeBookmark articles to read whenever youre ready.

Topics

Choose your language

InfoQ Homepage News Google Launched LangExtract, a Python Library for Structured Data Extraction from Unstructured Text

Google Launched LangExtract, a Python Library for Structured Data Extraction from Unstructured Text

Aug 08, 2025 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.
Get in touch
Listen to this article - 0:00
Audio ready to play
0:00
0:00

Google has introduced LangExtract, an open-source Python library designed to help developers extract structured information from unstructured text using large language models such as the Gemini models. The library simplifies the process of converting free-form text, including documents like clinical notes, legal texts, and customer feedback, into structured data. Developers can define extraction tasks through natural language instructions and example data, making it easier to process and organize information from various types of unstructured content.

One of LangExtract’s standout features is its use of controlled generation techniques. This ensures that the extracted information is consistently formatted and accurately linked to its original source in the text. The library highlights relevant spans of text, providing traceability so that each extracted entity is linked to its exact location in the original document. This feature ensures greater transparency and reliability when extracting information.

To handle long and complex documents, LangExtract incorporates advanced strategies like text chunking, parallel processing, and multiple extraction passes. These techniques help improve recall and accuracy, ensuring that the library can effectively extract information from large bodies of text while maintaining high-quality results. This makes LangExtract suitable for applications in various domains, from healthcare to legal documents, without the need for extensive fine-tuning of the underlying models.

LangExtract can be integrated with various LLMs, including cloud-based models like Gemini and local models via platforms such as Ollama. This flexibility makes it a versatile tool for developers working across different models. It enables users to define extraction tasks for a wide range of applications without requiring deep expertise in machine learning.

The release of LangExtract, has sparked enthusiastic responses within the developer community. Akshay Goel, a key contributor, expressed his excitement about the release and eagerness to see innovative applications from users, reflecting the collaborative spirit behind the project, posting:

Excited to release LangExtract alongside the team today and looking forward to seeing what the developer community builds with it!

Developer Kyle Brown described it as a major step forward in AI transparency, converting unstructured text into structured, understandable data. Adding to the momentum a TypeScript port of LangExtract, broadening its compatibility to support both OpenAI models and Google’s Gemini, demonstrating the community's active involvement.

For anyone who is interested -- I ported this to typescript and added an ability to use OpenAI not just Gemini.

The library is available under the Apache 2.0 license and can be easily installed via pip. It offers an accessible and powerful tool for developers looking to add information extraction capabilities to their applications.

About the Author

Daniel Dominguez

Show moreShow less

Rate this Article

Adoption
Style

Related Content

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

BT

AltStyle によって変換されたページ (->オリジナル) /