Google Releases Open Source Data Extraction Python Library

Written by Kay Ewbank

Monday, 18 August 2025

Google has introduced an open-source Python library that can be used to programmatically extract information while ensuring the outputs are structured and reliably tied back to their source.

LangExtract is a Python library that provides a lightweight interface to LLMs such as Google's Gemini models to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.

[画像:langextract]

The aim of the library is to make it easier for programmers to convert free-form text into structured data that can then be used for analysis. Suggested uses include documents such as clinical notes, legal texts, and customer feedback. You can set up extraction tasks using natural language instructions and example data, and the library uses large language models to assist in processing and organizing information.

The library supports precise source grounding. In other words, it maps each item of data extracted to its exact location in the source text, enabling visual highlighting for easy traceability and verification.

The structured output is based on the enforcing of an output schema based on your initial examples. This uses controlled generation in supported models like Gemini to guarantee robust, structured results. Users can choose from a range of LLM models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.

The developers say the library makes use of LLM world knowledge, meaning you can make use of precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. They say that the accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.

LangExtract is optimized for long documents, and the developers say it overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.

Based on this, LangExtract generates a self-contained, interactive HTML file that can be used to visualize and review thousands of extracted entities in their original context.

Writing about the new release on the Google Developers Blog, Akshay Goel and Atilla Kiraly said the library has flexibility for specialized domains like medicine, finance, engineering or law. They said the ideas behind LangExtract were first applied to medical information extraction and can be effective at processing clinical text. For example, it can identify medications, dosages, and other medication attributes, and then map the relationships between them:

"This capability was a core part of the research that led to this library, which you can read about in our paper on accelerating medical information extraction."

The LangExtract library is available on GitHub now.

[画像:langextract]

More Information

LangExtract On GitHub

Comments

or email your comment to: comments@i-programmer.info

Google Releases Open Source Data Extraction Python Library

Related Articles

Comments