Google has introduced an open-source Python library that can be used to programmatically extract information while ensuring the outputs are structured and reliably tied back to their source.
LangExtract is a Python library that provides a lightweight interface to LLMs such as Google's Gemini models to extract structured information from unstructured text documents based on user-defined instructions. It processes materials such as clinical notes or reports, identifying and organizing key details while ensuring the extracted data corresponds to the source text.
The aim of the library is to make it easier for programmers to convert free-form text into structured data that can then be used for analysis. Suggested uses include documents such as clinical notes, legal texts, and customer feedback. You can set up extraction tasks using natural language instructions and example data, and the library uses large language models to assist in processing and organizing information.
The library supports precise source grounding. In other words, it maps each item of data extracted to its exact location in the source text, enabling visual highlighting for easy traceability and verification.
The structured output is based on the enforcing of an output schema based on your initial examples. This uses controlled generation in supported models like Gemini to guarantee robust, structured results. Users can choose from a range of LLM models, from cloud-based LLMs like the Google Gemini family to local open-source models via the built-in Ollama interface.
The developers say the library makes use of LLM world knowledge, meaning you can make use of precise prompt wording and few-shot examples to influence how the extraction task may utilize LLM knowledge. They say that the accuracy of any inferred information and its adherence to the task specification are contingent upon the selected LLM, the complexity of the task, the clarity of the prompt instructions, and the nature of the prompt examples.
LangExtract is optimized for long documents, and the developers say it overcomes the "needle-in-a-haystack" challenge of large document extraction by using an optimized strategy of text chunking, parallel processing, and multiple passes for higher recall.
Based on this, LangExtract generates a self-contained, interactive HTML file that can be used to visualize and review thousands of extracted entities in their original context.
Writing about the new release on the Google Developers Blog, Akshay Goel and Atilla Kiraly said the library has flexibility for specialized domains like medicine, finance, engineering or law. They said the ideas behind LangExtract were first applied to medical information extraction and can be effective at processing clinical text. For example, it can identify medications, dosages, and other medication attributes, and then map the relationships between them:
"This capability was a core part of the research that led to this library, which you can read about in our paper on accelerating medical information extraction."
The LangExtract library is available on GitHub now.
More Information
Related Articles
Google Announces BigQuery Metastore
Google Adds Ability To See Datasets
Google Releases Python Client For Data Commons
To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Learn A New Language With Coursera Plus
27/08/2025
Expand your portfolio of skills with a new programming language. Coursera currently has an offer - 30% off an annual subscription to Coursera Plus. This runs until September 22, 2025 and is available [ ... ]
Cactus Lets You Build LLM Powered Applications On Your Mobile Phone
25/08/2025
Cactus is a "Cross-platform framework for deploying LLM/VLM/TTS models locally in your app". What does that mean?
- EU Commission Reactivates Bug Bounties
- Record Level Of Interest In Google Summer of Code 2025
- Python Still Growing - 2024 Developer Survey Results
- Kryptos Solution To Be Auctioned
- Prompt Engineering For Agentic Systems
- Robot Crabs Attacked By Real Crabs
- Google Demands Dev Identity For All Android Apps
- Go 1.35 Adds Experimental Garbage Collector
- .NET Preview 7 Adds XAML Source Generator
- Google Gets To Keep Chrome
- Apache Netbeans 27 Adds Gradle Fixes
- A World First For Humanoid Robots
- Groovy 5 Improves Web Content Creation
Comments
or email your comment to: comments@i-programmer.info