Google's Lang Extract Explained in 5 Minutes

Lang Extract is a new open-source project from Google that helps you convert your unstructured data into structured data. A key feature is the ability to define custom schemas, which means you can extract the precise information you're looking for. It also provides a clear visualization of the extracted information. As an open-source project on GitHub, this article will demonstrate how to use it for both entity extraction and relationship mapping, enabling the creation of custom knowledge graphs for applications like Retrieval-Augmented Generation (RAG).

First, we'll cover the project's core functionality, followed by a look at its various capabilities through practical examples. It's an open-source Python package that uses a large language model (LLM) to extract structured information from unstructured text. By providing a few-shot examples along with a schema, it can extract the desired information. It supports both proprietary models like Gemini and various open-source models. This article will focus on examples using Gemini. A long-context LLM is recommended for optimal performance.

Getting Started with Lang Extract

Installation is straightforward via pip, and the usage is simple.

pip install lang-extract

When you have the text for information extraction, it's important to follow best practices.

Note: For RAG applications, it's crucial to process the entire document at once for entity extraction, rather than on a per-chunk basis. This document-level metadata can then be linked to individual chunks.

The process begins by setting up high-quality examples to guide the extraction. You provide sample text and specify the entities and attributes you wish to extract. These attributes are particularly useful for relationship extraction between different entities, a concept we'll explore later. A concrete example will be provided later in this article. After setting up the examples, you provide your input text and run it through Lang Extract with a prompt. The prompt is what drives the extraction process, and it's applied to your input data. The extracted data can be visualized in a well-formatted HTML document.

Practical Examples and Capabilities

Let's walk through several examples to showcase Lang Extract's capabilities. First, ensure all necessary packages are installed and imported.

Basic Entity Extraction

In a basic example, we might have a text about Apple from which we want to extract specific information. The prompt would be something like: extract companies, people, products, dates, location and prices. The few-shot example could be related to Microsoft, providing a high-quality template for the model to follow. In the example, you define entities like company, person, and title, and then provide concrete instances from the sample text. You can provide multiple few-shot examples to improve accuracy.

To perform the extraction, you call the extract function, passing the input text, the prompt, the examples, and the chosen model. This runs the extraction and generates a JSONL file. You can then use a visualization function to load this file and create an HTML report.

To run this, you would execute the Python script, for instance: python python example_basic.py The output will show the extracted information, often including character boundaries for precise location. Remember, for structured output, always operate at the document level, not the chunk level. The generated HTML file provides a clear view of all extracted entities. Even without defined attributes, you can see the identified entities. A single piece of text can contain multiple entities of various types.

Extracting Entities and Their Attributes

Let's look at a more advanced example that involves extracting both entities and their attributes. Consider a quarterly earnings report. The prompt would aim to extract specific financial details. In this scenario, the example prompt would define a two-level structure: an extraction class (the entity) and its corresponding attributes. For instance, a company entity might have ticker and exchange as attributes, while financial_metrics could have period, value, and change.

This same principle can be applied to various text types, such as news articles (extracting funding_events with their attributes) or customer reviews (extracting product feedback). For customer reviews, you would provide sample text and define the structured output you need.

Building Knowledge Graphs with Relationship Extraction

The true power of Lang Extract lies in its ability to not only extract entities but also to identify the relationships between them, which is essential for building knowledge graphs. Let's explore a medical example. Given a text about a patient's medication dosage, we can extract details like dosage, route, frequency, and duration.

However, we can go further and analyze the relationship between a medical condition and the medications used to treat it. Imagine a clinical note summarizing a patient's problems and prescribed medications. The prompt could be: For each medication, identify its name, dosage, the condition it treats, and any related medications or interactions.

In the few-shot example, we would extract patient information (with age and gender as attributes) and their medical conditions. For each medication, we'd extract its details, including a crucial attribute: related_condition. This attribute links the medication directly back to the condition entity. This allows us to map multiple medications to a single condition.

An important parameter here is extraction_phases or passes. This tells the LLM how many times to process the text. Increasing the number of passes can improve accuracy but also increases API calls. Finally, all entities and attributes are collected into a JSONL file for visualization.

The visualization of the clinical note would show each class with its attributes. For example, the patient's information would display age and gender, while the medication details would include dosage and the linked related_condition. This direct link between conditions and medications is incredibly valuable for creating knowledge graphs.

Here is an example of a knowledge graph created from the extracted data. It might show four different conditions: Hypertension, Hyperlipidemia, Arterial Fibrillation, and Type 2 Diabetes. The graph would visually connect each condition to its corresponding medication(s), clearly illustrating the treatment relationships discovered through entity and relationship extraction.

Final Thoughts

The official GitHub repository contains numerous examples, including one based on the paper 'LLM Accelerates Annotation for Medical Information,' which inspired the medical example in this article. As mentioned, the knowledge graphs produced through this process are highly beneficial for downstream RAG systems.

Disclaimer: Lang Extract is an open-source project from Google engineers and not an officially supported Google product.

While Lang Extract is a powerful tool, it's worth noting that several other open-source projects for structured data extraction are also available. For anyone working on information retrieval systems, exploring structured data extraction for metadata enhancement is highly recommended.