Skip to content

Subscribe to our Newsletter for Updates and Tips

If you want to get updates on new features and tips on how to use Instructor, you can subscribe to our newsletter below to get notified when we publish new content.

Advanced Topics

  1. Query Understanding: Beyond Embeddings
  2. Achieving GPT-4 Level Summaries with GPT-3.5-turbo
  3. Basics of Guardrails and Validation in AI Models
  4. Validating Citations in AI-Generated Content
  5. Fine-tuning and Distillation in AI Models
  6. Enhancing OpenAI Client Observability with LangSmith
  7. Logfire Integration with Pydantic

AI Development and Optimization

Language Models and Prompting Techniques

Integrations and Tools

Media and Resources

Structured Outputs with Writer now supported

We're excited to announce that instructor now supports Writer's enterprise-grade LLMs, including their latest Palmyra X 004 model. This integration enables structured outputs and enterprise AI workflows with Writer's powerful language models.

Getting Started

First, make sure that you've signed up for an account on Writer and obtained an API key. Once you've done so, install instructor with Writer support by running pip install instructor[writer] in your terminal.

Make sure to set the WRITER_API_KEY environment variable with your Writer API key or pass it as an argument to the Writer constructor.

PDF Processing with Structured Outputs with Gemini

In this post, we'll explore how to use Google's Gemini model with Instructor to analyse the Gemini 1.5 Pro Paper and extract a structured summary.

The Problem

Processing PDFs programmatically has always been painful. The typical approaches all have significant drawbacks:

  • PDF parsing libraries require complex rules and break easily
  • OCR solutions are slow and error-prone
  • Specialized PDF APIs are expensive and require additional integration
  • LLM solutions often need complex document chunking and embedding pipelines

What if we could just hand a PDF to an LLM and get structured data back? With Gemini's multimodal capabilities and Instructor's structured output handling, we can do exactly that.

Quick Setup

First, install the required packages:

pip install "instructor[google-generativeai]"

Then, here's all the code you need:

import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    )
)

# Define your output structure
class Summary(BaseModel):
    summary: str

# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
    time.sleep(1)
    file = genai.get_file(file.name)
    print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
    messages=[
        {"role": "user", "content": ["Summarize the following file", file]},
    ],
    response_model=Summary,
)

print(resp.summary)
Expand to see Raw Results
summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."

Benefits

The combination of Gemini and Instructor offers several key advantages over traditional PDF processing approaches:

Simple Integration - Unlike traditional approaches that require complex document processing pipelines, chunking strategies, and embedding databases, you can directly process PDFs with just a few lines of code. This dramatically reduces development time and maintenance overhead.

Structured Output - Instructor's Pydantic integration ensures you get exactly the data structure you need. The model's outputs are automatically validated and typed, making it easier to build reliable applications. If the extraction fails, Instructor automatically handles the retries for you with support for custom retry logic using tenacity.

Multimodal Support - Gemini's multimodal capabilities mean this same approach works for various file types. You can process images, videos, and audio files all in the same api request. Check out our multimodal processing guide to see how we extract structured data from travel videos.

Conclusion

Working with PDFs doesn't have to be complicated.

By combining Gemini's multimodal capabilities with Instructor's structured output handling, we can transform complex document processing into simple, Pythonic code.

No more wrestling with parsing rules, managing embeddings, or building complex pipelines – just define your data model and let the LLM do the heavy lifting.

If you liked this, give instructor a try today and see how much easier structured outputs makes working with LLMs become. Get started with Instructor today!

Do I Still Need Instructor with Google's New OpenAI Integration?

Google recently launched OpenAI client compatibility for Gemini.

While this is a significant step forward for developers by simplifying Gemini model interactions, you absolutely still need instructor.

If you're unfamiliar with instructor, we provide a simple interface to get structured outputs from LLMs across different providers.

This makes it easy to switch between providers, get reliable outputs from language models and ultimately build production grade LLM applications.

The current state

The new integration provides an easy integration with the Open AI Client, this means that using function calling with Gemini models has become much easier. We don't need to use a gemini specific library like vertexai or google.generativeai anymore to define response models.

This looks something like this:

from openai import OpenAI
client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="gemini-1.5-flash",
    messages=[{"role": "user", "content": "Extract name and age from: John is 30"}]
)

While this seems convenient, there are three major limitations that make instructor still essential:

1. Limited Schema Support

The current implementation only supports simple, single-level schemas. This means you can't use complex nested schemas that are common in real-world applications. For example, this won't work:

class User(BaseModel):
    name: str
    age: int

class Users(BaseModel):
    users: list[User]  # Nested schema - will throw an error

2. No Streaming Support for Function Calling

The integration doesn't support streaming for function calling. This is a significant limitation if your application relies on streaming responses, which is increasingly common for:

  • Real-time user interfaces
  • Progressive rendering
  • Long-running extractions

3. No Multimodal Support

Perhaps the biggest limitation is the lack of multimodal support. Gemini's strength lies in its ability to process multiple types of inputs (images, video, audio), but the OpenAI compatibility layer doesn't support this. This means you can't:

  • Perform visual question answering
  • Extract structured data from images
  • Analyze video content
  • Process audio inputs

Why Instructor Remains Essential

Let's see how instructor solves these issues.

1. Easy Schema Management

It's easy to define and experiment with different response models when you're building your application up. In our own experiments, we found that changing a single field name from final_choice to answer improved model accuracy from 4.5% to 95%.

The way we structure and name fields in our response models can fundamentally alter how the model interprets and responds to queries. Manually editing schemas constrains your ability to iterate on your response models, introduces room for catastrophic errors and limits what you can squeeze out of your models.

You can get the full power of Pydantic with instructor with gemini using our from_gemini and from_vertexai integration instead of the limited support in the OpenAI integration.

2. Streaming Support

instructor provides built in support for streaming, allowing you to stream partial results as they're generated.

A common use case for streaming is to extract multiple items that have the same structure - Eg. extracting multiple users, extracting multiple products, extracting multiple events, etc.

This is relatively easy to do with instructor

from instructor import from_openai
from openai import OpenAI
from instructor import Mode
from pydantic import BaseModel
import os

client = from_openai(
    OpenAI(
        api_key=os.getenv("GOOGLE_API_KEY"),
        base_url="https://generativelanguage.googleapis.com/v1beta/",
    ),
    mode=Mode.MD_JSON,
)


class User(BaseModel):
    name: str
    age: int


resp = client.chat.completions.create_iterable(
    model="gemini-1.5-flash",
    messages=[
        {
            "role": "user",
            "content": "Generate 10 random users",
        }
    ],
    response_model=User,
)

for r in resp:
    print(r)
# name='Alice' age=25
# name='Bob' age=32
# name='Charlie' age=19
# name='David' age=48
# name='Emily' age=28
# name='Frank' age=36
# name='Grace' age=22
# name='Henry' age=41
# name='Isabella' age=30
# name='Jack' age=27

If you want to instead stream out an item as it's being generated, you can do so by using the create_partial method instead

from instructor import from_openai
from openai import OpenAI
from instructor import Mode
from pydantic import BaseModel
import os

client = from_openai(
    OpenAI(
        api_key=os.getenv("GOOGLE_API_KEY"),
        base_url="https://generativelanguage.googleapis.com/v1beta/",
    ),
    mode=Mode.MD_JSON,
)


class Story(BaseModel):
    title: str
    summary: str


resp = client.chat.completions.create_partial(
    model="gemini-1.5-flash",
    messages=[
        {
            "role": "user",
            "content": "Generate a random bedtime story + 1 sentence summary",
        }
    ],
    response_model=Story,
)

for r in resp:
    print(r)



# title = None summary = None
# title='The Little Firefly Who Lost His Light' summary=None
# title='The Little Firefly Who Lost His Light' summary='A tiny firefly learns the true meaning of friendship when he loses his glow and a wise old owl helps him find it again.'

3. Multimodal Support

instructor supports multimodal inputs for Gemini models, allowing you to perform tasks like visual question answering, image analysis, and more.

You can see an example of how to use instructor with Gemini to extract travel recomendations from videos post.

What else does Instructor offer?

Beyond solving the core limitations of Gemini's new OpenAI integration, instructor provides a list of features that make it indispensable for production grade applications.

1. Provider Agnostic API

Switching between providers shouldn't require rewriting your entire codebase. With instructor, it's as simple as changing just a few lines of code.

from openai import OpenAI
from instructor import from_openai

client = from_openai(
    OpenAI()
)

# rest of code

If we wanted to switch to Anthropic, all it takes is changing the following lines of code

from anthropic import Anthropic
from instructor import from_anthropic

client = from_anthropic(
    Anthropic()
)

# rest of code

2. Automatic Validation and Retries

Production applications need reliable outputs. Instructor handles this by validating all outputs against your desired response model and automatically retrying outputs that fail validation.

With our tenacity integration, you get full control over the retries if needed, allowing you to mechanisms like exponential backoff and other retry strategies easily.

import openai
import instructor
from pydantic import BaseModel
from tenacity import Retrying, stop_after_attempt, wait_fixed

client = instructor.from_openai(openai.OpenAI(), mode=instructor.Mode.TOOLS)


class UserDetail(BaseModel):
    name: str
    age: int


response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=UserDetail,
    messages=[
        {"role": "user", "content": "Extract `jason is 12`"},
    ],
    # Stop after the second attempt and wait a fixed 1 second between attempts
    max_retries=Retrying(
        stop=stop_after_attempt(2),
        wait=wait_fixed(1),
    ),
)
print(response.model_dump_json(indent=2))
"""
{
  "name": "jason",
  "age": 12
}
"""

Conclusion

While Google's OpenAI compatibility layer is a welcome addition, there are still a few reasons why you might want to stick with instructor for now.

Within a single package, you get features such as a provider agnostic API, streaming capabilities, multimodal support, automatic re-asking and more.

Give us a try today by installing with pip install instructor and see why Pydantic is all you need for a production grade LLM application..

Building an LLM-based Reranker for your RAG pipeline

Are you struggling with irrelevant search results in your Retrieval-Augmented Generation (RAG) pipeline?

Imagine having a powerful tool that can intelligently reassess and reorder your search results, significantly improving their relevance to user queries.

In this blog post, we'll show you how to create an LLM-based reranker using Instructor and Pydantic. This approach will:

  • Enhance the accuracy of your search results
  • Leverage the power of large language models (LLMs)
  • Utilize structured outputs for precise information retrieval

By the end of this tutorial, you'll be able to implement a llm reranker to label your synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system. Let's dive in!

Setting Up the Environment

First, let's set up our environment with the necessary imports:

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

client = instructor.from_openai(OpenAI())

We're using the instructor library, which integrates seamlessly with OpenAI's API and Pydantic for structured outputs.

Defining the Reranking Models

We'll use Pydantic to define our Label and RerankedResults models that structure the output of our LLM:

Notice that not only do I reference the chunk_id in the label class, I also asked a language model to use chain of thought. This is very useful for using models like 4o Mini or Claude, but not necessarily if we plan to use the o1-mini and o1-preview models.

class Label(BaseModel):
    chunk_id: int = Field(description="The unique identifier of the text chunk")
    chain_of_thought: str = Field(description="The reasoning process used to evaluate the relevance")
    relevancy: int = Field(
        description="Relevancy score from 0 to 10, where 10 is most relevant",
        ge=0,
        le=10,
    )


class RerankedResults(BaseModel):
    labels: list[Label] = Field(description="List of labeled and ranked chunks")

    @field_validator("labels")
    @classmethod
    def model_validate(cls, v: list[Label]) -> list[Label]:
        return sorted(v, key=lambda x: x.relevancy, reverse=True)

These models ensure that our LLM's output is structured and includes a list of labeled chunks with their relevancy scores. The RerankedResults model includes a validator that automatically sorts the labels by relevancy in descending order.

Creating the Reranker Function

Next, we'll create a function that uses our LLM to rerank a list of text chunks based on their relevance to a query:

def rerank_results(query: str, chunks: list[dict]) -> RerankedResults:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=RerankedResults,
        messages=[
            {
                "role": "system",
                "content": """
                You are an expert search result ranker. Your task is to evaluate the relevance of each text chunk to the given query and assign a relevancy score.

                For each chunk:
                1. Analyze its content in relation to the query.
                2. Provide a chain of thought explaining your reasoning.
                3. Assign a relevancy score from 0 to 10, where 10 is most relevant.

                Be objective and consistent in your evaluations.
                """,
            },
            {
                "role": "user",
                "content": """
                <query>{{ query }}</query>

                <chunks_to_rank>
                {% for chunk in chunks %}
                <chunk id="{{ chunk.id }}">
                    {{ chunk.text }}
                </chunk>
                {% endfor %}
                </chunks_to_rank>

                Please provide a RerankedResults object with a Label for each chunk.
                """,
            },
        ],
        context={"query": query, "chunks": chunks},
    )

This function takes a query and a list of text chunks as input, sends them to the LLM with a predefined prompt, and returns a structured RerankedResults object. Thanks to instructor we can use jinja templating to inject the query and chunks into the prompt by passing in the context parameter.

Testing the Reranker

To test our LLM-based reranker, we can create a sample query and a list of text chunks. Here's an example of how to use the reranker:

def main():
    query = "What are the health benefits of regular exercise?"
    chunks = [
        {
            "id": 0,
            "text": "Regular exercise can improve cardiovascular health and reduce the risk of heart disease.",
        },
        {
            "id": 1,
            "text": "The price of gym memberships varies widely depending on location and facilities.",
        },
        {
            "id": 2,
            "text": "Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.",
        },
        {
            "id": 3,
            "text": "Proper nutrition is essential for maintaining a healthy lifestyle.",
        },
        {
            "id": 4,
            "text": "Strength training can increase muscle mass and improve bone density, especially important as we age.",
        },
    ]

    results = rerank_results(query, chunks)

    print("Reranked results:")
    for label in results.labels:
        print(f"Chunk {label.chunk_id} (Relevancy: {label.relevancy}):")
        print(f"Text: {chunks[label.chunk_id]['text']}")
        print(f"Reasoning: {label.chain_of_thought}")
        print()

if __name__ == "__main__":
    main()

This test demonstrates how the reranker evaluates and sorts the chunks based on their relevance to the query. The full implementation can be found in the examples/reranker/run.py file.

If you want to extend this example, you could use the rerank_results function to label synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system.

Moreover, we could also add validators to the Label.chunk_id field to ensure that the chunk_id is present in the chunks list. This might be useful if labels are uuids or complex strings and we want to ensure that the chunk_id is a valid index for the chunks list.

heres an example

class Label(BaseModel):
    chunk_id: int = Field(description="The unique identifier of the text chunk")
    ...

    @field_validator("chunk_id")
    @classmethod
    def validate_chunk_id(cls, v: int, info: ValidationInfo) -> int:
        context = info.context
        chunks = context["chunks"]
        if v not in [chunk["id"] for chunk in chunks]:
            raise ValueError(f"Chunk with id {v} not found, must be one of {[chunk['id'] for chunk in chunks]}")
        return v

This will automatically check that the chunk_id is present in the chunks list and raise a ValueError if it is not, where context is the context dictionary that we passed into the rerank_results function.

Structured Outputs with Multimodal Gemini

In this post, we'll explore how to use Google's Gemini model with Instructor to analyze travel videos and extract structured recommendations. This powerful combination allows us to process multimodal inputs (video) and generate structured outputs using Pydantic models. This post was done in collaboration with Kino.ai, a company that uses instructor to do structured extraction from multimodal inputs to improve search for film makers.

Setting Up the Environment

First, let's set up our environment with the necessary libraries:

from pydantic import BaseModel
import instructor
import google.generativeai as genai

Defining Our Data Models

We'll use Pydantic to define our data models for tourist destinations and recommendations:

class TouristDestination(BaseModel):
    name: str
    description: str
    location: str

class Recommendations(BaseModel):
    chain_of_thought: str
    description: str
    destinations: list[TouristDestination]

Initializing the Gemini Client

Next, we'll set up our Gemini client using Instructor:

client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
)

Uploading and Processing the Video

To analyze a video, we first need to upload it:

file = genai.upload_file("./takayama.mp4")

Then, we can process the video and extract recommendations:

resp = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": ["What places do they recommend in this video?", file],
        }
    ],
    response_model=Recommendations,
)

print(resp)
Expand to see Raw Results
Recomendations(
    chain_of_thought='The video recommends visiting Takayama city, in the Hida Region, Gifu Prefecture. The 
video suggests visiting the Miyagawa Morning Market, to try the Sarubobo good luck charms, and to enjoy the 
cookie cup espresso, made by Koma Coffee. Then, the video suggests visiting a traditional Japanese Cafe, 
called Kissako Katsure, and try their matcha and sweets. Afterwards, the video suggests to visit the Sanmachi 
Historic District, where you can find local crafts and delicious foods. The video recommends trying Hida Wagyu
beef, at the Kin no Kotte Ushi shop, or to have a sit-down meal at the Kitchen Hida. Finally, the video 
recommends visiting Shirakawa-go, a World Heritage Site in Gifu Prefecture.',
    description='This video recommends a number of places to visit in Takayama city, in the Hida Region, Gifu 
Prefecture. It shows some of the local street food and highlights some of the unique shops and restaurants in 
the area.',
    destinations=[
        TouristDestination(
            name='Takayama',
            description='Takayama is a city at the base of the Japan Alps, located in the Hida Region of 
Gifu.',
            location='Hida Region, Gifu Prefecture'
        ),
        TouristDestination(
            name='Miyagawa Morning Market',
            description="The Miyagawa Morning Market, or the Miyagawa Asai-chi in Japanese, is a market that 
has existed officially since the Edo Period, more than 100 years ago. It's open every single day, rain or 
shine, from 7am to noon.",
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Nakaya - Handmade Hida Sarubobo',
            description='The Nakaya shop sells handcrafted Sarubobo good luck charms.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Koma Coffee',
            description="Koma Coffee is a shop that has been in business for about 50 or 60 years, and they 
serve coffee in a cookie cup. They've been serving coffee for about 10 years.",
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kissako Katsure',
            description='Kissako Katsure is a traditional Japanese style cafe, called Kissako, and the name 
means would you like to have some tea. They have a variety of teas and sweets.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Sanmachi Historic District',
            description='Sanmachi Dori is a Historic Merchant District in Takayama, all of the buildings here 
have been preserved to look as they did in the Edo Period.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Suwa Orchard',
            description='The Suwa Orchard has been in business for more than 50 years.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kitchen HIDA',
            description='Kitchen HIDA is a restaurant with a 50 year history, known for their Hida Beef dishes
and for using a lot of local ingredients.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Kin no Kotte Ushi',
            description='Kin no Kotte Ushi is a shop known for selling Beef Sushi, especially Hida Wagyu Beef 
Sushi. Their sushi is medium rare.',
            location='Hida Takayama'
        ),
        TouristDestination(
            name='Shirakawa-go',
            description='Shirakawa-go is a World Heritage Site in Gifu Prefecture.',
            location='Gifu Prefecture'
        )
    ]
)

The Gemini model analyzes the video and provides structured recommendations. Here's a summary of the extracted information:

  1. Takayama City: The main destination, located in the Hida Region of Gifu Prefecture.
  2. Miyagawa Morning Market: A historic market open daily from 7am to noon.
  3. Nakaya Shop: Sells handcrafted Sarubobo good luck charms.
  4. Koma Coffee: A 50-60 year old shop famous for serving coffee in cookie cups.
  5. Kissako Katsure: A traditional Japanese cafe offering various teas and sweets.
  6. Sanmachi Historic District: A preserved merchant district from the Edo Period.
  7. Suwa Orchard: A 50+ year old orchard business.
  8. Kitchen HIDA: A restaurant with a 50-year history, known for Hida Beef dishes.
  9. Kin no Kotte Ushi: A shop specializing in Hida Wagyu Beef Sushi.
  10. Shirakawa-go: A World Heritage Site in Gifu Prefecture.

Limitations, Challenges, and Future Directions

While the current approach demonstrates the power of multimodal AI for video analysis, there are several limitations and challenges to consider:

  1. Lack of Temporal Information: Our current method extracts overall recommendations but doesn't provide timestamps for specific mentions. This limits the ability to link recommendations to exact moments in the video.

  2. Speaker Diarization: The model doesn't distinguish between different speakers in the video. Implementing speaker diarization could provide valuable context about who is making specific recommendations.

  3. Content Density: Longer or more complex videos might overwhelm the model, potentially leading to missed information or less accurate extractions.

Future Explorations

To address these limitations and expand the capabilities of our video analysis system, here are some promising areas to explore:

  1. Timestamp Extraction: Enhance the model to provide timestamps for each recommendation or point of interest mentioned in the video. This could be achieved by:
class TimestampedRecommendation(BaseModel):
    timestamp: str
    timestamp_format: Literal["HH:MM", "HH:MM:SS"] # Helps with parsing
    recommendation: str

class EnhancedRecommendations(BaseModel):
    destinations: list[TouristDestination]
    timestamped_mentions: list[TimestampedRecommendation]
  1. Speaker Diarization: Implement speaker recognition to attribute recommendations to specific individuals. This could be particularly useful for videos featuring multiple hosts or interviewees.

  2. Segment-based Analysis: Process longer videos in segments to maintain accuracy and capture all relevant information. This approach could involve:

  3. Splitting the video into smaller chunks
  4. Analyzing each chunk separately
  5. Aggregating and deduplicating results

  6. Multi-language Support: Extend the model's capabilities to accurately analyze videos in various languages and capture culturally specific recommendations.

  7. Visual Element Analysis: Enhance the model to recognize and describe visual elements like landmarks, food dishes, or activities shown in the video, even if not explicitly mentioned in the audio.

  8. Sentiment Analysis: Incorporate sentiment analysis to gauge the speaker's enthusiasm or reservations about specific recommendations.

By addressing these challenges and exploring these new directions, we can create a more comprehensive and nuanced video analysis system, opening up even more possibilities for applications in travel, education, and beyond.

Structured Outputs and Prompt Caching with Anthropic

Anthropic's ecosystem now offers two powerful features for AI developers: structured outputs and prompt caching. These advancements enable more efficient use of large language models (LLMs). This guide demonstrates how to leverage these features with the Instructor library to enhance your AI applications.

Structured Outputs with Anthropic and Instructor

Instructor now offers seamless integration with Anthropic's powerful language models, allowing developers to easily create structured outputs using Pydantic models. This integration simplifies the process of extracting specific information from AI-generated responses.

To get started, you'll need to install Instructor with Anthropic support:

pip install instructor[anthropic]

Here's a basic example of how to use Instructor with Anthropic:

from pydantic import BaseModel
from typing import List
import anthropic
import instructor

# Patch the Anthropic client with Instructor
anthropic_client = instructor.from_anthropic(
    create=anthropic.Anthropic()
)

# Define your Pydantic models
class Properties(BaseModel):
    name: str
    value: str

class User(BaseModel):
    name: str
    age: int
    properties: List[Properties]

# Use the patched client to generate structured output
user_response = anthropic_client(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "Create a user for a model with a name, age, and properties.",
        }
    ],
    response_model=User,
)

print(user_response.model_dump_json(indent=2))
"""
{
  "name": "John Doe",
  "age": 30,
  "properties": [
    { "name": "favorite_color", "value": "blue" }
  ]
}
"""

This approach allows you to easily extract structured data from Claude's responses, making it simpler to integrate AI-generated content into your applications.

Prompt Caching: Boosting Performance and Reducing Costs

Anthropic has introduced a new prompt caching feature that can significantly improve response times and reduce costs for applications dealing with large context windows. This feature is particularly useful when making multiple calls with similar large contexts over time.

Here's how you can implement prompt caching with Instructor and Anthropic:

from instructor import Instructor, Mode, patch
from anthropic import Anthropic
from pydantic import BaseModel

# Set up the client with prompt caching
client = instructor.from_anthropic(Anthropic())

# Define your Pydantic model
class Character(BaseModel):
    name: str
    description: str

# Load your large context
with open("./book.txt", "r") as f:
    book = f.read()

# Make multiple calls using the cached context
for _ in range(2):
    resp, completion = client.chat.completions.create_with_completion(
        model="claude-3-haiku-20240307",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "<book>" + book + "</book>",
                        "cache_control": {"type": "ephemeral"},
                    },
                    {
                        "type": "text",
                        "text": "Extract a character from the text given above",
                    },
                ],
            },
        ],
        response_model=Character,
        max_tokens=1000,
    )

In this example, the large context (the book content) is cached after the first request and reused in subsequent requests. This can lead to significant time and cost savings, especially when working with extensive context windows.

Conclusion

By combining Anthropic's Claude with Instructor's structured output capabilities and leveraging prompt caching, developers can create more efficient, cost-effective, and powerful AI applications. These features open up new possibilities for building sophisticated AI systems that can handle complex tasks with ease.

As the AI landscape continues to evolve, staying up-to-date with the latest tools and techniques is crucial. We encourage you to explore these features and share your experiences with the community. Happy coding!

Flashcard generator with Instructor + Burr

Flashcards help break down complex topics and learn anything from biology to a new language or lines for a play. This blog will show how to use LLMs to generate flashcards and kickstart your learning!

Instructor lets us get structured outputs from LLMs reliably, and Burr helps create an LLM application that's easy to understand and debug. It comes with Burr UI, a free, open-source, and local-first tool for observability, annotations, and more!

Audio Support in OpenAI's Chat Completions API

OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new gpt-4o-audio-preview model, which brings advanced voice capabilities to the familiar Chat Completions API interface.

Key Features

The new audio support in the Chat Completions API offers several compelling features:

  1. Flexible Input Handling: The API can now process any combination of text and audio inputs, allowing for more versatile applications.

  2. Natural, Steerable Voices: Similar to the Realtime API, developers can use prompting to shape various aspects of the generated audio, including language, pronunciation, and emotional range.

  3. Tool Calling Integration: The audio support seamlessly integrates with existing tool calling functionality, enabling complex workflows that combine audio, text, and external tools.

Practical Example

To demonstrate how to use this new functionality, let's look at a simple example using the instructor library:

from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

client = instructor.from_openai(OpenAI())

class Person(BaseModel):
    name: str
    age: int

resp = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    response_model=Person,
    modalities=["text"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                "Extract the following information from the audio",
                Audio.from_path("./output.wav"),
            ],
        },
    ],
)

print(resp)
# Expected output: Person(name='Jason', age=20)

In this example, we're using the gpt-4o-audio-preview model to extract information from an audio file. The API processes the audio input and returns structured data (a Person object with name and age) based on the content of the audio.

Use Cases

The addition of audio support to the Chat Completions API enables a wide range of applications:

  1. Voice-based Personal Assistants: Create more natural and context-aware voice interfaces for various applications.

  2. Audio Content Analysis: Automatically extract information, sentiments, or key points from audio recordings or podcasts.

  3. Language Learning Tools: Develop interactive language learning applications that can process and respond to spoken language.

  4. Accessibility Features: Improve accessibility in applications by providing audio-based interactions and text-to-speech capabilities.

Considerations

While this new feature is exciting, it's important to note that it's best suited for asynchronous use cases that don't require extremely low latencies. For more dynamic and real-time interactions, OpenAI recommends using their Realtime API.

As with any AI-powered feature, it's crucial to consider ethical implications and potential biases in audio processing and generation. Always test thoroughly and consider the diversity of your user base when implementing these features.

Building a Pairwise LLM Judge with Instructor and Pydantic

In this blog post, we'll explore how to create a pairwise LLM judge using Instructor and Pydantic. This judge will evaluate the relevance between a question and a piece of text, demonstrating a practical application of structured outputs in language model interactions.

Introduction

Evaluating text relevance is a common task in natural language processing and information retrieval. By leveraging large language models (LLMs) and structured outputs, we can create a system that judges the similarity or relevance between a question and a given text.

Setting Up the Environment

First, let's set up our environment with the necessary imports:

import instructor
import openai
from pydantic import BaseModel, Field

client = instructor.from_openai(openai.OpenAI())

Here, we're using the instructor library, which integrates seamlessly with OpenAI's API and Pydantic for structured outputs.

Defining the Judgment Model

We'll use Pydantic to define a Judgment model that structures the output of our LLM:

class Judgment(BaseModel):
    thought: str = Field(
        description="The step-by-step reasoning process used to analyze the question and text"
    )
    justification: str = Field(
        description="Explanation for the similarity judgment, detailing key factors that led to the conclusion"
    )
    similarity: bool = Field(
        description="Boolean judgment indicating whether the question and text are similar or relevant (True) or not (False)"
    )

This model ensures that our LLM's output is structured and includes a thought process, justification, and a boolean similarity judgment.

Creating the Judge Function

Next, we'll create a function that uses our LLM to judge the relevance between a question and a text:

def judge_relevance(question: str, text: str) -> Judgment:
    return client.chat.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": """
                    You are tasked with comparing a question and a piece of text to determine if they are relevant to each other or similar in some way. Your goal is to analyze the content, context, and potential connections between the two.

                    To determine if the question and text are relevant or similar, please follow these steps:

                    1. Carefully read and understand both the question and the text.
                    2. Identify the main topic, keywords, and concepts in the question.
                    3. Analyze the text for any mention of these topics, keywords, or concepts.
                    4. Consider any potential indirect connections or implications that might link the question and text.
                    5. Evaluate the overall context and purpose of both the question and the text.

                    As you go through this process, please use a chain of thought approach. Write out your reasoning for each step inside <thought> tags.

                    After your analysis, provide a boolean judgment on whether the question and text are similar or relevant to each other. Use "true" if they are similar or relevant, and "false" if they are not.

                    Before giving your final judgment, provide a justification for your decision. Explain the key factors that led to your conclusion.

                    Please ensure your analysis is thorough, impartial, and based on the content provided.
                """
            },
            {
                "role": "user",
                "content": """
                    Here is the question:

                    <question>
                    {{question}}
                    </question>

                    Here is the text:
                    <text>
                    {{text}}
                    </text>
                """
            }
        ],
        response_model=Judgment,
        context={"question": question, "text": text},
    )

This function takes a question and a text as input, sends them to the LLM with a predefined prompt, and returns a structured Judgment object.

Testing the Judge

To test our pairwise LLM judge, we can create a set of test pairs and evaluate the judge's performance:

if __name__ == "__main__":
    test_pairs = [
        {
            "question": "What are the main causes of climate change?",
            "text": "Global warming is primarily caused by human activities, such as burning fossil fuels, deforestation, and industrial processes. These activities release greenhouse gases into the atmosphere, trapping heat and leading to a rise in global temperatures.",
            "is_similar": True,
        },
        # ... (other test pairs)
    ]

    score = 0
    for pair in test_pairs:
        result = judge_relevance(pair["question"], pair["text"])
        if result.similarity == pair["is_similar"]:
            score += 1

    print(f"Score: {score}/{len(test_pairs)}")
    #> Score 9/10

This test loop runs the judge on each pair and compares the result to a predetermined similarity value, calculating an overall score.

Conclusion

By combining Instructor, Pydantic, and OpenAI's language models, we've created a powerful tool for judging text relevance. This approach demonstrates the flexibility and power of structured outputs in LLM applications.

The pairwise LLM judge we've built can be used in various scenarios, such as:

  1. Improving search relevance in information retrieval systems
  2. Evaluating the quality of question-answering systems
  3. Assisting in content recommendation algorithms
  4. Automating parts of the content moderation process

As you explore this technique, consider how you might extend or adapt it for your specific use cases. The combination of structured outputs and large language models opens up a world of possibilities for creating intelligent, interpretable AI systems.