How to Get a Structured JSON Response from a Web Page Using AI

4 min read

In this article, I'll show you how to use AI to pull structured data from a web page and turn it into JSON. We'll use a real-world example: verifying invoices from the Kenya Revenue Authority (KRA). When you buy something in Kenya, you get an Electronic Tax Register (ETR) receipt with details like invoice numbers, dates, and amounts. To check if the seller submitted this to KRA, you can use their invoice checker website. But there's no API - just an HTML page. So, we'll grab that HTML, parse it with Google's Gemini AI, and get a clean JSON response.

Let's walk through it step by step.

What You'll Need

Before we start, set up your tools. I assume you know how to create a Python virtual environment. If not, look it up - it's simple. Then install these packages:

pip install google-genai httpx pydantic pydantic-settings

google-genai: For using Gemini AI.
httpx: To fetch the web page.
pydantic: To structure and validate the JSON.
pydantic-settings: To manage settings like API keys.

Next, get a Gemini API key. Go to https://aistudio.google.com/apikey, sign in with a Google account, and create a key. Don't share it or hardcode it in your scripts - keep it safe, maybe in a .env file. Gemini's free tier is decent: 15 requests per minute, 1,500 per day, and 1 million tokens per minute in eligible countries.

The Goal

When you check an invoice on the KRA site, you enter an invoice number and get an HTML page back. We want to extract these fields:

Control Unit Invoice Number
Trader System Invoice Number
Invoice Date (in yyyy-mm-dd format)
Total Taxable Amount
Total Tax Amount
Total Invoice Amount
Supplier Name

If the data's missing or the invoice isn't valid, we'll flag it. The output will be JSON, ready to use in a web app.

Step 1: Define the Data Structure

We'll use Pydantic to model the response. Here's the code:

from pydantic import BaseModel
from typing import Optional

class InvoiceCheckResponse(BaseModel):
    valid: bool
    control_unit_invoice_number: Optional[str] = None
    trader_system_invoice_number: Optional[str] = None
    invoice_date: Optional[str] = None
    total_taxable_amount: Optional[float] = None
    total_tax_amount: Optional[float] = None
    total_invoice_amount: Optional[float] = None
    supplier_name: Optional[str] = None

This sets up a structure with all the fields we want. Fields are optional because they might not always be there. Pydantic will turn this into JSON later.

Step 2: Tell the AI What to Do

We need to give Gemini clear instructions. Here's the prompt:

invoice_checker_message_content = (
    "You are a system that parses an HTML response to extract invoice validity information from Kenya Revenue Authority (KRA). "
    "Your goal is to identify and extract the following fields from the HTML response: "
    "1. Control Unit Invoice Number. "
    "2. Trader System Invoice No. "
    "3. Invoice Date. Convert the date to the format yyyy-mm-dd. "
    "4. Total Taxable Amount. "
    "5. Total Tax Amount. "
    "6. Total Invoice Amount. "
    "7. Supplier Name. "
    "If the data is not found, set the valid field to false."
)

This tells the AI exactly what to look for and how to format it. Clear prompts make a big difference with AI.

Step 3: Build the Parsing Function

Now, let's write a function to process the HTML. This uses Gemini to parse the page and return our structured data. Here's the full code:

import httpx
from google import genai
from google.genai import types
from pydantic import BaseModel
from typing import Optional
import os
from fastapi import HTTPException, status

class InvoiceCheckResponse(BaseModel):
    valid: bool
    control_unit_invoice_number: Optional[str] = None
    trader_system_invoice_number: Optional[str] = None
    invoice_date: Optional[strreports] = None
    total_taxable_amount: Optional[float] = None
    total_tax_amount: Optional[float] = None
    total_invoice_amount: Optional[float] = None
    supplier_name: Optional[str] = None

async def check_invoice(html: str) -> InvoiceCheckResponse:
    # Check for API key
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail="Gemini API key not set in environment variables."
        )

    # Set up the Gemini client
    client = genai.Client(api_key=api_key)
    model = "gemini-2.0-flash"

    # Prepare the HTML content
    contents = [
        types.Content(
            role="user",
            parts=[types.Part.from_text(text=html)]
        )
    ]

    # Configure the AI response
    generate_config = types.GenerateContentConfig(
        temperature=1,
        top_p=0.95,
        top_k=40,
        max_output_tokens=8192,
        response_mime_type="application/json",
        response_schema=genai.types.Schema(
            type=genai.types.Type.OBJECT,
            properties={
                "valid": genai.types.Schema(type=genai.types.Type.BOOLEAN),
                "control_unit_invoice_number": genai.types.Schema(type=genai.types.Type.STRING),
                "trader_system_invoice_number": genai.types.Schema(type=genai.types.Type.STRING),
                "invoice_date": genai.types.Schema(type=genai.types.Type.STRING),
                "total_taxable_amount": genai.types.Schema(type=genai.types.Type.NUMBER),
                "total_tax_amount": genai.types.Schema(type=genai.types.Type.NUMBER),
                "total_invoice_amount": genai.types.Schema(type=genai.types.Type.NUMBER),
                "supplier_name": genai.types.Schema(type=genai.types.Type.STRING),
            },
            required=["valid"]
        ),
        system_instruction=[types.Part.from_text(text=invoice_checker_message_content)]
    )

    # Call the AI and get the response
    try:
        response = client.models.generate_content(
            model=model,
            contents=contents,
            config=generate_config
        )
        return InvoiceCheckResponse.model_validate_json(response.text)
    except Exception as e:
        raise Exception(f"Error processing invoice: {str(e)}")

This function:

Checks for an API key (stored in an environment variable).
Sets up the Gemini client with the "gemini-2.0-flash" model.
Sends the HTML to the AI with our prompt and schema.
Returns the parsed data as an InvoiceCheckResponse object.

It's async because we'll fetch the web page asynchronously too.

Step 4: Fetch the Web Page and Use the Function

To check an invoice, you need its number from the receipt. Then:

import httpx

# Replace with your invoice number
invoice_number = "123456789"
url = f"https://itax.kra.go.ke/KRA-Portal/invoiceChk.htm?actionCode=loadPage&invoiceNo={invoice_number}"

# Fetch the HTML
response = httpx.get(url, timeout=15)
html = response.text

# Parse it with AI
check_invoice_response = await check_invoice(html)

# Print the JSON
print(check_invoice_response.json())

The output will look like this:

{
    "valid": true,
    "control_unit_invoice_number": "CU123456",
    "trader_system_invoice_number": "TS789012",
    "invoice_date": "2023-10-15",
    "total_taxable_amount": 1000.0,
    "total_tax_amount": 160.0,
    "total_invoice_amount": 1160.0,
    "supplier_name": "ABC Kenya Suppliers Ltd"
}

If the invoice isn't valid or data's missing, valid will be false, and some fields might be null.

How It Works

Here's the flow:

You get an invoice number from an ETR receipt.
You build the KRA URL with that number.
httpx fetches the HTML from the KRA site.
The check_invoice function sends the HTML to Gemini AI.
Gemini extracts the data based on our prompt and schema.
Pydantic turns it into a structured JSON object.

It's straightforward and skips manual parsing, which can be a pain if the HTML changes.

Tips and Things to Watch Out For

Errors: The code handles basic issues, like a missing API key. But if the HTML layout changes, the AI might miss data. Test it with different invoices.
Limits: Gemini's free tier has caps - 15 requests per minute, 1,500 per day. In a real app, handle rate limits gracefully.
Caching: Checking the same invoice repeatedly? Store the results to save API calls.
Security: Keep that API key secret. Use environment variables or a secrets manager.

Final Thoughts

This method works well for turning messy web pages into usable JSON. The KRA example shows how AI can save time when there's no API. You could adapt this for other sites too - just tweak the prompt and schema.

Let me know if you try it or run into trouble!