In this article, I'll show you how to use AI to pull structured data from a web page and turn it into JSON. We'll use a real-world example: verifying invoices from the Kenya Revenue Authority (KRA). When you buy something in Kenya, you get an Electronic Tax Register (ETR) receipt with details like invoice numbers, dates, and amounts. To check if the seller submitted this to KRA, you can use their invoice checker website. But there's no API - just an HTML page. So, we'll grab that HTML, parse it with Google's Gemini AI, and get a clean JSON response.
Before we start, set up your tools. I assume you know how to create a Python virtual environment. If not, look it up - it's simple. Then install these packages:
pip install google-genai httpx pydantic pydantic-settings
google-genai
: For using Gemini AI.httpx
: To fetch the web page.pydantic
: To structure and validate the JSON.pydantic-settings
: To manage settings like API keys.Next, get a Gemini API key. Go to https://aistudio.google.com/apikey, sign in with a Google account, and create a key. Don't share it or hardcode it in your scripts - keep it safe, maybe in a .env file. Gemini's free tier is decent: 15 requests per minute, 1,500 per day, and 1 million tokens per minute in eligible countries.
When you check an invoice on the KRA site, you enter an invoice number and get an HTML page back. We want to extract these fields:
If the data's missing or the invoice isn't valid, we'll flag it. The output will be JSON, ready to use in a web app.
We'll use Pydantic to model the response. Here's the code:
from pydantic import BaseModel from typing import Optional class InvoiceCheckResponse(BaseModel): valid: bool control_unit_invoice_number: Optional[str] = None trader_system_invoice_number: Optional[str] = None invoice_date: Optional[str] = None total_taxable_amount: Optional[float] = None total_tax_amount: Optional[float] = None total_invoice_amount: Optional[float] = None supplier_name: Optional[str] = None
This sets up a structure with all the fields we want. Fields are optional because they might not always be there. Pydantic will turn this into JSON later.
We need to give Gemini clear instructions. Here's the prompt:
invoice_checker_message_content = ( "You are a system that parses an HTML response to extract invoice validity information from Kenya Revenue Authority (KRA). " "Your goal is to identify and extract the following fields from the HTML response: " "1. Control Unit Invoice Number. " "2. Trader System Invoice No. " "3. Invoice Date. Convert the date to the format yyyy-mm-dd. " "4. Total Taxable Amount. " "5. Total Tax Amount. " "6. Total Invoice Amount. " "7. Supplier Name. " "If the data is not found, set the valid field to false." )
This tells the AI exactly what to look for and how to format it. Clear prompts make a big difference with AI.
Now, let's write a function to process the HTML. This uses Gemini to parse the page and return our structured data. Here's the full code:
import httpx from google import genai from google.genai import types from pydantic import BaseModel from typing import Optional import os from fastapi import HTTPException, status class InvoiceCheckResponse(BaseModel): valid: bool control_unit_invoice_number: Optional[str] = None trader_system_invoice_number: Optional[str] = None invoice_date: Optional[strreports] = None total_taxable_amount: Optional[float] = None total_tax_amount: Optional[float] = None total_invoice_amount: Optional[float] = None supplier_name: Optional[str] = None async def check_invoice(html: str) -> InvoiceCheckResponse: # Check for API key api_key = os.getenv("GEMINI_API_KEY") if not api_key: raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Gemini API key not set in environment variables." ) # Set up the Gemini client client = genai.Client(api_key=api_key) model = "gemini-2.0-flash" # Prepare the HTML content contents = [ types.Content( role="user", parts=[types.Part.from_text(text=html)] ) ] # Configure the AI response generate_config = types.GenerateContentConfig( temperature=1, top_p=0.95, top_k=40, max_output_tokens=8192, response_mime_type="application/json", response_schema=genai.types.Schema( type=genai.types.Type.OBJECT, properties={ "valid": genai.types.Schema(type=genai.types.Type.BOOLEAN), "control_unit_invoice_number": genai.types.Schema(type=genai.types.Type.STRING), "trader_system_invoice_number": genai.types.Schema(type=genai.types.Type.STRING), "invoice_date": genai.types.Schema(type=genai.types.Type.STRING), "total_taxable_amount": genai.types.Schema(type=genai.types.Type.NUMBER), "total_tax_amount": genai.types.Schema(type=genai.types.Type.NUMBER), "total_invoice_amount": genai.types.Schema(type=genai.types.Type.NUMBER), "supplier_name": genai.types.Schema(type=genai.types.Type.STRING), }, required=["valid"] ), system_instruction=[types.Part.from_text(text=invoice_checker_message_content)] ) # Call the AI and get the response try: response = client.models.generate_content( model=model, contents=contents, config=generate_config ) return InvoiceCheckResponse.model_validate_json(response.text) except Exception as e: raise Exception(f"Error processing invoice: {str(e)}")
This function:
It's async because we'll fetch the web page asynchronously too.
To check an invoice, you need its number from the receipt. Then:
import httpx # Replace with your invoice number invoice_number = "123456789" url = f"https://itax.kra.go.ke/KRA-Portal/invoiceChk.htm?actionCode=loadPage&invoiceNo={invoice_number}" # Fetch the HTML response = httpx.get(url, timeout=15) html = response.text # Parse it with AI check_invoice_response = await check_invoice(html) # Print the JSON print(check_invoice_response.json())
The output will look like this:
{ "valid": true, "control_unit_invoice_number": "CU123456", "trader_system_invoice_number": "TS789012", "invoice_date": "2023-10-15", "total_taxable_amount": 1000.0, "total_tax_amount": 160.0, "total_invoice_amount": 1160.0, "supplier_name": "ABC Kenya Suppliers Ltd" }
If the invoice isn't valid or data's missing, valid will be false, and some fields might be null.
httpx
fetches the HTML from the KRA site.It's straightforward and skips manual parsing, which can be a pain if the HTML changes.
This method works well for turning messy web pages into usable JSON. The KRA example shows how AI can save time when there's no API. You could adapt this for other sites too - just tweak the prompt and schema.
Let me know if you try it or run into trouble!
Enjoyed this blog post? Check out these related posts!
Adding middleware to FastAPI Applications: Process Time Headers, Security, and Compression
A practical guide to implementing middleware in FastAPI for better performance, security, and efficiency.
Read More..
Adding Google Authentication to Your FastAPI Application
A guide to adding Google Authentication to your FastAPI app.
Read More..
How to Set Up a Custom Domain for Your Google Cloud Run service
A Step-by-Step Guide to Mapping Your Domain to Cloud Run
Read More..
Deploying Reflex Front-End with Caddy in Docker
A step-by-step guide to building and serving Reflex static front-end files using Caddy in a Docker container
Read More..
Have a project in mind? Send me an email at hello@davidmuraya.com and let's bring your ideas to life. I am always available for exciting discussions.