AI / NLP

Embeddings and semantic search over official bulletins

Enrique Lopez · March 24, 2026

If you search for "grants for SME digitalization" in a traditional text search engine, the exact words need to appear in the document. But an official bulletin doesn't write "grants for SME digitalization." It writes "Resolution of March 15, 2026, announcing subsidies for the digital transformation of the productive fabric of small and medium enterprises." Keyword search fails because bureaucratic language and natural language are different worlds.

This is exactly the problem I solve with semantic search in the Boletin Claro search tools. In this article I explain how it works -- without diving into transformer math, but with enough technical detail that you could build something similar.

What are embeddings (in 30 seconds)

An embedding is a numerical representation of text in a high-dimensional vector space. Two texts with similar meaning will have nearby vectors, even if they don't share a single word. An embedding model is a neural network trained to produce these representations.

In practice, you convert a text into an array of 768 or 1536 floats. Then you compare arrays using cosine similarity. Two texts with cosine close to 1.0 are about the same thing. Close to 0.0, they're unrelated.

The indexing pipeline

Every day, the reader extracts between 200 and 400 entries from official bulletins. Each entry needs to be indexed to be searchable. The pipeline has three phases:

1. Chunking

Bulletin entries vary wildly in length. A BOE disposition can be 200 words or 20,000. Embedding models have a token limit (typically 512 or 8192 depending on the model). If the text exceeds the limit, you need to split it.

My chunking strategy is paragraph-based with overlap. Each chunk has a maximum of 512 tokens, is cut at paragraph boundaries (never mid-sentence), and overlaps by 50 tokens with the previous chunk. The overlap prevents losing context at the edges.

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
    paragraphs = text.split("\n\n")
    chunks = []
    current = []
    current_len = 0

    for para in paragraphs:
        para_tokens = count_tokens(para)
        if current_len + para_tokens > max_tokens and current:
            chunks.append("\n\n".join(current))
            # Keep last paragraph for overlap
            overlap_paras = [current[-1]] if current else []
            current = overlap_paras
            current_len = count_tokens(current[0]) if current else 0
        current.append(para)
        current_len += para_tokens

    if current:
        chunks.append("\n\n".join(current))
    return chunks

2. Generating embeddings

For vector generation I use Google's embeddings API (Vertex AI with the text-embedding-004 model). The choice wasn't random: I needed a model that works well with Spanish, supports long texts, and has reasonable pricing for batch processing.

Processing is batched at 100 texts per API call. With 300 daily entries and an average of 3 chunks per entry, that's about 900 embedding calls per day, resolved in 9 batch requests. The cost is pennies.

3. Storage

Vectors are stored in Firestore alongside the entry metadata (source, date, section, title). For search, I use Firestore's native vector support with a nearest-neighbor index. This saves me from needing a separate vector database like Pinecone or Weaviate.

Search: query, retrieval, and reranking

When a user searches for "solar energy grants in Andalusia" in the grants search engine, the process is:

  1. The user's query is converted to an embedding with the same model.
  2. The K nearest vectors are retrieved from Firestore (nearest neighbor search).
  3. Metadata filters are applied: source, date, autonomous community.
  4. Results are reranked with a reranking model for higher precision.

Reranking is key. Embedding-based search is good for recall (finding relevant candidates) but doesn't always order by relevance correctly. A cross-encoder reranker takes each (query, document) pair and produces a relevance score that's more accurate than cosine similarity alone. I use Cohere Rerank because it supports Spanish natively.

Why keywords alone don't cut it

A real example illustrates the difference. These are actual searches from the public procurement search engine:

User query

"cleaning contracts for public schools"

Semantic match (no shared keywords)

"Licitacion del servicio de mantenimiento higienico-sanitario en centros docentes de titularidad autonomica"

(Tender for hygienic-sanitary maintenance service in publicly-owned educational centers)

This is cross-language semantic matching: the user queries in English, and the results are Spanish government documents. With keyword search, "cleaning" doesn't appear in the result (the Spanish uses "mantenimiento higienico-sanitario"). "Schools" doesn't appear ("centros docentes"). "Public" doesn't appear ("titularidad autonomica"). Yet semantically, it's the same search.

User query

"grants for setting up an online store"

Semantic match

"Convocatoria de subvenciones para el fomento del comercio electronico y la implantacion de soluciones de venta digital en el sector minorista"

(Call for grants to promote e-commerce and digital sales solutions in the retail sector)

Note that this is cross-language semantic matching: the English query is matched against Spanish documents. "Setting up" vs "implantacion" (deployment), "online store" vs "comercio electronico y soluciones de venta digital" (e-commerce and digital sales solutions), "grants" vs "subvenciones" (subsidies). The meaning is identical, the words are completely different.

Practical optimizations

Metadata pre-filtering

There's no point comparing vectors against the entire database if the user has already selected "BDNS" as the source or "Madrid" as the region. Pre-filtering by metadata reduces the search space and improves both speed and relevance.

Caching frequent query embeddings

Many queries are similar: "freelancer grants", "SME funding", "cleaning tenders". I maintain an LRU cache of query embeddings to avoid redundant API calls.

Rate limiting

The public search engines at the BOE search tool have a rate limit of 20 requests per minute per IP. Enough for human use but prevents abuse.

Results and metrics

I don't have a formal benchmark against a labeled dataset (no relevance dataset exists for Spanish official bulletins). What I do measure is click-through rate on results: 34% of searches result in a click on one of the top 5 results, which is reasonable for such a specialized domain.

Average response time is 280ms, of which about 80ms is query embedding generation, 120ms is vector search in Firestore, and 80ms is reranking. Fast enough to feel instant in the UI.

If you're thinking about implementing semantic search over a specific domain, my recommendation is to start simple: one embedding model, one vector store, and evaluate results manually before adding complexity. Reranking is only worth it if your initial results are already "almost good" but poorly ordered.