Data Engineering

Scraping Spain's official bulletins: BOE, BDNS, and 20 regional gazettes

Enrique Lopez · March 24, 2026

When I decided to build a system that processes Spain's official bulletins automatically, I completely underestimated the diversity of formats. Spain has a national gazette (BOE), a grants database (BDNS), a commercial registry bulletin (BORME), and one official bulletin per autonomous community. That's over 20 sources, each with its own format, its own publication rules, and its own technical quirks.

In this article I explain how the extraction service that feeds the data behind the bulletin section of Boletin Claro.

The BOE: well-structured XML with problematic PDFs

The BOE (Boletin Oficial del Estado) is by far the cleanest source. It publishes a daily summary in XML with a predictable structure:

<sumario>
  <diario nbo="58">
    <seccion num="2A" nombre="Autoridades y personal">
      <departamento nombre="Ministerio de Hacienda">
        <epigrafe nombre="Nombramientos">
          <item id="BOE-A-2026-1234">
            <titulo>Resolución de 20 de marzo...</titulo>
            <urlPdf>/boe/dias/2026/03/24/pdfs/BOE-A-2026-1234.pdf</urlPdf>
          </item>
        </epigrafe>
      </departamento>
    </seccion>
  </diario>
</sumario>

The summary is a simple GET to https://boe.es/diario_boe/xml.php?id=BOE-S-20260324. Parsing the XML with lxml is trivial. The problem starts when you need the actual content of each entry, which lives in a PDF.

BOE PDFs vary enormously in quality. General provisions usually have clean selectable text. But grant announcements often include complex tables, multiple columns, or scanned text. I use pdfplumber as the primary extractor because it handles tables well, and apply post-processing heuristics to reconstruct paragraphs that break across pages.

The BDNS: a REST API with erratic pagination

The BDNS (Base de Datos Nacional de Subvenciones) is Spain's national grants database. Unlike the BOE, the BDNS exposes a public REST API. In theory this should be simpler. In practice, it has its own traps.

The API lets you filter by publication date and returns paginated results. The problem is that pagination isn't consistent: the total count can change between requests if new entries are published mid-fetch. My solution is to paginate until I get an empty page, rather than trusting the totalRegistros field.

async def fetch_bdns_page(session, date, offset):
    params = {
        "fechaDesde": date.strftime("%d/%m/%Y"),
        "fechaHasta": date.strftime("%d/%m/%Y"),
        "numPagina": offset // PAGE_SIZE + 1,
        "tamPagina": PAGE_SIZE,
    }
    resp = await session.get(BDNS_API_URL, params=params)
    data = resp.json()
    return data.get("convocatorias", [])

Each BDNS grant comes with structured fields: issuing body, budget, deadline, eligible beneficiaries. This is gold compared to extracting the same from a PDF. I convert it directly into an internal format with normalized fields. For more details on the BDNS structure, I wrote a comprehensive guide to the BDNS.

Regional bulletins: the wild west

This is where it gets interesting. Each autonomous community publishes its own official bulletin in its own format. Some examples of what I've encountered:

For each bulletin, I implement a specific HTTP client that inherits from a base class. The interface is simple: given a date, return a list of entries with title, text, section, and metadata. The complexity is encapsulated inside each client.

Idempotency: the key to a robust system

The reader runs every morning at 7:00 via Cloud Scheduler. But sometimes it fails: the BOE publishes late, a regional government's website is down, there's a transient network issue. That's why idempotency is essential.

Each entry is identified by a hash composed of the source, date, and the bulletin's internal identifier. Before inserting into Firestore, I check if it already exists. If it's already been processed, it's skipped without error. This lets me re-run the reader as many times as needed with zero risk of duplicates.

entry_id = hashlib.sha256(
    f"{source}:{date}:{internal_id}".encode()
).hexdigest()[:20]

if await firestore_client.document_exists("bulletin_entries", entry_id):
    logger.info(f"Skipping duplicate: {entry_id}")
    return None

Error handling and retries

Government websites aren't exactly famous for their reliability. I implement a retry system with exponential backoff for each source. If a source fails three times in a row, it's marked as failed for that date and an internal alert is raised.

I also handle the case of late-publishing bulletins. Some autonomous communities don't publish until 10:00 or 11:00 AM. The scheduler has a primary run at 7:00 and a recovery run at 12:00 that only processes sources that failed in the first round.

Converting to markdown

Once the content is extracted, everything gets converted to structured markdown. Markdown serves as the intermediate format between extraction and AI processing. It's readable, compact, and tokenizes well for LLMs.

The structure is always the same regardless of the source: a header with metadata (source, date, section, department) and the body text. This lets the interpreter work with a uniform format whether the original entry was BOE XML or BOJA HTML.

Metrics and monitoring

Each reader run generates metrics that I log in Firestore: entry count per source, error rate, execution time. A normal BOE day has between 50 and 100 entries. The BDNS can have between 100 and 300. Regional bulletins vary wildly: BOCM might have 70 entries on a Monday and 15 on a Friday.

If the entry count drops below a threshold for a source on a business day, an alert fires. This has caught several cases where a source changed its HTML format and the parser silently broke.

This entire extraction system is what feeds the data you can query in Boletin Claro's bulletin search engines. The challenge isn't so much technical as operational: keeping 20+ parsers running against websites that change without notice.