The “Clean Pipe” Architecture: Solving the Public Sector’s Unstructured Data Crisis

Nov 18, 2025 16:57 PM
The “Clean Pipe” Architecture: Solving the Public Sector’s Unstructured Data Crisis

The Public Sector Constraint: It’s Not Just “Big Data,” It’s “Messy Data”

In the commercial sector, modern data stacks often assume a baseline of structured, clean JSON or SQL inputs. In the public sector, specifically in our work with state agencies like the Illinois Department of Central Management Services (CMS), the reality is different. The primary constraint isn’t volume; it is variety and veracity.

When architecting the Intelligent Data Platform (IDP), we faced a “Legacy Lock” consisting of diverse, disconnected silos: SharePoint Online libraries, Dynamics 365, SAP, DB2, and Jira . The standard architectural patterns often fail here because they underestimate the complexity of ingesting unstructured data (documents, resumes, tickets) alongside structured financial records.

This paper details the specific architectural decisions ZCS made to overcome these constraints, prioritizing governance and auditability (IRS 1075, HIPAA) over simple speed.

Decision 1: The Ingestion Engine – Why We Chose “Hybrid” Over “Code-First”

The Constraint: High variance in source systems (SaaS vs. Mainframe) and a requirement for rapid, maintainable pipelines without a massive dedicated Java engineering team.

The Trade-off: Cloud Data Fusion vs. Cloud Dataflow

  • The “Standard” Engineering View: Cloud Dataflow is the default for streaming/batch capability.
  • Our Thought Leadership: For the IDP, we argued that Cloud Data Fusion was the superior choice for the ingestion layer, specifically because of its pre-built connector library for enterprise “legacy” stacks. Writing custom Python Beam pipelines to extract metadata from SharePoint Online or Dynamics 365 OData APIs is high-toil and brittle.

The Execution: We utilized Data Fusion to handle the “plumbing”—normalizing schemas from SAP/DB2 and utilizing the Microsoft Graph API connectors for SharePoint .

Where Dataflow Won: We reserved Cloud Dataflow strictly for the heavy lifting: complex transformations where stateful processing was required that graphical interfaces couldn’t handle.

Result: A hybrid pipeline where low-code connectors handle the “variety” constraint, and high-code Dataflow handles the “volume” constraint.

Decision 2: The “Trust” Architecture – Governance as a First-Class Citizen

The Constraint: Public sector data is heavily regulated (PII, PHI, EINs). The typical “Load first, fix later” ELT pattern is dangerous when dealing with sensitive citizen data.

The Solution: Active Governance via Dataplex Instead of treating governance as an afterthought, we implemented Google Cloud Dataplex as an active gatekeeper in the pipeline.

  • Implementation: We didn’t just “store” data; we implemented a Data Quality Firewall. Using Dataplex, we defined specific validation rules (e.g., “Null Check,” “Completeness,” “Validity”) that run during the pipeline execution.
  • Evidence of Success: Our dashboards actively track metrics like “27 Passed Rules” and “100% Completeness” . Records failing these checks (e.g., missing EINs or malformed timestamps) are automatically sequestered into a “Rejected Records” bucket for review.
  • Security Tags: We automated the application of Policy Tags (e.g., PII: High, dataset_tags: Confidential) to columns containing Vendor Names and EINs. This ensures that even if a pipeline moves data successfully, IAM policies prevent unauthorized access down to the column level.

Decision 3: Storage Layering – The “Raw-Processed-Curated” Pattern

The Constraint: Budget accountability and Audit Trails. We needed to prove exactly how a dashboard metric was derived from the original legacy source.

The Trade-off: BigQuery vs. Dataproc (Hadoop) While Dataproc allows for a “lift and shift” of on-prem Hadoop jobs, it retains the operational overhead of cluster management. For a state agency requiring “Reliability as a Service,” this was a non-starter.

Our Architecture: We standardized on BigQuery but enforced a strict logic separation layer often missing in basic deployments :

  1. Raw Data Layer: An immutable, exact copy of the source (SharePoint/DB2) data. No transformations. This is our audit baseline.
  2. Processed Data Layer: The “Clean” layer where Dataplex rules have been applied, and schemas are normalized.
  3. Curated Data Layer: The “Business” layer. This is the only layer exposed to visualization tools like Power BI and Looker.

By decoupling storage from compute and enforcing these layers, we solved the technical expertise constraint—allowing analysts to write simple SQL against the “Curated” layer without needing to understand the complex ETL logic upstream.

Decision 4: The Visualization Constraint – Meeting Users Where They Live

The Constraint: “Tool Fatigue” and Adoption Inertia. State agencies often have deep investments in the Microsoft ecosystem (Office 365, Power BI) and significant sunk costs in staff training. Forcing a migration to a new BI tool can kill a project’s momentum, regardless of technical superiority.

The Trade-off: Power BI (Existing) vs. Looker (Google Native)

  • Looker offers a superior semantic layer and governs metrics centrally (LookML), preventing the “metric drift” where two analysts report different numbers for the same KPI. It is also fully cloud-native.
  • Power BI allows for rapid user adoption due to familiarity but risks creating “report silos” where logic is buried in individual .pbix files rather than the data warehouse.

Our Architecture: We refused to let the “perfect” be the enemy of the “good.” We architected the IDP to be visualization-agnostic.

  • The Compromise: We enabled the BigQuery Connector for Power BI, allowing agency staff to keep their preferred visualization tool while forcing the data processing back to Google Cloud.
  • The Governance Lock: By exposing only the Curated Layer (views) to Power BI and disabling “Import Mode” where possible (favoring DirectQuery), we ensured that Power BI became a thin presentation layer, while BigQuery remained the single source of truth. This navigated the adoption constraint without sacrificing data integrity.

Results – From “Siloed” to “Showroom Ready”

The IDP is not just a data warehouse; it is a foundation for innovation. By solving the “dirty data” problem first with a governed, layered architecture, we enabled downstream innovations like the Generative AI Showroom.

Because we had already solved the “unstructured data” ingestion problem via the IDP, deploying Vertex AI models for Resume Matching and Automated PII Redaction became a rapid extension of the platform rather than a new infrastructure project. The IDP provided the “clean fuel” required for the Generative AI engine.

True thought leadership in the public sector isn’t about using the newest tool; it’s about architecting a system that survives the reality of legacy systems, strict compliance, and budget scrutiny.

Next Article
Share
Subscribe to our newsletter

    Related Blogs

    Explore More
    The Public Sector Data Modernization Roadmap: Building an Intelligent Data Platform for AI-Ready Government

    The Public Sector Data Modernization Roadmap: Building an Intelligent Data Platform for AI-Ready Government

    In the public sector, "Digital Transformation" often stops at digitization—turning paper forms into PDFs. But for agencies like the Illinois…

    The "Clean Pipe" Architecture: Solving the Public Sector’s Unstructured Data Crisis

    The "Clean Pipe" Architecture: Solving the Public Sector’s Unstructured Data Crisis

    The Public Sector Constraint: It’s Not Just "Big Data," It’s "Messy Data" In the commercial sector, modern data stacks often…

    What’s Next for Microsoft Fabric and the Modern Analytics Stack

    What’s Next for Microsoft Fabric and the Modern Analytics Stack

    Businesses today are drowning in data, yet insights often arrive too late to influence decisions. The modern analytics stack, once…

    Contact

    Join Leading Agencies Driving Impact