Know What Data You Have. Know What It Means.
When Data Volume Becomes a Discoverability Problem
At a certain scale, the hardest data problem is not storage or compute — it is discovery. A data engineer joins the team and needs to know which BigQuery table contains the authoritative customer data. There are 47 tables with "customer" in the name across 12 datasets. Some are operational copies. Some are archived. Some are staging tables that should not be used for reporting. There is no documentation. The only way to know which table is correct is to ask the data engineer who's been there longest.
A data catalog solves this. It makes every data asset visible, described, classified, and owned — so that any data consumer can find what they need, understand what it means, and know who to contact when they have questions.
What We Implement
Asset Registration and Metadata
Every data asset in scope — BigQuery tables, datasets, Cloud Storage objects, Pub/Sub topics — registered in Cloud Data Catalog or Dataplex with structured metadata: description, owner, last updated, sensitivity classification, and usage context. Registration can be manual for high-priority assets or automated via catalog export for large environments.
Business Glossary
A business glossary defines the canonical meaning of data terms: what "Customer" means in the context of the data platform (as opposed to the CRM system, where the same term might mean something slightly different). Glossary terms linked to the data assets that contain the relevant data — so that searching for "Monthly Recurring Revenue" returns the exact table and column that contains it.
Data Classification and Policy Tags
Sensitivity classification applied to data assets: public, internal, confidential, restricted. Policy tags configured in BigQuery Data Policy for columns containing PII, financial data, or other regulated information. Automated PII discovery using Cloud DLP scanning to identify sensitive data that was not manually classified.
Data Lineage
Lineage tracking that shows where data came from and where it goes: which source system populated this table, which transformation pipeline processed it, which downstream tables and dashboards depend on it. Dataplex automatic lineage for BigQuery and Dataflow workflows, supplemented with manual lineage documentation for custom pipelines.
Data Stewardship Model
A data catalog without human ownership becomes stale within months. We design the data stewardship model that keeps it current: designated data owners per domain, a curation workflow for new asset registration, a review cadence for existing metadata, and a process for handling data quality issues identified through the catalog.
- Cloud Data Catalog and Dataplex configuration and setup
- Data asset registration: BigQuery, Cloud Storage, Pub/Sub
- Business glossary design and term-to-asset linking
- Data sensitivity classification: policy tag taxonomy design
- BigQuery column-level policy tag enforcement
- Cloud DLP PII discovery and classification scanning
- Data lineage tracking: Dataplex automatic and manual lineage
- Data stewardship model design: ownership, curation workflow, review cadence
- Catalog search and discoverability configuration
- Compliance reporting: data inventory for PDPL and regulatory audits
How we deliver this service.
Catalog Scope Definition
We identify which data assets are in scope for the initial catalog: typically the data warehouse tables used in production dashboards and the pipelines that feed them. A phased scope prevents the catalog from becoming an unmanageable project before it delivers value.
Taxonomy and Glossary Design
Sensitivity classification taxonomy, business glossary terms, and data domain definitions — agreed with data owners and compliance stakeholders before any assets are registered. The taxonomy must reflect how the organization actually thinks about its data.
Asset Registration and Classification
Priority assets registered with metadata and classification applied. Automated registration configured for new assets entering the catalog scope. PII discovery scans run to identify unclassified sensitive columns.
Lineage Configuration
Dataplex automatic lineage activated for BigQuery and Dataflow workflows. Manual lineage documented for custom pipelines and file-based integrations not captured automatically.
Stewardship Handover
Data owners assigned, curation workflow documented, and the team responsible for catalog maintenance trained. The catalog is a living system — we hand it over with the process and tooling needed to keep it current.