Beyond the Surface: Why Data Quality Requires More Than Just Completeness

data-quality
data-engineering
Author

Numbers around us

Published

February 9, 2026

Image

The Restaurant Health Inspection Paradox

Imagine walking into a restaurant with an immaculate dining room. The tables are spotless, the silverware gleams, and every napkin is perfectly folded. As you’re admiring the pristine environment, you happen to glance into the kitchen through a swinging door. To your horror, you see expired ingredients, cross-contamination, and questionable food handling practices. The restaurant technically passes a surface-level inspection — everything looks clean — but the food is fundamentally unsafe to eat.

This scenario perfectly illustrates one of the most insidious problems in data engineering: the difference between structural data quality and semantic data quality. Your data pipeline might be checking all the boxes — no null values, correct data types, proper column names — yet the data itself could be completely wrong for your business needs. Just as a clean dining room doesn’t guarantee safe food, complete data doesn’t guarantee correct data.

In the world of data science and analytics, we’ve become quite good at implementing structural checks. We validate that phone numbers have the right format, that dates fall within reasonable ranges, and that foreign keys exist in their respective tables. But these checks are like inspecting the restaurant’s dining room while ignoring the kitchen. They tell us the data is structurally sound, but not whether it’s semantically meaningful.

This article explores why simple data checks often fail in real-world pipelines and introduces the critical distinction between checking that your data is complete versus checking that it’s correct. Understanding this difference isn’t just academic — it’s the key to building data systems that actually deliver reliable insights rather than confident nonsense.

The Problem: When Complete Data Still Fails

Most data quality frameworks focus on what we might call “structural integrity” — the basic building blocks of data validation. These checks include:

  • Null checks: Are all required fields populated?
  • Type validation: Is the age field actually a number?
  • Format validation: Does the email address contain an @ symbol?
  • Range checks: Is the date of birth in the past?
  • Referential integrity: Does the customer_id exist in the customers table?

These checks are valuable and necessary. They catch many obvious errors and prevent malformed data from entering your systems. However, they share a fundamental limitation: they validate the structure of your data without understanding its meaning.

Consider a simple example. Your pipeline validates customer order data with these structural checks:

library(dplyr)

validate_order <- function(order_df) {
  # Check for null values
  if (any(is.na(order_df$customer_id)) || any(is.na(order_df$order_date)) || 
      any(is.na(order_df$product_id)) || any(is.na(order_df$quantity))) {
    stop("Null values detected")
  }
  
  # Check data types
  if (!is.numeric(order_df$customer_id) || !is.numeric(order_df$product_id) || 
      !is.numeric(order_df$quantity)) {
    stop("Invalid data types")
  }
  
  # Check positive quantities
  if (any(order_df$quantity <= 0)) {
    stop("Invalid quantity values")
  }
  
  # Check date format
  if (any(is.na(as.Date(order_df$order_date, format = "%Y-%m-%d")))) {
    stop("Invalid date format")
  }
  
  return(TRUE)
}

This validation looks comprehensive. It checks for nulls, validates types, ensures quantities are positive, and verifies date formatting. Your pipeline runs, all checks pass, and you confidently load the data into your analytics warehouse.

Then, three months later, your CFO calls. The quarterly revenue report shows that revenue from returning customers has increased by 500%. Celebration ensues. Marketing takes credit. Bonuses are discussed.

Except it’s not true. The problem? Someone accidentally switched the customer_id and product_id columns during a database migration. All your structural checks passed because the data was complete and correctly formatted — but it was fundamentally wrong. You’ve been attributing orders to the wrong customers for three months.

This is the restaurant kitchen problem in action. Everything looked clean on the surface, but the actual content was contaminated. Your structural checks couldn’t catch the error because they don’t understand what the data means.

The Idea: Structural vs. Semantic Data Quality

To understand why simple checks fail, we need to recognize that data quality exists on two distinct levels:

Structural Data Quality

Structural quality answers the question: “Is this data well-formed?” It validates:

  • Syntactic correctness: Does the data follow the expected format?
  • Completeness: Are all required fields present?
  • Type consistency: Are the data types correct?
  • Referential integrity: Do relationships between tables make sense?

These checks are relatively easy to implement because they’re based on rules that don’t require business context. A null check doesn’t need to understand your business model. A type validation doesn’t need to know how you make decisions. These are the equivalents of checking whether ingredients are expired or whether the kitchen has running water.

Semantic Data Quality

Semantic quality answers a more complex question: “Does this data actually mean what it should mean in our business context?” It validates:

  • Business logic consistency: Do the values make sense together?
  • Temporal coherence: Is the sequence of events logical?
  • Statistical normality: Are patterns consistent with expectations?
  • Contextual accuracy: Does the data reflect reality?

These checks are harder to implement because they require deep understanding of your business domain, your data generation processes, and what “normal” looks like in your specific context. You need to know not just that food temperatures are being recorded, but what temperature ranges are safe for specific dishes.

Let’s illustrate the difference with another example. Imagine you’re tracking employee attendance data:

Structural validation would check:

# Structural checks only
check_structural <- function(attendance_df) {
  # Employee ID exists and is numeric
  stopifnot(!any(is.na(attendance_df$employee_id)))
  stopifnot(is.numeric(attendance_df$employee_id))
  
  # Clock-in and clock-out times are valid timestamps
  stopifnot(!any(is.na(attendance_df$clock_in)))
  stopifnot(!any(is.na(attendance_df$clock_out)))
  
  # Location field is not empty
  stopifnot(!any(attendance_df$location == ""))
  
  return(TRUE)
}

Semantic validation would additionally check:

# Semantic checks understand business context
check_semantic <- function(attendance_df) {
  # All structural checks first
  check_structural(attendance_df)
  
  # Clock-out must be after clock-in
  work_duration <- difftime(attendance_df$clock_out, 
                            attendance_df$clock_in, 
                            units = "hours")
  if (any(work_duration <= 0)) {
    stop("Clock-out time before clock-in time detected")
  }
  
  # Work duration should be reasonable (not 20 hours straight)
  if (any(work_duration > 16)) {
    warning("Unusually long shift detected: ", 
            max(work_duration), " hours")
  }
  
  # Employee shouldn't clock in at multiple locations simultaneously
  duplicate_times <- attendance_df %>%
    group_by(employee_id) %>%
    arrange(clock_in) %>%
    mutate(time_gap = as.numeric(difftime(lead(clock_in), clock_out, 
                                          units = "hours"))) %>%
    filter(time_gap < 0)
  
  if (nrow(duplicate_times) > 0) {
    stop("Employee clocked in at multiple locations simultaneously")
  }
  
  # Location should match employee's assigned office (requires business context)
  # This would require joining with employee master data
  
  return(TRUE)
}

The semantic checks understand that time flows forward, that humans can’t be in two places at once, and that a 20-hour shift is suspicious. These checks require context and meaning, not just structure.

The Example: When Structure Hides Semantic Failure

Let me share a real-world example (anonymized) that illustrates how dangerous this distinction can be.

A retail company was tracking inventory levels across hundreds of stores. Their data pipeline had comprehensive structural validation:

  • Product IDs were valid (existed in the product master table)
  • Store IDs were valid (existed in the store master table)
  • Quantities were non-negative integers
  • Timestamps were valid dates
  • No null values in critical fields

Everything looked perfect. The pipeline had been running smoothly for months, with zero validation errors. Dashboards showed inventory levels, stockouts were tracked, and reordering happened automatically.

Then came Black Friday. Multiple stores ran out of a popular item, but the inventory system showed plenty of stock. Customers were furious. The marketing team had promised availability. The company lost millions in sales and customer trust.

The post-mortem revealed the problem: a bug in the point-of-sale system at one distribution center was recording returns with the same transaction code as receipts. Instead of reducing inventory when products were returned to vendors, the system was increasing it. The data was structurally perfect — all the fields were filled in correctly, the formats were right, the values were plausible. But semantically, the data was backwards.

Here’s a simplified version of what the data looked like:

# The actual data (simplified)
inventory_transactions <- data.frame(
  store_id = c(101, 101, 102, 102),
  product_id = c("ABC123", "ABC123", "ABC123", "ABC123"),
  transaction_type = c("RECEIPT", "RETURN", "RECEIPT", "RECEIPT"),
  quantity = c(100, 50, 75, 50),
  timestamp = as.POSIXct(c("2025-11-01 08:00", "2025-11-15 14:00", 
                           "2025-11-10 09:00", "2025-11-20 11:00"))
)

# Structural validation - PASSES
print("All fields present: ") 
print(sum(complete.cases(inventory_transactions)) == nrow(inventory_transactions))

print("All IDs are numeric/character: ")
print(is.character(inventory_transactions$store_id))

print("All quantities are positive: ")
print(all(inventory_transactions$quantity > 0))

# The broken logic (returns should subtract, not add)
calculate_inventory_broken <- function(transactions) {
  transactions %>%
    arrange(timestamp) %>%
    summarise(final_inventory = sum(quantity))  # BUG: treats all as additions
}

print("\nBroken calculation (what actually happened):")
print(calculate_inventory_broken(inventory_transactions))
# Shows: 275 units (100 + 50 + 75 + 50)

# What semantic validation would catch
calculate_inventory_correct <- function(transactions) {
  transactions %>%
    arrange(timestamp) %>%
    mutate(quantity_adjusted = ifelse(transaction_type == "RETURN", 
                                     -quantity, 
                                     quantity)) %>%
    summarise(final_inventory = sum(quantity_adjusted))
}

print("\nCorrect calculation (what should happen):")
print(calculate_inventory_correct(inventory_transactions))
# Shows: 175 units (100 - 50 + 75 + 50)

The difference between 275 units and 175 units meant the difference between “we’re well-stocked” and “we’re about to run out.” Structural validation couldn’t catch this because the data was complete. Only semantic validation — understanding that returns should subtract from inventory — could have caught the error.

The Solution: Building Context into Your Validations

So how do we move beyond structural checks to implement semantic validation? Here are practical approaches:

1. Business Rules Validation

Encode your business logic directly into your validation layer:

validate_business_rules <- function(data) {
  # Rule: Revenue should equal price × quantity
  calculated_revenue <- data$price * data$quantity
  if (!all.equal(data$revenue, calculated_revenue, tolerance = 0.01)) {
    stop("Revenue doesn't match price × quantity")
  }
  
  # Rule: Discount percentage should never exceed 95%
  if (any(data$discount_pct > 0.95)) {
    stop("Unrealistic discount detected")
  }
  
  # Rule: Customer's total purchases shouldn't exceed their credit limit
  customer_totals <- data %>%
    group_by(customer_id) %>%
    summarise(total_purchases = sum(revenue))
  
  # Would need to join with customer credit data here
  
  return(TRUE)
}

2. Statistical Anomaly Detection

Monitor distributions and flag statistically unusual patterns:

validate_statistical_patterns <- function(current_data, historical_data) {
  # Check if current average is within 3 standard deviations of historical
  historical_mean <- mean(historical_data$daily_revenue)
  historical_sd <- sd(historical_data$daily_revenue)
  current_mean <- mean(current_data$daily_revenue)
  
  if (abs(current_mean - historical_mean) > 3 * historical_sd) {
    warning("Current revenue significantly differs from historical patterns")
  }
  
  # Check for impossible ratios
  return_rate <- sum(current_data$transaction_type == "RETURN") / nrow(current_data)
  if (return_rate > 0.5) {
    stop("Return rate exceeds 50% - likely data issue")
  }
  
  return(TRUE)
}

3. Cross-Dataset Consistency Checks

Validate that data makes sense across different sources:

validate_cross_dataset <- function(orders, inventory, customers) {
  # Orders should only contain products that exist in inventory
  invalid_products <- anti_join(orders, inventory, by = "product_id")
  if (nrow(invalid_products) > 0) {
    stop("Orders contain products not in inventory system")
  }
  
  # Customer lifetime value shouldn't exceed sum of their orders
  customer_metrics <- customers %>%
    left_join(orders %>% 
               group_by(customer_id) %>% 
               summarise(total_orders = sum(order_value)),
             by = "customer_id")
  
  if (any(customer_metrics$lifetime_value > customer_metrics$total_orders * 1.1)) {
    warning("Customer lifetime value exceeds known orders")
  }
  
  return(TRUE)
}

4. Temporal Logic Validation

Ensure sequences of events make logical sense:

validate_temporal_logic <- function(events) {
  # Order of events should be logical
  customer_journeys <- events %>%
    arrange(customer_id, timestamp) %>%
    group_by(customer_id) %>%
    mutate(previous_event = lag(event_type))
  
  # Can't have "purchase" before "account_created"
  invalid_sequences <- customer_journeys %>%
    filter(event_type == "purchase" & 
           (is.na(previous_event) | previous_event != "account_created"))
  
  if (nrow(invalid_sequences) > 0) {
    stop("Purchase events without account creation detected")
  }
  
  return(TRUE)
}

Why Data Quality Must Include Context

The fundamental insight is this: data quality cannot be assessed in a vacuum. Just as you can’t evaluate food safety without understanding food science, you can’t evaluate data quality without understanding your business domain.

Structural validation is necessary but not sufficient. It’s the foundation, not the building. You need structural checks to ensure your data pipeline doesn’t break, but you need semantic checks to ensure your data actually means what you think it means.

This has implications for how we organize our data teams and design our systems:

1. Data engineers need business context: You can’t write semantic validations if you don’t understand how the business uses the data.

2. Data quality is a team sport: Domain experts, analysts, and engineers must collaborate to define what “correct” means.

3. Validation should be layered: Start with structural checks, then add semantic validations, then monitor statistical patterns.

4. Expect validation to evolve: As your business changes, your semantic validations must change too.

5. Document your assumptions: Every semantic validation embodies assumptions about your business. Write them down.

Think back to the restaurant analogy. A health inspector doesn’t just check if the kitchen is clean — they understand food safety principles, they know what temperatures prevent bacterial growth, they recognize cross-contamination risks. They bring domain expertise to their inspection.

Your data validation should do the same. It should embody not just rules about data structure, but knowledge about your business operations, your customer behavior, your product logic. It should understand not just that fields are filled in, but that the values make sense together in your specific context.

Conclusion: The Kitchen Matters More Than the Dining Room

We began with a restaurant metaphor: a pristine dining room hiding a hazardous kitchen. In data engineering, we’ve become excellent at maintaining the dining room — our data looks clean, our schemas are well-designed, our pipelines run smoothly. But how often are we actually inspecting the kitchen?

Structural data quality tells you your data is well-formed. Semantic data quality tells you your data is meaningful. Both are essential, but only the latter keeps you from confidently making decisions based on fundamentally wrong information.

The next time you implement data validation, ask yourself: am I just checking that fields are filled in, or am I validating that the data actually makes sense for my business? Am I inspecting the dining room, or am I checking the kitchen?

Because in the end, it doesn’t matter how clean your data looks if what you’re serving to stakeholders is fundamentally unsafe for decision-making. The real measure of data quality isn’t whether your pipeline runs without errors — it’s whether the insights derived from your data can be trusted.

So go beyond the surface. Check your kitchen. Your data quality framework should understand not just structure, but meaning. Only then can you truly trust the insights you serve.

Canonical link

Exported from Numbers Around Us on February 9, 2026.