Mastering the Spice Rack – Transforming Categorical Features

Author

Numbers around us

Published

February 18, 2025

Now that we’ve covered the foundations of feature engineering, it’s time to dive deeper into categorical data transformation. Categorical variables are often overlooked or mishandled, yet they play a crucial role in many predictive models.

This article will explore:
Why categorical features matter in machine learning.
Beyond one-hot encoding—better techniques for high-cardinality data.
Feature engineering techniques for categorical variables in R (recipes, vtreat) and Python (pandas, category_encoders).
Best practices and common pitfalls when working with categorical data.

📌 Chapter 1: Understanding Categorical Features

1️⃣ What Are Categorical Features?

Categorical features represent distinct groups or labels rather than continuous numerical values. They can be:

Nominal – No inherent order (e.g., colors: “red,” “blue,” “green”).
Ordinal – Has a meaningful order (e.g., education level: “High School” < “Bachelor” < “Master”).

Unlike numerical features, categorical data needs special treatment to be useful in machine learning models.

2️⃣ Why Are Categorical Features Important?

Many real-world datasets rely on categorical variables to provide key information. Consider these examples:

🔹 Predicting loan default → Customer’s employment type (salaried, self-employed, retired) impacts risk.
🔹 Customer churn analysis → Subscription plan type (Basic, Premium, Enterprise) affects churn probability.
🔹 Medical diagnosis models → Patient’s blood type (A, B, O, AB) could influence health risks.

If categorical variables are not handled properly, models may fail to capture valuable patterns or become computationally inefficient.

Common Encoding Techniques & Their Limitations

Now that we understand the importance of categorical features, we need to convert them into numerical representations so machine learning models can process them effectively. However, not all encoding methods are equal, and the wrong choice can introduce bias, unnecessary complexity, or information loss.

1️⃣ One-Hot Encoding (OHE)

🔹 How It Works
One-hot encoding converts each unique category into a separate binary column (0 or 1).

📌 Example: Converting “City” into One-Hot Encoding

City City_NewYork City_London City_Paris
New York 1 0 0
London 0 1 0
Paris 0 0 1

✅ Advantages:
✔ Works well with low-cardinality categorical features.
✔ Prevents ordinal misinterpretation (e.g., red ≠ blue ≠ green).

❌ Limitations:

  • Inefficient for high-cardinality variables (e.g., thousands of ZIP codes).

  • Increases dataset size significantly (curse of dimensionality).

  • Introduces sparsity, making some algorithms inefficient.

📌 R: One-Hot Encoding with recipes

library(tidymodels)

recipe_categorical <- recipe(~ City, data = df) %>%
  step_dummy(all_nominal_predictors())

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: One-Hot Encoding with pandas

import pandas as pd

df = pd.DataFrame({'City': ['New York', 'London', 'Paris']})
df_encoded = pd.get_dummies(df, columns=['City'])
print(df_encoded)

2️⃣ Label Encoding

🔹 How It Works
Label encoding assigns a numeric value to each category.

📌 Example: Converting “Subscription Plan” into Label Encoding

Plan Encoded Value
Basic 0
Premium 1
Enterprise 2

✅ Advantages:
Memory-efficient (does not expand columns like OHE).
✔ Works well for ordinal categories (e.g., “Low” < “Medium” < “High”).

❌ Limitations:

  • Not suitable for nominal categories (may introduce false relationships).

  • Models may mistakenly assume numeric order matters (e.g., “New York” < “London” < “Paris” makes no sense).

📌 R: Label Encoding with recipes

recipe_categorical <- recipe(~ Plan, data = df) %>%
  step_integer(all_nominal_predictors())

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: Label Encoding with scikit-learn

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['Plan_encoded'] = encoder.fit_transform(df['Plan'])

Use Label Encoding only for ordinal categories.
Avoid it for nominal categories (use OHE instead).

3️⃣ Target Encoding (Mean Encoding)

🔹 How It Works
Replaces categories with the mean target value for each category.

📌 Example: Converting “City” into Target Encoding for predicting House Prices

City Avg House Price ($)
New York 900,000
London 750,000
Paris 850,000

✅ Advantages:
✔ Reduces dimensionality (no extra columns like OHE).
✔ Retains some predictive power of categories.

❌ Limitations:

  • Prone to data leakage if applied before splitting training/testing sets.

  • May overfit to small categories if not smoothed.

📌 R: Target Encoding with vtreat

library(vtreat)

treatment <- designTreatmentsN(df, varlist = "City", outcomename = "HousePrice")
df_transformed <- prepare(treatment, df)

📌 Python: Target Encoding with category_encoders

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['City'])
df['City_encoded'] = encoder.fit_transform(df['City'], df['HousePrice'])

Use target encoding for high-cardinality variables in regression problems.
Be cautious with leakage—always apply it within cross-validation.

4️⃣ Frequency Encoding

🔹 How It Works
Replaces categories with their frequency of occurrence.

📌 Example: Encoding “Car Brand” by Frequency

Car Brand Frequency
Toyota 5000
BMW 3000
Tesla 2000

✅ Advantages:
Preserves category importance without adding too many columns.
Efficient for high-cardinality variables.

❌ Limitations:

  • May lose interpretability if categories don’t have a meaningful frequency pattern.

  • Not useful if category frequency is unrelated to the target.

📌 R: Frequency Encoding

df$CarBrand_encoded <- ave(df$CarBrand, df$CarBrand, FUN = length)

📌 Python: Frequency Encoding

df['CarBrand_encoded'] = df.groupby('CarBrand')['CarBrand'].transform('count')

Use frequency encoding when dealing with large categorical variables.
Avoid it if frequency has no correlation with the target variable.

Summary: Choosing the Right Encoding Method

Encoding Method Works Best For When to Use
One-Hot Encoding Low-cardinality categories When categories have ≤ 10 unique values.
Label Encoding Ordinal categories When order matters (e.g., Low < Medium < High).
Target Encoding High-cardinality categories When there’s a strong relationship between category and target.
Frequency Encoding Large categorical variables When the category’s frequency is informative.

By selecting the right encoding method, we ensure our models make the most of categorical data while avoiding unnecessary complexity.

Advanced Feature Engineering for Categorical Data

Now that we’ve covered basic encoding techniques, let’s explore more advanced strategies to extract additional insights from categorical features.

This chapter will focus on:
Combining categorical features to create interaction terms.
Encoding strategies for tree-based vs. linear models.
Using embeddings for categorical data.

1️⃣ Feature Interactions for Categorical Variables

Feature interactions help models capture relationships between categories that individual features alone may miss. Instead of treating categorical features independently, we can combine them into new engineered features.

📌 Example: Combining “State” and “Product Category”

Imagine we’re predicting sales revenue, and our dataset includes:

  • State: “California”, “Texas”, “New York”

  • Product Category: “Electronics”, “Clothing”, “Furniture”

A standard model would treat these separately, but what if a product sells differently depending on the state?

State Product Category Sales ($)
California Electronics 120,000
Texas Electronics 80,000
New York Electronics 100,000
California Clothing 75,000
Texas Clothing 50,000

Instead of separate features, we can combine them into a single categorical variable:

State_Product Sales ($)
California_Electronics 120,000
Texas_Electronics 80,000
NewYork_Electronics 100,000
California_Clothing 75,000

Now, the model can learn state-specific sales patterns instead of treating all locations and products the same.

library(tidymodels)

df$State_Product <- interaction(df$State, df$ProductCategory, sep = "_")

recipe_categorical <- recipe(~ State_Product, data = df) %>%
  step_dummy(all_nominal_predictors())  # One-hot encode new feature

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: Creating Interaction Features with pandas

df['State_Product'] = df['State'] + "_" + df['ProductCategory']
df = pd.get_dummies(df, columns=['State_Product'])  # One-hot encode

Use feature interactions when categorical variables may have meaningful dependencies.
Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).

2️⃣ Encoding Strategies for Tree-Based vs. Linear Models

Different machine learning models handle categorical features differently:

🔹 Linear models (logistic regression, linear regression, SVMs)

  • Prefer one-hot encoding or ordinal encoding.

  • Work best when categorical features don’t have too many unique values.

🔹 Tree-based models (random forests, XGBoost, LightGBM, CatBoost)

  • Can handle raw categorical data directly (without one-hot encoding).

  • Perform well with target encoding and frequency encoding.

📌 Example: Predicting Customer Churn Using Linear vs. Tree-Based Models

We have a dataset with:

  • Subscription Type: “Free”, “Basic”, “Premium”, “Enterprise”

  • Churn: (1 = Yes, 0 = No)

Use feature interactions when categorical variables may have meaningful dependencies.
Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).

Subscription Type Churn Rate (%)
Free 80%
Basic 50%
Premium 20%
Enterprise 5%

For linear models, we use one-hot encoding:

Free Basic Premium Enterprise
1 0 0 0
0 1 0 0
0 0 1 0

For tree-based models, we can use target encoding:

Subscription Type Encoded Value
Free 0.80
Basic 0.50
Premium 0.20
Enterprise 0.05

📌 R: Encoding for Linear Models vs. Tree-Based Models

library(tidymodels)

# One-hot encoding for linear models
recipe_linear <- recipe(~ SubscriptionType, data = df) %>%
  step_dummy(all_nominal_predictors())

# Target encoding for tree-based models
library(vtreat)
treatment <- designTreatmentsC(df, varlist = "SubscriptionType", outcomename = "Churn")
df_transformed <- prepare(treatment, df)

📌 Python: Encoding for Linear Models vs. Tree-Based Models

import category_encoders as ce

# One-hot encoding (linear models)
df_linear = pd.get_dummies(df, columns=['SubscriptionType'])

# Target encoding (tree-based models)
encoder = ce.TargetEncoder(cols=['SubscriptionType'])
df['SubscriptionType_encoded'] = encoder.fit_transform(df['SubscriptionType'], df['Churn'])

Use target encoding for high-cardinality categorical variables in tree-based models.
Avoid one-hot encoding when there are too many categories—it increases feature space dramatically.

3️⃣ Using Embeddings for Categorical Data

🔹 How It Works
Categorical embeddings transform categories into dense numerical vectors that capture similarities between values.

Instead of one-hot encoding, we can train an embedding layer to represent relationships between categories automatically.

📌 Example: Encoding “Job Role” in a Hiring Model

  • Instead of "Engineer" → [1,0,0,0], "Manager" → [0,1,0,0]

  • A neural network can learn "Engineer" is more similar to "Scientist" than "Cashier", assigning embeddings like:

Job Role Embedding 1 Embedding 2 Embedding 3
Engineer 0.85 0.13 0.41
Scientist 0.88 0.12 0.39
Cashier 0.10 0.95 0.76

📌 Python: Creating Embeddings for Categorical Variables Using TensorFlow

import tensorflow as tf
from tensorflow.keras.layers import Embedding

# Define an embedding layer for a categorical feature with 10 unique values
embedding_layer = Embedding(input_dim=10, output_dim=3)

# Example: Convert category "Engineer" (ID = 3) into a dense vector
import numpy as np
category_input = np.array([[3]])  # Example category index
embedded_output = embedding_layer(category_input)
print(embedded_output)

Embeddings are useful when working with large categorical datasets (e.g., text, recommendation systems).
Not necessary for small categorical variables (OHE or target encoding is simpler).

📌 Summary: Choosing the Best Advanced Encoding Strategy

Encoding Method Works Best For When to Use
Feature Interactions Related categorical variables When categories influence each other (e.g., State + Product Category).
Target Encoding Tree-based models When category values correlate with the target variable.
Embeddings Deep learning models When handling large categorical datasets efficiently.

Using advanced encoding techniques ensures categorical variables provide real value to models while maintaining efficiency.

Best Practices & Common Pitfalls for Categorical Features

Now that we’ve explored categorical encoding techniques, let’s focus on best practices and mistakes to avoid when working with categorical data in machine learning.

This chapter will cover:
Preventing feature leakage when encoding categorical variables.
Avoiding overfitting with high-cardinality categorical features.
Optimizing categorical encoding based on the type of model.

1️⃣ Avoiding Feature Leakage with Categorical Encoding

Feature leakage occurs when a model has access to information during training that it wouldn’t have in real-world predictions. This is a huge issue in categorical encoding techniques like target encoding.

📌 Example: Target Encoding Gone Wrong

Imagine we are predicting customer churn, and we use target encoding on the “Customer Segment” column:

Customer Segment Avg Churn Rate (%)
Students 75%
Freelancers 40%
Corporate 10%

Problem:

  • If this encoding is applied before splitting into train/test sets, the model already knows the churn rate for each category.

  • The model will memorize the target value instead of learning real relationships.

Solution:

  • Perform target encoding only within cross-validation—never on the full dataset.

📌 R: Preventing Leakage in Target Encoding with vtreat

library(vtreat)

# Define the target encoding inside cross-validation
treatment <- designTreatmentsC(df_train, varlist = "CustomerSegment", outcomename = "Churn")

# Apply transformation to training and test sets separately
df_train_transformed <- prepare(treatment, df_train)
df_test_transformed <- prepare(treatment, df_test)

📌 Python: Preventing Leakage in Target Encoding with category_encoders

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['CustomerSegment'])

# Fit on training data, transform separately on train and test
df_train['CustomerSegment_encoded'] = encoder.fit_transform(df_train['CustomerSegment'], df_train['Churn'])
df_test['CustomerSegment_encoded'] = encoder.transform(df_test['CustomerSegment'])

Always encode categorical features using only training data to prevent information from leaking into the test set.

2️⃣ Avoiding Overfitting with High-Cardinality Features

High-cardinality categorical features (e.g., Customer ID, Product ID, ZIP Code) introduce thousands of unique values. If not handled properly, they can lead to:

  • Curse of dimensionality (too many one-hot encoded columns).

  • Overfitting to specific training data.

📌 Example: The ZIP Code Problem

If we one-hot encode a ZIP code variable, it might create thousands of extra features, leading to:
Sparse matrices (lots of zeroes).
Overfitting to specific locations.

Solutions for High-Cardinality Variables:

  1. Group rare categories together (step_other in R, pd.cut() in Python).

  2. Use frequency encoding instead of one-hot encoding.

  3. Use embeddings for categorical variables in deep learning models.

📌 R: Reducing High-Cardinality with step_other()

recipe_high_cardinality <- recipe(~ ZIPCode, data = df) %>%
  step_other(ZIPCode, threshold = 0.05)  # Groups rare ZIP codes together

📌 Python: Reducing High-Cardinality with Frequency Encoding

df['ZIPCode_encoded'] = df.groupby('ZIPCode')['ZIPCode'].transform('count')

By grouping rare categories together or using frequency encoding, we avoid overfitting while keeping useful information.

3️⃣ Optimizing Encoding Based on Model Type

Some models work well with raw categorical data, while others require specific encoding methods.

📌 Best Encoding Methods for Different Models

Model Type Best Encoding Method
Linear Models (Logistic Regression, SVMs, Linear Regression) One-Hot Encoding, Ordinal Encoding
Tree-Based Models (Random Forest, XGBoost, LightGBM, CatBoost) Target Encoding, Frequency Encoding, Raw Categories (CatBoost)
Deep Learning (Neural Networks) Embeddings, Frequency Encoding

Always select the encoding method based on how the model handles categorical variables.

📌 Example: Encoding for XGBoost vs. Logistic Regression

If we use logistic regression, we should use one-hot encoding:

df_linear = pd.get_dummies(df, columns=['Category'])

If we use XGBoost, we should use target encoding:

encoder = ce.TargetEncoder(cols=['Category'])
df['Category_encoded'] = encoder.fit_transform(df['Category'], df['Target'])

Tree-based models like XGBoost handle categorical data differently than linear models—choose encoding methods accordingly.

Best Practices & Common Pitfalls

✅ Best Practices ❌ Pitfalls to Avoid
Prevent leakage by encoding categories only on the training set. Encoding categorical variables before train-test split leads to overfitting.
Reduce high-cardinality categories using frequency encoding or grouping rare values. One-hot encoding thousands of categories makes models inefficient.
Use encoding methods suited to the model type (OHE for linear, target encoding for trees). Using the wrong encoding method can confuse models and reduce accuracy.

By following these best practices, we ensure categorical variables are handled effectively without introducing unnecessary complexity.

Conclusion & Next Steps

Categorical features are often overlooked or mishandled, yet they play a critical role in predictive modeling. In this article, we’ve explored different encoding techniques, advanced feature engineering strategies, and best practices for handling categorical data effectively.

1️⃣ Key Takeaways

Not all categorical variables are the same—Nominal (unordered) and ordinal (ordered) features require different encoding methods.
One-hot encoding works well for small categories, but becomes inefficient with high-cardinality features.
Target encoding and frequency encoding are powerful alternatives but must be used carefully to prevent feature leakage and overfitting.
Feature interactions (e.g., combining “State” and “Product Category”) capture relationships that individual features might miss.
Choosing the right encoding method depends on the model type:

  • Linear models (Logistic Regression, SVMs, etc.) → Prefer one-hot encoding or ordinal encoding.

  • Tree-based models (XGBoost, LightGBM, etc.) → Handle target encoding, frequency encoding, or raw categorical data well.

  • Deep learning modelsEmbeddings work best for high-cardinality features.

By selecting the right encoding method and avoiding common pitfalls, we can improve model performance while reducing computational complexity.

2️⃣ What’s Next?

This article is part of the Feature Engineering Series, where we break down how to create better predictive models by transforming raw data.

🚀 Next up: “Cooking Temperatures Matter – Scaling & Transforming Numerical Data”
In the next article, we’ll explore:
📌 When to scale numerical features (and when not to!).
📌 Log, power, and polynomial transformations—how they impact model performance.
📌 The difference between standardization and normalization (and which models need them).
📌 Practical R & Python examples to apply numerical feature transformations effectively.

🔹 Want to stay updated? Keep an eye out for the next post in the series!

💡 What’s your favorite way to handle categorical data? Drop your thoughts below! ⬇⬇⬇