Mastering the Spice Rack – Transforming Categorical Features

Author

Numbers around us

Published

February 18, 2025

Now that we’ve covered the foundations of feature engineering, it’s time to dive deeper into categorical data transformation. Categorical variables are often overlooked or mishandled, yet they play a crucial role in many predictive models.

This article will explore:
✅ Why categorical features matter in machine learning.
✅ Beyond one-hot encoding—better techniques for high-cardinality data.
✅ Feature engineering techniques for categorical variables in R (recipes, vtreat) and Python (pandas, category_encoders).
✅ Best practices and common pitfalls when working with categorical data.

📌 Chapter 1: Understanding Categorical Features

1️⃣ What Are Categorical Features?

Categorical features represent distinct groups or labels rather than continuous numerical values. They can be:

✔ Nominal – No inherent order (e.g., colors: “red,” “blue,” “green”).
✔ Ordinal – Has a meaningful order (e.g., education level: “High School” < “Bachelor” < “Master”).

Unlike numerical features, categorical data needs special treatment to be useful in machine learning models.

2️⃣ Why Are Categorical Features Important?

Many real-world datasets rely on categorical variables to provide key information. Consider these examples:

🔹 Predicting loan default → Customer’s employment type (salaried, self-employed, retired) impacts risk.
🔹 Customer churn analysis → Subscription plan type (Basic, Premium, Enterprise) affects churn probability.
🔹 Medical diagnosis models → Patient’s blood type (A, B, O, AB) could influence health risks.

If categorical variables are not handled properly, models may fail to capture valuable patterns or become computationally inefficient.

Common Encoding Techniques & Their Limitations

Now that we understand the importance of categorical features, we need to convert them into numerical representations so machine learning models can process them effectively. However, not all encoding methods are equal, and the wrong choice can introduce bias, unnecessary complexity, or information loss.

1️⃣ One-Hot Encoding (OHE)

🔹 How It Works
One-hot encoding converts each unique category into a separate binary column (0 or 1).

📌 Example: Converting “City” into One-Hot Encoding

City	City_NewYork	City_London	City_Paris
New York	1	0	0
London	0	1	0
Paris	0	0	1

✅ Advantages:
✔ Works well with low-cardinality categorical features.
✔ Prevents ordinal misinterpretation (e.g., red ≠ blue ≠ green).

❌ Limitations:

Inefficient for high-cardinality variables (e.g., thousands of ZIP codes).
Increases dataset size significantly (curse of dimensionality).
Introduces sparsity, making some algorithms inefficient.

📌 R: One-Hot Encoding with `recipes`

library(tidymodels)

recipe_categorical <- recipe(~ City, data = df) %>%
  step_dummy(all_nominal_predictors())

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: One-Hot Encoding with `pandas`

import pandas as pd

df = pd.DataFrame({'City': ['New York', 'London', 'Paris']})
df_encoded = pd.get_dummies(df, columns=['City'])
print(df_encoded)

2️⃣ Label Encoding

🔹 How It Works
Label encoding assigns a numeric value to each category.

📌 Example: Converting “Subscription Plan” into Label Encoding

Plan	Encoded Value
Basic	0
Premium	1
Enterprise	2

✅ Advantages:
✔ Memory-efficient (does not expand columns like OHE).
✔ Works well for ordinal categories (e.g., “Low” < “Medium” < “High”).

❌ Limitations:

Not suitable for nominal categories (may introduce false relationships).
Models may mistakenly assume numeric order matters (e.g., “New York” < “London” < “Paris” makes no sense).

📌 R: Label Encoding with `recipes`

recipe_categorical <- recipe(~ Plan, data = df) %>%
  step_integer(all_nominal_predictors())

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: Label Encoding with `scikit-learn`

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['Plan_encoded'] = encoder.fit_transform(df['Plan'])

✅ Use Label Encoding only for ordinal categories.
❌ Avoid it for nominal categories (use OHE instead).

3️⃣ Target Encoding (Mean Encoding)

🔹 How It Works
Replaces categories with the mean target value for each category.

📌 Example: Converting “City” into Target Encoding for predicting House Prices

City	Avg House Price ($)
New York	900,000
London	750,000
Paris	850,000

✅ Advantages:
✔ Reduces dimensionality (no extra columns like OHE).
✔ Retains some predictive power of categories.

❌ Limitations:

Prone to data leakage if applied before splitting training/testing sets.
May overfit to small categories if not smoothed.

📌 R: Target Encoding with `vtreat`

library(vtreat)

treatment <- designTreatmentsN(df, varlist = "City", outcomename = "HousePrice")
df_transformed <- prepare(treatment, df)

📌 Python: Target Encoding with `category_encoders`

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['City'])
df['City_encoded'] = encoder.fit_transform(df['City'], df['HousePrice'])

✅ Use target encoding for high-cardinality variables in regression problems.
❌ Be cautious with leakage—always apply it within cross-validation.

4️⃣ Frequency Encoding

🔹 How It Works
Replaces categories with their frequency of occurrence.

📌 Example: Encoding “Car Brand” by Frequency

Car Brand	Frequency
Toyota	5000
BMW	3000
Tesla	2000

✅ Advantages:
✔ Preserves category importance without adding too many columns.
✔ Efficient for high-cardinality variables.

❌ Limitations:

May lose interpretability if categories don’t have a meaningful frequency pattern.
Not useful if category frequency is unrelated to the target.

📌 R: Frequency Encoding

df$CarBrand_encoded <- ave(df$CarBrand, df$CarBrand, FUN = length)

📌 Python: Frequency Encoding

df['CarBrand_encoded'] = df.groupby('CarBrand')['CarBrand'].transform('count')

✅ Use frequency encoding when dealing with large categorical variables.
❌ Avoid it if frequency has no correlation with the target variable.

Summary: Choosing the Right Encoding Method

Encoding Method	Works Best For	When to Use
One-Hot Encoding	Low-cardinality categories	When categories have ≤ 10 unique values.
Label Encoding	Ordinal categories	When order matters (e.g., Low < Medium < High).
Target Encoding	High-cardinality categories	When there’s a strong relationship between category and target.
Frequency Encoding	Large categorical variables	When the category’s frequency is informative.

✅ By selecting the right encoding method, we ensure our models make the most of categorical data while avoiding unnecessary complexity.

Advanced Feature Engineering for Categorical Data

Now that we’ve covered basic encoding techniques, let’s explore more advanced strategies to extract additional insights from categorical features.

This chapter will focus on:
✅ Combining categorical features to create interaction terms.
✅ Encoding strategies for tree-based vs. linear models.
✅ Using embeddings for categorical data.

1️⃣ Feature Interactions for Categorical Variables

Feature interactions help models capture relationships between categories that individual features alone may miss. Instead of treating categorical features independently, we can combine them into new engineered features.

📌 Example: Combining “State” and “Product Category”

Imagine we’re predicting sales revenue, and our dataset includes:

State: “California”, “Texas”, “New York”
Product Category: “Electronics”, “Clothing”, “Furniture”

A standard model would treat these separately, but what if a product sells differently depending on the state?

State	Product Category	Sales ($)
California	Electronics	120,000
Texas	Electronics	80,000
New York	Electronics	100,000
California	Clothing	75,000
Texas	Clothing	50,000

Instead of separate features, we can combine them into a single categorical variable:

State_Product	Sales ($)
California_Electronics	120,000
Texas_Electronics	80,000
NewYork_Electronics	100,000
California_Clothing	75,000

Now, the model can learn state-specific sales patterns instead of treating all locations and products the same.

library(tidymodels)

df$State_Product <- interaction(df$State, df$ProductCategory, sep = "_")

recipe_categorical <- recipe(~ State_Product, data = df) %>%
  step_dummy(all_nominal_predictors())  # One-hot encode new feature

prepped_data <- prep(recipe_categorical) %>% bake(new_data = df)

📌 Python: Creating Interaction Features with `pandas`

df['State_Product'] = df['State'] + "_" + df['ProductCategory']
df = pd.get_dummies(df, columns=['State_Product'])  # One-hot encode

✅ Use feature interactions when categorical variables may have meaningful dependencies.
❌ Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).

2️⃣ Encoding Strategies for Tree-Based vs. Linear Models

Different machine learning models handle categorical features differently:

🔹 Linear models (logistic regression, linear regression, SVMs)

Prefer one-hot encoding or ordinal encoding.
Work best when categorical features don’t have too many unique values.

🔹 Tree-based models (random forests, XGBoost, LightGBM, CatBoost)

Can handle raw categorical data directly (without one-hot encoding).
Perform well with target encoding and frequency encoding.

📌 Example: Predicting Customer Churn Using Linear vs. Tree-Based Models

We have a dataset with:

Subscription Type: “Free”, “Basic”, “Premium”, “Enterprise”
Churn: (1 = Yes, 0 = No)

✅ Use feature interactions when categorical variables may have meaningful dependencies.
❌ Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).

Subscription Type	Churn Rate (%)
Free	80%
Basic	50%
Premium	20%
Enterprise	5%

For linear models, we use one-hot encoding:

Free	Basic	Premium
1	0	0
0	1	0
0	0	1

For tree-based models, we can use target encoding:

Subscription Type	Encoded Value
Free	0.80
Basic	0.50
Premium	0.20
Enterprise	0.05

📌 R: Encoding for Linear Models vs. Tree-Based Models

library(tidymodels)

# One-hot encoding for linear models
recipe_linear <- recipe(~ SubscriptionType, data = df) %>%
  step_dummy(all_nominal_predictors())

# Target encoding for tree-based models
library(vtreat)
treatment <- designTreatmentsC(df, varlist = "SubscriptionType", outcomename = "Churn")
df_transformed <- prepare(treatment, df)

📌 Python: Encoding for Linear Models vs. Tree-Based Models

import category_encoders as ce

# One-hot encoding (linear models)
df_linear = pd.get_dummies(df, columns=['SubscriptionType'])

# Target encoding (tree-based models)
encoder = ce.TargetEncoder(cols=['SubscriptionType'])
df['SubscriptionType_encoded'] = encoder.fit_transform(df['SubscriptionType'], df['Churn'])

✅ Use target encoding for high-cardinality categorical variables in tree-based models.
❌ Avoid one-hot encoding when there are too many categories—it increases feature space dramatically.

3️⃣ Using Embeddings for Categorical Data

🔹 How It Works
Categorical embeddings transform categories into dense numerical vectors that capture similarities between values.

Instead of one-hot encoding, we can train an embedding layer to represent relationships between categories automatically.

📌 Example: Encoding “Job Role” in a Hiring Model

Instead of "Engineer" → [1,0,0,0], "Manager" → [0,1,0,0]
A neural network can learn "Engineer" is more similar to "Scientist" than "Cashier", assigning embeddings like:

Job Role	Embedding 1	Embedding 2	Embedding 3
Engineer	0.85	0.13	0.41
Scientist	0.88	0.12	0.39
Cashier	0.10	0.95	0.76

📌 Python: Creating Embeddings for Categorical Variables Using `TensorFlow`

import tensorflow as tf
from tensorflow.keras.layers import Embedding

# Define an embedding layer for a categorical feature with 10 unique values
embedding_layer = Embedding(input_dim=10, output_dim=3)

# Example: Convert category "Engineer" (ID = 3) into a dense vector
import numpy as np
category_input = np.array([[3]])  # Example category index
embedded_output = embedding_layer(category_input)
print(embedded_output)

✅ Embeddings are useful when working with large categorical datasets (e.g., text, recommendation systems).
❌ Not necessary for small categorical variables (OHE or target encoding is simpler).

📌 Summary: Choosing the Best Advanced Encoding Strategy

Encoding Method	Works Best For	When to Use
Feature Interactions	Related categorical variables	When categories influence each other (e.g., State + Product Category).
Target Encoding	Tree-based models	When category values correlate with the target variable.
Embeddings	Deep learning models	When handling large categorical datasets efficiently.

✅ Using advanced encoding techniques ensures categorical variables provide real value to models while maintaining efficiency.

Best Practices & Common Pitfalls for Categorical Features

Now that we’ve explored categorical encoding techniques, let’s focus on best practices and mistakes to avoid when working with categorical data in machine learning.

This chapter will cover:
✅ Preventing feature leakage when encoding categorical variables.
✅ Avoiding overfitting with high-cardinality categorical features.
✅ Optimizing categorical encoding based on the type of model.

1️⃣ Avoiding Feature Leakage with Categorical Encoding

Feature leakage occurs when a model has access to information during training that it wouldn’t have in real-world predictions. This is a huge issue in categorical encoding techniques like target encoding.

📌 Example: Target Encoding Gone Wrong

Imagine we are predicting customer churn, and we use target encoding on the “Customer Segment” column:

Customer Segment	Avg Churn Rate (%)
Students	75%
Freelancers	40%
Corporate	10%

❌ Problem:

If this encoding is applied before splitting into train/test sets, the model already knows the churn rate for each category.
The model will memorize the target value instead of learning real relationships.

✅ Solution:

Perform target encoding only within cross-validation—never on the full dataset.

📌 R: Preventing Leakage in Target Encoding with `vtreat`

library(vtreat)

# Define the target encoding inside cross-validation
treatment <- designTreatmentsC(df_train, varlist = "CustomerSegment", outcomename = "Churn")

# Apply transformation to training and test sets separately
df_train_transformed <- prepare(treatment, df_train)
df_test_transformed <- prepare(treatment, df_test)

📌 Python: Preventing Leakage in Target Encoding with `category_encoders`

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['CustomerSegment'])

# Fit on training data, transform separately on train and test
df_train['CustomerSegment_encoded'] = encoder.fit_transform(df_train['CustomerSegment'], df_train['Churn'])
df_test['CustomerSegment_encoded'] = encoder.transform(df_test['CustomerSegment'])

✅ Always encode categorical features using only training data to prevent information from leaking into the test set.

2️⃣ Avoiding Overfitting with High-Cardinality Features

High-cardinality categorical features (e.g., Customer ID, Product ID, ZIP Code) introduce thousands of unique values. If not handled properly, they can lead to:

Curse of dimensionality (too many one-hot encoded columns).
Overfitting to specific training data.

📌 Example: The ZIP Code Problem

If we one-hot encode a ZIP code variable, it might create thousands of extra features, leading to:
❌ Sparse matrices (lots of zeroes).
❌ Overfitting to specific locations.

✅ Solutions for High-Cardinality Variables:

Group rare categories together (step_other in R, pd.cut() in Python).
Use frequency encoding instead of one-hot encoding.
Use embeddings for categorical variables in deep learning models.

📌 R: Reducing High-Cardinality with `step_other()`

recipe_high_cardinality <- recipe(~ ZIPCode, data = df) %>%
  step_other(ZIPCode, threshold = 0.05)  # Groups rare ZIP codes together

📌 Python: Reducing High-Cardinality with Frequency Encoding

df['ZIPCode_encoded'] = df.groupby('ZIPCode')['ZIPCode'].transform('count')

✅ By grouping rare categories together or using frequency encoding, we avoid overfitting while keeping useful information.

3️⃣ Optimizing Encoding Based on Model Type

Some models work well with raw categorical data, while others require specific encoding methods.

📌 Best Encoding Methods for Different Models

Model Type	Best Encoding Method
Linear Models (Logistic Regression, SVMs, Linear Regression)	One-Hot Encoding, Ordinal Encoding
Tree-Based Models (Random Forest, XGBoost, LightGBM, CatBoost)	Target Encoding, Frequency Encoding, Raw Categories (CatBoost)
Deep Learning (Neural Networks)	Embeddings, Frequency Encoding

✅ Always select the encoding method based on how the model handles categorical variables.

📌 Example: Encoding for XGBoost vs. Logistic Regression

If we use logistic regression, we should use one-hot encoding:

df_linear = pd.get_dummies(df, columns=['Category'])

If we use XGBoost, we should use target encoding:

encoder = ce.TargetEncoder(cols=['Category'])
df['Category_encoded'] = encoder.fit_transform(df['Category'], df['Target'])

✅ Tree-based models like XGBoost handle categorical data differently than linear models—choose encoding methods accordingly.

Best Practices & Common Pitfalls

✅ Best Practices	❌ Pitfalls to Avoid
Prevent leakage by encoding categories only on the training set.	Encoding categorical variables before train-test split leads to overfitting.
Reduce high-cardinality categories using frequency encoding or grouping rare values.	One-hot encoding thousands of categories makes models inefficient.
Use encoding methods suited to the model type (OHE for linear, target encoding for trees).	Using the wrong encoding method can confuse models and reduce accuracy.

✅ By following these best practices, we ensure categorical variables are handled effectively without introducing unnecessary complexity.

Conclusion & Next Steps

Categorical features are often overlooked or mishandled, yet they play a critical role in predictive modeling. In this article, we’ve explored different encoding techniques, advanced feature engineering strategies, and best practices for handling categorical data effectively.

1️⃣ Key Takeaways

✔ Not all categorical variables are the same—Nominal (unordered) and ordinal (ordered) features require different encoding methods.
✔ One-hot encoding works well for small categories, but becomes inefficient with high-cardinality features.
✔ Target encoding and frequency encoding are powerful alternatives but must be used carefully to prevent feature leakage and overfitting.
✔ Feature interactions (e.g., combining “State” and “Product Category”) capture relationships that individual features might miss.
✔ Choosing the right encoding method depends on the model type:

Linear models (Logistic Regression, SVMs, etc.) → Prefer one-hot encoding or ordinal encoding.
Tree-based models (XGBoost, LightGBM, etc.) → Handle target encoding, frequency encoding, or raw categorical data well.
Deep learning models → Embeddings work best for high-cardinality features.

✅ By selecting the right encoding method and avoiding common pitfalls, we can improve model performance while reducing computational complexity.

2️⃣ What’s Next?

This article is part of the Feature Engineering Series, where we break down how to create better predictive models by transforming raw data.

🚀 Next up: “Cooking Temperatures Matter – Scaling & Transforming Numerical Data”
In the next article, we’ll explore:
📌 When to scale numerical features (and when not to!).
📌 Log, power, and polynomial transformations—how they impact model performance.
📌 The difference between standardization and normalization (and which models need them).
📌 Practical R & Python examples to apply numerical feature transformations effectively.

🔹 Want to stay updated? Keep an eye out for the next post in the series!

💡 What’s your favorite way to handle categorical data? Drop your thoughts below! ⬇⬇⬇

📌 Chapter 1: Understanding Categorical Features

1️⃣ What Are Categorical Features?

2️⃣ Why Are Categorical Features Important?

Common Encoding Techniques & Their Limitations

1️⃣ One-Hot Encoding (OHE)

📌 R: One-Hot Encoding with recipes

📌 Python: One-Hot Encoding with pandas

2️⃣ Label Encoding

📌 R: Label Encoding with recipes

📌 Python: Label Encoding with scikit-learn

3️⃣ Target Encoding (Mean Encoding)

📌 R: Target Encoding with vtreat

📌 Python: Target Encoding with category_encoders

4️⃣ Frequency Encoding

📌 R: Frequency Encoding

📌 Python: Frequency Encoding

Summary: Choosing the Right Encoding Method

Advanced Feature Engineering for Categorical Data

1️⃣ Feature Interactions for Categorical Variables

📌 Example: Combining “State” and “Product Category”

📌 Python: Creating Interaction Features with pandas

2️⃣ Encoding Strategies for Tree-Based vs. Linear Models

📌 Example: Predicting Customer Churn Using Linear vs. Tree-Based Models

📌 R: Encoding for Linear Models vs. Tree-Based Models

📌 Python: Encoding for Linear Models vs. Tree-Based Models

3️⃣ Using Embeddings for Categorical Data

📌 Python: Creating Embeddings for Categorical Variables Using TensorFlow

📌 Summary: Choosing the Best Advanced Encoding Strategy

Best Practices & Common Pitfalls for Categorical Features

1️⃣ Avoiding Feature Leakage with Categorical Encoding

📌 Example: Target Encoding Gone Wrong

📌 R: Preventing Leakage in Target Encoding with vtreat

📌 Python: Preventing Leakage in Target Encoding with category_encoders

2️⃣ Avoiding Overfitting with High-Cardinality Features

📌 Example: The ZIP Code Problem

📌 R: Reducing High-Cardinality with step_other()

📌 Python: Reducing High-Cardinality with Frequency Encoding

3️⃣ Optimizing Encoding Based on Model Type

📌 Best Encoding Methods for Different Models

📌 Example: Encoding for XGBoost vs. Logistic Regression

Best Practices & Common Pitfalls

Conclusion & Next Steps

1️⃣ Key Takeaways

2️⃣ What’s Next?

📌 R: One-Hot Encoding with `recipes`

📌 Python: One-Hot Encoding with `pandas`

📌 R: Label Encoding with `recipes`

📌 Python: Label Encoding with `scikit-learn`

📌 R: Target Encoding with `vtreat`

📌 Python: Target Encoding with `category_encoders`

📌 Python: Creating Interaction Features with `pandas`

📌 Python: Creating Embeddings for Categorical Variables Using `TensorFlow`

📌 R: Preventing Leakage in Target Encoding with `vtreat`

📌 Python: Preventing Leakage in Target Encoding with `category_encoders`

📌 R: Reducing High-Cardinality with `step_other()`