Mastering the Spice Rack – Transforming Categorical Features
Now that we’ve covered the foundations of feature engineering, it’s time to dive deeper into categorical data transformation. Categorical variables are often overlooked or mishandled, yet they play a crucial role in many predictive models.
This article will explore:
✅ Why categorical features matter in machine learning.
✅ Beyond one-hot encoding—better techniques for high-cardinality data.
✅ Feature engineering techniques for categorical variables in R (recipes
, vtreat
) and Python (pandas
, category_encoders
).
✅ Best practices and common pitfalls when working with categorical data.
📌 Chapter 1: Understanding Categorical Features
1️⃣ What Are Categorical Features?
Categorical features represent distinct groups or labels rather than continuous numerical values. They can be:
✔ Nominal – No inherent order (e.g., colors: “red,” “blue,” “green”).
✔ Ordinal – Has a meaningful order (e.g., education level: “High School” < “Bachelor” < “Master”).
Unlike numerical features, categorical data needs special treatment to be useful in machine learning models.
2️⃣ Why Are Categorical Features Important?
Many real-world datasets rely on categorical variables to provide key information. Consider these examples:
🔹 Predicting loan default → Customer’s employment type (salaried, self-employed, retired) impacts risk.
🔹 Customer churn analysis → Subscription plan type (Basic, Premium, Enterprise) affects churn probability.
🔹 Medical diagnosis models → Patient’s blood type (A, B, O, AB) could influence health risks.
If categorical variables are not handled properly, models may fail to capture valuable patterns or become computationally inefficient.
Common Encoding Techniques & Their Limitations
Now that we understand the importance of categorical features, we need to convert them into numerical representations so machine learning models can process them effectively. However, not all encoding methods are equal, and the wrong choice can introduce bias, unnecessary complexity, or information loss.
1️⃣ One-Hot Encoding (OHE)
🔹 How It Works
One-hot encoding converts each unique category into a separate binary column (0 or 1).
📌 Example: Converting “City” into One-Hot Encoding
City | City_NewYork | City_London | City_Paris |
---|---|---|---|
New York | 1 | 0 | 0 |
London | 0 | 1 | 0 |
Paris | 0 | 0 | 1 |
✅ Advantages:
✔ Works well with low-cardinality categorical features.
✔ Prevents ordinal misinterpretation (e.g., red ≠ blue ≠ green).
❌ Limitations:
Inefficient for high-cardinality variables (e.g., thousands of ZIP codes).
Increases dataset size significantly (curse of dimensionality).
Introduces sparsity, making some algorithms inefficient.
📌 R: One-Hot Encoding with recipes
library(tidymodels)
<- recipe(~ City, data = df) %>%
recipe_categorical step_dummy(all_nominal_predictors())
<- prep(recipe_categorical) %>% bake(new_data = df) prepped_data
📌 Python: One-Hot Encoding with pandas
import pandas as pd
= pd.DataFrame({'City': ['New York', 'London', 'Paris']})
df = pd.get_dummies(df, columns=['City'])
df_encoded print(df_encoded)
2️⃣ Label Encoding
🔹 How It Works
Label encoding assigns a numeric value to each category.
📌 Example: Converting “Subscription Plan” into Label Encoding
Plan | Encoded Value |
---|---|
Basic | 0 |
Premium | 1 |
Enterprise | 2 |
✅ Advantages:
✔ Memory-efficient (does not expand columns like OHE).
✔ Works well for ordinal categories (e.g., “Low” < “Medium” < “High”).
❌ Limitations:
Not suitable for nominal categories (may introduce false relationships).
Models may mistakenly assume numeric order matters (e.g., “New York” < “London” < “Paris” makes no sense).
📌 R: Label Encoding with recipes
<- recipe(~ Plan, data = df) %>%
recipe_categorical step_integer(all_nominal_predictors())
<- prep(recipe_categorical) %>% bake(new_data = df) prepped_data
📌 Python: Label Encoding with scikit-learn
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
encoder 'Plan_encoded'] = encoder.fit_transform(df['Plan']) df[
✅ Use Label Encoding only for ordinal categories.
❌ Avoid it for nominal categories (use OHE instead).
3️⃣ Target Encoding (Mean Encoding)
🔹 How It Works
Replaces categories with the mean target value for each category.
📌 Example: Converting “City” into Target Encoding for predicting House Prices
City | Avg House Price ($) |
---|---|
New York | 900,000 |
London | 750,000 |
Paris | 850,000 |
✅ Advantages:
✔ Reduces dimensionality (no extra columns like OHE).
✔ Retains some predictive power of categories.
❌ Limitations:
Prone to data leakage if applied before splitting training/testing sets.
May overfit to small categories if not smoothed.
📌 R: Target Encoding with vtreat
library(vtreat)
<- designTreatmentsN(df, varlist = "City", outcomename = "HousePrice")
treatment <- prepare(treatment, df) df_transformed
📌 Python: Target Encoding with category_encoders
import category_encoders as ce
= ce.TargetEncoder(cols=['City'])
encoder 'City_encoded'] = encoder.fit_transform(df['City'], df['HousePrice']) df[
✅ Use target encoding for high-cardinality variables in regression problems.
❌ Be cautious with leakage—always apply it within cross-validation.
4️⃣ Frequency Encoding
🔹 How It Works
Replaces categories with their frequency of occurrence.
📌 Example: Encoding “Car Brand” by Frequency
Car Brand | Frequency |
---|---|
Toyota | 5000 |
BMW | 3000 |
Tesla | 2000 |
✅ Advantages:
✔ Preserves category importance without adding too many columns.
✔ Efficient for high-cardinality variables.
❌ Limitations:
May lose interpretability if categories don’t have a meaningful frequency pattern.
Not useful if category frequency is unrelated to the target.
📌 R: Frequency Encoding
$CarBrand_encoded <- ave(df$CarBrand, df$CarBrand, FUN = length) df
📌 Python: Frequency Encoding
'CarBrand_encoded'] = df.groupby('CarBrand')['CarBrand'].transform('count') df[
✅ Use frequency encoding when dealing with large categorical variables.
❌ Avoid it if frequency has no correlation with the target variable.
Summary: Choosing the Right Encoding Method
Encoding Method | Works Best For | When to Use |
---|---|---|
One-Hot Encoding | Low-cardinality categories | When categories have ≤ 10 unique values. |
Label Encoding | Ordinal categories | When order matters (e.g., Low < Medium < High). |
Target Encoding | High-cardinality categories | When there’s a strong relationship between category and target. |
Frequency Encoding | Large categorical variables | When the category’s frequency is informative. |
✅ By selecting the right encoding method, we ensure our models make the most of categorical data while avoiding unnecessary complexity.
Advanced Feature Engineering for Categorical Data
Now that we’ve covered basic encoding techniques, let’s explore more advanced strategies to extract additional insights from categorical features.
This chapter will focus on:
✅ Combining categorical features to create interaction terms.
✅ Encoding strategies for tree-based vs. linear models.
✅ Using embeddings for categorical data.
1️⃣ Feature Interactions for Categorical Variables
Feature interactions help models capture relationships between categories that individual features alone may miss. Instead of treating categorical features independently, we can combine them into new engineered features.
📌 Example: Combining “State” and “Product Category”
Imagine we’re predicting sales revenue, and our dataset includes:
State: “California”, “Texas”, “New York”
Product Category: “Electronics”, “Clothing”, “Furniture”
A standard model would treat these separately, but what if a product sells differently depending on the state?
State | Product Category | Sales ($) |
---|---|---|
California | Electronics | 120,000 |
Texas | Electronics | 80,000 |
New York | Electronics | 100,000 |
California | Clothing | 75,000 |
Texas | Clothing | 50,000 |
Instead of separate features, we can combine them into a single categorical variable:
State_Product | Sales ($) |
---|---|
California_Electronics | 120,000 |
Texas_Electronics | 80,000 |
NewYork_Electronics | 100,000 |
California_Clothing | 75,000 |
Now, the model can learn state-specific sales patterns instead of treating all locations and products the same.
library(tidymodels)
$State_Product <- interaction(df$State, df$ProductCategory, sep = "_")
df
<- recipe(~ State_Product, data = df) %>%
recipe_categorical step_dummy(all_nominal_predictors()) # One-hot encode new feature
<- prep(recipe_categorical) %>% bake(new_data = df) prepped_data
📌 Python: Creating Interaction Features with pandas
'State_Product'] = df['State'] + "_" + df['ProductCategory']
df[= pd.get_dummies(df, columns=['State_Product']) # One-hot encode df
✅ Use feature interactions when categorical variables may have meaningful dependencies.
❌ Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).
2️⃣ Encoding Strategies for Tree-Based vs. Linear Models
Different machine learning models handle categorical features differently:
🔹 Linear models (logistic regression, linear regression, SVMs)
Prefer one-hot encoding or ordinal encoding.
Work best when categorical features don’t have too many unique values.
🔹 Tree-based models (random forests, XGBoost, LightGBM, CatBoost)
Can handle raw categorical data directly (without one-hot encoding).
Perform well with target encoding and frequency encoding.
📌 Example: Predicting Customer Churn Using Linear vs. Tree-Based Models
We have a dataset with:
Subscription Type: “Free”, “Basic”, “Premium”, “Enterprise”
Churn: (1 = Yes, 0 = No)
✅ Use feature interactions when categorical variables may have meaningful dependencies.
❌ Avoid it if the number of unique combinations is too high (e.g., thousands of possible pairs).
Subscription Type | Churn Rate (%) |
---|---|
Free | 80% |
Basic | 50% |
Premium | 20% |
Enterprise | 5% |
For linear models, we use one-hot encoding:
Free | Basic | Premium | Enterprise |
---|---|---|---|
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
For tree-based models, we can use target encoding:
Subscription Type | Encoded Value |
---|---|
Free | 0.80 |
Basic | 0.50 |
Premium | 0.20 |
Enterprise | 0.05 |
📌 R: Encoding for Linear Models vs. Tree-Based Models
library(tidymodels)
# One-hot encoding for linear models
<- recipe(~ SubscriptionType, data = df) %>%
recipe_linear step_dummy(all_nominal_predictors())
# Target encoding for tree-based models
library(vtreat)
<- designTreatmentsC(df, varlist = "SubscriptionType", outcomename = "Churn")
treatment <- prepare(treatment, df) df_transformed
📌 Python: Encoding for Linear Models vs. Tree-Based Models
import category_encoders as ce
# One-hot encoding (linear models)
= pd.get_dummies(df, columns=['SubscriptionType'])
df_linear
# Target encoding (tree-based models)
= ce.TargetEncoder(cols=['SubscriptionType'])
encoder 'SubscriptionType_encoded'] = encoder.fit_transform(df['SubscriptionType'], df['Churn']) df[
✅ Use target encoding for high-cardinality categorical variables in tree-based models.
❌ Avoid one-hot encoding when there are too many categories—it increases feature space dramatically.
3️⃣ Using Embeddings for Categorical Data
🔹 How It Works
Categorical embeddings transform categories into dense numerical vectors that capture similarities between values.
Instead of one-hot encoding, we can train an embedding layer to represent relationships between categories automatically.
📌 Example: Encoding “Job Role” in a Hiring Model
Instead of
"Engineer" → [1,0,0,0]
,"Manager" → [0,1,0,0]
A neural network can learn
"Engineer"
is more similar to"Scientist"
than"Cashier"
, assigning embeddings like:
Job Role | Embedding 1 | Embedding 2 | Embedding 3 |
---|---|---|---|
Engineer | 0.85 | 0.13 | 0.41 |
Scientist | 0.88 | 0.12 | 0.39 |
Cashier | 0.10 | 0.95 | 0.76 |
📌 Python: Creating Embeddings for Categorical Variables Using TensorFlow
import tensorflow as tf
from tensorflow.keras.layers import Embedding
# Define an embedding layer for a categorical feature with 10 unique values
= Embedding(input_dim=10, output_dim=3)
embedding_layer
# Example: Convert category "Engineer" (ID = 3) into a dense vector
import numpy as np
= np.array([[3]]) # Example category index
category_input = embedding_layer(category_input)
embedded_output print(embedded_output)
✅ Embeddings are useful when working with large categorical datasets (e.g., text, recommendation systems).
❌ Not necessary for small categorical variables (OHE or target encoding is simpler).
📌 Summary: Choosing the Best Advanced Encoding Strategy
Encoding Method | Works Best For | When to Use |
---|---|---|
Feature Interactions | Related categorical variables | When categories influence each other (e.g., State + Product Category). |
Target Encoding | Tree-based models | When category values correlate with the target variable. |
Embeddings | Deep learning models | When handling large categorical datasets efficiently. |
✅ Using advanced encoding techniques ensures categorical variables provide real value to models while maintaining efficiency.
Best Practices & Common Pitfalls for Categorical Features
Now that we’ve explored categorical encoding techniques, let’s focus on best practices and mistakes to avoid when working with categorical data in machine learning.
This chapter will cover:
✅ Preventing feature leakage when encoding categorical variables.
✅ Avoiding overfitting with high-cardinality categorical features.
✅ Optimizing categorical encoding based on the type of model.
1️⃣ Avoiding Feature Leakage with Categorical Encoding
Feature leakage occurs when a model has access to information during training that it wouldn’t have in real-world predictions. This is a huge issue in categorical encoding techniques like target encoding.
📌 Example: Target Encoding Gone Wrong
Imagine we are predicting customer churn, and we use target encoding on the “Customer Segment” column:
Customer Segment | Avg Churn Rate (%) |
---|---|
Students | 75% |
Freelancers | 40% |
Corporate | 10% |
❌ Problem:
If this encoding is applied before splitting into train/test sets, the model already knows the churn rate for each category.
The model will memorize the target value instead of learning real relationships.
✅ Solution:
- Perform target encoding only within cross-validation—never on the full dataset.
📌 R: Preventing Leakage in Target Encoding with vtreat
library(vtreat)
# Define the target encoding inside cross-validation
<- designTreatmentsC(df_train, varlist = "CustomerSegment", outcomename = "Churn")
treatment
# Apply transformation to training and test sets separately
<- prepare(treatment, df_train)
df_train_transformed <- prepare(treatment, df_test) df_test_transformed
📌 Python: Preventing Leakage in Target Encoding with category_encoders
import category_encoders as ce
= ce.TargetEncoder(cols=['CustomerSegment'])
encoder
# Fit on training data, transform separately on train and test
'CustomerSegment_encoded'] = encoder.fit_transform(df_train['CustomerSegment'], df_train['Churn'])
df_train['CustomerSegment_encoded'] = encoder.transform(df_test['CustomerSegment']) df_test[
✅ Always encode categorical features using only training data to prevent information from leaking into the test set.
2️⃣ Avoiding Overfitting with High-Cardinality Features
High-cardinality categorical features (e.g., Customer ID, Product ID, ZIP Code) introduce thousands of unique values. If not handled properly, they can lead to:
Curse of dimensionality (too many one-hot encoded columns).
Overfitting to specific training data.
📌 Example: The ZIP Code Problem
If we one-hot encode a ZIP code variable, it might create thousands of extra features, leading to:
❌ Sparse matrices (lots of zeroes).
❌ Overfitting to specific locations.
✅ Solutions for High-Cardinality Variables:
Group rare categories together (step_other in R,
pd.cut()
in Python).Use frequency encoding instead of one-hot encoding.
Use embeddings for categorical variables in deep learning models.
📌 R: Reducing High-Cardinality with step_other()
<- recipe(~ ZIPCode, data = df) %>%
recipe_high_cardinality step_other(ZIPCode, threshold = 0.05) # Groups rare ZIP codes together
📌 Python: Reducing High-Cardinality with Frequency Encoding
'ZIPCode_encoded'] = df.groupby('ZIPCode')['ZIPCode'].transform('count') df[
✅ By grouping rare categories together or using frequency encoding, we avoid overfitting while keeping useful information.
3️⃣ Optimizing Encoding Based on Model Type
Some models work well with raw categorical data, while others require specific encoding methods.
📌 Best Encoding Methods for Different Models
Model Type | Best Encoding Method |
---|---|
Linear Models (Logistic Regression, SVMs, Linear Regression) | One-Hot Encoding, Ordinal Encoding |
Tree-Based Models (Random Forest, XGBoost, LightGBM, CatBoost) | Target Encoding, Frequency Encoding, Raw Categories (CatBoost) |
Deep Learning (Neural Networks) | Embeddings, Frequency Encoding |
✅ Always select the encoding method based on how the model handles categorical variables.
📌 Example: Encoding for XGBoost vs. Logistic Regression
If we use logistic regression, we should use one-hot encoding:
= pd.get_dummies(df, columns=['Category']) df_linear
If we use XGBoost, we should use target encoding:
= ce.TargetEncoder(cols=['Category'])
encoder 'Category_encoded'] = encoder.fit_transform(df['Category'], df['Target']) df[
✅ Tree-based models like XGBoost handle categorical data differently than linear models—choose encoding methods accordingly.
Best Practices & Common Pitfalls
✅ Best Practices | ❌ Pitfalls to Avoid |
---|---|
Prevent leakage by encoding categories only on the training set. | Encoding categorical variables before train-test split leads to overfitting. |
Reduce high-cardinality categories using frequency encoding or grouping rare values. | One-hot encoding thousands of categories makes models inefficient. |
Use encoding methods suited to the model type (OHE for linear, target encoding for trees). | Using the wrong encoding method can confuse models and reduce accuracy. |
✅ By following these best practices, we ensure categorical variables are handled effectively without introducing unnecessary complexity.
Conclusion & Next Steps
Categorical features are often overlooked or mishandled, yet they play a critical role in predictive modeling. In this article, we’ve explored different encoding techniques, advanced feature engineering strategies, and best practices for handling categorical data effectively.
1️⃣ Key Takeaways
✔ Not all categorical variables are the same—Nominal (unordered) and ordinal (ordered) features require different encoding methods.
✔ One-hot encoding works well for small categories, but becomes inefficient with high-cardinality features.
✔ Target encoding and frequency encoding are powerful alternatives but must be used carefully to prevent feature leakage and overfitting.
✔ Feature interactions (e.g., combining “State” and “Product Category”) capture relationships that individual features might miss.
✔ Choosing the right encoding method depends on the model type:
Linear models (Logistic Regression, SVMs, etc.) → Prefer one-hot encoding or ordinal encoding.
Tree-based models (XGBoost, LightGBM, etc.) → Handle target encoding, frequency encoding, or raw categorical data well.
Deep learning models → Embeddings work best for high-cardinality features.
✅ By selecting the right encoding method and avoiding common pitfalls, we can improve model performance while reducing computational complexity.
2️⃣ What’s Next?
This article is part of the Feature Engineering Series, where we break down how to create better predictive models by transforming raw data.
🚀 Next up: “Cooking Temperatures Matter – Scaling & Transforming Numerical Data”
In the next article, we’ll explore:
📌 When to scale numerical features (and when not to!).
📌 Log, power, and polynomial transformations—how they impact model performance.
📌 The difference between standardization and normalization (and which models need them).
📌 Practical R & Python examples to apply numerical feature transformations effectively.
🔹 Want to stay updated? Keep an eye out for the next post in the series!
💡 What’s your favorite way to handle categorical data? Drop your thoughts below! ⬇⬇⬇