The Final Cut – Choosing the Right Features for the Job

Author

Numbers around us

Published

April 29, 2025

You’ve created dozens—maybe hundreds—of new features: cleaned, transformed, encoded, engineered.
But now comes a critical decision:

🔪 Which features stay, and which must go?

Choosing the right features is about maximizing model performance while minimizing complexity. Good feature selection can:
✅ Improve model generalization
✅ Reduce overfitting
✅ Speed up training and inference
✅ Enhance model interpretability

This final article in the Feature Engineering Series will walk through:
✅ Why feature selection matters
✅ Simple filtering techniques (correlation, variance)
✅ Model-driven feature importance (tree models, permutation)
✅ Selection methods (filter, wrapper, embedded)
✅ Best practices for pruning without hurting performance

Why Choosing Features Matters

More features ≠ better models.
In fact, too many irrelevant or redundant features can actively hurt your model by:
❌ Increasing noise and variance
❌ Leading to overfitting
❌ Slowing down training and prediction
❌ Making models harder to explain

📌 Feature selection is about keeping what matters—and cutting what doesn’t.
It’s a mix of art and science, where we balance domain knowledge, data-driven decisions, and model feedback.

📌 Simple Thought Exercise

Imagine you’re predicting house prices.
You have features like:

square_footage
number_of_bathrooms
zip_code
price_of_dog_food_in_region

🛑 Even if “dog food price” technically varies across regions, it’s noise. It adds confusion, increases data collection cost, and risks spurious correlations. You need only the features that directly contribute to your target.

✅ Good feature selection leads to:

Simpler, faster models
Better generalization to new data
Clearer understanding of what drives predictions

Feature engineering without feature selection is like cooking without tasting.
You need to trim the dish before serving it.

Filter Methods – Quick Wins for Cutting Features

Filter methods are your first line of defense in feature selection.
They evaluate each feature independently of any machine learning model, based purely on statistics.
✅ Fast
✅ Intuitive
✅ Great for initial cleanup before deeper analysis

In this chapter, we’ll cover:
✅ Removing features with low variance
✅ Removing highly correlated features
✅ Univariate statistical tests (optional advanced filtering)

1️⃣ Removing Low-Variance Features

If a feature barely changes across your dataset, it’s unlikely to help your model.
Examples:

A product code that’s always the same
A “country” field that’s 99% “USA”

🔹 Low-variance features are boring. Models can safely ignore them—or better yet, you should remove them.

📌 R: Using `recipes::step_nzv()`


recipe <- recipe(target ~ ., data = df) %>%
  step_nzv(all_predictors())  # Near zero variance

📌 Python: Using `VarianceThreshold`

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)  # Remove features with variance < 0.01
X_reduced = selector.fit_transform(X)

✅ Simple, no model required, and often removes dozens of useless columns in wide datasets.

2️⃣ Removing Highly Correlated Features

Having two features that are nearly identical doesn’t add information—it adds redundancy.
High correlation (e.g., > 0.9) between two features suggests keeping only one.

📌 Why?

Reduces multicollinearity (important for linear models)
Speeds up training
Improves model stability

📌 R: Correlation Matrix with `corrr` + Manual Cutoff

library(corrr)

cor_matrix <- correlate(df)
# Find feature pairs with absolute correlation > 0.9
high_corr_pairs <- stretch(cor_matrix) %>%
  filter(abs(r) > 0.9, x != y)

📌 Python: Correlation Matrix + Manual Cutoff

import pandas as pd

corr_matrix = df.corr().abs()
upper = corr_matrix.where(pd.np.triu(pd.np.ones(corr_matrix.shape), k=1).astype(bool))

# Find columns to drop
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
df_reduced = df.drop(columns=to_drop)

✅ Helps models generalize better, especially linear, logistic regression, and LASSO models.

3️⃣ (Optional) Statistical Tests for Feature Relevance

Sometimes you want a quick univariate check for how strongly each feature relates to the target.

For classification → Chi-squared test, ANOVA F-test
For regression → Pearson/Spearman correlation

These are rough guides, not gospel.

📌 Python: Select K Best Features

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)
X_new = selector.fit_transform(X, y)

✅ Statistical filters can help shortlist candidates before deeper model-driven selection.

📌 Quick Recap: When to Use Filter Methods

Use Case	Filter to Apply
Too many static features	Remove low variance
Redundant twin features	Remove highly correlated
Very wide datasets	Univariate feature ranking (optional)

✅ Filter methods are fast, interpretable, and perfect for early-stage pruning before feeding data into heavier model-based feature selectors.

Wrapper Methods – Model-Driven Feature Selection

While filter methods rank features individually, wrapper methods evaluate subsets of features together based on how well they actually perform in a model.
✅ More accurate
✅ Takes feature interactions into account
❌ Slower and more computationally expensive

In this chapter, we’ll cover:
✅ What wrapper methods are
✅ Recursive Feature Elimination (RFE)
✅ Stepwise Selection
✅ When and why to use them

1️⃣ What Are Wrapper Methods?

Instead of looking at each feature in isolation, wrapper methods:
🔹 Train a model on different combinations of features
🔹 Evaluate model performance (accuracy, RMSE, AUC, etc.)
🔹 Keep the combinations that improve the metric

📌 Think of it as: “Train → Evaluate → Keep the best.”

2️⃣ Recursive Feature Elimination (RFE)

RFE starts with all features, trains a model, removes the least important feature, retrains, and repeats until only the strongest features remain.

✅ Good for:

Logistic regression, linear models
Small- to mid-sized feature sets
Cases where you want compact, high-quality feature subsets

📌 Python: RFE Example with `sklearn`

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

📌 R: RFE Example with `caret`

library(caret)

control <- rfeControl(functions = rfFuncs, method = "cv", number = 5)
rfe_result <- rfe(df[, predictors], df$target, sizes = c(5, 10, 15), rfeControl = control)

✅ RFE optimizes based on model performance—not just correlation or variance.

3️⃣ Stepwise Feature Selection

Stepwise selection builds a model by adding or removing features one at a time, depending on how much they improve model performance.

Forward selection starts with no features and adds the most useful one at each step.
Backward elimination starts with all features and removes the least useful one at each step.
Bidirectional selection combines both approaches.

✅ Good for:

Smaller feature spaces
Quick-and-dirty baselines for interpretable models

📌 R: Stepwise Selection with `MASS::stepAIC`

library(MASS)

model_full <- lm(target ~ ., data = df)
stepwise_model <- stepAIC(model_full, direction = "both")

✅ Stepwise works well when interpretability is critical (e.g., regression reports).

4️⃣ Pros and Cons of Wrapper Methods

Pros	Cons
Produces strong feature sets	Slow (lots of model training)
Considers feature interactions	Can overfit if not cross-validated
Model-specific (customized)	Not always scalable for very wide data

✅ Use wrapper methods when accuracy matters more than speed—especially in smaller datasets or final feature polishing.

📌 Summary: When to Use Wrapper Methods

Situation	Method
Need very compact, strong feature set	RFE
Quick model explainability (linear models)	Stepwise
Feature interactions matter	RFE or bidirectional stepwise

✅ Wrapper methods are your scalpel—precise, careful, and best used after initial filtering.

Embedded Methods – Feature Selection Built Into Models

Embedded methods are the smartest feature selectors:
they integrate feature selection into the model training process itself.
Instead of preprocessing or wrapper-based evaluation, the model naturally identifies and reduces less important features while learning.

This chapter covers:
✅ What embedded methods are
✅ Feature selection via regularization (LASSO)
✅ Feature importance from tree-based models
✅ When and how to use them efficiently

1️⃣ What Are Embedded Methods?

Unlike filters or wrappers that work outside of the model, embedded methods:
🔹 Train the model
🔹 Penalize or reward features automatically
🔹 Give built-in signals about which features matter

📌 Feature selection happens during model fitting—not as a separate step.

2️⃣ Regularization: LASSO for Shrinking Features

LASSO (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that shrinks less important feature coefficients toward zero.

Coefficients that become exactly zero are effectively removed from the model.

✅ Great for:

High-dimensional data (many features, few samples)
Finding compact linear models

📌 R: LASSO with `glmnet`

library(glmnet)

X <- model.matrix(target ~ ., df)[, -1]
y <- df$target

lasso_model <- cv.glmnet(X, y, alpha = 1)
coef(lasso_model, s = "lambda.min")

📌 Python: LASSO with `sklearn`

from sklearn.linear_model import LassoCV

lasso = LassoCV(cv=5).fit(X, y)
selected_features = X.columns[(lasso.coef_ != 0)]

✅ LASSO shrinks weak features out of the model, automatically cleaning your feature set.

3️⃣ Tree-Based Models: Natural Feature Importance

Decision trees, random forests, and gradient boosting models naturally rank feature importance based on how much each feature improves splits.

✅ Good for:

Nonlinear feature interactions
Handling messy, mixed-type data without heavy preprocessing
Quick feature importance ranking

📌 R: Feature Importance from `ranger`

library(ranger)

model <- ranger(target ~ ., data = df, importance = "impurity")
importance(model)

📌 Python: Feature Importance from `RandomForestClassifier`

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier().fit(X, y)
importances = model.feature_importances_

✅ Models like XGBoost, LightGBM, and CatBoost also offer similar importance scores.

4️⃣ Pros and Cons of Embedded Methods

Pros	Cons
Efficient: selection happens while training	Model-specific
Handles large feature sets well	Might discard features useful for other models
Captures nonlinear interactions (trees)	Can overfit without tuning (especially small data)

✅ Embedded methods are fast, smart, and directly tied to model objectives.

📌 Summary: When to Use Embedded Feature Selection

Situation	Embedded Method
High-dimensional linear models	LASSO
Nonlinear or tree-based models	Random Forest / XGBoost feature importance
Need automatic selection without extra loops	Any embedded method

✅ Embedded methods are your autopilot—they work while you train.

Best Practices & Common Mistakes in Feature Selection

Feature selection sounds simple: keep the good, throw away the bad.
But in practice, it’s very easy to introduce biases, overfit, or accidentally destroy predictive power if you’re not careful.

This chapter covers:
✅ The most common mistakes made during feature selection
✅ Best practices to keep models honest
✅ A simple checklist to build your final feature set

1️⃣ Common Mistakes to Avoid

Mistake	Why It’s Dangerous
🔁 Selecting features on the full dataset	Leaks test set info → fake performance boost
📦 Blindly trusting feature importance	Importance ≠ necessity (some features help in combos)
🛠 Overfitting during wrapper methods	Small datasets + too many selections = fragile models
🧽 Over-cleaning (dropping weak but synergistic features)	Sometimes features interact in ways you can’t see alone

📌 Real-World Example of Leakage

If you compute feature importance, correlation, or do stepwise selection before splitting train/test,
you’re giving your model information it wouldn’t have at prediction time.

✅ Always split your data first → only perform selection based on the training set.

2️⃣ Best Practices for Robust Feature Selection

✔ Always split first — Train/test split (or CV folds) before any feature importance evaluation.
✔ Use multiple methods — Combine filter (variance/correlation), wrapper (RFE), and embedded (tree importance) techniques.
✔ Cross-validate selection — Don’t just trust one random split; validate across folds if possible.
✔ Prioritize simplicity — Favor smaller, interpretable feature sets if predictive power remains stable.
✔ Use domain knowledge wisely — Don’t blindly delete a variable just because it’s weak statistically; it might still matter logically.
✔ Don’t panic about slight drops — Sometimes, dropping noisy features slightly worsens training accuracy but improves generalization.

3️⃣ Practical Checklist: How to Choose Features Like a Pro

Step	Action
1️⃣	Train/test split (or CV setup)
2️⃣	Filter out near-zero variance and highly correlated features
3️⃣	Run simple univariate filters (optional)
4️⃣	Apply wrapper method (RFE / stepwise) if time allows
5️⃣	Validate against model-specific embedded importance
6️⃣	Build multiple models (full vs selected) and compare metrics
7️⃣	Choose the smallest feature set with acceptable or better validation performance
8️⃣	Document feature selection steps for reproducibility

✅ Systematic selection > Gut feeling—but combining both often works best!

📌 Quick Summary: Mastering the Final Cut

Do	Avoid
Split data first	Feature selection on the full dataset
Cross-validate selections	Overfitting on a lucky split
Favor smaller, simpler models	Keeping everything just in case
Combine filter, wrapper, and embedded signals	Trusting only one method

✅ Feature selection isn’t just cleaning—it’s strategic model crafting.

Why Choosing Features Matters

📌 Simple Thought Exercise

Filter Methods – Quick Wins for Cutting Features

1️⃣ Removing Low-Variance Features

📌 R: Using recipes::step_nzv()

📌 Python: Using VarianceThreshold

2️⃣ Removing Highly Correlated Features

📌 R: Correlation Matrix with corrr + Manual Cutoff

📌 Python: Correlation Matrix + Manual Cutoff

3️⃣ (Optional) Statistical Tests for Feature Relevance

📌 Python: Select K Best Features

📌 Quick Recap: When to Use Filter Methods

Wrapper Methods – Model-Driven Feature Selection

1️⃣ What Are Wrapper Methods?

2️⃣ Recursive Feature Elimination (RFE)

📌 Python: RFE Example with sklearn

📌 R: RFE Example with caret

3️⃣ Stepwise Feature Selection

📌 R: Stepwise Selection with MASS::stepAIC

4️⃣ Pros and Cons of Wrapper Methods

📌 Summary: When to Use Wrapper Methods

Embedded Methods – Feature Selection Built Into Models

1️⃣ What Are Embedded Methods?

2️⃣ Regularization: LASSO for Shrinking Features

📌 R: LASSO with glmnet

📌 Python: LASSO with sklearn

3️⃣ Tree-Based Models: Natural Feature Importance

📌 R: Feature Importance from ranger

📌 Python: Feature Importance from RandomForestClassifier

4️⃣ Pros and Cons of Embedded Methods

📌 Summary: When to Use Embedded Feature Selection

Best Practices & Common Mistakes in Feature Selection

1️⃣ Common Mistakes to Avoid

📌 Real-World Example of Leakage

2️⃣ Best Practices for Robust Feature Selection

3️⃣ Practical Checklist: How to Choose Features Like a Pro

📌 Quick Summary: Mastering the Final Cut

📌 R: Using `recipes::step_nzv()`

📌 Python: Using `VarianceThreshold`

📌 R: Correlation Matrix with `corrr` + Manual Cutoff

📌 Python: RFE Example with `sklearn`

📌 R: RFE Example with `caret`

📌 R: Stepwise Selection with `MASS::stepAIC`

📌 R: LASSO with `glmnet`

📌 Python: LASSO with `sklearn`

📌 R: Feature Importance from `ranger`

📌 Python: Feature Importance from `RandomForestClassifier`