Beyond the Numbers – Feature Engineering for Text Data

Author

Numbers around us

Published

April 22, 2025

Text data is everywhere—customer reviews, support tickets, product descriptions, survey responses—and it’s often the most underutilized source of insight in a dataset. While numbers are easy to model, text requires more creativity and structure to be useful.

This article explores:
✅ Why text is challenging—and powerful—for machine learning.
✅ How to extract structured features from raw text.
✅ Key techniques like TF-IDF, text length, sentiment, and n-grams.
✅ Practical examples in R (tidytext) and Python (scikit-learn / NLP).

Why Text Needs Feature Engineering

Unlike numerical or categorical features, text data is:

  • Unstructured — There’s no inherent format, just characters.

  • Variable in length — A tweet and a product review are both “text” but vastly different.

  • Rich in signal — Word choice, tone, and structure carry valuable information.

So why don’t we just throw text into a model?

Because most machine learning algorithms expect fixed-length, structured inputs. Text must be transformed into features the model can learn from.

📌 What Can We Learn from Raw Text?

Signal Type Example Feature
Volume Number of characters, words, sentences
Structure Use of punctuation, capital letters
Semantics TF-IDF scores, word embeddings
Sentiment Polarity score, positive/negative tone
Topic Presence of key terms, topic clusters

These features can be numeric, binary, categorical, or even embedded vectors.

📌 Example Use Case: Hotel Review Sentiment

Review Text Stars Word Count Sentiment TF-IDF
“The room was spotless and the staff was kind.” 5 9 Positive
“Terrible service and dirty bathroom.” 1 6 Negative

✅ We can model review stars, churn likelihood, or customer satisfaction using engineered text features.

Basic Text Feature Extraction (Length, Count, Structure)

Before jumping into complex techniques like embeddings or TF-IDF, let’s not overlook the surprisingly powerful features that come straight from the structure of the text itself. These basic numeric summaries often carry strong signals—especially in short-form content like reviews, tickets, or emails.

This chapter focuses on extracting structural and volume-based features such as:
Character and word counts
Punctuation and special characters
Capitalization and formatting patterns
Readability metrics

1️⃣ Text Length-Based Features

The length of text can often correlate with its meaning or intent.

Text Word Count Char Count
“OK.” 1 3
“Thanks so much for your help!” 6 31
“Absolutely terrible experience, never again.” 6 44

📌 R: Length Features Using stringr and dplyr

library(dplyr)
library(stringr)

df <- df %>%
  mutate(
    word_count = str_count(text, "\\w+"),
    char_count = str_length(text)
  )

📌 Python: Length Features Using pandas

df['word_count'] = df['text'].str.split().str.len()
df['char_count'] = df['text'].str.len()

✅ These features often act as proxies for verbosity, clarity, or emotional tone.

2️⃣ Punctuation & Special Character Features

Characters like exclamation points, question marks, emojis, or ALL CAPS often express emotion, urgency, or attitude.

Text Excl. Count Uppercase Count
“This is AMAZING!!” 2 7
“Is this working?” 0 0

📌 R: Punctuation Features Using stringr

df <- df %>%
  mutate(
    num_exclam = str_count(text, "!"),
    num_question = str_count(text, "\\?"),
    num_upper = str_count(text, "[A-Z]")
  )

📌 Python: Same in pandas

df['num_exclam'] = df['text'].str.count('!')
df['num_question'] = df['text'].str.count('\\?')
df['num_upper'] = df['text'].str.findall(r'[A-Z]').str.len()

3️⃣ Readability & Complexity Features

Basic stats like average word length, long word ratio, or Flesch reading ease can provide insight into how formal, technical, or accessible the text is.

📌 Python Example: Readability with textstat

import textstat

df['readability'] = df['text'].apply(textstat.flesch_reading_ease)

📌 R Example: Readability with quanteda

library(quanteda)

df$readability <- textstat_readability(df$text, measure = "Flesch")$Flesch

High or low readability can signal anything from spam to legal language to customer distress.

📌 Summary: Structural Features to Start With

Feature What It Reflects
Word / Character Count Length, verbosity
Number of Exclamation Points Emotion, intensity
Number of Uppercase Letters Emphasis, shouting
Readability Score Complexity, accessibility

These simple metrics are easy to compute and can deliver strong baseline performance, especially when used alongside other structured data.

Keyword-Based Features – TF-IDF, N-Grams & Frequency Counts

Once you’ve squeezed all the value out of structural and length-based features, it’s time to go deeper—into the words themselves. Keyword-based features transform raw text into a structured representation of its content, allowing your models to capture meaning, context, and emphasis.

This chapter focuses on:
Word and term frequency (TF)
TF-IDF (Term Frequency-Inverse Document Frequency)
N-grams (phrases of 2+ words)
Most frequent word indicators

1️⃣ Term Frequency (TF) – Basic Bag-of-Words

Bag-of-Words (BoW) is the foundation of keyword-based feature engineering. It counts how many times each word appears in a document, ignoring order and grammar.

Text love this product
“I love this product!” 1 1 1
“Product was great” 0 0 1

✅ Great for: Simple models, short text like reviews or tweets.
❌ Ignores: Context, synonyms, grammar, importance.

📌 R: Bag-of-Words with tidytext

library(tidytext)
library(dplyr)

df_tokens <- df %>%
  unnest_tokens(word, text) %>%
  count(id, word) %>%
  bind_tf_idf(word, id, n)  # Optional step

📌 Python: Bag-of-Words with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['text'])

2️⃣ TF-IDF – Emphasizing Unique Words

TF-IDF downweights common words (“the”, “is”) and upweights words that are rare but meaningful in specific texts (“refund”, “broken”, “delighted”).

\[TF-IDF(t,d) = TF(t,d) \times log(\frac{N}{DF(t)})\]

Where:

  • t = term

  • d = document

  • N = total number of docs

  • DF(t) = number of docs containing term t

📌 R: TF-IDF with tidytext (continued)

Already done in the example above with bind_tf_idf(). You can then cast_dtm() to get the matrix.

📌 Python: TF-IDF with TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['text'])

✅ Useful for: Finding most distinguishing words across documents.

3️⃣ N-Grams – Capturing Phrases, Not Just Words

N-grams let you detect phrases like “credit card”, “bad service”, or “very helpful”. They reveal more context than isolated words.

Text 2-grams
“very bad service” “very bad”, “bad service”

📌 R: Extracting Bigrams with unnest_tokens()

df_bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

📌 Python: Bigrams with CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigrams = vectorizer.fit_transform(df['text'])

✅ Powerful for: Sentiment, intent detection, topic modeling.

4️⃣ Flagging Important Terms (Keyword Indicators)

Sometimes, all you need is a binary indicator: did the user mention “refund”? “cancel”? “perfect”? This is especially useful when domain-specific keywords drive decisions.

📌 R: Keyword Flags

df <- df %>%
  mutate(refund_mentioned = str_detect(text, regex("refund", ignore_case = TRUE)))

📌 Python: Keyword Flags

df['refund_mentioned'] = df['text'].str.contains('refund', case=False, na=False)

✅ This is a great lightweight feature in domain-specific tasks.

📌 Summary: Keyword-Based Feature Toolkit

Feature Use Case
Bag-of-Words (TF) Basic presence of terms
TF-IDF Detects unique terms in each document
N-Grams Captures meaningful phrases
Keyword Flags Quick domain-specific signals

These features bridge the gap between raw text and machine-readable inputs, especially when building traditional ML models (e.g., logistic regression, random forest, XGBoost).

Semantic Features & Word Embeddings

While keyword-based features like TF-IDF and n-grams are great at counting words, they fall short when it comes to understanding meaning. They treat “good” and “great” as unrelated, and “not bad” as equivalent to “bad.”

That’s where word embeddings come in.

This chapter covers:
✅ The concept behind word embeddings
✅ Pretrained embeddings like GloVe, Word2Vec, and FastText
✅ Sentence and document embeddings (e.g., BERT, Doc2Vec)
✅ How to incorporate embeddings into machine learning workflows

1️⃣ What Are Word Embeddings?

Embeddings represent words as dense numeric vectors in a way that captures semantic similarity. Words that occur in similar contexts will have similar vectors.

Word Vector (simplified)
“cat” [0.25, 0.81, -0.12, …]
“dog” [0.27, 0.78, -0.09, …]
“coffee” [-0.45, 0.14, 0.66, …]

🔹 “cat” and “dog” will have vectors closer to each other than to “coffee.”
🔹 Unlike BoW/TF-IDF, embeddings preserve context and meaning.

2️⃣ Using Pretrained Embeddings

You can use pretrained embeddings (trained on massive corpora like Wikipedia or Common Crawl) instead of training your own.

📌 Python: Using GloVe or Word2Vec with gensim

import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")  # or "word2vec-google-news-300"
vector = model['coffee']  # 50-dimensional vector

To get a sentence or document embedding, average word vectors:

import numpy as np

def sentence_embedding(text):
    words = text.split()
    vectors = [model[word] for word in words if word in model]
    return np.mean(vectors, axis=0)

df['text_embedding'] = df['text'].apply(sentence_embedding)

📌 R: Using Pretrained Embeddings via text2vec or text packages

library(text)
# Create word embeddings (GloVe-like) using built-in embeddings
embeddings <- textEmbed("coffee is great")

R users can also use pretrained word embeddings from fastText via text2vec or spaCy (Python + reticulate).

3️⃣ Contextual Embeddings (BERT and Friends)

Unlike static embeddings, transformer-based models like BERT generate context-aware embeddings.

  • bank” in “river bank” ≠ “bank” in “savings bank”

  • BERT knows the difference based on sentence context.

📌 Python: Sentence Embeddings with sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
df['embedding'] = df['text'].apply(lambda x: model.encode(x))

✅ Great for: Semantic search, clustering, document similarity, advanced NLP tasks

❌ Slower and heavier than traditional models—but much more powerful.

4️⃣ When to Use Which?

Technique Use Case Pros Cons
TF-IDF Classical models Simple, fast Ignores meaning
GloVe / Word2Vec General semantic similarity Lightweight No context awareness
BERT / Transformers Deep NLP tasks, similarity, classification Context-aware Heavier, slower

✅ Use embeddings when your task involves meaning, not just frequency.

📌 Summary: From Words to Meaningful Vectors

Feature Type Example Use Tool
Word Embedding Semantic similarity Word2Vec, GloVe, FastText
Sentence Embedding Text classification, search BERT, Sentence-BERT
Averaged Word Vectors Quick sentence-level embedding Manual or gensim

Embeddings allow you to turn freeform language into model-ready features that capture relationships, tone, and intent.

Pitfalls & Best Practices for Text Features

Working with text features can unlock massive value—but only if done carefully. Many common mistakes lead to data leakage, overfitting, or simply ineffective feature usage.

In this final chapter, we’ll cover:
✅ Common pitfalls in text preprocessing
✅ How to prevent leakage with TF-IDF and embeddings
✅ Best practices for combining text with structured data
✅ Model-friendly tips for production-ready pipelines

1️⃣ Common Pitfalls in Text Feature Engineering

Pitfall Why It’s a Problem
🔁 Preprocessing after splitting Can cause leakage if you compute TF-IDF or vocab across full data
🧼 Over-cleaning Removing punctuation, stop words, or case can strip away meaning
🔤 Rare word explosion TF-IDF can explode in size if not pruned
🧩 Treating text as fully independent Ignoring interactions with structured data misses context

📌 Example: Leakage via TF-IDF on Full Dataset

# ❌ Wrong: Vocabulary is learned from full dataset (train + test)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])  # Leakage!

# ✅ Right: Fit only on training set
vectorizer.fit(df_train['text'])
X_train = vectorizer.transform(df_train['text'])
X_test = vectorizer.transform(df_test['text'])

✅ Always split your data first, then fit any vectorizers, embeddings, or scalers only on the training set.

2️⃣ Best Practices for Combining Text with Structured Data

Text is rarely alone—you often want to blend it with numeric or categorical data. To make this work:

✔ Use column transformers or recipes to handle each data type separately

  • Scale numeric features

  • Encode categoricals (dummy or target encoding)

  • Transform text (TF-IDF, embeddings)

✔ Normalize or rescale after combining if needed

  • Especially important if you’re using distance-based models (SVMs, KNN)

📌 Python: Combining Text with Structured Data Using ColumnTransformer

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['price', 'rating']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['category']),
    ('txt', TfidfVectorizer(), 'review')
])

📌 R: Using recipes to Combine Text and Tabular Data

library(tidymodels)

recipe_combined <- recipe(target ~ ., data = df) %>%
  step_tokenize(review) %>%
  step_tf(review) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

✅ Combine text with other features in a clean and modular way that supports deployment.

3️⃣ Avoid Overfitting with Sparse Text Features

TF-IDF and n-gram matrices often create very high-dimensional, sparse datasets. This can lead to overfitting, especially on small datasets.

🛠 Mitigation strategies:

  • Limit vocabulary size (max_features or min_df/max_df)

  • Use dimensionality reduction (e.g., Truncated SVD on TF-IDF)

  • Try embeddings instead, especially for deeper models

4️⃣ Choose Features Based on the Model Type

Model Type Preferred Text Features
Linear Models TF-IDF, n-grams
Tree-based Models Aggregated embeddings, keyword flags
Neural Networks Word or sentence embeddings
Ensemble Models Combine structural, TF-IDF, and text flags

✅ Not all models benefit from raw TF-IDF or embeddings—choose based on interpretability and performance needs.

📌 Summary: Text Feature Engineering Done Right

✅ Best Practices ❌ Pitfalls to Avoid
Preprocess after splitting Preprocessing before split causes leakage
Use multiple feature types Relying only on TF-IDF or BoW
Combine text with structured data Ignoring interactions across features
Limit dimensionality of sparse features Feeding 10k sparse terms into a random forest

Text offers rich signals, but it must be handled carefully, especially when integrating into broader modeling pipelines.

Conclusion & What’s Next

Text data is often seen as messy and complex—but with the right tools and techniques, it becomes one of the most powerful sources of predictive insight in your entire dataset.

Throughout this article, we’ve explored how to turn raw text into structured, meaningful features, ready to drive performance in your models.

✅ Key Takeaways from Text Feature Engineering

✔ Even basic features like word count, punctuation, and readability can be surprisingly effective.
TF-IDF and n-grams help capture content, frequency, and intent across many documents.
Word embeddings (GloVe, Word2Vec) add semantic understanding, and transformer models like BERT allow context-aware representations.
Avoid data leakage—always split your data first, and fit vectorizers/embeddings only on the training set.
✔ Combine text with structured features in clean, modular pipelines using recipes (R) or ColumnTransformer (Python).
✔ Match your feature type to the model you’re using—and be wary of high dimensionality.

Good text features are not only useful in NLP tasks—they can also supercharge churn models, sentiment prediction, fraud detection, and more.

📌 What’s Next in the Series?

Up next is the final article in the Feature Engineering Series:

“The Final Cut – Choosing the Right Features for the Job”

We’ll cover:
🔹 How to evaluate feature usefulness (correlation, importance, permutation)
🔹 Feature selection strategies (filter, wrapper, embedded)
🔹 When fewer features beat more
🔹 Interpreting model feedback to refine your inputs
🔹 Best practices for pruning, testing, and simplifying your feature space

📦 Whether your features come from text, time-series, categories, or numbers—this final episode will help you pick the best ones and ditch the rest.