Timing is Everything – Feature Engineering for Time-Series Data
In the next installment of our Feature Engineering Series, we’re diving into one of the most unique and complex data types—time-series data. Unlike tabular data, time-series data comes with a natural order, and understanding temporal patterns is crucial for building effective models.
This article will cover:
✅ Why time matters in modeling.
✅ Essential time-based features to extract.
✅ Rolling windows, lag features, and cumulative aggregations.
✅ Handling missing timestamps in time-series data.
✅ Practical examples using R and Python.
📌 Chapter 1: Why Time-Series Data Needs Special Treatment (Updated)
1️⃣ Time Is More Than Just a Timestamp
When working with time-series data, the order of events matters just as much as the events themselves. Unlike static datasets where rows are independent, time-series data often contains:
Trends — Overall increases or decreases over time (e.g., sales steadily increasing).
Seasonality — Regular patterns that repeat (e.g., weekly or monthly sales peaks).
Shocks & Events — Sudden outliers caused by special occurrences (e.g., product launches or economic crises).
Ignoring these time-based patterns and treating dates as just another categorical variable can lead to:
❌ Loss of valuable temporal relationships
❌ Poor forecasting accuracy
❌ Overfitting to irrelevant patterns
2️⃣ Extracting Meaningful Time-Based Features
Every date or timestamp holds hidden temporal signals we can expose through feature engineering. These include:
✔ Day of week (Mon, Tue, etc.)
✔ Month (Jan, Feb, etc.)
✔ Quarter (Q1, Q2, etc.)
✔ Weekday/Weekend flag
✔ Holiday flag (if you have a holiday calendar)
✔ Cumulative days since the first record
✔ Yearly day index (day-of-year)
📌 R: Automatically Extracting Time Features Using time_tk
Instead of manually creating every feature, the tk_get_timeseries_signature()
function from the time_tk
package can extract an entire set of useful time-based features in one step.
library(timetk)
<- df %>%
df mutate(Date = as.Date(Date)) %>%
bind_cols(tk_get_timeseries_signature(df$Date))
The tk_get_timeseries_signature()
function will add columns like:
Date | index.num | year | quarter | month | day | wday.lbl | is_weekend | hour | minute | second |
---|---|---|---|---|---|---|---|---|---|---|
2025-01-01 | 1 | 2025 | 1 | 1 | 1 | Wed | 0 | NA | NA | NA |
2025-01-02 | 2 | 2025 | 1 | 1 | 2 | Thu | 0 | NA | NA | NA |
✅ This single function saves a lot of manual coding while ensuring the features are consistently extracted for any datetime column.
📌 Python: Manually Extracting Date Features Using pandas
While Python doesn’t have a direct equivalent to tk_get_timeseries_signature()
, you can achieve the same result with pandas
:
import pandas as pd
'Date'] = pd.to_datetime(df['Date'])
df['day_of_week'] = df['Date'].dt.day_name()
df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
df['month'] = df['Date'].dt.month
df['quarter'] = df['Date'].dt.quarter
df['day_of_year'] = df['Date'].dt.dayofyear
df['cumulative_days'] = (df['Date'] - df['Date'].min()).dt.days + 1 df[
✅ Key Takeaway
In R, tk_get_timeseries_signature()
is a time-saver for fast and comprehensive feature extraction, especially for quick exploratory analysis or building time-aware models.
In Python, we rely on manual feature creation with pandas
, which offers flexibility but requires more effort.
📌 Summary of Key Time-Based Features
Feature Name | Example Value | Why It’s Useful |
---|---|---|
Day of Week | Wednesday | Captures weekly patterns |
Month | January | Captures seasonal patterns |
Quarter | Q1 | Tracks quarterly trends |
Is Weekend | 1 (True) | Captures weekend behavior |
Day of Year | 125 | Tracks year progress |
Cumulative Days | 200 | Captures time elapsed since start |
✅ These time-based features allow machine learning models to recognize trends, cycles, and seasonal effects that are impossible to detect from raw timestamps alone.
📌 Chapter 2: Lag Features, Rolling Windows & Aggregations
When working with time-series data, the past matters. Future outcomes are often influenced by recent trends, making lag features and rolling aggregations essential for helping models learn from historical patterns.
This chapter explores:
✅ Lag features — capturing values from prior time periods.
✅ Rolling windows — summarizing recent trends over a time window.
✅ Cumulative features — tracking running totals over time.
✅ Practical examples in R and Python.
1️⃣ Lag Features – Learning from the Past
🔹 What Are Lag Features?
A lag feature is a copy of a variable shifted back by a given number of time steps. It allows a model to see past values directly, which is especially useful in forecasting and anomaly detection.
📌 Example: Lagging Sales by 1 Day
Date | Sales | Lag_1_Sales |
---|---|---|
2025-01-01 | 100 | NA |
2025-01-02 | 150 | 100 |
2025-01-03 | 80 | 150 |
The Lag_1_Sales
column lets the model see the previous day’s sales when predicting today’s.
📌 R: Creating Lag Features with slide()
from slider
library(slider)
<- df %>%
df mutate(Lag_1_Sales = slide_dbl(Sales, ~ .x, .before = 1, .after = -1, complete = TRUE))
📌 Python: Creating Lag Features with shift()
from pandas
'Lag_1_Sales'] = df['Sales'].shift(1) df[
✅ Lag features are critical for models that need to capture time dependence, such as ARIMA, LSTMs, or any time-series regression model.
❌ They introduce missing values at the start of the series, which you need to handle (e.g., drop, impute).
2️⃣ Rolling Features – Capturing Trends Over Time
🔹 What Are Rolling Windows?
Rolling features capture aggregated values (e.g., mean, sum, min, max) over a fixed time window, helping the model understand short-term trends.
📌 Example: 3-Day Rolling Average for Sales
Date | Sales | Rolling_3Day_Avg |
---|---|---|
2025-01-01 | 100 | NA |
2025-01-02 | 150 | NA |
2025-01-03 | 80 | 110 |
2025-01-04 | 60 | 96.67 |
The rolling average smooths out short-term fluctuations, helping the model focus on broader trends.
📌 R: Creating Rolling Features with slide_mean()
from slider
<- df %>%
df mutate(Rolling_3Day_Avg = slide_dbl(Sales, mean, .before = 2, .complete = TRUE))
📌 Python: Creating Rolling Features with rolling()
from pandas
'Rolling_3Day_Avg'] = df['Sales'].rolling(window=3).mean() df[
✅ Rolling features are extremely useful in smoothing out noise in time-series data.
❌ Like lag features, they introduce missing values at the start, especially for larger windows.
3️⃣ Cumulative Features – Tracking Running Totals
🔹 What Are Cumulative Features?
Cumulative features track the running sum, mean, or other aggregation over the entire time series up to the current point.
📌 Example: Cumulative Sales
Date | Sales | Cumulative_Sales |
---|---|---|
2025-01-01 | 100 | 100 |
2025-01-02 | 150 | 250 |
2025-01-03 | 80 | 330 |
These features help the model understand the long-term trajectory.
📌 R: Creating Cumulative Features with cumsum()
<- df %>%
df mutate(Cumulative_Sales = cumsum(Sales))
📌 Python: Creating Cumulative Features with cumsum()
'Cumulative_Sales'] = df['Sales'].cumsum() df[
✅ Cumulative features are particularly useful in financial and inventory forecasting.
❌ They don’t work well for cyclical data where seasonality matters more than cumulative trends.
4️⃣ Combining Lag, Rolling, and Cumulative Features
The real power comes when you combine these techniques into feature sets that capture both short-term and long-term trends.
📌 Example
Date | Sales | Lag_1 | Rolling_3 | Cumulative |
---|---|---|---|---|
2025-01-01 | 100 | NA | NA | 100 |
2025-01-02 | 150 | 100 | NA | 250 |
2025-01-03 | 80 | 150 | 110 | 330 |
2025-01-04 | 60 | 80 | 96.67 | 390 |
✅ This kind of combined feature set gives your model the best chance at capturing both local and global trends.
📌 Summary: Key Temporal Features for Time-Series Models
Feature Type | Description | Example |
---|---|---|
Lag Features | Prior values to capture short-term influence | Previous day’s sales |
Rolling Features | Aggregated values over recent time windows | 7-day average sales |
Cumulative Features | Running totals capturing long-term trajectory | Total sales to date |
✅ Mastering these features is crucial for effective forecasting, anomaly detection, and trend modeling.
📌 Chapter 3: Handling Irregular Time-Series & Missing Dates
Real-world time-series data is rarely perfect. Dates may be missing, gaps can appear between observations, and time intervals may be inconsistent. Whether you’re working with daily sales data, sensor logs, or financial time-series, handling these irregularities correctly is critical to building reliable models.
In this chapter, we’ll cover:
✅ Detecting missing dates in time-series datasets.
✅ Filling missing dates with placeholders.
✅ Interpolating missing values to keep trends intact.
✅ Special considerations for seasonal data.
1️⃣ Detecting Missing Dates
The first step is identifying where the gaps are—do dates skip weekends, are some weeks partially missing, or do you only have irregular data like log entries from events?
📌 R: Finding Missing Dates with tidyr
and tibble
library(tidyr)
library(tibble)
<- tibble(Date = seq(min(df$Date), max(df$Date), by = "day"))
all_dates <- full_join(all_dates, df, by = "Date")
df_complete <- filter(df_complete, is.na(Sales)) missing_dates
📌 Python: Finding Missing Dates with pandas
import pandas as pd
= pd.DataFrame({'Date': pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')})
all_dates = pd.merge(all_dates, df, on='Date', how='left')
df_complete = df_complete[df_complete['Sales'].isna()] missing_dates
2️⃣ Filling Missing Dates with Placeholders
Once you’ve identified missing dates, you need to decide how to fill them. In some cases, the absence of data is meaningful (e.g., no sales on a holiday), but in others, you may need to fill gaps to maintain a consistent timeline.
Options for Filling Missing Values:
Approach | Example | When to Use |
---|---|---|
Fill with Zeros | No sales = 0 | Retail data where missing = no activity |
Forward Fill | Carry forward last known value | Sensor data where readings rarely change |
Interpolate | Linearly estimate between known points | Gradual trends like temperatures |
📌 R: Filling Gaps with tidyr::fill()
<- df_complete %>%
df_complete fill(Sales, .direction = "down") # Forward fill
📌 Python: Filling Gaps with fillna()
'Sales'] = df_complete['Sales'].fillna(method='ffill') # Forward fill df_complete[
✅ Filling gaps ensures models see a complete time-series, rather than fragmented pieces.
3️⃣ Interpolating Missing Values
When data should follow a trend (e.g., temperatures, financial prices), interpolation can estimate missing values based on the shape of surrounding data.
📌 R: Interpolation Using zoo
library(zoo)
$Sales <- na.approx(df_complete$Sales) df_complete
📌 Python: Interpolation with pandas
'Sales'] = df_complete['Sales'].interpolate() df_complete[
✅ Interpolation is especially useful for sensors, environmental data, and continuous tracking.
4️⃣ Special Considerations for Seasonal Data
If your data follows clear seasonal cycles (daily energy demand, weekly sales peaks), simple interpolation could distort seasonal patterns. In those cases, it’s better to:
✔ Use seasonal averages to fill gaps (e.g., average Tuesday sales).
✔ Avoid interpolation across seasonal breaks (e.g., year-end).
📌 R: Seasonal Gap Filling with Grouped Averages
library(dplyr)
<- df_complete %>%
df_complete mutate(day_of_week = weekdays(Date)) %>%
group_by(day_of_week) %>%
mutate(Sales = ifelse(is.na(Sales), mean(Sales, na.rm = TRUE), Sales)) %>%
ungroup()
📌 Python: Seasonal Gap Filling with Grouped Averages
'day_of_week'] = df_complete['Date'].dt.day_name()
df_complete['Sales'] = df_complete.groupby('day_of_week')['Sales'].transform(lambda x: x.fillna(x.mean())) df_complete[
✅ This preserves seasonal effects while filling gaps, which is crucial for forecasting seasonal data.
5️⃣ Summary: Handling Missing Dates & Values in Time-Series Data
Challenge | Solution | Example |
---|---|---|
Missing dates | Rebuild full timeline | Add missing dates with NA sales |
Missing values | Forward fill | Carry forward last known sales value |
Gaps in continuous data | Interpolate | Linearly estimate missing values |
Missing seasonal data | Fill with seasonal averages | Use typical Tuesday sales for missing Tuesdays |
✅ Handling missing dates correctly keeps your time-series intact, so models can detect real patterns instead of reacting to data gaps.
📌 Chapter 4: Best Practices & Common Pitfalls for Time-Series Feature Engineering
Feature engineering for time-series data comes with unique challenges that don’t exist in typical tabular datasets. The order of data matters, and failing to respect that order can lead to data leakage, unrealistic results, and models that fail in production.
In this chapter, we’ll cover:
✅ The importance of respecting time order when engineering features.
✅ Common pitfalls with lag features, rolling windows, and cumulative features.
✅ The right way to handle time-aware cross-validation.
✅ Practical guidelines for building reliable time-based models.
1️⃣ Always Respect Time Order – No Peeking into the Future
The golden rule for time-series modeling:
You can only use information available up to the prediction point.
❌ Common Mistake: Calculating global statistics using the full dataset
If you compute, for example, the average daily sales using the entire dataset, your training data accidentally sees the future—this is data leakage.
✅ Instead:
Compute historical rolling averages that only use data before the prediction point.
Use expanding windows (cumulative metrics up to that date) instead of sliding windows that see future data.
📌 Example in Python – Expanding Mean (No Leakage)
'Cumulative_Avg_Sales'] = df['Sales'].expanding().mean() df[
✅ This approach only uses past data at each point.
📌 Example in R – Expanding Mean with slider
<- df %>%
df = slide_dbl(Sales, mean, .before = Inf, .complete = TRUE)) mutate(Cumulative_Avg_Sales
2️⃣ Lag Features – Handle Missing Data & Gaps Properly
Lag features are powerful, but they break if your time-series has irregular gaps. If a date is missing, the “Lag 1” feature could actually reference data from days (or even weeks) earlier.
✅ Instead:
Fill in missing dates first, even if you just insert placeholders.
Create lag features after ensuring all dates are present.
Common Pitfall: Using Raw shift()
or slide()
on Irregular Data
Date | Sales | Lag_1 (wrong) |
---|---|---|
2025-01-01 | 100 | NA |
2025-01-04 | 80 | 100 |
✅ Solution: Use a complete date index and then calculate lags.
3️⃣ Rolling Aggregates – Don’t Use Future Data
Rolling averages and cumulative metrics are valuable, but they must only use data up to the point being predicted.
❌ Pitfall: Centered Rolling Windows (Leakage)
Date | Sales | Centered 3-Day Avg (wrong) |
---|---|---|
2025-01-02 | 150 | Avg(2025-01-01, 02, 03) |
The centered average includes future data, which wouldn’t be available at prediction time.
✅ Only use trailing windows.
📌 Example in Python – Proper Trailing Rolling Average
'Rolling_3Day_Avg'] = df['Sales'].rolling(window=3).mean() df[
📌 Example in R – Trailing Rolling Average with slider
<- df %>%
df = slide_dbl(Sales, mean, .before = 2, .complete = TRUE)) mutate(Rolling_3Day_Avg
4️⃣ Time-Based Cross-Validation – Avoid Shuffling
In regular machine learning, we often shuffle data into random training and testing sets.
❌ Pitfall: Shuffling Time-Series Data
Shuffling destroys the time order, making it impossible to test if the model can predict future data based on past data.
✅ Instead:
Use time-based cross-validation (each fold trains on a past window and tests on the next future window).
Ensure the test set always comes after the training set.
📌 Example in R – Time-Based Resampling with rsample
library(rsample)
<- rolling_origin(df, initial = 365, assess = 30, cumulative = TRUE) splits
📌 Example in Python – Time-Based Split with sklearn
from sklearn.model_selection import TimeSeriesSplit
= TimeSeriesSplit(n_splits=5) tscv
✅ This setup ensures every test set simulates real-world future prediction scenarios.
5️⃣ Summary: Best Practices for Time-Series Feature Engineering
Best Practice | Explanation |
---|---|
Respect time order | Use only data available up to the prediction point |
Fill gaps before creating lags | Avoid misaligned lag features |
Use trailing windows only | Centered windows cause future leakage |
Time-aware cross-validation | Ensure train data always precedes test data |
Avoid global transformations | Any computation across full data (means, medians) leaks future info |
✅ Following these best practices ensures your time-based features are valid, reliable, and production-ready.
📌 Conclusion & Next Steps
Time-series feature engineering is a critical skill for building robust forecasting and predictive models. Unlike traditional tabular data, time-order matters, and failing to respect it can lead to data leakage, overfitting, or misleading predictions.
1️⃣ Key Takeaways from This Guide
✔ Time is more than just a timestamp—Extracting meaningful features (day of the week, seasonality, cumulative trends) can significantly improve model performance.
✔ Lag features help capture past trends—but they must be carefully aligned, especially when handling missing timestamps.
✔ Rolling and cumulative aggregations reveal patterns—but should only use past data to avoid leakage.
✔ Handling missing timestamps is crucial—Always fill gaps properly before engineering features.
✔ Cross-validation for time-series must be time-aware—Random shuffling is not an option!
✅ Mastering these techniques allows you to extract real value from time-series data and build stronger predictive models.
2️⃣ What’s Next?
This article is part of the Feature Engineering Series, where we explore how to transform raw data into meaningful insights.
🚀 Next up: “Beyond the Numbers – Feature Engineering for Text Data”
In the next article, we’ll cover:
📌 Text vectorization techniques (TF-IDF, word embeddings).
📌 Extracting meaningful text-based features (sentiment, keyword frequency).
📌 NLP-powered feature engineering for machine learning.
🔹 Want to stay updated? Keep an eye out for the next post in the series!
💡 What’s your favorite time-series feature engineering trick? Drop your thoughts below! ⬇⬇⬇