Omid - Challenge 174

Published

January 16, 2025

Challenge Description

The task is to filter out rows where there is a greater value within two days before or after the current row. For example, if a value at a given row is less than any value in the range of two rows before and after, it is excluded.

🔗 Link to Excel file: 👉https://lnkd.in/gV9WvDx62

Solutions

library(tidyverse)
library(readxl)

path = "files/CH-174 Filtering.xlsx"
input = read_excel(path, range = "C2:D25")
test  = read_excel(path, range = "F2:G7")

result = input %>%
  filter(!pmax(lag(Value, 1, default = 0) > Value,
               lag(Value, 2, default = 0) > Value,
               lead(Value, 1, default = 0) > Value,
               lead(Value, 2, default = 0) > Value))

all.equal(result, test, check.attributes = FALSE)
  • Logic:

    • lag(Value, n): Accesses the n-th previous row value for comparison.

    • lead(Value, n): Accesses the n-th next row value for comparison.

    • pmax(...): Evaluates whether any of the lagged or lead values are greater than the current value.

    • !pmax(...): Negates the result to keep rows where no greater value exists in the two-day window.

  • Strengths:

    • Conciseness: The use of lag, lead, and pmax makes the logic clear and compact.

    • Clarity: The logic directly mirrors the task requirements.

  • Areas for Improvement:

    • Edge Cases: Ensure default = 0 is appropriate for the dataset. For example, if negative values are present, 0 may not work as a default.
  • Gem:

    • The combination of lag, lead, and pmax elegantly captures the two-day filtering logic in a straightforward manner.
import pandas as pd

path = "CH-174 Filtering.xlsx"
input = pd.read_excel(path, usecols="C:D", skiprows=1, nrows=24, names=['Index', 'Value'])
test = pd.read_excel(path, usecols="F:G", skiprows=1, nrows=5, names=['Index', 'Value'])

def filter_values(df):
    df['All_Lagged_Lead_Lower'] = (df['Value'].shift(1, fill_value=0) < df['Value']) & \
                                  (df['Value'].shift(2, fill_value=0) < df['Value']) & \
                                  (df['Value'].shift(-1, fill_value=0) < df['Value']) & \
                                  (df['Value'].shift(-2, fill_value=0) < df['Value'])
    return df[df['All_Lagged_Lead_Lower']][['Index', 'Value']]

result = filter_values(input).reset_index(drop=True)
print(result.equals(test))  # True
  • Logic:

    • shift(n): Retrieves the value n rows before (n > 0) or after (n < 0) the current row.

    • Logical AND (&): Ensures the current value is greater than all lagged and lead values within the two-day window.

    • fill_value=0: Handles edge cases where lagged or lead values do not exist.

  • Strengths:

    • Explicit Logic: The filtering logic is broken into a clear, step-by-step process.

    • Reusability: Encapsulating the logic in a function (filter_values) makes it easy to apply to other datasets.

  • Areas for Improvement:

    • Efficiency: While readable, the row-wise filtering (shift and comparison) could be computationally expensive for large datasets.
  • Gem:

    • The use of shift with fill_value=0 handles edge cases gracefully and ensures no missing data in comparisons.

Difficulty Level

This task is moderate to challenging:

  • Requires a strong understanding of row-wise operations and lag/lead handling.

  • Balancing edge case handling (e.g., first and last rows) with efficiency can be non-trivial.