Omid - Challenge 175

Published

January 17, 2025

Challenge Description

🔰 In the ID column, remove all instances of “X” if it appears consecutively more than once.

🔗 Link to Excel file: 👉https://lnkd.in/gNWC_HzJ

Solutions

library(tidyverse)
library(readxl)

path = "files/CH-175 Remove consecutive X.xlsx"
input = read_excel(path, range = "C2:D10")
test  = read_excel(path, range = "G2:G10") %>%
  replace(is.na(.), "")

result = input %>%
  mutate(ID = str_remove_all(ID, "[xX]{2,}")) %>%
  select(ID)

all.equal(result, test, check.attributes = FALSE)
  • Logic:

    • str_remove_all(ID, "[xX]{2,}"): Matches two or more consecutive x or X and removes them from the string.

    • replace(is.na(.), ""): Handles NA values by replacing them with an empty string.

  • Strengths:

    • Conciseness: The mutate and str_remove_all functions make the transformation clear and efficient.

    • Robustness: Handles missing values (NA) gracefully.

  • Areas for Improvement:

    • None; the solution is well-suited for the task.
  • Gem:

    • The use of str_remove_all simplifies the regex operation, keeping the code compact and readable.
import pandas as pd
import re

path = "CH-175 Remove consecutive X.xlsx"

input = pd.read_excel(path, usecols="C:D", skiprows=1, nrows=9)
test = pd.read_excel(path, usecols="G", skiprows=1, nrows=9).fillna("").rename(columns=lambda x: x.split('.')[0])

input['ID'] = input['ID'].apply(lambda x: re.sub(r'[xX]{2,}', '', x))

print(input[['ID']].equals(test)) # True
  • Logic:

    • re.sub(r'[xX]{2,}', '', x): This regex matches two or more consecutive x or X characters and removes them.

    • apply: Applies the regex substitution to each value in the ID column.

  • Strengths:

    • Efficiency: Regex handles all consecutive cases in a single pass.

    • Clarity: The regex pattern is straightforward and self-explanatory.

  • Areas for Improvement:

    • Edge Cases: If the ID column contains NaN, ensure these are handled gracefully.
  • Gem:

    • The use of [xX]{2,} in regex is concise and effectively handles both lowercase and uppercase X.

Difficulty Level

This task is moderate to challenging:

  • Requires a strong understanding of row-wise operations and lag/lead handling.

  • Balancing edge case handling (e.g., first and last rows) with efficiency can be non-trivial.