library(tidyverse)
library(readxl)
= "files/CH-175 Remove consecutive X.xlsx"
path = read_excel(path, range = "C2:D10")
input = read_excel(path, range = "G2:G10") %>%
test replace(is.na(.), "")
= input %>%
result mutate(ID = str_remove_all(ID, "[xX]{2,}")) %>%
select(ID)
all.equal(result, test, check.attributes = FALSE)
Omid - Challenge 175
Challenge Description
🔰 In the ID column, remove all instances of “X” if it appears consecutively more than once.
🔗 Link to Excel file: 👉https://lnkd.in/gNWC_HzJ
Solutions
Logic:
str_remove_all(ID, "[xX]{2,}")
: Matches two or more consecutivex
orX
and removes them from the string.replace(is.na(.), "")
: HandlesNA
values by replacing them with an empty string.
Strengths:
Conciseness: The
mutate
andstr_remove_all
functions make the transformation clear and efficient.Robustness: Handles missing values (
NA
) gracefully.
Areas for Improvement:
- None; the solution is well-suited for the task.
Gem:
- The use of
str_remove_all
simplifies the regex operation, keeping the code compact and readable.
- The use of
import pandas as pd
import re
= "CH-175 Remove consecutive X.xlsx"
path
input = pd.read_excel(path, usecols="C:D", skiprows=1, nrows=9)
= pd.read_excel(path, usecols="G", skiprows=1, nrows=9).fillna("").rename(columns=lambda x: x.split('.')[0])
test
input['ID'] = input['ID'].apply(lambda x: re.sub(r'[xX]{2,}', '', x))
print(input[['ID']].equals(test)) # True
Logic:
re.sub(r'[xX]{2,}', '', x)
: This regex matches two or more consecutivex
orX
characters and removes them.apply
: Applies the regex substitution to each value in theID
column.
Strengths:
Efficiency: Regex handles all consecutive cases in a single pass.
Clarity: The regex pattern is straightforward and self-explanatory.
Areas for Improvement:
- Edge Cases: If the
ID
column containsNaN
, ensure these are handled gracefully.
- Edge Cases: If the
Gem:
- The use of
[xX]{2,}
in regex is concise and effectively handles both lowercase and uppercaseX
.
- The use of
Difficulty Level
This task is moderate to challenging:
Requires a strong understanding of row-wise operations and lag/lead handling.
Balancing edge case handling (e.g., first and last rows) with efficiency can be non-trivial.