Omid - Challenge 13

data-challenges
advanced-exercises
🔰 Determine how frequently Words from words list are comes together in article titles.
Published

March 24, 2026

Illustration for Omid - Challenge 13

Challenge Description

🔰 Determine how frequently Words from words list are comes together in article titles.

Solutions

library(tidyverse)
library(readxl)

input1 = read_excel("files/CH-013.xlsx", range = "B2:B13")
input2 = read_excel("files/CH-013.xlsx", range = "D3:F6", col_names = FALSE) %>%
  unite("new", 1:3, sep = "|", remove = T, na.rm = T) 

test = read_excel("files/CH-013.xlsx", range = "H2:L6") %>% 
  select(2:5) %>%
  as.matrix(.) %>%
  replace(is.na(.), 0)

colnames(test) = c("pv", "wind", "bat", "ev")

result = input1 %>%
   mutate(pv = str_detect(`Article Titles`, input2$new[[1]]),
                     wind = str_detect(`Article Titles`, input2$new[[2]]),
                     bat = str_detect(`Article Titles`, input2$new[[3]]),
                     ev = str_detect(`Article Titles`, input2$new[[4]])) %>%
   mutate(across(pv:ev, ~ifelse(. == TRUE, 1, 0))) %>%
   mutate(pv_ev = ifelse(pv == 1 & ev == 1, 1, 0),
          wind_bat = ifelse(wind == 1 & bat == 1, 1, 0),
          pv_wind = ifelse(pv == 1 & wind == 1, 1, 0),
          bat_ev = ifelse(bat == 1 & ev == 1, 1, 0),
          pv_bat = ifelse(bat == 1 & pv == 1, 1, 0),
          wind_ev = ifelse(wind == 1 & ev == 1, 1, 0)) %>%
  select(-c(pv, wind, bat, ev)) %>%
  pivot_longer(cols = -`Article Titles`, names_to = "Technology", values_to = "Value") %>%
  mutate(Technology2 = str_replace(Technology, "([a-z]+)_([a-z]+)", "\\2_\\1")) %>%
  unite(Tech, Technology, Technology2, sep = "|", remove = T) %>%
  separate_rows(Tech, sep = "\\|") %>%
  separate(Tech, into = c("Technology", "Technology2"), sep = "_", remove = T) %>%
  select(-`Article Titles`) %>%
  pivot_wider(names_from = Technology2, values_from = Value, 
              values_fn = list(Value = sum), values_fill = list(Value = 0))  %>%
  select(` ` = Technology, pv, wind, bat, ev) %>%
  arrange(factor(` `, levels = c("pv", "wind", "bat", "ev"))) %>%
  select(-` `) %>%
  as.matrix(.)
  

identical(result, test)
# [1] TRUE
  • Logic:

    • Reads the workbook ranges needed for the challenge

    • Reshapes the data into the grain required by the task

    • Builds the intermediate columns that drive the final result

    • Parses the text patterns directly instead of relying on manual cleanup

  • Strengths:

    • The R solution stays close to the workbook rule and keeps the transformation compact.
  • Areas for Improvement:

    • The code assumes the sheet structure and source ranges remain stable.
  • Gem:

    • The strongest part of the solution is choosing the right intermediate representation before shaping the final output.
import pandas as pd

input1 = pd.read_excel("CH-013.xlsx", usecols="B", skiprows=1, nrows=12)
patterns = pd.read_excel("CH-013.xlsx", usecols="D:F", skiprows=2, nrows=4, header=None)
test = pd.read_excel("CH-013.xlsx", usecols="I:L", skiprows=1, nrows=5).iloc[:, 1:].fillna(0).astype(int)
test.columns = ["pv", "wind", "bat", "ev"]

regexes = patterns.apply(lambda row: "|".join(row.dropna().astype(str)), axis=1).tolist()
labels = ["pv", "wind", "bat", "ev"]

flags = pd.DataFrame({
    label: input1["Article Titles"].str.contains(regex, regex=True).astype(int)
    for label, regex in zip(labels, regexes)
})
cooc = flags.T.dot(flags)
cooc.index = labels
cooc.columns = labels

print(cooc.equals(test))
  • Logic:

    • Reads the workbook ranges needed for the challenge

    • Parses the text patterns directly instead of relying on manual cleanup

    • Applies the rule iteratively until the output stabilizes

  • Strengths:

    • The Python version follows the same rule in a direct dataframe-oriented implementation.
  • Areas for Improvement:

    • The code assumes the workbook layout remains stable, so any sheet redesign would require small adjustments.
  • Gem:

    • The implementation stays close to the original workbook rule instead of adding unnecessary abstraction.

Difficulty Level

This task is moderate:

  • The core logic is clear, but the correct transformation pattern is not obvious from the raw input.

  • The challenge combines multiple reshaping, grouping, or parsing steps.