Omid - Challenge 264

data-challenges
advanced-exercises
🔰 : Extract the Longest Repeated Pattern in Text Identify and extract the longest substring that repeats at least twice within the text.
Published

March 24, 2026

Illustration for Omid - Challenge 264

Challenge Description

🔰 : Extract the Longest Repeated Pattern in Text Identify and extract the longest substring that repeats at least twice within the text.

Solutions

library(tidyverse)
library(readxl)

path = "files/200-299/264/CH-264 Extract from Text.xlsx"
input = read_excel(path, range = "B2:B7")
test  = read_excel(path, range = "D2:E7")

extract_substrings <- function(x) {
  n      <- str_length(x)
  starts <- rep(seq_len(n), times = rev(seq_len(n)))
  lengths <- sequence(rev(seq_len(n)))
  
  purrr::map2_chr(starts, lengths, 
                  ~ str_sub(x, .x, .x + .y - 1))
}

result = input %>%
  mutate(substrings = map(Text, extract_substrings)) %>%
  unnest(substrings) %>%
  mutate(substring_length = str_length(substrings)) %>%
  mutate(n = n(), .by = c(Text, substrings)) %>%
  filter(n > 1) %>%
  distinct() %>%
  slice_max(substring_length, n = 1, by = Text) %>%
  select(Text, Pattern = substrings) 

# answer slightly different from the original, but it is correct
  • Logic:

    • Reads the workbook ranges needed for the challenge

    • Builds the intermediate columns that drive the final result

    • Parses the text patterns directly instead of relying on manual cleanup

  • Strengths:

    • The R solution stays close to the workbook rule and keeps the transformation compact.
  • Areas for Improvement:

    • The code assumes the sheet structure and source ranges remain stable.
  • Gem:

    • The strongest part of the solution is choosing the right intermediate representation before shaping the final output.
import pandas as pd

path = "200-299/264/CH-264 Extract from Text.xlsx"
input = pd.read_excel(path, usecols="B", skiprows=1, nrows=5)

def extract_substrings(x):
    n = len(x)
    return [x[i:j] for i in range(n) for j in range(i+1, n+1)]

df = input.assign(substrings=input['Text'].apply(extract_substrings)).explode('substrings')
df['len'] = df['substrings'].str.len()
df['n'] = df.groupby(['Text', 'substrings'])['substrings'].transform('count')
filtered = df[df['n'] > 1]
max_len = filtered.groupby('Text')['len'].transform('max')
result = filtered[filtered['len'] == max_len][['Text', 'substrings']].drop_duplicates().rename(columns={'substrings': 'Pattern'})

print(result)
  • Logic:

    • Reads the workbook ranges needed for the challenge

    • Aggregates or ranks values at the relevant grouping level

    • Builds the intermediate columns that drive the final result

    • Applies the rule iteratively until the output stabilizes

  • Strengths:

    • The Python version follows the same rule in a direct dataframe-oriented implementation.
  • Areas for Improvement:

    • The code assumes the workbook layout remains stable, so any sheet redesign would require small adjustments.
  • Gem:

    • The implementation stays close to the original workbook rule instead of adding unnecessary abstraction.

Difficulty Level

This task is moderate:

  • The core logic is clear, but the correct transformation pattern is not obvious from the raw input.

  • The challenge combines multiple reshaping, grouping, or parsing steps.