Data Science Spotlight: Selected Problems from Advent of Code 2025

of Code is an annual advent calendar of programming puzzles that are themed around helping Santa’s elves prepare for Christmas. The whimsical setting masks the fact that many puzzles call for serious algorithmic problem-solving, especially towards the end of the calendar. In a previous article, we discussed the importance of algorithmic thinking for data scientists even as AI-assisted coding becomes the norm. With Advent of Code 2025 having wrapped up last month, this article takes a closer look at a selection of problems from the event that are especially relevant for data scientists. We will sketch out some interesting solution approaches in Python, highlighting algorithms and libraries that can be leveraged in a wide array of real-world data science use cases.

Navigating Tachyon Manifolds with Sets and Dynamic Programming

The first problem we will look at is Day 7: Laboratories. We are given a tachyon manifold in a file called input_d7.txt, as shown below:

.......S.......
...............
.......^.......
...............
......^.^......
...............
.....^.^.^.....
...............
....^.....^....
...............
...^.^...^.^...
...............
..^...^.....^..
...............
.^...^.^.....^.
...............

A tachyon beam (“|”) starts at the top of the manifold and travels downward. If the beam hits a splitter (“^”), it splits into two beams, one on either side of the splitter. Part One of the puzzle asks us to determine the number of times a beam will split given a set of initial conditions (starting point of the beam and the manifold layout). Note that simply counting the number of splitters and multiplying by two will not give the correct answer, since overlapping beams are only counted once, and some splitters are never reached by any of the beams. We can leverage set algebra to account for these constraints as shown in the implementation below:

import functools

def find_all_indexes(s, ch):
    """Return a set of all positions where character ch appears in s."""
    return {i for i, c in enumerate(s) if c == ch}

with open("input_d7.txt") as f:
    first_row = f.readline()  # row containing initial beams ('S')
    f.readline()  # skip separator line
    rows = f.readlines()  # remaining manifold rows

beam_ids = find_all_indexes(first_row, "S")  # active beam column positions
split_counter = 0  # total number of splits

for row_index, line in enumerate(rows):
    # Only even-indexed rows contain splitters
    if row_index % 2 != 0:
        continue

    # Find splitter positions in this row
    splitter_ids = find_all_indexes(line, "^")

    # Beams that hit a splitter (intersection)
    hits = beam_ids.intersection(splitter_ids)
    split_counter += len(hits)

    # New beams created by splits (left and right)
    if hits:
        new_beams = functools.reduce(lambda acc, h: acc.union({h - 1, h + 1}), hits, set())
    else:
        new_beams = set()

    # Update active beams (add new beams, remove beams that hit splitters)
    beam_ids = beam_ids.union(new_beams).difference(splitter_ids)

print(split_counter)

We use the intersection operation to identify the splitters that are directly hit by active beams coming from above. New beams are created to the left and right of every splitter that is hit, but overlapping beams are only counted once with the union operator. The set of beams resulting from each layer of splitters in the tachyon manifold is computed using a list comprehension wrapped in a reduce function, a higher-order function that helps to simplify the code and typically seen in functional programming. The difference operator ensures that the original beams incident on the splitter are not counted among the set of outgoing active beams.

In a classical system, if a tachyon particle is sent through the manifold and encounters a splitter, the particle can only continue along one unique path to the left or right of the splitter. Part Two of the puzzle introduces a quantum version of this setup, in which a particle simultaneously goes down both the left and right paths, effectively spawning two parallel timelines. Our task is to determine the total number of timelines that exist after a particle has traversed all viable paths in such a quantum tachyon manifold. This problem can be solved efficiently using dynamic programming as shown below:

from functools import lru_cache

def count_timelines_with_dfs_and_memo(path):
    """Count distinct quantum timelines using DFS + memoization (top-down DP)"""
    with open(path) as f:
        lines = [line.rstrip("\n") for line in f if line.strip()]

    height = len(lines)
    width = len(lines[0])

    # Find starting column
    start_col = next(i for i, ch in enumerate(lines[0]) if ch == "S")

    @lru_cache(maxsize=None)
    def dfs_with_memo(row, col):
        """Return number of timelines from (row, col) to bottom using DFS + memoization"""
        # Out of bounds horizontally
        if col = width:
            return 0
        
        # Past the bottom row: one complete timeline
        if row == height:
            return 1

        if lines[row][col] == "^":
            # Split left and right
            return dfs_with_memo(row+1, col-1) + dfs_with_memo(row+1, col+1)
        else:
            # Continue straight down
            return dfs_with_memo(row+1, col)

    return dfs_with_memo(1, start_col)

print(count_timelines_with_dfs_and_memo("input_d7.txt"))

Recursive depth-first search with memoization is used to set up a top-down form of dynamic programming, where each subproblem is solved once and reused multiple times. Two base cases are defined: a valid timeline is not created if a particle goes out of bounds horizontally, and a complete timeline is counted once the particle reaches the bottom of the manifold. The recursive step accounts for two cases: whenever the particle reaches a splitter, it branches into two timelines, otherwise it continues straight down in the existing timeline. Memoization (using the @lru_cache decorator) prevents recalculation of known values when multiple paths converge at the same location in the manifold.

In practice, data scientists can use the tools and techniques described above in a variety of situations. The concept of beam splitting is similar in some ways to the proliferation of data packets in a complex communications network. Simulating the cascading process is a bit like modeling supply chain disruptions, epidemics, and information diffusion. At a more abstract level, the puzzle can be framed as a constrained graph traversal or path counting problem. Set algebra and dynamic programming are versatile concepts that data scientists can use to solve such seemingly difficult algorithmic problems.

Building Circuits with Nearest Neighbor Search

The next problem we will look at is Day 8: Playground. We are provided with a list of triples that represent the 3D location coordinates of electrical junction boxes in a file called input_d8.txt, as shown below:

162,817,810

59,618,56

901,360,560

…

In Part One, we are asked to successively identify and connect pairs of junction boxes that are closest together in terms of straight-line (or Euclidean) distance. Connected boxes form a circuit through which electricity can flow. The task is ultimately to report the result of multiplying together the sizes of the three largest circuits after connecting the 1000 pairs of junction boxes that are closest together. One neat solution involves using a min-heap to store pairs of junction box coordinates. Following is an implementation based on an instructive video by James Peralta:

from collections import defaultdict
import heapq
from math import dist as euclidean_dist

# Load points
with open("input_d8.txt") as f:
    points = [tuple(map(int, line.split(","))) for line in f.read().split()]

k = 1000

# Build min‑heap of all pairwise distances
dist_heap = [
    (euclidean_dist(points[i], points[j]), points[i], points[j])
    for i in range(len(points))
    for j in range(i + 1, len(points))
]

heapq.heapify(dist_heap)

# Take k shortest edges and build adjacency list
neighbors = defaultdict(list)
for _ in range(k):
    _, a, b = heapq.heappop(dist_heap)
    neighbors[a].append(b)
    neighbors[b].append(a)

# Use DFS to compute component size
def dfs(start, seen):
    stack = [start]
    seen.add(start)
    size = 0

    while stack:
        node = stack.pop()
        size += 1
        for nxt in neighbors[node]:
            if nxt not in seen:
                seen.add(nxt)
                stack.append(nxt)
    return size

# Compute sizes of all connected components
seen = set()
sizes = [dfs(p, seen) for p in points if p not in seen]

# Derive final answer
sizes.sort(reverse=True)
a, b, c = sizes[:3]
print("Solution:", a * b * c)

A min-heap is a binary tree in which parent nodes have values less than or equal to the values of their child nodes; this ensures that the smallest value is stored at the top of the tree and can be accessed efficiently. In the above solution, this helpful property of min-heaps is used to quickly identify the nearest neighbors among the given junction boxes. The 1000 nearest pairs thus identified represent a 3D graph. Depth-first search is used to traverse the graph starting from a given junction box and count the number of boxes that are in the same connected graph component (i.e., circuit).

In Part Two, resource scarcity is introduced (not enough extension cables). We must now continue connecting the closest unconnected pairs of junction boxes together until they are all part of one large circuit. The required answer is the result of multiplying together the x-coordinates of the last two junction boxes that get connected. To solve this problem, we can use a union-find data structure and Kruskal’s algorithm for building minimum spanning trees as follows:

import heapq
from math import dist as euclidean_dist

# Load points
with open("input_d8.txt") as f:
    points = [tuple(map(int, line.split(","))) for line in f.read().split()]

# Build min‑heap of all pairwise distances
dist_heap = [
    (euclidean_dist(a, b), a, b)
    for i, a in enumerate(points)
    for b in points[i+1:]
]
heapq.heapify(dist_heap)

# Define functions to implement Union-Find
parent = {p: p for p in points}

def find(x):
    if parent[x] != x:
        parent[x] = find(parent[x])
    return parent[x]

def union(a, b):
    ra, rb = find(a), find(b)
    if ra == rb:
        return False
    parent[rb] = ra
    return True

# Use Kruskal's algorithm to connect points until all are in one component
edges_used = 0
last_pair = None

while dist_heap:
    _, a, b = heapq.heappop(dist_heap)
    if union(a, b):
        edges_used += 1
        last_pair = (a, b)
        if edges_used == len(points) - 1:
            break

# Derive final answer
x_product = last_pair[0][0] * last_pair[1][0]
print(x_product)

The location data is stored in a min-heap and connected graph components are built. We repeatedly take the shortest remaining edge between two points and only keep that edge if it connects two previously unconnected components; this is the basic idea behind Kruskal’s algorithm. But to do this efficiently, we need a way of quickly determining whether two points are already connected. If yes, then union(a, b) == False, and we skip the edge to avoid creating a cycle. Otherwise, we merge their graph components. Union-find is a data structure that can perform this check in nearly constant time. To use a corporate analogy, it is a bit like asking “Who is your boss?” repeatedly until you reach the CEO and then rewriting the value of everyone’s boss to be the name of the CEO (i.e., the root). Next time, when someone asks, “Who is your boss?”, you can quickly respond with the CEO’s name. If the roots of two nodes are the same, the respective components are merged by attaching one root to the other.

The circuit-building problem relates to clustering and community detection, which are important concepts to know for real-life data science use cases. For example, building graph components by identifying nearest neighbors can be part of practical algorithm for grouping customers by similarity of preferences, detecting communities in social networks, and clustering geographical locations. Kruskal’s algorithm can be used to design and optimize networks by minimizing routing costs. Abstract concepts such as Euclidean distances, min-heaps, and union-find help us measure, prioritize, and organize data at scale.

Configuring Factory Machines with Linear Programming

Next, we will walk through the problem posed in Day 10: Playground. We are given a manual for configuring factory machines in a file called input_d10.txt as shown below:

[.##.] (2) (0,3) (2) (2,3) (0,2) (0,1) {3,5,4,7}
[..##.] (0,2,3) (2,3) (0,4) (0,1,2) (1,2,3,4) {7,5,12,8,2}
[.###.#] (0,1,2,3) (0,3,4) (0,1,2,4,5) (1,2) {10,11,9,5,10,5}

Each line describes one machine. The number of characters in the square brackets reflects the number of indicator lights and their desired states (“.” means off and “#” on). All lights will initially be off. Button wiring schematics are shown in parentheses; e.g., pressing the button with schematic “(2, 3)” will flip the current states of the indicator lights at positions 2 and 3 from “.” to “#” or vice versa. The objective of Part One is to determine the minimum button presses needed to correctly configure the indicator lights on all given machines. An elegant solution using mixed‑integer linear programming (MILP) is shown below:

import re
import numpy as np
from scipy.optimize import milp, LinearConstraint, Bounds

# Parse a single machine description line
def parse_machine(line: str):
    # Extract light pattern
    match = re.search(r"\[([.#]+)\]", line)
    if not match:
        raise ValueError(f"Invalid line: {line}")

    pattern = match.group(1)
    m = len(pattern)

    # Target vector: '#' -> 1, '.' -> 0
    target = np.fromiter((ch == "#" for ch in pattern), dtype=int)

    # Extract button wiring
    buttons = [
        [int(x) for x in grp.split(",")] if grp.strip() else []
        for grp in re.findall(r"\(([^)]*)\)", line)
    ]

    # Build toggle matrix A
    n = len(buttons)
    A = np.zeros((m, n), dtype=int)

    for j, btn in enumerate(buttons):
        for idx in btn:
            if not (0

First, each machine is encoded as a matrix A in which the rows are the lights and the columns are the buttons. A[i, j] = 1 if button j toggles light i. Regular expressions are used for pattern matching on the input data. Next, we set up the optimization problem with a binary button‑press vector x, integer slack variables k, and a target light pattern t. For each machine, our aim is to choose button presses x, such that x_j = 1 if the j-th button is pressed and 0 otherwise. The condition “after pressing buttons x, the lights equal target t” reflects the congruence Ax ≡ t (mod 2), but since the MILP solver cannot deal with mod 2 directly, we express the condition as Ax – 2k = t, for some vector k consisting only of non-negative integers; this reformulation works because subtracting an even number does not change parity. The integrality specification says that the first n variables (the button presses) are binary and the remaining m variables (slack) are non-negative integers. We then run the MILP solver with the objective of minimizing the number of button presses needed to reach the target state. If the solver succeeds, res.x[:n] contains the optimal button‑press choices and the code adds the number of pressed buttons to a running total.

In Part Two, the task is to reach a target state described by the so-called “joltage” requirements, which are shown in curly braces for each machine. The joltage counters of a machine are initially set to 0, and buttons can be pressed any number of times to update the joltage levels. For example, the first machine starts with joltage values “{0, 0, 0, 0}”. Pressing button “(3)” once, “(1, 3)” three times, “(2,3)” three times, “(0,2)” once, and (0,1) twice produces the target state “{3, 5, 4, 7}”. This also happens to be the fewest button presses needed to reach the target state. Our task is to compute the minimum number of button presses needed to attain the target joltage states for all machines. Again, this can be solved using MILP as follows:

import re
import numpy as np
from scipy.optimize import milp, LinearConstraint, Bounds

def parse_machine(line: str):
    # Extract joltage requirements
    match = re.search(r"\{([^}]*)\}", line)
    if not match:
        raise ValueError(f"No joltage requirements in line: {line}")

    target = np.fromiter((int(x) for x in match.group(1).split(",")), dtype=int)
    m = len(target)

    # Extract button wiring
    buttons = [
        [int(x) for x in grp.split(",")] if grp.strip() else []
        for grp in re.findall(r"\(([^)]*)\)", line)
    ]

    # Build A (m × n)
    n = len(buttons)
    A = np.zeros((m, n), dtype=int)

    for j, btn in enumerate(buttons):
        for idx in btn:
            if not (0

Whereas Part One was a parity problem, Part Two is a counting problem. The core constraint of Part Two can be captured by the linear equation Ax = t, and no slack variables are needed. In a way, Part Two is reminiscent of the integer knapsack problem, where a knapsack must be filled with the right combination of differently weighted/sized objects.

Optimization problems such as these are often a feature of data science use cases in domains like logistics, supply chain management, and financial portfolio management. The underlying aim is to minimize or maximize some objective function subject to various constraints. Data scientists would also do well to master the use of modular arithmetic; see this article for a conceptual overview of modular arithmetic and an exploration of its practical use cases in data science. Finally, there is an interesting conceptual link between MILP and the notion of feature selection with regularization in machine learning. Feature selection is about choosing the least number of features to train a model without adversely affecting predictive performance. Using MILP is like performing an explicit combinatorial search over feature subsets with pruning and optimization. L1 regularization amounts to a continuous relaxation of MILP; the L1 penalty nudges the coefficients of unimportant features towards zero. L2 regularization relaxes the MILP constraints even further by shrinking the coefficients of unimportant features without setting them to exactly zero.

Reactor Troubleshooting with Network Analysis

The last problem we will look at is Day 11: Reactor. We are provided with a dictionary representation of a network of nodes and edges in a file called input_d11.txt as shown below:

you: hhh ccc
hhh: ccc fff iii
…
iii: out

The keys and values are source and destination nodes (or devices as per the problem storyline), respectively. In the above example, node “you” is connected to nodes “hhh” and “ccc”. The task in Part One is to count the number of different paths through the network that go from node “you” to “out”. This can be done using depth-first search as follows:

from collections import defaultdict

def parse_input(filename):
    """
    Parse the input file into a directed graph.
    Each line has the format: source: dest1 dest2 ...
    """
    graph = defaultdict(list)
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            src, dests = line.split(":")
            src = src.strip()
            for d in dests.strip().split():
                graph[src].append(d.strip())
    return graph

def dfs_paths(graph, start, goal):
    """
    Generate all paths from start to goal using DFS.
    """
    stack = [(start, [start])]
    while stack:
        (node, path) = stack.pop()
        for next_node in graph.get(node, []):
            if next_node in path:
                # Avoid cycles
                continue
            if next_node == goal:
                yield path + [next_node]
            else:
                stack.append((next_node, path + [next_node]))

def solve_d11_part1(filename):
    graph = parse_input(filename)
    all_paths = list(dfs_paths(graph, "you", "out"))
    print(len(all_paths))    

solve_d11_part1("input_d11.txt")

We use an explicit stack to implement the search. Each stack entry holds information about the current node and the path so far. For each neighbor, we skip it if it is already in the path, yield the completed path if the neighbor is the “out” node, or push the neighbor and the updated path onto the stack to continue our exploration of the remaining network. The search process thus enumerates all valid paths from “you” to “out” and the final code output is the count of distinct valid paths.

In Part Two, we are asked to count the number of paths that go from “svr” to “out” via nodes “dac” and “fft”. The constraint of intermediate nodes effectively restricts the number of valid paths in the network. Following is a sample solution:

from collections import defaultdict
from functools import lru_cache

def parse_input(filename):
    graph = defaultdict(list)
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            src, dests = line.split(":")
            src = src.strip()
            dests = [d.strip() for d in dests.strip().split()]
            graph[src].extend(dests)
            for d in dests:
                if d not in graph:
                    graph[d] = []
    return graph

def count_paths_with_constraints(graph, start, goal, must_visit):
    must_visit = frozenset(must_visit)

    @lru_cache(maxsize=None)
    def dfs(node, seen_required):
        seen_required = frozenset(seen_required)
        if node == goal:
            return 1 if seen_required == must_visit else 0

        total = 0
        for nxt in graph[node]:
            # Avoid cycles by not revisiting nodes already in seen_required+path
            # Instead of tracking full path, we assume DAG or small cycles
            new_seen = seen_required | (frozenset([nxt]) & must_visit)
            total += dfs(nxt, new_seen)
        return total

    return dfs(start, frozenset([start]) & must_visit)

def solve_d11_part2(filename):
    graph = parse_input(filename)
    must_visit = {"dac", "fft"}
    total_valid_paths = count_paths_with_constraints(graph, "svr", "out", must_visit)
    print(total_valid_paths)

solve_d11_part2("input_d11.txt")

The code builds on the logic of Part One, so that we now additionally keep track of visits to the intermediate nodes “dac” and “fft” within the depth-first search routine. As in the quantum tachyon manifold puzzle, we leverage memoization to preempt redundant computations.

Problems involving network analysis are a staple of data science. Path enumeration is directly relevant to use cases concerning telecommunications, internet routing, and power grid optimization. Complex ETL pipelines are often represented as networks (e.g., directed acyclic graphs), and path counting algorithms can be used to identify critical dependencies or bottlenecks in the workflow. In the context of recommender engines powered by knowledge graphs, analyzing paths flowing through the graph can help with the interpretation of recommender responses. Such recommenders can use paths between entities to justify recommendations, making the system transparent by showing how a suggested item is connected to a user’s known preferences – after all, we can explicitly trace the reasoning.

The Wrap

In this article we have seen how the playful scenarios that form the narratives of Advent of Code puzzles can surface genuinely powerful ideas, ranging from graph search and optimization to linear programming, combinatorics, and constraint solving. By dissecting these problems and experimenting with different solution strategies, data scientists can sharpen their algorithmic instincts and build a versatile toolkit that transfers directly to practical work spanning feature engineering, model interpretability, optimization pipelines, and more. As AI-assisted coding continues to evolve, the ability to frame, solve, and critically reason about such problems will likely remain a key differentiator for data scientists. Advent of Code offers a fun, low‑stakes way to keep those skills sharp – readers are encouraged to attempt the other puzzles in the 2025 edition and experience the joy of cracking tough problems using algorithmic thinking.

Source link

#Data #Science #Spotlight #Selected #Problems #Advent #Code