Find Duplicated File in System

- February 19, 2018

This problem requires more attention to the data structures and parsing then necessarily to the algorithm itself. Here it is https://leetcode.com/problems/find-duplicate-file-in-system/description/

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output: 
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

The algorithm that comes to mind for an O(n) solution is a linear run through the paths, indexing a hash table using the file content as the key, using as the value the file path, and finally creating the output by going through the hash table and adding to the output any content that has more than one file associated with it (so technically it is an O(2n) solution).
The parsing, casts and data structures to be used are the key here to solve this problem. It requires more focus on these details rather than thinking deeply about the algorithm itself.
I was a little surprised though to see that my solution was that fast, I was expecting someone to come up with an O(1n) instead of O(2n). Thanks, Marcelo.

public class Solution
{
public IList<IList<string>> FindDuplicate(string[] paths)
{
Hashtable files = new Hashtable();

for (int i = 0; i < paths.Length; i++)
{
string[] parts = paths[i].Split(' ');
string path = parts[0];

for (int j = 1; j < parts.Length; j++)
{
string file = parts[j];
int begin = file.IndexOf('(');
int end = file.IndexOf(')');
string content = file.Substring(begin + 1, end - begin - 1);

string key = path + '/' + file.Substring(0, begin);
if (!files.ContainsKey(content))
{
Hashtable htFile = new Hashtable();
htFile.Add(key, true);
files.Add(content, htFile);
}
else
{
Hashtable htFile = (Hashtable)files[content];
if (!htFile.ContainsKey(key))
{
htFile.Add(key, true);
}
files[content] = htFile;
}
}
}

List<IList<string>> retVal = new List<IList<string>>();

foreach (string fileContent in files.Keys)
{
Hashtable htInnerFiles = (Hashtable)files[fileContent];
if (htInnerFiles.Count > 1)
{
List<string> list = new List<string>();
foreach (string ss in htInnerFiles.Keys)
{
list.Add(ss);
}
retVal.Add(list);
}
}

return retVal;
}
}

Comments

Taras February 26, 2018 at 11:33 PM
I'm not sure why this problem was marked as having "Medium" difficulty level, but for easy problems like this I like to play with the language either to make it very concise like in

import collections

class Solution:
def findDuplicate(self, paths):
"""
:type paths: List[str]
:rtype: List[List[str]]
"""
index = collections.defaultdict(list)
for path in paths:
directory_name, *files = path.split(" ")
for file in files:
file_name, _, content = file.rpartition("(")
content = content[:-1]
index[content].append(directory_name + "/" + file_name)
return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

or well structured:

import collections

Directory = collections.namedtuple("Directory", ["name", "files"])
File = collections.namedtuple("File", ["name", "content"])

def parse_file(file):
"""
:type file: str
:rtype File
"""
name, _, content = file.rpartition("(")
return File(name=name, content=content[:-1])

def parse_directory(path):
"""
:type path: str
:rtype Directory
"""
name, *files = path.split(" ")
return Directory(name=name, files=map(parse_file, files))

class Solution:
def findDuplicate(self, paths):
"""
:type paths: List[str]
:rtype: List[List[str]]
"""
index = collections.defaultdict(list)
for directory in map(parse_directory, paths):
for file in directory.files:
index[file.content].append(directory.name + "/" + file.name)
return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

Thanks for sharing, Marcelo!
Reply Delete
Replies
Marcelo De Barros March 3, 2018 at 10:11 PM
Neat!!! :)
Reply Delete
Replies

Add comment

Quasi FSM (Finite State Machine) problem + Vibe

- July 13, 2025

Not really an FSM problem since the state isn't changing, it is just defined by the current input. Simply following the instructions should do it. Using VSCode IDE you can also engage the help of Cline or Copilot for a combo of coding and vibe coding, see below screenshot. Cheers, ACC. Process String with Special Operations I - LeetCode You are given a string s consisting of lowercase English letters and the special characters: * , # , and % . Build a new string result by processing s according to the following rules from left to right: If the letter is a lowercase English letter append it to result . A '*' removes the last character from result , if it exists. A '#' duplicates the current result and appends it to itself. A '%' reverses the current result . Return the final string result after processing all char...

Shortest Bridge – A BFS Story (with a Twist)

- May 03, 2025

Here's another one from the Google 30 Days challenge on LeetCode — 934. Shortest Bridge . The goal? Given a 2D binary grid where two islands (groups of 1s) are separated by water (0s), flip the fewest number of 0s to 1s to connect them. Easy to describe. Sneaky to implement well. 🧭 My Approach My solution follows a two-phase Breadth-First Search (BFS) strategy: Find and mark one island : I start by scanning the grid until I find the first 1 , then use BFS to mark all connected land cells as 2 . I store their positions for later use. Bridge-building BFS : For each cell in the marked island, I run a BFS looking for the second island. Each BFS stops as soon as it hits a cell with value 1 . The minimum distance across all these searches gives the shortest bridge. 🔍 Code Snippet Here's the core logic simplified: public int ShortestBridge(int[][] grid) { // 1. Mark one island as '2' and gather its coordinates List<int> island = FindAndMark...

Classic Dynamic Programming IX

- July 06, 2025

A bit of vibe code together with OpenAI O3. I asked O3 to just generate the sieve due to laziness. Sieve is used to calculate the first M primes (when I was using Miller-Rabin, was giving me TLE). The DP follows from that in a straightforward way: calculate the numbers from i..n-1, then n follows by calculating the min over all M primes. Notice that I made use of Goldbach's Conjecture as a way to optimize the code too. Goldbach's Conjecture estates that any even number greater than 2 is the sum of 2 primes. The conjecture is applied in the highlighted line. Cheers, ACC. PS: the prompt for the sieve was the following, again using Open AI O3 Advanced Reasoning: " give me a sieve to find the first M prime numbers in C#. The code should produce a List<int> with the first M primes " Minimum Number of Primes to Sum to Target - LeetCode You are given two integers n and m . You have to select a multiset of prime numbers from the first m pri...

Search This Blog

Another Casual Coder