Searching repositories for files with forbidden strings

Question 1

I wrote a small tool for searching git repositories for forbidden strings that I don't want to commit publicly. It's still something basic but it does the job.

The project contains three files.

config.py - this is the file where I store the secrects like which directories to scan and which strings to look for in a long regex (here redacted). This is commited to a VSTS repository because they are private.

import re
paths = [
 "c:\\..",
 "c:\\.."
 ]
re_forbidden = re.compile(r"(class)")

main.py - this is the core file. It uses os.walk to list all files and directories. Not all directories and files are processed. Paths that don't get commited like .git or bin and files like dll are skipped. Paths and the re_forbitten regex are imported from the above config.py. When found something suspicious then it just outputs the path of the affected file.

import os
import time
import itertools
import shutil
import re
import importlib.util
config_spec = importlib.util.spec_from_file_location("config", "c:\\home\\projects\\classified\\python\\sanipyzer\\config.py")
config = importlib.util.module_from_spec(config_spec)
config_spec.loader.exec_module(config)
from pprint import pprint
ignore_dirs = [".git", "bin", "obj", ".vs"]
ignore_files = ["dll", "exe", "pdb", "map"]
def format_filemtime(path):
 filemtime = os.path.getmtime(path)
 return time.strftime('%Y-%m-%d', time.gmtime(filemtime))
def ignore_dir(dirpath):
 for dir in ignore_dirs:
 pattern = r"\\" + re.escape(dir) + r"(\\|$)"
 if re.search(pattern, dirpath):
 return True
 return False
def ignore_file(file_name):
 for ext in ignore_files:
 pattern = r"\." + ext + "$"
 if re.search(pattern, file_name):
 return True
 return False
def sanitize(path): 
 start = time.perf_counter()
 for (dirpath, dirnames, filenames) in os.walk(path):
 if ignore_dir(dirpath):
 continue
 searchable_filenames = [filename for filename in filenames if not ignore_file(filename)]
 for filename in searchable_filenames:
 full_name = os.path.join(dirpath, filename)
 # without 'ignore' it throws for some files the UnicodeDecodeError 'utf-8' codec can't decode byte XXX in position XXX: invalid start byte
 with open(full_name, 'r', encoding="utf8", errors="ignore") as searchable:
 text = searchable.read()
 if config.re_forbidden.search(text):
 pprint(full_name)
 end = time.perf_counter()
 elapsed = round(end - start,2)
 print(f"elapsed: {elapsed} sec")
# --- --- ---
def main(): 
 for path in config.paths:
 sanitize(path)
if __name__ == '__main__':
 main()

test.py - this file is a very simple unittest. It tests the two ignore methods.

import unittest
from main import ignore_dir
from main import ignore_file
class MainTest(unittest.TestCase):
 def test_ignore_dir(self):
 self.assertTrue(ignore_dir("c:\\temp\\.git"))
 self.assertTrue(ignore_dir("c:\\temp\\.git\\abc"))
 def test_ignore_file(self):
 self.assertTrue(ignore_file("c:\\temp\\project\\program.dll"))
 self.assertFalse(ignore_file("c:\\temp\\project\\program.cs"))
 # tests the test
 #self.assertTrue(ignore_file("c:\\temp\\project\\program.cs"))
if __name__ == "__main__":
 # use False to avoid the traceback "Exception has occurred: SystemExit"
 unittest.main(exit=False)

What do you think of it? Is this an acceptable solution or are there still many beginner mistakes? I wasn't sure about the two ignore lists whether they are const and should be UPPER_CASE or are they just simple variables?

Question 2

Have you considered turning this into a git pre-commit hook? Then the rules would be automatically enforced (unless you commit with the --no-verify option). Also, you can use diff --cached to only consider the content that changed.

Question 3

@200_success yeah, thanks! I've discovered it just after I was done with the refactoring after receiving the review. I mentioned it in the last paragraph in my self-answer ;-)

Question 4

In general, it is a good code. However, I have a few remarks. Note: I am assuming that the code works perfectly as it should and will comment only on the style. Also, I assume that you are an experienced programmer and hence will not be explaining the basics, although if something is not clear please let me know.

Configuration

You might want to keep the real paths separate to a configuration script. I'd advice you to create a *.cfg file that would store all the configuration data. I can imagine this being used on different machines with different paths each. I don't think you would like to allow users to modify the code, it is better to give them config file.

In case you would like to keep the configuration as it is, I'd encourage to move the lists to config.py and name (as you mentioned) in UPPER_CASE. Additionally, you might also want to encapsulate them in a functions like this:

def get_ignored_dirs():
 return ignored_dirs

This allows to keep the API of config.py constant regardless on what you decide to do with inners of the configuration.

Lastly, you might consider moving your functions like ignore_dir to separate file (e.g. files_utils.py) and treat is as a proxy to your configuration.

Method naming

In general I tend do explicitly say that some function returns boolean value by sticking is at the beginning. For example, it is clear (from the code) that your function ignore_dir checks if the directory should be ignored. This is however, not clear from the name, one might think that this function ignores some directory. I'd suggest to change the name to is_ignored_dir or something similar. The same goes for ignore_file.

Separation of concerns and responsibilities

It seems for me that your sanitize function is not very SRP-like. Inside, you do many things including deciding whether to ignore a directory (or file), read the file and count time. I'd suggest dividing those responsibilities to more than one function. You might notice that if you try to do a unit test for this function, in reality it will be a e2e test.

Below, an example of what you can fix it.

def measured(func):
 def wrapper(path):
 start = time.perf_counter()
 func(path)
 end = time.perf_counter()
 elapsed = round(end - start,2)
 print(f"elapsed: {elapsed} sec")
 return wrapper
def searchable_files_from_list(filenames):
 for filename in filenames:
 if not is_ignore_file(filename):
 yield filename
def searchable_files_containing_forbidden_text(filenames):
 for filename in searchable_files_from_list(filenames):
 full_name = os.path.join(dirpath, filename)
 if contains_forbidden_text(full_name):
 yield full_name
def contains_forbidden_text(filename):
 with open(full_name, 'r', encoding="utf8", errors="ignore") as searchable:
 text = searchable.read()
 if config.re_forbidden.search(text):
 return True
 return False
def files_to_sanitize(path):
 for (dirpath,dirname,filenames) in os.walk(path):
 if is_ignored_dir(dirpath):
 continue
 for file in searchable_files_containing_forbidden_text(filenames):
 yield file
@measured
def print_files_to_sanitize(path):
 for file in files_to_sanitize(path):
 pprint(file)

Explanation:

Now, all of your methods can be separately unit tested and validated. Let's take look at each function at a time, starting with the last one.

print_files_to_sanitize This is what is left from your sanitize function. After the refactoring it become apparent that the function does not sanitize anything, it just prints a file name. This fact wasn't so clear before because of many things happening.

I've added a decorator to the function - this is a very common practice for measuring time of the method (you might want to generalize this by using *args and **kw).

files_to_sanitize is a generator that should give you all of the files that should be sanitized. It iterates through your catalogs ignoring some directories and getting the files from a second generator.

searchable_files_containing_forbidden_text Given a list of files it iterates over those which have a good file name (given by searchable_files_from_list function) and if such file contains a forbidden text it bubbles it up.

Question 5

Yes, I tested it with all my repositories and also with one that intentionally contained some illegal strings and after some initial problems it does its job well now. You're right, I'm not a complete beginner, just in python but I'll figure the new stuff out (and there are few things that I haven't seen yet) - but there are some links ;-). This is great review!

Question 6

I've refactored this and splitted the single method into multiple smaller ones. While reading about the *.cfg there was a link to json so I used this for the config. Now I have two of them: a local one, with only a single path:

{
 "private_config_path":"c:\\home\\projects\\classified\\python\\sanipyzer\\config.json"
}

and a secret one with all the rules:

{
 "scan_dirs": [
 "c:\\..", 
 ],
 "ignore_dirs": [
 ".git",
 "bin",
 "obj",
 ".vs"
 ],
 "ignore_files": [
 "dll",
 "exe",
 "pdb",
 "map"
 ],
 "forbidden_pattern": "..."
}

I've read about the *args and **kwargs and moved the stopwatch to a decorator (this is really cool and I already miss it in C#). This is now a part of my reusable.py.

import time
def log_elapsed(func):
 def measure(*args, **kw):
 start = time.perf_counter()
 func(*args, **kw)
 end = time.perf_counter()
 elapsed = round(end - start, 2)
 print(f"'{func.__name__}' elapsed: {elapsed} sec")
 return measure

The main.py has grown a little bit with its multiple functions but there are no global variables anymore and all values are passed via parameters. I'm not sure this is the best way and looks somehow verbose but I'll keep reading and the next script will be better. At the time of writing it I still don't know anything about classes in python.

import os
import time
import itertools
import shutil
import re
import json
import importlib.util
from pprint import pprint
from reusable import log_elapsed
def format_filemtime(path):
 filemtime = os.path.getmtime(path)
 return time.strftime('%Y-%m-%d', time.gmtime(filemtime))
def is_ignored_dir(dirpath, ignore_dirs):
 for dir in ignore_dirs:
 pattern = r"\\" + re.escape(dir) + r"(\\|$)"
 if re.search(pattern, dirpath):
 return True
 return False
def is_ignored_file(file_name, ignore_files):
 for ext in ignore_files:
 pattern = r"\." + ext + "$"
 if re.search(pattern, file_name):
 return True
 return False
def contains_forbidden_text(filename, re_forbidden):
 with open(filename, encoding="utf8", errors="ignore") as searchable:
 text = searchable.read()
 return True if re_forbidden.search(text) else False
def files_containing_forbidden_text(dirpath, filenames, re_forbidden):
 for filename in filenames:
 full_name = os.path.join(dirpath, filename)
 if contains_forbidden_text(full_name, re_forbidden):
 yield filename
def searchable_files(filenames, ignore_files):
 for filename in filenames:
 if not is_ignored_file(filename, ignore_files):
 yield filename
def files_to_sanitize(path, ignore_dirs, ignore_files, re_forbidden): 
 for (dirpath, dirname, filenames) in os.walk(path):
 if is_ignored_dir(dirpath, ignore_dirs):
 continue
 filenames = searchable_files(filenames, ignore_files)
 filenames = files_containing_forbidden_text(dirpath, filenames, re_forbidden)
 for filename in filenames:
 full_name = os.path.join(dirpath, filename)
 yield full_name
def load_config():
 with open("config.json", "r") as f:
 return json.load(f)
def load_rules(path):
 with open(path, "r") as f:
 return json.load(f)
# --- --- ---
@log_elapsed
def main(): 
 config = load_config()
 rules = load_rules(config["private_config_path"])
 re_forbidden = re.compile(rules["forbidden_pattern"])
 for path in rules["scan_dirs"]:
 for filename in files_to_sanitize(path, rules["ignore_dirs"], rules["ignore_files"], re_forbidden):
 pprint(filename)
if __name__ == '__main__':
 main()

In the next step I will use this script (with small changes) as a Git Hook for preventing commits when there are files that need to be sanitized first. (I hadn't known about this trick when I started with it thus the list of directories to scan)

MaLiN2223 MaLiN2223 1,0445 silver badges11 bronze badges · Accepted Answer · 2018-10-28 15:38:25Z

In general, it is a good code. However, I have a few remarks. Note: I am assuming that the code works perfectly as it should and will comment only on the style. Also, I assume that you are an experienced programmer and hence will not be explaining the basics, although if something is not clear please let me know.

Configuration

You might want to keep the real paths separate to a configuration script. I'd advice you to create a *.cfg file that would store all the configuration data. I can imagine this being used on different machines with different paths each. I don't think you would like to allow users to modify the code, it is better to give them config file.

In case you would like to keep the configuration as it is, I'd encourage to move the lists to config.py and name (as you mentioned) in UPPER_CASE. Additionally, you might also want to encapsulate them in a functions like this:

def get_ignored_dirs():
 return ignored_dirs

This allows to keep the API of config.py constant regardless on what you decide to do with inners of the configuration.

Lastly, you might consider moving your functions like ignore_dir to separate file (e.g. files_utils.py) and treat is as a proxy to your configuration.

Method naming

In general I tend do explicitly say that some function returns boolean value by sticking is at the beginning. For example, it is clear (from the code) that your function ignore_dir checks if the directory should be ignored. This is however, not clear from the name, one might think that this function ignores some directory. I'd suggest to change the name to is_ignored_dir or something similar. The same goes for ignore_file.

Separation of concerns and responsibilities

It seems for me that your sanitize function is not very SRP-like. Inside, you do many things including deciding whether to ignore a directory (or file), read the file and count time. I'd suggest dividing those responsibilities to more than one function. You might notice that if you try to do a unit test for this function, in reality it will be a e2e test.

Below, an example of what you can fix it.

def measured(func):
 def wrapper(path):
 start = time.perf_counter()
 func(path)
 end = time.perf_counter()
 elapsed = round(end - start,2)
 print(f"elapsed: {elapsed} sec")
 return wrapper
def searchable_files_from_list(filenames):
 for filename in filenames:
 if not is_ignore_file(filename):
 yield filename
def searchable_files_containing_forbidden_text(filenames):
 for filename in searchable_files_from_list(filenames):
 full_name = os.path.join(dirpath, filename)
 if contains_forbidden_text(full_name):
 yield full_name
def contains_forbidden_text(filename):
 with open(full_name, 'r', encoding="utf8", errors="ignore") as searchable:
 text = searchable.read()
 if config.re_forbidden.search(text):
 return True
 return False
def files_to_sanitize(path):
 for (dirpath,dirname,filenames) in os.walk(path):
 if is_ignored_dir(dirpath):
 continue
 for file in searchable_files_containing_forbidden_text(filenames):
 yield file
@measured
def print_files_to_sanitize(path):
 for file in files_to_sanitize(path):
 pprint(file)

Explanation:

Now, all of your methods can be separately unit tested and validated. Let's take look at each function at a time, starting with the last one.

print_files_to_sanitize This is what is left from your sanitize function. After the refactoring it become apparent that the function does not sanitize anything, it just prints a file name. This fact wasn't so clear before because of many things happening.

I've added a decorator to the function - this is a very common practice for measuring time of the method (you might want to generalize this by using *args and **kw).

files_to_sanitize is a generator that should give you all of the files that should be sanitized. It iterates through your catalogs ignoring some directories and getting the files from a second generator.

searchable_files_containing_forbidden_text Given a list of files it iterates over those which have a good file name (given by searchable_files_from_list function) and if such file contains a forbidden text it bubbles it up.

Yes, I tested it with all my repositories and also with one that intentionally contained some illegal strings and after some initial problems it does its job well now. You're right, I'm not a complete beginner, just in python but I'll figure the new stuff out (and there are few things that I haven't seen yet) - but there are some links ;-). This is great review!

Stack Exchange Network

Searching repositories for files with forbidden strings

2 Answers 2

Configuration

Method naming

Separation of concerns and responsibilities

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching repositories for files with forbidden strings

2 Answers 2

Configuration

Method naming

Separation of concerns and responsibilities

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions