Finding incorrect yml headers

Question 1

Introduction

I am working on a semi large project with a few hundred files. In this project there is a series of lesson.yml files I want to check is formatted correctly. And, yes, every file I want is called exactly that.

Just to clarify my code works, and does exactly what I want. However, I expect there either exists a better method, or that the code can be cleaned up significantly.

The files look exactly like this

level: 1-4
 topic: [tags1]
 subject: [tags2]
 grade: [tags3]

or this

indexed: false
 topic: [tags1]
 subject: [tags2]
 grade: [tags3]

If the file starts with indexed: false it should be skipped.

The title level: has to be from 1 to 4. Every file must have the titles topic, subject and grade, and only one of them . The tags can only be any of the words below.

 topic_tags: app|electronics|step_based|block_based|text_based|minecraft|web|game|robot|animation|sound|cryptography,
 subject_tags: mathematics|science|programming|technology|music|norwegian|english|arts_and_crafts|social_science
 grade: preschool|primary|secondary|junior|senior

Test cases

level: 9
tags:
 topic: [block_based, game]
 subject: [programming]
 grade: [primary, secondary, junior]

This should output the filepath then level: 9 with the 9 in red, as only levels 1-4 is supported.

level: 3
tags:
 topic: [text_based]
 subject: [mathematics, programming, yodeling]
 grade: [junior, senior]

This should output the line the filepath then subject: [mathematics, programming, yodeling] where the word yodeling is marked in red, as it is not a valid subject (even if most of us think it should be).

level: 1

This should output filepath: missing: topic, subjects, grade where topic, subjects, and grade is marked in red.

level: 9
tags:
 topic: [block_based, game]
 subject: [programming]
 grade: [primary, secondary, junior]
 grade: [primary, junior]

This one should output filepath then extra: grade as there is more than one grade.

Results

Running the code on my database returns something like this

enter image description here

Code

import glob
from termcolor import colored
from collections import defaultdict
import re
tags_ = dict(
 level="[1-4]",
 topic=
 "app|electronics|step_based|block_based|text_based|minecraft|web|game|robot|animation|sound|cryptography",
 subject=
 "mathematics|science|programming|technology|music|norwegian|english|arts_and_crafts|social_science",
 grade="preschool|primary|secondary|junior|senior",
)
# If a file starts with "indexed: false" skip it
def is_indexed(filename):
 with open(filename, 'r') as f:
 first_line = f.readline().replace(" ", "").lower().strip()
 return first_line != "indexed:false"
# Colors the words from bad_words red in a line
def color_incorrect(bad_words, line):
 line = re.sub('(' + '|'.join(bad_words) + ')', '{}', line)
 return line.format(*[colored(w, 'red') for w in bad_words])
def find_incorrect_titles(title_count, titles):
 missing = []
 extra = []
 for title in titles:
 if title_count[title] > 1:
 extra.append(colored(title, 'red'))
 elif title_count[title] < 1:
 missing.append(colored(title, 'red'))
 miss_str = 'missing: ' + ', '.join(missing) if missing else ''
 extra_str = 'extra: ' + ', '.join(extra) if extra else ''
 if miss_str:
 return miss_str + ' | ' + extra_str if extra_str else miss_str
 else:
 return extra_str
def find_incorrect_tags(filename):
 title_count = defaultdict(int) # Counts number of titles, topics, etc
 incorrect_tags = []
 with open(filename, 'r') as f:
 for line in f:
 line = line.strip()
 for title, tag in tags_.items():
 if not line.startswith(title):
 continue
 title_count[title] += 1
 n = True
 # Finds every non-legal tag as defined at the start of the file
 regex = r'\b(?!{0}|{1}\b)\w+'.format(title, tag)
 m = re.findall(regex, line) # Places the words in a list
 if m: # If we got any hits, this means the words are wrong
 line = color_incorrect(m, line) # color the words
 # This block finds titles without any legal words (empty).
 else:
 if title != "level":
 regex_legal = r'{0}: *\[( *({1}),? *)+\]'.format(
 title, tag)
 else:
 regex_legal = r'{0}: *( *({1}),? *)+'.format(
 title, tag)
 n = re.search(regex_legal, line)
 # If no legal words has been found, color the line red
 if not n:
 line = colored(line, 'red')
 if m or not n: # Add line to list of incorrect tags
 incorrect_tags.append(
 (' ' * 4 if title != "level" else " ") + line)
 break
 # We find if any title, topic, subject does not appear exactly once
 return (incorrect_tags, title_count)
def print_incorrect_titles_and_tags(filename):
 incorrect_tags, title_count = find_incorrect_tags(filename)
 incorrect_titles = find_incorrect_titles(title_count, tags_.keys())
 # If any errors are found we print them
 if incorrect_titles or incorrect_tags:
 print(colored(filename, 'yellow') + ": " + incorrect_titles)
 print('\n'.join(incorrect_tags)) if incorrect_tags else ''
if __name__ == "__main__":
 path = '../oppgaver/src'
 files = glob.glob(path + '/**/lesson.yml', recursive=True)
 for f in files:
 if is_indexed(f):
 print_incorrect_titles_and_tags(f)

Question 2

This is an odd statement:

print('\n'.join(incorrect_tags)) if incorrect_tags else ''

It produces the return value of print(), if incorrect_tags is truthy, otherwise it produces ''.

If the print() is executed, it concatenates possibly many strings together, with a newline separator just to print them. The last line's newline comes from the print statement itself. Somewhat confusing. The following is less tricky, and much clearer:

for incorrect_tag in incorrect_tags:
 print(incorrect_tag)

Promiscuous regex:

def color_incorrect(bad_words, line):
 line = re.sub('(' + '|'.join(bad_words) + ')', '{}', line)
 return line.format(*[colored(w, 'red') for w in bad_words])

If the line subject: [arts_and_crafts, mathematics, programming, art] is encountered, art becomes a bad word, and line becomes:

'subject: [{}s_and_crafts, mathematics, programming, {}]'

The subsequent line.format(...) will generate the exception:

TypeError: not all arguments converted during string formatting

Prevent this using \b word boundary assertions:

line = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', '{}', line)

AJNeufeld AJNeufeld 35.2k5 gold badges41 silver badges103 bronze badges · Answer 1 · 2018-07-20 20:43:59Z

This is an odd statement:

print('\n'.join(incorrect_tags)) if incorrect_tags else ''

It produces the return value of print(), if incorrect_tags is truthy, otherwise it produces ''.

If the print() is executed, it concatenates possibly many strings together, with a newline separator just to print them. The last line's newline comes from the print statement itself. Somewhat confusing. The following is less tricky, and much clearer:

for incorrect_tag in incorrect_tags:
 print(incorrect_tag)

Promiscuous regex:

def color_incorrect(bad_words, line):
 line = re.sub('(' + '|'.join(bad_words) + ')', '{}', line)
 return line.format(*[colored(w, 'red') for w in bad_words])

If the line subject: [arts_and_crafts, mathematics, programming, art] is encountered, art becomes a bad word, and line becomes:

'subject: [{}s_and_crafts, mathematics, programming, {}]'

The subsequent line.format(...) will generate the exception:

TypeError: not all arguments converted during string formatting

Prevent this using \b word boundary assertions:

line = re.sub(r'\b(' + '|'.join(bad_words) + r')\b', '{}', line)

Stack Exchange Network

Finding incorrect yml headers

Introduction

Test cases

Results

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding incorrect yml headers

Introduction

Test cases

Results

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions