2
\$\begingroup\$

I'm writing a code that, starting from an XML file:

  • stores the index of child elements of a tag and the child elements as key, values in a dictionary (function get_xml_by_tag_names);
  • deletes keys whose values contain a certain string (the specific text size) and puts these keys and the corresponding values into a second dictionary (def search_delete_append);
  • joins, for each dictionary, the dict values and extracts their text(def main);
  • replaces certain values with "" (def main);
  • counts the occurrences of specific regex I specify (def find_regex).

It works, but the "main" function needs to be more cleaned up. My concerns specifically regard the part in which I have to list the regex I'm interested in- they're multiple and it can become messy. Another problem is that the cleaning of the XML can be done in another separate function, but so I haven't managed to do it.

Here is the code:

import re
from xml.dom import minidom
from xml.etree import ElementTree as ET
from bs4 import BeautifulSoup
def get_xml_by_tag_names(xml_path, tag_name_1, tag_name_2):
 data = {}
 xml_tree = minidom.parse(xml_path)
 item_group_nodes = xml_tree.getElementsByTagName(tag_name_1)
 for idx, item_group_node in enumerate(item_group_nodes):
 cl_compile_nodes = item_group_node.getElementsByTagName(tag_name_2)
 for _ in cl_compile_nodes:
 data[idx]=[item_group_node.toxml()]
 return data
def find_regex(regex, text):
 l = []
 matches_prima = re.findall(regex, text)
 print("The number of", {regex}," matches is ", len(matches_prima))
def search_delete_append(dizionario, dizionariofasi):
 deletekeys = []
 insertvalues = []
 for k in dizionario:
 for v in dizionario[k]:
 if "7.489" in v:
 deletekeys.append(k)
 dizionariofasi[k] = v
 for item in deletekeys:
 del dizionario[item]
def main():
 dict_fasi = {}
 data = get_xml_by_tag_names('output2.xml', 'new_line', 'text')
 search_delete_append(data, dict_fasi)
 testo = []
 for value in data.values():
 myxml = ' '.join(value)
 tree = ET.fromstring(myxml)
 tmpstring = ' '.join(text.text for text in tree.findall('text'))
 for to_remove in (" < ", " >", ".", ",", ";", "-", "!", ":", "’", "?", "<>"):
 tmpstring = tmpstring.replace(to_remove, "")
 testo.append(tmpstring)
 testo = ''.join(testo)
 #print(testo)
 find_prima = re.compile(r"\]\s*prima(?!\S)")
 #print(find_regex(find_prima, testo))
 #################
 testo_fasi = []
 values = [x for x in dict_fasi.values()]
 myxml_fasi = ' '.join(values)
 find_CM = re.compile(r"10\.238")
 print(find_regex(find_CM, myxml_fasi)) #quanti CM ci sono?
 #print(myxml_fasi)
 for x in dict_fasi.values():
 xxx= ''.join(x)
 tree2 = ET.fromstring(xxx)
 tmpstring2 = ' '.join(text.text for text in tree2.findall('text'))
 testo_fasi.append(tmpstring2)
 testo_fasi = ''.join(testo_fasi)
 print(testo_fasi)
 find_regex(find_prima, testo_fasi)
if __name__ == "__main__":
 main()
asked Apr 24, 2020 at 14:55
\$\endgroup\$
1
  • \$\begingroup\$ Per codereview.meta.stackexchange.com/questions/3795/… it is preferable if your code is shown in English rather than Italian. I think your code is still on-topic; it would just clarify your code for your reviewers. \$\endgroup\$ Commented Apr 24, 2020 at 23:21

1 Answer 1

1
\$\begingroup\$

Mixed languages

Regardless of one's opinion on whether English is the lingua franca of programming, mixing languages is a bad idea. This would be better in all-Italian (where possible; libraries are still in English) or all-English than a mixture.

Type hints

Use them; for instance

def get_xml_by_tag_names(xml_path: str, tag_name_1: str, tag_name_2: str) -> dict:

Unused variable

Here:

l = []

Items iteration

for k in dizionario:
 for v in dizionario[k]:

should be

for k, values in dizionario.items():
 for v in values:

Magic numbers

Move this:

"7.489"

to a named constant variable somewhere.

Multiple replacement

This:

 for to_remove in (" < ", " >", ".", ",", ";", "-", "!", ":", "’", "?", "<>"):

would be easier as a single regular expression and a single call to sub.

Intermediate list

values = [x for x in dict_fasi.values()]
myxml_fasi = ' '.join(values)

can just be

myxml_fasi = ' '.join(dict_fasi.values())
answered Apr 24, 2020 at 23:41
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.