4
\$\begingroup\$

I wrote a script with utilities for calculating the entropy of iterables and included a Tk GUI that shows a quick overview over a text's properties in real-time. (on GitHub)

I tried to follow PEP 8 as good as possible, but I'm not sure about other things, specificially:

  1. I think my docstrings are sometimes overly redundant, see the GUI for example.
  2. In gui.py, I'm not sure if I should move the calculate method out of the GUI class.
  3. Is the overall design good? I know it's a rather small project, but I want to do this correctly.

If you have any other concerns beside these questions, I'm open to criticism!

The code is split into two modules:

calc.py - Includes the calculation functions

"""Utilities for entropy-related calculations."""
from math import ceil as _ceil, log2 as _log2
def prob_to_info(probability):
 """Converts probability in the range from 0 to 1 into information measured
 in bits, therefore using the dual logarithm. Returns None if the probability
 is equal to zero."""
 if probability == 0:
 return None
 elif probability == 1:
 return 0
 else:
 return -_log2(probability)
def info_to_prob(information):
 """Converts information measured in bits to probablity."""
 return 2**-information
def entropy(iterable):
 """Calculates the Shannon entropy of the given iterable."""
 return sum(prob[1]*prob_to_info(prob[1]) for prob in char_mapping(iterable))
def optimal_bits(iterable):
 """Calculates the optimal usage of bits for decoding the iterable."""
 return _ceil(entropy(iterable)) * len(iterable)
def metric_entropy(iterable):
 """Calculates the metric entropy of the iterable."""
 return entropy(iterable) / len(iterable)
def char_mapping(iterable):
 """Creates a dictionary of the unique chararacters and their probability
 in the given iterable."""
 char_map = dict.fromkeys(set(iterable))
 for char in set(iterable):
 probability = iterable.count(char) / len(iterable)
 char_map[char] = probability
 return sorted(char_map.items(), key=lambda x: x[1], reverse=True)

gui.py

import tkinter as tk
import calc
class GUI:
 """A simple Tk-based interface for real-time entropy-related analytics
 on given texts."""
 def __init__(self, root):
 """Initializes the GUI where 'root' is a tkinter.Tk instance."""
 self.parent = root
 self.parent.state("zoomed")
 self.frame = tk.Frame(self.parent)
 self.frame.grid(row=0, column=0, sticky="nwes")
 self.input_head = tk.Label(self.frame, text="Input:")
 self.input_head.grid(row=0, column=0, sticky="nwes")
 self.ignore_case_value = tk.IntVar()
 self.ignore_case_value.trace("w", self.case_switch)
 self.ignore_case = tk.Checkbutton(
 self.frame,
 variable=self.ignore_case_value,
 text="Ignore case"
 )
 self.ignore_case.grid(row=0, column=1, sticky="nwes")
 self.input_main = tk.Text(self.frame)
 self.input_main.grid(row=1, column=0, sticky="nwes", columnspan=2)
 self.input_main.bind("<KeyRelease>", self.update)
 self.output_head = tk.Label(self.frame, text="Output:")
 self.output_head.grid(row=0, column=2, sticky="nwes")
 self.output_main = tk.Text(self.frame, state=tk.DISABLED)
 self.output_main.grid(row=1, column=2, sticky="nwes")
 self.parent.rowconfigure(0, weight=1)
 self.parent.columnconfigure(0, weight=1)
 self.frame.rowconfigure(1, weight=1)
 self.frame.columnconfigure(0, weight=1)
 self.frame.columnconfigure(1, weight=1)
 self.frame.columnconfigure(2, weight=1)
 def case_switch(self, *_):
 """Toggles case sensivity ."""
 self.input_main.edit_modified(True)
 self.update()
 def update(self, *_):
 """Updates the contents of the analysis text box."""
 if not self.input_main.edit_modified():
 return
 analyze_text = self.calculate()
 self.output_main["state"] = tk.NORMAL
 self.output_main.delete("1.0", tk.END)
 self.output_main.insert("1.0", analyze_text)
 self.output_main["state"] = tk.DISABLED
 self.input_main.edit_modified(False)
 def calculate(self, *_):
 """Creates the analysis text."""
 text = self.input_main.get("1.0", "end-1c")
 if self.ignore_case_value.get():
 text = text.lower()
 char_map = calc.char_mapping(text)
 entropy = calc.entropy(char_map)
 metric_entropy = calc.metric_entropy(text)
 optimal = calc.optimal_bits(text)
 info = "\n".join(
 [
 "Length: " + str(len(text)),
 "Unique chars: " + str(len(char_map)),
 "Entropy: " + str(entropy),
 "Metric entropy: " + str(metric_entropy),
 "Optimal bit usage: " + str(optimal)
 ]
 )
 table_head = " Char\t| Probability\t\t| Bits\t\t| Occurences"
 table_body = "\n".join(
 [
 " " + repr(char)[1:-1] +
 "\t" + str(round(prob, 7)) +
 "\t\t" + str(round(calc.prob_to_info(prob), 7)) +
 "\t\t" + str(text.count(char))
 for char, prob in char_map
 ]
 )
 table = "\n".join([table_head, table_body])
 return "\n\n".join([info, table])
def main():
 root = tk.Tk()
 _ = GUI(root)
 root.mainloop()
if __name__ == "__main__":
 main()
asked Apr 4, 2015 at 19:39
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Why do you alias ceil and log2? \$\endgroup\$ Commented Apr 5, 2015 at 6:43

2 Answers 2

6
\$\begingroup\$

You ask about docstrings, so you should be aware that there is a PEP for those, too. In particular, note that:

Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description.

The style guide specifies that docstring lines should be a maximum of 72 characters; a few of yours exceed this. There are various formats that you can adopt to include information in the docstrings in a structured way for use by documentation generators and other tools; I like the Google style.

For example,

"""Converts probability in the range from 0 to 1 into information measured
in bits, therefore using the dual logarithm. Returns None if the probability
is equal to zero."""

could be more like:

"""Converts probability into information, measured in bits.
Notes:
 Uses the dual logarithm.
Args:
 probability (float): In the range from 0 to 1.
Returns:
 float [or None if the probability is equal to zero].
"""

I assume that you've aliased log2 and ceil to _log2 and _ceil respectively to avoid them being imported into gui. Instead, you can use __all__ to specify what should be available to modules that import from calc (see the tutorial):

__all__ = [
 'entropy',
 'info_to_prob',
 'metric_entropy',
 'optimal_bits',
 'prob_to_info',
]

It seems a bit odd to have the class that occupies pretty much the whole of gui.py be explicitly ignored after instantiation! Rather than having:

root = tk.Tk()
_ = GUI(root)
root.mainloop()

you could make the GUI class inherit from tk.Tk:

class GUI(tk.Tk):
 def __init__(self, *args, **kwargs):
 super().__init__(*args, **kwargs)
 self.state("zoomed")
 self.frame = tk.Frame(self)
 ...

and run it directly:

root = GUI()
root.mainloop()

This is trivial enough to include under if __name__ == '__main__': directly, rather than via main. There's also no need for the , *_ in GUI.calculate.


Rather than the string concatenation with +, I would use str.format, for example:

 table_head = " Char | Probability | Bits | Occurrences "
 table_body = "\n".join(
 [
 " {:<4} | {:>11.7f} | {:>11.7f} | {:>11}".format(
 char, 
 prob, 
 calc.prob_to_info(prob), 
 text.count(char)
 )
 for char, prob in char_map
 ]
 )

Given what this method does, I don't think that calculate is an appropriate name for it. You could split the calculations and the formatting into two methods, with more appropriate names.


As currently implemented, the code breaks (due to ZeroDivisionError in metric_entropy) if you toggle Ignore Case before entering any text, or if you delete all of the input text. You should handle this error, and display something sensible in these cases.

answered Apr 5, 2015 at 8:45
\$\endgroup\$
5
\$\begingroup\$

As you never use the first element of the tuples you get from char_mapping, and the order does not count I wrote a simpler function:

def ratios(iterable):
 """
 Returns a list of ratios indicating how often the chars
 appear in the iterable.
 >>> list(sorted(ratios("hello")))
 [0.2, 0.2, 0.2, 0.4]
 """
 return [iterable.count(i) / len(iterable) for i in set(iterable)]

that you can use as:

def entropy(iterable):
 """Calculates the Shannon entropy of the given iterable.
 >>> entropy(range(10))
 3.321928094887362
 >>> entropy([1,2,3])
 1.584962500721156
 """
 return sum(prob*prob_to_info(prob) for prob in ratios(iterable))

obtaining the same results as before.

answered Apr 4, 2015 at 21:13
\$\endgroup\$
1
  • \$\begingroup\$ Though it's actually used to create the char table in the GUI, I think you're right as it isn't used elsewhere. I think I'll create the complete table with chars and stuff in the GUI then. \$\endgroup\$ Commented Apr 4, 2015 at 21:50

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.