# test badge # coverage badge # python versions badge
big is a Python package of small functions and classes that aren't big enough to get a package of their own. It's zillions of useful little bits of Python code I always want to have handy.
For years, I've copied-and-pasted all my little helper functions between projects--we've all done it. But now I've finally taken the time to consolidate all those useful little functions into one big package--no more copy-and-paste, I just install one package and I'm ready to go. And, since it's a public package, you can use 'em too!
Not only that, but I've taken my time and re-thought and retooled a lot of this code. All the difficult-to-use, overspecialized, cheap hacks I've lived with for years have been upgraded with elegant, intuitive APIs and dazzling functionality. big is chock full of the sort of little functions and classes we've all hacked together a million times--only with all the API gotchas fixed, and thoroughly tested with 100% coverage. It's the code you would have written... if only you had the time. It's a real pleasure to use!
big requires Python 3.6 or newer. Its only dependency
is python-dateutil, and that's optional.
Think big!
It's true that much of the code in big is short, and one might reasonably have the reaction "that's so short, it's easier to write it from scratch every time I need it than remember where it is and how to call it". I still see value in these short functions in big because:
- everything in big is tested,
- every interface in big has been thoughtfully considered and designed.
For example, consider
StateManager.
If you remove the comments and documentation, it's actually pretty
short--easily less than a hundred lines. I myself have written
state machines from scratch using a similar approach many times.
They're easy to write. So why bother using StateManager?
Why not roll your own each time?
Because StateManager not only supports all the features you need--consider
accessor
and
dispatch--its
API is carefully designed to help prevent bugs and logical errors.
In considering the predecessor of StateManager for inclusion in big,
I realized that if an "observer" initiated a state transition, it would produce
a blurry mess of observer callbacks and entered and exited states,
executed in a confusing order. So StateManager in big simply
prevents you from executing state transitions in observers.
To use big, just install the big package (and its dependencies) from PyPI using your favorite Python package manager.
Once big is installed, you can simply import it. However, the top-level big package doesn't contain anything but a version number. Internally big is broken up into submodules, aggregated together loosely by problem domain, and you can selectively import just the functions you want. For example, if you only want to use the text functions, just import the text submodule:
import big.text
If you'd prefer to import everything all at once, simply import the big.all module. This one module imports all the other modules, and imports all their symbols too. So, one convenient way to work with big is this:
import big.all as big
That will make every symbol defined in big accessible from the big
object. For example, if you want to use
multisplit,
you can access it with just big.multisplit.
You can also use big.all with import *:
from big.all import *
but that's up to you.
big is licensed using the MIT license. You're free to use it and even ship it in your own programs, as long as you leave my copyright notice on the source code.
Although big is crammed full of fabulous code, a few of its subsystems rise above the rest. If you're curious what big might do for you, here are the five things in big I'm proudest of:
- The
multi-family of string functions big.state- Bound inner classes
- Enhanced
TopologicalSorter linesand lines modifier functions
And here are five little functions/classes I use all the time:
-
accessor(attribute='state', state_manager='state_manager')datetime_ensure_timezone(d, timezone)datetime_set_timezone(d, timezone)Delimiter(open, close, *, backslash=False, nested=True)dispatch(state_manager='state_manager', *, prefix='', suffix='')encode_strings(o, *, encoding='ascii')Event(scheduler, event, time, priority, sequence)fgrep(path, text, *, encoding=None, enumerate=False, case_insensitive=False)gently_title(s, *, apostrophes=None, double_quotes=None)get_float(o, default=_sentinel)get_int_or_float(o, default=_sentinel)grep(path, pattern, *, encoding=None, enumerate=False, flags=0)int_to_words(i, *, flowery=True, ordinal=False)lines(s, separators=None, *, line_number=1, column_number=1, tab_width=8, **kwargs)lines_convert_tabs_to_spaces(li)lines_filter_comment_lines(li, comment_separators)lines_containing(li, s, *, invert=False)lines_grep(li, pattern, *, invert=False, flags=0)lines_sort(li, *, reverse=False)multipartition(s, separators, count=1, *, reverse=False, separate=True)multisplit(s, separators, *, keep=False, maxsplit=-1, reverse=False, separate=False, strip=False)multistrip(s, separators, left=True, right=True)normalize_whitespace(s, separators=None, replacement=None)parse_delimiters(s, delimiters=None)parse_timestamp_3339Z(s, *, timezone=None)PushbackIterator(iterable=None)PushbackIterator.next(default=None)re_partition(text, pattern, count=1, *, flags=0, reverse=False)re_rpartition(text, pattern, count=1, *, flags=0)reversed_re_finditer(pattern, string, flags=0)Scheduler(regulator=default_regulator)Scheduler.schedule(o, time, *, absolute=False, priority=DEFAULT_PRIORITY)split_quoted_strings(s, quotes=('"', "'"), *, triple_quotes=True, backslash='\\')split_text_with_code(s, *, tab_width=8, allow_code=True, code_indent=4, convert_tabs_to_spaces=True)StateManager(state, *, on_enter='on_enter', on_exit='on_exit', state_class=None)timestamp_3339Z(t=None, want_microseconds=None)timestamp_human(t=None, want_microseconds=None)TopologicalSorter.remove(node)TopologicalSorter.View.close()TopologicalSorter.View.done(*nodes)TopologicalSorter.View.print(print=print)TopologicalSorter.View.ready()TopologicalSorter.View.reset()translate_filename_to_exfat(s)unicode_linebreaks_without_crlf
-
The
multi-family of string functionsWhitespace and line-breaking characters in Python and big
This submodule doesn't define any of its own symbols. Instead, it
imports every other submodule in big, and uses import * to
import every symbol from every other submodule, too. Every
public symbol in big is available in big.all.
Class decorators that implement bound inner classes. See the Bound inner classes deep-dive for more information.
-
Class decorator for an inner class. When accessing the inner class through an instance of the outer class, "binds" the inner class to the instance. This changes the signature of the inner class's
__init__fromdef __init__(self, *args, **kwargs):`
to
def __init__(self, outer, *args, **kwargs):
where
outeris the instance of the outer class.Compare this to functions:
- If you put a function inside a class, and access it through an instance I of that class, the function becomes a method. When you call the method, I is automatically passed in as the first argument.
- If you put a class inside a class,
and access it through an instance of that class,
the class becomes a bound inner class. When
you call the bound inner class, I is automatically
passed in as the second argument to
__init__, afterself.
Note that this has an implication for all subclasses. If class B is decorated with
BoundInnerClass, and class S is a subclass of B, such thatissubclass(S,B), class S must be decorated with eitherBoundInnerClassorUnboundInnerClass.
-
Class decorator for an inner class that prevents binding the inner class to an instance of the outer class.
If class B is decorated with
BoundInnerClass, and class S is a subclass of B, such thatissubclass(S,B)returnsTrue, class S must be decorated with eitherBoundInnerClassorUnboundInnerClass.
Functions for working with builtins. (Named builtin to avoid a
name collision with the builtins module.)
In general, the idea with these functions is a principle I first read about in either Code Complete or Writing Solid Code:
Don't associate with losers.
The intent here is, try to design APIs where it's impossible to call them the wrong way. Restrict the inputs to your functions to values you can always handle, and you won't ever have to return an error.
The functions in this sub-module are designed to always work. None of them should ever raise an exception--no matter what nonsense you pass in. (But don't take that as a challenge!)
-
Returns
float(o), unless that conversion fails, in which case returns the default value. If you don't pass in an explicit default value, the default value iso.
-
Returns
int(o), unless that conversion fails, in which case returns the default value. If you don't pass in an explicit default value, the default value iso.
-
Converts
ointo a number, preferring an int to a float.If
ois already an int or float, returnsounchanged. Otherwise, triesint(o). If that conversion succeeds, returns the result. Otherwise, triesfloat(o). If that conversion succeeds, returns the result. Otherwise returns the default value. If you don't pass in an explicit default value, the default value iso.
-
A decorator for class methods. When you have a method in a base class that's "pure virtual"--that must not be called, but must be overridden in child classes--decorate it with
@pure_virtual(). Calling that method will throw aNotImplementedError.Note that the body of any function decorated with
@pure_virtual()is ignored. By convention the body of these methods should contain only a single ellipsis, literally like this:class BaseClass: @big.pure_virtual() def on_reset(self): ...
-
Returns
Trueifocan be converted into afloat, andFalseif it can't.
-
Returns
Trueifocan be converted into anint, andFalseif it can't.
Functions for working with files, directories, and I/O.
-
Find the lines of a file that match some text, like the UNIX
fgreputility program.pathshould be an object representing a path to an existing file, one of:- a string,
- a bytes object, or
- a
pathlib.Pathobject.
textshould be either string or bytes.encodingis used as the file encoding when opening the file.- If
textis a str, the file is opened in text mode. - If
textis a bytes object, the file is opened in binary mode.encodingmust beNonewhen the file is opened in binary mode.
If
case_insensitiveis true, perform the search in a case-insensitive manner.Returns a list of lines in the file containing
text. The lines are either strings or bytes objects, depending on the type ofpattern. The lines have their newlines stripped but preserve all other whitespace.If
enumerateis true, returns a list of tuples of (line_number, line). The first line of the file is line number 1.For simplicity of implementation, the entire file is read in to memory at one time. If
case_insensitiveis true,fgrepalso makes a lowercased copy.
-
Returns the modification time of
path, in seconds since the epoch. Note that seconds is a float, indicating the sub-second with some precision.
-
Returns the modification time of
path, in nanoseconds since the epoch.
-
Returns the size of the file at
path, as an integer representing the number of bytes.
-
Look for matches to a regular expression pattern in the lines of a file, similarly to the UNIX
greputility program.pathshould be an object representing a path to an existing file, one of:- a string,
- a bytes object, or
- a
pathlib.Pathobject.
patternshould be an object containing a regular expression, one of:- a string,
- a bytes object, or
- an
re.Pattern, initialized with eitherstrorbytes.
encodingis used as the file encoding when opening the file.If
patternuses astr, the file is opened in text mode. Ifpatternuses a bytes object, the file is opened in binary mode.encodingmust beNonewhen the file is opened in binary mode.flagsis passed in as theflagsargument tore.compileifpatternis a string or bytes. (It's ignored ifpatternis anre.Patternobject.)Returns a list of lines in the file matching the pattern. The lines are either strings or bytes objects, depending on the type of
text. The lines have their newlines stripped but preserve all other whitespace.If
enumerateis true, returns a list of tuples of(line_number, line). The first line of the file is line number 1.For simplicity of implementation, the entire file is read in to memory at one time.
Tip: to perform a case-insensitive pattern match, pass in the
re.IGNORECASEflag into flags for this function (if pattern is a string or bytes) or when creating your regular expression object (if pattern is anre.Patternobject.(In older versions of Python,
re.Patternwas a private type calledre._pattern_type.)
-
A context manager that temporarily changes the directory. Example:
with big.pushd('x'): pass
This would change into the
'x'subdirectory before executing the nested block, then change back to the original directory after the nested block.You can change directories in the nested block; this won't affect pushd restoring the original current working directory upon exiting the nested block.
You can safely nest
with pushdblocks.
-
Ensures that a directory exists at
path. If this function returns and doesn't raise, it guarantees that a directory exists atpath.If a directory already exists at
path,safe_mkdirdoes nothing.If a file exists at
path,safe_mkdirunlinkspaththen creates the directory.If the parent directory doesn't exist,
safe_mkdircreates that directory, then createspath.This function can still fail:
pathcould be on a read-only filesystem.- You might lack the permissions to create
path. - You could ask to create the directory
x/yandxis a file (not a directory).
-
Unlinks
path, ifpathexists and is a file.
-
Ensures that
pathexists, and its modification time is the current time.If
pathdoes not exist, creates an empty file.If
pathexists, updates its modification time to the current time.
-
Ensures that all characters in s are legal for a FAT filesystem.
Returns a copy of
swhere every character not allowed in a FAT filesystem filename has been replaced with a character (or characters) that are permitted.
-
Ensures that all characters in s are legal for a UNIX filesystem.
Returns a copy of
swhere every character not allowed in a UNIX filesystem filename has been replaced with a character (or characters) that are permitted.
A drop-in replacement for Python's
graphlib.TopologicalSorter
with an enhanced API. This version of TopologicalSorter allows modifying the
graph at any time, and supports multiple simultaneous views, allowing
iteration over the graph more than once.
See the Enhanced TopologicalSorter deep-dive for more information.
-
Exception thrown by
TopologicalSorterwhen it detects a cycle.
-
An object representing a directed graph of nodes. See Python's
graphlib.TopologicalSorterfor concepts and the basic API.
New methods on TopologicalSorter:
-
Returns a shallow copy of the graph. The copy also duplicates the state of
get_readyanddone.
-
Checks the graph for cycles. If no cycles exist, returns None. If at least one cycle exists, returns a tuple containing nodes that constitute a cycle.
-
Prints the internal state of the graph. Used for debugging.
printis the function used for printing; it should behave identically to the builtinprintfunction.
-
Removes
nodefrom the graph.If any node
Pdepends on a nodeN, andNis removed, this dependency is also removed, butPis not removed from the graph.Note that, while
remove()works, it's slow. (It's O(N).)TopologicalSorteris optimized for fast adds and fast views.
-
Resets
get_readyanddoneto their initial state.
-
Returns a new
Viewobject on this graph.
-
A view on a
TopologicalSortergraph object. Allows iterating over the nodes of the graph in dependency order.
Methods on a View object:
-
Returns
Trueif more work can be done in the view--if there are nodes waiting to be yielded byget_ready, or waiting to be returned bydone.Aliased to
TopologicalSorter.is_activefor compatibility with graphlib.
-
Closes the view. A closed view can no longer be used.
-
Returns a shallow copy of the view, duplicating its current state.
-
Marks nodes returned by
readyas "done", possibly allowing additional nodes to be available fromready.
-
Prints the internal state of the view, and its graph. Used for debugging.
printis the function used for printing; it should behave identically to the builtinprintfunction.
-
Returns a tuple of "ready" nodes--nodes with no predecessors, or nodes whose predecessors have all been marked "done".
Aliased to
TopologicalSorter.get_readyfor compatibility withgraphlib.
-
Resets the view to its initial state, forgetting all "ready" and "done" state.
Functions for working with heap objects. Well, just one heap object really.
-
An object-oriented wrapper around the
heapqlibrary, designed to be easy to use--and easy to remember how to use. Theheapqlibrary implements a binary heap, a data structure used for sorting; you add objects to the heap, and you can then remove objects in sorted order. Heaps are useful because they have are efficient both in space and in time; they're also inflexible, in that iterating over the sorted items is destructive.The
HeapAPI in big mimics thelistandcollections.dequeobjects; this way, all you need to remember is "it works kinda like alistobject". Youappendnew items to the heap, thenpopleftthem off in sorted order.By default
Heapcreates an empty heap. If you pass in an iterableito the constructor, this is equivalent to calling theextend(i)on the freshly-constructedHeap.In addition to the below methods,
Heapobjects support iteration,len, theinoperator, and use as a boolean expression. You can also index or slice into aHeapobject, which behaves as if the heap is a list of objects in sorted order. Getting the first item (Heap[0], aka peek) is cheap, the other operations can get very expensive.
Methods on a Heap object:
-
Adds object
oto the heap.
-
Removes all objects from the heap, resetting it to empty.
-
Returns a shallow copy of the heap. Only duplicates the heap data structures itself; does not duplicate the objects in the heap.
-
Adds all the objects from the iterable
ito the heap.
-
If object
ois in the heap, removes it. Ifois not in the heap, raisesValueError.
-
If the heap is not empty, returns the first item in the heap in sorted order. If the heap is empty, raises
IndexError.
-
Equivalent to calling
Heap.append(o)immediately followed byHeap.popleft(). Ifois smaller than any other object in the heap at the time it's added, this will returno.
-
Equivalent to calling
Heap.popleft()immediately followed byHeap.append(o). This method will never returno, unlessowas already in the heap before the method was called.
-
Not a method, a property. Returns a copy of the contents of the heap, in sorted order.
Functions and classes for working with iteration. Only one entry so far.
-
Wraps any iterator, allowing you to push items back on the iterator. This allows you to "peek" at the next item (or items); you can get the next item, examine it, and then push it back. If any objects have been pushed onto the iterator, they are yielded first, before attempting to yield from the wrapped iterator.
The constructor accepts one argument, an
iterable, with a default ofNone. IfiterableisNone, thePushbackIteratoris created in an exhausted state.When the wrapped
iterableis exhausted (or if you passed inNoneto the constructor) you can still call thepushmethod to add items, at which point thePushBackIteratorcan be iterated over again.In addition to the following methods,
PushbackIteratorsupports the iterator protocol and testing for truth. APushbackIteratoris true if iterating over it will yield at least one value.
-
Equivalent to
next(PushbackIterator), but won't raiseStopIteration. If the iterator is exhausted, returns thedefaultargument.
-
Pushes a value into the iterator's internal stack. When a
PushbackIteratoris iterated over, and there are any pushed values, the top value on the stack will be popped and yielded.PushbackIteratoronly yields from the iterator it wraps when this internal stack is empty.
A simple and lightweight logging class, useful for performance analysis.
Not intended as a full-fledged logging facility like Python's
logging module.
-
The default clock function used by the
Logclass. This function returns elapsed time in nanoseconds, expressed as an integer.In Python 3.7+, this is
time.monotonic_ns; in Python 3.6 this is a custom function that callstime.perf_counter, then converts that time to an integer number of nanoseconds.
-
A simple and lightweight logging class, useful for performance analysis. Not intended as a full-fledged logging facility like Python's
loggingmodule.Allows nesting, which is literally just a presentation thing.
The
clocknamed parameter specifies the function theLogobject should call to get the time. This function should return anint, representing elapsed time in nanoseconds.To use: first, create your
Logobject.log = Log()
Then log events by calling your
Logobject, passing in a string describing the event.log('text')
Enter a nested subsystem containing events with
log.enter:log.enter('subsystem')
Then later exit that subsystem with
log.exit:log.exit()
And finally print the log:
log.print()
You can also iterate over the log events using
iter(log). This yields 4-tuples:(start_time, elapsed_time, event, depth)start_timeandelapsed_timeare times, expressed as an integer number of nanoseconds. The first event is atstart_time0, and all subsequent start times are relative to that time.eventis the event string you passed in tolog()(or"<subsystem> start"or"<subsystem> end").depthis an integer indicating how many subsystems the event is nested in; larger numbers indicate deeper nesting.
-
Notifies the log that you've entered a subsystem. The
subsystemparameter should be a string describing the subsystem.This is really just a presentation thing; all subsequent logged entries will be indented until you make the corresponding
log.exit()call.You may nest subsystems as deeply as you like.
-
Exits a logged subsystem. See
Log.enter.
Log.print(*, print=None, title="[event log]", headings=True, indent=2, seconds_width=2, fractional_width=9)
-
Prints the log.
Keyword-only parameters:
printspecifies the print function to use, default isbuiltins.print.titlespecifies the title to print at the beginning. Default is"[event log]". To suppress, pass inNone.headingsis a boolean; ifTrue(the default), prints column headings for the log.indentis the number of spaces to indent in front of log entries, and also how many spaces to indent each time we enter a subsystem.seconds_widthis how wide to make the seconds column, default 2.fractional_widthis how wide to make the fractional column, default 9.
-
Resets the log to its initial state.
A replacement for Python's sched.scheduler object,
adding full threading support and a modern Python interface.
Python's sched.scheduler object was a clever idea for the
time. It abstracted away the concept of time from its interface,
allowing it to be adapted to new schemes of measuring time--including
mock time used for testing. Very nice!
But unfortunately, sched.scheduler was designed in 1991--long
before multithreading was common, years before threading support
was added to Python. Sadly its API isn't flexible enough to
correctly handle some scenarios:
- If one thread has called
sched.scheduler.run, and the next scheduled event will occur at time T, and a second thread schedules a new event which occurs at a time < T,sched.scheduler.runwon't return any events to the first thread until time T. - If one thread has called
sched.scheduler.run, and the next scheduled event will occur at time T, and a second thread cancels all events,sched.scheduler.runwon't exit until time T.
Also, sched.scheduler is thirty years behind the times in
Python API design--its design predates many common modern
Python conventions. Its events are callbacks, which it
calls directly. Scheduler fixes this: its events are
objects, and you iterate over the Scheduler object to receive
events as they become due.
Scheduler also benefits from thirty years of improvements
to sched.scheduler. In particular, big reimplements the
relevant parts of the sched.scheduler test suite, to ensure that
Scheduler never repeats the historical problems discovered
over the lifetime of sched.scheduler.
-
An object representing a scheduled event in a
Scheduler. You shouldn't need to create them manually;Eventobjects are created automatically when you add events to aScheduler.Supports one method:
-
Cancels this event. If this event has already been canceled, raises
ValueError.
-
An abstract base class for
Schedulerregulators.A "regulator" handles all the details about time for a
Scheduler.Schedulerobjects don't actually understand time; it's all abstracted away by theRegulator.You can implement your own
Regulatorand use it withScheduler. YourRegulatorsubclass needs to implement a minimum of three methods:now,sleep, andwake. It must also provide an attribute called 'lock'. The lock must implement the context manager protocol, and should ensure thread safety for theRegulator. (Schedulerwill only request theRegulator's lock if it's not already holding it. Put another way, theRegulatordoesn't need to be a "reentrant" or "recursive" lock.)Normally a
Regulatorrepresents time using a floating-point number, representing a fractional number of seconds since some epoch. But this isn't strictly necessary. Any Python object that fulfills these requirements will work:- The time class must implement
__le__,__eq__,__add__, and__sub__, and these operations must be consistent in the same way they are for number objects. - If
aandbare instances of the time class, anda.__le__(b)is true, thenamust either be an earlier time, or a smaller interval of time. - The time class must also implement rich comparison
with numbers (integers and floats), and
0must represent both the earliest time and a zero-length interval of time.
- The time class must implement
-
Returns the current time in local units. Must be monotonically increasing; for any two calls to now during the course of the program, the later call must never have a lower value than the earlier call.
A
Schedulerwill only call this method while holding this regulator's lock.
-
Sleeps for some amount of time, in local units. Must support an interval of
0, which should represent not sleeping. (Though it's preferable that an interval of0yields the rest of the current thread's remaining time slice back to the operating system.)If
wakeis called on thisRegulatorobject while a different thread has called this function to sleep,sleepmust abandon the rest of the sleep interval and return immediately.A
Schedulerwill only call this method while not holding this regulator's lock.
-
Aborts all current calls to
sleepon thisRegulator, across all threads.A
Schedulerwill only call this method while holding this regulator's lock.
-
Implements a scheduler. The only argument is the "regulator" object to use; the regulator abstracts away all time-related details for the scheduler. By default
Scheduleruses an instance ofSingleThreadedRegulator, which is not thread-safe.(If you need the scheduler to be thread-safe, pass in an instance of a thread-safe
Regulatorclass likeThreadSafeRegulator.)In addition to the below methods,
Schedulerobjects support being evaluated in a boolean context (they are true if they contain any events), and they support being iterated over. Iterating over aSchedulerobject blocks until the next event comes due, at which point theScheduleryields that event. An emptySchedulerthat is iterated over raisesStopIteration. You can reuseSchedulerobjects, iterating over them until empty, then adding more objects and iterating over them again.
-
Schedules an object
oto be yielded as an event by thisscheduleobject at some time in the future.By default the
timevalue is a relative time value, and is added to the current time; using atimevalue of 0 should schedule this event to be yielded immediately.If
absoluteis true,timeis regarded as an absolute time value.If multiple events are scheduled for the same time, they will be yielded by order of
priority. Lowever values ofpriorityrepresent higher priorities. The default value isScheduler.DEFAULT_PRIORITY, which is 100. If two events are scheduled for the same time, and have the same priority,Schedulerwill yield the events in the order they were added.Returns an
Eventobject, which can be used to cancel the event.
-
Cancels a scheduled event.
eventmust be an object returned by thisSchedulerobject. Ifeventis not currently scheduled in thisSchedulerobject, raisesValueError.
-
A list of the currently scheduled
Eventobjects, in the order they will be yielded.
-
Returns an iterator for the events in the
Schedulerthat only yields the events that are currently due. Never blocks; if the next event is not due yet, raisesStopIteration.
-
An implementation of
Regulatordesigned for use in single-threaded programs. It doesn't support multiple threads, and in particular is not thread-safe. But it's much higher performance than thread-safeRegulatorimplementations.This
Regulatorisn't guaranteed to be safe for use while in a signal-handler callback.
-
A thread-safe implementation of
Regulatordesigned for use in multithreaded programs.This
Regulatorisn't guaranteed to be safe for use while in a signal-handler callback.
Library code for working with simple state machines.
There are lots of popular Python libraries for implementing
state machines. But they all seem to be designed for
large-scale state machines. These libraries are
sophisticated and data-driven, with expansive APIs.
And, as a rule, they require the state to be
a passive object (e.g. an Enum), and require you to explicitly
describe every possible state transition.
This approach is great for massive, super-complex state machines--you need a sophisticated library to manage such complex state machines. This approach also permits these libraries to offer clever features like automatically generating diagrams of your state machine.
However, most of the time, this level of sophistication is
unnecessary. There are lots of use cases for small scale,
simple state machines, where this complex data-driven approach
only gets in the way. I prefer writing my state machines
with active objects--where states are implemented as classes,
events are implemented as method calls on those classes,
and you transition to a new state by simply overwriting a
state attribute with a new instance of one of these classes.
big.state is designed to make it easy to write this style of
state machine. It has
a deliberately minimal, simple interface--the constructor for
the main StateManager class only has four parameters,
and it only exposes three attributes. The module also has
two decorators to make your life easier. And that's it!
With that small API surface area, you can effortlessly write
large scale state machines.
(But you can also write tiny data-driven state machines too.
Although big.state makes state machines with active states
easy to write, it's agnostic about how you actually implement
your state machine. big.state makes it easy to write any
kind of state machine you like!)
big.state provides features like:
- method calls that get called when entering and exiting a state,
- observer objects that get called each time you transition to a new state, and
- safety mechanisms to catch bugs and prevent design mistakes.
The main class in big.state is StateManager. This class
maintains the current "state" of your state machine, and
manages transitions to new states. It takes one required
parameter, which is the initial state.
Here are my recommended best practices for working with
StateManager for medium-sized and larger state machines:
- Your state machine should be implemented as a class.
- You should store
StateManageras an attribute of that class, preferably calledstate_manager. (Your state machine should have a "has-a" relationship withStateManager, not an "is-a" relationship where it inherits fromStateManager.) - Decorating your state machine class with the
accessordecorator will save you a lot of boilerplate. If your state machine is stored ino, decorating withaccessorlets you can access the current state usingo.stateinstead ofo.state_manager.state.
- You should store
- Your states should be implemented as classes.
- You should have a base class for your state classes, containing whatever functionality they have in common.
- You're encouraged to define these state classes inside
your state machine class, and use
BoundInnerClassso they automatically get references to the state machine they're a part of.
- Events should be method calls made on your state machine object.
- As a rule, events should be dispatched from the state machine to a method call on the current state with the same name.
- If all the code to handle a particular event lives in the
states, use the
dispatchdecorator to handle dispatching the call. This will write a new method for you, that calls the equivalent method on the current state, passing in all the arguments it received. - Your state base class should have a method for every event, decorated with `pure_virtual'.
Here's a simple example demonstrating all this functionality.
It's a state machine with two states, On and Off, and
one event method toggle. Calling toggle transitions
the state machine from the Off state to the On state,
and vice-versa.
from big.all import accessor, BoundInnerClass, dispatch, pure_virtual, StateManager @accessor() class StateMachine: def __init__(self): self.state_manager = StateManager(self.Off()) @dispatch() def toggle(self): ... @BoundInnerClass class State: def __init__(self, state_machine): self.state_machine = state_machine def __repr__(self): return f"<{type(self).__name__}>" @pure_virtual() def toggle(self): ... @BoundInnerClass class Off(State.cls): def on_enter(self): print("off!") def toggle(self): sm = self.state_machine sm.state = sm.On() # sm.state is the accessor @BoundInnerClass class On(State.cls): def on_enter(self): print("on!") def toggle(self): sm = self.state_machine sm.state = sm.Off() sm = StateMachine() print(sm.state) for _ in range(3): sm.toggle() print(sm.state)
For another, more complete example of working with StateManager,
see the test_vending_machine test code in tests/test_state.py
in the big source tree.
-
Class decorator. Adds a convenient state accessor attribute to your class.
When you have a state machine class containing a
StateManagerobject, it can be wordy and inconvenient to access the state through the state machine attribute:class StateMachine: def __init__(self): self.state_manager = StateManager(self.InitialState) ... sm = StateMachine() # vvvvvvvvvvvvvvvvvvvv that's a lot! sm.state_manager.state = NextState()
The
accessorclass decorator creates a property for you, a short-cut that directly accesses thestateattribute of your state manager. Just decorate with@accessor():@accessor() class StateMachine: def __init__(self): self.state_manager = StateManager(self.InitialState) ... sm = StateMachine() # vvvvvv that's a lot shorter! sm.state = NextState()
The
stateattribute evaluates to the same value:sm.state == sm.state_manager.state
And setting it sets the state on your
StateManagerinstance. These two statements now do the same thing:sm.state_manager.state = new_state sm.state = new_state
By default, this decorator assumes your
StateManagerinstance is in thestate_managerattribute, and you want to name the new accessor attributestate. You can override these defaults; the decorator's first parameter,attribute, should be the string used for the new accessor attribute, and the second parameter,state_manager, should be the name of the attribute where yourStateManagerinstance is stored.For example, if your state manager is stored in an attribute called
sm, and you want the short-cut to be calledst, you'd decorate your class with@accessor(attribute='st', state_manager='sm')
-
Decorator for state machine event methods, dispatching the event from the state machine object to its current state.
dispatchhelps with the following scenario:- You have your own state machine class which contains
a
StateManagerobject. - You want your state machine class to have methods representing events.
- Rather than handle those events in your state machine object itself, you want to dispatch them to the current state.
Simply create a method in your state machine class with the correct name and parameters but a no-op body, and decorate it with
@dispatch. Thedispatchdecorator will rewrite your method so it calls the equivalent method on the current state, passing through all the arguments.For example, instead of writing this:
class StateMachine: def __init__(self): self.state_manager = StateManager(self.InitialState) def on_sunrise(self, time, *, verbose=False): return self.state_manager.state.on_sunrise(time, verbose=verbose)
you can literally write this, which does the same thing:
class StateMachine: def __init__(self): self.state_manager = StateManager(self.InitialState) @dispatch() def on_sunrise(self, time, *, verbose=False): ...
Here, the
on_sunrisefunction you wrote is actually thrown away. (That's why the body is simply one"..."statement.) Your function is replaced with a function that gets thestate_managerattribute fromself, then gets thestateattribute from thatStateManagerinstance, then calls a method with the same name as the decorated function, passing in using*argsand**kwargs.Note that, as a stylistic convention, you're encouraged to literally use a single ellipsis as the body of these functions, like in the example above. This is a visual cue to readers that the body of the function doesn't matter.
The
state_managerargument to the decorator should be the name of the attribute where theStateManagerinstance is stored inself. The default is'state_manager', but you can specify a different string if you've stored yourStateManagerin another attribute. For example, if your state manager is in the attributesmedley, you'd decorate with:@dispatch('smedley')The
prefixandsuffixarguments are strings added to the beginning and end of the method call we call on the current state. For example, if you want the method you call to have an active verb form (e.g.reset), but you want it to directly call an event handler that starts withon_by convention (e.g.on_reset), you could do this:@dispatch(prefix='on_') def reset(self): ...
This is equivalent to:
def reset(self): return self.state_manager.state.on_reset()
If you have more than one event method, instead of decorating every event method with the same copy-and-pasted
dispatchcall, it's better to calldispatchonce, cache the function it returns, and decorate with that. Like so:my_dispatch = dispatch('smedley', prefix='on_') @my_dispatch def reset(self): ... @my_dispatch def sunrise(self): ...
- You have your own state machine class which contains
a
-
Base class for state machine state implementation classes. Use of this base class is optional; states can be instances of any type except
types.NoneType.
-
Simple, Pythonic state machine manager.
Has three public attributes:
-
state -
The current state. You transition from one state to another by assigning to this attribute.
-
next -
The state the
StateManageris transitioning to, if it's currently in the process of transitioning to a new state. If theStateManagerisn't currently transitioning to a new state, itsnextattribute isNone. While the manager is currently transitioning to a new state, it's illegal to start a second transition. (In other words: you can't assign tostatewhilenextis notNone.) -
observers -
A list of callables that get called during every state transition. It's initially empty; you should add and remove observers to the list as needed.
- The callables will be called with one positional argument, the state manager object.
- Since observers are called during the state transition, they aren't permitted to initiate state transitions.
- You're permitted to modify the list of observers
at any time. If you modify the list of observers
during an observer call,
StateManagerwill finish the current observer callbacks using a copy of the old list.
The constructor takes the following parameters:
-
state -
The initial state. It can be any valid state object; by default, any Python value can be a state except
None. (But also see thestate_classparameter below.) -
on_enter -
on_enterrepresents a method call on states called when entering that state. The value itself is a string used to look up an attribute on state objects; by defaulton_enteris the string'on_enter', but it can be any legal Python identifier string, or any false value.If
on_enteris a valid identifier string, and thisStateManagerobject transitions to a state object O, and O has an attribute with this name,StateManagerwill call that attribute (with no arguments) immediately after transitioning to that state. Passing in a false value foron_enterdisables this behavior.on_enteris called immediately after the transition is complete, which means you're expressly permitted to make a state transition inside anon_entercall.If defined,
on_exitwill be called on the initial state object, from inside theStateManagerconstructor. -
on_exit -
on_exitis similar toon_enter, except the attribute is called when transitioning away from a state object. Its default value is `'on_exit'``.on_exitis called during the state transition, which means you're expressly forbidden from making a state transition inside anon_exitcall. -
state_class -
state_classis used to enforce that thisStateManageronly ever transitions to valid state objects. It should be eitherNoneor a class. If it's a class, theStateManagerobject will require every value assigned to itsstateattribute to be an instance of that class. If it'sNone, states can be any object (exceptNone).
To transition to a new state, simply assign to the 'state' attribute.
- If
state_classisNone, you may use any value as a state exceptNone. - It's illegal to assign to
statewhile currently transitioning to a new state. (Or, in other words, at any timeself.nextis notNone.) - If the current state object has an
on_exitmethod, it will be called (with zero arguments) during the the transition to the next state. This means it's illegal to initiate a state transition inside anon_exitcall. - If you assign an object to
statethat has anon_enterattribute, that method will be called (with zero arguments) immediately after we have transitioned to that state. This means it's permitted to initiate a state transition inside anon_entercall. - It's illegal to attempt to transition to the current
state. If
state_manager.stateis alreadyfoo,state_manager.state = foowill raise an exception.
If you have an
StateManagerinstance calledstate_manager, and you transition it tonew_state:state_manager.state = new_state
StateManagerwill execute the following sequence of events:- Set
state_manager.nexttonew_state.- At of this moment
state_manageris "transitioning" to the new state.
- At of this moment
- If
state_manager.statehas anon_exitattribute, callstate_manager.state.on_exit(). - For every object
oin thestate_manager.observerlist, callo(self). - Set
state_manager.nexttoNone. - Set
state_manager.statetonew_state.- As of this moment, the transition is complete, and
state_manageris now "in" the new state.
- As of this moment, the transition is complete, and
- If
state_manager.statehas anon_enterattribute, callstate_manager.state.on_enter().
-
-
Exception raised when attempting to execute an illegal state transition.
There are only two types of illegal state transitions:
-
An attempted state transition while we're in the process of transitioning to another state. In other words, if
state_manageris yourStateManagerobject, you can't setstate_manager.statewhenstate_manager.nextis notNone. -
An attempt to transition to the current state. This is illegal:
state_manager = StateManager() state_manager.state = foo state_manager.state = foo # <-- this statement raises TransitionError
-
Note that transitioning to a different but identical object is expressly permitted.
-
Functions for working with text strings. There are
several families of functions inside the text module;
for a higher-level view of those families, read the
following deep-dives:
- The
multi-family of string functions linesand lines modifier functions- Word wrapping and formatting
All the functions in big.text will work with either
str or bytes objects, except the three
Word wrapping and formatting
functions. When working with bytes,
by default the functions will only work with ASCII
characters.
The big text functions all support both str and bytes.
The functions all automatically detect whether you passed in
str or bytes using an
intentionally simple and predictable process, as follows:
At the start of each function, it'll test its first "string"
argument to see if it's a bytes object.
is_bytes = isinstance(<argument>, bytes)
If isinstance returns True, the function assumes all arguments are
bytes objects. Otherwise the function assumes all arguments
are str objects.
As a rule, no further testing, casting, or catching exceptions is done.
Functions that take multiple string-like parameters require all such arguments to be the same type. These functions will check that all such arguments are of the same type.
Subclasses of str and bytes will also work; anywhere you
should pass in a str, you can also pass in a subclass of
str, and likewise for bytes.
-
A tuple of
strobjects, representing every line-breaking whitespace character defined by ASCII.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. If you don't want to include this string, useascii_linebreaks_without_crlfinstead. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
ascii_linebreakswithout'\r\n'.
-
A tuple of
strobjects, representing every whitespace character defined by ASCII.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. If you don't want to include this string, useascii_whitespace_without_crlfinstead. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
ascii_whitespacewithout'\r\n'.
-
A tuple of
bytesobjects, representing every line-breaking whitespace character recognized by the Pythonbytesobject.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
b'\r\n'. If you don't want to include this string, usebytes_linebreaks_without_crlfinstead. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
bytes_linebreakswithout'\r\n'.
-
A tuple of
bytesobjects, representing every line-breaking whitespace character recognized by the Pythonbytesobject.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
b'\r\n'. If you don't want to include this string, usebytes_whitespace_without_crlfinstead. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
bytes_whitespacewithout'\r\n'.
-
Class representing a delimiter for
parse_delimiters.openis the opening delimiter character, can bestrorbytes, must be length 1.closeis the closing delimiter character, must be the same type asopen, and length 1.backslashis a boolean: when inside this delimiter, can you escape delimiters with a backslash? (You usually can inside single or double quotes.)nestedis a boolean: must other delimiters nest in this delimiter? (Delimiters don't usually need to be nested inside single and double quotes.)
-
Accepts a container object
ocontainingstrobjects; returns an equivalent object with the strings encoded tobytes.omust be eitherdict,list, ortuple, or a subclass of one of those.Encodes every
strinside using the encoding specified in theencodingparameter, default is'ascii'. Handles nested containers.If
ois astr, raisesTypeError. Ifois a container and contains an object that isn't astr,dict,list, ortuple, raisesTypeError.
-
Uppercases the first character of every word in
s, leaving the other letters alone.sshould bestrorbytes.(For the purposes of this algorithm, words are any contiguous run of non-whitespace characters.)
This function will also capitalize the letter after an apostrophe if the apostrophe:
- is immediately after whitespace, or
- is immediately after a left parenthesis character (
'('), or - is the first letter of the string, or
- is immediately after a letter O or D, when that O or D
- is after whitespace, or
- is the first letter of the string.
In this last case, the O or D will also be capitalized.
Finally, this function will capitalize the letter after a quote mark if the quote mark:
- is after whitespace, or
- is the first letter of a string.
(A run of consecutive apostrophes and/or quote marks is considered one quote mark for the purposes of capitalization.)
All these rules mean
gently_titlecorrectly handles internally quoted strings:He Said 'No I Did Not'and contractions that start with an apostrophe:
'Twas The Night Before Christmasas well as certain Irish, French, and Italian names:
Peter O'Toole Lord D'ArcyIf specified,
apostrophesshould be astrorbytesobject containing characters that should be considered apostrophes. Ifapostrophesis false, andsisbytes,apostrophesis set to a bytes object containing the only ASCII apostrophe character:'If
apostrophesis false and s isstr,apostrophesis set to a string containing these Unicode apostrophe code points:'‘’‚‛Note that neither of these strings contains the "back-tick" character:
`This is a diacritical used for modifying letters, and isn't used as an apostrophe.
If specified,
double_quotesshould be astrorbytesobject containing characters that should be considered double-quote characters. Ifdouble_quotesis false, andsisbytes,double_quotesis set to a bytes object containing the only ASCII double-quote character:"If
double_quotesis false andsisstr, double_quotes is set to a string containing these Unicode double-quote code points:"""„‟«»‹›
-
Converts an integer into the equivalent English string.
int_to_words(2) -> "two" int_to_words(35) -> "thirty-five"
If the keyword-only parameter
floweryis true (the default), you also get commas and the wordandwhere you'd expect them. (Whenfloweryis true,int_to_words(i)produces identical output toinflect.engine().number_to_words(i), except for negative numbers:inflectstarts negative numbers with "minus", big starts them with "negative".)If the keyword-only parameter
ordinalis true, the string produced describes that ordinal number (instead of that cardinal number). Ordinal numbers describe position, e.g. where a competitor placed in a competition. In other words,int_to_words(1)returns the string'one', butint_to_words(1, ordinal=True)returns the string'first'.Numbers >=
10**66(one thousand vigintillion) are only converted usingstr(i). Sorry!
-
A tuple of
strobjects, representing every line-breaking whitespace character recognized by the Pythonstrobject. Identical tostr_linebreaks.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
linebreakswithout'\r\n'.
-
A "lines iterator" object. Splits s into lines, and iterates yielding those lines.
scan bestr,bytes, or any iterable ofstrorbytes.If
sis neitherstrnorbytes,smust be an iterable;linesyields successive elements ofsas lines. All objects yielded by this iterable should be homogeneous, eitherstrorbytes.If
sisstrorbytes, andseparatorsisNone,lineswill splitsat line boundaries and yield those lines, including empty lines. Ifseparatorsis notNone, it must be an iterable of strings of the same type ass;lineswill splitsusingmultisplit.When iterated over, yields 2-tuples:
(info, line)infois aLineInfoobject, which contains three fields by default:line- the original line, never modifiedline_number- the line number of this line, starting at theline_numberpassed in and adding 1 for each successive linecolumn_number- the column this line starts on, starting at thecolumn_numberpassed in, and adjusted when characters are removed from the beginning ofline
The
tab_widthkeyword-only parameter is an integer, representing how many spaces wide a tab character should be. It isn't used bylinesitself; instead, it's stored internally, and may be used by lines modifier functions (e.g.lines_convert_tabs_to_spaces,lines_strip_indent). Similarly, all keyword arguments passed in viakwargsare stored internally and can be accessed by user-defined lines modifier functions.For more information, see the deep-dive on
linesand lines modifier functions.
-
The second object yielded by a
linesiterator, containing metadata about the line. You can add your own fields by passing them in via**kwargs; you can also add new attributes or modify existing attributes as needed from inside a "lines modifier" function.For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Converts tabs to spaces for the lines of a "lines iterator", using the
tab_widthpassed in tolines.For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Filters out comment lines from the lines of a "lines iterator". Comment lines are lines whose first non-whitespace characters appear in the iterable of
comment_separatorsstrings passed in.What's the difference between
lines_strip_commentsandlines_filter_comment_lines?lines_filter_comment_linesonly recognizes lines that start with a comment separator (ignoring leading whitespace). Also, it filters out those lines completely, rather than modifying the line.lines_strip_commentshandles comment characters anywhere in the line, although it can ignore comments inside quoted strings. It truncates the line but still always yields the line.
For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Only yields lines that contain
s. (Filters out lines that don't contains.)If
invertis true, returns the opposite--filters out lines that contains.For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Only yields lines that match the regular expression
pattern. (Filters out lines that don't matchpattern.)patterncan bestr,bytes, or anre.Patternobject. Ifpatternis not anre.Patternobject, it's compiled withre.compile(pattern, flags=flags).If
invertis true, returns the opposite--filters out lines that matchpattern.For more information, see the deep-dive on
linesand lines modifier functions.(In older versions of Python,
re.Patternwas a private type calledre._pattern_type.)
-
A lines modifier function. Strips trailing whitespace from the lines of a "lines iterator".
For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Sorts all input lines before yielding them.
Lines are sorted lexicographically, from lowest to highest. If
reverseis true, lines are sorted from highest to lowest.For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Strips leading and trailing whitespace from the lines of a "lines iterator".
If
lines_stripremoves leading whitespace from a line, it updatesLineInfo.column_numberwith the new starting column number, and also adds a field to theLinesInfoobject:leading- the leading whitespace string that was removed
For more information, see the deep-dive on
linesand lines modifier functions.
lines_strip_comments(li, comment_separators, *, quotes=('"', "'"), backslash='\\', rstrip=True, triple_quotes=True)
-
A lines modifier function. Strips comments from the lines of a "lines iterator". Comments are substrings that indicate the rest of the line should be ignored;
lines_strip_commentstruncates the line at the beginning of the leftmost comment separator.If
rstripis true (the default),lines_strip_commentscalls therstrip()method onlineafter it truncates the line.If
quotesis true, it must be an iterable of quote characters. (Each quote character must be a single character.)lines_strip_commentswill parse the line and ignore comment characters inside quoted strings. Ifquotesis false, quote characters are ignored andline_strip_commentswill truncate anywhere in the line.backslashandtriple_quotesare passed in tosplit_quoted_string, which is used internally to detect the quoted strings in the line.Sets a new field on the associated
LineInfoobject for every line:comment- the comment stripped from the line, if any. If no comment was found,commentwill be an empty string.
What's the difference between
lines_strip_commentsandlines_filter_comment_lines?lines_filter_comment_linesonly recognizes lines that start with a comment separator (ignoring leading whitespace). Also, it filters out those lines completely, rather than modifying the line.lines_strip_commentshandles comment characters anywhere in the line, although it can ignore comments inside quoted strings. It truncates the line but still always yields the line.
For more information, see the deep-dive on
linesand lines modifier functions.
-
A lines modifier function. Automatically measures and strips indents.
Sets two new fields on the associated
LineInfoobject for every line:indent- an integer indicating how many indents it's observedleading- the leading whitespace string that was removed
Also updates LineInfo.column_number as needed.
Uses an intentionally simple algorithm. Only understands tab and space characters as indent characters. Internally detabs to spaces first for consistency, using the
tab_widthpassed in to lines.You can only dedent out to a previous indent. Raises
IndentationErrorif there's an illegal dedent.For more information, see the deep-dive on
linesand lines modifier functions.
merge_columns(*columns, column_separator=" ", overflow_response=OverflowResponse.RAISE, overflow_before=0, overflow_after=0)
-
Merge an arbitrary number of separate text strings into columns. Returns a single formatted string.
columnsshould be an iterable of "column tuples". Each column tuple should contain three items:(text, min_width, max_width)
textshould be a single string, eitherstrorbytes, with newline characters separating lines.min_widthandmax_widthare the minimum and maximum permissible widths for that column, not including the column separator (if any).Note that this function does not text-wrap the text of the columns. The text in the columns should already be broken into lines and separated by newline characters. (Lines in that are longer than that column tuple's
max_widthare handled with theoverflow_strategy, below.)column_separatoris printed between every column.overflow_strategytells merge_columns how to handle a column with one or more lines that are wider than that column'smax_width. The supported values are:OverflowStrategy.RAISE: Raise an OverflowError. The default.OverflowStrategy.INTRUDE_ALL: Intrude into all subsequent columns on all lines where the overflowed column is wider than itsmax_width.OverflowStrategy.DELAY_ALL: Delay all columns after the overflowed column, not beginning any until after the last overflowed line in the overflowed column.
When
overflow_strategyisINTRUDE_ALLorDELAY_ALL, and eitheroverflow_beforeoroverflow_afteris nonzero, these specify the number of extra lines before or after the overflowed lines in a column.For more information, see the deep-dive on Word wrapping and formatting.
-
Like
str.partition, but supports partitioning based on multiple separator strings, and can partition more than once.scan be eitherstrorbytes.separatorsshould be an iterable of objects of the same type ass.By default, if any of the strings in
separatorsare found ins, returns a tuple of three strings: the portion ofsleading up to the earliest separator, the separator, and the portion ofsafter that separator. Example:multipartition('aXbYz', ('X', 'Y')) => ('a', 'X', 'bYz')If none of the separators are found in the string, returns a tuple containing
sunchanged followed by two empty strings.multipartitionis greedy: if two or more separators appear at the leftmost location ins,multipartitionpartitions using the longest matching separator. For example:multipartition('wxabcyz', ('a', 'abc')) => `('wx', 'abc', 'yz')`Passing in an explicit
countlets you control how many timesmultipartitionpartitions the string.multipartitionwill always return a tuple containing(2*count)+1elements. Passing in acountof 0 will always return a tuple containings.If
separateis true, multiple adjacent separator strings behave like one separator. Example:big.text.multipartition('aXYbYXc', ('X', 'Y',), count=2, separate=False) => ('a', 'XY', 'b', 'YX', 'c') big.text.multipartition('aXYbYXc', ('X', 'Y',), count=2, separate=True ) => ('a', 'X', '', 'Y', 'bYXc')If
reverseis true, multipartition behaves likestr.rpartition. It partitions starting on the right, scanning backwards through s looking for separators.For more information, see the deep-dive on The
multi-family of string functions.
multisplit(s, separators=None, *, keep=False, maxsplit=-1, reverse=False, separate=False, strip=False)
-
Splits strings like
str.split, but with multiple separators and options.scan bestrorbytes.separatorsshould either beNone(the default), or an iterable ofstrorbytes, matchings.If
separatorsisNoneandsisstr,multisplitwill usebig.whitespaceasseparators. IfseparatorsisNoneandsisbytes,multisplitwill usebig.ascii_whitespaceasseparators.Returns an iterator yielding the strings split from
s. Ifkeepis true (orALTERNATING), andstripis false, joining these strings together will recreates.multisplitis greedy: if two or more separators start at the same location ins,multisplitsplits using the longest matching separator. For example:big.multisplit('wxabcyz', ('a', 'abc'))
yields
'wx'then'yz'.keepindicates whether or not multisplit should preserve the separator strings in the strings it yields. It supports four values:-
false (the default)
-
Discard the separators.
-
true (apart from
ALTERNATINGandAS_PAIRS) -
Append the separators to the end of the split strings. You can recreate the original string by using
"".jointo join the strings yielded bymultisplit. -
ALTERNATING -
Yield alternating strings in the output: strings consisting of separators, alternating with strings consisting of non-separators. If "separate" is true, separator strings will contain exactly one separator, and non-separator strings may be empty; if "separate" is false, separator strings will contain one or more separators, and non-separator strings will never be empty, unless "s" was empty. You can recreate the original string by using
"".jointo join the strings yielded bymultisplit. -
AS_PAIRS -
Yield 2-tuples containing a non-separator string and its subsequent separator string. Either string may be empty; the separator string in the last 2-tuple will always be empty, and if "s" ends with a separator string, both strings in the final 2-tuple will be empty.
separateindicates whether multisplit should consider adjacent separator strings insas one separator or as multiple separators each separated by a zero-length string. It supports two values:-
false (the default)
-
Group separators together. Multiple adjacent separators behave as if they're one big separator.
-
true
-
Don't group separators together. Each separator should split the string individually, even if there are no characters between two separators. (
multisplitwill behave as if there's a zero-character-wide string between adjacent separators.)
stripindicates whether multisplit should strip separators from the beginning and/or end ofs. It supports five values:-
false (the default)
- Don't strip separators from the beginning or end of "s".
-
true (apart from LEFT, RIGHT, and PROGRESSIVE)
- Strip separators from the beginning and end of "s" (similarly to `str.strip`).
-
LEFT - Strip separators only from the beginning of "s" (similarly to `str.lstrip`).
-
RIGHT - Strip separators only from the end of "s" (similarly to `str.rstrip`).
-
PROGRESSIVE - Strip from the beginning and end of "s", unless "maxsplit" is nonzero and the entire string is not split. If splitting stops due to "maxsplit" before the entire string is split, and "reverse" is false, don't strip the end of the string. If splitting stops due to "maxsplit" before the entire string is split, and "reverse" is true, don't strip the beginning of the string. (This is how `str.strip` and `str.rstrip` behave when you pass in `sep=None`.)
maxsplitshould be either an integer orNone. Ifmaxsplitis an integer greater than -1, multisplit will splittextno more thanmaxsplittimes.reversechanges wheremultisplitstarts splitting the string, and what direction it moves through the string when parsing.-
false (the default)
- Start splitting from the beginning of the string and parse moving right (towards the end).
-
true
- Start splitting from the end of the string and parse moving left (towards the beginning).
Splitting starting from the end of the string and parsing moving left has two effects. First, if
maxsplitis a number greater than 0, the splits will start at the end of the string rather than the beginning. Second, if there are overlapping instances of separators in the string,multisplitwill prefer the rightmost separator rather than the leftmost. Consider this example, wherereverseis false:multisplit("A x x Z", (" x ",), keep=big.ALTERNATING) => "A", " x ", "x Z"If you pass in a true value for
reverse,multisplitwill prefer the rightmost overlapping separator:multisplit("A x x Z", (" x ",), keep=big.ALTERNATING, reverse=True) => "A x", " x ", "Z"For more information, see the deep-dive on The
multi-family of string functions. -
-
Like
str.strip, but supports stripping multiple substrings froms.Strips from the string
sall leading and trailing instances of strings found inseparators.sshould bestrorbytes.separatorsshould be an iterable of eitherstrorbytesobjects matching the type ofs.If
leftis a true value, strips all leading separators froms.If
rightis a true value, strips all trailing separators froms.Processing always stops at the first character that doesn't match one of the separators.
Returns a copy of
swith the leading and/or trailing separators stripped. (Ifleftandrightare both false, returnssunchanged.)For more information, see the deep-dive on The
multi-family of string functions.
-
Returns
s, but with every run of consecutive separator characters turned into a replacement string. By default turns all runs of consecutive whitespace characters into a single space character.smay bestrorbytes.separatorsshould be an iterable of eitherstrorbytesobjects, matchings.replacementshould be either astrorbytesobject, also matchings, orNone(the default). IfreplacementisNone,normalize_whitespacewill use a replacement string consisting of a single space character.Leading or trailing runs of separator characters will be replaced with the replacement string, e.g.:
normalize_whitespace(" a b c") == " a b c"
-
Parses a string containing nesting delimiters. Raises an exception if mismatched delimiters are detected.
smay bestrorbytes.delimitersmay be eitherNoneor an iterable containing eitherDelimiterobjects or objects matchings(strorbytes). Entries in thedelimitersiterable which arestrorbytesshould be exactly two characters long; these will be used as theopenandclosearguments for a newDelimiterobject.If
delimitersisNone,parse_delimitersuses a default value matching these pairs of delimiters:() [] {} "" ''The quote mark delimiters enable backslash quoting and disable nesting.
Yields 3-tuples containing strings:
(text, open, close)where
textis the text before the next opening or closing delimiter,openis the trailing opening delimiter, andcloseis the trailing closing delimiter. At least one of these three strings will always be non-empty. Ifopenis non-empty,closewill be empty, and vice-versa. Ifsdoes not end with a closing delimiter, in the final tuple yielded, bothopenandclosewill be empty strings.(Concatenating every string yielded by
parse_delimiterstogether produces a new string identical tos.)You can only specify a particular character as an opening delimiter once, though you may reuse a particular character as a closing delimiter multiple times.
-
Like
str.partition, butpatternis matched as a regular expression.textcan be a string or a bytes object.patterncan be a string, bytes, orre.Patternobject.textandpattern(orpattern.pattern) must be the same type.If
patternis found in text, returns a tuple(before, match, after)
where
beforeis the text before the matched text,matchis there.Matchobject resulting from the match, andafteris the text after the matched text.If
patternappears intextmultiple times,re_partitionwill match against the first (leftmost) appearance.If
patternis not found intext, returns a tuple(text, None, '')
where the empty string is
strorbytesas appropriate.Passing in an explicit
countlets you control how many timesre_partitionpartitions the string.re_partitionwill always return a tuple containing(2*count)+1elements, and odd-numbered elements will be eitherre.Matchobjects orNone. Passing in acountof 0 will always return a tuple containings.If
patternis a string or bytes object,flagsis passed in as theflagsargument tore.compile.If
reverseis true, partitions starting at the right, likere_rpartition.(In older versions of Python,
re.Patternwas a private type calledre._pattern_type.)
-
Like
str.rpartition, butpatternis matched as a regular expression.textcan be astrorbytesobject.patterncan be astr,bytes, orre.Patternobject.textandpattern(orpattern.pattern) must be the same type.If
patternis found intext, returns a tuple(before, match, after)
where
beforeis the text before the matched text,matchis the re.Match object resulting from the match, andafteris the text after the matched text.If
patternappears intextmultiple times,re_partitionwill match against the last (rightmost) appearance.If
patternis not found intext, returns a tuple('', None, text)
where the empty string is
strorbytesas appropriate.Passing in an explicit
countlets you control how many timesre_rpartitionpartitions the string.re_rpartitionwill always return a tuple containing(2*count)+1elements, and odd-numbered elements will be eitherre.Matchobjects orNone. Passing in acountof 0 will always return a tuple containings.If
patternis a string,flagsis passed in as theflagsargument tore.compile.(In older versions of Python,
re.Patternwas a private type calledre._pattern_type.)
-
An iterator. Behaves almost identically to the Python standard library function
re.finditer, yielding non-overlapping matches ofpatterninstring. The difference is,reversed_re_finditersearchesstringfrom right to left.patterncan bestr,bytes, or a precompiledre.Patternobject. If it'sstrorbytes, it'll be compiled withre.compileusing theflagsyou passed in.stringshould be the same type aspattern(orpattern.pattern).
-
Splits
sinto quoted and unquoted segments.scan be eitherstrorbytes.quotesis an iterable of quote separators, eitherstrorbytesmatchings. Note thatsplit_quoted_stringsonly supports quote characters, as in, each quote separator must be exactly one character long.Returns an iterator yielding 2-tuples:
(is_quoted, segment)where
segmentis a substring ofs, andis_quotedis true if the segment is quoted. Joining all the segments together recreatess.If
triple_quotesis true, supports "triple-quoted" strings like Python.If
backslashis a character, this character will quoting characters inside a quoted string, like the backslash character inside strings in Python.
split_text_with_code(s, *, tab_width=8, allow_code=True, code_indent=4, convert_tabs_to_spaces=True)
-
Splits
sinto individual words, suitable for feeding intowrap_words.smay be eitherstrorbytes.Paragraphs indented by less than
code_indentwill be broken up into individual words.If
allow_codeis true, paragraphs indented by at leastcode_indentspaces will preserve their whitespace: internal whitespace is preserved, and the newline is preserved. (This will preserve the formatting of code examples when these words are rejoined into lines bywrap_words.)For more information, see the deep-dive on Word wrapping and formatting.
-
A tuple of
strobjects, representing every line-breaking whitespace character recognized by the Pythonstrobject. Identical tolinebreaks.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
str_linebreakswithout'\r\n'.
-
A tuple of
strobjects, representing every whitespace character recognized by the Pythonstrobject. Identical towhitespace.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
str_whitespacewithout'\r\n'.
-
A tuple of
strobjects, representing every line-breaking whitespace character defined by Unicode.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
unicode_linebreakswithout'\r\n'.
-
A tuple of
strobjects, representing every whitespace character defined by Unicode.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
unicode_whitespacewithout'\r\n'.
-
A tuple of
strobjects, representing every whitespace character recognized by the Pythonstrobject. Identical tostr_whitespace.Useful as a
separatorargument for big functions that accept one, e.g. the big "multi-" family of functions, or thelinesand lines modifier functions.Also contains
'\r\n'. See the deep-dive section on The Unix, Mac, and DOS line-break conventions for more.For more information, please see the Whitespace and line-breaking characters in Python and big deep-dive.
-
Equivalent to
whitespacewithout'\r\n'.
-
Combines
wordsinto lines and returns the result as a string. Similar totextwrap.wrap.wordsshould be an iterator yielding str or bytes strings, and these strings should already be split at word boundaries. Here's an example of a valid argument forwords:"this is an example of text split at word boundaries".split()
A single
'\n'indicates a line break. If you want a paragraph break, embed two'\n'characters in a row.marginspecifies the maximum length of each line. The length of every line will be less than or equal tomargin, unless the length of an individual element insidewordsis greater thanmargin.If
two_spacesis true, elements fromwordsthat end in sentence-ending punctuation ('.','?', and'!') will be followed by two spaces, not one.Elements in
wordsare not modified; any leading or trailing whitespace will be preserved. You can use this to preserve whitespace where necessary, like in code examples.For more information, see the deep-dive on Word wrapping and formatting.
Functions for working with time. Currently deals specifically with timestamps. The time functions in big are designed to make it easy to use best practices.
-
Ensures that a
datetime.dateobject has a timezone set.If
dhas a timezone set, returnsd. Otherwise, returns a newdatetime.dateobject equivalent todwith itstzinfoset totimezone.
-
Returns a new
datetime.dateobject identical todbut with itstzinfoset totimezone.
-
Ensures that a
datetime.datetimeobject has a timezone set.If
dhas a timezone set, returnsd. Otherwise, creates a newdatetime.datetimeobject equivalent todwith itstzinfoset totimezone.
-
Returns a new
datetime.datetimeobject identical todbut with itstzinfoset totimezone.
-
Parses a timestamp string returned by
timestamp_3339Z. Returns adatetime.datetimeobject.timezoneis an optional default timezone, and should be adatetime.tzinfoobject (orNone). If provided, and the time represented in the string doesn't specify a timezone, thetzinfoattribute of the returned object will be explicitly set totimezone.parse_timestamp_3339Zdepends on thepython-dateutilpackage. Ifpython-dateutilis unavailable,parse_timestamp_3339Zwill also be unavailable.
-
Return a timestamp string in RFC 3339 format, in the UTC time zone. This format is intended for computer-parsable timestamps; for human-readable timestamps, use
timestamp_human().Example timestamp:
'2021-05-25T06:46:35.425327Z'tmay be one of several types:- If
tis None,timestamp_3339Zuses the current time in UTC. - If
tis an int or a float, it's interpreted as seconds since the epoch in the UTC time zone. - If
tis atime.struct_timeobject ordatetime.datetimeobject, and it's not in UTC, it's converted to UTC. (Technically,time.struct_timeobjects are converted to GMT, usingtime.gmtime. Sorry, pedants!)
If
want_microsecondsis true, the timestamp ends with microseconds, represented as a period and six digits between the seconds and the'Z'. Ifwant_microsecondsisfalse, the timestamp will not include this text. Ifwant_microsecondsisNone(the default), the timestamp ends with microseconds if the type oftcan represent fractional seconds: a float, adatetimeobject, or the valueNone. - If
-
Return a timestamp string formatted in a pleasing way using the currently-set local timezone. This format is intended for human readability; for computer-parsable time, use
timestamp_3339Z().Example timestamp:
"2021/05/24 23:42:49.099437"tcan be one of several types:- If
tisNone,timestamp_humanuses the current local time. - If
tis an int or float, it's interpreted as seconds since the epoch. - If
tis atime.struct_timeordatetime.datetimeobject, it's converted to the local timezone.
If
want_microsecondsis true, the timestamp will end with the microseconds, represented as ".######". Ifwant_microsecondsis false, the timestamp will not include the microseconds.If
want_microsecondsisNone(the default), the timestamp ends with microseconds if the type oftcan represent fractional seconds: a float, adatetimeobject, or the valueNone. - If
-
This family of string functions was inspired by Python's
str.split,str.rsplit, andstr.splitlinesmethods. These string splitting methods are well-designed and often do what you want. But they're surprisingly narrow and opinionated. What if your use case doesn't map neatly to one of these functions?str.splitsupports two very specific modes of operation--unless you want to split your string in exactly one of those two modes, you probably can't usestr.splitto solve your problem.So what can you use? There's
re.split, but that can be hard to use.1 Now there's a new answer:multisplit.multisplit's goal is to be the be-all end-all string splitting function. It's designed to replace every mode of operation provided bystr.split,str.rsplit, andstr.splitlines, and it can even replacestr.partitionandstr.rpartition. (big usesmultisplitto implementmultipartition.) [multisplit] can do it all!The downside of
multisplit's awesome flexibility is that, since it is so sophisticated and tunable, it can be hard to use. It takes five keyword-only parameters after all. However, they're designed to be reasonably memorable, and their default values are designed to be easy to remember. The best way to cope withmultisplit's complexity is to use it as a building block for your own text splitting functions. For example big usesmultisplitto implementmultipartition,normalize_whitespace,lines, and several others functions.To use
multisplit, pass in the string you want to split, the separators you want to split on, and tweak its behavior with its five keyword arguments. It returns an iterator that yields string segments from the original string in your preferred format.The cornerstone of
multisplitis theseparatorsargument. This is an iterable of strings, of the same type (strorbytes) as the string you want to split (s).multisplitwill split the string at each non-overlapping instance of any string specified inseparators.multisplitalso lets you fine-tune how it splits, through five keyword-only parameters:keeplets you include the separator strings in the output, in a number of different formats.separatelets you specify whether adjacent separator strings should be grouped together (likestr.splitoperating on whitespace) or regarded as separate (likestr.splitwhen you pass in an explicitsepseparator).striplets you strip separator strings from the beginning, end, or both ends of the string you're splitting. It also supports a special progressive mode that duplicates the behavior ofstr.splitwhen you useNoneas the separator.maxsplitlets you specify the maximum number of times to split the string, exactly like themaxsplitargument tostr.split.reverselets you applymaxsplitto the end of the string and splitting backwards, exactly likestr.rsplit.
To make it slightly easier to remember, all these keyword-only parameters default to a false value. (Well, technically,
maxsplitdefaults to the special value-1, for compatibility withstr.split. But this is its special "don't do anything" magic value. All the other keyword-only parameters default toFalse.)multisplitalso inspiredmultistripandmultipartition, which also take this sameseparatorsarguments. There are also other big functions that take aseparatorsargument; for consistency's sakes, the parameter name always has the wordseparatorsin it. (For example,comment_separatorsforlines_filter_comment_lines.)To give you a sense of how the five keyword-only parameters changes the behavior of
multisplit, here's a breakdown of each of these parameters with examples.maxsplitspecifies the maximum number of times the string should be split. It behaves the same as themaxsplitparameter tostr.split.The default value of
-1means "split as many times as you can". In our example here, the string can be split a maximum of three times. Therefore, specifying amaxsplitof-1is equivalent to specifying amaxsplitof2or greater:>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'))) # "maxsplit" defaults to -1 ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=0)) ['appleXbananaYcookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=1)) ['apple', 'bananaYcookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=2)) ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=3)) ['apple', 'banana', 'cookie']
maxsplithas interactions withreverseandstrip. For more information, see the documentation regarding those parameters, below.keepindicates whether or notmultisplitshould preserve the separator strings in the strings it yields. It supports four values: false, true, and the special valuesALTERNATINGandAS_PAIRS.>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'))) # "keep" defaults to False ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), keep=False)) ['apple', 'banana', 'cookie']
When
keepis true,multisplitkeeps the separators, appending them to the end of the separated string:>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), keep=True)) ['appleX', 'bananaY', 'cookie']
When
keepisALTERNATING,multisplitkeeps the separators as separate strings. The first string yielded is always a non-separator string, and from then on it always alternates between a separator string and a non-separator string. Put another way, if you store the output ofmultisplitin a list, entries with an even-numbered index (0, 2, 4, ...) are always non-separator strings, and entries with an odd-numbered index (1, 3, 5, ...) are always separator strings.>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), keep=big.ALTERNATING)) ['apple', 'X', 'banana', 'Y', 'cookie']
Note that
ALTERNATINGalways emits an odd number of strings; the first and last strings are always non-separator strings. Likestr.split, if the string you're splitting starts or ends with a separator string,multisplitwill emit an empty string there:>>> list(big.multisplit('1a1z1', ('1',), keep=big.ALTERNATING)) ['', '1', 'a', '1', 'z', '1', '']
Finally, when
keepisAS_PAIRS,multisplitkeeps the separators as separate strings. But instead of yielding strings, it yields 2-tuples of strings. Every 2-tuple contains a non-separator string followed by a separator string.If the original string starts with a separator, the first 2-tuple will contain an empty non-separator string and the separator:
>>> list(big.multisplit('YappleXbananaYcookie', ('X', 'Y'), keep=big.AS_PAIRS)) [('', 'Y'), ('apple', 'X'), ('banana', 'Y'), ('cookie', '')]
The last 2-tuple will always contain an empty separator string:
>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), keep=big.AS_PAIRS)) [('apple', 'X'), ('banana', 'Y'), ('cookie', '')] >>> list(big.multisplit('appleXbananaYcookieXXX', ('X', 'Y'), keep=big.AS_PAIRS, strip=True)) [('apple', 'X'), ('banana', 'Y'), ('cookie', '')]
(This rule means that
AS_PAIRSalways emits an even number of strings. Contrast that withALTERNATING, which always emits an odd number of strings, and the last string it emits is always a non-separator string. Put another way: if you ignore the tuples, the list of strings emitted byAS_PAIRSis the same as those emitted byALTERNATING, exceptAS_PAIRSappends an empty string.)Because of this rule, if the original string ends with a separator, and
multisplitdoesn'tstripthe right side, the final tuple emitted byAS_PAIRSwill be a 2-tuple containing two empty strings:>>> list(big.multisplit('appleXbananaYcookieX', ('X', 'Y'), keep=big.AS_PAIRS)) [('apple', 'X'), ('banana', 'Y'), ('cookie', 'X'), ('', '')]
This looks strange and unnecessary. But it is what you want. This odd-looking behavior is discussed at length in the section below, titled Why do you sometimes get empty strings when you split?
The behavior of
keepcan be affected by the value ofseparate. For more information, see the next section, onseparate.separateindicates whether multisplit should consider adjacent separator strings insas one separator or as multiple separators each separated by a zero-length string. It can be either false or true.>>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'))) # separate defaults to False ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=False)) ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=True)) ['apple', '', 'banana', '', '', 'cookie']
If
separateandkeepare both true values, and your string has multiple adjacent separators,multisplitwill viewsas having zero-length non-separator strings between the adjacent separators:>>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=True, keep=True)) ['appleX', 'Y', 'bananaY', 'X', 'Y', 'cookie'] >>> list(big.multisplit('appleXYbananaYXYcookie', ('X', 'Y'), separate=True, keep=big.AS_PAIRS)) [('apple', 'X'), ('', 'Y'), ('banana', 'Y'), ('', 'X'), ('', 'Y'), ('cookie', '')]
stripindicates whether multisplit should strip separators from the beginning and/or end ofs. It supports five values: false, true,big.LEFT,big.RIGHT, andbig.PROGRESSIVE.By default,
stripis false, which means it doesn't strip any leading or trailing separators:>>> list(big.multisplit('XYappleXbananaYcookieYXY', ('X', 'Y'))) # strip defaults to False ['', 'apple', 'banana', 'cookie', '']
Setting
stripto true strips both leading and trailing separators:>>> list(big.multisplit('XYappleXbananaYcookieYXY', ('X', 'Y'), strip=True)) ['apple', 'banana', 'cookie']
big.LEFTandbig.RIGHTtellmultistripto only strip on that side of the string:>>> list(big.multisplit('XYappleXbananaYcookieYXY', ('X', 'Y'), strip=big.LEFT)) ['apple', 'banana', 'cookie', ''] >>> list(big.multisplit('XYappleXbananaYcookieYXY', ('X', 'Y'), strip=big.RIGHT)) ['', 'apple', 'banana', 'cookie']
big.PROGRESSIVEduplicates a specific behavior ofstr.splitwhen usingmaxsplit. It always strips on the left, but it only strips on the right if the string is completely split. Ifmaxsplitis reached before the entire string is split, andstripisbig.PROGRESSIVE,multisplitwon't strip the right side of the string. Note in this example how the trailing separatorYisn't stripped from the input string whenmaxsplitis less than3.>>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), strip=big.PROGRESSIVE)) ['apple', 'banana', 'cookie'] >>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), maxsplit=0, strip=big.PROGRESSIVE)) ['appleXbananaYcookieY'] >>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), maxsplit=1, strip=big.PROGRESSIVE)) ['apple', 'bananaYcookieY'] >>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), maxsplit=2, strip=big.PROGRESSIVE)) ['apple', 'banana', 'cookieY'] >>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), maxsplit=3, strip=big.PROGRESSIVE)) ['apple', 'banana', 'cookie'] >>> list(big.multisplit('XappleXbananaYcookieY', ('X', 'Y'), maxsplit=4, strip=big.PROGRESSIVE)) ['apple', 'banana', 'cookie']
reversespecifies wheremultisplitstarts parsing the string--from the beginning, or the end--and in what direction it moves when parsing the string--towards the end, or towards the beginning. It only supports two values: when it's false,multisplitstarts at the beginning of the string, and parses moving to the right (towards the end of the string). But whenreverseis true,multisplitstarts at the end of the string, and parses moving to the left (towards the beginning of the string).This has two noticable effects on
multisplit's output. First, this changes which splits are kept whenmaxsplitis less than the total number of splits in the string. Whenreverseis true, the splits are counted starting on the right and moving towards the left:>>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), reverse=True)) # maxsplit defaults to -1 ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=0, reverse=True)) ['appleXbananaYcookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=1, reverse=True)) ['appleXbanana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=2, reverse=True)) ['apple', 'banana', 'cookie'] >>> list(big.multisplit('appleXbananaYcookie', ('X', 'Y'), maxsplit=3, reverse=True)) ['apple', 'banana', 'cookie']
The second effect is far more subtle. It's only relevant when splitting strings containing multiple overlapping separators. When
reverseis false, and there are two (or more) overlapping separators, the string is split by the leftmost overlapping separator. Whenreverseis true, and there are two (or more) overlapping separators, the string is split by the rightmost overlapping separator.Consider these two calls to
multisplit. The only difference between them is the value ofreverse. They produce different results, even though neither one usesmaxsplit.>>> list(big.multisplit('appleXAYbananaXAYcookie', ('XA', 'AY'))) # reverse defaults to False ['apple', 'Ybanana', 'Ycookie'] >>> list(big.multisplit('appleXAYbananaXAYcookie', ('XA', 'AY'), reverse=True)) ['appleX', 'bananaX', 'cookie']
Here are some examples of how you could use
multisplitto replace some common Python string splitting methods. These exactly duplicate the behavior of the originals.def _multisplit_to_split(s, sep, maxsplit, reverse): separate = sep != None if separate: strip = False else: sep = big.ascii_whitespace if isinstance(s, bytes) else big.whitespace strip = big.PROGRESSIVE result = list(big.multisplit(s, sep, maxsplit=maxsplit, reverse=reverse, separate=separate, strip=strip)) if not separate: # ''.split() == ' '.split() == [] if result and (not result[-1]): result.pop() return result def str_split(s, sep=None, maxsplit=-1): return _multisplit_to_split(s, sep, maxsplit, False) def str_rsplit(s, sep=None, maxsplit=-1): return _multisplit_to_split(s, sep, maxsplit, True) def str_splitlines(s, keepends=False): linebreaks = big.ascii_linebreaks if isinstance(s, bytes) else big.linebreaks l = list(big.multisplit(s, linebreaks, keep=keepends, separate=True, strip=False)) if l and not l[-1]: # yes, ''.splitlines() returns an empty list l.pop() return l def _partition_to_multisplit(s, sep, reverse): if not sep: raise ValueError("empty separator") l = tuple(big.multisplit(s, (sep,), keep=big.ALTERNATING, maxsplit=1, reverse=reverse, separate=True)) if len(l) == 1: empty = b'' if isinstance(s, bytes) else '' if reverse: l = (empty, empty) + l else: l = l + (empty, empty) return l def str_partition(s, sep): return _partition_to_multisplit(s, sep, False) def str_rpartition(s, sep): return _partition_to_multisplit(s, sep, True)
You wouldn't want to use these, of course--Python's built-in functions are so much faster!
Sometimes when you split using
multisplit, you'll get empty strings in the return value. This might be unexpected, violating the Principle Of Least Astonishment. But there are excellent reasons for this behavior.Let's start by observing what
str.splitdoes.str.splitreally has two major modes of operation: when you don't pass in a separator (or pass inNonefor the separator), and when you pass in an explicit separator string. In this latter mode, the documentation says it regards every instance of a separator string as an individual separator splitting the string. What does that mean? Watch what happens when you have two adjacent separators in the string you're splitting:>>> '1,2,,3'.split(',') ['1', '2', '', '3']
What's that empty string doing between
'2'and'3'? Here's how you should think about it: when you pass in an explicit separator,str.splitsplits at every occurance of that separator in the string. It always splits the string into two places, whenever there's a separator. And when there are two adjacent separators, conceptually, they have a zero-length string in between them:>>> '1,2,,3'[4:4] ''
The empty string in the output of
str.splitrepresents the fact that there were two adjacent separators. Ifstr.splitdidn't add that empty string, the output would look like this:['1', '2', '3']
But then it'd be indistinguishable from splitting the same string without two separators in a row:
>>> '1,2,3'.split(',') ['1', '2', '3']
This difference is crucial when you want to reconstruct the original string from the split list.
str.splitwith a separator should always be reversable usingstr.join, and with that empty string there it works correctly:>>> ','.join(['1', '2', '3']) '1,2,3' >>> ','.join(['1', '2', '', '3']) '1,2,,3'
Now take a look at what happens when the string you're splitting starts or ends with a separator:
>>> ',1,2,3,'.split(',') ['', '1', '2', '3', '']
This might seem weird. But, just like with two adjacent separators, this behavior is important for consistency. Conceptually there's a zero-length string between the beginning of the string and the first comma. And
str.joinneeds those empty strings in order to correctly recreate the original string.>>> ','.join(['', '1', '2', '3', '']) ',1,2,3,'
Naturally,
multisplitlets you duplicate this behavior. When you wantmultisplitto behave just likestr.splitdoes with an explicit separator string, just pass inkeep=False,separate=True, andstrip=False. That is, ifaandbare strings,big.multisplit(a, (b,), keep=False, separate=True, strip=False)
always produces the same output as
a.split(b)
For example, here's
multisplitsplitting the strings we've been playing with, using these parameters:>>> list(big.multisplit('1,2,,3', (',',), keep=False, separate=True, strip=False)) ['1', '2', '', '3'] >>> list(big.multisplit(',1,2,3,', (',',), keep=False, separate=True, strip=False)) ['', '1', '2', '3', '']
This "emit an empty string" behavior also has ramifications when
keepisn't false. The behavior ofkeep=Trueis easy to predict;multisplitjust appends the separators to the previous string segment:>>> list(big.multisplit('1,2,,3', (',',), keep=True, separate=True, strip=False)) ['1,', '2,', ',', '3'] >>> list(big.multisplit(',1,2,3,', (',',), keep=True, separate=True, strip=False)) [',', '1,', '2,', '3,', '']
The principle here is that, when you use
keep=True, you should be able to reconstitute the original string with''.join:>>> ''.join(['1,', '2,', ',', '3']) '1,2,,3' >>> ''.join([',', '1,', '2,', '3,', '']) ',1,2,3,'
keep=big.ALTERNATINGis much the same, except we insert the separators as their own segments, rather than appending each one to the previous segment:>>> list(big.multisplit('1,2,,3', (',',), keep=big.ALTERNATING, separate=True, strip=False)) ['1', ',', '2', ',', '', ',', '3'] >>> list(big.multisplit(',1,2,3,', (',',), keep=big.ALTERNATING, separate=True, strip=False)) ['', ',', '1', ',', '2', ',', '3', ',', '']
Remember,
ALTERNATINGoutput always begins and ends with a non-separator string. If the string you're splitting begins or ends with a separator, the output frommultisplitspecifyingkeep=ALTERNATINGwill correspondingly begin or end with an empty string.And, as with
keep=True, you can also recreate the original string by passing these arrays in to''.join:>>> ''.join(['1', ',', '2', ',', '', ',', '3']) '1,2,,3' >>> ''.join(['', ',', '1', ',', '2', ',', '3', ',', '']) ',1,2,3,'
Finally there's
keep=big.AS_PAIRS. The behavior here seemed so strange, initially I thought it was wrong. But I've given it a lot of thought, and I've convinced myself that this is correct:>>> list(big.multisplit('1,2,,3', (',',), keep=big.AS_PAIRS, separate=True, strip=False)) [('1', ','), ('2', ','), ('', ','), ('3', '')] >>> list(big.multisplit(',1,2,3,', (',',), keep=big.AS_PAIRS, separate=True, strip=False)) [('', ','), ('1', ','), ('2', ','), ('3', ','), ('', '')]
That tuple at the end, just containing two empty strings:
('', '')
It's so strange. How can that be right?
It's the same as
str.split.multisplitmust split the string into two pieces every time it finds the separator in the original string. So it must emit the empty non-separator string. And since that zero-length string isn't (cannot!) be followed by a separator, when usingkeep=AS_PAIRSthe final separator string is also empty.Think of it this way: with the tuple of empty strings there, you can easily convert one
keepformat into any another. (Provided that you know what the separators were--either the sourcekeepformat was not false, or you only used one separator string when callingmultisplit). Without that tuple of empty strings at the end, you'd also have to have anifstatement to add or remove empty stuff from the end.I'll demonstrate this with a simple example. Here's the output of
multisplitsplitting the string'1a1z1'by the separator'1', in each of the fourkeepformats:>>> list(big.multisplit('1a1z1', '1', keep=False)) ['', 'a', 'z', ''] >>> list(big.multisplit('1a1z1', '1', keep=True)) ['1', 'a1', 'z1', ''] >>> list(big.multisplit('1a1z1', '1', keep=big.ALTERNATING)) ['', '1', 'a', '1', 'z', '1', ''] >>> list(big.multisplit('1a1z1', '1', keep=big.AS_PAIRS)) [('', '1'), ('a', '1'), ('z', '1'), ('', '')]
Because the
AS_PAIRSoutput ends with that tuple of empty strings, we can mechanically convert it into any of the other formats, like so:>>> result = list(big.multisplit('1a1z1', '1', keep=big.AS_PAIRS)) >>> result [('', '1'), ('a', '1'), ('z', '1'), ('', '')] >>> [s[0] for s in result] # convert to keep=False ['', 'a', 'z', ''] >>> [s[0]+s[1] for s in result] # convert to keep=True ['1', 'a1', 'z1', ''] >>> [s for t in result for s in t][:-1] # convert to keep=big.ALTERNATING ['', '1', 'a', '1', 'z', '1', '']
If the
AS_PAIRSoutput didn't end with that tuple of empty strings, you'd need to add anifstatement to restore the trailing empty strings as needed.str.splitreturns an empty list when you split an empty string by whitespace:>>> ''.split() []
But not when you split by an explicit separator:
>>> ''.split('x') ['']
multisplitis consistent here. If you split an empty string, it always returns an empty string, as long as the separators are valid:>>> list(big.multisplit('')) [''] >>> list(big.multisplit('', ('a', 'b', 'c'))) ['']
Similarly, when splitting a string that only contains whitespace,
str.splitalso returns an empty list:>>> ' '.split() []
This is really the same as "splitting an empty string", because when
str.splitsplits on whitespace, the first thing it does is strip leading whitespace.If you
multisplita string that only contains whitespace, and you split on whitespace characters, it returns two empty strings:>>> list(big.multisplit(' ')) ['', '']
This is because the string conceptually starts with a zero-length string, then has a run of whitespace characters, then ends with another zero-length string. So those two empty strings are the leading and trailing zero-length strings, separated by whitespace. If you tell
multisplitto also strip the string, you'll get back a single empty string:>>> list(big.multisplit(' ', strip=True)) ['']
And
multisplitbehaves consistently even when you use different separators:>>> list(big.multisplit('ababa', 'ab')) ['', ''] >>> list(big.multisplit('ababa', 'ab', strip=True)) ['']
And I should know--
multisplitis implemented usingre.split!
-
Several functions in big take a
separatorsargument, an iterable of separator strings. Examples of these functions includelinesandmultisplit. Although you can use any iterable of strings you like, most often you'll be separating on some form of whitespace. But what, exactly, is whitespace? There's more to this topic than you might suspect.The good news is, you can almost certainly ignore all the complexity. These days the only whitespace characters you're likely to encounter are spaces, tabs, newlines, and maybe carriage returns. Python and big handle all those easily.
With respect to big and these
separatorsarguments, big provides four values designed for use asseparators. All four of these are tuples containing whitespace characters:- When working with
strobjects, you'll want to use eitherbig.whitespaceorbig.linebreaks.big.whitespacecontains all the whitespace characters,big.linebreakscontains just the line-breaking whitespace characters. - big also has equivalents for working with
bytesobjects:big.bytes_whitespaceandbig.bytes_linebreaks, respectively.
Apart from exceptionally rare occasions, these are all you'll ever need. And if that's all you need, you can stop reading this section now.
But what about those exceptionally rare occasions? You'll be pleased to know big handles them too. The rest of this section is a deep dive into these rare occasions.
Here's the list of all characters recognized by Python
strobjects as whitespace characters:# char decimal hex name ########################################## '\t' , # 9 - 0x0009 - tab '\n' , # 10 - 0x000a - newline '\v' , # 11 - 0x000b - vertical tab '\f' , # 12 - 0x000c - form feed '\r' , # 13 - 0x000d - carriage return '\x1c' , # 28 - 0x001c - file separator '\x1d' , # 29 - 0x001d - group separator '\x1e' , # 30 - 0x001e - record separator '\x1f' , # 31 - 0x001f - unit separator ' ' , # 32 - 0x0020 - space '\x85' , # 133 - 0x0085 - next line '\xa0' , # 160 - 0x00a0 - non-breaking space '\u1680', # 5760 - 0x1680 - ogham space mark '\u2000', # 8192 - 0x2000 - en quad '\u2001', # 8193 - 0x2001 - em quad '\u2002', # 8194 - 0x2002 - en space '\u2003', # 8195 - 0x2003 - em space '\u2004', # 8196 - 0x2004 - three-per-em space '\u2005', # 8197 - 0x2005 - four-per-em space '\u2006', # 8198 - 0x2006 - six-per-em space '\u2007', # 8199 - 0x2007 - figure space '\u2008', # 8200 - 0x2008 - punctuation space '\u2009', # 8201 - 0x2009 - thin space '\u200a', # 8202 - 0x200a - hair space '\u2028', # 8232 - 0x2028 - line separator '\u2029', # 8233 - 0x2029 - paragraph separator '\u202f', # 8239 - 0x202f - narrow no-break space '\u205f', # 8287 - 0x205f - medium mathematical space '\u3000', # 12288 - 0x3000 - ideographic spaceThis list was derived by iterating over every character defined in Unicode, and testing to see if the
split()method on a Pythonstrobject splits at that character.The first surprise: this isn't the same as the list of all characters defined by Unicode as whitespace. It's almost the same list, except Python adds four extra characters:
'\x1c','\x1d','\x1e', and'\x1f', which respectively are called "file separator", "group separator", "record separator", and "unit separator". I'll refer to these as "the four ASCII separator characters".These characters were defined as part of the original ASCII standard, way back in 1963. As their names suggest, they were intended to be used as separator characters for data, the same way Ctrl-Z was used to indicate end-of-file in the CPM and earliest FAT filesystems. But the four ASCII separator characters were rarely used even back in the day. Today they're practically unheard of.
As a rule, printing these characters to the screen generally doesn't do anything--they don't move the cursor, and the screen doesn't change. So their behavior is a bit mysterious. A lot of people (including early Python programmers it seems!) thought that meant they're whitespace. This seems like an odd conclusion to me. After all, all the other whitespace characters move the cursor, either right or down or both; these don't move the cursor at all.
The Unicode standard is unambiguous: these characters are not whitespace. And yet Python's "Unicode object" behaves as if they are. So I'd say this is a bug; Python's Unicode object should implement what the Unicode standard says. Like many bugs, this one has lingered for a long time. The behavior is present in Python 2, there's a ten-year-old issue on the Python issue tracker about this, and it's not making progress.
The second surprise has to do with
bytesobjects. Of course,bytesobjects represent binary data, and don't necessarily represent characters. Even if they do, they don't have any encoding associated with them. However, for convenience--and backwards-compatibility with Python 2--Python'sbytesobjects support several method calls that treat the data as if it were "ASCII-compatible".The surprise: These methods on Python
bytesobjects recognize a different set of whitespace characters. Here's the list of all bytes recognized by Pythonbytesobjects as whitespace:# char decimal hex name ####################################### '\t' , # 9 - 0x09 - tab '\n' , # 10 - 0x0a - newline '\v' , # 11 - 0x0b - vertical tab '\f' , # 12 - 0x0c - form feed '\r' , # 13 - 0x0d - carriage return ' ' , # 32 - 0x20 - spaceThis list was derived by iterating over every possible byte value, and testing to see if the
split()method on a Pythonbytesobject splits at that byte.The good news is, this list is the same as ASCII's list, and it agrees with Unicode. In fact this list is quite familiar to C programmers; it's the same whitespace characters recognized by the standard C function
isspace()(inctypes.h). Python has used this function to decide which characters are and aren't whitespace in 8-bit strings since its very beginning.Notice that this list doesn't contain the four ASCII separator characters. That these two types in Python don't agree only enhances the mystery.
The situation is slightly worse with line-breaking characters. Line-breaking characters are a subset of whitespace characters; they're whitespace characters that always move the cursor down. And, as with whitespace generally, Python
strobjects don't agree with Unicode about what is and is not a line-breaking character, and Pythonbytesobjects don't agree with either of those.Here's the list of all Unicode characters recognized by Python
strobjects as line-breaking characters:# char decimal hex name ########################################## '\n' , # 10 0x000a - newline '\v' , # 11 0x000b - vertical tab '\f' , # 12 0x000c - form feed '\r' , # 13 0x000d - carriage return '\x1c' , # 28 0x001c - file separator '\x1d' , # 29 0x001d - group separator '\x1e' , # 30 0x001e - record separator '\x85' , # 133 0x0085 - next line '\u2028', # 8232 0x2028 - line separator '\u2029', # 8233 0x2029 - paragraph separatorThis list was derived by iterating over every character defined in Unicode, and testing to see if the
splitlines()method on a Pythonstrobject splits at that character.Again, this is different from the list of characters defined as line-breaking whitespace in Unicode. And again it's because Python defines some of the four ASCII separator characters as line-breaking characters. In this case it's only the first three; Python doesn't consider the fourth, "unit separator", as a line-breaking character. (I don't know why Python draws this distinction... but then again, I don't know why it considers the first three to be line-breaking. It's all a mystery to me.)
Here's the list of all characters recognized by Python
bytesobjects as line-breaking characters:# char decimal hex name ####################################### '\n' , # 10 0x000a - newline '\r' , # 13 0x000d - carriage returnThis list was derived by iterating over every possible byte, and testing to see if the
splitlines()method on a Pythonbytesobject splits at that byte.It's here we find our final unpleasant surprise: the methods on Python
bytesobjects don't consider'\v'(vertical tab) and'\f'(form feed) to be line-break characters. I assert this is also a bug. These are well understood to be line-breaking characters; "vertical tab" is like a "tab", except it moves the cursor down instead of to the right. And "form feed" moves the cursor to the top left of the next "page", which requires advancing at least one line.To be crystal clear: the odds that any of this will cause a problem for you are extremely low. In order for it to make a difference:
- you'd have to encounter text using one of these six characters where Python disagrees with Unicode and ASCII, and
- you'd have to process the input based on some definition of whitespace, and
- it would have to produce different results than you might have other wise expected, and
- this difference in results would have to be important.
It seems extremely unlikely that all of these will be true for you.
In case this does affect you, big has a complete set of predefined whitespace tuples that will handle any of these situations. big defines a total of ten tuples, sorted into five categories.
In every category there are two values: one that contains
whitespace, the other containslinebreaks. Thewhitespacetuple contains all the possible values of whitespace--characters that move the cursor either horizontally, or vertically, or both, but don't print anything visible to the screen. Thelinebreakstuple contains the subset of whitespace characters that move the cursor vertically.The most important two values start with
str_:str_whitespaceandstr_linebreaks. These contain all the whitespace characters recognized by the Pythonstrobject.Next are two values that start with
unicode_:unicode_whitespaceandunicode_linebreaks. These contain all the whitespace characters defined in the Unicode standard.Third, two values that start with
ascii_:ascii_whitespaceandascii_linebreaks. These contain all the whitespace characters defined in ASCII.Fourth, two values that start with
bytes_:bytes_whitespaceandbytes_linebreaks. These contain all the whitespace characters recognized by the Pythonbytesobject.Finally we have the two tuples that lack a prefix:
whitespaceandlinebreaks. These are the tuples you should use most of the time, and several big functions use them as default values. These are identical tostr_whitespaceandstr_linebreaksrespectively.(big actually defines an additional ten tuples, as discussed in the very next section.)
Historically, different platforms used different ASCII characters--or sequences of ASCII characters--to represent "go to the next line" in text files. Here are the most popular conventions:
\n - UNIX, Amiga \r - Mac OS (before OS X), many 8-bit computers \r\n - Windows, DOS(There are a couple more conventions, and a lot more history, in the Wikipedia article on newlines.)
Handling these differing conventions was a stumbling block for a long time--both for computer programs, and in the daily lives of computer users. Python went through several iterations on how to handle this, eventually settling on the "universal newlines" support added in Python 2.3. These days the world seems to be converging on the UNIX standard
'\n'; Windows supports it, and it's the default on every other modern platform. So in practice you probably don't have end-of-line conversion problems, either.But just in case, big has one more trick. All of the tuples defined in the previous section--from
whitespacetoascii_linebreaks--also contain this string:'\r\n'(The two
bytes_tuples contain thebytesequivalent,b'\r\n.)This addition means that, when you use one of these tuples with one of the big functions that take separators, it'll split on
\r\nas if it was one character. This means that big itself should automatically handle the DOS and Windows end-of-line character sequence, in case one happens to creep into your data.If you don't want this behavior, just add the suffix
_without_crlfto the end of any of the ten tuples, e.g.whitespace_without_crlf,bytes_linebreaks_without_crlf.What if you need to split text by whitespace, or by lines, but that text is in
bytesformat with an unusual encoding? big makes that easy too. If one of the builtin tuples won't work for you, you can can make your own tuple from scratch, or modify an existing tuple to meet your needs.For example, let's say you need to split a document by whitespace, and the document is encoded in code page 850, aka "latin-1". Normally the easiest thing would be to decode it a
strobject using the'latin-1'text codec, then operate on it normally. But you might have reasons why you don't want to decode it--maybe the document is damaged and doesn't decode properly, and it's easier to work with the encoded bytes than to fix it. If you want to process the text with a big function that accepts aseparatorargument, you could make your own custom tuple of "latin-1" whitespace characters. "latin-1" has the same whitespace characters as ASCII, but adds one more, value 255, which is not line-breaking. So it's easy to make the appropriate tuples yourself:latin_1_whitespace = big.bytes_whitespace + (b'\xff',) latin_1_linebreaks = big.bytes_linebreaksWhat if you want to process a
bytesobject containing UTF-8? That's easy too. Just convert one of the existing tuples containingstrobjects usingbig.encode_strings. For example, to split a UTF-8 encoded bytes objectbusing the Unicode line-breaking characters, you could call:multisplit(b, encode_strings(unicode_linebreaks, encoding='utf-8'))Note that this technique probably won't work correctly for most other multibyte encodings, for example UTF-16. For these encodings, you should decode to
strbefore processing.Why? It's because
multisplitcould find matches in multibyte sequences straddling characters. Consider this example:>>> haystack = '\u0101\u0102' >>> needle = '\u0201' >>> needle in haystack False >>> >>> encoded_haystack = haystack.encode('utf-16-le') >>> encoded_needle = needle.encode('utf-16-le') >>> encoded_needle in encoded_haystack True
The character
'\u0201'doesn't appear in the original string, but the encoded version appears in the encoded string, as the second byte of the first character and the first byte of the second character:>>> encoded_haystack b'\x01\x01\x02\x01' >>> encoded_needle b'\x01\x02'
- When working with
-
linesis a function that makes it easy to write well-behaved, feature-rich text parsers.linesitself iterates over a string, returning an iterator that yields individual lines split from that string. The iterator yields a 2-tuple:(LinesInfo, line)The
LinesInfoobject provides the line number and starting column number for each line. This makes it easy for your parser to provide line and column information for error messages.This iterator is designed to be modified by "lines modifier" functions. These are functions that consume a lines iterator and re-yield the values, possibly modifying or discarding them along the way. For example, passing a
linesiterator intolines_filter_empty_linesresults in an iterator that skips over the empty lines. All the lines modifier functions that ship with big start with the stringlines_.Most lines modifier function names belong to a category, encoded as the second word in the function name (immediately after
lines_). Some examples:lines_filter_functions conditionally remove lines from the output. For example,lines_filter_empty_lineswill only yield a line if it isn't empty.lines_strip_functions may remove one or more substrings from the line. For example,lines_strip_indent(li)strips the leading whitespace from a line before yielding it. (Whenever a lines modifier removes leading text from a line, it will add aleadingfield to the accompanyingLineInfoobject containing the removed substring, and will also update thecolumn_numberof the line to reflect the new starting column.)lines_convert_functions means this lines modifier may change one or more substrings in the line. For example,lines_convert_tabs_to_spaceschanges tab characters to space characters in any lines it processes.
(big isn't strict about these category names though. For example,
lines_containing(li, s, *, invert=False)andlines_grep(li, pattern, *, invert=False, flags=0)are obviously "filter" modifiers, but their names don't start withlines_filter_.)All lines modifier functions are composable with each other; you can "stack" them together simply by passing the output of one into the input of another. For example,
with open("textfile.txt", "rt") as f: for info, line in big.lines_filter_empty_lines( big.lines_rstrip(lines(f.read()))): ...
will iterate over the lines of
textfile.txt, skipping over all empty lines and lines that consist only of whitespace.When you stack line modifiers in this way, note that the outer modifiers happen later. In the above example, each line is first "r-stripped", and then discarded if it's empty. If you stacked the line modifiers in the opposite order:
with open("textfile.txt", "rt") as f: for info, line in big.lines_rstrip( big.lines_filter_empty_lines(lines(f.read()))): ...
then it'd filter out empty lines first, and then "r-strip" the lines. So lines in the input that contained only whitespace would still get yielded as empty lines, which is probably not what you want. Ordering is important!
Of course, you can write your own lines modifier functions. Simply accept a lines iterator as an argument, iterate over it, and yield each line info and line, modifying them (or not yielding them!) as you see fit. You could even write your own lines iterator, a replacement for
lines, if you need functionalitylinesdoesn't provide.Note that if you write your own lines modifier function, and it removes text from the beginning the line, you must update
column_numberin theLineInfoobject manually--it doesn't happen automatically.
-
big contains three functions used to reflow and format text in a pleasing manner. In the order you should use them, they are
split_text_with_code,wrap_words(),, and optionallymerge_columns. This trio of functions gives you the following word-wrap superpowers:- Paragraphs of text representing embedded "code" don't get word-wrapped. Instead, their formatting is preserved.
- Multiple texts can be merged together into multiple columns.
The big word wrapping functions also distinguish between "text" and "code". The main distinction is, "text" lines can get word-wrapped, but "code" lines shouldn't. big considers any line starting with enough whitespace to be a "code" line; by default, this is four spaces. Any non-blank line that starting with four spaces is a "code" line, and any non-blank line that starts with less than four spaces is a "text" line.
In "text" mode:
- words are separated by whitespace,
- initial whitespace on the line is discarded,
- the amount of whitespace between words is irrelevant,
- individual newline characters are ignored, and
- more than two newline characters are converted into exactly two newlines (aka a "paragraph break").
In "code" mode:
- all whitespace is preserved, except for trailing whitespace on a line, and
- all newline characters are preserved.
Also, whenever
split_text_with_codeswitches between "text" and "code" mode, it emits a paragraph break.A split text array is an intermediary data structure used by big.text functions to represent text. It's literally just an array of strings, where the strings represent individual word-wrappable substrings.
split_text_with_codereturns a split text array, andwrap_words()consumes a split text array.You'll see four kinds of strings in a split text array:
- Individual words, ready to be word-wrapped.
- Entire lines of "code", preserving their formatting.
- Line breaks, represented by a single newline:
'\n'. - Paragraph breaks, represented by two newlines:
'\n\n'.
This might be clearer with an example or two. The following text:
hello there! this is text. this is a second paragraph!would be represented in a Python string as:
"hello there!\nthis is text.\n\n\nthis is a second paragraph!"Note the three newlines between the second and third lines.
If you then passed this string in to
split_text_with_code, it'd return this split text array:[ 'hello', 'there!', 'this', 'is', 'text.', '\n\n', 'this', 'is', 'a', 'second', 'paragraph!']
split_text_with_codemerged the first two lines together into a single paragraph, and collapsed the three newlines separating the two paragraphs into a "paragraph break" marker (two newlines in one string).Now let's add an example of text with some "code". This text:
What are the first four squared numbers? for i in range(1, 5): print(i**2) Python is just that easy!would be represented in a Python string as (broken up into multiple strings for clarity):
"What are the first four squared numbers?\n\n" + " for i in range(1, 5):\n\n\n" + " print(i**2)\n\nPython is just that easy!"
split_text_with_codeconsiders the two lines with initial whitespace as "code" lines, and so the text is split into the following split text array:['What', 'are', 'the', 'first', 'four', 'squared', 'numbers?', '\n\n', ' for i in range(1, 5):', '\n', '\n', '\n', ' print(i**2)', '\n\n', 'Python', 'is', 'just', 'that', 'easy!']
Here we have a "text" paragraph, followed by a "code" paragraph, followed by a second "text" paragraph. The "code" paragraph preserves the internal newlines, though they are represented as individual "line break" markers (strings containing a single newline). Every paragraph is separated by a "paragraph marker".
Here's a simple algorithm for joining a split text array back into a single string:
prev = None a = [] for word in split_text_array: if not (prev and prev.isspace() and word.isspace()): a.append(' ') a.append(word) text = "".join(a)
Of course, this algorithm is too simple to do word wrapping. Nor does it handle adding two spaces after sentence-ending punctuation. In practice, you shouldn't do this by hand; you should use
wrap_words.merge_columnsmerges multiple strings into columns on the same line.For example, it could merge these three Python strings:
[ "Here's the first\ncolumn of text.", "More text over here!\nIt's the second\ncolumn! How\nexciting!", "And here's a\nthird column.", ]
into the following text:
Here's the first More text over here! And here's a column of text. It's the second third column. column! How exciting!(Note that
merge_columnsdoesn't do its own word-wrapping; instead, it's designed to consume the output ofwrap_words.)Each column is passed in to
merge_columnsas a "column tuple":(s, min_width, max_width)
sis the string,min_widthis the minimum width of the column, andmax_widthis the minimum width of the column.As you saw above,
scan contain newline characters, andmerge_columnsobeys those when formatting each column.For each column,
merge_columnsmeasures the longest line of each column. The width of the column is determined as follows:- If the longest line is less than
min_widthcharacters long, the column will bemin_widthcharacters wide. - If the longest line is less than or equal to
min_widthcharacters long, and less than or equal tomax_widthcharacters long, the column will be as wide as the longest line. - If the longest line is greater than
max_widthcharacters long, the column will bemax_widthcharacters wide, and lines that are longer thanmax_widthcharacters will "overflow".
What is "overflow"? It's a condition
merge_columnsmay encounter when the text in a column is wider than that column'smax_width.merge_columnsneeds to consider both "overflow lines", lines that are longer thanmax_width, and "overflow columns", columns that contain one or more overflow lines.What does
merge_columnsdo when it encounters overflow?merge_columnssupports three "strategies" to deal with this condition, and you can specify which one you want using itsoverflow_strategyparameter. The three strategies are:-
OverflowStrategy.RAISE: Raise anOverflowErrorexception. The default. -
OverflowStrategy.INTRUDE_ALL: Intrude into all subsequent columns on all lines where the overflowed column is wider than itsmax_width. The subsequent columns "make space" for the overflow text by not adding text on those overflowed lines; this is called "pausing" their output. -
OverflowStrategy.DELAY_ALL: Delay all columns after the overflowed column, not beginning any until after the last overflowed line in the overflowed column. This is like theINTRUDE_ALLstrategy, except that the columns "make space" by pausing their output until the last overflowed line.
When
overflow_strategyisINTRUDE_ALLorDELAY_ALL, and eitheroverflow_beforeoroverflow_afteris nonzero, these specify the number of extra lines before or after the overflowed lines in a column where the subsequent columns "pause".
-
big's
TopologicalSorteris a drop-in replacement forgraphlib.TopologicalSorterin the Python standard library (new in 3.9). However, the version in big has been greatly upgraded:prepareis now optional, though it still performs a cycle check.- You can add nodes and edges to a graph at any time, even while iterating over the graph. Adding nodes and edges always succeeds.
- You can remove nodes from graph
gwith the new methodg.remove(node). Again, you can do this at any time, even while iterating over the graph. Removing a node from the graph always succeeds, assuming the node is in the graph. - The functionality for iterating over a graph now lives in its own object called
a view. View objects implement the
get_ready,done, and__bool__methods. There's a default view built in to the graph object; theget_ready,done, and__bool__methods on a graph just call into the graph's default view. You can create a new view at any time by calling the newviewmethod.
Note that if you're using a view to iterate over the graph, and you modify the graph, and the view now represents a state that isn't coherent with the graph, attempting to use that view raises a
RuntimeError. More on what I mean by "coherence" in a minute.This implementation also fixes some minor warts with the existing API:
- In Python's implementation,
static_orderandget_ready/doneare mutually exclusive. If you ever callget_readyon a graph, you can never callstatic_order, and vice-versa. The implementaiton in big doesn't have this restriction, because its implementation ofstatic_ordercreates and uses a new view object every time it's called.. - In Python's implementation, you can only iterate over the graph once, or call
static_orderonce. The implementation in big solves this in several ways: it allows you to create as many views as you want, and you can call the newresetmethod on a view to reset it to its initial state.
So what does it mean for a view to no longer be coherent with the graph? Consider the following code:
g = big.TopologicalSorter() g.add('B', 'A') g.add('C', 'A') g.add('D', 'B', 'C') g.add('B', 'A') v = g.view() g.ready() # returns ('A',) g.add('A', 'Q')
First this code creates a graph
gwith a classic "diamond" dependency pattern. Then it creates a new viewv, and gets the currently "ready" nodes, which consists just of the node'A'. Finally it adds a new dependency:'A'depends on'Q'.At this moment, view
vis no longer coherent.'A'has been marked as "ready", but'Q'has not. And yet'A'depends on'Q'. All those statements can't be true at the same time! So viewvis no longer coherent, and any attempt to interact withvraises an exception.To state it more precisely: if view
vis a view on graphg, and you callg.add('Z', 'Y'), and neither of these statements is true in viewv:'Y'has been marked asdone.'Z'has not yet been yielded byget_ready.
then
vis no longer "coherent".(If
'Y'has been marked asdone, then it's okay to make'Z'dependent on'Y'regardless of what state'Z'is in. Likewise, if'Z'hasn't been yielded byget_readyyet, then it's okay to make'Z'dependent on'Y'regardless of what state'Y'is in.)Note that you can restore a view to coherence. In this case, removing either
YorZfromgwould resolve the incoherence betweenvandg, andvwould start working again.Also note that you can have multiple views, in various states of iteration, and by modifying the graph you may cause some to become incoherent but not others. Views are completely independent from each other.
-
One minor complaint I have about Python regards inner classes. An "inner class" is a class defined inside another class. And, well, inner classes seem kind of half-baked. Unlike functions, inner classes don't get bound to the object.
Consider this Python code:
class Outer(object): def method(self): pass class Inner(object): def __init__(self): pass o = Outer() o.method() i = o.Inner()
When
o.methodis called, Python automatically passes in theoobject as the first parameter (generally calledself). In object-oriented lingo,ois bound tomethod, and indeed Python calls this object a bound method:>>> o.method <bound method Outer.method of <__main__.Outer object at 0x########>>But that doesn't happen when
o.Inneris called. (It does pass in aself, but in this case it's the newly-createdInnerobject.) There's just no built-in way for theo.Innerobject being constructed to automatically get a reference too. If you need one, you must explicitly pass one in, like so:class Outer(object): def method(self): pass class Inner(object): def __init__(self, outer): self.outer = outer o = Outer() o.method() i = o.Inner(o)
This seems redundant. You don't have to pass in
oexplicitly to method calls, why should you have to pass it in explicitly to inner classes?Well--now you don't have to! You just decorate the inner class with
@big.BoundInnerClass, andBoundInnerClasstakes care of the rest!Let's modify the above example to use our
BoundInnerClassdecorator:from big import BoundInnerClass class Outer(object): def method(self): pass @BoundInnerClass class Inner(object): def __init__(self, outer): self.outer = outer o = Outer() o.method() i = o.Inner()
Notice that
Inner.__init__now requires anouterparameter, even though you didn't pass in any arguments too.Inner. When it's called,ois magically passed in toouter! Thanks,BoundInnerClass! You've saved the day!Decorating an inner class like this always adds a second positional parameter, after
self. And, likeself, you don't have to use the nameouter, you can use any name you like. (Although it's probably a good idea, for consistency's sakes.)Bound inner classes get slightly complicated when mixed with inheritance. It's not all that difficult, you merely need to obey the following rules:
-
A bound inner class can inherit normally from any unbound class.
-
To subclass from a bound inner class while still inside the outer class scope, or when referencing the inner class from the outer class (as opposed to an instance of the outer class), you must actually subclass or reference
classname.cls. This is because inside the outer class, the "class" you see is actually an instance of aBoundInnerClassobject. -
All classes that inherit from a bound inner class must always call the superclass's
__init__. You don't need to pass in theouterparameter; it'll be automatically passed in to the superclass's__init__as before. -
An inner class that inherits from a bound inner class, and which also wants to be bound to the outer object, should be decorated with
BoundInnerClass. -
An inner class that inherits from a bound inner class, but doesn't want to be bound to the outer object, should be decorated with
UnboundInnerClass.
Restating the last two rules: every class that descends from any
BoundInnerClassshould be decorated with eitherBoundInnerClassorUnboundInnerClass. Which one you use depends on what behavior you want--whether or not you want your inner subclass to automatically get theouterinstance passed in to its__init__.Here's a simple example using inheritance with bound inner classes:
from big import BoundInnerClass, UnboundInnerClass class Outer(object): @BoundInnerClass class Inner(object): def __init__(self, outer): self.outer = outer @UnboundInnerClass class ChildOfInner(Inner.cls): def __init__(self): super().__init__() o = Outer() i = o.ChildOfInner()
We followed the rules:
Innerinherits from object; since object isn't a bound inner class, there are no special rules about inheritanceInnerneeds to obey.ChildOfInnerinherits fromInner.cls, notInner.- Since
ChildOfInnerinherits from aBoundInnerClass, it must be decorated with eitherBoundInnerClassorUnboundInnerClass. It doesn't want the outer object passed in, so it's decorated withUnboundInnerClass. ChildOfInner.__init__callssuper().__init__.
Note that, because
ChildOfInneris decorated withUnboundInnerClass, it doesn't take anouterparameter. Nor does it pass in anouterargument when it callssuper().__init__. But when the constructor forInneris called, the correctouterparameter is passed in--like magic! Thanks again,BoundInnerClass!If you wanted
ChildOfInnerto also get the outer argument passed in to its__init__, just decorate it withBoundInnerClassinstead ofUnboundInnerClass, like so:from big import BoundInnerClass class Outer(object): @BoundInnerClass class Inner(object): def __init__(self, outer): self.outer = outer @BoundInnerClass class ChildOfInner(Inner.cls): def __init__(self, outer): super().__init__() assert self.outer == outer o = Outer() i = o.ChildOfInner()
Again,
ChildOfInner.__init__doesn't need to explicitly pass inouterwhen callingsuper.__init__.You can see more complex examples of using inheritance with
BoundInnerClass(andUnboundInnerClass) in the big test suite.-
If you refer to a bound inner class directly from the outer class, rather than using the outer instance, you get the original class. This ensures that references to
Outer.Innerare consistent; this class is also a base class of all the bound inner classes. Additionally, if you attempt to construct an instance of an unboundOuter.Innerclass without referencing it via an instance, you must pass in the outer parameter by hand--just like you'd have to pass in theselfparameter by hand when calling a method on the class itself rather than on an instance of the class. -
If you refer to a bound inner class from an outer instance, you get a subclass of the original class.
-
Bound classes are cached in the outer object, which both provides a small speedup and ensures that
isinstancerelationships are consistent. -
You must not rename inner classes decorated with either
BoundInnerClassorUnboundInnerClass! The implementation ofBoundInnerClasslooks up the bound inner class in the outer object by name in several places. Adding aliases to bound inner classes is harmless, but the original attribute name must always work. -
Bound inner classes from different objects are different classes. This is symmetric with bound methods; if you have two objects
aandbthat are instances of the same class,a.BoundInnerClass != b.BoundInnerClass, just asa.method != b.method. -
The binding only goes one level deep; if you had an inner class
Cinside another inner classBinside a classA, the constructor forCwould be called with theBobject, not theAobject. -
Similarly, if you have a bound inner class
Binside a classA, and another bound inner classDinside a classC, andDinherits fromB, the constructor forDwill be called with theBobject but not theAobject. WhenDcallssuper().__init__it'll have to fill in theouterparameter by hand. -
There's a race condition in the implementation: if you access a bound inner class through an outer instance from two separate threads, and the bound inner class was not previously cached, the two threads may get different (but equivalent) bound inner class objects, and only one of those instances will get cached on the outer object. This could lead to confusion and possibly cause bugs. For example, you could have two objects that would be considered equal if they were instances of the same bound inner class, but would not be considered equal if instantiated by different instances of that same bound inner class. There's an easy workaround for this problem: access the bound inner class from the
__init__of the outer class, which should allow the code to cache the bound inner class instance before a second thread could ever get a reference to the outer object.
-
-
released 2023年09月19日
-
Breaking change: renamed almost all the old
whitespaceandnewlinestuples. Worse yet, one symbol has the same name but a different value:ascii_whitespace! I've also changed the suffix_without_dosto the more accurate and intuitive_without_crlf, and similarly changednewlinestolinebreaks. Sorry for all the confusion. This resulted from a lot of research into whitespace and newline characters, in Python, Unicode, and ASCII; please see the new deep-dive Whitespace and line-breaking characters in Python and big to see what all the fuss is about. Here's a summary of all the changes to the whitespace tuples:RENAMED TUPLES (old name -> new name) ascii_newlines -> bytes_linebreaks ascii_whitespace -> bytes_whitespace newlines -> linebreaks ascii_newlines_without_dos -> bytes_linebreaks_without_crlf ascii_whitespace_without_dos -> bytes_whitespace_without_crlf newlines_without_dos -> linebreaks_without_crlf whitespace_without_dos -> whitespace_without_crlf REMOVED TUPLES utf8_newlines utf8_whitespace utf8_newlines_without_dos utf8_whitespace_without_dos UNCHANGED TUPLES (same name, same meaning) whitespace NEW TUPLES ascii_linebreaks ascii_whitespace str_linebreaks str_whitespace unicode_linebreaks unicode_whitespace ascii_linebreaks_without_crlf ascii_whitespace_without_crlf str_linebreaks_without_crlf str_whitespace_without_crlf unicode_linebreaks_without_crlf unicode_whitespace_without_crlf -
New function in the
big.textmodule:encode_strings, which takes a container object containingstrobjects and returns an equivalent object containing encoded versions of those strings asbytes. -
Changed
split_text_with_codeimplementation to useStateManager. (No API or semantic changes, just an change to the internal implementation.) -
When you call
multisplitwith a type mismatch between 's' and 'separators', the exception it raises now includes the values of 's' and 'separators'. -
Added more tests for
big.stateto exercise all the string arguments ofaccessoranddispatch. -
The exhaustive
multisplittester now lets you specify test cases as cohesive strings, rather than forcing you to split the string manually. -
The exhaustive
multisplittester is better at internally verifying that it's doing the right thing. (There are some internal sanity checks, and those are more accurate now.) -
Whoops! The name of the main class in
big.stateisStateManager. I accidentally wroteStateMachineinstead in the docs... several times. -
Originally the
multisplitparameter 'separators' was required. I changed it to optional a while ago, with a default ofNone. (If you pass inNoneit usesbig.str_whitespaceorbig.bytes_whitespace, depending on the type ofs.) But the documentation didn't reflect this change until... now. -
Improved the prose in The
multi-family of string functions deep-dive. Hopefully now it does a better job of sellingmultisplitto the reader. -
The usual smattering of small doc fixes and improvements.
My thanks again to Eric V. Smith for his willingness to consider and discuss these issues. Eric is now officially a contributor to big, increasing the project's bus factor to two. Thanks, Eric!
-
released 2023年09月04日
- Added the new
big.statemodule, with its excitingStateManagerclass! int_to_wordsnow supports the newordinalkeyword-only parameter, to produce ordinal strings instead of cardinal strings. (The number 1 as a cardinal string is'one', but as an ordinal string is'first').- Added the
pure_virtualdecorator tobig.builtin. - The documentation is now much prettier! I finally discovered a syntax
I can use to achieve a proper indent in Markdown, supported by both
GitHub and PyPI. You simply nest the text you want indented inside
an HTML description list as the description text, and skip the
description item (
<dl><dd>). Note that you need a blank line after the<dl><dd>line, or else Markdown will ignore the markup in the following paragraph. Thanks to Hugo van Kemenade for his help confirming this! Oh, and, Hugo also fixed the image markup so the big banner displays properly on PyPI. Thanks, Hugo!
- Added the new
-
released 2023年07月22日
Extremely minor release. No new features or bug fixes.
- Fixed coverage, now back to the usual 100%. (This just required changing the tests, which didn't find any new bugs.)
- Made the tests for
Logdeterministic. They now use a fake clock that always returns the same values. - Added GitHub Actions integration. Tests and coverage are run in the cloud after every checkin. Thanks to Dan Pope for gently walking me through this!
- Fixed metadata in the
pyproject.tomlfile. - Added badges for testing, coverage, and supported Python versions.
-
released 2023年06月28日
-
released 2023年06月15日
-
Bugfix! If an outer class
Outerhad an inner classInnerdecorated with@BoundInnerClass, andois an instance ofOuter, andoevaluated to false in a boolean context,o.Innerwould be the unbound version ofInner. Now it's the bound version, as is proper. -
Modified
tests/test_boundinnerclasses.py:- Added regression test for the above bugfix (of course!).
- It now takes advantage of that newfangled "zero-argument
super". - Added testing of an unbound subclass of an unbound subclass.
-
-
released 2023年06月11日
- Added
int_to_words. - All tests now insert the local big directory
onto
sys.path, so you can run the tests on your local copy without having to install. Especially convenient for testing with old versions of Python!
Note: tomorrow, big will be one year old!
- Added
-
released 2023年05月19日
- Convert all iterator functions to use my new approach: instead of checking arguments inside the iterator, the function you call checks arguments, then has a nested iterator function which it runs and returns the result. This means bad inputs raise their exceptions at the call site where the iterator is constructed, rather than when the first value is yielded by the iterator!
-
released 2023年05月19日
- Added
parse_delimitersandDelimiter.
- Added
-
released 2023年05月18日
- Major retooling of
strandbytessupport inbig.text.- Functions in
big.textnow uniformly acceptstrorbytesor a subclass of either. See the Support for bytes and str section for how it works. - Functions in
big.textare now more consistent about raisingTypeErrorvsValueError. If you mixbytesandstrobjects together in one call, you'll get aTypeError, but if you pass in an empty iterable (of a correct type) where a non-empty iterable is required you'll get aValueError.big.textgenerally tries to give theTypeErrorhigher priority; if you pass in a value that fails both the type check and the value check, thebig.textfunction will raiseTypeErrorfirst.
- Functions in
- Major rewrite of
re_rpartition. I realized it had the same "reverse mode" problem that I fixed inmultisplitback in version 0.6.10: the regular expression should really search the string in "reverse mode", from right to left. The difference is whether the regular expression potentially matches against overlapping strings. When in forwards mode, the regular expression should prefer the leftmost overlapping match, but in reverse mode it should prefer the rightmost overlapping match. Most of the time this produces the same list of matches as you'd find searching the string forwards--but sometimes the matches come out very different. This was way harder to fix withre_rpartitionthan withmultisplit, because Python'sremodule only supports searching forwards. I have to emulate reverse-mode searching by manually checking for overlapping matches and figuring out which one(s) to keep--a lot of work! Fortunately it's only a minor speed hit if you don't have overlapping matches. (And if you do have overlapping matches, you're probably just happyre_rpartitionnow produces correct results--though I did my best to make it performant anyway.) In the future, big will probably add support for the PyPI packageregex, which reimplements Python'sremodule but adds many features... including reverse mode! - New function:
reversed_re_finditer. Behaves almost identically to the Python standard library functionre.finditer, yielding non-overlapping matches ofpatterninstring. The difference is,reversed_re_finditersearchesstringfrom right to left. (Written as part of there_rpartitionrewrite mentioned above.) - Added
apostrophes,double_quotes,ascii_apostrophes,ascii_double_quotes,utf8_apostrophes, andutf8_double_quotesto thebig.textmodule. Previously the first four of these were hard-coded strings insidegently_title. (And the last two didn't exist!) - Code cleanup in
split_text_with_code, removed redundant code. I think it has about the same number ofifstatements; if anything it might be slightly faster. - Retooled
re_partitionandre_rpartitionslightly, should now be very-slightly faster. (Well,re_rpartitionwill be slower if your pattern finds overlapping matches. But at least now it's correct!) - Lots and lots of doc improvements, as usual.
- Major retooling of
-
released 2023年03月13日
- Tweaked the implementation of
multisplit. Internally, it does the string splitting usingre.split, which returns alist. It used to iterate over the list and yield each element. But that meant keeping the entire list around in memory untilmultisplitexited. Now,multisplitreverses the list, pops off the final element, and yields that. This meansmultisplitdrops all references to the split strings as it iterates over the string, which may help in low-memory situations. - Minor doc fixes.
- Tweaked the implementation of
-
released 2023年03月11日
- Breaking changes to the
Scheduler:- It's no longer thread-safe by default, which means it's much faster for non-threaded workloads.
- The lock has been moved out of the
Schedulerobject and into theRegulator. Among other things, this means that theSchedulerconstructor no longer takes alockargument. Regulatoris now an abstract base class.big.scheduleralso provides two concrete implementations:SingleThreadedRegulatorandThreadSafeRegulator.RegulatorandEventare now defined in thebig.schedulernamespace. They were previously defined inside theSchedulerclass.- The arguments to the
Eventconstructor were rearranged. (You shouldn't care, as you shouldn't be manually constructingEventobjects anyway.) - The
Schedulernow guarantees that it will only callnowandwakeon aRegulatorobject while holding thatRegulator's lock.
- Minor doc fixes.
- Breaking changes to the
-
released 2023年03月09日
- Retooled
multisplitandmultistripargument verification code. Both functions now consistently check all their inputs, and use consistent error messages when raising an exception.
- Retooled
-
released 2023年03月09日
- Fixed a minor crashing bug in
multisplit: if you passed in a list of separators (orseparatorswas of any non-hashable type), andreversewas true,multisplitwould crash. It usedseparatorsas a key into a dict, which meantseparatorshad to be hashable. multisplitnow verifies that thespassed in is eitherstrorbytes.- Updated all copyright date notices to 2023.
- Lots of doc fixes.
- Fixed a minor crashing bug in
-
released 2023年02月26日
- Fixed Python 3.6 support! Some equals-signs-in-f-strings and some other anachronisms had crept in. 0.6.16 has been tested on all versions from 3.6 to 3.11 (as well as having 100% coverage).
- Made the
dateutilspackage an optional dependency. Only one function needs it,parse_timestamp_3339Z(). - Minor cleanup in
PushbackIterator(). It also uses slots now, which should make it a bit faster.
-
released 2023年01月07日
- Added the new functions
datetime_ensure_timezone(d, timezone)anddatetime_set_timezone(d, timezone). These allow you to ensure or explicitly set a timezone on adatetime.datetimeobject. - Added the
timezoneargument toparse_timestamp_3339Z(). gently_title()now capitalizes the first letter after a left parenthesis.- Changed the secret
multirpartitionfunction slightly. Itsreverseparameter now means to un-reverse its reversing behavior. Stated another way,multipartition(reverse=X)andmultirpartition(reverse=not X)now do the same thing.
- Added the new functions
-
released 2022年12月11日
- Improved the text of the
RuntimeErrorraised byTopologicalSorter.Viewwhen the view is incoherent. Now it tells you exactly what nodes are conflicting. - Expanded the deep dive on
multisplit.
- Improved the text of the
-
released 2022年12月11日
- Changed
translate_filename_to_exfat(s)behavior: when modifying a string with a colon (':') not followed by a space, it used to convert it to a dash ('-'). Now it converts the colon to a period ('.'), which looks a little more natural. A colon followed by a space is still converted to a dash followed by a space.
- Changed
-
tagged 2022年12月04日
- Bugfix: When calling
TopologicalSorter.print(), it sorts the list of nodes, for consistency's sakes and for ease of reading. But if the node objects don't support<or>comparison, that throws an exception.TopologicalSorter.print()now catches that exception and simply skips sorting. (It's only a presentation thing anyway.) - Added a secret (otherwise undocumented!) function:
multirpartition, which is likemultipartitionbut withreverse=True. - Added the list of conflicted nodes to the "node is incoherent" exception text.
Note: although version 0.6.12 was tagged, it was never packaged for release.
- Bugfix: When calling
-
tagged 2022年11月13日
- Changed the import strategy. The top-level big module used
to import all its child modules, and
import *all the symbols from all those modules. But a friend (hi Mark Shannon!) talked me out of this. It's convenient, but if a user doesn't care about a particular module, why make them import it. So now the top-level big module contains nothing but a version number, and you can either import just the submodules you need, or you can import big.all to get all the symbols (like big itself used to do).
Note: although version 0.6.11 was tagged, it was never packaged for release.
- Changed the import strategy. The top-level big module used
to import all its child modules, and
-
released 2022年10月26日
- All code changes had to do with
multisplit:- Fixed a subtle bug. When splitting with a separator that can overlap
itself, like
' x ',multisplitwill prefer the leftmost instance. But whenreverse=True, it must prefer the rightmost instance. Thanks to Eric V. Smith for suggesting the clever "reverse everything, callre.split, and un-reverse everything" approach. That let me fix this bug while still implementing on top ofre.split! - Implemented
PROGRESSIVEmode for thestripkeyword. This behaves likestr.strip: when splitting, strip on the left, then start splitting. If we don't exhaustmaxsplit, strip on the right; if we do exhaustmaxsplit, don't strip on the right. (Similarly forstr.rstripwhenreverse=True.) - Changed the default for
striptoFalse. It used to beNOT_SEPARATE. But this was too surprising--I'd forget that it was the default, and turning onkeepwouldn't return everything I thought I should get, and I'd head off to debugmultisplit, when in fact it was behaving as specified. The Principle Of Least Surprise tells me thatstripdefaulting toFalseis less surprising. Also, maintaining the invariant that all the keyword-only parameters tomultisplitdefault toFalseis a helpful mnemonic device in several ways. - Removed
NOT_SEPARATE(and the not-yet-implementedSTR_STRIP) modes forstrip. They're easy to implement yourself, and this removes some surface area from the already-too-bigmultisplitAPI.
- Fixed a subtle bug. When splitting with a separator that can overlap
itself, like
- Modernized
pyproject.tomlmetadata to makeflithappier. This was necessary to ensure thatpip install bigalso installs its dependencies.
- All code changes had to do with
-
released 2022年10月16日
- Renamed two of the three freshly-added lines modifier functions:
lines_filter_containsis nowlines_containing, andlines_filter_grepis nowlines_grep.
- Renamed two of the three freshly-added lines modifier functions:
-
released 2022年10月16日
- Added three new lines modifier functions
to the
textmodule:lines_filter_contains,lines_filter_grep, andlines_sort. gently_titlenow acceptsstrorbytes. Also added theapostrophesanddouble_quotesarguments.
- Added three new lines modifier functions
to the
-
released 2022年10月14日
- Fixed a bug in
multisplit. I thought when usingkeep=AS_PAIRSthat it shouldn't ever emit a 2-tuple containing just empty strings--but on further reflection I've realized that that's correct. This behavior is now tested and documented, along with the reasoning behind it. - Added the
reverseflag tore_partition. whitespace_without_dosandnewlines_without_dosstill had the DOS end-of-line sequence in them! Oops!- Added a unit test to check that. The unit test also ensures that
whitespace,newlines, and all the variants (utf8_,ascii_, and_with_dos) exactly match the set of characters Python considers whitespace and newline characters.
- Added a unit test to check that. The unit test also ensures that
- Lots more documentation and formatting fixes.
- Fixed a bug in
-
released 2022年10月13日
- Added the new
itertoolsmodule, which so far only containsPushbackIterator. - Added
lines_strip_commentsandsplit_quoted_stringsto thetextmodule.
- Added the new
-
released 2022年10月13日
- I realized that
whitespaceshould contain the DOS end-of-line sequence ('\r\n'), as it should be considered a single separator when splitting etc. I added that, along withwhitespace_no_dos, and naturallyutf8_whitespace_no_dosandascii_whitespace_no_dostoo. - Minor doc fixes.
- I realized that
-
released 2022年10月13日
A big upgrade!
- Completely retooled and upgraded
multisplit, and addedmultistripandmultipartition, collectively called Themulti-family of string functions. (Thanks to Eric Smith for suggestingmultipartition! Well, sort of.)[multisplit](#multisplits-separatorsnone--keepfalse-maxsplit-1-reversefalse-separatefalse-stripfalse)now supports five (!) keyword-only parameters, allowing the caller to tune its behavior to an amazing degree.- Also, the original implementation of
[multisplit](#multisplits-separatorsnone--keepfalse-maxsplit-1-reversefalse-separatefalse-stripfalse)got its semantics a bit wrong; it was inconsistent and maybe a little buggy. multistripis likestr.stripbut accepts an iterable of separator strings. It can strip from the left, right, both, or neither (in which case it does nothing).multipartitionis likestr.partition, but accepts an iterable of separator strings. It can also partition more than once, and supportsreverse=Truewhich causes it to partition from the right (likestr.rpartition).- Also added useful predefined lists of separators for use with all
the
multifunctions:whitespaceandnewlines, withascii_andutf8_versions of each, andwithout_dosvariants of all threenewlinesvariants.
- Added the
SchedulerandHeapclasses.Scheduleris a replacement for Python'ssched.schedulerclass, with a modernized interface and a major upgrade in functionality.Heapis an object-oriented interface to Python'sheapqmodule, used byScheduler. These are in their own modules,big.heapandbig.scheduler. - Added
linesand all thelines_modifiers. These are great for writing little text parsers. For more information, please see the deep-dive onlinesand lines modifier functions. - Removed
stripped_linesandrstripped_linesfrom thetextmodule, as they're superceded by the far superiorlinesfamily. - Enhanced
normalize_whitespace. Added theseparatorsandreplacementparameters, and added support forbytesobjects. - Added the
countparameter tore_partitionandre_rpartition.
- Completely retooled and upgraded
-
released 2022年09月12日
- Added
stripped_linesandrstripped_linesto thetextmodule. - Added support for
lento theTopologicalSorterobject.
- Added
-
released 2022年09月04日
- Added
gently_titleandnormalize_whitespaceto thetextmodule. - Changed
translate_filename_to_exfatto handle translating':'in a special way. If the colon is followed by a space, then the colon is turned into' -'. This yields a more natural translation when colons are used in text, e.g.'xXx: The Return Of Xander Cage'is translated to'xXx - The Return Of Xander Cage'. If the colon is not followed by a space, turns the colon into'-'. This is good for tiresome modern gobbledygook like'Re:code', which will now be translated to'Re-code'.
- Added
-
released 2022年06月12日
- Initial release.
-