π fix the msg parser and update the Travis CI build
β‘οΈ update dependencies and make pocketsphinx optional
π documentation build fixes
π psv/tsv parsers, user-provided filename extensions, audio parsing with pocketsphinx, and several other bug fixes
python 3 compatability, improved docx extraction, improved image extraction, and more.
π± pdf layout preservation, extensionless file support, and several π fixes
β Added .rtf and .msg support
π Includes support for tiff files and a new --option/-O command line option to pass in arbitrary keyword arguments to parsers, like the language for tesseract OCR
π support for a variety of formats, including audio (.wav, .mp3, .ogg), csv, scanned pdfs, and htm plus various bug fixes and internal improvements.
π Bump in major release comes from a standardization of the byte-string output of textract. This also includes support for spreadsheets (.xls, .xlsx) and e-publications (.epub)