1

I use the following:

from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
 def __init__(self):
 self.reset()
 self.fed = []
 def handle_data(self, d):
 self.fed.append(d)
 def get_data(self):
 return ''.join(self.fed)
def strip_tags(html):
 s = MLStripper()
 s.feed(html)
 return s.get_data()

to get rid of the HTML tags found in a text. However, for one of my file, when I do:

fdir = open('0001005214-12-000007.txt')
text = fdir.read()
strip_tags(text)

I get the following error:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "G:/Dropbox/Textual/codes/Python/Parsing/Word_Count.py", line 26, in strip_tags
 s.feed(html)
 File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 117, in feed
 self.goahead(0)
 File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 169, in goahead
 k = self.parse_html_declaration(i)
 File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 245, in parse_html_declaration
 return self.parse_marked_section(i)
 File "C:\Users\Martineau\Anaconda\lib\markupbase.py", line 160, in parse_marked_section
 self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
 File "C:\Users\Martineau\Anaconda\lib\HTMLParser.py", line 124, in error
 raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: unknown status keyword 't\n' in marked section, at line 35210, column 58

What does this error mean? How can I bypass this error?

The actual file that I want to parse is this one

asked Nov 17, 2014 at 0:57
1
  • I'd assume it hit some invalid markup. You could either try and catch the error or feed it through beautifulsoup beforehand. Commented Nov 17, 2014 at 1:42

1 Answer 1

5

The problem is very simple, but messy. You are not parsing HTML. You are parsing HTML wrapped in what appears to be the SEC's homegrown SGML vocabulary. Confused? Not surprised. Here's what visiting your data link, saving the file, and opening it up looks like:

 <SEC-DOCUMENT>0001005214-12-000007.txt : 20120430
 <SEC-HEADER>0001005214-12-000007.hdr.sgml : 20120430
 <ACCEPTANCE-DATETIME>20120430163103
 ACCESSION NUMBER: 0001005214-12-000007
 CONFORMED SUBMISSION TYPE: 10-K
 PUBLIC DOCUMENT COUNT: 12
 CONFORMED PERIOD OF REPORT: 20120131
 FILED AS OF DATE: 20120430
 DATE AS OF CHANGE: 20120430
 FILER:
 COMPANY DATA: 
 COMPANY CONFORMED NAME: AMERICAN WAGERING INC
 CENTRAL INDEX KEY: 0001005214
 STANDARD INDUSTRIAL CLASSIFICATION: SERVICES-MISCELLANEOUS AMUSEMENT & RECREATION [7990]
 IRS NUMBER: 880344658
 STATE OF INCORPORATION: NV
 FISCAL YEAR END: 0105
 FILING VALUES:
 FORM TYPE: 10-K
 SEC ACT: 1934 Act
 SEC FILE NUMBER: 000-20685
 FILM NUMBER: 12795496
 BUSINESS ADDRESS: 
 STREET 1: 675 GRIER DR
 CITY: LAS VEGAS
 STATE: NV
 ZIP: 89119
 BUSINESS PHONE: 7027350101
 MAIL ADDRESS: 
 STREET 1: 675 GRIER DR
 CITY: LAS VEGAS
 STATE: NV
 ZIP: 89119
 </SEC-HEADER>
 <DOCUMENT>
 <TYPE>10-K
 <SEQUENCE>1
 <FILENAME>formtenk-01312012.htm
 <DESCRIPTION>FORM 10 K 1.31.2012
 <TEXT>
 <html>
 <head>
 <title>formtenk-01312012.htm</title>
 <!--Licensed to: American Wagering, Inc.-->
 <!--Document Created using EDGARizer 2020 5.4.1.0-->
 <!--Copyright 1995 - 2009 Thomson Reuters. All rights reserved.-->
 </head>
 <body bgcolor="#ffffff" style="DISPLAY: inline; FONT-FAMILY: Palatino Linotype; FONT-SIZE: 9pt">
 <div>

Then skipping oodles of HTML lines, we pick it back up at:

 </div>
 </body>
</html>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>ZIP
<SEQUENCE>33
<FILENAME>0001005214-12-000007-xbrl.zip
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 0001005214-12-000007-xbrl.zip
M4$L#!!0````(`/"#GD":H45DWI(``/X8"``1`!P`8F5T;2TR,#$R,#$S,2YX
M;6Q55`D``Z/VGD^C]IY/=7@+``$$)0X```0Y`0``[#UI;QLYEM7円V/_`T223
M!)!DE20?<HZ!XZ1[W)T+<;I[@<5B0%51$MMU+<FRK/WU^]XCZY!<\I&V$RDN
MH`]9Q>/=%TM\+_YY87ドルL7"@MD_AER^OV6DS$?A+(>/JRE>D.U[Z4K7^^^L__
M>/&W3N=G0ドルO%C0C8>,&^S))()S'[+#(#"[`CWQ<A3.G@X(NQ"AFL'>M#_"A?

So now we're out of HTML an into a string-encoded XBRL file. Then skipping a gazillon of those lines, we end up the file with:

 MN?<,9P8'``"4-```0ドル`8```````!````I($][P``8F5T;2TR,#$R,#$S,2YX
 M<V155`4``Z/VGD]U>`L``00E#@``!#D!``!02P4&``````8`!@`:`@``CO8`
 #````
 `
 end
 </TEXT>
 </DOCUMENT>
 <DOCUMENT>
 <TYPE>XML
 <SEQUENCE>34
 <FILENAME>FilingSummary.xml
 <DESCRIPTION>IDEA: XBRL DOCUMENT
 <TEXT>
 <XBRL>
 <?xml version="1.0" encoding="utf-8"?>
 <FilingSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <Version>2.4.0.6</Version>
 <ProcessingTime />
 <ReportFormat>Html</ReportFormat>
 <ContextCount>27</ContextCount>
 <ElementCount>111</ElementCount>
 <EntityCount>1</EntityCount>
 <FootnotesReported>false</FootnotesReported>
 <SegmentCount>5</SegmentCount>
 <ScenarioCount>0</ScenarioCount>
 <TuplesReported>false</TuplesReported>
 <UnitCount>4</UnitCount>
 <MyReports>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R1.htm</HtmlFileName>
 <LongName>000100 - Document - Document and Entity Information</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/DocumentAndEntityInformation</Role>
 <ShortName>Document and Entity Information</ShortName>
 </Report>
 <Report>
 <IsDefault>true</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R2.htm</HtmlFileName>
 <LongName>010000 - Statement - CONSOLIDATED BALANCE SHEETS</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/ConsolidatedBalanceSheets</Role>
 <ShortName>CONSOLIDATED BALANCE SHEETS</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R3.htm</HtmlFileName>
 <LongName>010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/ConsolidatedBalanceSheetsParenthetical</Role>
 <ShortName>CONSOLIDATED BALANCE SHEETS (Parenthetical)</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R4.htm</HtmlFileName>
 <LongName>020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/ConsolidatedStatementsOfOperations</Role>
 <ShortName>CONSOLIDATED STATEMENTS OF OPERATIONS</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R5.htm</HtmlFileName>
 <LongName>030000 - Statement - CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/ConsolidatedStatementsOfStockholdersEquityDeficiency</Role>
 <ShortName>CONSOLIDATED STATEMENTS OF STOCKHOLDERS' EQUITY (DEFICIENCY)</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R6.htm</HtmlFileName>
 <LongName>040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/ConsolidatedStatementsOfCashFlows</Role>
 <ShortName>CONSOLIDATED STATEMENTS OF CASH FLOWS</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R7.htm</HtmlFileName>
 <LongName>060100 - Disclosure - Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/OrganizationRisksAndUncertaintiesAndSummaryOfSignificantAccountingPolicies</Role>
 <ShortName>Organization, Risks and Uncertainties, and Summary of Significant Accounting Policies</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R8.htm</HtmlFileName>
 <LongName>060200 - Disclosure - Property and Equipment</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/PropertyAndEquipment</Role>
 <ShortName>Property and Equipment</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R9.htm</HtmlFileName>
 <LongName>060300 - Disclosure - Debt</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/Debt</Role>
 <ShortName>Debt</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R10.htm</HtmlFileName>
 <LongName>060400 - Disclosure - Series A Preferred Stock</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/SeriesPreferredStock</Role>
 <ShortName>Series A Preferred Stock</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R11.htm</HtmlFileName>
 <LongName>060500 - Disclosure - Stock Options and Other Equity and Related Party Transactions</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/StockOptionsAndOtherEquityAndRelatedPartyTransactions</Role>
 <ShortName>Stock Options and Other Equity and Related Party Transactions</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R12.htm</HtmlFileName>
 <LongName>060600 - Disclosure - Commitments and Contingencies</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/CommitmentsAndContingencies</Role>
 <ShortName>Commitments and Contingencies</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R13.htm</HtmlFileName>
 <LongName>060700 - Disclosure - Related Party Transactions</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/RelatedPartyTransactions</Role>
 <ShortName>Related Party Transactions</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R14.htm</HtmlFileName>
 <LongName>060800 - Disclosure - Income Taxes</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/IncomeTaxes</Role>
 <ShortName>Income Taxes</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R15.htm</HtmlFileName>
 <LongName>060900 - Disclosure - Business Segments</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/BusinessSegments</Role>
 <ShortName>Business Segments</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R16.htm</HtmlFileName>
 <LongName>061000 - Disclosure - Additional Supplementary Cash Flow Information</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/AdditionalSupplementaryCashFlowInformation</Role>
 <ShortName>Additional Supplementary Cash Flow Information</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <HtmlFileName>R17.htm</HtmlFileName>
 <LongName>061100 - Disclosure - Financial Instruments</LongName>
 <ReportType>Sheet</ReportType>
 <Role>http://americanwagering.com/role/FinancialInstruments</Role>
 <ShortName>Financial Instruments</ShortName>
 </Report>
 <Report>
 <IsDefault>false</IsDefault>
 <HasEmbeddedReports>false</HasEmbeddedReports>
 <LongName>All Reports</LongName>
 <ReportType>Book</ReportType>
 <ShortName>All Reports</ShortName>
 </Report>
 </MyReports>
 <Logs>
 <Log type="Info">Process Flow-Through: 010000 - Statement - CONSOLIDATED BALANCE SHEETS</Log>
 <Log type="Info"> Process Flow-Through: Removing column 'Jan. 31, 2010'</Log>
 <Log type="Info">Process Flow-Through: 010100 - Statement - CONSOLIDATED BALANCE SHEETS (Parenthetical)</Log>
 <Log type="Info">Process Flow-Through: 020000 - Statement - CONSOLIDATED STATEMENTS OF OPERATIONS</Log>
 <Log type="Info">Process Flow-Through: 040000 - Statement - CONSOLIDATED STATEMENTS OF CASH FLOWS</Log>
 </Logs>
 <InputFiles>
 <File>betm-20120131.xml</File>
 <File>betm-20120131.xsd</File>
 <File>betm-20120131_cal.xml</File>
 <File>betm-20120131_def.xml</File>
 <File>betm-20120131_lab.xml</File>
 <File>betm-20120131_pre.xml</File>
 </InputFiles>
 <SupplementalFiles />
 <BaseTaxonomies />
 <HasPresentationLinkbase>true</HasPresentationLinkbase>
 <HasCalculationLinkbase>true</HasCalculationLinkbase>
 </FilingSummary>
 </XBRL>
 </TEXT>
 </DOCUMENT>
 </SEC-DOCUMENT>

So all in all, you have a multipart document encoded in a text format with a header, a text section, an HTML section, an XBRL file, and a report. If you want to use the simple HTMLParser to read it, you're going to have to strip out the HTML section first.

So, how to do that? Try a preprocess step like this:

import os
def html_part(filepath):
 """
 Generator returning only the HTML lines from an
 SEC Edgar SGML multi-part file.
 """
 start, stop = '<html>\n', '</html>\n'
 filepath = os.path.expanduser(filepath)
 with open(filepath) as f:
 # find start indicator, yield it
 for line in f:
 if line == start:
 yield line
 break
 # yield lines until stop indicator found, yield and stop
 for line in f:
 yield line
 if line == stop:
 raise StopIteration
origpath = '0001005214-12-000007.txt'
htmlpath = origpath.replace('.txt', '.html')
with open(htmlpath, "w") as out:
 out.write(''.join(html_part(origpath)))

Once you've stripped out just the HTML lines, you can use your original code to parse the file in htmlpath, which is truly the HTML part.

answered Nov 17, 2014 at 2:29
Sign up to request clarification or add additional context in comments.

2 Comments

Wow! Fantastic answer. I now understand the issue. Thanks for the help!
raise StopIteration now breaks the loop after some updates (source). Changing to continue fixed the problem for me.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.