-
Notifications
You must be signed in to change notification settings - Fork 11.1k
-
I'm playing around the chapter06's notebook, and find
- sms_spam_collection/SMSSpamCollection.tsv has 5574 lines
- but the DataFrame loaded from it has only 5572 lines
Just wonder what causes the mismatch~
imageBeta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 1 reply
-
Hi @sammyne
Good catch. The quote on line 5082 on SMSSpamCollection.tsv
is causing the problem. This results in the following lines 5083 and 5084 to be parsed together by pandas read_csv
. I looked at the data and there are several instances of single quotation marks "
, however line 5082 is the only offending one.
There's a few solutions:
- Remove the quotation marker in line 5082 only.
- Look at the pandas documentation here and use the quoting control field with
csv.QUOTE_NONE
from these docs:
df = pd.read_csv(data_file_path, sep='\t', header=None, names=["Label", "Text"], quoting=3)
- Normalize text file by removing quotation marks
- Read file using python
open()
manually and handle quotation markers by your choosing
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Good catch actually. I didn't see that issue before
Beta Was this translation helpful? Give feedback.