0

I am reading a file from the web row by row and each row is a list. The list has three columns visibly separated by this pattern: +++$+++.

this is my code:

with closing(requests.get(url, stream=True)) as r:
 reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'latin-1'))
 for i, row in enumerate(reader):
 if i < 5:
 t = row[0].split('(\s\+{3}\$\+{3}\s)+')
 print(t)

I have tried to split the list using this instruction in python3.6 and can't get it to work. Any suggestion is well appreciated:

the list:

['m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html']
['m1 +++$+++ 1492: conquest of paradise +++$+++ http://www.hundland.org/scripts/1492-ConquestOfParadise.txt']
['m2 +++$+++ 15 minutes +++$+++ http://www.dailyscript.com/scripts/15minutes.html']
['m3 +++$+++ 2001: a space odyssey +++$+++ http://www.scifiscripts.com/scripts/2001.txt']
['m4 +++$+++ 48 hrs. +++$+++ http://www.awesomefilm.com/script/48hours.txt']

this is my regex expression:

row[0].split('(\s\+{3}\$\+{3}\s)+')

each row has only one component -> row[0]

when I print the result is not splitting the row.

melpomene
86.1k8 gold badges95 silver badges154 bronze badges
asked Jul 15, 2018 at 23:12
4
  • 1
    .split() on a string isn't a regex match at all - it's literally looking for the string (\s\+{3}\$\+{3}\s)+! You want re.split(r'(\s\+{3}\$\+{3}\s)+', row[0]) instead. Commented Jul 15, 2018 at 23:27
  • Or use row[0].split(" +++$+++ "), since nothing you're doing here appears to benefit from the power of regular expressions. Commented Jul 15, 2018 at 23:29
  • Also remove the brackets in the re.split to not return the +++$+++ Commented Jul 15, 2018 at 23:32
  • thanks, @jasonharper for the clarification. I learned this one now. Commented Jul 16, 2018 at 3:40

2 Answers 2

1

Doing

row[0].split(' +++$+++ ')

should give you exactly what you wanted without regex.

answered Jul 16, 2018 at 10:31
0

Assuming you don't want to use split(), if you want to relax things and return a tuple maybe this can help.

Input

import re
input = '''['m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html']
['m1 +++$+++ 1492: conquest of paradise +++$+++ http://www.hundland.org/scripts/1492-ConquestOfParadise.txt']
['m2 +++$+++ 15 minutes +++$+++ http://www.dailyscript.com/scripts/15minutes.html']
['m3 +++$+++ 2001: a space odyssey +++$+++ http://www.scifiscripts.com/scripts/2001.txt']
['m4 +++$+++ 48 hrs. +++$+++ http://www.awesomefilm.com/script/48hours.txt']'''
output = re.findall('\[\'([\S\s]+?)[\s]+[\+]{3}\$[\+]{3}[\s]+([\S\s]+?)[\s][\+]{3}\$[\+]{3}[\s]+([\S\s]+?)\'\]', input)
print(output)

Output:

[('m0', '10 things i hate about you', 'http://www.dailyscript.com/scripts/10Things.html'), ('m1', '1492: conquest of paradise', 'http://www.hundland.org/scripts/1492-ConquestOfParadise.txt'), ('m2', '15 minutes', 'http://www.dailyscript.com/scripts/15minutes.html'), ('m3', '2001: a space odyssey', 'http://www.scifiscripts.com/scripts/2001.txt'), ('m4', '48 hrs.', 'http://www.awesomefilm.com/script/48hours.txt')] 

.

.

I' also trying to experiment with an alternating regex, but for the life of me, I can't get the formula to work haha.. eventually. I'll post it later, but hopefully the above helps

answered Jul 15, 2018 at 23:40
1
  • 1
    Thanks, @Inquisitor01 I got a good one from jasonharper. Appreciate it. Commented Jul 16, 2018 at 3:39

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.