Extract substrings separately from a string using python regex

Question 1

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".

a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278 
BLOCK 16 
LOT 21 
EXCEPTING THEREOUT ALL MINES AND MINERALS 
ESTATE: FEE SIMPLE 
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
---------------------------------------------------------------------------- 
----
 REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
--------------------------------------------------------------------------- 
-- 
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''

Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?

Here is the expression I have pieced together so far:

doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'

Question 2

I have edited with the actual input string.

Question 3

Okay anyway to do this using regex?

Question 4

Why do you want to do this with regex? Are you willing to accept any other solution?

Question 5

Yes if there is a better way other than regex

Question 6

Not a regex based solution. But does the trick.

a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278 
BLOCK 16 
LOT 21 
EXCEPTING THEREOUT ALL MINES AND MINERALS 
ESTATE: FEE SIMPLE 
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
---------------------------------------------------------------------------- 
----
 REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
--------------------------------------------------------------------------- 
-- 
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip() 
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE

Hope it helps.

Question 7

re based code snippet

import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
 REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]

Output

1st match: AFFIDAVIT OF
2nd match: CASH & MTGE

Question 8

We can try using re.findall with the following pattern:

PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)

Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.

input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE

Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.

Question 9

The thing is the string below DOCUMENT TYPE may be multiline and need not be necessarily a multiline. If it is multiline, it should consider it.

Question 10

My answer covers a multiline situation. If you see a flaw in my answer, then state exactly what it is.

Question 11

I cant get you what this does result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE). Cant we give 'PHASED OF CONDOMINIUM PLAN' as single word ?

Question 12

No, we can't, hence I initially commented under your question that there is no answer. You need to match across lines.

Question 13

Okay fine what will be the modification that needs to be done if there is no multinline word after date?

Question 14

Why regular expressions?

It looks like you know the exact delimiting string, just str.split() by it and get the first part:

In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND 610,000ドル CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '

Question 15

It wont work for the input string which i have edited now

Question 16

@Farook in this state it won't, right. You could though adjust the solution and split it on a newline first, but in that case, regex would be able to do it in one go.

Question 17

I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.

import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND 610,000ドル CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
 if (parts[i] == "15/08/2017"):
 print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '610,000ドル', 'CASH & MTGE']
172 211 342

Question 18

positive lookbehind assertion**

 m=re.search('(?<=15/08/2017).*', a)
 m.group(0)

Question 19

You have to return the right group:

re.match("(.*?)15/08/2017",a).group(1)

Question 20

You nede to use group(1)

import re
re.match("(.*?)15/08/2017",a).group(1)

Output

'172 211 342 '

Question 21

Building on your expression, this is what I believe you need:

import re
a='172 211 342 15/08/2017 TRANSFER OF LAND 610,000ドル CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)

Output:

'172 211 342 '

Question 22

You can do this by using group(1)

re.match("(.*?)15/08/2017",a).group(1)

UPDATE

For updated string you can use .search instead of .match

re.search("(.*?)15\/08\/2017",a).group(1)

Question 23

This will give incorrect results if there are more than one term before 15/08/2017.

Question 24

I have edited my input string. It didn't work for the string which is edited now

Question 25

This will fail completely if the desired term is anything other than the first term.

Question 26

Your problem is that your string is formatted the way it is. The line you are looking for is

182 246 612 01/10/2018 PHASED OF CASH & MTGE

And then you are looking for what ever comes after 'PHASED OF' and some spaces.

You want to search for

(?<=PHASED OF)\s*(?P.*?)\n

in your string. This will return a match object containing the value you are looking for in the group value.

m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')

Also: There are many good online regex testers to fiddle around with your regexes. And only after finishing up the regex just copy and paste it into python.

I use this one: https://regex101.com/

Question 27

I am not searching for what ever comes after 'PHASED OF' and some spaces. Instead i am seraching for the string after the entire word below the DPCUMENT TYPE (i.e) 'PHASED OF CONDOMINIUM PLAN'

Question 28

"I need to get the string after the word 'PHASED OF CONDOMINIUM PLAN' which should returns 'CASH & MTGE' I have tried using the below expression". Where did i go wrong?

CodeIt 3,6383 gold badges29 silver badges37 bronze badges · Accepted Answer · 2018-12-26 04:00:45Z

Not a regex based solution. But does the trick.

a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278 
BLOCK 16 
LOT 21 
EXCEPTING THEREOUT ALL MINES AND MINERALS 
ESTATE: FEE SIMPLE 
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
---------------------------------------------------------------------------- 
----
 REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
--------------------------------------------------------------------------- 
-- 
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip() 
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE

Hope it helps.

CollectivesTM on Stack Overflow

Extract substrings separately from a string using python regex

11 Answers 11

Comments

Comments

5 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

11 Answers 11

Comments

Comments

5 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related