Regex substr - Extracting number after specific text sequence

Question 1

Couldn't find any solution to a specific sequence. Context: italian cadaster, where plot number is embedded in a table coded in kml, in the "description" field. I want to extract the plot number.

Part of the code is:

...<tr>
<td>SVILUPPO</td>
<td>25</td>
</tr>
<tr bgcolor="#D4E4F3">
<td>NUMERO</td>
<td>156</td>
</tr>
<tr>
<td>LIVELLO</td>..."

In this example, I want to get: 156. There are other categories (commune, cadastral sheet, etc., with other numbers. I need to identify the right number quoting "NUMERO".

Tried:

regexp_substr( "description" , 'NUMERO</td> \n\n <td>(\\d+)<' )

or

regexp_substr( "description" , 'NUMERO \\D+ (\\d+) <' )

With:

'NUMERO as the start of the sequence, \D+ as any non decimal character, (\d+) to extract subchain with any number < to close the sequence.

Both formulas are valid, but I get 'NULL' in return. I can't see why. Any help much welcome.

Question 2

Assuming the code you are trying to match is exactly as displayed in the question, you have errors in both of your regexes.

In the case of the first one, there were a few changes:

you need to double escape (\\) all of the backslashes, not just the ones for \\d
remove the spaces
I also needed to add \\r in front of \\n, but \r\n is a Windows-specific newline, and may or may not be relevant for you depending on operating system

regexp_substr("description", 'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<')

enter image description here

In the case of your second attempt, it worked when I removed the spaces - and is probably more robust:

regexp_substr("description", 'NUMERO\\D+(\\d+)<')

The following similar expression also worked for me:

regexp_substr("description" , 'NUMERO[^\\d]+(\\d+)<')

Question 3

The newlines can be more easily addressed using the whitespace character class, \\s, since that character class captures carriage returns and linefeeds. The 'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<' can be simplfied to 'NUMERO</td>\\s*<td>(\\d+)<'.

Tom Brennan Tom Brennan 6,24612 silver badges32 bronze badges · Accepted Answer · 2023-10-16 12:34:12Z

Assuming the code you are trying to match is exactly as displayed in the question, you have errors in both of your regexes.

In the case of the first one, there were a few changes:

you need to double escape (\\) all of the backslashes, not just the ones for \\d
remove the spaces
I also needed to add \\r in front of \\n, but \r\n is a Windows-specific newline, and may or may not be relevant for you depending on operating system

regexp_substr("description", 'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<')

enter image description here

In the case of your second attempt, it worked when I removed the spaces - and is probably more robust:

regexp_substr("description", 'NUMERO\\D+(\\d+)<')

The following similar expression also worked for me:

regexp_substr("description" , 'NUMERO[^\\d]+(\\d+)<')

The newlines can be more easily addressed using the whitespace character class, \\s, since that character class captures carriage returns and linefeeds. The 'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<' can be simplfied to 'NUMERO</td>\\s*<td>(\\d+)<'.

Stack Exchange Network

Regex substr - Extracting number after specific text sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex substr - Extracting number after specific text sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions