Couldn't find any solution to a specific sequence. Context: italian cadaster, where plot number is embedded in a table coded in kml, in the "description" field. I want to extract the plot number.
Part of the code is:
...<tr>
<td>SVILUPPO</td>
<td>25</td>
</tr>
<tr bgcolor="#D4E4F3">
<td>NUMERO</td>
<td>156</td>
</tr>
<tr>
<td>LIVELLO</td>..."
In this example, I want to get: 156. There are other categories (commune, cadastral sheet, etc., with other numbers. I need to identify the right number quoting "NUMERO".
Tried:
regexp_substr( "description" , 'NUMERO</td> \n\n <td>(\\d+)<' )
or
regexp_substr( "description" , 'NUMERO \\D+ (\\d+) <' )
With:
'NUMERO as the start of the sequence, \D+ as any non decimal character, (\d+) to extract subchain with any number < to close the sequence.
Both formulas are valid, but I get 'NULL' in return. I can't see why. Any help much welcome.
1 Answer 1
Assuming the code you are trying to match is exactly as displayed in the question, you have errors in both of your regexes.
In the case of the first one, there were a few changes:
- you need to double escape (
\\
) all of the backslashes, not just the ones for\\d
- remove the spaces
- I also needed to add
\\r
in front of\\n
, but \r\n is a Windows-specific newline, and may or may not be relevant for you depending on operating system
regexp_substr("description", 'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<')
In the case of your second attempt, it worked when I removed the spaces - and is probably more robust:
regexp_substr("description", 'NUMERO\\D+(\\d+)<')
The following similar expression also worked for me:
regexp_substr("description" , 'NUMERO[^\\d]+(\\d+)<')
-
2The newlines can be more easily addressed using the whitespace character class,
\\s
, since that character class captures carriage returns and linefeeds. The'NUMERO</td>\\r\\n\\r\\n<td>(\\d+)<'
can be simplfied to'NUMERO</td>\\s*<td>(\\d+)<'
.bixb0012– bixb00122023年11月24日 17:28:47 +00:00Commented Nov 24, 2023 at 17:28