Matching a section of a large text file using Python

Question 1

I have been manually searching large files produced by running a program. I have been successful at pulling out some blocks of information but I am stuck trying to extract the last three blocks. The structure of the blocks are as below:

Have tried several re expressions with no success such as:

v2 = re.findall(r'(?s)\(VFSCAN\) AT TIME =(.*?)100 BUSES WITH LOW VOLTAGE DEVIATION BELOW.*?\s*$',wholefile)

wholefile is the entire file that I have read in. The file have several of each of the below sections and I would like to extract them all so that I may locate the last occurrence of an entry such as (18436 [LENZIE 618.0] -0.245). I will then parse the line with the time to determine when this occurred. I have to do the same for "voltage deviation" "voltage", and "frequency". If I find out how to match one variable length, multi line section it should be the same for the others. My problem is knowing when to end the search. I am using the fact that the search should end at the last blank line (hence I use \s*$). I am using findall to extract all such sections for voltage deviation for example.

I also have an issue with the VERBOSE definition of the pattern in python. I does not seem to work (below). Am I doing something wrong?

(VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV
18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245 
18431 [LENZIE 118.0] -0.214 18432 [LENZIE 218.0] -0.214 
18435 [LENZIE 518.0] -0.214 18434 [LENZIE 418.0] -0.214 
(VFSCAN) AT TIME = 2.6267 UP TO 100 BUSES WITH LOW VOLTAGE BELOW 0.700:
X ----- BUS ------ X VOLT X ----- BUS ------ X VOLT
65191 [BONANZA 24.0] 0.439 65194 [CHAPITA 138] 0.581 
65192 [BONANZA 138] 0.585 65371 [COVE TP 138] 0.694 
66278 [RANGELY 138] 0.698 
(VFSCAN) AT TIME = 6.0632 UP TO 100 BUSES WITH LOW FREQUENCY BELOW 59.600:
X ----- BUS ------ X FREQ X ----- BUS ------ X FREQ
27117 [WTGCP .600] 59.443 27123 [WTGGE2 .570] 59.490 
27119 [WTGGE .570] 59.492 26040 [INTERM2G26.0] 59.492 
26039 [INTERM1G26.0] 59.492 
pattern = r"""
(?s) # Tell Regex to span multiple lines
\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200: # Literal string to serach for
(\s*$).*? # This search for an empty line
X ----- BUS ------ X VOLT X ----- BUS ------ X VOLT # Literal string to search (\d{5}.*).*? # Multiple lines starting with numbers
\s*$ # This search ends with an empty line
"""
regex = re.compile(pattern, re.VERBOSE)

NEXT DAY After trying for a couple more hours I came up with the following. The first one matched everything (not what I need) and the second one which I was sure would work did not match my test file.

First:

(?s)^\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:.*(\s*$)?

Second:

(?m)(?s)^\(VFSCAN\).*100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:^\s*$^X ----- BUS ------ X VDEV.*?
(.*?)
^\s*$

With these regex I am trying to match the following section of the file completely.

(VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV
18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245 
18431 [LENZIE 118.0] -0.214 18432 [LENZIE 218.0] -0.214 
18435 [LENZIE 518.0] -0.214 18434 [LENZIE 418.0] -0.214

I need some help to fix the pattern so that I can select the above.

I have problems with the following text. I just want to extract the time and related items in all the square brackets "[]".

test3 = r'''(VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW - 0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV
18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245 
18431 [LENZIE 118.0] -0.214 18435 [LENZIE 518.0] -0.214 
18434 [LENZIE 418.0] -0.214 18432 [LENZIE 218.0] -0.214 
(VFSCAN) AT TIME = 1.5167 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV
69036 [DNLP2G21.575] -0.414 69038 [DNLP2G22.575] -0.414 
69040 [DNLP2G23.575] -0.414 69032 [DNLP1_G1.575] -0.402 
65460 [DIFICULT 230] -0.384 69027 [7MIHL G1.575] -0.355 
69076 [HORIZ_G .575] -0.303 67237 [MEDBOWCO 115] -0.301 
67940 [STNDPSVC 230] -0.300 65976 [MINERS 34.5] -0.294 
65585 [FT CRK1 34.5] -0.261 65584 [FT CRK2 34.5] -0.261 
69073 [HIPLN_G .575] -0.214 
(VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV
65191 [BONANZA 24.0] -0.572 65192 [BONANZA 138] -0.434 
65194 [CHAPITA 138] -0.433 66278 [RANGELY 138] -0.320 
65371 [COVE TP 138] -0.302 79265 [CALAMRDG 138] -0.286 
79400 [DES.MINE 138] -0.285 65086 [ASHLEY 69.0] -0.284 
79067 [VERNAL 138] -0.277 67257 [MOONLAK269.0] -0.268 
67256 [MOONLAK169.0] -0.266 79264 [W.RV.CTY 138] -0.206 
'''

When I use findall with the pattern I get.

[('1.1800', 'DEVIATION', 'LENZIE 218.0'), ('1.5167', 'DEVIATION', 'HIPLN_G .575'), ('1.1800', 'DEVIATION', 'W.RV.CTY 138')]

I should be getting over 30 matched tuples in my list.

Question 2

Regex to extract fields

\(VFSCAN\)[^=]*=\s* # first line of a section: (VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?) # group 1 - first number of first line: 1.1800
\D+
(\d+) # group 2 - second number of first line: 100
[^\d-]+
(-?\d*(?:\.\d+)?) # group 3 - last number of first line: -0.200
\D+ # skip second line
(?: # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
 (?: # a data entry: 18436 [LENZIE 618.0] -0.245
 (\d+) # group 4 - first number in an entry: 18436
 \s+\[
 (.*?) # group 5 - words in brackets: LENZIE
 (-?\d*(?:\.\d+)?) # group 6 - number in brackets: 618.0
 \]\s*
 (\S*) # group 7 - last number (VDEV): -0.245
 \s*
 )+
 (?=[\r\n\s]+|$)
)+

BUSES WITH LOW VOLTAGE DEVIATION BELOW goes between group 2 and group 3 ([^\d-]+). So, you can do one of the below:

Option 1

You can capture also this part to check later if it is your desired section. Simply adding parantheses around it, makes it 3th capture group:

[^\d-]+ => ([^\d-]+).

Option 2

Or you can change same part of the regex to match only against desired section. In this case, the regex matches only specified section instead of each section:

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+VOLTAGE\s+DEVIATION\s+BELOW\s+

If you want to match both the lines:

BUSES WITH LOW VOLTAGE DEVIATION BELOW
BUSES WITH LOW FREQUENCY BELOW

then, you can write changing part with alternative (|) syntax ((?:...) means do not capture this group):

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+(?:VOLTAGE\s+DEVIATION|FREQUENCY)\s+BELOW\s+

Performance Improvement

Capturing Groups

Unnecesarray capturing groups can be removed such as (xyz) => xyz, or made non-capturing in this way: (xyz) => (?:xyz)

Unnecessary Optionalities

Changing .* to .+ can cause to some performance increase.

Improved Regex

The regex below is improved version of the above regex:

\(VFSCAN\)[^=]*=\s* # first line of a section: (VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?) # group 1 - first number of first line: 1.1800
\D+
\d+ # second number of first line: 100
[^\d-]+
-?\d*(?:\.\d+)? # last number of first line: -0.200
\D+ # skip second line
(?: # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
 (?: # a data entry: 18436 [LENZIE 618.0] -0.245
 \d+ # first number in an entry: 18436
 \s+\[
 (.+?) # group 2 - words in brackets: LENZIE
 -?\d*(?:\.\d+)? # number in brackets: 618.0
 \]\s+
 \S+ # last number (VDEV): -0.245
 \s*
 )+
 (?=[\r\n\s]+|$)
)+

Question 3

This is wonderful! If I wanted to match each section individually what would I change in the above regex?. For instance, if I only wanted to match the voltage deviation or the frequency these only change on the first line. One would be "BUSES WITH LOW VOLTAGE DEVIATION BELOW" and the other would be "BUSES WITH LOW FREQUENCY BELOW". Where would I insert this to make the match unique for each of thise groups.

Question 4

oops I hit enter instead of space. I wanted to express my gratitude for your response. This will definitely save me a lot of time. I am also learning a great deal about regular expressions and how powerful they are. This will take me some time to understand, however I can use it while I learn. Thanks a million!!

Question 5

@user1642486 You're welcome. I improved my answer to cover the point you commented. Please see if it solves the problem.

Question 6

@user1642486 In StackOverflow, we upvote an answer if we find it helpful; we accept an answer if it solves our problem :)

Question 7

I thought I already sent this but my computer is acting up. I ran the regex with the changes you suggested and it took a very long time to run (> one hour). Since I only need the time from the first line I am wondering if removing all the groups except the time and the item name (included within the square braces) would reduce the execution time. I used the findall to pull all the matches, would it be better to use finditer instead since I will be checking each match for the item in [] and check the time if found? Thanks again! This has been more than an answer and more of a tutorial for me.

Question 8

You are wither trying to match VOLT with VDEV

(VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200:
X ----- BUS ------ X VDEV X ----- BUS ------ X VDEV

or you are trying to match -0.200 with 0.700

(VFSCAN) AT TIME = 2.6267 UP TO 100 BUSES WITH LOW VOLTAGE BELOW 0.700:
X ----- BUS ------ X VOLT X ----- BUS ------ X VOLT

mmdemirbas mmdemirbas 9,1966 gold badges48 silver badges55 bronze badges · Accepted Answer · 2012-09-03 20:30:11Z

Regex to extract fields

\(VFSCAN\)[^=]*=\s* # first line of a section: (VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?) # group 1 - first number of first line: 1.1800
\D+
(\d+) # group 2 - second number of first line: 100
[^\d-]+
(-?\d*(?:\.\d+)?) # group 3 - last number of first line: -0.200
\D+ # skip second line
(?: # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
 (?: # a data entry: 18436 [LENZIE 618.0] -0.245
 (\d+) # group 4 - first number in an entry: 18436
 \s+\[
 (.*?) # group 5 - words in brackets: LENZIE
 (-?\d*(?:\.\d+)?) # group 6 - number in brackets: 618.0
 \]\s*
 (\S*) # group 7 - last number (VDEV): -0.245
 \s*
 )+
 (?=[\r\n\s]+|$)
)+

BUSES WITH LOW VOLTAGE DEVIATION BELOW goes between group 2 and group 3 ([^\d-]+). So, you can do one of the below:

Option 1

You can capture also this part to check later if it is your desired section. Simply adding parantheses around it, makes it 3th capture group:

[^\d-]+ => ([^\d-]+).

Option 2

Or you can change same part of the regex to match only against desired section. In this case, the regex matches only specified section instead of each section:

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+VOLTAGE\s+DEVIATION\s+BELOW\s+

If you want to match both the lines:

BUSES WITH LOW VOLTAGE DEVIATION BELOW
BUSES WITH LOW FREQUENCY BELOW

then, you can write changing part with alternative (|) syntax ((?:...) means do not capture this group):

[^\d-]+ => \s+BUSES\s+WITH\s+LOW\s+(?:VOLTAGE\s+DEVIATION|FREQUENCY)\s+BELOW\s+

Performance Improvement

Capturing Groups

Unnecesarray capturing groups can be removed such as (xyz) => xyz, or made non-capturing in this way: (xyz) => (?:xyz)

Unnecessary Optionalities

Changing .* to .+ can cause to some performance increase.

Improved Regex

The regex below is improved version of the above regex:

\(VFSCAN\)[^=]*=\s* # first line of a section: (VFSCAN) AT TIME = 1.1800 UP TO 100 BUSES WITH LOW VOLTAGE DEVIATION BELOW -0.200
(\d*(?:\.\d+)?) # group 1 - first number of first line: 1.1800
\D+
\d+ # second number of first line: 100
[^\d-]+
-?\d*(?:\.\d+)? # last number of first line: -0.200
\D+ # skip second line
(?: # a data line: 18436 [LENZIE 618.0] -0.245 18433 [LENZIE 318.0] -0.245
 (?: # a data entry: 18436 [LENZIE 618.0] -0.245
 \d+ # first number in an entry: 18436
 \s+\[
 (.+?) # group 2 - words in brackets: LENZIE
 -?\d*(?:\.\d+)? # number in brackets: 618.0
 \]\s+
 \S+ # last number (VDEV): -0.245
 \s*
 )+
 (?=[\r\n\s]+|$)
)+

This is wonderful! If I wanted to match each section individually what would I change in the above regex?. For instance, if I only wanted to match the voltage deviation or the frequency these only change on the first line. One would be "BUSES WITH LOW VOLTAGE DEVIATION BELOW" and the other would be "BUSES WITH LOW FREQUENCY BELOW". Where would I insert this to make the match unique for each of thise groups.
oops I hit enter instead of space. I wanted to express my gratitude for your response. This will definitely save me a lot of time. I am also learning a great deal about regular expressions and how powerful they are. This will take me some time to understand, however I can use it while I learn. Thanks a million!!
@user1642486 You're welcome. I improved my answer to cover the point you commented. Please see if it solves the problem.
@user1642486 In StackOverflow, we upvote an answer if we find it helpful; we accept an answer if it solves our problem :)
I thought I already sent this but my computer is acting up. I ran the regex with the changes you suggested and it took a very long time to run (> one hour). Since I only need the time from the first line I am wondering if removing all the groups except the time and the item name (included within the square braces) would reduce the execution time. I used the findall to pull all the matches, would it be better to use finditer instead since I will be checking each match for the item in [] and check the time if found? Thanks again! This has been more than an answer and more of a tutorial for me.

CollectivesTM on Stack Overflow

Matching a section of a large text file using Python

2 Answers 2

Regex to extract fields

Option 1

Option 2

Performance Improvement

Capturing Groups

Unnecessary Optionalities

Improved Regex

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Regex to extract fields

Option 1

Option 2

Performance Improvement

Capturing Groups

Unnecessary Optionalities

Improved Regex

12 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related