I have a file that has around 440K lines of data. I need to read these data and find the actual "table" in the text file. Part of the text file looks like this.
[BEGIN] 2022年4月8日 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Charging_VRF, Router ID 10.12.24.19:
Total Number of Routes: 2479
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.0.19.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.143.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.144.128/25 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.80/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.81/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.16/28 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.64/29 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.94/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
...
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Gb_VRF, Router ID 10.12.24.19:
Total Number of Routes: 1911
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.1.133.192/30 10.12.8.63 0 100 300 ?
* i 10.12.8.63 0 100 0 ?
*>i 10.1.133.216/30 10.12.8.64 0 100 300 ?
* i 10.12.8.64 0 100 0 ?
*>i 10.1.160.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.161.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.161.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.161.249/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.164.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.165.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.165.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
The text file goes long way, and it has plenty of garbage lines which I did not want to, so I am trying to find the keywords (display bgp vpnv4 vpn-instance) and start reading once I found. The code looks like this, which I will convert the table into my dataframe.
My problem is that, reading this 440k lines of code and convert into dataframe takes me almost half an hour to complete, I am here to seek help to see if there is a better way to improve the efficiency.
import pandas as pd
import ipaddress
from chardet import detect
def validate_ipaddress(ip_address):
try:
ip = ipaddress.IPv4Network(ip_address)
return True
except ValueError:
return False
def get_encoding_type(file):
with open(file, 'rb') as f:
data = f.read()
return detect(data)['encoding']
bgp_df = pd.DataFrame()
vrf_list = ['Charging_VRF', 'Gb_VRF', 'Gn_VRF']
def generate_bgp_network_list(block, vrf):
ip_address_list = block.split('\n')
ip_addresses = [[address for address in ip_address.strip().split(' ') if address] for ip_address in ip_address_list if ip_address] # generate list of lines
ip_addresses = [address for address in ip_addresses if len(address) > 0] # remove empty list
ip_addresses = [(ipaddress.IPv4Network(ip_address[1], False), ip_address[-1]) for ip_address in ip_addresses if validate_ipaddress(ip_address[1])]
bgp_data = [{'ip_network': address, 'vrf': vrf, 'as_number': as_number} for address, as_number in ip_addresses]
bgp_df = bgp_df.append(bgp_data, index=False)
def read_bgp_file(file):
if file == '':
return
file = open(file, encoding=get_encoding_type(file))
lines = file.readlines()
start = False
block = ''
lines = iter(lines)
for line in lines:
if '<' in line and len(block) > 0:
generate_bgp_network_list(block, vrf)
start = False
block = ''
if f'display bgp vpnv4 vpn-instance' in line:
vrf = line.strip().split(' ')[-2]
if vrf in vrf_list:
start = True
if start:
block += line
1 Answer 1
For mid- to large-scale file processing, probably best to operate on string slices directly instead of lines.
bgp_df
should not be a global, and should not be mutated by generate_bgp_network_list
. But also: does your code actually work? You need to declare that variable as a global for your assignment to have any effect. Also, index
is not a valid kwarg for append
; perhaps you're looking for ignore_index
.
Much of your code misses the point of Pandas: it makes many annoying things easy, and you should always Google whether it's able to Do Your Thing before you do it yourself. read_fwf
works perfectly with your data in inference mode and removes much of your manual parsing. You may or may not find a performance improvement when passing explicit colspecs.
vrf_list
should be a set {}
and not a list []
.
Suggested
This does not cover address validation. As with everything performance: don't take my word for it; test and profile.
from io import StringIO
from typing import Iterator
import pandas as pd
vrf_list = {'Charging_VRF', 'Gb_VRF', 'Gn_VRF'}
def generate_bgp_network_list(block: str, vrf: str) -> pd.DataFrame:
with StringIO(block) as f:
df = pd.read_fwf(f)
df['vrf'] = vrf
df = df.drop(columns=['Unnamed: 0', 'NextHop', 'MED', 'LocPrf', 'PrefVal'])
df = df[df.Network.notna()]
return df.rename(columns={'Network': 'ip_network', 'Path/Ogn': 'as_number'})
def read_blocks(content: str) -> Iterator[pd.DataFrame]:
routes_end = 0
vpn_prefix = 'display bgp vpnv4 vpn-instance '
routes_prefix = 'Total Number of Routes'
while True:
vpn_start = content.find(vpn_prefix, routes_end)
if vpn_start == -1:
break
vrf_start = vpn_start + len(vpn_prefix)
vrf_end = content.find(' ', vrf_start)
vrf = content[vrf_start: vrf_end]
routes_start = 1 + content.find(
'\n',
content.find(routes_prefix, vrf_end)
)
routes_end = content.find('\n<', routes_start)
routes = content[routes_start: routes_end]
yield generate_bgp_network_list(routes, vrf)
def read_bgp_file(content: str) -> pd.DataFrame:
return pd.concat(
tuple(read_blocks(content)),
ignore_index=True,
)
def main() -> None:
content = '''[BEGIN] 2022年4月8日 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Charging_VRF, Router ID 10.12.24.19:
Total Number of Routes: 2479
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.0.19.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.143.0/24 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.144.128/25 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.80/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.148.81/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.16/28 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.64/29 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
*>i 10.0.201.94/32 10.12.8.21 0 100 300 ?
* i 10.12.8.22 0 100 0 ?
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
BGP Local router ID is 10.12.24.19
Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
h - history, i - internal, s - suppressed, S - Stale
Origin : i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V - valid, I - invalid, N - not-found
VPN-Instance Gb_VRF, Router ID 10.12.24.19:
Total Number of Routes: 1911
Network NextHop MED LocPrf PrefVal Path/Ogn
*>i 10.1.133.192/30 10.12.8.63 0 100 300 ?
* i 10.12.8.63 0 100 0 ?
*>i 10.1.133.216/30 10.12.8.64 0 100 300 ?
* i 10.12.8.64 0 100 0 ?
*>i 10.1.160.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.161.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.161.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.161.249/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
*>i 10.1.164.248/29 10.12.40.7 0 100 300 ?
* i 10.12.40.7 0 100 0 ?
*>i 10.1.165.0/29 10.12.40.8 0 100 300 ?
* i 10.12.40.8 0 100 0 ?
*>i 10.1.165.248/32 10.12.40.7 2 100 300 ?
* i 10.12.40.7 2 100 0 ?
<'''
bgp_df = read_bgp_file(content)
print(bgp_df)
'''
ip_network vrf as_number
0 10.0.19.0/24 Charging_VRF ?
1 10.0.143.0/24 Charging_VRF ?
2 10.0.144.128/25 Charging_VRF ?
3 10.0.148.80/32 Charging_VRF ?
4 10.0.148.81/32 Charging_VRF ?
5 10.0.201.16/28 Charging_VRF ?
6 10.0.201.64/29 Charging_VRF ?
7 10.0.201.94/32 Charging_VRF ?
0 10.1.133.192/30 Gb_VRF ?
1 10.1.133.216/30 Gb_VRF ?
2 10.1.160.248/29 Gb_VRF ?
3 10.1.161.0/29 Gb_VRF ?
4 10.1.161.248/32 Gb_VRF ?
5 10.1.161.249/32 Gb_VRF ?
6 10.1.164.248/29 Gb_VRF ?
7 10.1.165.0/29 Gb_VRF ?
8 10.1.165.248/32 Gb_VRF ?
'''
if __name__ == '__main__':
main()
-
\$\begingroup\$ The snippet posted here is only for testing purpose, just to see what is the best solution to speed up the whole thing. Anyway thanks for the suggestion, I actually had a solution where I can ignore the "generate_bgp_network_list" function cause that one is the culprit that slow down the whole process \$\endgroup\$ReverseEngineer– ReverseEngineer2022年04月13日 04:35:12 +00:00Commented Apr 13, 2022 at 4:35
Explore related questions
See similar questions with these tags.
pandas
,ipaddress
, etc), all variables / functions (data
,validate_ipaddress()
,get_encoding_type()
) as well as how you are using your code (how you call your main function). Otherwise, this question will be closed because the code can't be run as it is. \$\endgroup\$