4
\$\begingroup\$

I have a file that has around 440K lines of data. I need to read these data and find the actual "table" in the text file. Part of the text file looks like this.

[BEGIN] 2022年4月8日 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary 
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
 h - history, i - internal, s - suppressed, S - Stale
 Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found
 
 VPN-Instance Charging_VRF, Router ID 10.12.24.19:
 Total Number of Routes: 2479
 Network NextHop MED LocPrf PrefVal Path/Ogn
 *>i 10.0.19.0/24 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.143.0/24 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.144.128/25 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.148.80/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.148.81/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.16/28 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.64/29 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.94/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
...
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
 h - history, i - internal, s - suppressed, S - Stale
 Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found
 
 VPN-Instance Gb_VRF, Router ID 10.12.24.19:
 Total Number of Routes: 1911
 Network NextHop MED LocPrf PrefVal Path/Ogn
 *>i 10.1.133.192/30 10.12.8.63 0 100 300 ?
 * i 10.12.8.63 0 100 0 ?
 *>i 10.1.133.216/30 10.12.8.64 0 100 300 ?
 * i 10.12.8.64 0 100 0 ?
 *>i 10.1.160.248/29 10.12.40.7 0 100 300 ?
 * i 10.12.40.7 0 100 0 ?
 *>i 10.1.161.0/29 10.12.40.8 0 100 300 ?
 * i 10.12.40.8 0 100 0 ?
 *>i 10.1.161.248/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?
 *>i 10.1.161.249/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?
 *>i 10.1.164.248/29 10.12.40.7 0 100 300 ?
 * i 10.12.40.7 0 100 0 ?
 *>i 10.1.165.0/29 10.12.40.8 0 100 300 ?
 * i 10.12.40.8 0 100 0 ?
 *>i 10.1.165.248/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?

The text file goes long way, and it has plenty of garbage lines which I did not want to, so I am trying to find the keywords (display bgp vpnv4 vpn-instance) and start reading once I found. The code looks like this, which I will convert the table into my dataframe.

My problem is that, reading this 440k lines of code and convert into dataframe takes me almost half an hour to complete, I am here to seek help to see if there is a better way to improve the efficiency.

import pandas as pd
import ipaddress
from chardet import detect
def validate_ipaddress(ip_address):
 try:
 ip = ipaddress.IPv4Network(ip_address)
 return True
 except ValueError:
 return False
def get_encoding_type(file):
 with open(file, 'rb') as f:
 data = f.read()
 return detect(data)['encoding']
bgp_df = pd.DataFrame()
vrf_list = ['Charging_VRF', 'Gb_VRF', 'Gn_VRF']
 
def generate_bgp_network_list(block, vrf):
 ip_address_list = block.split('\n')
 ip_addresses = [[address for address in ip_address.strip().split(' ') if address] for ip_address in ip_address_list if ip_address] # generate list of lines
 ip_addresses = [address for address in ip_addresses if len(address) > 0] # remove empty list
 ip_addresses = [(ipaddress.IPv4Network(ip_address[1], False), ip_address[-1]) for ip_address in ip_addresses if validate_ipaddress(ip_address[1])]
 bgp_data = [{'ip_network': address, 'vrf': vrf, 'as_number': as_number} for address, as_number in ip_addresses]
 bgp_df = bgp_df.append(bgp_data, index=False)
def read_bgp_file(file):
 if file == '':
 return
 file = open(file, encoding=get_encoding_type(file))
 lines = file.readlines()
 start = False
 block = ''
 lines = iter(lines)
 for line in lines:
 if '<' in line and len(block) > 0:
 generate_bgp_network_list(block, vrf)
 start = False
 block = ''
 if f'display bgp vpnv4 vpn-instance' in line:
 vrf = line.strip().split(' ')[-2]
 if vrf in vrf_list:
 start = True
 if start:
 block += line
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 12, 2022 at 10:12
\$\endgroup\$
8
  • 1
    \$\begingroup\$ Hello and welcome on CR! Please post the full/working code such as: all the imports (pandas, ipaddress, etc), all variables / functions (data, validate_ipaddress(), get_encoding_type()) as well as how you are using your code (how you call your main function). Otherwise, this question will be closed because the code can't be run as it is. \$\endgroup\$ Commented Apr 12, 2022 at 10:23
  • \$\begingroup\$ What is the size of the file in bytes? Does it fit into your RAM? \$\endgroup\$ Commented Apr 12, 2022 at 10:26
  • \$\begingroup\$ @Reinderien I would expect ~500K lines to fit into any decent RAM ^^ \$\endgroup\$ Commented Apr 12, 2022 at 10:28
  • 1
    \$\begingroup\$ Crucially your excerpt does not show the end of the table, only an ellipsis. Can you include this please? \$\endgroup\$ Commented Apr 12, 2022 at 19:43
  • 1
    \$\begingroup\$ It would help if you show the expected result, given the data above. \$\endgroup\$ Commented Apr 13, 2022 at 14:42

1 Answer 1

1
\$\begingroup\$

For mid- to large-scale file processing, probably best to operate on string slices directly instead of lines.

bgp_df should not be a global, and should not be mutated by generate_bgp_network_list. But also: does your code actually work? You need to declare that variable as a global for your assignment to have any effect. Also, index is not a valid kwarg for append; perhaps you're looking for ignore_index.

Much of your code misses the point of Pandas: it makes many annoying things easy, and you should always Google whether it's able to Do Your Thing before you do it yourself. read_fwf works perfectly with your data in inference mode and removes much of your manual parsing. You may or may not find a performance improvement when passing explicit colspecs.

vrf_list should be a set {} and not a list [].

Suggested

This does not cover address validation. As with everything performance: don't take my word for it; test and profile.

from io import StringIO
from typing import Iterator
import pandas as pd
vrf_list = {'Charging_VRF', 'Gb_VRF', 'Gn_VRF'}
def generate_bgp_network_list(block: str, vrf: str) -> pd.DataFrame:
 with StringIO(block) as f:
 df = pd.read_fwf(f)
 df['vrf'] = vrf
 df = df.drop(columns=['Unnamed: 0', 'NextHop', 'MED', 'LocPrf', 'PrefVal'])
 df = df[df.Network.notna()]
 return df.rename(columns={'Network': 'ip_network', 'Path/Ogn': 'as_number'})
def read_blocks(content: str) -> Iterator[pd.DataFrame]:
 routes_end = 0
 vpn_prefix = 'display bgp vpnv4 vpn-instance '
 routes_prefix = 'Total Number of Routes'
 while True:
 vpn_start = content.find(vpn_prefix, routes_end)
 if vpn_start == -1:
 break
 vrf_start = vpn_start + len(vpn_prefix)
 vrf_end = content.find(' ', vrf_start)
 vrf = content[vrf_start: vrf_end]
 routes_start = 1 + content.find(
 '\n',
 content.find(routes_prefix, vrf_end)
 )
 routes_end = content.find('\n<', routes_start)
 routes = content[routes_start: routes_end]
 yield generate_bgp_network_list(routes, vrf)
def read_bgp_file(content: str) -> pd.DataFrame:
 return pd.concat(
 tuple(read_blocks(content)),
 ignore_index=True,
 )
def main() -> None:
 content = '''[BEGIN] 2022年4月8日 14:00:05
<Z0301IPBBPE03>screen-length 0 temporary 
Info: The configuration takes effect on the current user terminal interface only.
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Charging_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
 h - history, i - internal, s - suppressed, S - Stale
 Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found
 
 VPN-Instance Charging_VRF, Router ID 10.12.24.19:
 Total Number of Routes: 2479
 Network NextHop MED LocPrf PrefVal Path/Ogn
 *>i 10.0.19.0/24 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.143.0/24 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.144.128/25 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.148.80/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.148.81/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.16/28 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.64/29 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
 *>i 10.0.201.94/32 10.12.8.21 0 100 300 ?
 * i 10.12.8.22 0 100 0 ?
<Z0301IPBBPE03>display bgp vpnv4 vpn-instance Gb_VRF routing-table
 
 BGP Local router ID is 10.12.24.19
 Status codes: * - valid, > - best, d - damped, x - best external, a - add path,
 h - history, i - internal, s - suppressed, S - Stale
 Origin : i - IGP, e - EGP, ? - incomplete
 RPKI validation codes: V - valid, I - invalid, N - not-found
 
 VPN-Instance Gb_VRF, Router ID 10.12.24.19:
 Total Number of Routes: 1911
 Network NextHop MED LocPrf PrefVal Path/Ogn
 *>i 10.1.133.192/30 10.12.8.63 0 100 300 ?
 * i 10.12.8.63 0 100 0 ?
 *>i 10.1.133.216/30 10.12.8.64 0 100 300 ?
 * i 10.12.8.64 0 100 0 ?
 *>i 10.1.160.248/29 10.12.40.7 0 100 300 ?
 * i 10.12.40.7 0 100 0 ?
 *>i 10.1.161.0/29 10.12.40.8 0 100 300 ?
 * i 10.12.40.8 0 100 0 ?
 *>i 10.1.161.248/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?
 *>i 10.1.161.249/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?
 *>i 10.1.164.248/29 10.12.40.7 0 100 300 ?
 * i 10.12.40.7 0 100 0 ?
 *>i 10.1.165.0/29 10.12.40.8 0 100 300 ?
 * i 10.12.40.8 0 100 0 ?
 *>i 10.1.165.248/32 10.12.40.7 2 100 300 ?
 * i 10.12.40.7 2 100 0 ?
 
 <'''
 bgp_df = read_bgp_file(content)
 print(bgp_df)
 '''
 ip_network vrf as_number
 0 10.0.19.0/24 Charging_VRF ?
 1 10.0.143.0/24 Charging_VRF ?
 2 10.0.144.128/25 Charging_VRF ?
 3 10.0.148.80/32 Charging_VRF ?
 4 10.0.148.81/32 Charging_VRF ?
 5 10.0.201.16/28 Charging_VRF ?
 6 10.0.201.64/29 Charging_VRF ?
 7 10.0.201.94/32 Charging_VRF ?
 0 10.1.133.192/30 Gb_VRF ?
 1 10.1.133.216/30 Gb_VRF ?
 2 10.1.160.248/29 Gb_VRF ?
 3 10.1.161.0/29 Gb_VRF ?
 4 10.1.161.248/32 Gb_VRF ?
 5 10.1.161.249/32 Gb_VRF ?
 6 10.1.164.248/29 Gb_VRF ?
 7 10.1.165.0/29 Gb_VRF ?
 8 10.1.165.248/32 Gb_VRF ?
 '''
if __name__ == '__main__':
 main()
answered Apr 13, 2022 at 1:49
\$\endgroup\$
1
  • \$\begingroup\$ The snippet posted here is only for testing purpose, just to see what is the best solution to speed up the whole thing. Anyway thanks for the suggestion, I actually had a solution where I can ignore the "generate_bgp_network_list" function cause that one is the culprit that slow down the whole process \$\endgroup\$ Commented Apr 13, 2022 at 4:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.