What is an efficient ways to parse a bar separated usr file in Python

Question 1

I am trying to create a robust, generic way to parse a bar delimited usr files, now I can read the file in and separate it by | then index with integers.

However, this always feels very rigid in its design and I want to try to avoid it.

What I would like is a way to map any bar delimited file to JSON or at least a Python dict. I'm looking for some from of Factory method I think.

Say if the file like this:

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3

It would be relatively straight forward. However, it would become undesirable when you get files like this:

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3

This represents a Header, Tail (which are always the same in every file) and 2 entries (2 sets of Group1 and Group2)

So I need to also retain the fact that, files have groups and each set of group has to be 'scooped' up together. I.E: File X may have two groups (Aand B) - if File X had one entry it would look like this:

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3

Two entries would look like this:

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3

All the key names for File X are known, so I can use a lookup structure

At the moment I have a Pandas implementation looks like this:

df = pd.read_csv('file1.usr', sep='|')
header_names = ["HeaderKey", "HeaderKey1", "HeaderKey2", "HeaderKey3"]
footer_names = ["FooterKey", "FooterKey1", "FooterKey2", "FooterKey3"]
groups = {'A': ['AValueKey', 'A2ValueKey', 'A3ValueKey'],
 'B': ['BValueKey', 'B2ValueKey', 'B3ValueKey']}
first_group_name = 'A'
df1 = df.iloc[:-1]
s = df1.iloc[:, 0].eq(first_group_name).cumsum()
for i, x in df1.groupby(s):
 group = {}
 for k, v in x.set_index(x.columns[0]).T.to_dict('l').items():
 group[k] = dict(zip(groups[k], v))
 header = dict(zip(header_names, df.columns))
 footer= dict(zip(footer_names, df.iloc[-1]))
 file = {'header': header, 'groups': group, 'footer': footer}
print(file)

{
 'groups': {
 'A': {
 'AValueKey': 'Entry1', 'A2ValueKey': 'Entry2', 'A3ValueKey': 'Entry3'
 },
 'B': {
 'BValueKey': 'Entry1', 'B2ValueKey': 'Entry2', 'B3ValueKey': 'Entry3'}
 },
 'header': { 
 'HeaderKey': 'Header'
 'HeaderKey1': 'Header1',
 'HeaderKey2': 'Header2',
 'HeaderKey3': 'Header3',
 },
 'footers': {
 'FooterKey': 'Footer',
 'FooterKey1': 'Footer1',
 'FooterKey2': 'Footer2',
 'FooterKey3': 'Footer3',
 
 }
}

So it relies on having the structure:

header_names = ["HeaderKey", "HeaderKey1", "HeaderKey2", "HeaderKey3"]
trailer_names = ["FooterKey", "FooterKey1", "FooterKey2", "FooterKey3"]
groups = {'A': ['AValueKey', 'A2ValueKey', 'A3ValueKey'],
 'B': ['BValueKey', 'B2ValueKey', 'B3ValueKey']}
first_group_name = 'A'

Are there any other ways that would be more efficient?

EDIT based on @Reinderien answer

Updated data format

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3

Firstly, thanks for going out on a limb even though evidently I haven't provided a clear scope.

To address your points;

Suggestions on global code, cap constants, tuples over lists and tail/trailer all noted, thanks :)
Indication of scale:

Each file is up <5KB, with a volume of between 10,000-100,000/day. I.E this script would need to parse and load up to 100,000 5KB files daily.

Case of repeated groups:

File would look like this:

Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry2|Entry3|Entry4
B|Entry2|Entry3|Entry4
Footer|Footer1|Footer2|Footer3

I take full responsibility for not being clearer in my question, but this is undesirable behavior. In the case of repeated groups, we would need to retain all the data, but split it into two separate payloads. Header and Footers:) will be the same for both however the group part of the payload would contain the corresponding data.

The first entry in the group line is always the same, but the data leading from that can differ. I hope that clears things up, please let me know.

Question 2

With regards to "implementation that looks something like this", please note: Code Review requires concrete code from a project, with enough code and / or context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site. Please take a look at the help center.

Question 3

Is it actual code from a project rather than pseudo-code or hypothetical code? Details matter! In order to give good advice, we need to see real, concrete code, and understand the context in which the code is used. Generic code (such as code containing placeholders like foo, MyClass, or doSomething()) leaves too much to the imagination.

Question 4

Yes, it was poor chose of language from me, but that is all I have in an implementation for converting the file to JSON. I have updated the question and thanks for the advice :)

Question 5

Please keep reviews limited to one language. Additionally you have provided a lot of examples but they don't relate. The output for Python is talking about "AValueKey" but that's not defined in any of the examples. This is just incoherent, please just pick one story and stick with that.

Question 6

Also, why does this: 'B': ['BValueKey', 'B2ValueKey', 'A3ValueKey'] have an A in it?

Question 7

Some suggestions for you:

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants
The standard terminology for the opposite of "header" is "footer", not "trailer"
Given your description of scale, this is a very parallelizable problem and could easily be framed as a standard Python multi-processing program
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, they are aggregated to a list of lists with no regard for uniqueness
In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

The suggested code:

from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence
HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
 'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}
def parse(fn: str) -> Iterable[List[str]]:
 with open(fn) as f:
 yield from (
 line.rstrip().split('|')
 for line in f
 )
def load(lines: Iterable[Sequence[str]]) -> dict:
 lines = iter(lines)
 heads = next(lines)
 prev_line = next(lines)
 groups = defaultdict(list)
 for line in lines:
 group, *entries = prev_line
 groups[group].append(dict(zip(GROUPS[group], entries)))
 prev_line = line
 return {
 'header': dict(zip(HEADER_NAMES, heads)),
 'footer': dict(zip(FOOTER_NAMES, prev_line)),
 'groups': groups,
 }
if __name__ == '__main__':
 d = load(parse('file1.usr'))
 pprint(d)

This produces:

{'footer': {'FootKey1': 'Footer1',
 'FootKey2': 'Footer2',
 'FootKey3': 'Footer3'},
 'groups': defaultdict(<class 'list'>,
 {'A': [{'A1ValueKey': 'Entry1',
 'A2ValueKey': 'Entry2',
 'A3ValueKey': 'Entry3'}],
 'B': [{'B1ValueKey': 'Entry1',
 'B2ValueKey': 'Entry2',
 'B3ValueKey': 'Entry3'},
 {'B1ValueKey': 'Entry4',
 'B2ValueKey': 'Entry5',
 'B3ValueKey': 'Entry6'}]}),
 'header': {'HeaderKey1': 'Header1',
 'HeaderKey2': 'Header2',
 'HeaderKey3': 'Header3'}}

Question 8

Thanks for this, I like the separation and use of the generator. I have edited my question to try and address your points.

Question 9

@Bob Edited for performance commentary and aggregation

Question 10

Thanks :D - If we wanted to output 2 separate payloads for each 'set' of groups A and B and persist the Header and Footer rather then a defaultdict - How would this be achieved?

Question 11

We're in somewhat dangerous territory: the "evolving question/answer" is not really supported in CodeReview. Let us please discuss this in chat: chat.stackexchange.com/rooms/107405

Question 12

@Bob A very different, vectorized approach proposed in codereview.stackexchange.com/questions/241533

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-04-29 20:48:52Z

Some suggestions for you:

Avoid global code
Make constants capitalized
Use tuples instead of lists for immutable constants
The standard terminology for the opposite of "header" is "footer", not "trailer"
Given your description of scale, this is a very parallelizable problem and could easily be framed as a standard Python multi-processing program
The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
I have assumed that you wish to remain printing the dictionary out to stdout, in which case pprint is more appropriate. If you want to serialize this to JSON, that is trivial using the json module.
I have assumed that in the case of repeated groups, they are aggregated to a list of lists with no regard for uniqueness
In the other answer, the suggestion is good to pass the result of zip directly to the dict constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.

The suggested code:

from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence
HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
 'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}
def parse(fn: str) -> Iterable[List[str]]:
 with open(fn) as f:
 yield from (
 line.rstrip().split('|')
 for line in f
 )
def load(lines: Iterable[Sequence[str]]) -> dict:
 lines = iter(lines)
 heads = next(lines)
 prev_line = next(lines)
 groups = defaultdict(list)
 for line in lines:
 group, *entries = prev_line
 groups[group].append(dict(zip(GROUPS[group], entries)))
 prev_line = line
 return {
 'header': dict(zip(HEADER_NAMES, heads)),
 'footer': dict(zip(FOOTER_NAMES, prev_line)),
 'groups': groups,
 }
if __name__ == '__main__':
 d = load(parse('file1.usr'))
 pprint(d)

This produces:

{'footer': {'FootKey1': 'Footer1',
 'FootKey2': 'Footer2',
 'FootKey3': 'Footer3'},
 'groups': defaultdict(<class 'list'>,
 {'A': [{'A1ValueKey': 'Entry1',
 'A2ValueKey': 'Entry2',
 'A3ValueKey': 'Entry3'}],
 'B': [{'B1ValueKey': 'Entry1',
 'B2ValueKey': 'Entry2',
 'B3ValueKey': 'Entry3'},
 {'B1ValueKey': 'Entry4',
 'B2ValueKey': 'Entry5',
 'B3ValueKey': 'Entry6'}]}),
 'header': {'HeaderKey1': 'Header1',
 'HeaderKey2': 'Header2',
 'HeaderKey3': 'Header3'}}

Thanks for this, I like the separation and use of the generator. I have edited my question to try and address your points.
Thanks :D - If we wanted to output 2 separate payloads for each 'set' of groups A and B and persist the Header and Footer rather then a defaultdict - How would this be achieved?
We're in somewhat dangerous territory: the "evolving question/answer" is not really supported in CodeReview. Let us please discuss this in chat: chat.stackexchange.com/rooms/107405
@Bob A very different, vectorized approach proposed in codereview.stackexchange.com/questions/241533

Stack Exchange Network

What is an efficient ways to parse a bar separated usr file in Python

EDIT based on @Reinderien answer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

What is an efficient ways to parse a bar separated usr file in Python

EDIT based on @Reinderien answer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions