I am trying to create a robust, generic way to parse a bar delimited usr
files, now I can read the file in and separate it by |
then index with integers.
However, this always feels very rigid in its design and I want to try to avoid it.
What I would like is a way to map any bar delimited file to JSON or at least a Python dict
. I'm looking for some from of Factory method I think.
Say if the file like this:
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3
It would be relatively straight forward. However, it would become undesirable when you get files like this:
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3
This represents a Header
, Tail
(which are always the same in every file) and 2 entries (2 sets of Group1
and Group2
)
So I need to also retain the fact that, files have groups and each set of group has to be 'scooped' up together. I.E: File X
may have two groups (A
and B
) - if File X
had one entry it would look like this:
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3
Two entries would look like this:
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3
All the key names for File X
are known, so I can use a lookup structure
At the moment I have a Pandas implementation looks like this:
df = pd.read_csv('file1.usr', sep='|')
header_names = ["HeaderKey", "HeaderKey1", "HeaderKey2", "HeaderKey3"]
footer_names = ["FooterKey", "FooterKey1", "FooterKey2", "FooterKey3"]
groups = {'A': ['AValueKey', 'A2ValueKey', 'A3ValueKey'],
'B': ['BValueKey', 'B2ValueKey', 'B3ValueKey']}
first_group_name = 'A'
df1 = df.iloc[:-1]
s = df1.iloc[:, 0].eq(first_group_name).cumsum()
for i, x in df1.groupby(s):
group = {}
for k, v in x.set_index(x.columns[0]).T.to_dict('l').items():
group[k] = dict(zip(groups[k], v))
header = dict(zip(header_names, df.columns))
footer= dict(zip(footer_names, df.iloc[-1]))
file = {'header': header, 'groups': group, 'footer': footer}
print(file)
{
'groups': {
'A': {
'AValueKey': 'Entry1', 'A2ValueKey': 'Entry2', 'A3ValueKey': 'Entry3'
},
'B': {
'BValueKey': 'Entry1', 'B2ValueKey': 'Entry2', 'B3ValueKey': 'Entry3'}
},
'header': {
'HeaderKey': 'Header'
'HeaderKey1': 'Header1',
'HeaderKey2': 'Header2',
'HeaderKey3': 'Header3',
},
'footers': {
'FooterKey': 'Footer',
'FooterKey1': 'Footer1',
'FooterKey2': 'Footer2',
'FooterKey3': 'Footer3',
}
}
So it relies on having the structure:
header_names = ["HeaderKey", "HeaderKey1", "HeaderKey2", "HeaderKey3"]
trailer_names = ["FooterKey", "FooterKey1", "FooterKey2", "FooterKey3"]
groups = {'A': ['AValueKey', 'A2ValueKey', 'A3ValueKey'],
'B': ['BValueKey', 'B2ValueKey', 'B3ValueKey']}
first_group_name = 'A'
Are there any other ways that would be more efficient?
EDIT based on @Reinderien answer
- Updated data format
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
Footer|Footer1|Footer2|Footer3
Firstly, thanks for going out on a limb even though evidently I haven't provided a clear scope.
To address your points;
Suggestions on global code, cap constants, tuples over lists and tail/trailer all noted, thanks :)
Indication of scale:
Each file is up <5KB, with a volume of between 10,000-100,000/day. I.E this script would need to parse and load up to 100,000 5KB files daily.
- Case of repeated groups:
File would look like this:
Header|Header1|Header2|Header3
A|Entry1|Entry2|Entry3
B|Entry1|Entry2|Entry3
A|Entry2|Entry3|Entry4
B|Entry2|Entry3|Entry4
Footer|Footer1|Footer2|Footer3
I take full responsibility for not being clearer in my question, but this is undesirable behavior. In the case of repeated groups, we would need to retain all the data, but split it into two separate payloads. Header and Footers:) will be the same for both however the group
part of the payload would contain the corresponding data.
The first entry in the group line is always the same, but the data leading from that can differ. I hope that clears things up, please let me know.
1 Answer 1
Some suggestions for you:
- Avoid global code
- Make constants capitalized
- Use tuples instead of lists for immutable constants
- The standard terminology for the opposite of "header" is "footer", not "trailer"
- Given your description of scale, this is a very parallelizable problem and could easily be framed as a standard Python multi-processing program
- The parsing of the serialized file format is shown in a separate generator function from the loading of the data into the dictionary format you've shown
- I have assumed that you wish to remain printing the dictionary out to
stdout
, in which casepprint
is more appropriate. If you want to serialize this to JSON, that is trivial using thejson
module. - I have assumed that in the case of repeated groups, they are aggregated to a list of lists with no regard for uniqueness
- In the other answer, the suggestion is good to pass the result of
zip
directly to thedict
constructor. Basically: this takes two iterables, iterates over both of them at the same time; uses one as the key and the other as the value; and assumes that the order of the key iterable matches the order of the value iterable.
The suggested code:
from collections import defaultdict
from pprint import pprint
from typing import Iterable, List, Sequence
HEADER_NAMES = ('HeaderKey1', 'HeaderKey2', 'HeaderKey3')
FOOTER_NAMES = ('FootKey1', 'FootKey2', 'FootKey3')
GROUPS = {'A': ('A1ValueKey', 'A2ValueKey', 'A3ValueKey'),
'B': ('B1ValueKey', 'B2ValueKey', 'B3ValueKey')}
def parse(fn: str) -> Iterable[List[str]]:
with open(fn) as f:
yield from (
line.rstrip().split('|')
for line in f
)
def load(lines: Iterable[Sequence[str]]) -> dict:
lines = iter(lines)
heads = next(lines)
prev_line = next(lines)
groups = defaultdict(list)
for line in lines:
group, *entries = prev_line
groups[group].append(dict(zip(GROUPS[group], entries)))
prev_line = line
return {
'header': dict(zip(HEADER_NAMES, heads)),
'footer': dict(zip(FOOTER_NAMES, prev_line)),
'groups': groups,
}
if __name__ == '__main__':
d = load(parse('file1.usr'))
pprint(d)
This produces:
{'footer': {'FootKey1': 'Footer1',
'FootKey2': 'Footer2',
'FootKey3': 'Footer3'},
'groups': defaultdict(<class 'list'>,
{'A': [{'A1ValueKey': 'Entry1',
'A2ValueKey': 'Entry2',
'A3ValueKey': 'Entry3'}],
'B': [{'B1ValueKey': 'Entry1',
'B2ValueKey': 'Entry2',
'B3ValueKey': 'Entry3'},
{'B1ValueKey': 'Entry4',
'B2ValueKey': 'Entry5',
'B3ValueKey': 'Entry6'}]}),
'header': {'HeaderKey1': 'Header1',
'HeaderKey2': 'Header2',
'HeaderKey3': 'Header3'}}
-
\$\begingroup\$ Thanks for this, I like the separation and use of the generator. I have edited my question to try and address your points. \$\endgroup\$Bob– Bob2020年04月30日 08:49:46 +00:00Commented Apr 30, 2020 at 8:49
-
\$\begingroup\$ @Bob Edited for performance commentary and aggregation \$\endgroup\$Reinderien– Reinderien2020年04月30日 13:53:22 +00:00Commented Apr 30, 2020 at 13:53
-
\$\begingroup\$ Thanks :D - If we wanted to output 2 separate payloads for each 'set' of groups
A
andB
and persist the Header and Footer rather then adefaultdict
- How would this be achieved? \$\endgroup\$Bob– Bob2020年04月30日 14:23:58 +00:00Commented Apr 30, 2020 at 14:23 -
1\$\begingroup\$ We're in somewhat dangerous territory: the "evolving question/answer" is not really supported in CodeReview. Let us please discuss this in chat: chat.stackexchange.com/rooms/107405 \$\endgroup\$Reinderien– Reinderien2020年04月30日 14:32:15 +00:00Commented Apr 30, 2020 at 14:32
-
\$\begingroup\$ @Bob A very different, vectorized approach proposed in codereview.stackexchange.com/questions/241533 \$\endgroup\$Reinderien– Reinderien2020年05月01日 02:00:15 +00:00Commented May 1, 2020 at 2:00
foo
,MyClass
, ordoSomething()
) leaves too much to the imagination. \$\endgroup\$'B': ['BValueKey', 'B2ValueKey', 'A3ValueKey']
have anA
in it? \$\endgroup\$