Script which opens and reads the same files multiple times

Question 1

I have code that works perfectly, but it uses too much memory.

Essentially this code takes an input file (let's call it an index, that is 2 column tab-separated) that searches in a second input file (let's call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.

An example of the index is:

amphibian anm|art|art|art|art
anaconda anm
aardvark anm

An example of the data is:

amphibian-n is green 10
anaconda-n is green 2
anaconda-n eats mice 1
aardvark-n eats plants 1

Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:

anm-n is green
art-n is green
anm-n eats mice
anm-n eats plants

I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.

The desired results are the following:

anm second hello 1.0
anm eats plants 1.0
anm first heador 0.333333333333
art first heador 0.666666666667

I open the same input file many times (i.e., 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of RAM in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.

Is there a way to optimize how many times I open the same input file?

#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import division
from collections import defaultdict
import datetime
print "starting:",
print datetime.datetime.now()
mapping = dict()
with open('input-map', "rb") as oSenseFile:
 for line in oSenseFile:
 uLine = unicode(line, "utf8")
 concept, conceptClass = uLine.split()
 if len(concept) > 2: 
 mapping[concept + '-n'] = conceptClass
print "- step 1:",
print datetime.datetime.now()
lemmas = set()
with open('input-data', "rb") as oIndexFile:
 for line in oIndexFile:
 uLine = unicode(line, "latin1")
 lemma = uLine.split()[0]
 if mapping.has_key(lemma):
 lemmas.add(lemma)
print "- step 2:",
print datetime.datetime.now()
featFreqs = defaultdict(lambda: defaultdict(float))
with open('input-data', "rb") as oIndexFile: 
 for line in oIndexFile:
 uLine = unicode(line, "latin1")
 lemmaTAR, slot, filler, freq = uLine.split()
 featFreqs[slot][filler] += int(freq)
print "- step 3:",
print datetime.datetime.now()
classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
 
with open('input-data', "rb") as oIndexFile: 
 for line in oIndexFile:
 uLine = unicode(line, "latin1")
 lemmaTAR, slot, filler, freq = uLine.split()
 if lemmaTAR in lemmas:
 senses = mapping[lemmaTAR].split(u'|')
 for sense in senses:
 classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
 else:
 pass
print "- step 4:",
print datetime.datetime.now()
 
with open('output', 'wb') as oOutFile:
 for sense in sorted(classFreqs):
 for slot in classFreqs[sense]:
 for fill in classFreqs[sense][slot]:
 outstring = '\t'.join([sense, slot, fill,\
 str(classFreqs[sense][slot][fill])])
 oOutFile.write(outstring.encode("utf8") + '\n')

Any suggestions on how to optimize this code to process large text files (e.g., >4GB)?

Question 2

Why does anaconda become art in the example? The index maps it to anm.

Question 3

anaconda does not become art, you are referring to amphibian that is mapped to art. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated.

Question 4

The example is still not quite clear, but anyway, perhaps you should use a database.

Question 5

I noticed the same question on stackoverflow.com, which has an accepted answer already.

Question 6

Don't use Python 2 any more; the rest of this answer will assume Python 3 without diving too much into the syntax. Most of the Unicode stuff needs to go away; see codecs for standard encoding names.

The desired results are the following

Are they really? hello doesn't appear in your sample input at all.

I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order

Don't do that. Just open it once and seek to the beginning as necessary.

Your steps should be converted into functions.

Rather than printing datetime.now(), make a logger with an asctime field.

In Python 3 you should not be opening those files as rb; instead pass the appropriate encoding and open them in text mode.

Write a main function responsible for opening and closing files, and pass those files into subroutines.

featFreqs = defaultdict(lambda: defaultdict(float)) is not a good idea, because you only ever add integers; use (int) instead.

The indentation in step 4 is wild. That needs to be fixed up, and you need to keep references to intermediate indexed dictionary levels.

Yes, there are ways (that I don't demonstrate) where the file processing is partitioned to reduce memory burden. The tricky part becomes indexing into parts of a map that are not currently in memory. One approach is to produce a database (SQLite, possibly) that is well-indexed; it will have reasonable caching characteristics and can be gigantic without ruining your RAM during queries.

All together,

#!/usr/bin/env python3
import logging
import typing
from collections import defaultdict
type FreqDict = defaultdict[str, defaultdict[str, int]]
type ClassDict = defaultdict[str, defaultdict[str, defaultdict[str, float]]]
def setup_logger() -> logging.Logger:
 logging.basicConfig(
 level=logging.INFO, format='%(asctime)s %(message)s',
 )
 return logging.getLogger('indexer')
def start(o_sense_file: typing.TextIO) -> dict[str, str]:
 mapping: dict[str, str] = {}
 for line in o_sense_file:
 concept, concept_class = line.split()
 if len(concept) > 2:
 mapping[concept + '-n'] = concept_class
 return mapping
def step_1(mapping: dict[str, str], o_index_file: typing.TextIO) -> set[str]:
 lemmas = set()
 for line in o_index_file:
 lemma = line.split()[0]
 if lemma in mapping:
 lemmas.add(lemma)
 return lemmas
def step_2(o_index_file: typing.TextIO) -> FreqDict:
 feat_freqs = defaultdict(lambda: defaultdict(int))
 for line in o_index_file:
 lemmaTAR, slot, filler, freq = line.split()
 feat_freqs[slot][filler] += int(freq)
 return feat_freqs
def step_3(
 o_index_file: typing.TextIO, mapping: dict[str, str],
 lemmas: set[str], feat_freqs: FreqDict,
) -> ClassDict:
 class_freqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
 for line in o_index_file:
 lemmaTAR, slot, filler, freq = line.split()
 if lemmaTAR in lemmas:
 senses = mapping[lemmaTAR].split('|')
 for sense in senses:
 class_freqs[sense][slot][filler] += int(freq) / len(senses) / feat_freqs[slot][filler]
 return class_freqs
def step_4(o_out_file: typing.TextIO, class_freqs: ClassDict) -> None:
 for sense in sorted(class_freqs.keys()):
 by_sense = class_freqs[sense]
 for slot, freqs in by_sense.items():
 for fill, freq in freqs.items():
 o_out_file.write(f'{sense}\t{slot}\t{fill}\t{freq}\n')
def main():
 logger.info('Starting')
 with open('input-map', encoding='utf_8') as o_sense_file:
 mapping = start(o_sense_file)
 with open('input-data', encoding='latin_1') as o_index_file:
 logger.info('Step 1')
 lemmas = step_1(mapping=mapping, o_index_file=o_index_file)
 logger.info('Step 2')
 o_index_file.seek(0)
 feat_freqs = step_2(o_index_file=o_index_file)
 logger.info('Step 3')
 o_index_file.seek(0)
 class_freqs = step_3(
 mapping=mapping, o_index_file=o_index_file, lemmas=lemmas, feat_freqs=feat_freqs,
 )
 logger.info('Step 4')
 with open('output', mode='w', encoding='utf_8') as o_out_file:
 step_4(o_out_file=o_out_file, class_freqs=class_freqs)
if __name__ == '__main__':
 logger = setup_logger()
 main()

Console output:

2025年01月11日 00:06:06,813 Starting
2025年01月11日 00:06:06,816 Step 1
2025年01月11日 00:06:06,816 Step 2
2025年01月11日 00:06:06,816 Step 3
2025年01月11日 00:06:06,816 Step 4

Output file:

anm is green 0.3333333333333333
anm eats mice 1.0
anm eats plants 1.0
art is green 0.6666666666666666

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-01-11 05:21:07Z

Don't use Python 2 any more; the rest of this answer will assume Python 3 without diving too much into the syntax. Most of the Unicode stuff needs to go away; see codecs for standard encoding names.

The desired results are the following

Are they really? hello doesn't appear in your sample input at all.

I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order

Don't do that. Just open it once and seek to the beginning as necessary.

Your steps should be converted into functions.

Rather than printing datetime.now(), make a logger with an asctime field.

In Python 3 you should not be opening those files as rb; instead pass the appropriate encoding and open them in text mode.

Write a main function responsible for opening and closing files, and pass those files into subroutines.

featFreqs = defaultdict(lambda: defaultdict(float)) is not a good idea, because you only ever add integers; use (int) instead.

The indentation in step 4 is wild. That needs to be fixed up, and you need to keep references to intermediate indexed dictionary levels.

Yes, there are ways (that I don't demonstrate) where the file processing is partitioned to reduce memory burden. The tricky part becomes indexing into parts of a map that are not currently in memory. One approach is to produce a database (SQLite, possibly) that is well-indexed; it will have reasonable caching characteristics and can be gigantic without ruining your RAM during queries.

All together,

#!/usr/bin/env python3
import logging
import typing
from collections import defaultdict
type FreqDict = defaultdict[str, defaultdict[str, int]]
type ClassDict = defaultdict[str, defaultdict[str, defaultdict[str, float]]]
def setup_logger() -> logging.Logger:
 logging.basicConfig(
 level=logging.INFO, format='%(asctime)s %(message)s',
 )
 return logging.getLogger('indexer')
def start(o_sense_file: typing.TextIO) -> dict[str, str]:
 mapping: dict[str, str] = {}
 for line in o_sense_file:
 concept, concept_class = line.split()
 if len(concept) > 2:
 mapping[concept + '-n'] = concept_class
 return mapping
def step_1(mapping: dict[str, str], o_index_file: typing.TextIO) -> set[str]:
 lemmas = set()
 for line in o_index_file:
 lemma = line.split()[0]
 if lemma in mapping:
 lemmas.add(lemma)
 return lemmas
def step_2(o_index_file: typing.TextIO) -> FreqDict:
 feat_freqs = defaultdict(lambda: defaultdict(int))
 for line in o_index_file:
 lemmaTAR, slot, filler, freq = line.split()
 feat_freqs[slot][filler] += int(freq)
 return feat_freqs
def step_3(
 o_index_file: typing.TextIO, mapping: dict[str, str],
 lemmas: set[str], feat_freqs: FreqDict,
) -> ClassDict:
 class_freqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
 for line in o_index_file:
 lemmaTAR, slot, filler, freq = line.split()
 if lemmaTAR in lemmas:
 senses = mapping[lemmaTAR].split('|')
 for sense in senses:
 class_freqs[sense][slot][filler] += int(freq) / len(senses) / feat_freqs[slot][filler]
 return class_freqs
def step_4(o_out_file: typing.TextIO, class_freqs: ClassDict) -> None:
 for sense in sorted(class_freqs.keys()):
 by_sense = class_freqs[sense]
 for slot, freqs in by_sense.items():
 for fill, freq in freqs.items():
 o_out_file.write(f'{sense}\t{slot}\t{fill}\t{freq}\n')
def main():
 logger.info('Starting')
 with open('input-map', encoding='utf_8') as o_sense_file:
 mapping = start(o_sense_file)
 with open('input-data', encoding='latin_1') as o_index_file:
 logger.info('Step 1')
 lemmas = step_1(mapping=mapping, o_index_file=o_index_file)
 logger.info('Step 2')
 o_index_file.seek(0)
 feat_freqs = step_2(o_index_file=o_index_file)
 logger.info('Step 3')
 o_index_file.seek(0)
 class_freqs = step_3(
 mapping=mapping, o_index_file=o_index_file, lemmas=lemmas, feat_freqs=feat_freqs,
 )
 logger.info('Step 4')
 with open('output', mode='w', encoding='utf_8') as o_out_file:
 step_4(o_out_file=o_out_file, class_freqs=class_freqs)
if __name__ == '__main__':
 logger = setup_logger()
 main()

Console output:

2025年01月11日 00:06:06,813 Starting
2025年01月11日 00:06:06,816 Step 1
2025年01月11日 00:06:06,816 Step 2
2025年01月11日 00:06:06,816 Step 3
2025年01月11日 00:06:06,816 Step 4

Output file:

anm is green 0.3333333333333333
anm eats mice 1.0
anm eats plants 1.0
art is green 0.6666666666666666

Stack Exchange Network

Script which opens and reads the same files multiple times

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Script which opens and reads the same files multiple times

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions