I have code that works perfectly, but it uses too much memory.
Essentially this code takes an input file (let's call it an index, that is 2 column tab-separated) that searches in a second input file (let's call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.
An example of the index is:
amphibian anm|art|art|art|art
anaconda anm
aardvark anm
An example of the data is:
amphibian-n is green 10
anaconda-n is green 2
anaconda-n eats mice 1
aardvark-n eats plants 1
Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:
anm-n is green
art-n is green
anm-n eats mice
anm-n eats plants
I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.
The desired results are the following:
anm second hello 1.0
anm eats plants 1.0
anm first heador 0.333333333333
art first heador 0.666666666667
I open the same input file many times (i.e., 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of RAM in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.
Is there a way to optimize how many times I open the same input file?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import division
from collections import defaultdict
import datetime
print "starting:",
print datetime.datetime.now()
mapping = dict()
with open('input-map', "rb") as oSenseFile:
for line in oSenseFile:
uLine = unicode(line, "utf8")
concept, conceptClass = uLine.split()
if len(concept) > 2:
mapping[concept + '-n'] = conceptClass
print "- step 1:",
print datetime.datetime.now()
lemmas = set()
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemma = uLine.split()[0]
if mapping.has_key(lemma):
lemmas.add(lemma)
print "- step 2:",
print datetime.datetime.now()
featFreqs = defaultdict(lambda: defaultdict(float))
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemmaTAR, slot, filler, freq = uLine.split()
featFreqs[slot][filler] += int(freq)
print "- step 3:",
print datetime.datetime.now()
classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemmaTAR, slot, filler, freq = uLine.split()
if lemmaTAR in lemmas:
senses = mapping[lemmaTAR].split(u'|')
for sense in senses:
classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
else:
pass
print "- step 4:",
print datetime.datetime.now()
with open('output', 'wb') as oOutFile:
for sense in sorted(classFreqs):
for slot in classFreqs[sense]:
for fill in classFreqs[sense][slot]:
outstring = '\t'.join([sense, slot, fill,\
str(classFreqs[sense][slot][fill])])
oOutFile.write(outstring.encode("utf8") + '\n')
Any suggestions on how to optimize this code to process large text files (e.g., >4GB)?
1 Answer 1
Don't use Python 2 any more; the rest of this answer will assume Python 3 without diving too much into the syntax. Most of the Unicode stuff needs to go away; see codecs for standard encoding names.
The desired results are the following
Are they really? hello
doesn't appear in your sample input at all.
I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order
Don't do that. Just open it once and seek to the beginning as necessary.
Your steps should be converted into functions.
Rather than printing datetime.now()
, make a logger with an asctime
field.
In Python 3 you should not be opening those files as rb
; instead pass the appropriate encoding and open them in text mode.
Write a main
function responsible for opening and closing files, and pass those files into subroutines.
featFreqs = defaultdict(lambda: defaultdict(float))
is not a good idea, because you only ever add integers; use (int)
instead.
The indentation in step 4 is wild. That needs to be fixed up, and you need to keep references to intermediate indexed dictionary levels.
Yes, there are ways (that I don't demonstrate) where the file processing is partitioned to reduce memory burden. The tricky part becomes indexing into parts of a map that are not currently in memory. One approach is to produce a database (SQLite, possibly) that is well-indexed; it will have reasonable caching characteristics and can be gigantic without ruining your RAM during queries.
All together,
#!/usr/bin/env python3
import logging
import typing
from collections import defaultdict
type FreqDict = defaultdict[str, defaultdict[str, int]]
type ClassDict = defaultdict[str, defaultdict[str, defaultdict[str, float]]]
def setup_logger() -> logging.Logger:
logging.basicConfig(
level=logging.INFO, format='%(asctime)s %(message)s',
)
return logging.getLogger('indexer')
def start(o_sense_file: typing.TextIO) -> dict[str, str]:
mapping: dict[str, str] = {}
for line in o_sense_file:
concept, concept_class = line.split()
if len(concept) > 2:
mapping[concept + '-n'] = concept_class
return mapping
def step_1(mapping: dict[str, str], o_index_file: typing.TextIO) -> set[str]:
lemmas = set()
for line in o_index_file:
lemma = line.split()[0]
if lemma in mapping:
lemmas.add(lemma)
return lemmas
def step_2(o_index_file: typing.TextIO) -> FreqDict:
feat_freqs = defaultdict(lambda: defaultdict(int))
for line in o_index_file:
lemmaTAR, slot, filler, freq = line.split()
feat_freqs[slot][filler] += int(freq)
return feat_freqs
def step_3(
o_index_file: typing.TextIO, mapping: dict[str, str],
lemmas: set[str], feat_freqs: FreqDict,
) -> ClassDict:
class_freqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
for line in o_index_file:
lemmaTAR, slot, filler, freq = line.split()
if lemmaTAR in lemmas:
senses = mapping[lemmaTAR].split('|')
for sense in senses:
class_freqs[sense][slot][filler] += int(freq) / len(senses) / feat_freqs[slot][filler]
return class_freqs
def step_4(o_out_file: typing.TextIO, class_freqs: ClassDict) -> None:
for sense in sorted(class_freqs.keys()):
by_sense = class_freqs[sense]
for slot, freqs in by_sense.items():
for fill, freq in freqs.items():
o_out_file.write(f'{sense}\t{slot}\t{fill}\t{freq}\n')
def main():
logger.info('Starting')
with open('input-map', encoding='utf_8') as o_sense_file:
mapping = start(o_sense_file)
with open('input-data', encoding='latin_1') as o_index_file:
logger.info('Step 1')
lemmas = step_1(mapping=mapping, o_index_file=o_index_file)
logger.info('Step 2')
o_index_file.seek(0)
feat_freqs = step_2(o_index_file=o_index_file)
logger.info('Step 3')
o_index_file.seek(0)
class_freqs = step_3(
mapping=mapping, o_index_file=o_index_file, lemmas=lemmas, feat_freqs=feat_freqs,
)
logger.info('Step 4')
with open('output', mode='w', encoding='utf_8') as o_out_file:
step_4(o_out_file=o_out_file, class_freqs=class_freqs)
if __name__ == '__main__':
logger = setup_logger()
main()
Console output:
2025年01月11日 00:06:06,813 Starting
2025年01月11日 00:06:06,816 Step 1
2025年01月11日 00:06:06,816 Step 2
2025年01月11日 00:06:06,816 Step 3
2025年01月11日 00:06:06,816 Step 4
Output file:
anm is green 0.3333333333333333
anm eats mice 1.0
anm eats plants 1.0
art is green 0.6666666666666666
Explore related questions
See similar questions with these tags.
anaconda
becomeart
in the example? The index maps it toanm
. \$\endgroup\$anaconda
does not becomeart
, you are referring toamphibian
that is mapped toart
. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated. \$\endgroup\$