I need to parse logs and have got following code. I can see two problems: map().filter()
may induce some performance penalties and copy-paste block
parser.py:
class Info(object):
a = ""
j = ""
z = ""
infoline = ""
def __init__(self, a, j, z, infoline):
self.a = a
self.j = j
self.z = z
self.infoline = infoline
# function check if the line parameter is produced by provider 'prov'
# yes - returns certain substring of line
# no - returns None
def get_infoline(line, prov):
...
def process(line, prov):
retA = None
retJ = None
retZ = None
infoline = get_infoline(line, prov)line param from
if infoline:
# filling some of retA, retJ, retZ
...
return Info(retA, retJ, retZ, infoline)
job.py:
from pyspark import SparkContext
import parser
...
prov = ...
log = sc.textFile(pathTofile)
parsed = log.map(lambda ln: parser.proccess(ln, prov)).filter(lambda i: i)
summaryA = parsed.map(lambda info: (info.a, 1)).reduceByKey(add) \
.map(lambda (a,b): (b,a)).sortByKey(False) \
.map(lambda (count, name): ("%s\t%i" % (name, count))) \
.saveAsTextFile('/output/path/a.tsv')
summaryJ = parsed.map(lambda info: (info.j, 1)).reduceByKey(add) \
.map(lambda (a,b): (b,a)).sortByKey(False) \
.map(lambda (count, name): ("%s\t%i" % (name, count))) \
.saveAsTextFile('/output/path/j.tsv')
summaryZ = parsed.map(lambda info: (info.z, 1)).reduceByKey(add) \
.map(lambda (a,b): (b,a)).sortByKey(False) \
.map(lambda (count, name): ("%s\t%i" % (name, count))) \
.saveAsTextFile('/output/path/z.tsv')
1 Answer 1
With your first file, parser.py
:
Because the
Info
class'__init__
method requires the three arguments:a
,j
, andz
, which it then attributes to itself, you can remove the class attributes of the same name, as they become superfluous.Within the
process
method, you set several variables before doing anything;retA
,retJ
, andretZ
, which you set toNone
. If you aren't modifying these objects in place, you can omit these completely and add anelse
statement.if infoline: # filling some of retA, retJ, retZ return Info(retA, retJ, retZ) else: return Info(None, None, None)
In the aforementioned
if
statement, you can simply useif get_infoline(line, prov):
Rather than setting the
infoline
variable. Then, if you need the result, you can define it within the statement.
-
\$\begingroup\$ (+1) . 1 Thank you. 2 Actually, I need
None
for nextfilter
call 3 Thank you - I missedinfoline
asInfo
member. I fixed \$\endgroup\$Loom– Loom2016年03月31日 09:29:05 +00:00Commented Mar 31, 2016 at 9:29
Explore related questions
See similar questions with these tags.
map
method do? What type of object does it return?reduceByKey
?filter
? \$\endgroup\$