1
\$\begingroup\$

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block

parser.py:

class Info(object):
 a = ""
 j = ""
 z = ""
 infoline = ""
 def __init__(self, a, j, z, infoline):
 self.a = a
 self.j = j
 self.z = z
 self.infoline = infoline
# function check if the line parameter is produced by provider 'prov'
# yes - returns certain substring of line
# no - returns None
def get_infoline(line, prov):
 ...
def process(line, prov):
 retA = None
 retJ = None
 retZ = None
 infoline = get_infoline(line, prov)line param from 
 if infoline:
 # filling some of retA, retJ, retZ
 ...
 return Info(retA, retJ, retZ, infoline)

job.py:

from pyspark import SparkContext
import parser
...
prov = ...
log = sc.textFile(pathTofile)
parsed = log.map(lambda ln: parser.proccess(ln, prov)).filter(lambda i: i)
summaryA = parsed.map(lambda info: (info.a, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/a.tsv')
summaryJ = parsed.map(lambda info: (info.j, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/j.tsv')
summaryZ = parsed.map(lambda info: (info.z, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/z.tsv')
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Mar 29, 2016 at 8:43
\$\endgroup\$
2
  • \$\begingroup\$ What does the map method do? What type of object does it return? reduceByKey? filter? \$\endgroup\$ Commented Mar 31, 2016 at 3:39
  • \$\begingroup\$ These functions return RDD \$\endgroup\$ Commented Mar 31, 2016 at 8:18

1 Answer 1

2
\$\begingroup\$

With your first file, parser.py:

  1. Because the Info class' __init__ method requires the three arguments: a, j, and z, which it then attributes to itself, you can remove the class attributes of the same name, as they become superfluous.

  2. Within the process method, you set several variables before doing anything; retA, retJ, and retZ, which you set to None. If you aren't modifying these objects in place, you can omit these completely and add an else statement.

    if infoline:
     # filling some of retA, retJ, retZ
     return Info(retA, retJ, retZ)
    else:
     return Info(None, None, None)
  3. In the aforementioned if statement, you can simply use

    if get_infoline(line, prov):

    Rather than setting the infoline variable. Then, if you need the result, you can define it within the statement.

answered Mar 31, 2016 at 3:38
\$\endgroup\$
1
  • \$\begingroup\$ (+1) . 1 Thank you. 2 Actually, I need None for next filter call 3 Thank you - I missed infoline as Info member. I fixed \$\endgroup\$ Commented Mar 31, 2016 at 9:29

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.