Python + spark to parse and save logs

Question 1

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block

parser.py:

class Info(object):
 a = ""
 j = ""
 z = ""
 infoline = ""
 def __init__(self, a, j, z, infoline):
 self.a = a
 self.j = j
 self.z = z
 self.infoline = infoline
# function check if the line parameter is produced by provider 'prov'
# yes - returns certain substring of line
# no - returns None
def get_infoline(line, prov):
 ...
def process(line, prov):
 retA = None
 retJ = None
 retZ = None
 infoline = get_infoline(line, prov)line param from 
 if infoline:
 # filling some of retA, retJ, retZ
 ...
 return Info(retA, retJ, retZ, infoline)

job.py:

from pyspark import SparkContext
import parser
...
prov = ...
log = sc.textFile(pathTofile)
parsed = log.map(lambda ln: parser.proccess(ln, prov)).filter(lambda i: i)
summaryA = parsed.map(lambda info: (info.a, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/a.tsv')
summaryJ = parsed.map(lambda info: (info.j, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/j.tsv')
summaryZ = parsed.map(lambda info: (info.z, 1)).reduceByKey(add) \
 .map(lambda (a,b): (b,a)).sortByKey(False) \
 .map(lambda (count, name): ("%s\t%i" % (name, count))) \
 .saveAsTextFile('/output/path/z.tsv')

Question 2

What does the map method do? What type of object does it return? reduceByKey? filter?

Question 3

These functions return RDD

Question 4

With your first file, parser.py:

Because the Info class' __init__ method requires the three arguments: a, j, and z, which it then attributes to itself, you can remove the class attributes of the same name, as they become superfluous.
Within the process method, you set several variables before doing anything; retA, retJ, and retZ, which you set to None. If you aren't modifying these objects in place, you can omit these completely and add an else statement.
```
if infoline:
 # filling some of retA, retJ, retZ
 return Info(retA, retJ, retZ)
else:
 return Info(None, None, None)
```
In the aforementioned if statement, you can simply use
```
if get_infoline(line, prov):
```
Rather than setting the infoline variable. Then, if you need the result, you can define it within the statement.

Question 5

(+1) . 1 Thank you. 2 Actually, I need None for next filter call 3 Thank you - I missed infoline as Info member. I fixed

Zach Gates Zach GatesZach Gates 1611 silver badge7 bronze badges · Answer 1 · 2016-03-31 03:38:53Z

With your first file, parser.py:

Because the Info class' __init__ method requires the three arguments: a, j, and z, which it then attributes to itself, you can remove the class attributes of the same name, as they become superfluous.
Within the process method, you set several variables before doing anything; retA, retJ, and retZ, which you set to None. If you aren't modifying these objects in place, you can omit these completely and add an else statement.
```
if infoline:
 # filling some of retA, retJ, retZ
 return Info(retA, retJ, retZ)
else:
 return Info(None, None, None)
```
In the aforementioned if statement, you can simply use
```
if get_infoline(line, prov):
```
Rather than setting the infoline variable. Then, if you need the result, you can define it within the statement.

(+1) . 1 Thank you. 2 Actually, I need None for next filter call 3 Thank you - I missed infoline as Info member. I fixed

Stack Exchange Network

Python + spark to parse and save logs

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python + spark to parse and save logs

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions