TrainSet Practice Problem from CodeChef

Question 1

A dataset consisting of N items. Each item is a pair of a word and a boolean denoting whether the given word is a spam word or not.

There shouldn't be a word included in the training set that's marked both as spam and not-spam. For example item {"fck", 1}, and item {"fck, 0"} can't be present in the training set, because first item says the word "fck" is a spam, whereas the second item says it is not, which is a contradiction.

Your task is to select the maximum number of items in the training set.

Note that same pair of {word, bool} can appear multiple times in input. The training set can also contain the same pair multiple times.

Question link is here.

What I have done so far:

Grouped given pair by word and taken count of spam and not-spam of each group.
Finally, taken max count of each group and taken sum of this values.

Code:

from itertools import groupby
from collections import Counter
t = int(input())
for _ in range(t):
 n = int(input())
 data_set = sorted([tuple(map(str, input().split())) for __ in range(n)])
 print(sum(max(Counter(element[1]).values()) for element in groupby(data_set, key=lambda x:x[0])))

My code is working fine. Is my solution is efficient? If not please explain an efficient algorithm.

What is the time complexity of my algorithm?

Is it O(N)?

Example Input:

3
3
abc 0
abc 1
efg 1
7
fck 1
fck 0
fck 1
body 0
body 0
body 0
ram 0
5
vv 1
vv 0
vv 0
vv 1
vv 1

Example Output:

2
6
3

Constraints:

1≤T≤10
1≤N≤25,000
1≤|wi|≤5 for each valid i
0≤si≤1 for each valid i
w1,w2,...,wN contain only lowercase English letters

Edit-1:

To take input from file for testing,

with open('file.txt', 'r') as f:
 for _ in range(int(f.readline())):
 n = int(f.readline())
 data_set = [tuple(map(str, f.readline().replace('\n','').split())) for __ in range(n)]
 print(data_set) 
 # your logic here
 # Output:
 # [('abc', '0'), ('abc', '1'), ('efg', '1')]
 # [('fck', '1'), ('fck', '0'), ('fck', '1'), ('body', '0'), ('body', '0'), ('body', '0'), ('ram', '0')]
 # [('vv', '1'), ('vv', '0'), ('vv', '0'), ('vv', '1'), ('vv', '1')]

Question 2

did you get the dataset where the answer is wrong and the exected outcome?

Question 3

@MaartenFabré No, they didn't provide that dataset. Is there any other way to debug?

Question 4

@shaikmoeed, Ok, thanks for the edit, I'll look at it

Question 5

In search of most performant and optimized solution for your task let's consider 3 cases.

We'll test them on input file containing "example intput" from the question.

Case #1 - is your initial approach with suggested sorted call wrapped in max_trainset_sorted_counter function:

def max_trainset_sorted_counter():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 data_set = sorted(tuple(map(str, f.readline().split())) for __ in range(n))
 print(sum(max(Counter(element[1]).values()) for element in groupby(data_set, key=lambda x: x[0])))

When profiled (cProfile) it produces: 179 function calls (165 primitive calls).

Case #2 - is an approach from previous answer based on defaultdict and Counter wrapped in max_trainset_ddict_counter function:

def max_trainset_ddict_counter():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 results = defaultdict(Counter)
 data_set = (f.readline().replace('\n', '').split() for _ in range(n))
 for word, result_bool in data_set:
 results[word][result_bool] += 1
 print(sum(max(result_bools.values()) for word, result_bools in results.items()))

When profiled it produces: 140 function calls.

Case #3 - is actually my solution based on itertools.groupby, itertools.islice features and simple arithmetic trick of 2 steps to sum up maximals (items with greater number of occurrences of 1 or 0 flags) of each group within a trainset:

sum up all values of the group - sum_ones = sum(int(i[1]) for i in group), which actually gives a number of 1 values/flags
then maximal is found with simple max(sum_ones, g_size - sum_ones), where g_size is assigned with len(group) (number of entries within a group). This will cover any combinations of 0 and 1 in solitary or mixed sequences.

It's wrapped in max_trainset_ext_sums function:

def max_trainset_ext_sums():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 sum_groups = 0
 for k, g in groupby(sorted(tuple(row.split()) for row in islice(f, n)), key=lambda x: x[0]):
 group = tuple(g)
 g_size = len(group)
 if g_size == 1:
 sum_groups += g_size
 else:
 sum_ones = sum(int(i[1]) for i in group)
 sum_groups += max(sum_ones, g_size - sum_ones)
 print(sum_groups)

When profiled it produces: 99 function calls.

All 3 cases produce the same expected output:

2
6
3

But let's get to time performance test (note, on very small input sample the time difference may not be significant):

>>> from timeit import timeit
>>> timeit('max_trainset_sorted_counter()', setup='from __main__ import max_trainset_sorted_counter', number=1000)
0.1788159019779414
>>> timeit('max_trainset_ddict_counter()', setup='from __main__ import max_trainset_ddict_counter', number=1000)
0.17249802296282724
>>> timeit('max_trainset_ext_sums()', setup='from __main__ import max_trainset_ext_sums', number=1000)
0.14651802799198776

Question 6

On first sight, I see 2 issues:

`groupby`.

From the documentation:

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).

(emphasis mine)

Compare your result for

fck 1
body 0
body 0
fck 0
body 0
ram 0
fck 1

to the result for

fck 1
fck 0
body 0
body 0
body 0
ram 0
fck 1

debugging

The data structure that is important for the correctness of your answer is:

[Counter(element[1]) for element in groupby(data_set, key=lambda x:x[0])]

Here you assemble the values, which you later count.

In the first case:

[Counter({('fck', '1'): 1, ('fck', '0'): 1}),
 Counter({('body', '0'): 3}),
 Counter({('ram', '0'): 1}),
 Counter({('fck', '1'): 1})]

In the second case:

[Counter({('fck', '1'): 1}),
 Counter({('body', '0'): 2}),
 Counter({('fck', '0'): 1}),
 Counter({('body', '0'): 1}),
 Counter({('ram', '0'): 1}),
 Counter({('fck', '1'): 1})]

The entries for fck are not grouped anymore.

The easiest way to solve this, is by changing data_set to data_set = sorted(tuple(map(str, input().split())) for __ in range(n)),

Or you can use an intermediate 2-level dict, but that is more complicated. I don't know which one will be the fastest.

--

alternative without `sorted`

You can use defaultdict and Counter to set up an alternative datastructure

from collections import defaultdict, Counter
input_str = """
fck 1
body 0
body 0
fck 0
body 0
ram 0
fck 1
"""
results = defaultdict(Counter)
data_set = (row.split() for row in input_str.split("\n") if row)
for word, result_bool in data_set:
 results[word][result_bool] += 1

defaultdict(collections.Counter,
 {'fck': Counter({'1': 2, '0': 1}),
 'body': Counter({'0': 3}),
 'ram': Counter({'0': 1})})

sum(max(result_bools.values()) for word, result_bools in results.items())

This does not need the sorting.

Question 7

@shaikmoeed Considering there are 2 for loops in your code, probably not.

Question 8

@Mast Thanks for letting me know. Can we solve this problem in O(N)? or What is the optimal time complexity we can achieve for this problem?

Question 9

The outer loop loops over the test cases, so I wouldn't include that in the time complexity analysis. The sorted is O(n log(n)) (depending on the sortedness of the input), the alternative solution loops once over all the lines, once over all the words, so looks O(n) to me

score 3 · Accepted Answer · 2019-11-29 19:30:09Z

In search of most performant and optimized solution for your task let's consider 3 cases.

We'll test them on input file containing "example intput" from the question.

Case #1 - is your initial approach with suggested sorted call wrapped in max_trainset_sorted_counter function:

def max_trainset_sorted_counter():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 data_set = sorted(tuple(map(str, f.readline().split())) for __ in range(n))
 print(sum(max(Counter(element[1]).values()) for element in groupby(data_set, key=lambda x: x[0])))

When profiled (cProfile) it produces: 179 function calls (165 primitive calls).

Case #2 - is an approach from previous answer based on defaultdict and Counter wrapped in max_trainset_ddict_counter function:

def max_trainset_ddict_counter():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 results = defaultdict(Counter)
 data_set = (f.readline().replace('\n', '').split() for _ in range(n))
 for word, result_bool in data_set:
 results[word][result_bool] += 1
 print(sum(max(result_bools.values()) for word, result_bools in results.items()))

When profiled it produces: 140 function calls.

Case #3 - is actually my solution based on itertools.groupby, itertools.islice features and simple arithmetic trick of 2 steps to sum up maximals (items with greater number of occurrences of 1 or 0 flags) of each group within a trainset:

sum up all values of the group - sum_ones = sum(int(i[1]) for i in group), which actually gives a number of 1 values/flags
then maximal is found with simple max(sum_ones, g_size - sum_ones), where g_size is assigned with len(group) (number of entries within a group). This will cover any combinations of 0 and 1 in solitary or mixed sequences.

It's wrapped in max_trainset_ext_sums function:

def max_trainset_ext_sums():
 with open('lines.txt', 'r') as f:
 for line in f:
 n = int(line.strip())
 sum_groups = 0
 for k, g in groupby(sorted(tuple(row.split()) for row in islice(f, n)), key=lambda x: x[0]):
 group = tuple(g)
 g_size = len(group)
 if g_size == 1:
 sum_groups += g_size
 else:
 sum_ones = sum(int(i[1]) for i in group)
 sum_groups += max(sum_ones, g_size - sum_ones)
 print(sum_groups)

When profiled it produces: 99 function calls.

All 3 cases produce the same expected output:

2
6
3

But let's get to time performance test (note, on very small input sample the time difference may not be significant):

>>> from timeit import timeit
>>> timeit('max_trainset_sorted_counter()', setup='from __main__ import max_trainset_sorted_counter', number=1000)
0.1788159019779414
>>> timeit('max_trainset_ddict_counter()', setup='from __main__ import max_trainset_ddict_counter', number=1000)
0.17249802296282724
>>> timeit('max_trainset_ext_sums()', setup='from __main__ import max_trainset_ext_sums', number=1000)
0.14651802799198776

Stack Exchange Network

TrainSet Practice Problem from CodeChef

2 Answers 2

`groupby`.

debugging

alternative without `sorted`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

TrainSet Practice Problem from CodeChef

2 Answers 2

groupby.

debugging

alternative without sorted

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`groupby`.

alternative without `sorted`