A dataset consisting of N items. Each item is a pair of a word and a boolean denoting whether the given word is a spam word or not.
There shouldn't be a word included in the training set that's marked both as spam and not-spam. For example item {"fck", 1}
, and item {"fck, 0"}
can't be present in the training set, because first item says the word "fck"
is a spam, whereas the second item says it is not, which is a contradiction.
Your task is to select the maximum number of items in the training set.
Note that same pair of {word, bool}
can appear multiple times in input. The training set can also contain the same pair multiple times.
Question link is here.
What I have done so far:
- Grouped given pair by word and taken count of spam and not-spam of each group.
- Finally, taken
max
count of each group and takensum
of this values.
Code:
from itertools import groupby
from collections import Counter
t = int(input())
for _ in range(t):
n = int(input())
data_set = sorted([tuple(map(str, input().split())) for __ in range(n)])
print(sum(max(Counter(element[1]).values()) for element in groupby(data_set, key=lambda x:x[0])))
My code is working fine. Is my solution is efficient? If not please explain an efficient algorithm.
What is the time complexity of my algorithm?
Is it O(N)
?
Example Input:
3
3
abc 0
abc 1
efg 1
7
fck 1
fck 0
fck 1
body 0
body 0
body 0
ram 0
5
vv 1
vv 0
vv 0
vv 1
vv 1
Example Output:
2
6
3
Constraints:
- 1≤T≤10
- 1≤N≤25,000
- 1≤|wi|≤5 for each valid i
- 0≤si≤1 for each valid i
- w1,w2,...,wN contain only lowercase English letters
Edit-1:
To take input from file for testing,
with open('file.txt', 'r') as f:
for _ in range(int(f.readline())):
n = int(f.readline())
data_set = [tuple(map(str, f.readline().replace('\n','').split())) for __ in range(n)]
print(data_set)
# your logic here
# Output:
# [('abc', '0'), ('abc', '1'), ('efg', '1')]
# [('fck', '1'), ('fck', '0'), ('fck', '1'), ('body', '0'), ('body', '0'), ('body', '0'), ('ram', '0')]
# [('vv', '1'), ('vv', '0'), ('vv', '0'), ('vv', '1'), ('vv', '1')]
-
\$\begingroup\$ did you get the dataset where the answer is wrong and the exected outcome? \$\endgroup\$Maarten Fabré– Maarten Fabré2019年11月28日 08:59:33 +00:00Commented Nov 28, 2019 at 8:59
-
\$\begingroup\$ @MaartenFabré No, they didn't provide that dataset. Is there any other way to debug? \$\endgroup\$shaik moeed– shaik moeed2019年11月28日 09:00:41 +00:00Commented Nov 28, 2019 at 9:00
-
1\$\begingroup\$ @shaikmoeed, Ok, thanks for the edit, I'll look at it \$\endgroup\$RomanPerekhrest– RomanPerekhrest2019年11月28日 12:38:45 +00:00Commented Nov 28, 2019 at 12:38
2 Answers 2
In search of most performant and optimized solution for your task let's consider 3 cases.
We'll test them on input file containing "example intput" from the question.
Case #1 - is your initial approach with suggested sorted
call wrapped in max_trainset_sorted_counter
function:
def max_trainset_sorted_counter():
with open('lines.txt', 'r') as f:
for line in f:
n = int(line.strip())
data_set = sorted(tuple(map(str, f.readline().split())) for __ in range(n))
print(sum(max(Counter(element[1]).values()) for element in groupby(data_set, key=lambda x: x[0])))
When profiled (cProfile
) it produces: 179 function calls (165 primitive calls).
Case #2 - is an approach from previous answer based on defaultdict
and Counter
wrapped in max_trainset_ddict_counter
function:
def max_trainset_ddict_counter():
with open('lines.txt', 'r') as f:
for line in f:
n = int(line.strip())
results = defaultdict(Counter)
data_set = (f.readline().replace('\n', '').split() for _ in range(n))
for word, result_bool in data_set:
results[word][result_bool] += 1
print(sum(max(result_bools.values()) for word, result_bools in results.items()))
When profiled it produces: 140 function calls.
Case #3 - is actually my solution based on itertools.groupby
, itertools.islice
features and simple arithmetic trick of 2 steps to sum up maximals (items with greater number of occurrences of 1
or 0
flags) of each group within a trainset:
- sum up all values of the group -
sum_ones = sum(int(i[1]) for i in group)
, which actually gives a number of1
values/flags - then maximal is found with simple
max(sum_ones, g_size - sum_ones)
, whereg_size
is assigned withlen(group)
(number of entries within a group). This will cover any combinations of0
and1
in solitary or mixed sequences.
It's wrapped in max_trainset_ext_sums
function:
def max_trainset_ext_sums():
with open('lines.txt', 'r') as f:
for line in f:
n = int(line.strip())
sum_groups = 0
for k, g in groupby(sorted(tuple(row.split()) for row in islice(f, n)), key=lambda x: x[0]):
group = tuple(g)
g_size = len(group)
if g_size == 1:
sum_groups += g_size
else:
sum_ones = sum(int(i[1]) for i in group)
sum_groups += max(sum_ones, g_size - sum_ones)
print(sum_groups)
When profiled it produces: 99 function calls.
All 3 cases produce the same expected output:
2
6
3
But let's get to time performance test (note, on very small input sample the time difference may not be significant):
>>> from timeit import timeit
>>> timeit('max_trainset_sorted_counter()', setup='from __main__ import max_trainset_sorted_counter', number=1000)
0.1788159019779414
>>> timeit('max_trainset_ddict_counter()', setup='from __main__ import max_trainset_ddict_counter', number=1000)
0.17249802296282724
>>> timeit('max_trainset_ext_sums()', setup='from __main__ import max_trainset_ext_sums', number=1000)
0.14651802799198776
On first sight, I see 2 issues:
groupby
.
From the documentation:
The operation of groupby() is similar to the
uniq
filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).
(emphasis mine)
Compare your result for
fck 1
body 0
body 0
fck 0
body 0
ram 0
fck 1
to the result for
fck 1
fck 0
body 0
body 0
body 0
ram 0
fck 1
debugging
The data structure that is important for the correctness of your answer is:
[Counter(element[1]) for element in groupby(data_set, key=lambda x:x[0])]
Here you assemble the values, which you later count.
In the first case:
[Counter({('fck', '1'): 1, ('fck', '0'): 1}),
Counter({('body', '0'): 3}),
Counter({('ram', '0'): 1}),
Counter({('fck', '1'): 1})]
In the second case:
[Counter({('fck', '1'): 1}),
Counter({('body', '0'): 2}),
Counter({('fck', '0'): 1}),
Counter({('body', '0'): 1}),
Counter({('ram', '0'): 1}),
Counter({('fck', '1'): 1})]
The entries for fck
are not grouped anymore.
The easiest way to solve this, is by changing data_set
to data_set = sorted(tuple(map(str, input().split())) for __ in range(n))
,
Or you can use an intermediate 2-level dict, but that is more complicated. I don't know which one will be the fastest.
--
alternative without sorted
You can use defaultdict
and Counter
to set up an alternative datastructure
from collections import defaultdict, Counter
input_str = """
fck 1
body 0
body 0
fck 0
body 0
ram 0
fck 1
"""
results = defaultdict(Counter)
data_set = (row.split() for row in input_str.split("\n") if row)
for word, result_bool in data_set:
results[word][result_bool] += 1
defaultdict(collections.Counter, {'fck': Counter({'1': 2, '0': 1}), 'body': Counter({'0': 3}), 'ram': Counter({'0': 1})})
sum(max(result_bools.values()) for word, result_bools in results.items())
This does not need the sorting.
-
\$\begingroup\$ @shaikmoeed Considering there are 2 for loops in your code, probably not. \$\endgroup\$2019年11月28日 12:06:29 +00:00Commented Nov 28, 2019 at 12:06
-
\$\begingroup\$ @Mast Thanks for letting me know. Can we solve this problem in
O(N)
? or What is the optimal time complexity we can achieve for this problem? \$\endgroup\$shaik moeed– shaik moeed2019年11月28日 12:14:25 +00:00Commented Nov 28, 2019 at 12:14 -
\$\begingroup\$ The outer loop loops over the test cases, so I wouldn't include that in the time complexity analysis. The sorted is
O(n log(n))
(depending on the sortedness of the input), the alternative solution loops once over all the lines, once over all the words, so looksO(n)
to me \$\endgroup\$Maarten Fabré– Maarten Fabré2019年11月28日 17:23:05 +00:00Commented Nov 28, 2019 at 17:23
Explore related questions
See similar questions with these tags.