2
\$\begingroup\$

This is a problem of a past bioinformatic contest (requires an account). My solution works but is too slow for some test cases.

Input format

The first line of the input contains one integer \$T\$, \$(1 \leq T \leq 3)\$ the number of test cases. Each test case is specified by four lines.

The first line of each test case contains three integer numbers \$M\$, \$K\$, \$N\$.

The second line contains \$M\$ numbers \$m_i\$ − masses of metabolites \$(0 < m_i\le 1000)\$.

The third line contains \$K\$ numbers \$a_i\$ − masses of adducts \$(-1000 \le a_i \le 1000)\$.

The fourth line contains \$N\$ numbers \$s_i\$ − masses of signals \$(0 < s_i\le 1000)\$.

All the masses are indicated with exactly six decimal places.

Output format

For each signal \$s_i\$ of each test case, print numbers \$j\$ and \$k\$ such that \$s_i = m_j+a_k+\Delta\$, \$m_j+a_k > 0\$ and the absolute value of \$\Delta\$ is smallest possible. If there are multiple numbers \$j\$ and \$k\$ with same absolute value of \$\Delta\$ for some signal, you can print any of them.

Sample input

3
2 2 5
1.000002 0.000002
0.500000 -0.500000
0.500001 0.500002 0.500003 1.000000 0.000001
2 2 5
1.000002 0.000001
0.500000 -0.500000
0.500001 0.500002 0.500003 1.000000 0.000001
5 4 7
0.000001 0.000002 0.000003 0.000004 0.000005
0.000002 0.000010 0.000001 -0.000001
0.000001 0.000002 0.000100 0.000005 0.000020 0.000010 0.000003

Sample output

1 2
1 2
1 2
1 2
1 2
2 1
1 2
1 2
1 2
2 1
2 4
1 3
5 2
3 1
5 2
1 2
1 1

Test cases

1.txt: \$M,K,N≤10\$
3.zip: \$M,K≤1000;N≤10^5\$
4.zip: \$M≤10^6;K,N≤1000\$
5.zip: \$M,K,N≤10^4\$
answers.zip: to test the solution

Code

from bisect import bisect_left
from time import perf_counter as pc
# Find in arr the closest number to n
def take_closest(arr, n):
 pos = bisect_left(arr, n)
 if pos == 0:
 return arr[0]
 if pos == len(arr):
 return arr[-1]
 before = arr[pos - 1]
 after = arr[pos]
 if after - n < n - before:
 return after
 else:
 return before
def solve(masses, adducts, signals):
 totals = {}
 for i, m in enumerate(masses):
 for j, a in enumerate(adducts):
 ma = m + a
 if ma > 0:
 totals[ma] = (i + 1, j + 1)
 skeys = sorted(totals.keys())
 for s in signals:
 closest = take_closest(skeys, s)
 yield totals[closest]
if __name__ == "__main__":
 test_num = 3
 of = open(f"out{test_num}.txt", "w")
 with open(f"{test_num}.txt", "r") as f:
 t0 = pc()
 t = int(f.readline())
 for _ in range(t):
 M, K, N = map(int, f.readline().strip().split())
 masses = list(map(float, f.readline().strip().split()))
 adducts = list(map(float, f.readline().strip().split()))
 signals = list(map(float, f.readline().strip().split()))
 for j, k in solve(masses, adducts, signals):
 of.write(f'{j} {k}\n')
 t1 = pc()
 print(f"Runtime: {round(t1-t0,3)} s")
 of.close()

Algorithm:

  1. Store all sums \$m_i + a_j\$ in a dictionary with indices \$i,j\$ as values.
  2. Sort signals
  3. For each signal, find the closest number among the sorted keys of the dictionary using binary search.

Issues:

The solution works but is too slow for test case 4, while test case 5 takes around 10 minutes on my machine.

Any feedback is appreciated.

Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Jun 30, 2021 at 8:04
\$\endgroup\$
6
  • \$\begingroup\$ You can avoid creating a dict+sorted list: 1. Sort adducts. 2. For each signal[i]-mass[j] look for (bisect) adduct. \$\endgroup\$ Commented Jun 30, 2021 at 9:50
  • \$\begingroup\$ @PavloSlavynskyy Thanks, I'll try your idea but at the moment I am not sure if that will be enough. masses would need to be scanned for each signal, and M is kind of large in test case 4. Feel free to post an answer, will help me to understand better your idea. \$\endgroup\$ Commented Jun 30, 2021 at 10:32
  • \$\begingroup\$ Sorry, this is code_review, not write_my_code. But i've tested the idea - it's under 2 minutes for test case 5 on my i5-7500. \$\endgroup\$ Commented Jun 30, 2021 at 10:45
  • \$\begingroup\$ @PavloSlavynskyy I wasn't asking for code, but for a more formal answer. Comments should be for clarifications as far as I know. \$\endgroup\$ Commented Jun 30, 2021 at 11:07
  • \$\begingroup\$ The only technological constraint I can see on the problem page is "using programming". Does it stipulate Python? \$\endgroup\$ Commented Jun 30, 2021 at 17:22

2 Answers 2

2
\$\begingroup\$

The complexity of your algorithm is:

Creating a dict of values: \$O(MNlog(MN))\$ (Loops for M and for N, log for adding into dict).

Sorting keys: \$O(MNlog(MN))\$ (sorting a list of the size MN)

Looking up: \$O(Klog(MN))\$ (because we do K lookups among an MN-sized list).

Total will be \$O((MN+K)log(MN))\$, for the case \$N=M=K\$ it will be \$O(N^2log(N^2))\$, not great, not terrible.

\$\Delta\$ is represented as an expression for 3 variables; to search for minimum (with log complexity), we still need to fix two other variables, which gives us \$O(N^2log(N))\$ - this is slightly better, but I think worth a try. The question is what variables should we loop over, and what to use for bisection search. The task is to find for every signal - so, we should have a loop over it. So the idea is something like this:

adducts_dict = {adducts[k]:k for k in range(len(adducts))}
adducts = sorted(adducts)
for s in signals:
 for j, m in enumerate(masses):
 #bisect find minimal distance from s-m to adduct; save that adduct and j
 yield (closest_j, adducts_dict[closest_adduct])

The complexity here will be: \$O(Klog(K) + Klog(K)+N(Mlog(K)+log(K))) = O((MN+K)log(K)\$. Slightly better.

One thing more: adducts and masses can be treated equally in the expression for \$\Delta\$ and swapped; this will give us \$O((KN+M)log(M))\$. It looks like it will be good to keep greatest of (M,N) added, not multiplied, so for the test case 4 you should sort and bisect search masses, not adducts (just swap the arrays and resulting pairs).

answered Jun 30, 2021 at 12:40
\$\endgroup\$
1
\$\begingroup\$

A minor addition: the "sample" test case is ambiguous. Consider:

$$s = 0.500,001$$ $$s - 1.000,002 - (-0.5) = s - 0.000,002 - 0.5 = -0.000,001 $$

With synthesized tests you don't want to allow for nondeterministic behaviour. Better to choose inputs that unambiguously point toward one correct answer.

LP: Don't do this

It's possible (though not advisable) to reframe your implementation as a mixed-integer linear programming problem where:

  • the structural variables are binary selection coefficients into the metabolite and adduct vectors
  • there are three auxiliary variables: to minimize the objective, and one for each of metabolite and adduct to enforce exactly one choice
  • since an abs needs to be applied, it requires two passes per value of s

This works(ish) but is very slow.

#include <assert.h>
#include <limits.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <glpk.h>
#define VERBOSE 1
const double epsilon = 1e-10;
 
static void fatal(const char *msg) {
 fprintf(stderr, "%s\n", msg);
 exit(1);
}
static void pfatal(const char *msg) {
 perror(msg);
 exit(1);
}
static void usage(const char *cmd) {
 fprintf(stderr, "Usage: %s problem-number [...]\n", cmd);
 exit(1);
}
static void open_files(int i_problem, FILE **file_in, FILE **file_ans) {
 char filename_in[NAME_MAX], filename_ans[NAME_MAX];
 
 snprintf(filename_in, NAME_MAX, "%d.txt", i_problem);
 *file_in = fopen(filename_in, "r");
 if (!*file_in) 
 pfatal("Failed to open input file");
 
 snprintf(filename_ans, NAME_MAX, "ans%d.txt", i_problem);
 *file_ans = fopen(filename_ans, "r");
 if (!*file_ans) 
 pfatal("Failed to open output file");
}
static void read_line(FILE *file, char *line, int n) {
 if (!fgets(line, n, file))
 pfatal("Input I/O");
 
 if (line[strlen(line) - 1] != '\n')
 fatal("Input line too long");
}
static void read_ints(FILE *file, int *array, int n) {
 const int field_chars = 12, buf_size = n*field_chars;
 char *line = malloc(buf_size);
 if (!line)
 pfatal("No memory for line");
 read_line(file, line, buf_size);
 
 const char *field = line;
 for (int i = 0; i < n; i++) {
 int consumed;
 if (sscanf(field, "%d%n", array + i, &consumed) != 1)
 pfatal("Bad input");
 field += consumed;
 }
 
 free(line);
}
static double *read_doubles(FILE *file, int n) {
 const int field_chars = 12, buf_size = n*field_chars;
 char *line = malloc(buf_size);
 if (!line)
 pfatal("No memory for line");
 read_line(file, line, buf_size);
 
 double *array = malloc(n * sizeof(double));
 if (!array)
 pfatal("No memory for input");
 
 const char *field = line;
 for (int i = 0; i < n-1; i++) {
 int consumed;
 if (sscanf(field, "%lf%n", array + i, &consumed) != 1)
 pfatal("Bad input");
 field += consumed;
 }
 
 free(line);
 return array;
}
static void read_case(
 FILE *file_in, 
 int *M, int *K, int *N,
 double **m, double **a, double **s
) {
 char line[256];
 read_line(file_in, line, sizeof(line));
 if (sscanf(line, "%d %d %d\n", M, K, N) != 3)
 fatal("Incorrect test case header format");
 
 printf("M=%d K=%d N=%d ", *M, *K, *N);
 #if VERBOSE
 putchar('\n');
 #endif
 fflush(stdout);
 
 if (*M < 1) fatal("Out-of-range M");
 if (*K < 1) fatal("Out-of-range K");
 if (*N < 1) fatal("Out-of-range N");
 
 *m = read_doubles(file_in, *M), // metabolites
 *a = read_doubles(file_in, *K), // adducts
 *s = read_doubles(file_in, *N); // signals
}
/*
For each given s, choose one m and one a to minimize |s - m - a|.
Show the indices in m and a.
In GLPK terms,
 x[:M+K]: structural "col" variables, the actual selection coefficients
 x'[:3]: auxiliary "row" variables, to constrain the solution
 z: objective, should approach s
 c: objective coefficients, equal to m and a concatenated
 A: constraint coefficients, three constraint rows, one col for each m,a
 l, u: lower and upper bounds
 
 c
z = [m m a a a][x]
 [x]
 [x]
 [x]
 [x]
 A
[x'] [m m a a a][x]
[x'] = [1 1 0 0 0][x]
[x'] [0 0 1 1 1][x]
 [x]
 [x]
 
Min|maximize z = cx subject to x' = Ax, l <= x <= u, l' <= x' <= u'
Synthesizing "minimize abs(s - m - a)" translates to:
 - Maximize m+a subject to m+a <= s
 - Minimize m+a subject to m+a >= s
 - Take whichever solution is closer to s
*/
static glp_prob *make_prob(
 int i_problem, int i_test, 
 int M, int K, const double *m, const double *a
) {
 glp_term_out(GLP_OFF);
 glp_prob *lp = glp_create_prob();
 char name[64];
 snprintf(name, sizeof(name), "stepik-bioinfo-2021-%d.%d", i_problem, i_test);
 glp_set_prob_name(lp, name);
 glp_set_obj_name(lp, "m+a");
 
 // auxiliary "row" variables:
 // 0: tracking the objective function, to enforce minimum or maximum
 // 1: metabolite selection sum equal to 1
 // 2: adduct selection sum equal to 1
 glp_add_rows(lp, 3);
 glp_set_row_name(lp, 1, "objective_limit");
 glp_set_row_name(lp, 2, "fixed_sum_metabolite");
 glp_set_row_name(lp, 3, "fixed_sum_adduct");
 // set_row_bnds(lp, 1) deferred to the min/max step
 glp_set_row_bnds(lp, 2, GLP_FX, 1, 1);
 glp_set_row_bnds(lp, 3, GLP_FX, 1, 1);
 
 // structural "column" variables, M+K selection vector of metabolites and 
 // adducts in [0, 1]
 glp_add_cols(lp, M + K);
 
 // The glpk array convention is dumb and 1-indexed, meaning every input
 // array needs a dummy prefix
 const int row_ind[4] = {INT_MIN, 1, 2, 3};
 
 char col_name[16];
 
 // Metabolites
 for (int i = 0; i < M; i++) {
 snprintf(col_name, sizeof(col_name), "m_%d", i+1);
 glp_set_col_name(lp, i+1, col_name);
 glp_set_col_kind(lp, i+1, GLP_BV);
 // implied: glp_set_col_bnds(lp, i+1, GLP_DB, 0, 1);
 glp_set_obj_coef(lp, i+1, m[i]);
 double constraints[4] = {NAN, m[i], 1, 0};
 glp_set_mat_col(lp, i+1, 3, row_ind, constraints);
 }
 
 // Adducts
 for (int i = 0; i < K; i++) {
 snprintf(col_name, sizeof(col_name), "a_%d", i+1);
 glp_set_col_name(lp, i+M+1, col_name);
 glp_set_col_kind(lp, i+M+1, GLP_BV);
 // implied: glp_set_col_bnds(lp, i+M+1, GLP_DB, 0, 1);
 glp_set_obj_coef(lp, i+M+1, a[i]);
 double constraints[4] = {NAN, a[i], 0, 1};
 glp_set_mat_col(lp, i+M+1, 3, row_ind, constraints);
 }
 
 return lp;
}
static int find_selected(glp_prob *lp, int n, int offset) {
 for (int i = 0; i < n; i++) {
 if (glp_mip_col_val(lp, i + offset + 1) > 0.5)
 return i;
 }
 fatal("Selected index not found");
} 
static double optimize(
 glp_prob *lp, int direction, int i_s, double s,
 const double *m, const double *a,
 int M, int K, int *j_max, int *k_max
) {
 const char *dir_str = direction == GLP_MIN ? "min" : "max";
 #if VERBOSE
 printf(" [%d] %s ", i_s, dir_str);
 #endif
 
 // Reset between optimization runs
 // glp_std_basis(lp);
 
 glp_set_obj_dir(lp, direction);
 int bound = direction == GLP_MIN ? GLP_LO : GLP_UP;
 glp_set_row_bnds(lp, 1, bound, s, s);
 
 int err = glp_simplex(lp, NULL);
 if (err) glp_error("GLPK simplex failure %d\n", err);
 int stat = glp_get_status(lp);
 if (stat == GLP_OPT) {
 err = glp_intopt(lp, NULL);
 if (err) glp_error("GLPK MIP failure %d\n", err);
 stat = glp_mip_status(lp);
 }
 
 if (stat != GLP_OPT) {
 #if VERBOSE
 printf("%lf: infeasible\n", s);
 #endif
 return INFINITY;
 }
 
 double obj = glp_mip_obj_val(lp);
 #if VERBOSE
 if (direction == GLP_MIN) printf("%.2le <- %.2le ", s, obj);
 else printf("%.2le -> %.2le ", obj, s);
 #endif
 
 *j_max = find_selected(lp, M, 0);
 *k_max = find_selected(lp, K, M);
 
 double error = fabs(obj - s);
 #if VERBOSE
 printf(
 "j=%d k=%2d err=%.1le act_err=%+.1le\n",
 *j_max+1, *k_max+1, error,
 s - m[*j_max] - a[*k_max]
 );
 #endif
 return error;
}
static void test_case(int i_problem, int i_test, FILE *file_in, FILE *file_ans) {
 printf("problem %d.%d ", i_problem, i_test);
 int M, K, N;
 double *m, *a, *s;
 read_case(file_in, &M, &K, &N, &m, &a, &s);
 
 glp_prob *lp = make_prob(i_problem, i_test, M, K, m, a);
 
 int matches = 0;
 
 int expected[2];
 
 for (int i_s = 0; i_s < N; i_s++) {
 int j = -1, k = -1;
 // Minimize m+a subject to m+a >= s
 double error = optimize(lp, GLP_MIN, i_s, s[i_s], m, a, M, K, &j, &k);
 
 if (error > epsilon) {
 int j1 = -1, k1 = -1;
 // Maximize m+a subject to m+a <= s
 double error1 = optimize(lp, GLP_MAX, i_s, s[i_s], m, a, M, K, &j1, &k1);
 
 if (error > error1) {
 error = error1;
 j = j1; k = k1;
 }
 }
 
 if (j < 0 || k < 0) fatal("No solution");
 
 read_ints(file_ans, expected, 2);
 #if VERBOSE
 printf(" Act %2d %2d exp %2d %2d\n", j+1, k+1, expected[0], expected[1]);
 #endif
 if (j+1 == expected[0] && k+1 == expected[1])
 matches++;
 }
 
 glp_delete_prob(lp);
 free(m); free(a); free(s);
 
 printf(" matched %d/%d\n", matches, N); 
}
int main(int argc, const char **argv) {
 if (argc < 2) usage(*argv);
 
 printf("Using glpk %s\n", glp_version());
 
 for (int a = 1; a < argc; a++) {
 FILE *file_in, *file_ans;
 int i_problem;
 if (sscanf(argv[a], "%d", &i_problem) != 1)
 usage(*argv);
 
 open_files(i_problem, &file_in, &file_ans);
 
 int T;
 if (fscanf(file_in, "%d\n", &T) != 1) fatal("Bad test count");
 if (T < 1 || T > 3) fatal("Out-of-range test count");
 
 for (int i_test = 0; i_test < T; i_test++) {
 test_case(i_problem, i_test, file_in, file_ans);
 }
 }
 
 return 0;
}

Numpy vectorization

It's possible to use something vaguely close to your original implementation but using all numpy and no loops. This works-ish for problems 1.1, 2.2 and almost everything in 3.2 but

  • there's a few stray mismatches in 3.2;
  • I wasn't careful enough with memory so problem 4.1 dies from OOM - this could be fixed by switching to a KN lookup instead of an MK lookup; and
  • Problems 2.1 and 3.1 are totally wrong for some reason;

but it's still possible as a proof-of-concept to demonstrate how you would take your algorithm and vectorize it.

from sys import argv
import numpy as np
def solve_case(m: np.ndarray, a: np.ndarray, s: np.ndarray) -> np.ndarray:
 mrep = np.tile(m, len(a))
 jrep = np.tile(np.arange(len(m), dtype=np.int32), len(a))
 arep = np.repeat(a, len(m))
 krep = np.repeat(np.arange(len(a), dtype=np.int32), len(m))
 jk = np.vstack((jrep, krep))
 masum = mrep + arep
 order = masum.argsort()
 jk[:] = jk[:, order]
 masum[:] = masum[order]
 i = np.searchsorted(masum, s)
 lower = np.abs(s - masum[i - 1])
 upper = np.abs(s - masum[i])
 adj = lower < upper
 res = jk[:, i - adj]
 return res.T + 1
def solve(i_problem: int) -> None:
 with open(f'{i_problem}.txt') as file_in, \
 open(f'ans{i_problem}.txt') as file_ans:
 T = int(next(file_in))
 for i_case in range(1, T + 1):
 M, K, N = (int(x) for x in next(file_in).split())
 print(f'problem {i_problem}.{i_case}: M={M} K={K} N={N}', end=' ')
 m, a, s = (
 np.genfromtxt(file_in, dtype=np.float64, max_rows=1)
 for _ in range(3)
 )
 assert m.shape == (M,)
 assert a.shape == (K,)
 assert s.shape == (N,)
 actual = solve_case(m, a, s)
 expected = np.genfromtxt(file_ans, dtype=np.int32, max_rows=N)
 matched = np.sum(actual == expected) / actual.size
 print(f'{matched:.2%} matched')
def main() -> None:
 for arg in argv[1:]:
 solve(int(arg))
if __name__ == '__main__':
 main()
answered Jul 3, 2021 at 1:45
\$\endgroup\$
1
  • \$\begingroup\$ The test case in the question is actually the "sample test" provided by the contest. I agree that it is not the best test case. \$\endgroup\$ Commented Jul 3, 2021 at 5:41

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.