The problem of inferring an unknown string from noisy copies appears in disparate fields, such as genomic sequencing, and learning of workflows from demonstrations. It can be abstractly described as follows. Let T = t_1...t_m be some unknown string consisting of characters t_i from some alphabet. We want to figure out T, but we can produce only noisy copies of T : Suppose that every t_i is changed (i.e., replaced with some other character) with some known small probability r, and these errors are mutually independent, for all positions in T and all
copies of T. Suppose that every t_i is changed or deleted with probability r, moreover, between every t_i and t_i+1, another character is inserted with probability r. Again, all these errors are mutually independent, for all positions in T and all copies of T. Note that the produced random copies of T can have varying lengths, and the characters from T can end up at slightly different positions there, which makes the problem more tricky. Develop several algorithms for this string reconstruction problem, derive bounds on the time and, most importantly, bounds on the probability to fail (i.e., not to infer T correctly). Note that the input data are randomized here, but your algorithms might be both randomized or deterministic. Always state clearly what type of results you show: worst-case or expected values, Monte Carlo or Las Vegas, etc.
Suggestions:
- One can think of different approaches, e.g., invoke a dynamic programming algorithm for string editing (a.k.a. alignment), or consider short substrings that appear frequently in the data (which suggests
that they may occur in T ), or use small random samples of the noisy copies of T , or find other ideas. - Focus on small or large numbers n of copies, or both cases.
- An important special case appears when T has no duplicates, i.e., every character appears at most once. This can make some matters easier.
- How does the performance and practicality of your algorithms depend on r?
- In the above formulation we have only assumed some probability r for the occurence of any error at every position on T . You may adopt further probabilistic assumptions on the result of the error (e.g., every wrong character is produced with some probability) or consider a worst-case model where an “adversary” decides on the result of every error (but still the occurrence of every error is random).
Step by stepSolved in 2 steps