This post follows on from my previous one about the Myhill–Nerode theorem. You should read that one first.

When trying to construct a deterministic finite automaton for a language in practice, one thing we need to know is how it’s represented. If all we have is a yes/no predicate, the problem is in general intractable (it could be any set of strings at all!). If all we have is a set of examples and non-examples and we want to construct a minimal deterministic finite automaton (DFA), the problem is NP hard.

This post is about what can do if we augment the yes/no predicate only slightly: We have what is called a *minimally adequate teacher*. It can answer two types of questions about our language \(L\):

- Is this string in the language?
- Given this DFA, does it correctly recognise the language and if not what is an example of a string where it gives the wrong answer?

In this scenario there is an polynomial time algorithm in the size of the minimal DFA (and logarithmic in the size of the longest counterexample the teacher returns) which will terminate in a correct DFA for the language.

This algorithm was originally presented by Dana Angluin in her paper “Learning Regular Sets from Queries and Counterexamples” under the name L* search. The version I’m going to present here is in fact a slightly later version of the algorithm presented by Ronald L. Rivest and Robert E. Schapire in “Inference of Finite Automata Using Homing Sequences” which improves on the algorithmic complexity and makes the underlying nature of the algorithm slightly clearer. Regardless, I’m still going to call it L* search.

The idea of L* search (although not how it was originally presented) is that we are going to try to construct the Myhill–Nerode automaton , \(Z\), directly. We will do this as follows:

- We will maintain a sequence \(w_1, \ldots, w_m\) of witnesses and a sequence \(t_1, \ldots t_n\) of tests. \(w_i\) is a sequence of strings such that each string is not word equivalent to any other, i.e. is a member of a distinct equivalence class in \(Z\), and \(t_i\) is a sequence of tests that witness that: i.e. For any \(i, j \leq m\) with \(i \ne j\) there is some \(k\) such that \(w_i t_k \in L \ne w_j t_k \in L\).
- We start with \(w_1 = t_1 = \epsilon\).
- At each stage we either terminate or we will extend \(t_i\) by one test and \(p_i\) by one or more strings.
- This must terminate, because we’re creating a sequence of non word equivalent strings so we can’t have more elements in it than there are states in \(Z\) (because the states are equivalence classes of strings). It’s not currently obvious that we can always make progress prior to that point, but that will emerge in the construction of the algorithm.

The core idea is to use the \(t_i\) to ‘finger print’ every string. The \(t_i\) give us a mapping \(\mathrm{row}: S \to \{0, 1\}^n\) where \(\mathrm{row}(s)_i\) is \(1\) if \(st_i \in L\), else \(0\). We can calculate this fingerprint because we are allowed to ask the teacher whether a string is in the language. We know for a fact that if \(\mathrm{row}(s) \ne \mathrm{row}(t)\) then \(s\) and \(t\) aren’t word equivalent, because one of the \(t_i\) must witness that.

Given our current sequence of tests \(t_i\) and our witnesses \(w_i\), we use the row function to construct the following automaton whose states are the \(w_i\). In the course of constructing it we may have to extend \(w_i\):

For each \(w_i\) and each \(a \in A\), we consider \(\mathrm{row}(w_i a)\). Define \(\delta(w_i, j) = w_j\), where \(w_j\) is chosen such that \(\mathrm{row}(w_i a) = \mathrm{row}(w_j) \). If no such \(w_j\) exists, then extend the sequence by an element by adding \(w_i a\) to the end so that it does.

We continue this process until the transition function has been defined for all elements of the sequence (including the ones newly added).

We then define the initial state of this finite state machine to be \(w_1 = \epsilon\), and a state \(w_i\) to be accepting if and only if \(w_i \in L\).

The resulting state machine is more or less the same as the the Myhill–Nerode automaton. The only differences are:

- We have explicitly chosen an instance of each equivalence class (the \(w_i\))
- Instead of considering whether states are distinguishable by
*any*test, we only ask if they are distinguishable by our current experiments \(e_i\). This can cause us to conflate more states than we should, and fixing that will be the core point of what we do next.

So we now have a DFA to present to the teacher. At this point we perform a counter-example query. Either our teacher says that the DFA is correct, in which case we’re done, or it presents us with a counter-example in the form of a string that the DFA predicts the wrong answer for. Call our current DFA \(Q\), and that counter-example \(c\).

What we want to do now is to take that mispredicted string and extract from it an experiment that shows us where we went wrong: At some point we must have made a transition \(\delta(w_i, a) = w_j\) when in actual fact \(w_ia\) and \(w_j\) are not word equivalent (if we never made such a transition then the path we walked is the same as the path we would have walked in \(Z\), which *does* correctly recognise the language). So what we want to do is find an experiment \(t\) such that \(w_i a t \in L \ne w_j t \in L\), and add \(t\) as our next experiment.

If we do that, then when we build the next version of the DFA with the new set of experiments the number of states *must* increase: We already know that \(w_i a\) is not word-equivalent to \(w_k\) for any \(k \ne j\). We have just shown that it is *also* not word equivalent to \(w_j\). Thus, in order to find a state to transition to when we reach \(i, a\), we must either have already added a new state or we add one there and then.

The idea to find such a transition, and an experiment that would distinguish it, is this: We use the suffixes of \(c\) as our experiment candidates.

We’ll look at a sequence of strings that *should* be equivalent if our DFA were correct, and look for the exact moment that something goes wrong to find our experiment.

Consider splitting \(c\) up into a sequence of strings \(u_i v_i\), with \(0 \leq i \leq |c|\) and \(|u_i| = i\). i.e. \(u_i\) is the first \(i\) characters of \(c\) and \(v_i\) is the remaining set of characters.

(Notation: \(|s|\) is the length of the string \(s\)).

Now, we can replace \(u_i\) with \(\delta_Q(q_0, u_i)\) – the string that represents the state that you get to after reading the first \(i\) characters – and not have this change whether the result is in the language recognized by \(Q\). However, the same is *not* true of the language \(L\): If we consider \(i = |c|\) then this string is \(\delta(q_0, c)\), and by the fact that \(c\) is a counter-example we know that \(\delta_Q(\epsilon, c) \in L \ne c \in L\).

Call a value of \(i\) *bad* if it has the property that \(\delta_Q(\epsilon, u_i) v_i \in L \ne c \in L\). i.e. if replacing the state we expect to be in after \(i\) characters with the representative for that state changes whether the string is recognised as being in the desired language. Necessarily, \(|c|\) is bad and \(0\) is good (because \(u_i = \epsilon\) so the replacement string is just \(c\) again).

What we want to do is find an index \(i\) such that \(i\) is good and \(i + 1\) is bad. We’ll see how to do that in a moment, but first lets see why it would help:

What we want to show is that if we have such an \(i\) then the transition from \(\delta(q_0, u_i)\) to \delta(q_0, u_{i + 1})\) via \(c_{i + 1}\) is the bad transition we were looking for, and that \(v_{i+1}\) is a test that witnesses this.

To see this, note that because \(i + 1\) is bad we must have \(\delta(q_0, u_{i + 1}) v_{i + 1} \ne c \in L\) and because \(i\) is good we must have \(\delta(q_0, u_{i}) v_{i} = c \in L\), so \(\delta(q_0, u_{i}) v_{i} \ne \delta(q_0, u_{i + 1}) v_{i + 1} \). But \(v_i = c_{i + 1} v_{i + 1}\), so this means we have \(\delta(q_0, u_{i}) c_{i + 1} v_{i + 1} \ne \delta(q_0, u_{i + 1}) v_{i + 1} \).

i.e. \(v_{i + 1}\) is an experiment distinguishing \(\delta(q_0, u_{i}) c_{i + 1}\) from \(\delta(q_0, u_{i + 1})\). But by construction, \(\delta(q_0, u_{i + 1}) = \delta(\delta(q_0, u_i), c)\), which makes \(v_{i + 1}\) our desired experiment to add to the list.

Now, to complete the process: How do we find such an \(i\)?

The most direct approach would be to just do a linear scan, but we can do better: You can use binary search, by using the observation that the general binary search algorithm works just fine without anything to do with sorting at all if you regard it as looking for a point at which a function changes in a particular way.

In python pseudocode, we can find such an \(i\) as follows:

# Invariants: lo is good, hi is bad lo = 0 hi = len(c) while lo + 1 < hi: mid = (lo + hi) // 2 if bad(mid): hi = mid else: lo = mid |

This then takes \(\log_2(|c|)\) queries and when it terminates we can take \(i = \mathrm{lo}\).

Which, was the final step in our puzzle for the construction: As long as our automaton does not recognise \(L\), we can find a new experiment, add it to the list, and thus force our automaton to keep growing. When it can grow no longer it must be equivalent to \(Z\), because it is an automaton of the same size which recognises \(L\).

We’re almost done, but lets do some analysis on how many queries we made here (in this analysis I will assume that all answers to membership queries are cached so we only make them once).

Suppose our final set of tests is \(|T|\). Note that \(|T| \leq |Z|\) because every time we added to \(T\) we increased the number of states by at least one, and the algorithm terminated when the number of states was equal to \(|Z|\).

We have performed precisely \(|T|\) counter-example queries because every time we performed a counter-example query we added exactly \(1\) element to \(T\).

Now lets see how many membership queries we’ve performed.

We performed membership queries in two places: First, when building the transition table. Secondly, when trying to construct the tests.

For the first: For every witness, every \(a \in A\) and every \(t \in T\) we have calculated whether it is a member of \(L\) in order to compute the rows and the transitions. So that comes to \(|Z||A||T| \leq |A| |Z|^2\) membership queries.

In calculating the test, we had to perform one membership query at every point in the binary search in order to determine if mid were bad. Therefore we performed \(O(\log(|c|)\) membership queries. Letting \(m\) be the maximum length of any counter-example provided in the course of our run, we thus performed a total of \(O(|T| \log(m)) = O(|Z| \log(m))\) membership queries.

So, in our final analysis, running L* search with a minimally adequate teacher for our language requires:

- \(O(|A||Z|^2 + |Z| \log(m))\) membership queries
- \(O(|Z|)\) counterexample queries

In fact these estimates are a bit too conservative. As we’ll see in a later post, it is sometimes going to be the case that \(|T| \ll |Z|\).

Why do we care?

Personally, I mostly care because I think L* is a super-neat algorithm, and the concept of a minimally adequate teacher as a learning environment is surprisingly general.

You can also put this into a stochastic setting to relax the requirement of the existence of a minimally adequate teacher (which may be hard to obtain) , in which case you instead get to build DFAs which give the correct answer with at least some chosen probability when the string is drawn from a fixed choice of distribution.

However, the evidence seems to be that most people in the field don’t care about this algorithm in the field of language learning. There hasn’t been a great deal of research on it in recent years that I’ve been able to find, and most of the research on it from back in the day was less concerned with languages than automata: You frequently got this framed in the context of a robot learning to navigate a world.

I recently read Grammatical Inference: Learning Automata and Grammars, which if you found this post interesting I can *highly* recommend, as there won’t be many more like it in this series. The next post will use some of the ideas from it, but will be moving away from the area of inference.