## Friday, March 6, 2009

### On the power of membership queries in agnostic learning

In class, the BCGKT theorem was introduced as an illustration of approximate sampling. It's also a complexity-theoretic result for query learning, wherein learning algorithms can draw examples of their choosing, not just those from a fixed dataset or a rigid oracle. Much as interactive proofs exploit the ability to ask (randomized) questions, query learning algorithms exploit the ability to perform experiments, ask users for feedback, and examine surroundings. These varying capabilities correspond to different models of query learning. In the BCGKT theorem, the NP oracle answers to equivalence querying: it either accepts the candidate hypothesis or returns a counterexample. Other interesting models include membership querying, which allows examples to be constructed from scratch and submitted to an oracle who labels it positive or negative, and active learning, in which artificial construction is disallowed but many unlabeled examples are available. Many positive results, such as exponential improvements in sample complexity, make query learning a topic of major interest in learning theory.

Query learners tend to work well because pivotal examples can falsify large swaths of the hypotheses under consideration. It's not surprising that they encounter difficulty when the true function is never a candidate and may not even exist. This impairs the ability to rule out hypotheses on the basis of a single inconsistency and often forces the learner to satisfy a conservative elimination criterion. This so-called agnostic setting is rife with hardness results, including a contribution by one of our wise masters.

The marginal distribution of the examples is often critical. Negative results for distribution-independent learning are often paired with positive results for distribution-specific learning; see, for example, the reply from our other wise master to his counterpart. This phenomenon is exemplified by Vitaly Feldman's recent work, which appeared in short at COLT '08 and in full a few days ago in JMLR.

The bad news comes first. When the examples come from an unknown distribution, membership queries don't offer any additional power. Much like boosting relies on a learner's ability to handle any marginal distribution on the examples, this proof relies on the learner's ability to find the best hypothesis regardless of how the joint example-label distribution changes. This robustness admits simulation of a membership query oracle: draw a sample, evaluate queries against the sample when possible, and return dummy answers for queries outside the sample. If error rates on the sample-conjured oracle are representative of those on the original distribution - something that can be ensured with Vapnik-Chervonenkis bounds - then the existence of a membership query learner implies the existence of a standard oracle learner. Feldman actually proves a stronger result involving finer details of agnostic learning, but those aren't of particular interest to this complexity class.

The juicier result, and the main topic of this post, is the separation proof: when the examples are uniformly distributed, some concept classes are learnable through membership queries but not through a standard oracle. Random samples may not reveal knowledge crucial to learning; since the distribution is fixed, this weakness can be systematically exploited. In the agnostic PAC model - noise free in the sense that the conditional expectation $\phi$ of the labels is a consistent boolean function of the examples - the exploit is very simple: carve out a small portion of the domain and therein hide a suitably short encoding of the target function. The encoding can be easily recovered by membership queries, but it is unlikely to be found by random examples due to the relatively large size of the entire domain.

This trick doesn't work in the agnostic setting, where such hiding places cannot be safe or small. Functions can be arbitrarily noisy on any part of the domain, so it's impossible to find a safe area to keep a brittle encoding. This issue is mitigated by the Hadamard code, which encodes an $n$-bit vector as all the $2^n$ values taken by the parity function $\chi$ for that vector. This approach seems suspicious, since all the values clearly can't be recovered, and learning parity functions in the agnostic setting is a major open problem. These worries are resolved by the Goldreich-Levin theorem, which was covered recently in Luca's class. For the purpose of this proof, the Goldreich-Levin theorem allows efficient recovery of the largest Fourier coefficients of $\phi$. We could stop here if we were happy to assume the intractibility of learning parities from a noisy standard oracle.

But we're not, so we're left to worry about how the Hadamard bloat might expose our secret to detection. Even an efficient coding scheme, like that described by Guruswami and Sudan, doesn't save us from a hefty lower bound on the size of the secret region. Assuming the concept class cannot be learned without the secret region, the region must occupy at least $(1-2\epsilon)2^n$ proportion of the domain.

The trick of sampling a fixed part of the domain is preserved by a technique from a 2007 paper by Elbaz et al. Instead of storing the Hadamard code in plain sight, it is obscured by a pseudorandom function. Recall that $\pi_y$ is a pseudorandom boolean function if its output cannot be distinguished from a completely random function with more than negligible probability in polynomial time. Such functions exist if one-way functions exist, which is the cryptographic assumption underlying the proof. Here they are used to build concepts of the following form (meaning will soon be given to the undefined variables; treat them as opaque for now):

$g_{\bar{d}}\left(k,z,\bar{x}\right)=\pi_{d_k}(z)\oplus \chi_{\bar{d}(k)}\left(\bar{x}\right)$

View the arguments as a single binary string of length $m$ and suppose $k$ is fixed. The value of $g$ on a random example $(z,\bar{x})$ is pseudorandom but, if $z$ is fixed, then $\{g(z,x)\}_x$ recovers the Hadamard encoding of $\bar{d}(k)$. Of course, this is possible only if $d_k$ is encoded and recovered; this will require another secret, and so forth. Thus a sequence of pseudorandom functions and secrets are used. There are $p-1$ such secrets $\bar{d}=\left(d_1,d_2,\ldots ,d_{p-1}\right)$, each indexing a family of pseudorandom boolean functions $\mathcal{F}_n=\left\{\pi_y\right\}_{y\in \{0,1\}^n}$. To combat adversarial noise, the sequence is highly redundant: each secret doesn't unlock one previous secret, but all previous secrets.

All the secrets except one can be recovered by membership queries, but doing the same from random examples would be akin to distinguishing a random function from a pseudorandom function. The proof details are a little hairy. I encourage you to read the original paper, but please don't be led astray by the figure; it's missing an arrow and it's labeled incorrectly. Vitaly says he'll post a corrected version on his site, most likely here.

In short, membership queries offer no advantage in the fully agnostic setting. Cryptographic evidence suggests that membership queries are more powerful than a standard oracle when the examples are uniformly distributed. I suggest Dana Angluin's survey to interested readers.