Today we covered...

Reed-Muller code review. Local decoding with "random lines". Code concatenation. Beginning Sudan-Trevisan-Vadhan's hardness amplification.

## Tuesday, March 31, 2009

## Monday, March 30, 2009

### Midterm correction

On problem #4 on the midterm, assume $w \geq n$.

Thanks to Or for pointing out the necessity of this.

Thanks to Or for pointing out the necessity of this.

## Thursday, March 26, 2009

### Lecture 19

Today we covered...

Completion of proof of Impagliazzo's Hard-Core Set Theorem, Yao's XOR Lemma. Intro to error-correcting codes, list-decodable codes, locally decodable codes. Reed-Solomon codes, Hadamard codes, Reed-Muller codes. Connection between local list-decoding and hardness amplification.

Completion of proof of Impagliazzo's Hard-Core Set Theorem, Yao's XOR Lemma. Intro to error-correcting codes, list-decodable codes, locally decodable codes. Reed-Solomon codes, Hadamard codes, Reed-Muller codes. Connection between local list-decoding and hardness amplification.

## Tuesday, March 24, 2009

### Lecture 18

Today we covered: Hardness vs. randomness corollaries of Nisan-Wigderson; Impagliazzo-Wigderson Theorem ($f \in \mathsf{E}$ requiring circuits of size $2^{\epsilon n}$ implies $\mathsf{P} = \mathsf{BPP}$); Hardness Amplification , $1 - 1/\mathrm{poly}(n)$ hardness to $1/2 + 1/\mathrm{poly}(n)$ hardness, with the XOR Lemma; Impagliazzo's proof via the Hard-Core Set Theorem (with Nisan's proof via the Minimax Theorem).

### Midterm

The take-home midterm is posted on the course home page. Please grab a copy now.

As a reminder, it is due next Tuesday, no collaboration is allowed, and no references to "outside" sources are allowed: just your course notes, the blog, the course homeworks, and the course homework solutions.

As a reminder, it is due next Tuesday, no collaboration is allowed, and no references to "outside" sources are allowed: just your course notes, the blog, the course homeworks, and the course homework solutions.

## Friday, March 20, 2009

### Lecture 17

In this lecture we covered the Nisan-Wigderson Generator, which connects hardness to randomness. In the course of proving it we also saw designs and the Yao Next-Bit Predictor Theorem, which uses the "hybrid method".

## Tuesday, March 17, 2009

### Lecture 16

In this lecture we covered Nisan's pseudorandom generator for log space.

One thing I had to rush over at the end was the actual definition of the generator, along with its seed length!

Here is an official inductive definition of the generator: Let $h_1, ... h_{\lg n} : \{0,1\}^k \to \{0,1\}^k$ be functions. We define $G_t : \{0,1\}^k \to \{0,1\}^{2^t k}$ inductively by $G_0(x) = x$, $G_{t}(x) = (G_{t-1}(x), G_{t-1}(h_t(x)))$.

Finally, Nisan's generator $G$ maps $\langle x, h_1, ... h_{\lg n} \rangle$ into $G_{\lg n}(x)$. Here the $h_i$'s are the "descriptions" of the hash functions.

Recall that we are using the simple "$a \circ x + b$" pairwise independent hash family; the "description" of such a function is just the $2k$-bit string $\langle a, b\rangle$. Conveniently, choosing this description uniformly at random coincides with choosing $h \sim \mathcal{H}_k$ uniformly at random.

So finally, the seed length is $\ell = k + (\lg n)(2k)$. Since $k = O(\log S)$, this is indeed $O(S \log n)$ seed-length, as claimed. The number of output bits is in fact $nk = O(n \log S)$ (which is, as we noted, slightly more than we need).

One thing I had to rush over at the end was the actual definition of the generator, along with its seed length!

Here is an official inductive definition of the generator: Let $h_1, ... h_{\lg n} : \{0,1\}^k \to \{0,1\}^k$ be functions. We define $G_t : \{0,1\}^k \to \{0,1\}^{2^t k}$ inductively by $G_0(x) = x$, $G_{t}(x) = (G_{t-1}(x), G_{t-1}(h_t(x)))$.

Finally, Nisan's generator $G$ maps $\langle x, h_1, ... h_{\lg n} \rangle$ into $G_{\lg n}(x)$. Here the $h_i$'s are the "descriptions" of the hash functions.

Recall that we are using the simple "$a \circ x + b$" pairwise independent hash family; the "description" of such a function is just the $2k$-bit string $\langle a, b\rangle$. Conveniently, choosing this description uniformly at random coincides with choosing $h \sim \mathcal{H}_k$ uniformly at random.

So finally, the seed length is $\ell = k + (\lg n)(2k)$. Since $k = O(\log S)$, this is indeed $O(S \log n)$ seed-length, as claimed. The number of output bits is in fact $nk = O(n \log S)$ (which is, as we noted, slightly more than we need).

## Monday, March 16, 2009

### Fixed-polynomial circuit lower bounds for MA/1

In this post, we look at a recent circuit lower bound result which shows that MA/1 is not contained in the class SIZE($n^k$). Some history and motivation first: A well studied fundamental problem in lowerbouds is in showing super-polynomial circuit lowerbounds for NP. This is natural to look at, since such a result would immediately separate P and NP, because P $\in$ P/poly has poly-sized circuit families deciding languages in it.

However, though people have been looking at this problem for quite some time, even getting super-linear lower bounds on the circuit size has been evasive. While the goal has been getting super polynomial lower bounds, the current best lower bound is just $4.5n - o(n)$.

Given this state, a natural thing to do is to look at showing results for larger classes, or in other words, attack NP from above. In this line of research, Kannan showed a promising result of this nature (we have seen this in one of the homeworks): $\Sigma_2 \cap \Pi_2$ is not contained in SIZE($n^k$). Getting closer to NP, Kobler and Watanabe have shown using the halving algorithm of BCGKT (seen in class), that $ZPP^{NP}$ has a circuit lower bound of $n^k$, for any fixed $k$. This is an improvement over the previous result of Kannan, as $ZPP^{NP}$ is "closer" to NP than $\Sigma_2 \cap \Pi_2$ is (i.e. $NP \subseteq ZPP^{NP} \subseteq \Sigma_2 \cap \Pi_2$).

In a recent result, Rahul Santhanam gets even closer to NP by showing an almost polynomial circuit lower bound for MA. He shows that for any fixed $k>0$, there are languages in the class MA/1 (MA with $1$ bit of advice) which are not decided by non-uniform circuit families of size $n^k$. Why is this result important ? Intuitively, MA is like the probabilistic version of NP, and therefore such lower bounds automatically gain much value, as it is widely believed that ''randomness'' to NP does not give much more power than the basic NP class (that is, MA $=$ NP is believed to hold). This therefore nearly implies polynomial circuit lower bounds for MA, which sits just above NP in the hierarchy. We say nearly because the class for which he shows a lower bound assumes $1$ additional bit of advice (MA/1).

Now let's look at the high level picture of proving such lower bounds. Imagine we want to show a lower bound for a class C, where NP $\subseteq$ C. If it is the case that NP is not contained in P/poly, then we are done because C is only larger than NP. On the other hand, if NP $\subseteq$ P/poly, then we \emph{use} this fact to show that the polynomial hierarchy collapses to C. But there are complete problems in PH which are not decided by SIZE($n^k$) --- recall this from HW1; because of the collapse, these problems are in $C$, and this completes the proof. If we recall, we did the exact same thing for showing such lower bounds for $\Sigma_2 \cap \Pi_2$: we showed that if NP $\subseteq$ P/poly, then PH collapses to $\Sigma_2 \cap \Pi_2$. Since we know that PH has problems which are not decided by SIZE($n^k$), we have that $\Sigma_2 \cap \Pi_2 $ is not contained in SIZE($n^k$).

In fact, the argument need not always be based on the two cases of whether or not NP is in P/poly. There are cases where it helps to consider a class which \emph{contains} C, and then argue depending on it's containment in P/poly. For instance, we have the result that if PSPACE $\subseteq$ P/poly, then PSPACE $=$ MA (this problem figures in HW4). Intuitively, the proof goes as follows: if PSPACE $\subseteq$ P/poly, then the prover of the IP protocol for QBF (a complete problem for PSPACE $\subseteq$ IP) can be represented by a poly-sized circuit. Merlin can therefore simply give Arthur the circuit for the correct input size and Arthur can use the circuit to simulate any query to the prover. Completeness is direct, and intuitively, soundness holds because we know no adaptive prover can cheat Arthur, so no fixed prover can! Therefore, QBF $\in$ MA.

Using the ideas defined above, we are now ready to show the main theorem that MA/1 is not in SIZE($n^k$).

We require the following lemma, whose proof is rather technical (using Shamir's proof for IP = PSPACE) and we omit.

There exists a PSPACE-complete problem $L$ and a probabilistic polynomial time turing machine $M$ such that for any input $x$,

1. $M$ asks only queries of length $|x|$.

2. If $M$ is given the language $L$ as oracle and if $x \in L$, then $M$ accepts $x$ with probability $1$.

3. If $x \notin L$, then no matter what oracle $M$ gets, it accepts with probability only at most $1/2$.

Such a complete problem is known to exist, and we shall make use of it. Just as an aside, this definition of $L$ seems similar to the notion of checkers in HW4.

If this language $L \in $ P/poly, then PSPACE $\in $ P/poly, meaning PSPACE $=$ MA, and this would give us the desired result since PSPACE is not contained in SIZE($n^k$). Therefore, we assume that $L \notin $ P/poly, and create the following \emph{padded} language $L'$. For what follows, let $s(n)$ denote the size of the smallest circuit which can decide $L$ on inputs of size $n$. Consider the following language $L'$:

$L' = \{ x1^{y} | x \in L, |x| = n, y \geq n, y$ is a power of $2, s(n) \in ((y + n)^{k+1}, (2y+n)^{k+1}] \}$.

We show that $L' \in $ MA/1 but $L' \notin $ SIZE($n^k$). We first show the latter result: In some sense, the magic is already done as the padding has sort of diagonalized the input.

Because $L \notin $ SIZE($n^k$) (we are in the case where $L \notin$ P/poly), there exists some input size $n_0$ such that $s(n_0) > (n_0+1)^k$, where $s(n)$ is the size of the smallest circuit deciding $L$ on inputs of size $n$. Define $y_0$ to be the power of $2$ such that $(n_0 + y_0)^k < s(n_0) \leq (n_0 + 2y_0)^k$. Notice that this is the same $y$ as defined in $L'$.

Now, suppose $L'$ has a circuit family of size $m^k$. We now create a "small" circuit for deciding $L$ on input size $n_0$: the circuit simply hardwires $y_0$ additional $1$'s to the input $x$ and subsequently mimics the circuit of $L'$ on inputs of size $n_0 + y$. However, the size of this circuit is at most $(n_0+y)^k$, which contradicts the definition of $s(n)$. This shows that $L' \notin $ SIZE($n^k$).

All that remains is to show that $L' \in $ MA/1. The key here is in knowing if the input size $m$ of a given $x1^y$ is \emph{valid}, in that it satisfies the circuit size condition. It would be difficult to have the MA protocol figure it out if the circuit size is large (which it probably is), but here is where we use the $1$ bit of advice. Given an input size of $m$, the advice is $1$ if and only if $m$ can be decomposed into $n$ and $y = 2^k$ such that $y \geq n$, and the smallest circuit size for deciding $L$ on inputs of size $n$ is between $(y+n)^{k+1}$ and $(2y + n)^{k+1}$. Notice that if such a valid $(n,y)$ decomposition exists, it is unique: if we try to increase $y$ to $2y$ and decrease $n$ accordingly, it would make the resulting decomposition infeasible as we have $y \geq n$.

Thus, given $x1^y$ for which we need to decide membership in $L'$, the MA protocol does the following: Merlin sends Arthur the circuit of size $s \in ((n+y)^{k+1}, (n+2y)^{k+1}]$ which solves $L$ on inputs of length $n$. Arthur checks if the advice is $1$, and rejects if it is not. If it is, then the $m$ is of a valid length, and Arthur looks at the circuit and identifies $n$ (which is the input size of the received circuit) and makes other necessary checks, like $y \geq n$, $y$ is a power of $2$, and $(y+n)^{k+1} < s \leq (2y+n)^{k+1}$. It then simulates $M$ (recall $M$, the probabilistic turing machine which decided $L$ with some oracle given), and accepts or rejects based on $M$.

It is easy to see that if $x1^y \in L'$, then Merlin could send the correct circuit, and Arthur would always accept. On the other hand, if $x1^y \notin L'$, then Merlin ``commits'' to an oracle by sending a circuit and by the properties of $L$ and $M$, Arthur accepts with only a small probability. This completes the proof.

However, though people have been looking at this problem for quite some time, even getting super-linear lower bounds on the circuit size has been evasive. While the goal has been getting super polynomial lower bounds, the current best lower bound is just $4.5n - o(n)$.

Given this state, a natural thing to do is to look at showing results for larger classes, or in other words, attack NP from above. In this line of research, Kannan showed a promising result of this nature (we have seen this in one of the homeworks): $\Sigma_2 \cap \Pi_2$ is not contained in SIZE($n^k$). Getting closer to NP, Kobler and Watanabe have shown using the halving algorithm of BCGKT (seen in class), that $ZPP^{NP}$ has a circuit lower bound of $n^k$, for any fixed $k$. This is an improvement over the previous result of Kannan, as $ZPP^{NP}$ is "closer" to NP than $\Sigma_2 \cap \Pi_2$ is (i.e. $NP \subseteq ZPP^{NP} \subseteq \Sigma_2 \cap \Pi_2$).

In a recent result, Rahul Santhanam gets even closer to NP by showing an almost polynomial circuit lower bound for MA. He shows that for any fixed $k>0$, there are languages in the class MA/1 (MA with $1$ bit of advice) which are not decided by non-uniform circuit families of size $n^k$. Why is this result important ? Intuitively, MA is like the probabilistic version of NP, and therefore such lower bounds automatically gain much value, as it is widely believed that ''randomness'' to NP does not give much more power than the basic NP class (that is, MA $=$ NP is believed to hold). This therefore nearly implies polynomial circuit lower bounds for MA, which sits just above NP in the hierarchy. We say nearly because the class for which he shows a lower bound assumes $1$ additional bit of advice (MA/1).

Now let's look at the high level picture of proving such lower bounds. Imagine we want to show a lower bound for a class C, where NP $\subseteq$ C. If it is the case that NP is not contained in P/poly, then we are done because C is only larger than NP. On the other hand, if NP $\subseteq$ P/poly, then we \emph{use} this fact to show that the polynomial hierarchy collapses to C. But there are complete problems in PH which are not decided by SIZE($n^k$) --- recall this from HW1; because of the collapse, these problems are in $C$, and this completes the proof. If we recall, we did the exact same thing for showing such lower bounds for $\Sigma_2 \cap \Pi_2$: we showed that if NP $\subseteq$ P/poly, then PH collapses to $\Sigma_2 \cap \Pi_2$. Since we know that PH has problems which are not decided by SIZE($n^k$), we have that $\Sigma_2 \cap \Pi_2 $ is not contained in SIZE($n^k$).

In fact, the argument need not always be based on the two cases of whether or not NP is in P/poly. There are cases where it helps to consider a class which \emph{contains} C, and then argue depending on it's containment in P/poly. For instance, we have the result that if PSPACE $\subseteq$ P/poly, then PSPACE $=$ MA (this problem figures in HW4). Intuitively, the proof goes as follows: if PSPACE $\subseteq$ P/poly, then the prover of the IP protocol for QBF (a complete problem for PSPACE $\subseteq$ IP) can be represented by a poly-sized circuit. Merlin can therefore simply give Arthur the circuit for the correct input size and Arthur can use the circuit to simulate any query to the prover. Completeness is direct, and intuitively, soundness holds because we know no adaptive prover can cheat Arthur, so no fixed prover can! Therefore, QBF $\in$ MA.

Using the ideas defined above, we are now ready to show the main theorem that MA/1 is not in SIZE($n^k$).

We require the following lemma, whose proof is rather technical (using Shamir's proof for IP = PSPACE) and we omit.

There exists a PSPACE-complete problem $L$ and a probabilistic polynomial time turing machine $M$ such that for any input $x$,

1. $M$ asks only queries of length $|x|$.

2. If $M$ is given the language $L$ as oracle and if $x \in L$, then $M$ accepts $x$ with probability $1$.

3. If $x \notin L$, then no matter what oracle $M$ gets, it accepts with probability only at most $1/2$.

Such a complete problem is known to exist, and we shall make use of it. Just as an aside, this definition of $L$ seems similar to the notion of checkers in HW4.

If this language $L \in $ P/poly, then PSPACE $\in $ P/poly, meaning PSPACE $=$ MA, and this would give us the desired result since PSPACE is not contained in SIZE($n^k$). Therefore, we assume that $L \notin $ P/poly, and create the following \emph{padded} language $L'$. For what follows, let $s(n)$ denote the size of the smallest circuit which can decide $L$ on inputs of size $n$. Consider the following language $L'$:

$L' = \{ x1^{y} | x \in L, |x| = n, y \geq n, y$ is a power of $2, s(n) \in ((y + n)^{k+1}, (2y+n)^{k+1}] \}$.

We show that $L' \in $ MA/1 but $L' \notin $ SIZE($n^k$). We first show the latter result: In some sense, the magic is already done as the padding has sort of diagonalized the input.

Because $L \notin $ SIZE($n^k$) (we are in the case where $L \notin$ P/poly), there exists some input size $n_0$ such that $s(n_0) > (n_0+1)^k$, where $s(n)$ is the size of the smallest circuit deciding $L$ on inputs of size $n$. Define $y_0$ to be the power of $2$ such that $(n_0 + y_0)^k < s(n_0) \leq (n_0 + 2y_0)^k$. Notice that this is the same $y$ as defined in $L'$.

Now, suppose $L'$ has a circuit family of size $m^k$. We now create a "small" circuit for deciding $L$ on input size $n_0$: the circuit simply hardwires $y_0$ additional $1$'s to the input $x$ and subsequently mimics the circuit of $L'$ on inputs of size $n_0 + y$. However, the size of this circuit is at most $(n_0+y)^k$, which contradicts the definition of $s(n)$. This shows that $L' \notin $ SIZE($n^k$).

All that remains is to show that $L' \in $ MA/1. The key here is in knowing if the input size $m$ of a given $x1^y$ is \emph{valid}, in that it satisfies the circuit size condition. It would be difficult to have the MA protocol figure it out if the circuit size is large (which it probably is), but here is where we use the $1$ bit of advice. Given an input size of $m$, the advice is $1$ if and only if $m$ can be decomposed into $n$ and $y = 2^k$ such that $y \geq n$, and the smallest circuit size for deciding $L$ on inputs of size $n$ is between $(y+n)^{k+1}$ and $(2y + n)^{k+1}$. Notice that if such a valid $(n,y)$ decomposition exists, it is unique: if we try to increase $y$ to $2y$ and decrease $n$ accordingly, it would make the resulting decomposition infeasible as we have $y \geq n$.

Thus, given $x1^y$ for which we need to decide membership in $L'$, the MA protocol does the following: Merlin sends Arthur the circuit of size $s \in ((n+y)^{k+1}, (n+2y)^{k+1}]$ which solves $L$ on inputs of length $n$. Arthur checks if the advice is $1$, and rejects if it is not. If it is, then the $m$ is of a valid length, and Arthur looks at the circuit and identifies $n$ (which is the input size of the received circuit) and makes other necessary checks, like $y \geq n$, $y$ is a power of $2$, and $(y+n)^{k+1} < s \leq (2y+n)^{k+1}$. It then simulates $M$ (recall $M$, the probabilistic turing machine which decided $L$ with some oracle given), and accepts or rejects based on $M$.

It is easy to see that if $x1^y \in L'$, then Merlin could send the correct circuit, and Arthur would always accept. On the other hand, if $x1^y \notin L'$, then Merlin ``commits'' to an oracle by sending a circuit and by the properties of $L$ and $M$, Arthur accepts with only a small probability. This completes the proof.

## Friday, March 6, 2009

### On the power of membership queries in agnostic learning

In class, the BCGKT theorem was introduced as an illustration of approximate sampling. It's also a complexity-theoretic result for query learning, wherein learning algorithms can draw examples of their choosing, not just those from a fixed dataset or a rigid oracle. Much as interactive proofs exploit the ability to ask (randomized) questions, query learning algorithms exploit the ability to perform experiments, ask users for feedback, and examine surroundings. These varying capabilities correspond to different models of query learning. In the BCGKT theorem, the NP oracle answers to equivalence querying: it either accepts the candidate hypothesis or returns a counterexample. Other interesting models include membership querying, which allows examples to be constructed from scratch and submitted to an oracle who labels it positive or negative, and active learning, in which artificial construction is disallowed but many unlabeled examples are available. Many positive results, such as exponential improvements in sample complexity, make query learning a topic of major interest in learning theory.

Query learners tend to work well because pivotal examples can falsify large swaths of the hypotheses under consideration. It's not surprising that they encounter difficulty when the true function is never a candidate and may not even exist. This impairs the ability to rule out hypotheses on the basis of a single inconsistency and often forces the learner to satisfy a conservative elimination criterion. This so-called agnostic setting is rife with hardness results, including a contribution by one of our wise masters.

The marginal distribution of the examples is often critical. Negative results for distribution-independent learning are often paired with positive results for distribution-specific learning; see, for example, the reply from our other wise master to his counterpart. This phenomenon is exemplified by Vitaly Feldman's recent work, which appeared in short at COLT '08 and in full a few days ago in JMLR.

The bad news comes first. When the examples come from an unknown distribution, membership queries don't offer any additional power. Much like boosting relies on a learner's ability to handle any marginal distribution on the examples, this proof relies on the learner's ability to find the best hypothesis regardless of how the joint example-label distribution changes. This robustness admits simulation of a membership query oracle: draw a sample, evaluate queries against the sample when possible, and return dummy answers for queries outside the sample. If error rates on the sample-conjured oracle are representative of those on the original distribution - something that can be ensured with Vapnik-Chervonenkis bounds - then the existence of a membership query learner implies the existence of a standard oracle learner. Feldman actually proves a stronger result involving finer details of agnostic learning, but those aren't of particular interest to this complexity class.

The juicier result, and the main topic of this post, is the separation proof: when the examples are uniformly distributed, some concept classes are learnable through membership queries but not through a standard oracle. Random samples may not reveal knowledge crucial to learning; since the distribution is fixed, this weakness can be systematically exploited. In the agnostic PAC model - noise free in the sense that the conditional expectation $\phi$ of the labels is a consistent boolean function of the examples - the exploit is very simple: carve out a small portion of the domain and therein hide a suitably short encoding of the target function. The encoding can be easily recovered by membership queries, but it is unlikely to be found by random examples due to the relatively large size of the entire domain.

This trick doesn't work in the agnostic setting, where such hiding places cannot be safe or small. Functions can be arbitrarily noisy on any part of the domain, so it's impossible to find a safe area to keep a brittle encoding. This issue is mitigated by the Hadamard code, which encodes an $n$-bit vector as all the $2^n$ values taken by the parity function $\chi$ for that vector. This approach seems suspicious, since all the values clearly can't be recovered, and learning parity functions in the agnostic setting is a major open problem. These worries are resolved by the Goldreich-Levin theorem, which was covered recently in Luca's class. For the purpose of this proof, the Goldreich-Levin theorem allows efficient recovery of the largest Fourier coefficients of $\phi$. We could stop here if we were happy to assume the intractibility of learning parities from a noisy standard oracle.

But we're not, so we're left to worry about how the Hadamard bloat might expose our secret to detection. Even an efficient coding scheme, like that described by Guruswami and Sudan, doesn't save us from a hefty lower bound on the size of the secret region. Assuming the concept class cannot be learned without the secret region, the region must occupy at least $(1-2\epsilon)2^n$ proportion of the domain.

The trick of sampling a fixed part of the domain is preserved by a technique from a 2007 paper by Elbaz et al. Instead of storing the Hadamard code in plain sight, it is obscured by a pseudorandom function. Recall that $\pi_y$ is a pseudorandom boolean function if its output cannot be distinguished from a completely random function with more than negligible probability in polynomial time. Such functions exist if one-way functions exist, which is the cryptographic assumption underlying the proof. Here they are used to build concepts of the following form (meaning will soon be given to the undefined variables; treat them as opaque for now):

$g_{\bar{d}}\left(k,z,\bar{x}\right)=\pi_{d_k}(z)\oplus \chi_{\bar{d}(k)}\left(\bar{x}\right)$

View the arguments as a single binary string of length $m$ and suppose $k$ is fixed. The value of $g$ on a random example $(z,\bar{x})$ is pseudorandom but, if $z$ is fixed, then $\{g(z,x)\}_x$ recovers the Hadamard encoding of $\bar{d}(k)$. Of course, this is possible only if $d_k$ is encoded and recovered; this will require another secret, and so forth. Thus a sequence of pseudorandom functions and secrets are used. There are $p-1$ such secrets $\bar{d}=\left(d_1,d_2,\ldots ,d_{p-1}\right)$, each indexing a family of pseudorandom boolean functions $\mathcal{F}_n=\left\{\pi_y\right\}_{y\in \{0,1\}^n}$. To combat adversarial noise, the sequence is highly redundant: each secret doesn't unlock one previous secret, but all previous secrets.

All the secrets except one can be recovered by membership queries, but doing the same from random examples would be akin to distinguishing a random function from a pseudorandom function. The proof details are a little hairy. I encourage you to read the original paper, but please don't be led astray by the figure; it's missing an arrow and it's labeled incorrectly. Vitaly says he'll post a corrected version on his site, most likely here.

In short, membership queries offer no advantage in the fully agnostic setting. Cryptographic evidence suggests that membership queries are more powerful than a standard oracle when the examples are uniformly distributed. I suggest Dana Angluin's survey to interested readers.

Query learners tend to work well because pivotal examples can falsify large swaths of the hypotheses under consideration. It's not surprising that they encounter difficulty when the true function is never a candidate and may not even exist. This impairs the ability to rule out hypotheses on the basis of a single inconsistency and often forces the learner to satisfy a conservative elimination criterion. This so-called agnostic setting is rife with hardness results, including a contribution by one of our wise masters.

The marginal distribution of the examples is often critical. Negative results for distribution-independent learning are often paired with positive results for distribution-specific learning; see, for example, the reply from our other wise master to his counterpart. This phenomenon is exemplified by Vitaly Feldman's recent work, which appeared in short at COLT '08 and in full a few days ago in JMLR.

The bad news comes first. When the examples come from an unknown distribution, membership queries don't offer any additional power. Much like boosting relies on a learner's ability to handle any marginal distribution on the examples, this proof relies on the learner's ability to find the best hypothesis regardless of how the joint example-label distribution changes. This robustness admits simulation of a membership query oracle: draw a sample, evaluate queries against the sample when possible, and return dummy answers for queries outside the sample. If error rates on the sample-conjured oracle are representative of those on the original distribution - something that can be ensured with Vapnik-Chervonenkis bounds - then the existence of a membership query learner implies the existence of a standard oracle learner. Feldman actually proves a stronger result involving finer details of agnostic learning, but those aren't of particular interest to this complexity class.

The juicier result, and the main topic of this post, is the separation proof: when the examples are uniformly distributed, some concept classes are learnable through membership queries but not through a standard oracle. Random samples may not reveal knowledge crucial to learning; since the distribution is fixed, this weakness can be systematically exploited. In the agnostic PAC model - noise free in the sense that the conditional expectation $\phi$ of the labels is a consistent boolean function of the examples - the exploit is very simple: carve out a small portion of the domain and therein hide a suitably short encoding of the target function. The encoding can be easily recovered by membership queries, but it is unlikely to be found by random examples due to the relatively large size of the entire domain.

This trick doesn't work in the agnostic setting, where such hiding places cannot be safe or small. Functions can be arbitrarily noisy on any part of the domain, so it's impossible to find a safe area to keep a brittle encoding. This issue is mitigated by the Hadamard code, which encodes an $n$-bit vector as all the $2^n$ values taken by the parity function $\chi$ for that vector. This approach seems suspicious, since all the values clearly can't be recovered, and learning parity functions in the agnostic setting is a major open problem. These worries are resolved by the Goldreich-Levin theorem, which was covered recently in Luca's class. For the purpose of this proof, the Goldreich-Levin theorem allows efficient recovery of the largest Fourier coefficients of $\phi$. We could stop here if we were happy to assume the intractibility of learning parities from a noisy standard oracle.

But we're not, so we're left to worry about how the Hadamard bloat might expose our secret to detection. Even an efficient coding scheme, like that described by Guruswami and Sudan, doesn't save us from a hefty lower bound on the size of the secret region. Assuming the concept class cannot be learned without the secret region, the region must occupy at least $(1-2\epsilon)2^n$ proportion of the domain.

The trick of sampling a fixed part of the domain is preserved by a technique from a 2007 paper by Elbaz et al. Instead of storing the Hadamard code in plain sight, it is obscured by a pseudorandom function. Recall that $\pi_y$ is a pseudorandom boolean function if its output cannot be distinguished from a completely random function with more than negligible probability in polynomial time. Such functions exist if one-way functions exist, which is the cryptographic assumption underlying the proof. Here they are used to build concepts of the following form (meaning will soon be given to the undefined variables; treat them as opaque for now):

$g_{\bar{d}}\left(k,z,\bar{x}\right)=\pi_{d_k}(z)\oplus \chi_{\bar{d}(k)}\left(\bar{x}\right)$

View the arguments as a single binary string of length $m$ and suppose $k$ is fixed. The value of $g$ on a random example $(z,\bar{x})$ is pseudorandom but, if $z$ is fixed, then $\{g(z,x)\}_x$ recovers the Hadamard encoding of $\bar{d}(k)$. Of course, this is possible only if $d_k$ is encoded and recovered; this will require another secret, and so forth. Thus a sequence of pseudorandom functions and secrets are used. There are $p-1$ such secrets $\bar{d}=\left(d_1,d_2,\ldots ,d_{p-1}\right)$, each indexing a family of pseudorandom boolean functions $\mathcal{F}_n=\left\{\pi_y\right\}_{y\in \{0,1\}^n}$. To combat adversarial noise, the sequence is highly redundant: each secret doesn't unlock one previous secret, but all previous secrets.

All the secrets except one can be recovered by membership queries, but doing the same from random examples would be akin to distinguishing a random function from a pseudorandom function. The proof details are a little hairy. I encourage you to read the original paper, but please don't be led astray by the figure; it's missing an arrow and it's labeled incorrectly. Vitaly says he'll post a corrected version on his site, most likely here.

In short, membership queries offer no advantage in the fully agnostic setting. Cryptographic evidence suggests that membership queries are more powerful than a standard oracle when the examples are uniformly distributed. I suggest Dana Angluin's survey to interested readers.

## Thursday, March 5, 2009

### Factoring

Sam asked me about the complexity of factoring, and the fastest known algorithms for it. Here is a very nice survey by Pomerance about the best known algorithms. Some light reading for you over Spring Break!

### Lecture 15

In this lecture we finished the proof of the Switching Lemma and showed how it implied Håstad's Theorem: Computing Parity by depth-$k$ circuits (unbounded fan-in) requires size $2^{\Omega(n^{1/(k-1)})}$.

We then stated and sketched the proof of the Linial-Mansour-Nisan (LMN) Theorem: If $f$ is computed by depth-$k$, size-$S$ circuits, then there is a multilinear real polynomial $p$ of degree at most $t = O(\log(S/\epsilon))^{k}$ which approximates $f$ in the following sense:

E${}_{x}[(f(x_1, ..., x_n) - p(x_1, ..., x_n))^2] \leq \epsilon$,

where $x = (x_1, ... x_n)$ is a uniformly random string.

By the way, Håstad subsequently slightly sharpened this; he improved $t$ to

$O(\log(S/\epsilon))^{k-1} \cdot min\{\log S, \log (1/\epsilon)\}$.

We then stated and sketched the proof of the Linial-Mansour-Nisan (LMN) Theorem: If $f$ is computed by depth-$k$, size-$S$ circuits, then there is a multilinear real polynomial $p$ of degree at most $t = O(\log(S/\epsilon))^{k}$ which approximates $f$ in the following sense:

E${}_{x}[(f(x_1, ..., x_n) - p(x_1, ..., x_n))^2] \leq \epsilon$,

where $x = (x_1, ... x_n)$ is a uniformly random string.

By the way, Håstad subsequently slightly sharpened this; he improved $t$ to

$O(\log(S/\epsilon))^{k-1} \cdot min\{\log S, \log (1/\epsilon)\}$.

## Tuesday, March 3, 2009

### Lecture 14

In this class we reviewed DNF/CNF size and width decision tree size and depth. We then began the proof of the Switching Lemma, which we will finish next time. If you want to read up on it, we are following Paul Beame's exposition.

Also, Homework #4 is out. It is due in 3 weeks, after which we will have the midterm. This midterm will occupy either 5 or 7 days.

Also, Homework #4 is out. It is due in 3 weeks, after which we will have the midterm. This midterm will occupy either 5 or 7 days.

Subscribe to:
Posts (Atom)