4 comments

  • HarHarVeryFunny 11 minutes ago

    I have a very hand-wavy explanation for how (but not fully why) overparametized nets tend to generalize rather than overfit.

    First a couple of facts:

    1) An ANN works by learning decision boundaries that separate and group training samples and their associated labels.

    2) If you train an overparametized net on random data then it will memorize it, but if you train it on structured data lying on some lower dimensional manifold, then rather than memorizing it, it will instead generalize, so the behavior depends on the nature of the data it is trained on.

    Now the hand-wavy bit:

    As training progresses the weights move the decision surfaces around until each training sample maps to a region of output space corresponding to the correct label, with these regions of output/latent space being separated by the learnt decision surfaces.

    Initially during training (up to the double descent phase in cases where that happens) these regions of "gerrymandered" output space may only correspond to a single or very few training samples, so there may be multiple disconnected regions each mapping to the label "cat", and another group of disconnected regions each mapping to the label "dog". This is the the overfitting phase.

    Now, if the data permits, with the data manifold being consistently labelled (nothing that looks like a cat being labelled a dog), there will often be potential to merge some of these disconnected regions of output space that map to the same label. So, for example we might go from four small regions of "cat" space to two larger merged regions of "cat" space. This is the mechanism of generalization with the extra space contained by the merged regions corresponding to interpolation - no training samples "forced" those larger merged regions, but also none prevented it ("dog" that looks like a cat).

    The question then remains why the dynamics of training may cause the decision surfaces to initially be highly "gerrimanderd" (because it's easier?), but on continued training to merge (because without any dogs among the cats there is no reason not to, and once merged no label error causing them to unmerge - a ratcheting up process from smaller to larger regions with increasing generalization?).

    • cherryteastain 3 hours ago

      A related viewpoint is that overparametrization is good because the model is stranded when the Hessian has all positive/zero eigenvalues. If we treat the probability that a particular Hessian eigenvalue turns positive as a Bernoulli process, the chance of all eigenvalues going positive/zero exponentially decreases as the parameter count increases

      [1] https://arxiv.org/abs/1406.2572

      • david-gpu 3 hours ago

        You don't need billions of parameters for that, precisely because the risk of being stuck at a local minimum decreases exponentially with the number of parameters. Right?

      • vatsachak 1 hour ago

        Isn't this trivial?

        What's more interesting is as to why double descent happens

        • Scene_Cast2 5 hours ago

          IIRC the original author of the Lottery Ticket Hypothesis now disavows that idea.

          One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.

          If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.

          Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.

          • getnormality 4 hours ago

            A while ago a lot of the discussion about overparameterization was about explaining "double descent", the observation that test error doesn't descend monotonically and actually hits a local maximum around the point where the model has just enough parameters to interpolate the data. My favorite article about double descent looks at this in terms of splines [1]. If I can try to summarize that article: when you are designing a parametrized model to fit to data, you have a choice. You can either:

            1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.

            2. Accept overparameterization and regularize.

            The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.

            Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).

            [1] https://mlu-explain.github.io/double-descent/

            • schmuhblaster 2 hours ago

              > This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).

              Could you elaborate on this?

              • porridgeraisin 3 hours ago

                Hi, I work on RL, or as it is known today, "classical" RL. I'm interested in knowing the latest work that explains double descent and in general optimisation behaviour of overparameterized neural networks. Do you have a survey paper or blog post or anything else to recommend?

              • WithinReason 4 hours ago

                How is this view inconsistent with the lottery ticket hypothesis?