Paper authors (and this posts author apparently) like to throw in lots of scary-looking maths to signal that they are smart and that what they are doing has merit. The Reinforcement Learning field is particularly notorious for doing this, but it's all over ML. Often it is not on purpose, everyone is taught this is the proper "formal" way to express these things, and that any other representation is not precise or appropriate in a scientific context.
In practice, when it comes down to code, even without higher-level libraries, it is surprisingly simple, concise and intuitive.
Most of the math elements used have quite straightforward properties and utility, but of course if you combine them all together into big expressions with lots of single-character variables, it's really hard to understand for everyone. You kind of need to learn to squint your eyes and understand the basic building-blocks that the maths represent, but that shouldn't be necessary if it wasn't obfuscated like this.
I’m going to push back on this a bit. I think a simpler explanation (or at least one that doesn’t involve projecting one’s own insecurities onto the authors) is that the people who write these papers are generally comfortable enough with mathematics that they don’t believe anything has been obfuscated. ML is a mathematical science and many people in ML were trained as physicists or mathematicians (I’m one of them). People write things this way because it makes symbolic manipulations easier and you can keep the full expression in your head; what you’re proposing would actually make it significantly harder to verify results in papers.
Maybe! I’ve found that people usually don’t do extra work if they don’t need to. The heavy notation in differential geometry, for example, can be awfully helpful when you’re actually trying to do Lagrangian mechanics on a Riemannian manifold. And superfluous bits of a definition might be kept around because going from the minimal definition to the one that is actually useful in practice can sometimes be non-trivial, so you’ll just keep the “superfluous” definition in your head.
To add to this, I'd even argue that the most "scary looking" parts of the GAN paper are where Goodfellow is just showing intermediate steps, like in (4) and (5). I guess one can argue that this is superfluous but that feels pretentious. I'd argue that the math here is helping communicate.
I think people forget why math is used. I'm always a little surprised that programmers don't see this because the languages are being used for the same reasons. Precision. They're terrible languages to communicate something like this conversation but then again English is a terrible way to communicate highly abstract concepts.
On the other hand, I've definitely seen people use math to make their works seem more important (definitely in some ML) I think I more frequently see it just being copy pasted (like every diffusion paper ever). I think that is probably superfluous, though it's definitely debatable and I'm absolutely certain these use cases aren't for flexing lol.
Agreed. Also, fwiw, the mathematics involved in the paper are pretty simple as far as mathematical sophistication goes. Spend two to three months on one "higher level" maths course of your choosing and you'll be able to fully understand every equation in this paper relatively easily. Even a basic course in information theory coupled with some discrete maths should give you essentially all you need to comprehend the math in this post. The concepts being presented here are not mysterious and much of this math is banal. Mathematical notation can seem foreboding, but once you grasp it, you'll see, like Von Neumann said, that life is complicated but math is simple.
Haha, recognise. I invented a fast search algorithm and worked with some academics to publish a paper on it last year.
They threw in all the complex math to the paper. I could not initially understand it at all despite inventing the damn algorithm!
Having said that, picking it apart and taking a little time with it, it actually wasn't that hard - but it sure looked scary and incomprehensible at first!
I think you misunderstand what the math is for. The math is not for training the model but for understanding why the model can be formulated that way and why this training will work. It is the exact opposite of obscurification.
Think of it this way
You don't need math to train a good model but you need math to know why your model is wrong.
It isn't about lording over others, it is that in research you care why things work just as much as that they work. The reason for this is very simple: it's fucking hard to improve things when you don't understand them. If you just have a black box then the only strategy you have available is brute force. But if you analyze things and and build knowledge, then you don't have to brute force.
Also, the idea of using a paper to signal intelligence is kinda silly. Papers aren't being written for the general public, papers are the communication between scientists. Who are they impressing? Each other? The others who are going to call them out if they write bullshit or make arguments convoluted? I don't buy that. But maybe because I'm a researcher. But I also don't think I need to use math to look smart, my PhD and publication record do a good enough job of that on their own. I don't even need it to flex to other researchers. The math in my papers is because it is just easier to communicate. I'm sure there's concepts that you find easier to understand by reading code than by using English. Same thing. Math and programming are great languages when you need high precision and when being pedantic is essential. Math is used because it is the best way to communicate, not as a flex. We flex on each other by showing how our ideas are the best. You can't do that if the other person doesn't understand you.
@staticelf and anyone else that feels that way:
That feeling is normal in the beginning. Basically your first year of a PhD is spent going "what the fuck does any of this mean?!?!" It's rough. But also normal. You're working at the bounds of human knowledge and papers are written in the context of other papers. It's hard to jump in because it is like jumping into the middle of a decades (or longer) conversation. If you didn't feel lost then the conversation probably wasn't that complicated and we'd probably have solved those problems much earlier. So you sit down and read a lot of papers to get context to that conversation.
My point is, don't put yourself down. The hill you need to climb looks steeper than it is. Unfortunately it is also hard to track your progress so you tend to feel like it's continually out of reach until it suddenly isn't. (It's also hard because everyone feels like an imposter and many are afraid to admit not knowing. But the whole job is about not knowing lol) Probably the most important skill in a PhD is persistence. I doubt you're too stupid. I'm sure you can look back and see that you've done things you or other people are really impressed with. Things that looked like giant mountains to climb but looking back don't seem so giant anymore. We'd get nowhere in life if we didn't try to do things we initially thought were too hard. Truth is you never know till you try. I'm not going to say it's easy (it isn't), but that it isn't insurmountable. You can't compare yourself against others who have years of training. Instead look at them and see that that's where this training can take you. But you can't get there if you don't try.
I come from a country which had a strong Soviet influence, and in school basically we were taught that behind every hard formula lies an intuitive explanation. As otherwise, there’s no way to come up with the formula in the first place.
This statement is not true, there are counter examples I encountered in my university studies but I would say that intuition will get you very far. Einstein was able to come up with special theory of relativity by just manipulating mental models after all. Only when he tried to generalize it, that’s when he hit the limit of the claim I learned in school.
That being said after abandoning intuition, relying on pure mathematical reasoning drives you to a desired place and from there you usually can reason about the theorem in an intuitive way again.
Math in this paper is not that hard to learn, you just need someone to present you the key idea.
> behind every hard formula lies an intuitive explanation
Probably a good thing to teach people when starting out. Especially since I think one thing people have is converting the symbols to the abstract ideas they communicate.
I agree you're definitely right that a lot of math (if not most) is really unintuitive. But I think I still like the sentiment behind that idea. Maybe it changes with translation, but I feel like that equations make a lot more sense when I break them down and think about what the symbols are doing and looking at their relationships with one another compared to when just looking at them as symbols to manipulate. Like seeing the form F=ma as more than mass and acceleration and but how in structure it is similar to F=-kx. Getting there isn't easy but once you do it is much more intuitive than it was before.
Haha, I was just going to say the same. I was hoping, I guess naively, that this would explain the math. Not just show me math. While I love a good figure, I like pseudocode just as much :)
Whenever someone says this I like to point out that they are very often used to train the VAE and VQVAE models that LDM models use. Slowly diffusion is encroaching on its territory with 1-step models, however, and there are now alternative methods to generate rich latent spaces and decoders too, so this is changing, but I'd say up until last year most of the image generators still used an adversarial objective for the encoder-decoder training. This year, not sure..
Exactly, for real time applications VTO, simulators,...), i.e. 60+FPS, diffusion can't be used efficiently. The gap is still there afaik. One lead has been to distill DPM into GANs, not sure this works for GANs that are small enough for real time.
I mean it is really hard to push diffusion models down in size so that just makes the speed part hard. I'm not sure diffusion can ever truly win in the speed race, at least without additional context like breadth of generation. But isn't that the thing? The best model is only the best in a given context?
I think the weirdest thing in ML has always been acting like there's an objectively better model and no context is needed.
Whilst it's maybe not worth studying them in detail I'd say being aware of their existence and roughly how they work is still useful. Seeing the many varied ways people have done things with neural networks can be useful inspiration for your own ideas and perhaps the ideas and techniques behind GANs will find a new life or a new purpose.
Yes you can just concentrate on the latest models but if you want a better grounding in the field some understanding of the past is important. In particular reusing ideas from the past in a new way and/or with better software/hardware/datasets is a common source of new developments!
GAN is not a architecture its a training method. As the models themselves change underneath, GAN remain relevant. (Just as you see autoencoder still being used as a term in new published works, which is even older.)
Though if you can rephrase the problem into a diffusion it seems to be prefered these days. (Less prone to mode collapse)
Gan is famously used for generative usecases, but has wide uses for creating useful latent spaces with limited data, and show up in few-shot-learning-papers. (Im actually not that up to speed on the state of art in few-shot so mabie they have something clever that replace it)
They're used as a small regularization term in image/audio decoders. But GANs have a different learning dynamic (Z6 rather than Z1 or Z2) which makes them pretty unstable to train unless you're using something like Bayesian neural networks, so they fell out of favor for the entire image generation process.
Adversarial loss is still use on most image generators, diffusion/autoregressive models work on a latent space (they don't have to, but it would incredibly inefficient) created by an autoencoder, these autoencoders are trained on several losses, usually L1/L2, LPIPS and adversarial.
If you ever wondered about the generalization to multiple classes, there is a reason that the gans look totally different:
https://proceedings.mlr.press/v137/kavalerov20a/kavalerov20a...
It turns out 2 classes is special. Better to add the classes as side information rather than try to make it part of the main objective.
Reading an article like this makes me realize I am too stupid to ever build a foundation model from scratch.
Paper authors (and this posts author apparently) like to throw in lots of scary-looking maths to signal that they are smart and that what they are doing has merit. The Reinforcement Learning field is particularly notorious for doing this, but it's all over ML. Often it is not on purpose, everyone is taught this is the proper "formal" way to express these things, and that any other representation is not precise or appropriate in a scientific context.
In practice, when it comes down to code, even without higher-level libraries, it is surprisingly simple, concise and intuitive.
Most of the math elements used have quite straightforward properties and utility, but of course if you combine them all together into big expressions with lots of single-character variables, it's really hard to understand for everyone. You kind of need to learn to squint your eyes and understand the basic building-blocks that the maths represent, but that shouldn't be necessary if it wasn't obfuscated like this.
I’m going to push back on this a bit. I think a simpler explanation (or at least one that doesn’t involve projecting one’s own insecurities onto the authors) is that the people who write these papers are generally comfortable enough with mathematics that they don’t believe anything has been obfuscated. ML is a mathematical science and many people in ML were trained as physicists or mathematicians (I’m one of them). People write things this way because it makes symbolic manipulations easier and you can keep the full expression in your head; what you’re proposing would actually make it significantly harder to verify results in papers.
Maybe.
But my experience as a mathematician tells me another part of that story.
Certain fields are much more used to consuming (and producing) visual noise in their notation!
Some fields have even superfluous parts in their definitions and keep them around out of tradition.
It's just as with code: Not everyone values writing readable code highly. Some are fine with 200 line function bodies.
And refactoring mathematics is even harder: There's no single codebase and the old papers don't disappear.
Maybe! I’ve found that people usually don’t do extra work if they don’t need to. The heavy notation in differential geometry, for example, can be awfully helpful when you’re actually trying to do Lagrangian mechanics on a Riemannian manifold. And superfluous bits of a definition might be kept around because going from the minimal definition to the one that is actually useful in practice can sometimes be non-trivial, so you’ll just keep the “superfluous” definition in your head.
To add to this, I'd even argue that the most "scary looking" parts of the GAN paper are where Goodfellow is just showing intermediate steps, like in (4) and (5). I guess one can argue that this is superfluous but that feels pretentious. I'd argue that the math here is helping communicate.
I think people forget why math is used. I'm always a little surprised that programmers don't see this because the languages are being used for the same reasons. Precision. They're terrible languages to communicate something like this conversation but then again English is a terrible way to communicate highly abstract concepts.
On the other hand, I've definitely seen people use math to make their works seem more important (definitely in some ML) I think I more frequently see it just being copy pasted (like every diffusion paper ever). I think that is probably superfluous, though it's definitely debatable and I'm absolutely certain these use cases aren't for flexing lol.
Agreed. Also, fwiw, the mathematics involved in the paper are pretty simple as far as mathematical sophistication goes. Spend two to three months on one "higher level" maths course of your choosing and you'll be able to fully understand every equation in this paper relatively easily. Even a basic course in information theory coupled with some discrete maths should give you essentially all you need to comprehend the math in this post. The concepts being presented here are not mysterious and much of this math is banal. Mathematical notation can seem foreboding, but once you grasp it, you'll see, like Von Neumann said, that life is complicated but math is simple.
> like Von Neumann said, that life is complicated but math is simple
Maybe for Von Neumann math was simple...
Haha, recognise. I invented a fast search algorithm and worked with some academics to publish a paper on it last year.
They threw in all the complex math to the paper. I could not initially understand it at all despite inventing the damn algorithm!
Having said that, picking it apart and taking a little time with it, it actually wasn't that hard - but it sure looked scary and incomprehensible at first!
I think you misunderstand what the math is for. The math is not for training the model but for understanding why the model can be formulated that way and why this training will work. It is the exact opposite of obscurification.
Think of it this way
It isn't about lording over others, it is that in research you care why things work just as much as that they work. The reason for this is very simple: it's fucking hard to improve things when you don't understand them. If you just have a black box then the only strategy you have available is brute force. But if you analyze things and and build knowledge, then you don't have to brute force.Also, the idea of using a paper to signal intelligence is kinda silly. Papers aren't being written for the general public, papers are the communication between scientists. Who are they impressing? Each other? The others who are going to call them out if they write bullshit or make arguments convoluted? I don't buy that. But maybe because I'm a researcher. But I also don't think I need to use math to look smart, my PhD and publication record do a good enough job of that on their own. I don't even need it to flex to other researchers. The math in my papers is because it is just easier to communicate. I'm sure there's concepts that you find easier to understand by reading code than by using English. Same thing. Math and programming are great languages when you need high precision and when being pedantic is essential. Math is used because it is the best way to communicate, not as a flex. We flex on each other by showing how our ideas are the best. You can't do that if the other person doesn't understand you.
@staticelf and anyone else that feels that way:
That feeling is normal in the beginning. Basically your first year of a PhD is spent going "what the fuck does any of this mean?!?!" It's rough. But also normal. You're working at the bounds of human knowledge and papers are written in the context of other papers. It's hard to jump in because it is like jumping into the middle of a decades (or longer) conversation. If you didn't feel lost then the conversation probably wasn't that complicated and we'd probably have solved those problems much earlier. So you sit down and read a lot of papers to get context to that conversation.
My point is, don't put yourself down. The hill you need to climb looks steeper than it is. Unfortunately it is also hard to track your progress so you tend to feel like it's continually out of reach until it suddenly isn't. (It's also hard because everyone feels like an imposter and many are afraid to admit not knowing. But the whole job is about not knowing lol) Probably the most important skill in a PhD is persistence. I doubt you're too stupid. I'm sure you can look back and see that you've done things you or other people are really impressed with. Things that looked like giant mountains to climb but looking back don't seem so giant anymore. We'd get nowhere in life if we didn't try to do things we initially thought were too hard. Truth is you never know till you try. I'm not going to say it's easy (it isn't), but that it isn't insurmountable. You can't compare yourself against others who have years of training. Instead look at them and see that that's where this training can take you. But you can't get there if you don't try.
The big zig zaggy "E" is a for loop. That's all you really have to know
It takes a while to get into, just like with everything determination is key
Also there are libraries that abstract away most if not all the things, so you don't have to know everything
That's the thing, it's too hard to learn so I rather do something else with the limited time I have left.
I come from a country which had a strong Soviet influence, and in school basically we were taught that behind every hard formula lies an intuitive explanation. As otherwise, there’s no way to come up with the formula in the first place.
This statement is not true, there are counter examples I encountered in my university studies but I would say that intuition will get you very far. Einstein was able to come up with special theory of relativity by just manipulating mental models after all. Only when he tried to generalize it, that’s when he hit the limit of the claim I learned in school.
That being said after abandoning intuition, relying on pure mathematical reasoning drives you to a desired place and from there you usually can reason about the theorem in an intuitive way again.
Math in this paper is not that hard to learn, you just need someone to present you the key idea.
I agree you're definitely right that a lot of math (if not most) is really unintuitive. But I think I still like the sentiment behind that idea. Maybe it changes with translation, but I feel like that equations make a lot more sense when I break them down and think about what the symbols are doing and looking at their relationships with one another compared to when just looking at them as symbols to manipulate. Like seeing the form F=ma as more than mass and acceleration and but how in structure it is similar to F=-kx. Getting there isn't easy but once you do it is much more intuitive than it was before.
I wasn't taught this, but came to this conclusion after much struggle, and I think this mentality has served me very well.
I hope anyone who is unsure will read your comment and at least try to follow it for a while.
Haha, I was just going to say the same. I was hoping, I guess naively, that this would explain the math. Not just show me math. While I love a good figure, I like pseudocode just as much :)
"You Are NOT Dumb, You Just Lack the Prerequisites"
https://lelouch.dev/blog/you-are-probably-not-dumb/
Aren't GANs like ancient?
Last time I used a GAN was in 2015, still interesting to see a post about GANs now and then.
Whenever someone says this I like to point out that they are very often used to train the VAE and VQVAE models that LDM models use. Slowly diffusion is encroaching on its territory with 1-step models, however, and there are now alternative methods to generate rich latent spaces and decoders too, so this is changing, but I'd say up until last year most of the image generators still used an adversarial objective for the encoder-decoder training. This year, not sure..
Exactly, for real time applications VTO, simulators,...), i.e. 60+FPS, diffusion can't be used efficiently. The gap is still there afaik. One lead has been to distill DPM into GANs, not sure this works for GANs that are small enough for real time.
I mean it is really hard to push diffusion models down in size so that just makes the speed part hard. I'm not sure diffusion can ever truly win in the speed race, at least without additional context like breadth of generation. But isn't that the thing? The best model is only the best in a given context?
I think the weirdest thing in ML has always been acting like there's an objectively better model and no context is needed.
they're also used a lot for training current TTS and audio codec models to output speech that sounds realistic.
Whilst it's maybe not worth studying them in detail I'd say being aware of their existence and roughly how they work is still useful. Seeing the many varied ways people have done things with neural networks can be useful inspiration for your own ideas and perhaps the ideas and techniques behind GANs will find a new life or a new purpose.
Yes you can just concentrate on the latest models but if you want a better grounding in the field some understanding of the past is important. In particular reusing ideas from the past in a new way and/or with better software/hardware/datasets is a common source of new developments!
GAN is not a architecture its a training method. As the models themselves change underneath, GAN remain relevant. (Just as you see autoencoder still being used as a term in new published works, which is even older.)
Though if you can rephrase the problem into a diffusion it seems to be prefered these days. (Less prone to mode collapse)
Gan is famously used for generative usecases, but has wide uses for creating useful latent spaces with limited data, and show up in few-shot-learning-papers. (Im actually not that up to speed on the state of art in few-shot so mabie they have something clever that replace it)
They're used as a small regularization term in image/audio decoders. But GANs have a different learning dynamic (Z6 rather than Z1 or Z2) which makes them pretty unstable to train unless you're using something like Bayesian neural networks, so they fell out of favor for the entire image generation process.
The article is from 2020, so it would be closer to relevancy back then.
Adversarial loss is still use on most image generators, diffusion/autoregressive models work on a latent space (they don't have to, but it would incredibly inefficient) created by an autoencoder, these autoencoders are trained on several losses, usually L1/L2, LPIPS and adversarial.
Turing machines are ancient as well.
Yeah, title needs (2020) added.
GANs were fun though. :)
[dead]