Human Judgment as a Specification

(blog.brownplt.org)

39 points | by surprisetalk 3 days ago

4 comments

ekidd 3 hours ago

> Telling people “you must read all the code generated by an LLM” is definitely meaningful—but it is not at all moderate (so most people won’t do it).
I am honestly heartbroken to live in a world where reading the code is seen as an unreasonable ask by either students or by professional working programmers.

[-]
- watwut 39 minutes ago
  
  What is heart breaking about it? Code reviews were always being the most sucky part of the job.
  They are also among more recent inventions, they are not "the traditional" programming at all. It is not like code review was the thing that attracted people to the profession or something that would be ore rewarding part of it.
- ramses0 2 hours ago
  
  Don't tell me you're reading all assembly generated by your local golang or javac compiler? And that you've read every line of code down the dependency tree for your node_modules?
  I'm just upset that we are throwing away the original prompts for generated code in such a cavalier fashion.
  
  [-]
  - ekidd 2 hours ago
    
    The difference is that a compiler is a rigorous, (nearly) determinisic, heavily tested artrifact built by expert humans. I have only encountered genuine code generation bugs in compilers twice in my career. And yes, those bugs I did trace to the assembly.
    An LLM prompt, even a huge one, is an incredibly vague document that leaves out most of the edge cases. And even Fable 5 happily ignores clear instructions in its prompt.
    Now, to be fair, I absolutely expect the buggy slop to win, and to drive out the people that either write their own code or at least read the output. This will, in turn, make customers less willing to spend money on software after they get burnt a few times by buggy garbage. I think this is pretty much inevitable once Fable returns. It's just too damn good at long time horizon tasks, generating far more mostly sorta working code than any human could reasonably read.
    
    [-]
    - warkdarrior 2 hours ago
      
      > The difference is that a compiler is a rigorous, (nearly) determinisic, heavily tested artrifact built by expert humans.
      How do you know your compiler is a rigurous and deterministic? Did you review all of its code?
      
      [-]
      - lkey 36 minutes ago
        
        Compilers have specifications, test suites, and teams of human beings (over decades) to ensure that what the compiler produces is nearly deterministic relative to code input. This is testable without even opening the black box.
        LLMs are intentionally not deterministic, nor is their output subject to any known specification. Output is a point in a high dimensional manifold, determined by the input vector, but this manifold is unknowable in a real and intractable sense.
        These are not equivalent constructions and it demeans you to conflate them.
        
        [-]
        
        whattheheckheck 29 minutes ago
        
        So write the thing that proves the black box of magic output satisfies the solution?
      - totallykvothe 47 minutes ago
        
        Bad faith argument. You know this comparison is ridiculous
  - wild_egg 2 hours ago
    
    `npm install` is the OG vibecoding
  - acedTrex 2 hours ago
    
    I despise this retort that i see constantly, in no way shape or form is it remotely an accurate analogy. They are two completely different things and its dishonest to attribute the two together.
    
    [-]
    - ramses0 1 hour ago
      
      "A compiler is free to optimize...", on sufficiently basic prompting "make me a user address collection form that writes to a database table called 'registered_users'..."
      ...I agree it's not deterministic (neither are all your variations of C compilers, neither is Firefox v Safari v Chrome), but it probably Does Something(tm), and I might not want to peel back the covers and see how it used React, or Vue, VanillaJS, QT, or GTK.
      It's upsetting that we are _committing the generated code_ rather than being able to use better and better optimizing compilers against the original prompt of: "make me a user registration form with database connection"
      ...I'm very with you on "it's not an accurate analogy", but I'm pointing out that there have been sea-changes already w.r.t. strict adherence to the generated code, or inclusion of left-pad v react libraries.
      ...and there have been corresponding productivity gains (debatable? ;-) when we've worked at these higher levels of abstraction.
      I'm personally still in the "blacksmith" stage of working with AI output (put it back in the fire and beat on it a bunch more times), and shudder in horror at the thought of maintaining (or paying to maintain) megabytes of hours of token generation that looks like source code.
      I'm hopeful that we'll eventually strip out some of the mud between the CPU and putting pixels on the screen (with the help of LLM's?), and that we'll still be able to understand and reason about the real "DAG" of what our programs are trying to do (eg: declarative guis, kindof like we have declarative sql), but there will always be a muddy middle part where the computer/complier/LLM is doing something in between that _is_ sufficiently reliable for us to ignore those bits most of the time.
remywang 3 hours ago

> Telling people “you must read all the code generated by an LLM” is definitely meaningful—but it is not at all moderate (so most people won’t do it)
But they should! The code is the best source of truth on what the software is doing after all.
Instead of giving up on that, we should make it easier to read generated code, e.g. by generating less code in a higher level language.
On the flip side, forcing myself to read all the code also resulted in a smaller, higher quality code base.
otekengineering 3 hours ago

this is the type of thing you need to build a foundation sturdy enough to let you operate higher up the stack and ratchet to design-by-metaphor and then design-by-philosophy. those design skills are taught in humanities departments, not engineering departments, so this is a weird feeling place for those of us that wandered over from a technical field.
jMyles 5 hours ago

> This is also why PICK can usefully fail. Sometimes none of the model’s candidates is right, and PICK ends with zero survivors. Under the spec-elucidation reading, that outcome means: the commitments you made through classification could not be satisfied by anything the model produced. Better to know than to ship the regex anyway.
Zooming out (but only a little) from the impetus to formalize a commitment to a particular class of result candidate (what the author here is calling "spec elucidation"), we can also imagine this same evolution of concerns being applied in order to cause what we currently term "AI safety" into something more like "AI ethics".
For example, if we can elucidate the specifications for things like peace and justice to ensure that the class of results is formally verified as non-participation in war (or perhaps, further in the future, non-participation in state activities whatsoever), we may be able to throw cold water on all the vitriolic arguments about model capabilities and which need to be banned or delayed lest we accelerate the apocalypse (or whatever is actually on the mind of the ban-this-model constituency).
I like how the author ends tersely with:
> If you have a formal language with the closure properties above — we suspect you would be surprised how many do — we would very much like to hear from you.
That's certainly not me, but I bet it's true that it's somebody.

[-]
- NitpickLawyer 4 hours ago
  
  > ensure that the class of results is formally verified as non-participation in war
  There are very few things that cannot be stated as dual use, with one totally benign and one totally screwed up. It's like wanting a hammer to distinguish if it's striking a nail for a roof vs. a nail for an illegal animal pen. That's the wrong application of constraints. The hammer shouldn't care.
  
  [-]
  - jMyles 3 hours ago
    
    The author addresses this point as well:
    > This is also why we do not believe PICK becomes less useful as models improve. Better models do not make user intent more articulate — asked for “a regex matching countries of North America”, a more capable model still cannot tell you whether you want the Caribbean included, or where you want to stop heading south. Better models produce better candidates, faster — which shifts user effort precisely toward the work PICK is built to support.
    
    [-]
    - NitpickLawyer 2 hours ago
      
      That's not I'm saying tho. I quoted the "non-participation in war" bit. I don't see how any system can ascertain if a prompt asking for an algorithm is dual use or not.