Solid Go codebase, especially the lock-free CAS rate limiting and explicit crypto.ZeroKey memory wiping. Curious how you handle token billing if an upstream SSE stream drops before the final usage chunk arrives?
Sorry for the late reply, sleep and work got in the way :)
Good question. In the current code, streaming usage is best-effort: for OpenAI-style SSE we rely on the provider’s usage chunk, and if the stream drops before that arrives we can undercount, including falling back to zero for that request.
Anthropic is slightly better in our adapter because we accumulate input_tokens from message_start and output_tokens from message_delta, but premature termination can still leave usage incomplete.
That trade-off is deliberate for now: VoidLLM is optimized around a very small hot-path overhead, so we don’t run a tokenizer inline on every streaming response. The usage data is operational metadata, not a billing source of truth.
The likely next step is to improve this off the hot path: better mid-stream interruption signaling, and optional local token estimation / reconciliation as a fallback path rather than making every stream pay for it.
I think this is this a very important concept. Well done.
Im surprised to see that no start-up does this already.
Its going to be a real product category in the future IMO
I think so too, although I’m not sure which part you mean exactly :)
My own bet is that the broader concept is “LLM access infrastructure” - privacy, policy, key management, and usage visibility in one layer instead of being scattered across apps and vendors.
Solid Go codebase, especially the lock-free CAS rate limiting and explicit crypto.ZeroKey memory wiping. Curious how you handle token billing if an upstream SSE stream drops before the final usage chunk arrives?
Sorry for the late reply, sleep and work got in the way :)
Good question. In the current code, streaming usage is best-effort: for OpenAI-style SSE we rely on the provider’s usage chunk, and if the stream drops before that arrives we can undercount, including falling back to zero for that request. Anthropic is slightly better in our adapter because we accumulate input_tokens from message_start and output_tokens from message_delta, but premature termination can still leave usage incomplete.
That trade-off is deliberate for now: VoidLLM is optimized around a very small hot-path overhead, so we don’t run a tokenizer inline on every streaming response. The usage data is operational metadata, not a billing source of truth.
The likely next step is to improve this off the hot path: better mid-stream interruption signaling, and optional local token estimation / reconciliation as a fallback path rather than making every stream pay for it.
I think this is this a very important concept. Well done. Im surprised to see that no start-up does this already. Its going to be a real product category in the future IMO
I think so too, although I’m not sure which part you mean exactly :)
My own bet is that the broader concept is “LLM access infrastructure” - privacy, policy, key management, and usage visibility in one layer instead of being scattered across apps and vendors.
[dead]