NewsNook

nesting hacker news in a more meaningful way

BenchPress: Predict any LLM's score on any benchmark

How to Passive-Aggressively Shame People Who Use LLMs Selfishly

Measuring Search Ranking Quality with LLM Judged NDCG

ClaudeMeter – macOS menu bar app to track Claude usage and limits

LLM-CTF benchmark – 2,639 real data points from NeurIPS and original runs

California AB 2047 makes 3D printers off-limits to students, educators, business

Serving Large Language Models with a Minimalist Python CLI

Inference Compute Shapes Frontier LLM Evaluation

Confidence estimation is a better metric than agreement for LLM judges

LLMs Are Digitizing Judgment

Show HN: RLM-based local debugger for AI agent traces

Show HN: Hallu – a web framework where an LLM hallucinates your app

Charon: A blind, end-to-end-encrypted marketplace for LLM inference

Wayfinder – routing LLM prompts without another LLM

Show HN: Compilr.dev, multi LLM AI workspace

Why developers use LLMs to write blog posts

Show HN: peerd – AI agent harness that runs entirely in your browser