TL;DR
FUTURE TOKENS is a blog + repo to discover and document LLM affordances: reusable commands that reliably turn the context we have into the context we need.
LLMs collapse idea → spec → prototype, right in the chat window: name a command and I often get a runnable v0.
So a key bottleneck is naming and governing commands. I’ll name them, publish their artifacts, and show how they work together.
LLM Affordances
Using chat, we can define a Command in natural language whose output is a specific Artifact, exactly like defining a function and its return type1.
Command: A named verb specified in a Markdown file in the repo (e.g., dimensionalize.md, rhyme.md, metaphorize.md). Each command spec defines inputs→outputs (schema), what stays the same (guarantees), parameters, examples, stop rules, and more.
Artifact: The text returned in chat when the command runs. It must follow the command’s schema.
Together, the Command and Artifact pair is an affordance: an action the chat interface makes available with a predictable result shape.
Commands
[Context We Have] —(Command)—> [Context We Need]
The Command is the arrow. We usually discover it after seeing different problems solved the same way.
If a similar [Have]→[Need]
pattern repeats across domains, we name a Command and standardize its schema in the repo.
Example: Dimensionalize
Consider the following situations:
[Context We Have] → [Context We Need]
[Preferences, watch history, subscriptions] → What should I watch now?
[Role constraints, resumes, company culture doc] → Who should I fast‑track?
[Backlog, budget, dependencies] → What are the next three tasks?
[Feature ideas, user complaints, SLAs] → What do we fix first?
Insight: each wants a recommendation from mostly qualitative inputs. The common move is: “expose 3–7 orthogonal axes, score options, pick.”
So we name the command: Dimensionalize. Can LLMs do it? Early exploration: here.
FUTURE TOKENS Is This Search
Spot repeating
[Have]→[Need]
patterns in real work or life.Name the candidate Command in plain English. (If I can’t name it, I can’t call it.)
Spec the Artifact.
Try the Command in real situations, iterating on the Artifact spec if needed.
Measure (lightly, for now)2.
Publish the Command (
*.md
in GitHub) if it works. Retire it if drift or misuse shows up.
That’s it: workbench, not metaphysics. Discovery first; polish later.
What FUTURE TOKENS Isn’t
Not metaphysics. LLMs make mistakes and change all the time. I’m searching for what works, not capital-T Truth or a global optimum.
Not jailbreaking. We catalog allowed moves and their artifacts. Refusals are fences, not puzzles to exploit.
Not WYSIWYG. Most mobile UI exposes every affordance as a button; LLMs don’t, and won’t. Nobody knows the full functionality3.
Not a terminal. There’s no manual that lists the latent behaviors; we have to discover, name, and test them.
What FUTURE TOKENS Is Like
Each frame is a way to look at the same search for Commands, and a different invitation for what to try next.
Electrical Outlet → Appliances
One simple interface provides for an economy’s worth of appliances. Studying the plug geometry won’t predict refrigerators or computers. All the chat appliances are waiting to be invented.
Magic Spells
Naming ≈ execution. The chat window is spell‑casting. We’re growing a grimoire: when to cast, what to expect, when to stop.
Kitchen Techniques > Recipes
We collect techniques (pre‑heat, deglaze, salt‑to‑taste), not dish instructions (shakshuka). Recipes are one-offs. Techniques compose across contexts; so do Rhyme, Metaphorize, Dimensionalize.
Printing Press
Move from handwriting to typesetting for thought. Create the plates (Commands), standardize the layout (Artifacts), scale output without losing editorial judgment.
Why this search?
Because of what we can potentially find:
Callable commands that generalize problem solving across more than one domain.
Control surfaces (axes + knobs) that steer outcomes.
Compositions of commands that compound.
A better collaboration contract: LLMs handle breadth, consistency, and speed; people handle norms, stakes, and taste.
The socket is on. What should we power next?
Though results are non-deterministic, in contrast to most scripted functions.
Why measurement is hard:
The work is judgment‑heavy
Feedback is slow/noisy
It’s hard to identify the alternative path or result
Some ideas:
Consistency testing (running the same command on the same context produces similar results)
Degradation testing (running the command with parts of the spec removed causes worse results)
User validation
We have evals for a number of tasks which suggest LLMs are quite capable. Perhaps LLMs should be assumed to be incapable of performing tasks for which we have no benchmark. But no one would argue that humans are only capable of those tasks that have public benchmarks.