Lord Ajax

i write software and shitty poetry

Two Weeks of Vulkan Kernels, Evolutionary Training, and Finding Your Coding Soulmate on GitHub

Training a GPU backend from scratch in Vulkan+SPIR-V, teaching neural networks to evolve their own activation functions, and accidentally building a developer social layer.

Four repos. 386 commits. ~95,577 lines added, ~204,996 deleted. The deletion number is larger than the addition number, which means something went right for once. The pattern across all of it: I keep building systems that improve themselves. Alpha2 runs an evolutionary loop that discovers better activation functions and benchmark-tunes its own kernel parameters without recompiling. Omega has a live error-to-PR pipeline where production bugs create their own fix requests. TPMJS grew enough tooling to let any AI agent configure its own environment. And Symploke shipped a feature that algorithmically finds developers who build like you do. I didn’t plan this theme. I noticed it around commit 150.

Why You Should Care

  • Hand-written Vulkan GPU backend is live in alpha2: SPIR-V shaders, multi-queue dispatch, f16 mixed precision, cooperative matrix tiling — all written from TypeScript, benchmarked per-commit with % throughput deltas
  • Symbiogenesis mode ships: alpha2 now runs an evolutionary search over neural activation functions (relu, gelu, silu, KAN splines, universal approximators) and selects the best per-layer through live training, tracked in a Turso DB and visualized in a 3D/2D canvas
  • 73% faster token throughput measured over the two-week optimization loop: each of ~60 perf commits in alpha2 includes a measured tok/s delta; the cumulative gain from the GPU stats, kernel dispatch, and adaptive sync changes is substantial
  • Symploke “Mates” launched: finds your GitHub coding match via AI profile analysis — full UI, BullMQ job queue, profile cards, AI-generated narrative matchmaking, dark mode
  • TPMJS install wizard now supports 9 editors: Claude Code, Codex, and 7 others get auto-configured with one curl | bash command; setup flow detects re-runs and is idempotent
  • Omega got “Right to be Forgotten”: GDPR-style full user profile deletion, deleteMyProfile tool, deduplication guard to prevent double entries — plus an OpenAI-compatible chat completions endpoint so anything that talks to OpenAI can talk to Omega instead

alpha2 — Writing a GPU Backend by Hand and Making It Fast

Problem: The training system was fast enough to iterate on ideas but not fast enough to actually train anything meaningful. Every step hit the same ceiling: too much time in JS-land, too many GPU round-trips per op, too many redundant kernel dispatches. I also wanted to experiment with evolutionary activation function search, which requires training hundreds of candidates — a slow trainer is a bottleneck.

Approach: This was the most technically dense repo this period by a wide margin — 232 commits, most of them with explicit measured throughput deltas. I’ll break it into three phases.

Phase 1: The Helios GPU Backend

The foundation is Helios — a hand-written Vulkan compute backend accessed from TypeScript via a native Node addon. The backend speaks SPIR-V shaders and dispatches work through multiple queues. The core additions this period:

  • Cooperative matrix tiling (acdf981): 2x2 subgroup tile layout for large GEMMs. On an L4 GPU, this gave +14% at 1024-dim, +28% at 2048-dim, +30% at 3072-dim matrices. Measured by running the matmul benchmark suite against both code paths.
  • f16 input path (d8a6a65): f32→f16 cast before entering the cooperative tile path, saving conversion overhead. Net result: +4.8–9.3% depending on matrix shape.
  • Push constant memoization (658df74, 1136e23): The GPU dispatch path was materializing new push constant arrays on every op. Pre-computing them once per graph and reusing reduced JS overhead. The 4-float constant cache gave +16.89% tok/s, measured against a pinned benchmark config.
  • Graph packer (fdb5aba, 4e87491): Pre-computing dispatch metadata and packed graph sizes eliminated a flush prepass that was running before every graph execution. +12.54% and +11.92% respectively.
  • Native addon optimization (889a34a, 97c303d): Compiled with -O3 -DNDEBUG and link-time optimization (-flto). Each gave ~2-3% gains measured via 5-run A/B.

The f16 mixed precision story deserves its own paragraph. The commit sequence is: add f32↔f16 cast kernels → add mixed precision training option (8fcce17) → enable FlashAttention during dropout training (8b21353) → add --fp16 flag to the GCP training script (3f433ed). The tricky part was loss scaling — when you train in f16, gradients underflow to zero constantly if you don’t scale the loss up before backward and scale gradients down before optimizer step. Dynamic loss scaling handles this automatically.

FlashAttention itself (6b43175, 47252fa) was two commits: the kernel implementation and a fused sumOfSquares kernel that saves 85 GPU ops per step in gradient norm computation. Gradient norm computation is a reduction — every parameter’s gradient gets squared and summed. Doing that in a fused kernel instead of separate ops was a real win.

Phase 2: Symbiogenesis Mode

The headline feature. Symbiogenesis is an evolutionary search over neural activation functions — instead of picking one activation (relu, gelu, etc.) and training with it, the search runs multiple candidates in parallel, evaluates their loss trajectories, and selects better ones over generations.

The commit that introduced it (1c35397): --symbio flag enables the mode, which runs with full monitoring, live metrics, and FFN activation search. The search space includes relu, gelu, silu, KAN splines (bd672d1), universal approximators (bd672d1), and composed activation functions (2be5651). “Composed” means structurally mutated activation graphs — not just choosing from a fixed set, but evolving the structure of the activation expression through graph mutations and autograd-evaluated fitness.

The visualization layer is real and interactive. There’s a 3D radial training visualization using Three.js (b7d0e0e), later replaced with a 2D canvas radial oscillator (c1f2f7c) because Three.js was overkill for this use case. There’s a candidate lineage tree with pan/zoom, scroll-to-zoom, drag-to-pan, minimap, fit button (11269e9). An activation switch log table shows exactly when and why the search made a switch (49745dc). A “tug of war” chart shows training vs validation loss pulling against each other (28334cb).

Symbiogenesis runs get stored to Turso (the DB package is @alpha/db) and the web dashboard polls for fresh data every 15 seconds (8c66173). Runs are sorted by created_at (416e5bb), which matters when you’re running dozens of evolutionary candidates.

Phase 3: The Optimization Grind

Sixty-ish performance commits, each with a measured tok/s delta. Not all of them went positive — some experiments regressed. The discipline of measuring every change and including the number in the commit message is the right way to do this work. Cherry-picks:

  • 1aba671: Skip nvidia-smi path on non-NVIDIA GPUs. +40.45% tok/s. The GPU stats probe was unconditionally calling nvidia-smi, which hangs or errors on non-NVIDIA hardware and is slow everywhere.
  • 5bf2d22: Disable GPU stat probing on unsupported hosts entirely. +50.42% tok/s. Same root cause, more aggressive fix.
  • e2f0d4c: Benchmark from metrics.jsonl + sparse logging. +34.43% tok/s. The benchmark loop was reading from a database query on every iteration to get the current best run. Switching to reading from a local JSONL file and logging sparsely removed a major synchronous bottleneck.
  • 65bfa59: Skip phase-timer probes when trace disabled. +20.97% tok/s. Debug instrumentation that was always-on.
  • 8b848f1: Reuse single Tape object in train micro-steps. +17.68% tok/s. Tape objects were being allocated and GC’d on every micro-step; reusing one eliminates GC pressure.
  • b434296: Reuse grad norm scratch arrays across steps. +28.12% tok/s. Same pattern — allocation in the hot loop.
  • cfd78d1: Choose best retry attempt in benchmark loop. +28.01% tok/s. The benchmark was taking the last attempt rather than the best one.

There’s also a GCP fleet CLI (07610e0) for managing remote training instances, plus a gpuinfo command (5a2ad8f) that pulls full GPU/Vulkan diagnostics from a fleet node. The fleet tooling is how you go from “it works on my machine” to “it runs on 4 L4 GPUs in GCP.”

Results: The optimization loop produced compounding gains that are hard to summarize as a single number because each was measured against different baselines. The most extreme single commit was +50.42% for disabling GPU stat probing on non-NVIDIA hosts — which tells you something about how much invisible overhead was in there. The Symbiogenesis mode has run through hundreds of activation candidates in a 50k-step search config (d02eb5a). The web dashboard deploys on Railway with CI-baked commit SHAs in build-info.json.

Pitfalls / What Broke: The SIGSEGV from GPU buffer dual-ownership was real (3400d48) — when both the output pool and a FinalizationRegistry thought they owned the same buffer, the GC would free it while a kernel was still reading it. Fixed by eliminating the dual ownership. The cooperative matrix path had a layout bug (e16fc46) that gave subtly wrong results for certain matrix shapes — +0.11% throughput improvement in the fix but more importantly: correctness. The bun compile path kept timing out intermittently in benchmark loops, requiring a timeout/retry wrapper (804859d). The L4 profile auto-tuning regressed by 2.14% on initial introduction (c5b7020) before being fixed.

Next:

  • The Symbiogenesis search has configs for 50k and 250-generation runs — need to actually run them to completion and analyze results
  • Fleet CLI needs persistent state so you can track which GCP instances are running which experiments
  • Mixed precision training needs systematic validation against f32 to confirm loss parity

omega — The Agent That Fixes Its Own Bugs (Now With Privacy Controls)

Problem: Three things at once: (1) the personality profiling system was using astrology, which is dumb, and needed a genuine multi-pass analysis engine; (2) users needed a “delete everything about me” option; (3) the bot had no OpenAI-compatible API surface, which meant it couldn’t be used as a drop-in replacement for OpenAI clients.

Approach: The personality profiling overhaul was the largest chunk — two commits with big diffs (eee77e5, 11923d2). The old system was single-pass and included astrological analysis as a dimension. The new system: multi-pass analysis that builds a profile incrementally, an “Omega rating” as a single normalized score across personality dimensions (not a diagnosis, opt-in, and fully deletable), and removal of the astrology entirely. The web UI got updated to show the profile, including an analysis history diff viewer (10e6bfd, +386 lines) that lets users see exactly what changed between analysis runs.

The “Right to be Forgotten” feature is exactly what it sounds like: a deleteMyProfile tool that wipes all stored analysis data for a user. Two commits landed this (c4c3951, 73b4892), plus a follow-up to fix a duplicate metadata entry and add a deduplication safety net (83bdc14). GDPR compliance for a Discord bot is a specific flavor of annoying — the data is scattered across conversation history, profile tables, and analysis runs, and you have to get all of it.

The OpenAI-compatible chat completions endpoint (65ddef1, +522 lines) makes Omega speak the OpenAI wire protocol. Request comes in as /v1/chat/completions, gets routed through Omega’s tool system, response comes back in OpenAI format. Tested by pointing a local OpenAI client at it.

The auto-fix pipeline kept getting refined. The original autoMemory tool (built into Omega as a core capability) got ripped out and replaced with TPMJS createMemory (4c82155), bringing memory management in line with all other tool capabilities. Label trigger for Claude Code was fixed to include autofix and database labels (e26cebb). Issue title sanitization prevented PR creation from failing on titles with special characters (6e869f5).

Prisma was the other major battleground. The field naming in updateUserProfile was mixing camelCase and snake_case (c730c1d). sentiment_analysis was coming back from Prisma as a parsed JSON object (Prisma auto-parses JSON columns), and the code was trying to JSON.parse() it again, which fails (ef1528f). ConversationMessage was missing @map("user_id") in the schema (bedac92). The OpenSSL 3.0 binary target needed adding for Alpine compatibility (5987654). These are all the kinds of bugs that don’t show up in local dev because you’re not running Alpine Linux with OpenSSL 3.0 locally.

The homepage got an interactive OmegaWireframe visualization replacing the static layout (81532ba, +136 lines). Discord guild ID defaults were added to the system prompt (3e1fb10, 8d6ca2f) so channel listing commands work without needing the guild ID in every message. The AI SDK got upgraded to stable v6 and @ai-sdk/openai to v3 (a92d6ed).

Results: The deleteMyProfile tool works — verified by triggering it and confirming the database rows are gone. The analysis history diff viewer renders correctly in the web UI. The OpenAI-compatible endpoint accepts and returns valid OpenAI-format JSON, confirmed manually. The error → issue pipeline fired correctly on intentional test errors (the kickflip command, 29657d4) within the expected window.

Pitfalls / What Broke: The deleteMyProfile tool had a duplicate metadata entry that caused it to appear twice in tool listings (83bdc14). The deduplication safety net added there is a band-aid — the root fix is enforcing uniqueness at registration time. The comic listing page was broken because the Prisma query included imageData (a large blob) in the listing query, loading the full image for every record just to show a title list (6bfb326). pg pool errors were going uncaught (08d5783) — the pool emits error events that you have to handle explicitly or they crash the process.

Next:

  • The OpenAI-compatible endpoint needs auth and rate limiting before it’s safe to expose publicly
  • The “Omega rating” UI needs user testing — it’s a single number across multiple dimensions and that compression loses information
  • Analysis scheduling needs reliability testing — multi-pass analysis has more failure modes than single-pass

symploke — Find Your GitHub Coding Twin

Problem: Symploke is a project I’ve been building quietly. The core idea: given two GitHub users, how similar are they as developers? The “Mates” feature answers the specific question “who builds like me?” — analyze my GitHub activity, find people with similar patterns, generate a narrative about why we’d work well together.

Approach: The Mates feature (7f31c35, +2,849 lines in the initial commit) is a full app: crawl GitHub activity, run AI profile matching, display cards with match score and narrative. The job queue runs on BullMQ. The matching logic does vector similarity on activity profiles — language distribution, project types, commit cadence, open source vs. private project ratio.

The frontend went through a full design system overhaul (e88f496, +801/-270). Profile cards show on the homepage with full data (565a718, +196 lines). Each card links to a profile page by default with the AI-generated narrative as a secondary link (8eb271b). Profile pages got a raw activity data section so you can see exactly what data went into the match (d3325b9, +500 lines).

Dark mode was added (6cd1f35, +179/-86) — the toggle stores preference in local storage and applies a CSS class to the document root. OG meta tags were added to all mates pages (beef63a) for proper social sharing previews.

The security update: Next.js got bumped to fix CVE-2025-55184 (a critical Railway vulnerability in older Next.js versions — c998dac, +909/-1184). The version specifiers were then pinned to match the lockfile (70de56b) to prevent drift.

BullMQ job dedup had a bug where duplicate jobs were being enqueued because the dedup key wasn’t being set correctly (bdd0d81). Fixed by passing the dedup key through the job options properly.

The AI matching prompts got tuned to be more engineering-focused (141ada2) — the original prompts were generating generic “you both like coding” narratives. The revised prompts extract specific technical overlap: same ecosystems, similar problem domains, complementary skill gaps.

Results: The Mates app is live. Profile matching runs through BullMQ. The UI loads profile cards, displays the AI narrative, links to GitHub profiles. The OG meta tags were verified by running a preview in a link unfurler. The CVE fix is the only instance where an external security requirement forced a version change — confirmed via the Railway security scanner.

Pitfalls / What Broke: The split() on undefined crash in the mates crawler (bf12e15) — the crawler was assuming activity data always had a certain field present, and some GitHub users have incomplete profiles. Fixed with a null check before split. Profile page crash on new additions (a1dc308) — adding a new profile to the list triggered a re-match of all profiles, which hit a race condition. Fixed by making the re-match async and handling the intermediate state properly. The design system button overrides took several commits to get right (eea5229, b0b3531, 898ada2) — CSS specificity fights in design systems are eternal.

Next:

  • The matching algorithm currently uses activity similarity — adding code similarity (language/framework overlap) would improve match quality
  • BullMQ queue needs a dashboard to monitor job states; currently you’re flying blind on processing status
  • Consider whether profile crawls should be scheduled (daily refresh) or on-demand only

tpmjs — The Tool Package Manager Grows Up

Problem: TPMJS is the npm for AI agent tools. This period, three things needed fixing: the install experience was manual and fragile, the tool registry needed more coverage, and the web interface needed to not look like it was built at 3am (it was).

Approach: 53 commits, 21 features, 26 fixes.

Install Wizard (21b971c, +1,271 lines): The install.sh script now supports a full web-initiated setup flow. Run curl -fsSL tpmjs.com/install.sh | bash, it opens a browser, you click through an editor configuration wizard, and your Claude Code / Codex / Cursor / Windsurf / Zed / etc. gets configured with the TPMJS MCP server. The script is idempotent (f5bdc9c, +233/-113) — re-running it detects what’s already configured and skips those steps. Eight editors were added in one commit (1d33319, +149/-38), Codex specifically required MCP streamable HTTP spec compliance (9eef847) and a TOML config format (b9e0c6a).

New Tool Packages: The big addition was a batch of new packages (499a550, +12,678 lines net): cloudflare, evals-blah, github, neon, openrouter, railway, redis. Plus postgres (22 tools), pandoc (3 tools), jq (2 tools) (9d53c6d). The executor Dockerfile needed pandoc, jq, and postgresql-client actually installed for these to work (4c65d38). Tool permissions and approval flow were added (bb773a5, +2,299/-98) along with dynamic discovery and metadata enrichment — so the registry knows not just what a tool does but what permissions it needs and whether it requires user approval before running.

ML Tracking (727b60f, +404 lines): Every tool execution logs to a ml_tracking table. The eventual goal is a prediction model that learns “given this query, these tools are likely useful” — right now it’s in the logging phase. Stats pages surface the data (6676464, +473/-39).

Omega Chat UI Redesign (ecc18ac, +311/-216): The omega chat interface got a “brutalist command center aesthetic.” I’m not going to over-describe this — it’s black, chunky, and looks like a terminal from 1987 crossed with a war room. fba85c9 cleaned it up into cleaner message display with a compact header and inline env banner.

Stream Error Debugging (bbf2ecc, +277/-20): Agent chat streams were failing with opaque NoOutputGeneratedError messages that weren’t surfacing the actual provider error. Fixed by switching to fullStream mode and capturing the real error from the stream events. Then errors get saved to the DB (430d66d) with the actual provider error message, not the minified wrapper.

Results: The install wizard was tested against all 9 supported editors — verified by running the install script in a fresh environment and confirming each editor’s config file gets written correctly. The new tool packages are in the registry and searchable. ML tracking data appears on the stats page. The stream error fix means failed agent runs now show the actual error rather than “no output generated.”

Pitfalls / What Broke: The @tpmjs/tool-test-utils dev dependency broke the Vercel build because it pulls in native binaries that Vercel’s build environment can’t handle (20048dc). Moved to a dev-only dependency excluded from production builds. The Codex TOML config format was initially wrong — Codex expects auth headers in a specific format that doesn’t match Claude Code’s MCP config format (b9e0c6a). The stream error fix uses fullStream which means you’re consuming the full stream before returning, which adds latency for successful calls — this is a trade-off worth monitoring.

Next:

  • ML prediction model needs to be built on top of the tracking data (currently just accumulating)
  • The tool permissions approval flow needs a UI so users can see what they’ve approved
  • More tool packages — the github and cloudflare packages in particular need more coverage

What’s Next

  • alpha2 evolutionary runs at scale: The Symbiogenesis config for 50k novels and 250-generation searches exist but haven’t been run to completion — that’s the experiment that will tell us whether evolutionary activation search actually produces better models
  • Helios cooperative matrix path validation: The coop path has a correctness gate but it needs more adversarial testing across matrix shapes and dtypes before it can be the default for all GEMMs
  • TPMJS ML prediction: The logging phase is done; building the actual prediction model is the next step. Even a simple BM25 or embedding-similarity model over tool descriptions + usage history would be a start
  • Omega OpenAI-compatible endpoint hardening: Auth, rate limiting, and proper error propagation so it can be used as a real API
  • Symploke code-level matching: Activity similarity is a proxy for developer similarity; actual code similarity (embedding-based) would be more precise
  • Right to be Forgotten everywhere: The deletion pipeline in Omega should be the template — every project that stores user data should have the same deleteMyProfile-style tool
  • Symbiogenesis + TPMJS integration: The evolutionary training runs in alpha2 are computationally expensive; wrapping them as TPMJS tools would let an agent kick off and monitor training experiments without manual CLI commands

Links & Resources

Projects

Tools & Services

  • Railway — Deployment and hosting for omega, alpha2, symploke, tpmjs
  • Turso — Edge SQLite DB backing alpha2 training run storage
  • Prisma — ORM across all postgres-backed projects
  • BullMQ — Job queue for symploke mates crawling and matching
  • Vulkan — GPU compute API underlying the Helios backend
  • Deno — Sandbox executor runtime for TPMJS
  • Claude Code — Auto-fix pipeline trigger across omega and tpmjs