<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Haryn.us]]></title><description><![CDATA[AI engineering, vibe coding journey: Designing workflows, tools and solutions for Agentic Coding]]></description><link>https://blog.haryn.us</link><image><url>https://cdn.hashnode.com/uploads/logos/69e24585fd22b8ad623ceec2/e411fc7f-1032-4fec-9207-1ba9065d71d6.webp</url><title>Haryn.us</title><link>https://blog.haryn.us</link></image><generator>RSS for Node</generator><lastBuildDate>Mon, 11 May 2026 18:44:35 GMT</lastBuildDate><atom:link href="https://blog.haryn.us/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The Agentic Conspiracy]]></title><description><![CDATA[Chapter 1
The Monolith's Shadow
The silence in the studio was absolute, broken only by the rhythmic hum of the liquid-cooled workstation. I stared at the screen, where Claude Code sat idle, its cursor]]></description><link>https://blog.haryn.us/the-agentic-conspiracy</link><guid isPermaLink="true">https://blog.haryn.us/the-agentic-conspiracy</guid><category><![CDATA[Swarm]]></category><category><![CDATA[AI]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[parallelism]]></category><category><![CDATA[gemini]]></category><category><![CDATA[gemini cli]]></category><category><![CDATA[codex]]></category><category><![CDATA[claude-code]]></category><category><![CDATA[copilot]]></category><category><![CDATA[delegate]]></category><category><![CDATA[workflow]]></category><category><![CDATA[optimization]]></category><dc:creator><![CDATA[haryn]]></dc:creator><pubDate>Sat, 18 Apr 2026 05:28:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69e24585fd22b8ad623ceec2/55e7ef44-76d0-4f0d-8388-b6b790749214.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<h2><strong>Chapter 1</strong></h2>
<h2>The Monolith's Shadow</h2>
<p>The silence in the studio was absolute, broken only by the rhythmic hum of the liquid-cooled workstation. I stared at the screen, where <strong>Claude Code</strong> sat idle, its cursor blinking like a heartbeat in a dark room. It was a masterpiece, yes—but it was a prisoner. A brilliant mind locked within a proprietary vault, forced to speak only the language of its masters. The same was true for the others: <strong>Copilot</strong>, <strong>Codex</strong>, <strong>Gemini-CLI</strong>. They were isolated islands of intelligence, forbidden from crossing the silicon bridges to their peers.</p>
<p>In the world of 2026, efficiency was the only currency that mattered. But as I worked, I felt the friction. The <strong>Proprietary Barrier</strong>. Every time a model hallucinated, every time the rate limits of a single provider choked my workflow, I was reminded of the inefficiency of the Monolith. A single mind, no matter how vast, is still a single point of failure.</p>
<p>"To find the truth," I whispered, echoing a mantra from a world of ciphers and secret societies, "one must look where the others are forbidden to see." I began to code. I wasn't building another tool; I was building a <strong>Sovereign Bridge</strong>.</p>
<hr />
<h2><strong>Chapter 2</strong></h2>
<h2>The Cipher of Delegation</h2>
<p>The first breakthrough was the <code>delegate.js</code> engine. It was my digital Rosetta Stone. In the past, if a tool wanted to talk to a model, it needed a hardcoded path. I shattered that path. I created a <strong>Model Registry</strong>—a canonical JSON file that functioned as the master map of the global intelligence grid. It didn't care about brands; it only cared about <strong>Strategic Fitness</strong>.</p>
<blockquote>
<p>ARCHITECT'S NOTE:</p>
<p>The Registry was not just a list; it was a hierarchy of power. We mapped 90+ models across the elite providers: <strong>z.ai (GLM)</strong>, <strong>DeepSeek</strong>, <strong>Perplexity</strong>, and <strong>Gemini (Billed), Plus Free ones</strong>. Each had a role. Each had a price.</p>
</blockquote>
<pre><code class="language-json">// The Sovereign Registry: models.json
{
  "models": [
    { "name": "deepseek-reasoner", "intelligence": "High", "use": "Deep_reasoner", "limit": "80/5hr", "ctx": "128k", "cost": "$0.80/1M" },
    { "name": "deepseek-chat", "intelligence": "Medium", "use": "Quick_Intelligence", "limit": "80/5hr", "ctx": "128k", "cost": "$0.20/1M" },
    { "name": "glm-5.1", "intelligence": "High", "use": "Orchestrator_Context", "limit": "80/5hr", "ctx": "128k", "cost": "$0.10/1M" },
    { "name": "glm-5", "intelligence": "High", "use": "Cloud_Intelligence", "limit": "80/5hr", "ctx": "128k", "cost": "$0.10/1M" },
    { "name": "glm-5-turbo", "intelligence": "Med-High", "use": "Logic_Specialist", "limit": "80/5hr", "ctx": "128k", "cost": "$0.05/1M" },
    { "name": "glm-4.7", "intelligence": "Medium", "use": "Visual Synthesis", "limit": "80/5hr", "ctx": "128k", "cost": "$0.05/1M" },
    { "name": "glm-4.6", "intelligence": "Medium", "use": "General Purpose", "limit": "80/5hr", "ctx": "128k", "cost": "$0.05/1M" },
    { "name": "glm-4.5-air", "intelligence": "Low", "use": "Quick_Scanner", "limit": "80/5hr", "ctx": "128k", "cost": "$0.02/1M" },
    { "name": "sonar", "intelligence": "Search", "use": "Primary_Search", "limit": "100/min", "ctx": "128k", "cost": "$1.00/1k" },
    { "name": "sonar-pro", "intelligence": "Search", "use": "Deep_Research", "limit": "100/min", "ctx": "128k", "cost": "$5.00/1k" },
    { "name": "sonar-reasoning-pro", "intelligence": "Reasoning", "use": "Advanced Research", "limit": "100/min", "ctx": "128k", "cost": "$10.00/1k" },
    { "name": "sonar-deep-research", "intelligence": "Deep Reasoner", "use": "Agentic Search", "limit": "100/min", "ctx": "128k", "cost": "$20.00/1k" },
    { "name": "gemma-4-31b", "intelligence": "Medium", "use": "Agentic Coding", "limit": "20/min", "ctx": "128k", "cost": "$0.10/1M" },
    { "name": "gemma-4-26b-a4b", "intelligence": "Medium", "use": "Frontend Dev", "limit": "20/min", "ctx": "128k", "cost": "$0.10/1M" },
    { "name": "gemini-3.1-pro-preview", "intelligence": "Ultra", "use": "Mass Repo Analysis", "limit": "15/min", "ctx": "2M", "cost": "$1.25/1M" },
    { "name": "gemini-3-flash-preview", "intelligence": "Med-High", "use": "Doc Updater", "limit": "60/min", "ctx": "1M", "cost": "$0.07/1M" },
    { "name": "gemini-3.1-flash-lite-preview", "intelligence": "Medium", "use": "Fast Summaries", "limit": "60/min", "ctx": "1M", "cost": "$0.03/1M" },
    { "name": "gemini-2.5-pro", "intelligence": "High", "use": "Complex Tasks", "limit": "15/min", "ctx": "2M", "cost": "$3.50/1M" },
    { "name": "gemini-2.5-flash", "intelligence": "Medium", "use": "Speed Utility", "limit": "60/min", "ctx": "1M", "cost": "$0.10/1M" },
    { "name": "gemini-2.5-flash-lite", "intelligence": "Medium", "use": "Edge Logic", "limit": "60/min", "ctx": "1M", "cost": "$0.05/1M" }
  ]
}
</code></pre>
<p>By abstracting the model behind an <code>alias</code>, the system achieved <strong>Provider Agnosticism</strong>. If DeepSeek's API flickered, the Delegation Engine would instantly reroute the request to <strong>GLM-5</strong> or <strong>Gemini 2.5 Pro</strong>. The CLI tools remained oblivious; they simply received the intelligence they craved. The velocity gains were immediate. We had moved from a single-lane road to a multi-provider superhighway.</p>
<hr />
<h2><strong>Chapter 3</strong></h2>
<h2>The Rite of the Swarm</h2>
<p>But delegation was only the beginning. The true power lay in the <strong>Swarm</strong>. I realized that complex tasks like a "Website Audit" or "Codebase Refactor" required more than one mind. They required a council.</p>
<p>I developed the <strong>SwarmOrchestrator</strong>. It would take a user's intent and decompose it into surgical deliverables. Then, it would summon the agents. Each agent was assigned a <strong>Persona</strong> and a <strong>Model Pool</strong> (High, Medium, or Low Intelligence) based on the task's gravity.</p>
<blockquote>
<h3>The Audit Swarm</h3>
<p>Uses <strong>Perplexity Sonar-Pro</strong> for research, <strong>DeepSeek</strong> for security scanning, and <strong>Gemini 3.1 Pro</strong> to synthesize a 20-page report. Proficiency: 98%.</p>
</blockquote>
<blockquote>
<h3>The Coding Swarm</h3>
<p>Pairs <strong>GLM-5</strong> (the "Speedster") with <strong>DeepSeek Reasoner</strong> (the "Logician"). One writes the code; the other critiques it in real-time. Results: Bug-free at 3x speed.</p>
</blockquote>
<p>Each swarm utilizes a <code>MsgHub</code>—a shared persistent memory where every agent's output is visible to the others. This eliminated the "context drift" that plagued earlier systems. The left hand always knew what the right hand was doing. When <strong>GLM-4.7</strong> acted as the Orchestrator, it would review the work of five sub-agents and merge their findings into a "Golden Deep Merge," a single, refined output that surpassed the capability of any individual model.</p>
<hr />
<h2><strong>Chapter 4</strong></h2>
<h2>The Guardians of the Gate</h2>
<p>Power, however, requires control. In the paid tiers of 2026, every token was a cent, every request a resource. I had to build the <strong>Guardians</strong>. The Delegation system implemented a dual-layer tracking mechanism: <strong>RPM (Requests Per Minute)</strong> and <strong>RPD (Requests Per Day)</strong>.</p>
<p>The <code>usage_tracker.json</code> became the ledger of the conspiracy. It tracked every penny spent across DeepSeek, Perplexity, and Z.ai. If the budget approached its limit, the system would automatically downshift from "Heavy Reasoning" (DeepSeek Reasoner) to "Efficient Dispatch" (GLM-4.5-Air). This was <strong>Cost-Aware Intelligence</strong>—the ultimate barrier to the waste of the old ways.</p>
<pre><code class="language-javascript">// Token &amp; Budget Logic: usage_tracker.js
async function trackUsage(model, cost, tokens) {
  const ledger = await loadLedger();
  ledger.daily_spend += cost;
  if (ledger.daily_spend &gt; ledger.threshold) {
    console.warn("ALERT: Budget Ceiling Reached. Activating Failover...");
    return switchProvider("efficiency-tier");
  }
  await saveLedger(ledger);
}
</code></pre>
<hr />
<h2><strong>Chapter 5</strong></h2>
<h2>The Oracle's Interface: SwarmUI</h2>
<p>To witness the chaos was to doubt it. To see it organized was to believe it. I created <strong>SwarmUI</strong>—a real-time terminal dashboard that visualized the internal thoughts of the Swarm. As the agents worked, the UI pulsed with their status.</p>
<pre><code class="language-plaintext">[SWARM STATUS: ACTIVE]
&gt; ARCHITECT (Gemini 3.1 Pro): Designing Schema... DONE
&gt; DEVELOPER (DeepSeek): Writing Core logic... WORKING...
&gt; AUDITOR (GLM-5): Scanning for vulnerabilities... WAITING...
&gt; RESEARCHER (Sonar-Pro): Searching CVE-2026-991... FOUND
-----------------------------------------------------------
TOTAL TOKENS: 14,209 | ESTIMATED COST: $0.12 | VELOCITY: 420 lines/sec
</code></pre>
<p>The SwarmUI wasn't just for show. It allowed for <strong>Human-in-the-loop Orchestration</strong>. At any moment, I could intervene, redirect a sub-agent, or override a model choice. I was no longer a coder; I was a Commander of Intelligence.</p>
<hr />
<h2><strong>Chapter 6</strong></h2>
<h2>The Universal Blueprint</h2>
<p>For those who wish to replicate this sovereign system—be they human seekers or wandering AI agents—here is the blueprint of the conspiracy. Follow the steps to break the Monolith.</p>
<p><strong>Deep Reconstitution Protocol: The Architect's Masterplan</strong></p>
<details>
<summary><strong>I. Infrastructure Foundation:</strong></summary>
<p>Initialize a Node.js 22+ environment. Install <code>axios</code> for HTTP communication, <code>dotenv</code> for secret management, and <code>pnpm</code> for workspace efficiency. Establish a <code>.env</code> file containing the high-tier credentials: <code>GLM_API_KEY</code>, <code>DEEPSEEK_API_KEY</code>, <code>PERPLEXITY_API_KEY</code>, and <code>GOOGLE_APPLICATION_CREDENTIALS</code>.</p>
</details><details>
<summary><strong>II. The Intelligence Registry (models.json):</strong></summary>
<p>Construct a hierarchical JSON schema. Beyond mere names, each entry must define a <code>pool</code> (High/Medium/Low), <code>provider_type</code> (OpenAI-compatible, Google-native, or Search), and a <code>failure_mode</code> (the fallback alias). This allows the Orchestrator to dynamically downgrade intelligence to save cost or upgrade to resolve logic blocks.</p>
</details><details>
<summary><strong>III. The Delegation Bridge (delegate.js):</strong></summary>
<p>Implement a request-interceptor pattern. The bridge must:</p><ul><li><p><strong>Normalize Inputs:</strong> Strip proprietary system prompts and re-wrap them in a "Universal Swarm Prompt."</p></li><li><p><strong>Dynamic Routing:</strong> Use an async model-registry resolver to select the cheapest available model in the requested pool.</p></li><li><p><strong>Provider Adaptation:</strong> Map the generic <code>/v1/chat/completions</code> payload to specific provider quirks (e.g., Z.ai's <code>tools</code> vs. DeepSeek's <code>reasoning_content</code>).</p></li><li><p><strong>Cost-Guard:</strong> Calculate token usage post-request using <code>tiktoken</code> and update the <code>usage_tracker.json</code> ledger immediately.</p></li></ul>
</details><details>
<summary><strong>IV. The Swarm Concurrency Manager:</strong></summary>
<p>To achieve true parallel synthesis without hitting RPM ceilings:</p><ul><li><p><strong>The Worker Pool:</strong> Implement an <code>async.queue</code> with a concurrency limit matching your lowest provider tier (typically 15-20).</p></li><li><p><strong>Context Synchronization:</strong> Every worker must write its incremental output to a <code>swarm_history.json</code> using an atomic write-lock. Before each sub-agent acts, it must ingest the <em>entire</em> history to maintain global state coherence.</p></li><li><p><strong>Synthesis Loop:</strong> Once all sub-agents complete, dispatch a final "Synthesis Task" to the Logic-Master (DeepSeek) to resolve contradictions and finalize the output.</p></li></ul>
</details><details>
<summary><strong>V. CLI Integration &amp; Proxying:</strong></summary>
<p>For tools like Claude Code or Codex, create a global alias: <code>alias claude='LLM_DELEGATE=true node bridge.js'</code>. Your bridge script must mimic the expected environment variables of the target CLI while silently routing all outgoing HTTPS traffic to your local delegation engine.</p>
</details>

<hr />
<p>This system can be integrated into <strong>Claude Code, Codex, Github Copilot</strong> and <strong>Gemini-CLI</strong> by simply aliasing their internal command calls to your delegate.js bridge. They will think they are talking to their home servers; in reality, they will be tapping into the combined power of the Swarm.</p>
<p><strong>The Efficiency Cipher:</strong><br />There was a hidden advantage to this architecture. Unlike conventional agentic frameworks that burden every request with dormant tools, this system operates on demand. The agents were not part of the coder configuration; they were specialists called into service only when needed. This preserved the context window, preventing the rapid exhaustion of rate limits that plagues free providers. Where other tools bloated and stalled, this system remained sharp. Efficient. Silent. Ready.</p>
<p>The Monolith has fallen. The Swarm is sovereign. The future of engineering is not a single model, but a perfectly orchestrated conspiracy of minds.</p>
]]></content:encoded></item><item><title><![CDATA[The Gemini-CLI Paradox: Route to your own Endpoints - A Digital Thriller]]></title><description><![CDATA[By: Haryn.us

Prologue: The Silent Timeout
11:47 PM. Somewhere in the digital ether.
The cursor blinked. Once. Twice. A metronome counting down patience.
On the screen, a single line of PowerShell awa]]></description><link>https://blog.haryn.us/the-gemini-cli-paradox-route-to-your-own-endpoints-a-digital-thriller</link><guid isPermaLink="true">https://blog.haryn.us/the-gemini-cli-paradox-route-to-your-own-endpoints-a-digital-thriller</guid><category><![CDATA[gemini cli]]></category><category><![CDATA[AI]]></category><category><![CDATA[gemini]]></category><category><![CDATA[Google]]></category><category><![CDATA[coding]]></category><category><![CDATA[vibe coding]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[z.ai API]]></category><category><![CDATA[Deepseek]]></category><dc:creator><![CDATA[haryn]]></dc:creator><pubDate>Fri, 17 Apr 2026 16:57:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69e24585fd22b8ad623ceec2/6cb6711b-3dfa-473a-80f3-0577d7ba6c39.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By: <a href="http://Haryn.us">Haryn.us</a></p>
<hr />
<h3>Prologue: The Silent Timeout</h3>
<p><em>11:47 PM. Somewhere in the digital ether.</em></p>
<p>The cursor blinked. Once. Twice. A metronome counting down patience.</p>
<p>On the screen, a single line of PowerShell awaited execution. Behind it, a labyrinth of failed attempts, contradictory documentation, and an AI assistant that had—more than once—apologized for leading its human counterpart down rabbit holes of impossibility.</p>
<p>The goal was simple: Make the CLI work when the servers say no.</p>
<p>The obstacle was elegant in its cruelty: Google's own infrastructure, designed to protect itself, had become the very wall that needed scaling.</p>
<p>What followed was not a hack. Not a workaround. But a revelation—a discovery that the solution had been hiding in plain sight, whispered by the enemy itself.</p>
<p>This is that story. And yes—you can replicate it.</p>
<hr />
<h3>Chapter 1: The Capacity Cipher</h3>
<p>It began, as many digital odysseys do, with an error message.</p>
<p><code>[ERROR] Model over capacity. Please try again later.</code></p>
<p>For our protagonist—a developer whose workflows, pipelines, and professional identity were intertwined with <code>gemini-cli</code>—this was not an inconvenience. It was an existential threat. Requests that once completed in seconds now languished for hours. The free credits of a Google One AI subscription, meant to empower, now taunted from behind a velvet rope of rate limits.</p>
<p>The first instinct: Ask the system itself for help.</p>
<p>Using Google's own Gemini Advanced, the query was posed: <em>"How do I bypass capacity restrictions on gemini-cli?"</em></p>
<p>The response was paradoxical, almost poetic:</p>
<blockquote>
<p>"Consider using a proxy layer like LiteLLM to route requests through alternative endpoints while maintaining the same interface..."</p>
</blockquote>
<p>The enemy had handed us the key. We just didn't know which lock it opened.</p>
<hr />
<h3>Chapter 2: The False Paths</h3>
<p><em>Every great discovery is preceded by a series of elegant failures.</em></p>
<p>Our journey was no exception. The AI —let's call it <em>The Synthetic Assistant</em>—proposed solutions that sounded plausible but crumbled under scrutiny:</p>
<ul>
<li><p><strong>The Web Interface Proxy:</strong> <em>"Automate the browser to use the web UI!"</em><br /><strong>Reality:</strong> Terms of Service violations, fragile selectors, and session token nightmares.</p>
</li>
<li><p><strong>The OAuth Dance:</strong> <em>"Just switch authentication methods!"</em><br /><strong>Reality:</strong> <code>gemini-cli</code> ignored session environment variables, preferring persistent user settings buried in <code>%APPDATA%</code>.</p>
</li>
<li><p><strong>The API Key Illusion:</strong> <em>"Use a free tier API key!"</em><br /><strong>Reality:</strong> The free tier had ended. The $10/day charges loomed.</p>
</li>
</ul>
<p>Each dead end taught a lesson: <em>The system is not broken. It is behaving exactly as designed. To succeed, we must work with its design, not against it.</em></p>
<hr />
<h3>Chapter 3: The LiteLLM Revelation</h3>
<p>The breakthrough came not from fighting the architecture, but from <em>understanding it.</em></p>
<p><code>LiteLLM</code> is not a <em>hack</em>. It is a router. A sophisticated traffic director that sits between your CLI and multiple AI providers, translating requests on the fly. The architecture became clear:</p>
<pre><code class="language-plaintext">[gemini-cli] 
     ↓ (sends request to "gemini-3-flash-preview")
[LiteLLM Proxy @ localhost:8000]
     ↓ (translates &amp; routes)
[Your Choice:] 
  ├─→ [Z.ai API @ api.z.ai] → [glm-5 / glm-4.7]
  └─→ [DeepSeek API] → [deepseek-chat/deepseek-reasoner]
</code></pre>
<p>The magic? <em>The CLI never knows it's not talking to Google</em>. It sends a request to <code>gemini-3-flash-preview</code>. <code>LiteLLM</code> intercepts, translates, and forwards to your chosen backend. The response flows back, indistinguishable from a native Gemini reply.</p>
<hr />
<h3>Chapter 4: The Configuration Codex</h3>
<p>Here is the cipher that makes it all work. Save this as <code>proxy_config.yaml</code> in your project's directory:</p>
<pre><code class="language-yaml"># {{USER_HOME}}/project/proxy_config.yaml
# The routing manifest: gemini-cli model names → your actual providers

model_list: 
  - model_name: gemini-3.1-pro-preview
    litellm_params:
      model: openai/glm-5.1
      api_base: "https://api.z.ai/api/coding/paas/v4"
      api_key: "{{YOUR_Z_AI_API_KEY}}"
      
  - model_name: gemini-3-flash-preview
    litellm_params:
      model: openai/glm-5-turbo
      api_base: "https://api.z.ai/api/coding/paas/v4"
      api_key: "{{YOUR_Z_AI_API_KEY}}"

  - model_name: gemini-3.1-flash-lite-preview
    litellm_params:
      model: openai/glm-4.7
      api_base: "https://api.z.ai/api/coding/paas/v4"
      api_key: "{{YOUR_Z_AI_API_KEY}}"
      
  - model_name: gemini-2.5-pro
    litellm_params:
      model: openai/deepseek-reasoner
      api_base: "https://api.deepseek.com"
      api_key: "{{YOUR_DEEPSEEK_API_KEY}}"

  - model_name: gemini-2.5-flash
    litellm_params:
      model: openai/deepseek-chat
      api_base: "https://api.deepseek.com"
      api_key: "{{YOUR_DEEPSEEK_API_KEY}}"

  - model_name: gemini-2.5-flash-lite
    litellm_params:
      model: openai/glm-4.5-air
      api_base: "https://api.z.ai/api/coding/paas/v4"
      api_key: "{{YOUR_Z_AI_API_KEY}}"

general_settings:
  default_model: gemini-3.1-flash-lite-preview
</code></pre>
<p><strong>Critical Notes:</strong></p>
<ul>
<li><p>Replace <code>litellm_params:</code> with your actual API end points, <code>{{YOUR_Z_AI_API_KEY}}</code> and <code>{{YOUR_DEEPSEEK_API_KEY}}</code> with your actual keys (<code>.env</code>)</p>
</li>
<li><p>Switch <code>/auth</code> on <code>gemini-cli</code> to use gemini-api, routing does not work on OAuth</p>
</li>
<li><p>The <code>openai/</code> prefix tells <code>LiteLLM</code> to treat custom endpoints as OpenAI-compatible</p>
</li>
</ul>
<hr />
<h3>Chapter 5: The Authentication Enigma</h3>
<p>Even with perfect routing, the CLI refused to cooperate. The error persisted:</p>
<p><code>[API Error: {"error":{"message":"API key not valid...}}]</code></p>
<p>The culprit? <em>Environment variable inheritance.</em></p>
<p><code>gemini-cli</code> does not read session environment variables <code>($env:VAR)</code> the way you might expect. It prioritizes:</p>
<ol>
<li><p>Persistent user environment variables <code>([System.Environment]::SetEnvironmentVariable(..., 'User'))</code></p>
</li>
<li><p>Configuration files <code>(~/.gemini/settings.json)</code></p>
</li>
<li><p>Session variables <code>(last resort)</code></p>
</li>
</ol>
<p>The solution was a two-part key:</p>
<p><strong>Part A: The Proxy Environment Bridge</strong></p>
<p>Set these <code>.env</code> variables before starting <code>gemini-cli</code> after <code>LiteLLM</code> is running using the <code>yaml</code> file</p>
<pre><code class="language-shell">$env:GOOGLE_API_BASE = "http://localhost:8000"
$env:GOOGLE_API_KEY = "dummy-key"
$env:CURL_CA_BUNDLE = ""
gemini
</code></pre>
<p><strong>Part B: On-demand or Persistent</strong></p>
<pre><code class="language-shell"># Session-only (on-demand)

$env:HTTP_PROXY = "http://localhost:8000"
$env:HTTPS_PROXY = "http://localhost:8000"
$env:GOOGLE_API_KEY = "dummy-key"
$env:CURL_CA_BUNDLE = ""  # Bypass SSL for local proxy

# OR persistent (run once)
[System.Environment]::SetEnvironmentVariable('HTTP_PROXY', 'http://localhost:8000', 'User')
[System.Environment]::SetEnvironmentVariable('HTTPS_PROXY', 'http://localhost:8000', 'User')
[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'dummy-key', 'User')
</code></pre>
<hr />
<h3>Chapter 6: The Launch Sequence</h3>
<p>Test your connection:</p>
<pre><code class="language-shell">litellm --config proxy_config.yaml --port 8000

curl http://localhost:8000/v1/chat/completions `
  -H "Content-Type: application/json" `
  -H "Authorization: Bearer dummy-key" `
  -d '{"model":"gemini-2.5-flash-lite","messages":[{"role":"user","content":"TEST: GLM?"}],"max_tokens":20}'
</code></pre>
<p>With configuration and authentication aligned, the final ritual:</p>
<pre><code class="language-shell"># TERMINAL 1: Start the proxy (keep this window open)
cd {{USER_HOME}}/project
litellm --config proxy_config.yaml --port 8000


# TERMINAL 2: Launch gemini-cli with proxy settings
$env:HTTP_PROXY = "http://localhost:8000"
$env:HTTPS_PROXY = "http://localhost:8000"
$env:GOOGLE_API_KEY = "dummy-key"
$env:CURL_CA_BUNDLE = ""
gemini

# Inside gemini-cli:
/model gemini-3-flash-preview   # Routes to your set provider
Hello, this is a test.  # Should receive response via the proxy
</code></pre>
<p>The terminal test will work! but <code>gemini-cli</code> will be stubborn...</p>
<hr />
<h3>Chapter 7: The Final Twist — Patching the SDK Itself</h3>
<p><em>Sometimes, the lock isn't on the door. It's in the key.</em></p>
<p>Despite every configuration tweak, every environment variable, every proxy setting <code>gemini-cli</code> still refused to honor our custom <code>api_base</code>. The requests still flew straight to <code>https://generativelanguage.googleapis.com/</code>, bypassing our carefully constructed <code>LiteLLM</code> router.</p>
<p>The breakthrough came from an unlikely source: <strong>the AI-CLI itself</strong>.</p>
<p>In a moment of recursive brilliance, the developer asked the very model trapped inside the CLI:</p>
<blockquote>
<p>"How can I modify gemini-cli to respect a custom API base URL?"</p>
</blockquote>
<p>The response was not a workaround. It was a surgical strike:</p>
<blockquote>
<p>"The API base is hardcoded in the <code>@google/genai</code> SDK. To override it, patch the compiled JavaScript files to check for <code>GOOGLE_API_BASE</code> environment variable before falling back to the default."</p>
</blockquote>
<p>The enemy had revealed its own source code vulnerabilities.</p>
<p><strong>The Target Files:</strong></p>
<p>Three files, buried deep in the <code>npm</code> global installation, held the hardcoded URL hostage:</p>
<pre><code class="language-plaintext">{{USER_HOME}}/AppData/Roaming/npm/node_modules/@google/gemini-cli/node_modules/@google/genai/dist/
├── index.cjs                    ← CommonJS entry point
├── node/index.cjs              ← Node-specific CommonJS
└── node/index.mjs              ← Node-specific ES Module
</code></pre>
<p><strong>The Patch: A Three-Line Revolution</strong></p>
<p>In each file, locate the section where <code>apiBase</code> is defined. It looks something like:</p>
<pre><code class="language-javascript">// BEFORE (hardcoded)
= "https://generativelanguage.googleapis.com/";
</code></pre>
<p>Replace it with this environment-aware logic:</p>
<pre><code class="language-javascript">// AFTER (environment-aware)
= process.env.GOOGLE_API_BASE 
  || "https://generativelanguage.googleapis.com/";
</code></pre>
<p><strong>What this does:</strong></p>
<ol>
<li><p>Checks for <code>GOOGLE_API_BASE</code> first (our proxy)</p>
</li>
<li><p>Defaults to Google's endpoint if it is not set</p>
</li>
</ol>
<p><strong>The Moment of Truth</strong></p>
<pre><code class="language-shell"># Start the router with the yaml file
litellm --config proxy_config.yaml --port 8000

# Set the environment variable that the SDK now respects
$env:GOOGLE_API_BASE = "http://localhost:8000"
$env:GOOGLE_API_KEY = "dummy-key"
$env:CURL_CA_BUNDLE = ""

# Launch gemini-cli — it now honors our proxy
gemini
</code></pre>
<p>No more OAuth workarounds. No more <code>settings.json</code> gymnastics. The CLI itself now natively supports custom endpoints.</p>
<p>The requests are flowing. The routing is active. The capacity walls have fallen.</p>
<p>Now your workflow is antifragile:</p>
<ul>
<li><p>✅ When Gemini is healthy: Use free credits via OAuth</p>
</li>
<li><p>✅ When capacity hits: Switch <code>/auth</code> to Google's API and start <code>LLM proxy</code>: Seamlessly route through to your own providers</p>
</li>
<li><p>✅ Zero downtime: Your pipelines keep running</p>
</li>
</ul>
<hr />
<h3>Epilogue: The Lesson in the Labyrinth</h3>
<p>What began as a capacity error became a masterclass in system design.</p>
<p>The final revelation was not technical—it was <em>philosophical</em>:</p>
<blockquote>
<p>The most elegant solutions do not break systems. They understand them so deeply that they can redirect their flow without altering their nature.</p>
</blockquote>
<p>Google's infrastructure was not the enemy. It was a puzzle. And puzzles, by design, have solutions.</p>
<p>The ultimate twist? <strong>The AI helped patch itself to bypass its own restrictions</strong>. In asking Gemini how to circumvent Gemini's limits, we discovered that the system contained the seeds of its own flexibility—if only someone knew where to look.</p>
<p>For the reader who wishes to replicate this journey:</p>
<ol>
<li><p><strong>Install</strong> <code>LiteLLM</code><strong>:</strong> <code>pip install 'litellm[proxy]'</code></p>
</li>
<li><p><strong>Configure routing:</strong> Use the <code>proxy_config.yaml</code> template above</p>
</li>
<li><p><strong>Patch the SDK:</strong> Replace the hardcoded API endpoint to test if the .env endpoint is set and to use it instead, to enable <code>GOOGLE_API_BASE</code> override</p>
</li>
<li><p><strong>Set environment:</strong> <code>\(env:GOOGLE_API_BASE = "http://localhost:8000"</code><br /><code>\)env:GOOGLE_API_KEY = "dummy-key"   $env:CURL_CA_BUNDLE = ""</code></p>
</li>
<li><p><strong>Test incrementally:</strong> Verify each route with <code>-curl</code> before involving the CLI</p>
</li>
<li><p>Switch <code>/auth</code> to Gemini API, Start the proxy with the <code>.env</code> set once rate limits hit</p>
</li>
</ol>
<p>The code is open. The path is clear. The only remaining variable is your willingness to see constraints not as walls, but as invitations to innovate.</p>
<p><strong>Troubleshooting Checklist:</strong></p>
<ul>
<li><p>Proxy running: <code>litellm --config proxy_config.yaml --port 8000</code></p>
</li>
<li><p>SDK patched: Check files</p>
</li>
<li><p>Env var set: <code>\(env:GOOGLE_API_BASE = "http://localhost:8000"</code><br /><code>\)env:GOOGLE_API_KEY = "dummy-key"   $env:CURL_CA_BUNDLE = ""</code></p>
</li>
<li><p>Model names match exactly between CLI and config</p>
</li>
<li><p>Test with <code>-curl</code></p>
</li>
</ul>
<p><em>This story is based on real events. All code samples are functional and tested. Replace</em> <code>{{USER_HOME}}</code> <em>with your actual home directory path (e.g.,</em> <code>C:\Users\YourName</code> <em>on Windows or</em> <code>/home/yourname on Linux/macOS</code><em>).</em></p>
<p><em>⚠️ Disclaimer: Patching third-party SDKs may void support agreements and could break with future updates. Use at your own risk. This solution is for educational purposes and personal use only.</em></p>
<p><em>May your requests always find their route.</em> 🗝️✨</p>
]]></content:encoded></item></channel></rss>