Founder Story: The Tenant's Voice

0-to-1: Building & Scaling a RAG System for Social Good

The Context & Problem

Like many renters, I’d had negative experiences with landlords, but lacked the confidence to challenge unfair situations. That changed when a landlord tried to deduct £2,625 from my girlfriend's tenancy deposit. This time, we decided to contest it.

I turned to Google's NotebookLM, uploading official UK tenancy law documents to guide our response. Armed with a clear understanding of "fair wear and tear" and "betterment," we drafted a formal rebuttal. The result was a success: the Tenancy Deposit Scheme (TDS) returned £2,350 of the £2,625 claimed. This victory proved that with the right information, tenants can effectively defend their rights.

The Flaw in General-Purpose AI

I wanted to share what I'd learned, but quickly hit two roadblocks. First, people were hesitant to trust a general-purpose tool like ChatGPT for specific legal matters. Second, I found that even NotebookLM could pull information from incorrect sources. For something this important, **accuracy and trust are non-negotiable**.

This is why I built The Tenant's Voice. It was created to be a dedicated, reliable platform for tenants.

£2,350

Successfully Recovered

The personal victory that sparked the idea and validated the user need.

The Approach & Solution

The goal was never just to provide answers, but to differentiate from ChatGPT by empowering users to *easily encourage action*. The product had to help tenants "think less and do more."

Guiding Principles

Trust & Accuracy

Only use verified, official sources.

Encourage Action

Make the next steps clear and simple.

Mobile First

Design for users in real-world situations.

The Hands-On Build: A Production-Grade RAG

I made the conscious, senior-level product trade-off of sacrificing some speed of output for guaranteed accuracy. I built a production-grade RAG (Retrieval-Augmented Generation) system from the ground up, using a modern, scalable tech stack. The core of the system is a vector database built using only legitimate sources (gov.uk, Shelter, Citizens Advice, TDS).

To ensure accuracy, content was chunked using a recursive character text splitter, and we ingested the "last modified" dates for each document. This critical step prevents the model from sharing outdated advice—for example, citing a law from 1985 that was superseded in 2015—ensuring every piece of guidance is grounded in the most current, reliable facts.

Runtime RAG Pipeline

How user queries are processed in real-time:

User Query
+ Chat History
Query Transform
(AI Keywords)
Embedding
text-embedding-004
Vector Search
pgvector (Top 5)
LLM Generation
gemini-2.5-flash
Structured JSON
{text, actions}

Data sources (gov.uk, Shelter, Citizens Advice) pre-processed offline with AI-generated keywords and embeddings.

See It In Action

Real examples of how users interact with The Tenant's Voice and the actionable guidance they receive.

Technical Deep Dive & Optimizations

As the product manager, I tracked, diagnosed, and solved issues that moved our product from unstable to user-centric by applying design thinking in every build. This log breaks down how I identified each problem, analyzed the user impact, and implemented the solution.

Part 1: RAG Optimization & Database Indexing (Performance)

The Problem: A Very Slow Query

The application felt slow. Profiling revealed that a single database call—the match_documents function—was responsible for 68.2% of all query time. This was the critical bottleneck:

  • Mean Time: 911 ms (nearly a full second on average)
  • Max Time: 2,620 ms (a very noticeable 2.6 seconds)

Even though the end-to-end user experience felt slow (6–8 seconds), we knew that fixing this expensive database call was the critical first step.

What Was Wrong with the Initial Setup

Two primary issues in the initial database design caused the bottleneck:

  • Index Mismatch: The vector index was built with vector_l2_ops, which optimizes for L2 (Euclidean) distance. However, for semantic search with modern embedding models like Google's text-embedding-004, Cosine Similarity is the correct metric—it measures the angle between vectors, not the straight-line distance. This mismatch prevented the index from being used effectively, forcing a slow brute-force scan.
  • No Pre-filtering: The match_documents function performed a vector search across all 6,000 documents every single time. A keywords column existed but wasn't being used to narrow the search space before the expensive vector comparison.

The Fix: A Two-Stage Filtering Strategy

We implemented a series of coordinated changes to drastically reduce the search space and let indexes do their job:

  1. Corrected the Vector Index: Dropped the old index and created a new one using vector_cosine_ops. We tuned the lists parameter to 80 (based on √6000), which is a best practice for IVFFlat index optimization.
  2. Added a Keyword Index: Created a GIN index on the keywords column, which is incredibly fast for searching within array columns.
  3. Upgraded the SQL Function: Rewrote match_documents to accept a p_keywords array parameter. The new logic first uses the GIN index to filter documents (WHERE keywords @> p_keywords), then performs the vector search only on that much smaller, pre-filtered set.
  4. Updated the Edge Function: Modified index.ts to extract keywords from the user's query and pass them to the parameterized SQL function, enabling pre-filtering on every request.

The Results: Massive Performance Gains

Query logs after the changes show dramatic improvement:

  • Mean Time: Dropped ~24×, from 911 ms down to ~37 ms
  • Max Time: Dropped ~66×, from 2,620 ms down to ~40 ms
  • Consistency: The huge variability between minimum and maximum query times disappeared. The function now performs consistently fast, removing the database as the bottleneck.

Key Learnings for the Future

This work reinforced several critical best practices for building high-performance RAG systems:

  • Filter First, Search Second: Always reduce rows with relational filters (e.g., WHERE user_id = '...' or WHERE keywords @> '...') before expensive vector operations. This is the single most important optimization you can make.
  • Use the Right Index for the Job: A vector index is not one-size-fits-all. Ensure your index type (IVFFlat, HNSW) and distance metric (cosine_ops, l2_ops) match your embeddings and query patterns. For semantic search, cosine similarity is almost always correct.
  • Cache is Your Friend (and a Clue): The 100% cache hit rate told us the slowness was compute-bound, not disk-bound—use profiling signals like this when diagnosing.
  • Isolate the Bottleneck: Measure each layer. After fixing the database, timing logs on the Edge function revealed the next bottleneck: sequential AI API calls, which is where the remaining 6–8 seconds of latency lies.
Part 2: Solving the 36k-Byte Crash (Stability)

The Problem: Crashing on Long Conversations

I identified that the core function was crashing with a 400 Bad Request error. I realized that my most engaged users—those with long, detailed conversations—were the most likely to experience a total app failure, breaking trust and halting their journey.

What Was Wrong

I dove into the logs and saw that the text-embedding-004 model has a small 36,000-byte limit. My code was sending the entire chat history for vectorization, which was inefficient and, for long chats, fatal.

The Fix: A Dual-History Approach

I identified two different needs: RAG only needs recent context to find relevant documents, while the AI needs full context to understand the user's journey. I implemented a solution by creating two history variables:

  1. recentHistoryText: A small, truncated history sent to the embedding model for efficient document retrieval.
  2. fullHistoryText: The complete history sent to the final chat model (gemini-2.5-flash) to maintain conversational context.
Part 3: Chatbot Optimization & Stability (Performance & Reliability)

The Problem: Two Critical Bottlenecks

The chatbot application faced two severe issues that directly impacted user trust and experience:

  • Stability Failures: The server function frequently crashed with a 500 error when the AI's response didn't match the expected JSON format. This occurred when the AI produced conversational plain text instead—often a symptom of poor RAG results.
  • Unacceptable Latency: End-to-end response times exceeded 40 seconds per query, making the tool feel broken and unreliable for real-time user interactions.

What Was Wrong with the Initial Architecture

Root cause analysis revealed two distinct issues:

  • No Error Handling: The JSON.parse() call had no try-catch wrapper. When the AI returned plain text (e.g., "Please pro..."), the parser crashed immediately, with no graceful fallback.
  • Redundant AI Pre-processing Step: The function spent 27.7 seconds on an initial AI call to generate a searchQuery and p_keywords for hybrid search. However, profiling showed the actual database query (with keywords and vectors) took only 250–1,100 ms. The expensive AI step was the true bottleneck, and it was unclear if it actually improved search quality.

The Fix: Error Handling + Vector-Only Search

We implemented a two-part solution:

  1. Wrapped JSON.parse() in Try-Catch: If parsing fails, the function logs the error and returns a valid JSON object with a user-friendly fallback message: "I'm sorry, I had trouble processing that request."
  2. Removed Redundant AI Pre-processing: We eliminated the 27.7-second Query & Keyword Generation step. The new logic:
    • Combines the user's query and chat history into a single string
    • Generates one vector embedding from that combined string
    • Calls match_documents using only the vector embedding, passing an empty array [] for keywords

The Results: Stability + Speed Breakthrough

The optimization validated our hypothesis that vector-only search would be fast and accurate:

  • Latency Reduction: Eliminated the 27.7-second bottleneck entirely. Total server response time dropped from ~41 seconds to ~3.8 seconds—a 90% reduction.
  • Database Efficiency: The new vector-only query remained fast and efficient at ~1.1 seconds (1,072 ms) across 6,000 documents.
  • Answer Quality (HITL Evaluation): Side-by-side comparison of hybrid vs. vector-only responses showed no significant quality difference. Both versions provided legally correct, relevant, and easy-to-understand answers. The vector-only method occasionally produced slightly more comprehensive results.
  • Reliability: Error handling prevented further crashes, ensuring graceful degradation when the AI produced unexpected output.

Key Learnings for the Future

This optimization reinforced critical lessons about building resilient AI products:

  • Always Have a Fallback: When parsing or processing external AI outputs, assume it can fail. Wrap critical calls in error handling and provide sensible user-facing fallbacks.
  • Question Every Expensive Step: The AI pre-processing step seemed reasonable in theory but consumed 67% of total latency. Always profile before optimizing; measure impact before adding complexity.
  • Simpler is Often Better: Removing a step entirely (instead of trying to optimize it) proved faster and just as effective. Occam's Razor applies to system architecture.
  • HITL Validation is Essential: We didn't assume vector-only search would be "good enough"—we tested it manually with real examples to confirm quality parity before deploying at scale.

Caveats & Trade-offs

While vector-only search proved effective, important caveats apply:

  • Domain-Specific: This optimization works because RAG is already filtering to the right domain (UK tenancy law). In broader search scenarios, the AI keyword extraction might still add value.
  • Query Complexity: Very complex or ambiguous queries might benefit from explicit keyword expansion. We monitor for this and can adjust if needed.
  • Not a Universal Pattern: Removing pre-processing works here because our embedding model (text-embedding-004) is strong enough to handle conversational queries directly. Other embedding models or use cases may require explicit keyword extraction.
Part 4: Fixing Accidental Submissions (Usability)

The Problem: User Frustration from Quickfire Questions

I noticed that users who were asked for details (e.g., "how long has the mould been present?") would try to type a multi-line answer. When they hit Enter for a new line, the UI submitted their partial, incomplete thought, confusing the AI and forcing it to ask the same questions again.

What Was Wrong

The UI was fighting the user's intent. A single-line <input> box that submitted on Enter was preventing users from providing the detailed, multi-line answers the AI needed.

The Fix: Aligning the UI with User Intent

My solution was to re-align the UI to match the user's natural workflow. I implemented this by:

  1. Replacing the single-line <input> with a multi-line, auto-resizing <textarea>.
  2. Changing the submit event from "Enter" to "Ctrl+Enter" (or "Cmd+Enter").
  3. Updating the placeholder text to teach this new, more deliberate interaction.
Part 5: Indexing 3,072-Dim Embeddings — HNSW on a halfvec Cast (Performance)

The Problem: Intermittent 3-Second Timeouts

Users were occasionally hitting 500: canceling statement due to statement timeout on the chat endpoint. The Supabase anon role has a statement_timeout = 3s — strict by default. EXPLAIN ANALYZE on a cold cache showed match_documents taking 2,911 ms, right at the edge of the cutoff.

What Was Wrong

  • No vector index existed. Only the primary-key btree on id. Every retrieval was a full sequential scan over 3,374 rows × 3,072-dim float vectors — roughly 40 MB of cosine math per query.
  • pgvector's HNSW caps at 2,000 dimensions on the raw vector type. The Gemini gemini-embedding-001 model emits 3,072-dim vectors — so the default index path was unavailable.

The Fix: HNSW Index Over a halfvec Cast

  1. Built the index over an expression rather than the column: CREATE INDEX ... USING hnsw ((embedding::halfvec(3072)) halfvec_cosine_ops). The halfvec type (FP16 internally) supports up to 4,000 dims for HNSW.
  2. Rewrote match_documents to cast both sides of the <=> operator to halfvec(3072) so the planner consistently uses the index.
  3. Kept the underlying column as vector(3072) — no schema migration, no embedding regeneration. halfvec is purely an index-time and query-time cast.

The Results: 21× Faster, 50× Fewer Buffer Reads

  • EXPLAIN ANALYZE on the same probe query: 2,911 ms → 136 ms.
  • Buffer hits dropped from 47,851 to 690 — the working set fits in cache instead of paging from disk.
  • Statement-timeout failures stopped recurring under normal load.

Key Learning: Index Dimensionality Limits Are Easy to Miss

The 2,000-dim cap on vector HNSW isn't surfaced anywhere obvious — the index just silently isn't created if you push past it. With embedding model dimensions trending up (1,536 / 3,072 / 4,096), halfvec is now the recommended default per pgvector docs. Recall loss from FP32→FP16 is <1% in published benchmarks. Worth doing on day one, not waiting for the timeout that proves it was needed.

Part 6: Streaming Responses — Buying Back 12 Seconds of Perceived Wait (UX)

The Problem: An 8-15 Second Wall of Silence

End-to-end response time was acceptable (~10-15 s), but the user-perceived experience was dominated by an empty loading ring with no progress for the entire wait. Nielsen's 1-10 s threshold is where attention starts to drift — past 10 s without progress, users alt-tab.

The Solution: SSE Stream + RAF-Paced Typewriter

  1. Backend: Replaced the single blocking generateContent() call with generateContentStream() for the prose, plus a parallel non-streaming classification call for action buttons + confidence. The Edge Function returns text/event-stream with one SSE event per Gemini chunk.
  2. Frontend: Replaced supabase.functions.invoke() (which buffers the whole response) with raw fetch() reading response.body as a ReadableStream. Each SSE event is appended to a buffer.
  3. Smoothing: Network chunks arrive in 30-80 char bursts every 200-500 ms. To stay below the perceptual stall threshold (~50 ms between visible updates), characters drain into the DOM via requestAnimationFrame at an adaptive rate — 1 char/frame when caught up, up to 12 chars/frame when a big chunk lands, full flush on stream-done.

The Results: Visible Motion 60 Times Per Second

  • First visible character now arrives at ~12 s (one Gemini round-trip), not at the 15 s end-of-response.
  • Median interval between visible DOM updates: 17 ms (= one frame at 60 fps). p99: 20 ms. Max gap: 36 ms — below the perceptual stall threshold for the entire response.
  • 304 DOM updates over a 5,024-char response, vs ~15-20 raw network chunks before. Same end-time, dramatically smoother feel.

Key Learning: Perceived Latency ≠ Actual Latency

Visible incremental progress hijacks the user's clock — once reading starts, subjective time compresses (well-established in flow and queue-perception research). For an LLM where the first-token cost is dominated by embedding + retrieval setup, you can't make the wait shorter, but you can make it readable. The fix is psychology-aware, not throughput-aware.

Caveats & Trade-offs

The streaming refactor breaks supabase.functions.invoke() as a client pattern — the SDK wraps responses as {data, error} assuming non-streaming JSON. Direct fetch() is required client-side. Also: when the function changes shape, frontend and backend must deploy together. The brief window where the prod frontend still expected JSON while the prod function had been updated to SSE caused 1-2 min of broken chat during the rollout — flagged for future hardening as a deployment ordering hazard.

Part 7: Diagnosing Embedding Contamination — When the Fix Becomes the Bug (Reliability)

The Problem: Topic Shifts Were Being Refused

A user reported a real conversation where the bot answered a deposit question correctly, then refused to discuss Awaab's Law a few turns later — "the information I have access to focuses specifically on tenancy deposit protection rules." After the user pushed back, the bot eventually answered properly. The behaviour was intermittent and the root cause wasn't obvious.

What Was Wrong: Speculation vs Measurement

My initial hypothesis tree had two competing explanations: (a) the embedding was contaminated by recent conversation history, or (b) the model had eventually capitulated and answered from training-data knowledge. Both were plausible. Speculating between them wouldn't have shipped the right fix.

The Fix: Build a Diagnostic First, Then Decide

  1. Added a debug: true request flag to the Edge Function (gated server-side by ALLOW_DEBUG_MODE=true so it never exposes internals in production). When set, the SSE response includes an extra event with: the exact embedding input length and head/tail preview, per-retrieved-chunk similarity scores, source URLs, content previews, and threshold-pass status.
  2. Wrote an ~80-line replay script that POSTed the exact 5-turn conversation through curl and dumped the diagnostic payload per turn.

The Data (Before)

The numbers showed contamination, not capitulation:

  • Turn 1 (deposit query, 37-char embedding): retrieved 5 deposit chunks at cosine 0.70-0.71. Correct.
  • Turn 2 (off-topic "count to 1000", 3,871-char embedding mostly the previous deposit response): retrieved deposit chunks at 0.79-0.82, higher than turn 1.
  • Turn 3 (Awaab's Law query, 4,462-char embedding still deposit-weighted): 0 Awaab chunks retrieved; 5/5 deposit chunks at 0.78-0.81. The model refused correctly — the context literally didn't contain Awaab content.
  • Turn 5 ("you do have access to awaab's law", 2,310-char embedding because slice(-6) finally aged out the deposit messages): Awaab gov.uk guidance finally appears at sim 0.67. The "capitulation" was actually correct grounding, not training-data hallucination.

The Fix: Embed the Query Alone

Changed retrieval to embed only the sanitised current query — no chat-history concatenation. The full conversation history is still passed to the chat model for conversational context; only the retrieval embedding is now isolated.

Critically, this reverses the decision documented in Part 2 above. The original rationale was sound at the time — fewer LLM round-trips, more context for retrieval. But the corpus, embedding model, and conversation-length patterns have all evolved since. The earlier optimisation had become the new bug. Worth noting honestly: architectural choices have a shelf-life.

The Data (After)

  • Same Turn 3 Awaab query (157-char embedding): 5/5 Awaab gov.uk chunks retrieved at sim 0.71-0.73.
  • Model produces a 4,941-char grounded breakdown on the first try, no refusal.
  • All 5 turns of the original failing conversation now behave correctly without needing 4 turns of context decay.

Key Learning: Diagnose with Instrumentation, Don't Argue with Hypotheses

The 30 minutes spent adding the debug flag + writing the replay script was the highest-leverage time of the week. It turned a "could be X or could be Y" debate into a one-shot data answer. The same pattern will scale to future retrieval-quality issues: add the telemetry once, gate it behind an env var so prod stays clean, and use it whenever a user report doesn't match the expected behaviour.

Caveats & Trade-offs

Query-only embedding sacrifices pure follow-up queries like "tell me more" — they no longer have a referent for retrieval. Users now need to name the topic. The proper longer-term fix is a small LLM call that rewrites the user query to a self-contained statutory-language form (query expansion). For now, the trade-off favours topic-shift correctness over the convenience of vague follow-ups.

Part 8: From Postgres Stack Traces to Human Sentences — Classified Error UX (UX)

The Problem: Raw Errors Bleeding Through to Users

Edge Function failures were surfacing to users with messages like "Sorry, something went wrong. Details: canceling statement due to statement timeout" — a Postgres internal error string rendered in a chat bubble. Other failure modes (Gemini rate limits, content-safety blocks, daily quota hits, embedding API outages) all returned similarly raw upstream text. The UX rewarded technical literacy and punished everyone else.

The Solution: A Glossary of Stable Error Codes

  1. Backend classifier: A classifyError() helper inspects any thrown error or upstream failure and maps it to a stable machine-readable code: RATE_LIMITED, CONTENT_BLOCKED, QUOTA_EXHAUSTED, DB_TIMEOUT, EMBEDDING_FAILED, NETWORK_ERROR, INVALID_REQUEST, SERVER_MISCONFIGURED, GENERIC_UPSTREAM. Each payload carries { code, message, retry_after? }.
  2. SSE-aware error transport: Pre-stream errors return as JSON with the structured payload. Errors that occur mid-stream (e.g. Gemini drops the connection halfway through generating) are emitted as data: {"error": {...}} SSE events the frontend can react to without losing the partial bubble.
  3. Frontend glossary: ERROR_GLOSSARY maps each code to a user-facing string and a retryable flag. RATE_LIMITED → "Lots of people are using this right now — please try again in 30 seconds." CONTENT_BLOCKED → "I can't respond to that particular phrasing. Try rewording your question." QUOTA_EXHAUSTED → "We've hit our daily AI quota. The developers are working on it."
  4. Auto-retry-once: On retryable mid-stream failures (rate-limit, transient network), the frontend silently clears the partial bubble, waits 800 ms, and re-sends the same query. If the second attempt also fails, the friendly text + a Retry button render inline.
  5. Pre-flight: On page load, the frontend shape-validates that the Supabase URL and JWT look correct. If the build's env-var substitution failed, the user sees a non-dismissable banner instead of a chat that silently doesn't work.

The Results: Errors Stop Being Embarrassing

  • Postgres timeouts, Gemini safety blocks, and quota errors all now surface as one-sentence messages the user can actually act on.
  • Transient blips self-heal silently — most users never see them.
  • The error-code surface is now a contract between backend and frontend: when a new failure mode appears in the wild, only the glossary table needs editing, not the chat handler.

Key Learning: Treat Error Surface as a Product Feature

The classifier + glossary cost about an hour to write, but it turned the worst part of the product (silent failures and Postgres stack traces) into something users tolerate. The same pattern transfers cleanly to any LLM-backed app: classify upstream failure modes once, write the friendly mapping once, and the next provider outage becomes a one-line PR instead of a UX incident.

Caveats & Trade-offs

The classifier is regex-based on error messages — adequate for Gemini and Postgres patterns observed so far, but brittle if upstream providers reword their error text. Adding a smoke test that asserts the expected code emerges for each known failure mode is queued as part of a broader testing pass.

Part 9: Triage-First Prompting — Listen Before You Advise (UX & Reliability)

The Problem: 40-Line Walls of Text Built on Guesses

The chatbot was generating long procedural responses to every query, often after silently assuming the user's tenure type, UK nation, or who their landlord was. A confidence badge in the header showed High even when the model hedged or refused to answer. To compensate for the length, the UI auto-collapsed any reply taller than 600 px — hiding exactly the content the user had just asked for.

What Was Wrong

  • Comprehensive-response prompt: The system prompt rewarded depth and structure, so the model produced a full step-by-step procedure even when it lacked the basic facts to ground it.
  • Soft ambiguity heuristics: The prompt asked the model to "ask follow-up questions if not 95% confident" — but the model is bad at self-judging confidence, and almost always decided it was confident enough.
  • Decoupled confidence rating: A separate classification call returned a stylistic "High / Medium / Low" rating that wasn't tied to retrieval quality or factual coverage. It would show High on a polite refusal.
  • Length compensation in the UI: Auto-collapse + "Show full message" was a workaround for an over-eager generator. The fix was at the wrong layer.

The Fix: Three Modes With Hard Trigger Conditions

  1. Mode A — clarify (default): Fires automatically if ANY of tenure type / UK nation / landlord type / duration / what-was-tried is unknown. Entire reply is 2-3 short numbered questions. No partial advice slipped in. No hedging.
  2. Mode B — brief fact: Pure statutory questions only ("what is Section 21?"). 1-3 sentence definition, ends with an offer to map it to the user's situation. Any "my landlord" / "I" / "we" flips back to Mode A.
  3. Mode C — procedural: Numbered steps, what to say to the landlord, statutory timeframes. Fires only when the user has supplied the facts in conversation OR explicitly escalates ("yes give me the full guide", "walk me through it").
  4. Removed the confidence rating end-to-end: Stripped from the classification prompt, the SSE meta payload, and the frontend pill. Sources footer already shows real evidence; the rating was theatre.
  5. Removed expand/collapse: Once long replies became opt-in, the auto-collapse was solving a problem we'd already designed away.

The Results: Honest By Default

  • "My landlord won't fix the heating" went from a 40-line generic walkthrough to a 3-question clarifier (tenure / nation / how long).
  • Deposit-in-England queries (person-specific) now clarify before quoting law, rather than dispensing landlord-specific procedure built on assumptions.
  • Smoke test (eval/smoke.ts — 10 hand-picked queries, Claude Sonnet 4.6 judge classifying mode A/B/C): 8 of 10 hit expected mode on the first prompt iteration. The remaining 2 surfaced edge cases that drove the next revision.

Key Learning: An Unverifiable Confidence Number Is Worse Than No Number

Asking a clarifying question is concrete: the user can answer it, the model can act on the answer. A "High confidence" badge the user can't validate is just decoration — and in domains where the cost of wrong advice is high, decoration that implies competence is actively harmful. The same goes for any UX cue that signals certainty the underlying system doesn't actually have.

Caveats & Trade-offs

Some users find clarifying questions annoying — they want an answer right now. The mitigation is an explicit escape: typing "yes" or "give me the full guide" jumps directly to Mode C. The hard trigger conditions are also somewhat brittle for edge cases where context is implicit ("I'm a council tenant" hidden in earlier turns); proper conversational state-tracking is a follow-up.

Part 10: Anchoring the Model in Current Law — The LEGAL CONTEXT Block (Reliability)

The Problem: A Confidently Wrong Answer About Section 21

The Renters' Rights Act 2025 received Royal Assent in October 2025 and its main provisions came into force on 1 May 2026 — abolishing Section 21 "no-fault" evictions for assured private tenancies in England. Gemini's training data predates that. The vector corpus was ingested before that. So when a user asked "what is a Section 21 notice?" the bot would confidently describe it as the standard way for landlords to end an Assured Shorthold Tenancy. Correct in 2024. Completely wrong in 2026.

What Was Wrong

  • Two layers of staleness: The model's training data is frozen at its cutoff, and the vector corpus is frozen at its ingest date. Neither has a way to learn about a statute that commenced in between.
  • Retrieval can't save you: Even with perfect retrieval, the chunks themselves predate the Act. "Tighten the threshold" or "improve the embeddings" doesn't help when the underlying source material is wrong.
  • No precedence signal: The model treats retrieved chunks as authoritative. There was no mechanism to tell it "current statute overrides this old guidance."

The Fix: A Hand-Curated Statute Anchor With Explicit Precedence

  1. LEGAL CONTEXT block in the system prompt: Prepended above retrieved chunks. Covers RRA 2025 specifics — commencement dates, Section 21 abolition, periodic-tenancy default, pet-request right, rent-in-advance cap (1 month maximum), the transitional rule for Section 21 notices served before 1 May 2026 (proceedings must issue by 31 July 2026 at the latest).
  2. Explicit "not yet in force" list: Landlord Ombudsman, the Private Rented Sector Database, Awaab's Law extension to PRS, Decent Homes Standard extension to PRS. Stops the bot from advising users to rely on protections that haven't commenced.
  3. Jurisdiction reminder: The Act is England-only. Scotland, Wales, and Northern Ireland have separate housing legislation.
  4. Precedence-on-conflict instruction: "If a retrieved source contradicts this LEGAL CONTEXT, trust the LEGAL CONTEXT, briefly note that the source predates the Renters' Rights Act 2025, and answer based on current law." The model now reconciles old chunks with the anchor instead of presenting them at face value.
  5. Verification gate before baking in: Every fact in the block was cross-checked against gov.uk and Shelter on the day the prompt was written — not pulled from memory. Wrong dates would be worse than no block.

The Results: The Bot Reconciles, Doesn't Repeat

  • "What is a Section 21 notice?" now returns: "Section 21 evictions have been abolished from 1 May 2026 by the Renters' Rights Act 2025… A notice served on or before 30 April 2026 may still be valid if proceedings are issued by 31 July 2026."
  • When stale retrieved chunks are part of the response, the reply visibly notes the discrepancy to the user instead of stating contradicted information confidently.
  • Cost: about 280 tokens of prompt overhead per query — negligible compared to retrieved context or generation.

Key Learning: In-Context Facts Beat Chasing the Fine-Tune

When your training data has a fixed cutoff but the world doesn't, a paragraph of curated facts at the top of the prompt buys more than most retrieval upgrades. The trick is the precedence rule: without it, the model treats every source as equally authoritative, and confident-but-stale chunks drown out the anchor. With it, you've effectively given the model a "current as of" timestamp it can reason against.

Caveats & Trade-offs

Manual maintenance is real — someone has to remember to update the block when new statute commences, and to verify each fact against primary sources. There's also a soft cap: once the anchor list grows past around 10 entries, the right architecture is a Supabase legal_facts table editable without a function redeploy. The current block is a deliberate trade-off — simpler today, but with a known migration path when complexity demands it.

Product Decision Framework

Every technical decision was driven by a product-first mindset. Here are the critical trade-offs I navigated as both founder and PM.

B2C vs B2B

Decision: Build for tenants (B2C), not landlords (B2B).

Rationale: Landlords already have access to resources and legal advice. Tenants are the underserved market facing an information gap. This aligns with the mission of empowerment over profit.

Stateless by Design

Decision: 100% stateless—no server-side conversation memory.

Rationale: Tenancy issues are sensitive. Guaranteeing anonymity builds trust. Chat history is passed client-side, giving users full control while maintaining context.

Accuracy Over Speed

Decision: Accept a few hundred milliseconds of latency for guaranteed accuracy.

Rationale: For legal guidance, incorrect answers are worse than slow answers. The RAG pipeline ensures every response is grounded in verified sources.

Integrated Architecture

Decision: Single Supabase Edge Function + pgvector vs. multi-agent workflow + Pinecone.

Rationale: Fewer services = lower latency, lower cost, and fewer points of failure. Simplicity enables faster iteration and debugging.

The Job to be Done

"When I have a problem with my tenancy, help me understand my rights and confidently take the correct, formal next step... without me having to pay for a lawyer or spend hours reading dense legal documents."

— The "Frustrated Renter" persona

The Results & Impact

The tool quickly found its audience. By sharing the exact advice I pulled from the tool, I became a 'star contributor' on several large tenant and landlord Facebook groups. This grassroots adoption is a clear sign of Product-Market Fit.

50+ Daily Active Users

Achieved steady daily usage through organic, community-led growth with zero marketing spend.

Community Recognition

Became a trusted voice in the target community, validating the tool's accuracy and usefulness.

Future Roadmap

The current tool is just the beginning. I have a clear, three-pronged vision for the future, including a high-value multimodal feature and potential monetization paths.

  • The Contract Analyzer (Multimodal)

    Allow users to upload tenancy agreements. The tool would use AI to highlight sketchy or unenforceable clauses, providing a critical service that would justify a monetization model to cover processing costs.

  • Solicitor Referral Network

    Use the tool as a referral point to recommend solicitors for complex cases, creating a revenue stream via referral fees.

  • Charity & Council Integration

    Partner with councils, support groups, and charities to integrate the tool into their websites, improving the user experience and helping tenants plan their next steps more effectively.