JigsawFlux Blog

Picking an Open-Source Agent Framework: LangGraph, CrewAI, and AutoGen

2026-06-28T00:00:00.000Z

The first decision in any agentic project isn't which model to use. It's which framework will orchestrate it. Get that wrong and you inherit a stack you can't run locally, can't afford to scale, and can't escape when the vendor changes the API.

This is a JigsawFlux project. JigsawFlux builds open-source tools for health tech, humanitarian response, and crisis management — in places where "cloud-native" is not an option and IT budgets are measured in grants, not headcount. That context imposes hard constraints on every architecture decision: portability, cost, and freedom from vendor lock-in.

The frameworks here — LangGraph, CrewAI, and AutoGen — were chosen because they meet those constraints. They are open source, actively maintained, and run entirely on hardware you own. Alternatives like Microsoft Semantic Kernel or Amazon Bedrock Agents are capable, but they introduce hard dependencies on specific cloud ecosystems. That trade-off doesn't fit the JigsawFlux model.

Each framework is also the natural implementation home for a different family of agentic patterns — which is the other reason for this grouping, and why this is Part 1 of a two-part series. Part 2 implements those patterns directly.

The shared task

Every framework runs the same pipeline: a Researcher agent uses a DuckDuckGo web search tool to gather facts on a given topic, then a Writer agent consumes those notes and produces a structured Markdown report. The topic in all runs was solid-state batteries.

python run.py --framework all --topic "solid-state batteries"

This symmetry is deliberate. Same task, same model (claude-sonnet-4-6), same tool — only the orchestration framework changes. That isolation means the telemetry at the end reflects framework behaviour and overhead, not model variation.

The full source is at github.com/JigsawFlux/comparing-agent-frameworks.

LangGraph: the stateful graph

LangGraph models execution as a directed graph where nodes are functions and edges are routing decisions. State flows through every node as a typed dictionary — you define what it contains, and reducers control how it gets updated. ^[1]

The researcher loop is cyclic by design. It calls web_search, gets results back via the tool node, and loops until it decides it has enough information — at which point the conditional edge routes to the writer.

The state schema makes the data contract explicit:

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    topic: str
    research_notes: str
    final_report: str

The routing logic is a single function:

def should_continue(state: AgentState):
    last_message = state["messages"][-1]
    if last_message.tool_calls:
        return "tools"   # loop back through tool node
    return "writer"      # research complete

And the graph wiring reduces to five lines:

workflow.add_edge(START, "researcher")
workflow.add_conditional_edges(
    "researcher",
    should_continue,
    {"tools": "tools", "writer": "writer"}
)
workflow.add_edge("tools", "researcher")   # the cyclic loop
workflow.add_edge("writer", END)

What you get from this: fine-grained control over every routing decision, complete visibility into state at every step, and trivial output extraction (state["final_report"]). What it costs: you are building a graph. The mental model is powerful but requires internalising nodes, edges, reducers, and the distinction between cyclic and acyclic topologies before you can be productive.

Agentic patterns this enables: ReAct ^[5] (the researcher loop is a ReAct loop), Plan-and-Execute ^[6], ReWOO ^[7], Reflexion ^[8], DAG pipelines, human-in-the-loop via interrupt().

CrewAI: declarative roles

CrewAI inverts the mental model. Instead of defining a graph, you define agents with a role, goal, and backstory, and tasks with a description and expected output. Hand them to a Crew and it handles the orchestration. ^[2]

researcher = Agent(
    role="Senior Technology Researcher",
    goal=f"Conduct deep research on '{topic}' and compile key insights",
    backstory=(
        "You are a highly analytical research specialist. To stay within strict "
        "API rate limits, use the search_tool exactly ONCE with a broad query. "
        "Produce precise, structured notes."
    ),
    tools=[search_tool],
    llm=llm,
)

writer = Agent(
    role="Expert Technical Writer",
    goal=f"Synthesize research notes into a professional technical report on '{topic}'",
    backstory=(
        "You are a veteran technical publisher who specialises in explaining complex "
        "advancements in clean, structured Markdown."
    ),
    llm=llm,
)

Tasks are similarly declarative — each specifies what it needs and what it should produce:

research_task = Task(
    description=f"Research '{topic}' using web_search. Identify: core concept, benefits, key players, barriers.",
    expected_output="Detailed, structured notes listing research facts.",
    agent=researcher,
)

write_task = Task(
    description=f"Write a professional report on '{topic}' from the research notes.",
    expected_output="A structured technical report in Markdown format.",
    agent=writer,
)

crew = Crew(
    agents=[researcher, writer],
    tasks=[research_task, write_task],
    process=Process.sequential,
)
result = crew.kickoff()

Context passing between tasks is implicit — CrewAI injects the previous task's output as input to the next one. You never touch state directly.

This produced the richest research notes of the three runs: 9,351 characters versus LangGraph's 5,573. The declarative backstory and role fields give the model stronger framing to work from, and the verbose CrewAI reasoning traces generate more intermediate content. The trade-off is that sequential execution is the easy path; anything more complex — conditional routing, cyclic reasoning — requires switching to Process.hierarchical and adding a manager agent.

Agentic patterns this enables: Hierarchical agent (add a manager with Process.hierarchical), role-based pipelines, multi-agent delegation.

AutoGen: conversation-based

AutoGen treats agent coordination as a conversation. Each agent is a participant; they exchange messages, call tools, and signal completion via a termination string. There is no graph, no task object — just agents talking. ^[3][9]

The pipeline runs in two separate phases. Phase 1 is the research conversation:

researcher = autogen.AssistantAgent(
    name="Researcher",
    system_message=(
        "You are a Senior Researcher. Use the web_search tool ONCE to gather facts. "
        "Compile structured research notes. When complete, end with TERMINATE."
    ),
    llm_config=llm_config,
)

user_proxy = autogen.UserProxyAgent(
    name="UserProxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    is_termination_msg=lambda x: "TERMINATE" in x.get("content", ""),
    code_execution_config=False,
)

register_function(web_search, caller=researcher, executor=user_proxy, ...)
user_proxy.initiate_chat(researcher, message=f"Research '{topic}'...")

The UserProxyAgent is not a human — it is the tool executor. When the researcher emits a tool call, the proxy executes it and sends the result back as a message. The loop ends when the researcher appends TERMINATE.

Phase 2 repeats the pattern with a Writer agent and a fresh proxy, passing the extracted research notes as the opening message:

user_proxy2.initiate_chat(
    writer,
    message=f"Write a professional report on '{topic}' based on:\n\n{research_notes}"
)

AutoGen's conversation-native model is best suited to topologies where agents negotiate, debate, or vote — patterns where the back-and-forth is the mechanism, not just the means to an end.

Agentic patterns this enables: Peer-to-peer network, Consensus/Joint, Human-in-the-loop (swap UserProxyAgent for an actual human).

Telemetry: what the runs produced

All three ran against the same topic on the same hardware with claude-sonnet-4-6. ^[12]

Framework	Status	Time (s)	Notes (chars)	Report (chars)
LangGraph	Success	129.64	5,573	18,178
CrewAI	Success	170.92	9,351	18,643
AutoGen	Success	76.19	0	322

Three things stand out.

LangGraph and CrewAI both produced full reports. The 41-second gap between them reflects CrewAI's more verbose internal reasoning — it generates more intermediate text per agent turn, which explains the longer notes. Both took the same 15-second rate-limit sleep between agent phases.

AutoGen was fastest and produced almost nothing. 76 seconds, 0 notes, a 322-character "report" that turned out to be the writer's input prompt echoed back. This is not an AutoGen model failure — the model ran successfully. It is a message extraction problem in the runner. AutoGen stores conversation history per agent pair (user_proxy.chat_messages[researcher]), and the termination-message stripping removed content that overlapped with the notes before extraction. The model did its job; the output pipeline had a subtle bug.

This is worth pausing on because it reveals a real architectural difference. LangGraph's typed state makes output extraction trivial — the final report is simply state["final_report"]. CrewAI exposes it via task.output.raw. With AutoGen, you parse message history, and subtle bugs in that parsing can silently produce nothing. The conversational flexibility that makes AutoGen powerful for multi-agent debate also makes structured output extraction more fragile.

A note on n8n

LangGraph is sometimes compared to n8n — a visual, platform-managed workflow tool that also supports AI agents. Both can orchestrate multi-step agentic workflows, but they operate at different layers. ^[4][11]

	LangGraph	n8n
Interface	Code-first (Python / TypeScript)	Visual canvas (drag-and-drop nodes)
State management	Typed central state with reducers	Step-by-step JSON payload passing
Cyclic loops	Native graph edges	Implicit via agent thought loops
Triggers	Manual / custom API server	Native webhooks, crons, event listeners
Human-in-the-loop	`interrupt()` + checkpointer	Built-in Wait and Form nodes
400+ integrations	Write your own tool wrappers	Slack, Jira, Notion, Postgres out of the box
On-premises	✅ Any Python environment	✅ Self-hosted Docker

Use LangGraph when the logic is complex, cyclic, and needs to live inside your existing Python application. Use n8n when you need rapid API integration, non-developer maintainers, or built-in webhook triggers without writing ingestion routes.

For JigsawFlux use cases — clinics, crisis response, volunteer-run operations — either works on-premises. The deciding factor is usually who will maintain it: developers reach for LangGraph, operational staff reach for n8n.

Framework decision guide

Framework	Paradigm	Best for
LangGraph	Stateful graph	ReAct loops, DAG pipelines, human-in-the-loop, fine-grained state control
CrewAI	Declarative roles	Role-based teams, hierarchical topologies, rapid prototyping
AutoGen	Conversation-based	Peer-to-peer networks, consensus patterns, multi-agent debate

All three have a free tier and run on hardware you already own. None require a cloud subscription to develop or deploy.

What's next

This comparison is the foundation for Part 2, which implements and benchmarks agentic patterns directly — using whichever framework is the natural fit for each:

Single-agent patterns

ReAct ^[5] — reason-act loops using LangGraph's cyclic edges
Plan-and-Execute ^[6] — separate planning phase from execution
ReWOO ^[7] — plan all tool calls upfront, then execute without intermediate observation
Reflexion ^[8] — self-critique and iterative self-improvement

Multi-agent topologies

Hierarchical — orchestrator delegates to specialised sub-agents (CrewAI)
DAG — directed pipeline with no feedback loops (LangGraph)
Peer-to-peer network — lateral agent communication without a central manager (AutoGen)
Consensus/Joint — multiple agents debate and converge on a shared answer (AutoGen)

The project source is on GitHub: github.com/JigsawFlux/comparing-agent-frameworks. ^[10]

References

Frameworks

[1] LangGraph — langchain-ai/langgraph, GitHub.

[2] CrewAI — crewAIInc/crewAI, GitHub.

[3] AutoGen — microsoft/autogen, GitHub.

[4] n8n — n8n.io, workflow automation platform.

Research papers

[5] Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.

[6] Wang, L. et al. (2023). Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. arXiv:2305.04091.

[7] Xu, B. et al. (2023). ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. arXiv:2305.18323.

[8] Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.

[9] Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv:2308.08155.

Project sources

[10] JigsawFlux. comparing-agent-frameworks — README.md, framework comparison matrix and architecture overview.

[11] JigsawFlux. comparing-agent-frameworks — langgraph_vs_n8n.md, architectural comparison of LangGraph and n8n.

[12] JigsawFlux. comparing-agent-frameworks — telemetry from python run.py --framework all --topic "solid-state batteries" using claude-sonnet-4-6 (2026-06-28). github.com/JigsawFlux/comparing-agent-frameworks.

This is a JigsawFlux project. JigsawFlux builds open-source tools for health tech, humanitarian response, and crisis management — tools designed to work on constrained budgets, unreliable infrastructure, and donated hardware. If you are working on something in this space, or want to contribute, the JigsawFlux GitHub organisation is where the work happens.

Agent Clinic: Human-in-the-Loop Medical Consultations with LangGraph and AWS Bedrock

2026-06-21T00:00:00.000Z

The name cuts two ways. It's a clinic — for patients. And the clinic runs on agents.

The problem this POC targets is specific: small and charity hospitals where doctor time is genuinely scarce and IT budgets are measured in hundreds of dollars, not thousands. A consultation isn't just a diagnosis — it's intake, medical history retrieval, triage sorting, prescription recording, pharmacy stock checking. The typical workflow hands all of that to a doctor anyway, because there's no other option. The result: a clinician spending 40% of their time on work that doesn't require clinical judgment.

The premise here is simple. AI handles everything that doesn't require a clinician. The doctor steps in exactly once — to read the AI-produced intake summary and give a diagnosis. That's it. The prescription agent takes over from there.

This is a JigsawFlux project. JigsawFlux builds open-source tools for health tech, things that matters — tools that have to work in the real world, not the well-funded one. That means two hard constraints shaped every architecture decision here: cost and deployability. Viable on a shoestring budget. Runnable in places where "cloud-native" isn't an option — a clinic with a single server, unreliable internet, and an IT team of one.

Built on AWS Bedrock (Claude Haiku 4.5), LangGraph for orchestration, LangChain @tool wrappers for data access, and Streamlit for the UI. Total cost: < $0.01 per consultation. Deployable on a £25/month VPS or a clinic's own hardware, with the option to go fully on-premises as models improve.

Why Not Lambda + Bedrock?

The obvious AWS pattern for an LLM-powered app is Lambda + Bedrock: a Lambda function receives a request, calls the model, returns a response. It works well for single-turn, stateless interactions.

A medical consultation is not a single-turn interaction. It looks like this:

Patient submits symptoms        [seconds]
↓ intake agent runs             [~5 seconds]
↓ graph pauses — doctor queued  [minutes to hours]
↓ doctor submits diagnosis
↓ prescription agent runs       [~3 seconds]
↓ patient receives prescription

That middle gap — between intake completing and the doctor responding — can be minutes or hours. Lambda invocations are stateless. Every invocation discards all context. Modelling a pause-and-resume workflow with Lambda means rebuilding state management, custom polling, webhook handlers, and retry logic from scratch. The honest Lambda-native path for a stateful workflow like this is Lambda + Step Functions + API Gateway + DynamoDB — a real AWS stack with real AWS complexity and real AWS consulting costs.

But there's a second problem with Lambda for this use case: it is irrevocably cloud-only. A charity clinic in a low-connectivity region cannot run Lambda on their own hardware. If the internet goes down, consultations stop. Patient data has to leave the building to be processed. There is no on-premises option, no hybrid option, no path to data sovereignty.

LangGraph running on a single Python process is none of that. It runs on a clinic's existing server, a cheap VPS, a donated laptop. The only network call during a consultation is the Bedrock API invocation — a small JSON payload, only made when the model is actually thinking. Patient records stay local. Connectivity interruptions only affect the model call, not the workflow state. And in a future phase, Bedrock can be swapped for a locally-hosted model (Ollama running llama3.2 or similar), taking the cloud dependency to zero.

Concern	Lambda + Bedrock	LangGraph + Bedrock
State between steps	Application must manage	Built-in typed state (`TypedDict`)
Cyclic workflows	Manual loop logic	First-class graph edges
Human handoff	Custom webhook + polling	`interrupt()` primitive
Pause & resume	Rebuild from scratch	Checkpoint + resume built-in
Multi-agent routing	Custom routing code	Conditional edges
Deployment options	AWS cloud only	Any Python environment
On-premises / hybrid	❌ Not possible	✅ Native — runs on local hardware
Data leaves building	✅ Always (Lambda processes it)	Only model invocation payload
Infrastructure overhead	Lambda + Step Functions + APIGW + DDB	Single Python process

The Architecture

Three layers, each with a clear responsibility:

┌─────────────────────────────────────────────────┐
│  LangGraph                                      │  Orchestration
│  StateGraph · nodes · edges · interrupt()       │  (workflow logic, state, routing)
├─────────────────────────────────────────────────┤
│  LangChain AWS  (langchain-aws)                 │  Integration
│  ChatBedrock · @tool wrappers                   │  (model wrappers, tool schemas)
├─────────────────────────────────────────────────┤
│  AWS Bedrock  (Claude Haiku 4.5)                │  Intelligence
│  Foundation model · pay-per-token               │  (reasoning, generation)
└─────────────────────────────────────────────────┘

The LangGraph StateGraph encodes the consultation as nodes (agents) and edges (routing logic). Here's the full workflow:

Every node reads from and writes to a single typed state object that flows through the entire graph:

# graph.py
class ConsultationState(TypedDict):
    session_id: str
    patient_id: str
    symptoms: str
    patient_history: dict

    intake_summary: str
    is_emergency: bool
    triage_score: int    # 1=Routine, 2=Minor, 3=Moderate, 4=Severe, 5=Urgent non-emergency
    triage_reason: str   # one-sentence AI justification for the score

    doctor_clarification_req: str
    patient_clarification_ans: str

    doctor_notes: str
    prescription: str

    status: str          # intake | emergency | awaiting_doctor | clarifying | prescribing | complete
    messages: list       # rendered in patient chat UI

This typed contract is what makes the graph inspectable and debuggable. At any pause point you can call graph.get_state(config) and see exactly what every field holds.

Wiring the graph up is straightforward — build_graph() is the only assembly point:

# graph.py
MODEL_ID = os.getenv("BEDROCK_MODEL_ID", "us.anthropic.claude-haiku-4-5-20251001-v1:0")

def _get_model():
    return ChatBedrock(
        model_id=MODEL_ID,
        model_kwargs={"temperature": 0.3},
        region_name=os.getenv("AWS_DEFAULT_REGION", "us-east-1"),
    )

def build_graph():
    workflow = StateGraph(ConsultationState)

    workflow.add_node("intake_agent",        intake_agent_node)
    workflow.add_node("emergency_protocol",  emergency_protocol_node)
    workflow.add_node("doctor_review",       doctor_review_node)
    workflow.add_node("clarification_agent", clarification_agent_node)
    workflow.add_node("prescription_agent",  prescription_agent_node)

    workflow.set_entry_point("intake_agent")

    workflow.add_conditional_edges(
        "intake_agent",
        should_escalate,
        {"emergency_protocol": "emergency_protocol", "doctor_review": "doctor_review"},
    )
    workflow.add_edge("emergency_protocol", END)

    workflow.add_conditional_edges(
        "doctor_review",
        should_clarify,
        {"clarification_agent": "clarification_agent", "prescription_agent": "prescription_agent"},
    )
    workflow.add_edge("clarification_agent", "doctor_review")
    workflow.add_edge("prescription_agent", END)

    checkpointer = MemorySaver()
    return workflow.compile(checkpointer=checkpointer)

The conditional edge functions are one-liners that read directly from state:

def should_escalate(state) -> Literal["emergency_protocol", "doctor_review"]:
    return "emergency_protocol" if state.get("is_emergency") else "doctor_review"

def should_clarify(state) -> Literal["clarification_agent", "prescription_agent"]:
    return "clarification_agent" if state.get("doctor_clarification_req") else "prescription_agent"

Human-in-the-Loop: How `interrupt()` Actually Works

The doctor_review node is not a model call. It's a pause point — here's the real code:

# graph.py
def doctor_review_node(state: ConsultationState) -> dict:
    context = {
        "session_id": state["session_id"],
        "patient_id": state["patient_id"],
        "intake_summary": state["intake_summary"],
        "patient_history": state["patient_history"],
    }
    if state.get("patient_clarification_ans"):
        context["clarification"] = {
            "question": state["doctor_clarification_req"],
            "answer": state["patient_clarification_ans"],
        }

    # Graph pauses here until Streamlit calls graph.invoke(Command(resume=doctor_input), config)
    doctor_input = interrupt(context)

    if doctor_input.get("action") == "clarify":
        clarification_req = doctor_input["question"]
        _update_consultation(
            state["session_id"],
            doctor_clarification_req=clarification_req,
            status="clarifying",
        )
        return {
            "doctor_clarification_req": clarification_req,
            "patient_clarification_ans": "",
            "status": "clarifying",
        }

    doctor_notes = doctor_input.get("notes", "")
    _update_consultation(
        state["session_id"],
        doctor_notes=doctor_notes,
        status="prescribing",
    )
    return {"doctor_notes": doctor_notes, "status": "prescribing"}

When interrupt(context) fires, LangGraph serialises the entire ConsultationState to the checkpoint store (MemorySaver in this POC; DynamoDB in production) and stops. The doctor's Streamlit tab polls the database for pending consultations and shows the intake summary.

When the doctor acts, Streamlit resumes the graph with a single call:

# app.py — doctor submits diagnosis
graph.invoke(
    Command(resume={"action": "diagnose", "notes": final_notes}),
    config,
)

# app.py — doctor asks patient a clarifying question
graph.invoke(
    Command(resume={"action": "clarify", "question": question.strip()}),
    config,
)

# app.py — patient answers the doctor's question
graph.invoke(Command(resume=answer.strip()), config)

Each Command(resume=...) call picks up execution from exactly where interrupt() left it — no rebuilding state, no re-running the intake agent, no webhooks.

The Streamlit UI shares the in-process graph across all browser tabs via @st.cache_resource. This is what makes the patient → doctor state handoff work without any network calls:

# app.py
@st.cache_resource
def get_graph():
    return build_graph()  # single graph instance, shared across all sessions

MemorySaver lives inside that cached object. Patient tab writes state; doctor tab reads it — same process, same memory, zero coordination overhead.

The clarification loop uses the same interrupt() mechanism a second time. clarification_agent_node pauses the graph waiting for the patient's answer; when they reply, the graph resumes and routes back to doctor_review via a fixed edge. A cyclic human workflow that would need custom state machines, polling infrastructure, and webhook handlers in a Lambda-based system becomes three graph edges and two interrupt() calls.

Demo Walkthrough

Here's a complete consultation using Patient P001 (Alice Thompson, headache and fever).

1. Login — role and PIN selection

The login screen is a placeholder for Amazon Cognito. The hardcoded PIN (1234 for patients, doctor for doctors) illustrates exactly where real authentication would plug in — the rest of the graph doesn't change.

2. Patient submits symptoms — AI intake runs

The patient types their symptoms and clicks Start Consultation. The intake agent fetches the patient's record and medical history from SQLite, searches the knowledge base, assigns a triage score (1–5), and hands off to the doctor queue. The entire intake takes a few seconds.

The graph is now paused at doctor_review. Nothing in the system runs until the doctor acts.

3. Doctor desktop — triage queue and response panel

In a separate browser tab the doctor logs in. The queue shows all pending consultations sorted by triage severity. For Alice's case:

The doctor sees the triage badge (🟡 Level 3 — Moderate), the AI-generated intake summary with clinical context drawn from Alice's history and allergies, and a response panel with two actions: Submit Diagnosis or Ask Patient.

4. Patient receives the prescription

After the doctor submits, the prescription agent checks pharmacy inventory (Amoxicillin is seeded as out of stock — it flags this and suggests an alternative) then records the prescription. The patient's view updates:

Try these scenarios yourself:

Scenario	How to trigger
Emergency path	Report chest pain radiating to the left arm — red banner, no queue entry
Clarification loop	Doctor uses "Ask Patient" before diagnosing — patient answers, routes back
Out-of-stock pharmacy	Prescription agent flags Amoxicillin and suggests an alternative
Multilingual intake	Submit symptoms in French or Spanish — intake responds in kind; doctor always gets English

The Tool Layer: `@tool` Now, MCP Later

The intake and prescription agents access data through LangChain @tool-decorated functions:

@tool
def get_patient_record(patient_id: str) -> dict: ...

@tool
def get_medical_history(patient_id: str) -> list[dict]: ...

@tool
def search_knowledge_base(query: str) -> list[dict]: ...

@tool
def record_prescription(session_id: str, prescription: str) -> str: ...

These are bound to the Haiku model in intake_agent_node and the agent runs a standard tool-calling loop — invoke, check for tool calls, execute them, feed results back, repeat until the model produces a final JSON response:

# graph.py — intake_agent_node (core loop)
def intake_agent_node(state: ConsultationState) -> dict:
    llm = _get_model()
    intake_tools = [get_patient_record, get_medical_history, search_knowledge_base]
    llm_with_tools = llm.bind_tools(intake_tools)
    tool_map = {t.name: t for t in intake_tools}

    messages = [
        SystemMessage(content=INTAKE_SYSTEM),
        HumanMessage(content=f"Patient ID: {state['patient_id']}\nSymptoms: {state['symptoms']}"),
    ]

    for _ in range(6):  # cap tool-calling iterations
        response = llm_with_tools.invoke(messages)
        messages.append(response)

        if not getattr(response, "tool_calls", None):
            break  # model produced final answer — no more tool calls

        for tc in response.tool_calls:
            fn = tool_map.get(tc["name"])
            if fn:
                result = fn.invoke(tc["args"])
                messages.append(ToolMessage(content=json.dumps(result), tool_call_id=tc["id"]))

    # parse the structured JSON response and return updated state fields ...

The model is instructed via INTAKE_SYSTEM to respond only with a JSON object containing intake_summary, is_emergency, triage_score, triage_reason, and a patient_message in the patient's detected language. The prescription agent follows the same loop pattern, using check_pharmacy_inventory and record_prescription as its tools.

For a POC this is the right choice: zero infrastructure, SQLite queries are synchronous, and the entire stack is Python.

In production, three problems make this untenable:

Security — the agent has unrestricted database access. There is no way to enforce that the intake agent can only read records for the current patient.
Reuse — a second agent type (specialist referral, pharmacy checker) would need to duplicate or tightly share these functions.
Auditability — no independent record of which agent accessed which data, a compliance requirement in clinical settings.

Model Context Protocol (MCP) solves all three. Each tool becomes a standalone server process with its own IAM role, access policy, and CloudWatch audit log:

	`@tool` wrappers (POC)	MCP Servers (Production)
Setup time	Minutes	Days
Security boundary	None (same process)	IAM role per server
Audit trail	LangGraph traces only	CloudWatch per call
Reusability	Duplicated per agent	Shared across all agents
Cost	$0	~$2–5/month (Lambda)
Graph changes required	None	None

The last row is the key. The LangGraph node functions in graph.py don't change — they still call get_patient_record() and search_knowledge_base(). Only the implementations in tools.py swap from direct SQLite calls to MCP client calls. The tool signatures are the interface contract; the interface is already final.

AWS Verified Permissions (Cedar policies) in Phase 2 adds the access enforcement: "intake_agent may call get_patient_record only for the patient_id present in the current session."

Cost

For a small or charity hospital, cost isn't an optimisation — it's a constraint. A £2,000/month telemedicine SaaS subscription is simply not available. £20/month might be.

Item	POC (local)	Production (50 consults/day)	Lambda-equivalent stack
Claude Haiku 4.5	~$0.001/consult	~$1.50/month	~$1.50/month (same)
Compute	$0 (developer laptop)	~$10–15/month (t3.micro or VPS)	~$5–10/month (Lambda)
State / DB	$0 (SQLite)	~$1–2/month (DynamoDB on-demand)	~$5–10/month (DDB + Step Functions)
API / orchestration	$0	$0 (LangGraph in-process)	~$10–15/month (API Gateway + Step Functions)
Total	$0	< $20/month	~$25–40/month

LangGraph on a cheap VM is modestly cheaper than the Lambda-native equivalent — but the bigger difference is operational complexity and the option to run on hardware you already own. A clinic that already has a server pays only for Bedrock tokens.

Hybrid deployment options:

Deployment	Compute cost	Cloud dependency	Data stays local?
Developer laptop (POC)	$0	Bedrock API only	✅ Yes
£25/month VPS	~£25/month	Bedrock API only	✅ Yes
Clinic's own server	$0 (sunk cost)	Bedrock API only	✅ Yes
Fully on-premises (future)	$0	None (local model)	✅ Yes
Lambda + Step Functions	Per-invocation	AWS cloud required	❌ No

The "Bedrock API only" row is worth emphasising. During a consultation, the only data that leaves the clinic's network is the prompt sent to the model — the patient's symptoms and the anonymised intake context. The patient database, medical history, and prescription records never leave the machine. That matters for GDPR compliance and for clinics operating in jurisdictions with strict patient data rules.

Two model decisions keep the token cost low:

Haiku 4.5, not Sonnet 4. Haiku is ~8× cheaper per token. In this system, the doctor is always the final clinical authority — the triage score is a queue-sorting mechanism, not a clinical decision. The AI's job is to produce a useful intake summary, not to diagnose. Haiku 4.5 does that reliably, including structured JSON output (triage_score, intake_summary) and native multilingual support for non-English-speaking patients at no extra cost.

Cross-region inference profile. One gotcha worth flagging: Claude Haiku 4.5 requires a cross-region inference profile, not a direct model ID. Invoking the bare model ID returns a ValidationException. The default in graph.py is already the correct form:

# graph.py
MODEL_ID = os.getenv("BEDROCK_MODEL_ID", "us.anthropic.claude-haiku-4-5-20251001-v1:0")
#                                                         ^^^
#                                          cross-region prefix — required for on-demand throughput

If you override BEDROCK_MODEL_ID with a bare model ID (e.g. anthropic.claude-haiku-4-5-20251001-v1:0), switch it back to the us. prefixed version.

POC → Production Roadmap

The architectural bet at the centre of this design: graph.py — the nodes, edges, conditional routing, and interrupt() calls — never changes across any production phase. Infrastructure evolves; the workflow doesn't.

Component	POC	Phase 1	Phase 2	Phase 3	Phase 4	Phase 5
LLM	Haiku (Bedrock)	← same	← same	← same	← same	← same
Graph topology	5 nodes, 2 interrupts	← same	← same	← same	← same	← same
Checkpointing	MemorySaver	DynamoDB	← same	← same	← same	+ PITR
Patient DB	SQLite	RDS PostgreSQL	← same	+ encryption	← same	+ Multi-AZ
Tool access	`@tool` → SQLite	`@tool` → RDS	MCP servers	+ Cedar policies	← same	+ audit logs
Auth	Hardcoded PIN	← same	← same	Amazon Cognito	← same	← same
Frontend	Streamlit (local)	Streamlit (App Runner)	← same	React / Next.js	+ WebSocket	← same
Infra cost/month	$0	~$25	~$30	~$50	~$55	~$80–120

Phase 2 is the architectural pivot. Everything before it is scaffolding. Everything after it scales. The move from @tool wrappers to MCP servers is the only change that touches security, reusability, and auditability simultaneously — and it does so without touching the graph.

Beyond the core phases, the design also supports additive extensions for low-resource environments: a WhatsApp/SMS patient interface via Twilio (the graph's interrupt() is channel-agnostic — state simply sits in DynamoDB until the next SMS arrives), voice note intake via AWS Transcribe or self-hosted Whisper, and native multilingual support (already live in the POC — Haiku 4.5 detects the patient's language and responds in kind while always producing the doctor summary in English).

What's Next

The full source — graph.py, tools.py, app.py, seed_db.py, and all architecture documentation — is on GitHub: github.com/JigsawFlux/agentic-clinic.

To run it locally:

git clone https://github.com/JigsawFlux/agentic-clinic
cd agentic-clinic
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python seed_db.py
streamlit run app.py

You'll need AWS credentials and Bedrock model access for us.anthropic.claude-haiku-4-5-20251001-v1:0 in us-east-1. The README has the full setup steps including the Bedrock console access request and a troubleshooting section for the two most common ValidationException and ResourceNotFoundException errors.

The four scenarios in the demo section are good starting points — emergency path, clarification loop, out-of-stock pharmacy substitution, and multilingual intake. Each exercises a different branch of the graph.

This is a JigsawFlux project. JigsawFlux builds open-source tools for health tech, humanitarian response, and crisis management — tools designed to work on constrained budgets, unreliable infrastructure, and donated hardware. If you're working on something in this space, or you want to contribute to this project, the JigsawFlux GitHub organisation is where the work happens.

Stop Burning AI Credits: A Framework for Right-Sizing Model Usage

2026-06-11T00:00:00.000Z

Four months into 2026, Uber's AI budget for the year was already gone — thousands of engineers with un-gated access to Claude Code, bills reportedly running $500–$2,000 per person per month, and leadership asking very loud questions about what exactly all those tokens were buying. Around the same time, an unnamed "mystery company" reportedly burned $500 million on Claude credits in a single month — not from a runaway model or a billing bug, but because nobody had thought to put a usage cap on employee licences.

Neither story is about bad engineering. Both are about one broken default: when developers get unrestricted access to frontier models, they use frontier models for everything.

I've been working through a framework to fix this — not by restricting access, but by routing the right task to the right model. The goal is frictionless development that doesn't quietly drain your budget. Here's how it works.

The shift that changed everything

Until early 2026, enterprise AI tooling largely ran on flat-rate subscriptions. You paid a seat fee, your team used whatever they needed, and costs were predictable. That model is gone. GitHub Copilot retired flat-rate allowances in favour of token-metered "AI Credits" on June 1, 2026. Every major provider has followed the same trajectory.

The problem isn't the pricing model — it's that usage habits haven't caught up. Complex agentic tasks and heavy reasoning models consume tokens exponentially faster than simple completions. A developer running Claude Opus to autocomplete a for-loop is swinging a sledgehammer at a drawing pin. The pin goes in either way; the difference is what the swing cost.

That's the sledgehammer problem, and it's the single largest source of AI budget waste: identical output, ten times the price.

Tier your workload

The fix is simpler than it sounds: match the model to the complexity of the task. Not every request needs deep reasoning. Not every request needs a lightweight model either. Three tiers cover almost everything.

Task Complexity	Typical Use Cases	Recommended Tier	Cost Profile
Tier 1 — Simple & Deterministic	Inline completion, boilerplate, unit tests, regex, Dockerfiles	Efficient models (Haiku, GPT-4o-mini, Gemini Flash)	📉 Lowest
Tier 2 — Moderate & Generative	Component logic, API endpoints, Mermaid diagram generation	Balanced models (Sonnet, Gemini Pro)	⚖️ Medium
Tier 3 — Complex Reasoning	Architecture design, C++ memory debugging, large repo refactoring	Frontier models (Opus, GPT-5, Gemini Ultra)	📈 Highest

The goal isn't to ban frontier models — it's to use them where they actually earn their keep. A well-structured RAG pipeline or a cross-cutting refactor across a million-line codebase? That's a Tier 3 problem. A standard API endpoint in Go? It isn't.

Two traffic lanes: IDE vs programmatic

Tiering the workload is half the answer. The other half is routing enforcement — making sure models are actually selected based on task complexity rather than developer habit. The architecture I recommend splits AI traffic into two distinct lanes, each governed differently.

Lane A: IDE traffic via GitHub Copilot

Everyday inline coding and IDE chat stays within the GitHub Enterprise account. Governance happens through native Copilot policies: seat-based licensing, Targeted Model Rules, and usage caps. The key move is enforcing "Auto" mode as the default for standard users — it selects an appropriate model rather than defaulting to the most expensive one. Senior architects and principal engineers can be granted Tier 3 access for the work that justifies it.

Lane B: Programmatic traffic via Azure APIM

Custom scripts, CI/CD pipelines, and internal tools route through a single Azure API Management gateway. Rather than each team managing its own provider keys and burning Tier 3 credits by default, everything flows through a central control point that provides:

Identity-based access control via Entra ID
Semantic caching — identical prompts return cached responses at zero token cost
Intelligent routing across Azure AI Foundry and AWS Bedrock
Dollar-based budgets enforced per team or pipeline

The caching point deserves emphasis. An automated pipeline making 50 near-identical classification requests doesn't need to pay for 50 model calls. It pays for one, caches the response, and the remaining 49 return instantly — the most straightforward cost reduction in the entire framework.

Architecture

What this looks like day-to-day

Three scenarios that show the framework working in practice:

The developer writing .NET boilerplate in VS Code opens a file and starts typing. Copilot's Auto mode kicks in with a Tier 1 model — fast, cheap, accurate for the task. The developer never thinks about model selection, and the team's AI Credits aren't quietly being drained by Opus completions for standard controller logic.

The architect testing a RAG pipeline writes a Python script to benchmark embedding strategies. Instead of managing three different provider API keys, they authenticate once via Entra ID and send all calls through the APIM gateway. The gateway checks their team budget, routes to the appropriate tier, and logs everything. If their budget is close to the limit, a threshold alert fires — not a surprise invoice.

The data pipeline categorising code risk across 50 repositories runs nightly. The first repository's analysis is processed and cached. Repositories two through fifty hit the semantic cache and return at zero token cost. A task that would otherwise consume 50× the tokens costs the same as one.

Rolling this out

Three phases, and the order is deliberate:

Gateway first. Stand up Azure APIM and migrate all programmatic API calls to the unified endpoint. Establish identity-based routing and per-team budgets before anything else. You can't govern what you can't see.
IDE governance second. Once programmatic traffic is visible and controlled, implement GitHub Enterprise Targeted Model Rules. Restrict Tier 3 access to roles that genuinely need it; set Auto as the default for everyone else.
Team education last. Infrastructure changes enforce behaviour. Education reinforces it. Once the system is in place, rolling out guidelines on prompt scoping, context compression, and model selection across VS Code, Cursor, and Antigravity lands on a foundation that already supports the habits you're trying to build.

The temptation is to start with education — it feels fast and low-risk. But guidelines without enforcement fade. Get the infrastructure right first.

The non-profit dimension

Everything above assumes an enterprise with a budget to optimise. But the same billing shock hits harder when you're running on grants, donations, or a volunteer-funded open-source project. Non-profits face a specific version of this problem: the same productivity pressure, the same tooling expectations from technical staff, and far less capacity to absorb a surprise $2,000-per-engineer monthly bill.

The tiering framework still applies — but for cost-constrained organisations, there are additional levers worth considering beyond just routing between model tiers.

Local inference removes marginal token cost entirely. Tools like Ollama let you run open models (Llama, Mistral, Phi, Gemma) on local hardware or a small cloud VM — I've written about exactly this setup before, running Ollama on a Kubernetes home lab. The per-query cost drops to near zero; you trade API fees for infrastructure and a ceiling on model capability. For Tier 1 tasks — boilerplate, unit tests, simple completions — a locally hosted 7B or 13B model is often indistinguishable in output quality from a cloud API call, at a fraction of the ongoing cost.

Local hosting earns its keep in two situations beyond pure cost. The first is edge deployment: when inference needs to run on devices in the field — clinics with unreliable connectivity, crisis-response hardware, remote sensors — a cloud API isn't slow, it's unavailable. The second is subtler: for high-frequency, low-complexity workloads, the network round trip itself becomes a cost. Every cloud call pays latency and egress on top of the token price. A local model answering in tens of milliseconds, with no per-call fee, beats a cloud frontier model answering the same mundane question in two seconds — both on responsiveness and on the invoice.

Corporate cloud hosting sits between local inference and direct API access. Running models through AWS Bedrock or Azure AI Foundry — rather than calling Anthropic or Google directly — typically costs 20–40% less per token under enterprise agreements. More importantly for non-profits, it keeps data within a known compliance boundary, avoids egress fees from mixed-cloud setups, and makes budget governance easier through existing cloud billing tools. If your organisation is already on Azure or AWS, routing AI workloads through Foundry or Bedrock is often the fastest path to meaningful cost reduction without changing tooling.

The decision tree for cost-constrained teams:

Can a local model (Ollama + Llama/Mistral) handle this task acceptably? Use it.
Is your org already on Azure or AWS? Route through Foundry or Bedrock first.
Does the task genuinely need a frontier capability? Then pay for the direct API — but only then.

The same tiering logic, extended one level further down.

References

Visual Studio Magazine, "Copilot Billing Shock Hits Developers" (June 4, 2026)
InfoWorld, "GitHub shifts Copilot to usage-based billing" (April 28, 2026)
Forbes, "Uber Burns Its 2026 AI Budget in Four Months on Claude Code" (May 17, 2026)
AI Magazine, "Why Uber Has Already Burned Through Its AI Budget"
Inc. Magazine, "1 Company Spent Half a Billion Dollars on Claude in a Single Month" (June 5, 2026)
Tom's Hardware, "Mystery company accidentally blew $500 million on Claude AI in a single month" (May 29, 2026)

Appendix: How this post was written

This post was drafted and iterated entirely using Claude Code — which makes it a small live example of the tiering approach it describes. Three sessions, three different kinds of work, with measurably different token costs for each.

Session 1 — structural work (Sonnet + Haiku subagent)

The first session covered the heavy lifting: converting a raw strategy document into a structured blog post, rewriting from third-person enterprise-doc tone to first-person narrative, and setting up the publishing infrastructure (frontmatter, truncate markers, draft log).

Step	Model	Use case	Cost
Iter 0	claude-sonnet-4-6	Frontmatter, CLAUDE.md setup, draft log scaffolding	—
Iter 1	claude-sonnet-4-6	Full structural rewrite — enterprise doc → first-person blog	—
Subagent	claude-haiku-4-5	Read-only codebase lookup (existing blog tone analysis)	$0.02
Session 1 total		1.5k input · 18k output · 1.6m cache read	$1.30

Session 2 — prose and new sections (Sonnet)

The second session expanded the post: narrative restructuring, the non-profit section, and the first version of this appendix. Same model as Session 1, but cheaper — targeted edits on a stable structure produce far fewer output tokens than a full rewrite.

Step	Model	Use case	Cost
Iter 2–3	claude-sonnet-4-6	Prose polish, non-profit section, appendix with real cost data	—
Session 2 total		0.7k input · 11.1k output · 1.0m cache read	~$0.56

Session 3 — narrative pass (Fable 5)

The final session switched to claude-fable-5, Anthropic's narrative-optimised model, for a light editorial pass: metaphor consistency, sentence rhythm, and voice. The smallest change set of the three sessions — and, unexpectedly, the most expensive.

Step	Model	Use case	Cost
Iter 4	claude-fable-5	Editorial polish — metaphor consistency, rhythm, voice	$2.25
Session 3 total		645 input · 5.1k output	$2.25

Look at the numbers side by side. Sonnet generated 31.4k output tokens across two sessions of structural and prose work for $2.01. Fable generated 5.1k tokens — a sixth of the volume — for $2.25. The lightest session cost the most, because premium per-token pricing dominated token count entirely.

That accidental result is a cleaner demonstration of this post's thesis than anything I planned: model choice drives cost more than workload size does. Whether the premium model was worth it for an editorial pass is exactly the kind of question the tiering framework exists to force. Total across all three sessions: $4.28.

Full token breakdown: _DRAFT_LOG.md in this post's directory.

Running a Local LLM on Kubernetes — A Home Lab Setup

2026-06-01T00:00:00.000Z

In Part 1 I ran Ollama directly on a Linux machine and wired it up through an MCP layer to a small web app. It worked. But bare-metal has friction — if the process crashes, it stays down. Adding Open-WebUI means managing another process. Resource limits are manual. There's no clean internal networking between services.

This post moves the whole thing into Kubernetes. The goal isn't enterprise-grade infrastructure — it's a home lab setup that's reliable, easy to extend, and honest about its limitations.

Manifests are in the ollama-mcp-starter repo under backend/k8s-deployment/.

The Hardware

The server is an Intel NUC Hades Canyon (NUC8i7HVK) — a small-form-factor machine with a skull logo on the lid and a surprisingly capable spec for its size:

Component	Detail
CPU	Intel Core i7-8809G, 4C/8T, 3.1 GHz base / 4.2 GHz turbo
RAM	32 GB DDR4
Storage	NVMe SSD (M.2 PCIe)
GPU	AMD Radeon RX Vega M GH — 4 GB HBM2
Network	2× Intel Gigabit LAN
Power	230W external adapter

It draws modest power for a home server, runs quietly, and fits on a shelf. These are the things that matter when it's on 24/7.

The GPU caveat

The Vega M GH is a capable GPU for graphics workloads, but Ollama's GPU acceleration uses CUDA — an NVIDIA-only technology. AMD GPU support via ROCm exists in Ollama but requires manual configuration and is not supported through the standard Kubernetes GPU operator.

In practice: Ollama on this box runs CPU-only. The i7-8809G handles inference well enough for personal and home lab use — expect 10–20 tokens/second with llama3.1:8b. More on this in Lessons Learned.

Why Move to Kubernetes?

Running Ollama on bare metal works, but once you add a second service (Open-WebUI), you're managing two processes manually. A third service and it becomes unwieldy. Kubernetes solves this cleanly:

Auto-restart — pods restart automatically on crash; no babysitting processes
Resource limits — cap CPU and memory per service so one greedy process can't starve the others
Persistent storage — PersistentVolumeClaims keep model data across pod restarts
Internal DNS — services talk to each other by name (ollama-service.ollama.svc.cluster.local) rather than fragile IP addresses
Ingress — one load balancer IP, hostname-based routing to multiple services

For a single-node home lab, MicroK8s is the right choice. It installs as a snap package, ships with the add-ons you need, and doesn't require a multi-node cluster to be useful.

Architecture Overview

Before diving into setup, here's the logical view of the full system:

Two access paths to Ollama:

MCP path (from Part 1) — Browser → Static Frontend → Node.js Backend → MCP Server → ollama.local
Direct path — Browser → ai.local → Open-WebUI → Ollama (internal ClusterIP)

Both paths go through MetalLB and Nginx on the NUC. The MCP app and frontend still run on the dev machine; only the inference layer lives in Kubernetes.

MicroK8s Setup

Install

sudo snap install microk8s --classic
sudo usermod -aG microk8s $USER
newgrp microk8s

Verify the node is ready:

microk8s kubectl get nodes

Enable Add-ons

microk8s enable dns
microk8s enable storage
microk8s enable ingress
microk8s enable metallb

When enabling MetalLB, you'll be prompted for an IP range. Use a small slice of your local subnet that won't conflict with DHCP — for example:

192.168.1.50-192.168.1.60

MetalLB will assign IPs from this pool to LoadBalancer services. The ingress controller picks up 192.168.1.54 in this setup.

kubectl alias

MicroK8s ships its own kubectl. To use the standard kubectl command (useful for remote access from another machine):

microk8s config > ~/.kube/config

Then install kubectl on your client machine and point it at the exported config.

GPU Operator

microk8s enable gpu

This installs the NVIDIA GPU Operator. It works well if your node has an NVIDIA card — it handles driver installation, container runtime configuration, and makes GPUs schedulable as resources.

On this NUC, the GPU is AMD (Vega M GH), so the operator deploys but has nothing to drive. Ollama falls back to CPU. The gpu-operator-resources namespace will exist but be idle. See Lessons Learned.

Deploying Ollama

Why a StatefulSet?

Ollama stores pulled models in /root/.ollama. A Deployment gives pods random names and doesn't guarantee stable storage attachment. A StatefulSet gives a stable pod name (ollama-0), ordered startup and shutdown, and a consistent binding between the pod and its PersistentVolumeClaim.

The Stack Manifest

ollama-stack.yaml creates four resources in sequence:

The ollama namespace
A 50 Gi PVC (models are large — llama3.1:8b is ~5 GB; headroom matters)
The Ollama StatefulSet
A ClusterIP service on port 11434

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          resources:
            limits:
              cpu: "4"
              memory: "16Gi"
            requests:
              cpu: "2"
              memory: "4Gi"
          volumeMounts:
            - name: ollama-volume
              mountPath: /root/.ollama

The memory limit of 16 Gi gives llama3.1:8b enough headroom to load the full model into RAM without competing with Open-WebUI for the remaining 16 Gi.

Model Preload with initContainer

The first time Ollama starts, no models are pulled. If your app tries to chat before a model exists, it fails with a confusing error. ollama-automated.yaml solves this with an initContainer that pulls the model before the main container starts:

initContainers:
  - name: pull-model
    image: ollama/ollama:latest
    command: ["/bin/sh", "-c"]
    args:
      - |
        ollama serve &
        sleep 5
        ollama pull llama3.1:8b
        kill %1
    volumeMounts:
      - name: ollama-volume
        mountPath: /root/.ollama

The init container starts ollama serve in the background, waits for it to be ready, pulls the model, then stops. The main container finds the model already on disk and starts serving immediately.

Apply it:

kubectl apply -f backend/k8s-deployment/ollama-stack.yaml
# or, to pre-pull the model on first boot:
kubectl apply -f backend/k8s-deployment/ollama-automated.yaml

Deploying Open-WebUI

Open-WebUI is a polished chat interface that connects directly to Ollama. open-webui-stack.yaml deploys it in the same ollama namespace:

env:
  - name: OLLAMA_BASE_URL
    value: http://ollama-service.ollama.svc.cluster.local:11434

The key line is OLLAMA_BASE_URL. It uses Kubernetes internal DNS (..svc.cluster.local) to reach Ollama — no hard-coded IPs, no reliance on external networking. If the Ollama pod restarts and gets a new IP, the DNS name still resolves correctly.

A 10 Gi PVC stores chat history and Open-WebUI configuration.

kubectl apply -f backend/k8s-deployment/open-webui-stack.yaml

Once running, Open-WebUI is available at http://ai.local (after ingress is configured below).

MetalLB + Nginx Ingress

Why MetalLB?

In a cloud cluster, type: LoadBalancer services get a public IP automatically from the cloud provider. On bare metal, nothing assigns that IP — services stay in . MetalLB fills that gap by assigning IPs from a configured pool to LoadBalancer services on your local network.

ingress-lb.yaml creates a LoadBalancer service for the Nginx ingress controller:

kubectl apply -f backend/k8s-deployment/ingress-lb.yaml

MetalLB assigns 192.168.1.54 from the configured pool. Verify:

kubectl get svc -n ingress
# ingress-loadbalancer   LoadBalancer   ...   192.168.1.54   80:...,443:...

Ingress Rules

Three ingress objects route hostnames to services:

File	Hostname	Target Service
`ollama-ingress.yaml`	`ollama.local`	`ollama-service:11434`
`open-webui-ingress.yaml`	`ai.local`	`open-webui-service:80`
`dashboard-ingress.yaml`	`dashboard.local`	Kubernetes dashboard

kubectl apply -f backend/k8s-deployment/ollama-ingress.yaml
kubectl apply -f backend/k8s-deployment/open-webui-ingress.yaml
kubectl apply -f backend/k8s-deployment/dashboard-ingress.yaml

Client Hosts File

Any machine that wants to reach these hostnames needs a single /etc/hosts entry pointing the names at the MetalLB IP:

192.168.1.54  dashboard.local ollama.local ai.local

On macOS: /etc/hosts. On Windows: C:\Windows\System32\drivers\etc\hosts.

After this, http://ai.local opens Open-WebUI in the browser, and http://ollama.local is the Ollama API endpoint.

Connecting the MCP App from Part 1

The MCP app from Part 1 pointed at the bare-metal Ollama IP. Switching to the k8s deployment is a one-line .env change in both backend/.env and mcp-server/.env:

# Before (bare metal)
OLLAMA_HOST=http://192.168.1.80:11434

# After (k8s ingress)
OLLAMA_HOST=http://ollama.local
OLLAMA_MODEL=llama3.1:8b
OLLAMA_TIMEOUT=180000

The app now survives an Ollama pod restart without any manual intervention — the pod comes back up, the DNS name resolves, and the next request succeeds. The 3-minute timeout (180000 ms) accounts for the slower CPU-only inference on llama3.1:8b.

Verifying the Deployment

Check everything is running:

kubectl get pods,svc,ingress -n ollama

Expected output:

NAME                             READY   STATUS    RESTARTS
pod/ollama-0                     1/1     Running   1
pod/open-webui-648d966b5-jsvfl   1/1     Running   2

NAME                         TYPE        CLUSTER-IP       PORT(S)
service/ollama-service       ClusterIP   10.152.183.132   11434/TCP
service/open-webui-service   ClusterIP   10.152.183.48    80/TCP

NAME                                    HOSTS        ADDRESS
ingress/ollama-ingress                  ollama.local 127.0.0.1
ingress/open-webui-ingress             ai.local     127.0.0.1

Test Ollama directly:

curl http://ollama.local/api/tags

Demo

With everything running, Open-WebUI is available at http://ai.local — a full chat interface served entirely from the home lab, with no cloud API involved.

llama3.1:8b handling a practical question with a detailed, well-structured response. The model name and URL in the browser confirm it's running locally through the k8s ingress.

Lessons Learned

The GPU assumption cost time. I installed the NVIDIA GPU operator early, assumed Ollama would detect the AMD Vega M and use it. It didn't — CUDA and ROCm are separate stacks. The Ollama container never saw the GPU. Lesson: check GPU vendor before planning acceleration. For AMD, ROCm support in Ollama requires a custom image and a different operator setup — a future post if I get there.

CPU inference is slower than expected on larger models. llama3.2:3b from Part 1 was snappy on CPU. Upgrading to llama3.1:8b dropped throughput noticeably — around 10–15 tokens/second. The 3B model was fine for quick tests; the 8B model is better quality but requires patience. The 3-minute timeout in the MCP app exists because of this.

The initContainer preload is worth the complexity. On first deploy without it, the Ollama pod starts, the Open-WebUI pod starts, a user immediately tries to chat, and gets a confusing "model not found" error. The initContainer adds maybe 5 minutes to first boot (model download) and eliminates that failure mode entirely.

StatefulSet vs Deployment matters less than the PVC. I spent time deliberating over this. The real thing that matters is the PVC binding — models must persist across restarts. StatefulSet makes the PVC binding explicit; a Deployment with a manual PVC binding would also work. StatefulSet is cleaner.

Ingress address shows 127.0.0.1 — that's normal. MicroK8s reports the ingress ADDRESS as 127.0.0.1 rather than the MetalLB IP. This confused me initially. The actual external IP lives on the LoadBalancer service in the ingress namespace, not on the ingress objects themselves.

What's Next

The infrastructure is stable. The next steps are on the application side:

Streaming responses — the MCP app waits for the full model completion before showing anything; streaming would feel much more interactive
Containerise the app — the Node.js backend and MCP server still run locally on the dev machine; packaging them as containers and deploying to the same cluster would make the whole system self-contained
AMD GPU support — ROCm-based Ollama on the Vega M is worth exploring; a 4 GB HBM2 GPU could significantly improve throughput even at reduced model precision

The full source, including all manifests, is on GitHub. Feedback and contributions welcome.

Claude™, Copilot™ & Gemini™ for Architects — A 2026 Field Guide

2026-05-20T00:00:00.000Z

AI Tools for Architects: Beyond Code Generation

Practical use cases for Solution, Infrastructure & Enterprise Architects Featuring Claude™, GitHub Copilot™, and Gemini™ + Antigravity™

The rapid explosion of LLMs and AI tools to assist software engineers and architects is mind-boggling. Selecting the right tools for each use case can be a daunting task. I have been experimenting with three superpowers: Claude Code™, GitHub Copilot™ + VS Code, and Antigravity™ + Gemini™. Variations such as using Cursor with Claude™ can easily be derived from this.

This article also shares a GitHub repo that demonstrates some of the architecture use cases.

#	Section	Topic
1	Three Superpowers	AI Tools for Architects — Beyond Code Generation
2	What Changed	The 2025–2026 Leap
3	Three Superpowers	Claude · Copilot · Gemini + Antigravity
4	Use Case Matrix	Architect use cases rated across all three tools
5	Case Study: Claude	Reverse-engineer architecture → ADR
6	Case Study: Copilot	IaC scaffold + security scan + coding agent
7	Case Study: Gemini	Upload diagram → component spec + GCP mapping
8	Elevated Use Cases	ADRs, Security, TOGAF, POC/MVP
9	Workflow	Discover → Design → Build → Validate
10	Case Study Output	ADRs, Architecture & Trade-off docs on GitHub
11	Start Today	This Week / This Month / This Quarter actions

1. Three Superpowers

From Generators to Agents

Architecture Document — CTA Public Transport Optimisation System

2026-05-19T00:00:00.000Z

Version: 1.0 Date: 2026-03-12 Status: Baselined Standard: 4+1 Architectural View Model (Kruchten, 1995) Notation: ArchiMate 3.1 concepts rendered as Mermaid diagrams

1. Document Purpose and Scope

1.1 Purpose

This document describes the software architecture of the Chicago Transit Authority (CTA) Public Transport Optimisation System. It is structured according to the 4+1 Architectural View Model (Kruchten, IEEE Software 1995), which organises the architecture into five complementary views, each addressing the concerns of a different stakeholder group:

View	Primary Audience	Central Concern
Use Case (+1)	All stakeholders	Scenarios that drive architectural decisions
Logical	Architects, developers	Functional decomposition and key abstractions
Process	Architects, integrators	Concurrency, data flows, runtime behaviour
Development	Developers, build engineers	Module structure, package organisation
Physical	Operations, DevOps	Deployment topology, infrastructure mapping

Diagrams use Mermaid syntax and follow ArchiMate 3.1 layering conventions:

Technology Layer — infrastructure elements (brokers, databases, containers)
Application Layer — software components and their interfaces
Business Layer — business processes and actors that the system serves

1.2 System Overview

The system is a real-time streaming pipeline that ingests simulated operational data from the CTA elevated rail network ("L"), processes it through multiple transformation stages, and presents a live transit status dashboard. It demonstrates a full Event-Driven Architecture (EDA) on the Confluent Kafka platform.

1.3 Scope

Three train lines: Blue, Red, Green (each with 10 trains, bidirectional)
Station arrival events, turnstile ridership counts, and weather telemetry
Static station reference data from PostgreSQL
A browser-accessible real-time status dashboard

2. Architectural Drivers

2.1 Quality Attribute Requirements

ID	Quality Attribute	Scenario	Architectural Response
QA-01	Throughput	3 lines × stations × 10 trains produce arrival events every 5 s	10-partition Kafka topic; AvroProducer batching
QA-02	Decoupling	New consumers must not require producer changes	All communication via Kafka topics (no direct calls)
QA-03	Schema Evolution	Fields may be added to events over time	Avro + Schema Registry with compatibility enforcement
QA-04	Replayability	Dashboard must recover state on restart	Consumers start from `offset_earliest`; Faust rebuilds table from log
QA-05	Responsiveness	Dashboard must serve HTTP requests without stalling Kafka polling	Tornado async IO loop; consumers as coroutines
QA-06	Extensibility	Station reference data changes without code deployment	Kafka Connect JDBC connector; consumers subscribe to topic

2.2 Constraints

Python-only application code (no JVM services authored in-house)
Single-host Docker Compose deployment (development / demonstration environment)
Confluent Platform 5.2.2 (fixed version)

3. Use Case View (+1)

The Use Case View captures the key scenarios that motivated and validate the architectural decisions. In the 4+1 model this view acts as the glue — each scenario exercises a slice through every other view.

3.1 Actor Diagram

3.2 Key Scenarios

UC-01 — View Live Transit Status

Trigger: Transit operator opens http://localhost:8888 Flow: Tornado serves status.html populated from in-memory Lines and Weather state that is continuously updated by four Kafka consumers running as async coroutines. Architectural relevance: Drives the Tornado async server choice (ADR-006) and the requirement for in-process Kafka consumer coroutines.

UC-02 — Publish Train Arrival Event

Trigger: Simulation time step advances; a train moves to the next station. Flow: Station.run() → AvroProducer.produce() → Schema Registry validates Avro → Kafka topic org.chicago.cta.station.arrivals.t001 → KafkaConsumer in server → Lines.process_message() → UI state updated. Architectural relevance: Establishes the end-to-end Kafka + Avro pipeline (ADR-001, ADR-002).

UC-04 — Publish Weather Reading

Trigger: Simulation hour boundary. Flow: Weather.run() → HTTP POST to Kafka REST Proxy → Kafka topic org.chicago.cta.weather.v1 → KafkaConsumer in server → Weather.process_message(). Architectural relevance: Demonstrates the REST Proxy integration path (ADR-005).

UC-06 — Aggregate Rider Counts

Trigger: Continuous turnstile events on com.cta.stations.turnstile.entry. Flow: KSQL turnstile table materialises from topic → KSQL TURNSTILE_SUMMARY GROUP BY aggregation → new Kafka topic → KafkaConsumer (is_avro=False) in server → UI ridership count. Architectural relevance: Drives the KSQL aggregation decision (ADR-004).

UC-07 — Transform Station Schema

Trigger: Kafka Connect pushes a raw station row to com.cta.stations.data.rawt001.stations. Flow: Faust transform_stations agent reads record → resolves red/blue/green booleans to line string → writes TransformedStation to org.chicago.cta.stations.table.v1t001 and updates Faust in-memory table. Architectural relevance: Drives the Faust stream processor choice (ADR-004).

4. Logical View

The Logical View describes the system's functional decomposition into key abstractions, their responsibilities, and their relationships. This view follows ArchiMate's Application Layer notation.

4.1 Component Overview

4.2 Key Abstractions

Producer Hierarchy

Consumer / Model Hierarchy

4.3 Kafka Topic Catalogue

Topic	Producer	Consumer(s)	Format	Partitions
`org.chicago.cta.station.arrivals.t001`	Station (AvroProducer)	Tornado server	Avro	10
`com.cta.stations.turnstile.entry`	Turnstile (AvroProducer)	KSQL	Avro	10
`org.chicago.cta.weather.v1`	Weather (REST Proxy)	Tornado server	Avro	10
`com.cta.stations.data.rawt001.stations`	Kafka Connect JDBC	Faust	JSON (Connect)	1
`org.chicago.cta.stations.table.v1t001`	Faust	Tornado server	JSON	1
`TURNSTILE_SUMMARY`	KSQL	Tornado server	JSON	—

5. Process View

The Process View describes the system's dynamic behaviour — how processes start, how data flows between them at runtime, and how concurrency is managed.

5.1 System Startup Sequence

The diagram below shows the mandatory startup order. Components further right depend on components to their left being fully initialised.

5.2 End-to-End Data Flow — Train Arrival

5.3 End-to-End Data Flow — Turnstile Aggregation

5.4 Concurrency Model

The entire consumer application runs in a single OS thread using cooperative multitasking. Kafka polling is non-blocking (0.1 s timeout). The HTTP handler is synchronous but executes between coroutine yield points, keeping UI latency low.

6. Development View

The Development View describes the organisation of the software in the development environment — module structure, package dependencies, and build artefacts.

6.1 Module Structure

6.2 Package Dependencies

6.3 Entry Points and Startup Commands

Process	Entry Point	Command
Data producer + simulation	`producers/simulation.py`	`python simulation.py`
Station stream transformer	`consumers/faust_stream.py`	`faust -A faust_stream worker -l info`
Turnstile KSQL setup	`consumers/ksql.py`	`python ksql.py`
Dashboard web server	`consumers/server.py`	`python server.py`

Note: Processes 2, 3, and 4 have an implicit startup ordering dependency. The Kafka Connect JDBC connector (configured by the simulation) must produce station data before the Faust app can transform it; the KSQL tables must exist before the dashboard starts. There is no orchestration script enforcing this order.

7. Physical View

The Physical View maps software components onto physical (or virtualised) infrastructure. This view follows ArchiMate's Technology Layer.

7.1 Container Deployment Diagram

7.2 Network Port Map

Port	Service	Protocol	Consumer(s)
2181	Zookeeper	TCP	Kafka broker (internal)
9092	Kafka broker	PLAINTEXT	Python producers, Python consumers, Faust
8081	Schema Registry	HTTP	AvroProducer, AvroConsumer, Kafka Connect
8082	Kafka REST Proxy	HTTP	Weather producer
8083	Kafka Connect REST API	HTTP	`connector.py` setup
8084	Connect UI	HTTP	Operator browser
8085	Topics UI	HTTP	Operator browser
8086	Schema Registry UI	HTTP	Operator browser
8088	KSQL Server	HTTP	`ksql.py` setup
5432	PostgreSQL	TCP	Kafka Connect JDBC
8888	Tornado Dashboard	HTTP	Transit Operator browser

7.3 Data Persistence Boundary

All in-process state is rebuilt from Kafka on restart. Durable state exists only in PostgreSQL (station reference data) and the Kafka topic logs.

8. Architectural Decisions Summary

Cross-reference to the detailed ADR documents in docs/adr/.

ID	Decision	Rationale	ADR
AD-01	Apache Kafka as the central event bus	Decoupling, replayability, fan-out	ADR-001
AD-02	Avro + Schema Registry for all first-party topics	Schema evolution, contract enforcement	ADR-002
AD-03	Kafka Connect JDBC Source for PostgreSQL	Zero custom ingestion code; handles offset/retry	ADR-003
AD-04	Faust for station transformation	Python-native; record-level transform	ADR-004
AD-05	KSQL for turnstile aggregation	Declarative SQL GROUP BY; no Python state management	ADR-004
AD-06	Kafka REST Proxy for weather	Demonstrates HTTP-based produce path	ADR-005
AD-07	Tornado async web server	Single-thread concurrency for Kafka + HTTP	ADR-006

9. Risks and Technical Debt

9.1 Risks

ID	Risk	Severity	Affected View	Mitigation
R-01	Single Kafka broker — SPOF	High	Physical	Add 2 additional brokers; set `replication_factor=3`
R-02	Replication factor 1 on all topics	High	Physical	Increase to 3 in production
R-03	Hard-coded `localhost` addresses in both `constants.py` files	Medium	Development	Externalise via environment variables or a config file
R-04	Hard-coded DB credentials in `connector.py`	High	Physical	Use Kafka Connect secrets management or environment injection
R-05	Manual startup ordering with no orchestration	Medium	Process	Add a readiness-check script or use `depends_on` with health checks
R-06	`AvroProducer` is a deprecated Confluent API	Medium	Development	Migrate to `SerializingProducer` + `AvroSerializer`

9.2 Technical Debt

ID	Description	Location	Effort
TD-01	`TURNSTILE_SUMMARY` uses JSON while all other topics use Avro — inconsistency in serialisation convention	`consumers/ksql.py`, `consumers/server.py:87`	Low
TD-02	Faust Table uses `store="memory://"` — state lost on restart, rebuild time increases with topic size	`consumers/faust_stream.py:38`	Medium
TD-03	Both `producers/constants.py` and `consumers/constants.py` duplicate identical constant values	Both files	Low
TD-04	No unit or integration tests present in the repository	Entire codebase	High
TD-05	Weather schema JSON loaded on every `Weather.__init__` call via file I/O (class variables mitigate partially)	`producers/models/weather.py:49-55`	Low
TD-06	`connector.py` exits the process on connector creation failure, preventing graceful recovery	`producers/connector.py:51-53`	Low

Document generated by reverse-engineering the source code on 2026-03-12. All diagrams use Mermaid and render natively on GitHub.

Architectural Trade-off Analysis — CTA Public Transport Optimisation System

2026-05-19T00:00:00.000Z

Version: 1.0 Date: 2026-03-12 Authors: Architecture Review (reverse-engineered from codebase) References: architecture.md · ADR-001 through ADR-006

1. Evaluation Framework

Every trade-off is scored against the six quality attributes (QAs) derived from the system's architectural drivers (see architecture.md §2).

1.1 Quality Attribute Weights

ID	Quality Attribute	Weight	Justification
QA-01	Throughput	20 %	System processes events from 3 lines × stations × 10 trains @ 5 s intervals
QA-02	Decoupling	20 %	Producers and consumers must evolve independently
QA-03	Schema Evolution	15 %	Fields may be added; consumers must not break
QA-04	Replayability	15 %	Dashboard must rebuild state on restart
QA-05	Responsiveness	15 %	Dashboard HTTP latency must not stall Kafka polling
QA-06	Extensibility	15 %	New data sources/consumers without code changes

1.2 Scoring Scale

Score	Meaning
5	Fully meets the quality attribute
4	Meets with minor gaps
3	Partial / neutral
2	Partially undermines the quality attribute
1	Significantly undermines the quality attribute

1.3 Risk Scale

Level	Symbol	Meaning
Critical	🔴	Likely to cause production incidents
High	🟠	Significant impact under normal load
Medium	🟡	Impact under edge cases or growth
Low	🟢	Manageable with standard practices

2. Decision Trade-off Analysis

2.1 Event Bus — Kafka vs Alternatives

Decision: ADR-001 — Apache Kafka as the single event bus

Weighted Scoring Matrix

Quality Attribute	Weight	Kafka	RabbitMQ	Redis Streams	REST Polling
Throughput (QA-01)	20 %	5	4	4	2
Decoupling (QA-02)	20 %	5	4	3	1
Schema Evolution (QA-03)	15 %	5	3	2	2
Replayability (QA-04)	15 %	5	2	3	1
Responsiveness (QA-05)	15 %	4	4	5	2
Extensibility (QA-06)	15 %	5	4	3	1
Weighted Total		4.85	3.55	3.30	1.55

Positioning

Trade-off Narrative

Why Kafka wins here: Kafka's append-only, partitioned log is the single feature that unlocks replayability and fan-out simultaneously — properties that no queue-based broker (RabbitMQ) provides out of the box. The ability to start a new consumer at offset_earliest and rebuild the full station/weather state is architecturally critical for the dashboard's cold-start scenario.

What is sacrificed:

Operational simplicity. Kafka requires Zookeeper (in CP 5.x), Schema Registry, and REST Proxy as satellites. A RabbitMQ cluster is simpler to operate.
Latency at p99. Kafka batches records before acknowledgment; for the weather producer that posts once per simulated hour, this is irrelevant, but it rules Kafka out for sub-millisecond latency use cases.

Key risk introduced: 🔴 Single-broker deployment with replication_factor=1. In production, broker failure loses all un-replicated messages.

2.2 Serialisation — Avro vs Alternatives

Decision: ADR-002 — Apache Avro + Confluent Schema Registry

Weighted Scoring Matrix

Quality Attribute	Weight	Avro + Registry	Plain JSON	Protobuf	MessagePack
Throughput (QA-01)	20 %	5	3	5	4
Decoupling (QA-02)	20 %	5	2	5	2
Schema Evolution (QA-03)	15 %	5	1	5	2
Replayability (QA-04)	15 %	5	3	4	3
Responsiveness (QA-05)	15 %	4	5	4	4
Extensibility (QA-06)	15 %	5	2	4	2
Weighted Total		4.85	2.55	4.55	2.80

Trade-off Narrative

Why Avro wins: The Schema Registry's compatibility check acts as a compile-time equivalent at publish-time — a field removal or rename is rejected before a single consumer can be broken. Avro's wire format embeds only the schema ID (4 bytes), making messages far more compact than equivalent JSON.

Protobuf is the credible alternative: Protobuf achieves nearly identical scores. The differentiator is ecosystem fit: confluent-kafka-python's AvroProducer/AvroConsumer were the idiomatic Python Confluent API at CP 5.2.2, whereas Protobuf support required more boilerplate. Today (Confluent Platform 7+), Protobuf is first-class; migrating would be viable.

Key inconsistency introduced: 🟠 TURNSTILE_SUMMARY uses JSON while all other topics use Avro. This forces consumers to branch on is_avro and removes schema-enforcement for rider counts — the metric that most directly feeds the UI.

2.3 DB Ingestion — Kafka Connect vs Custom Producer

Decision: ADR-003 — Kafka Connect JDBC Source Connector

Weighted Scoring Matrix

Quality Attribute	Weight	Kafka Connect JDBC	Custom Python Producer	Direct DB Read in Consumer	Debezium CDC
Throughput (QA-01)	20 %	4	4	3	5
Decoupling (QA-02)	20 %	5	3	1	5
Schema Evolution (QA-03)	15 %	3	2	1	5
Replayability (QA-04)	15 %	5	4	1	5
Responsiveness (QA-05)	15 %	4	4	2	4
Extensibility (QA-06)	15 %	5	3	1	5
Weighted Total		4.35	3.30	1.55	4.85

Trade-off Narrative

Why Kafka Connect wins over a custom producer: Zero-code ingestion eliminates an entire class of bugs: offset tracking, error handling, and retry logic are handled by a battle-tested framework. The connector is idempotent — safe to re-run on simulation restart.

Why Debezium CDC scores higher but was rejected: Debezium captures every INSERT/UPDATE/DELETE via PostgreSQL Write-Ahead Log, which is more correct (would capture station updates, not just inserts). However, enabling WAL replication requires DBA-level PostgreSQL configuration (wal_level=logical), which is overkill when the stations table is quasi-static reference data loaded once from CSV.

Hidden cost of the chosen approach: 🟡 mode=incrementing only detects new rows by monotonically increasing stop_id. A station name correction or line reassignment will silently remain stale in the Kafka topic and in the dashboard until the connector is manually reset and replayed.

When the decision should be revisited

If station data becomes writable (e.g. an admin UI for updating station names), migrate to Debezium CDC to capture UPDATE and DELETE events.

2.4 Stream Processing — Faust + KSQL vs Alternatives

Decision: ADR-004 — Faust for record transformation + KSQL for aggregation

Option Space

Weighted Scoring Matrix

Quality Attribute	Weight	Faust + KSQL	Faust Only	KSQL Only	Kafka Streams	Spark SS
Throughput (QA-01)	20 %	4	4	5	5	5
Decoupling (QA-02)	20 %	5	5	5	5	5
Schema Evolution (QA-03)	15 %	4	4	4	5	4
Replayability (QA-04)	15 %	3	3	4	5	5
Responsiveness (QA-05)	15 %	4	4	4	4	3
Extensibility (QA-06)	15 %	5	4	4	5	4
Weighted Total		4.20	4.00	4.35	4.80	4.35

Trade-off Narrative

The dual-engine pattern is a deliberate pedagogical trade-off: The use of two tools increases operational complexity (two processes, two different programming models) but each tool is used where it excels:

Concern	Faust	KSQL
Programming model	Async Python coroutines	Declarative SQL
Best for	Arbitrary code logic, Python type safety	GROUP BY, windowed aggregations
State store	In-memory (dev) / RocksDB (prod)	Kafka-backed materialised table
Restart behaviour	Replays topic from earliest	Persistent table survives restart

Key risk: 🟡 Faust's store="memory://" means station state is rebuilt from the full topic on every restart. As the station topic grows this adds startup latency. Replace with store="rocksdb://" for a persistent local state store.

Operational debt: 🟠 No orchestration enforces the startup order:

Kafka Connect must publish station data
Faust must transform it
KSQL must create TURNSTILE_SUMMARY
Only then can the dashboard start

A failure anywhere in this chain requires manual intervention.

2.5 Weather Producer — REST Proxy vs Native Client

Decision: ADR-005 — Kafka REST Proxy for the Weather producer

Weighted Scoring Matrix

Quality Attribute	Weight	REST Proxy	Native AvroProducer
Throughput (QA-01)	20 %	3	5
Decoupling (QA-02)	20 %	5	5
Schema Evolution (QA-03)	15 %	4	5
Replayability (QA-04)	15 %	5	5
Responsiveness (QA-05)	15 %	3	5
Extensibility (QA-06)	15 %	5	5
Weighted Total		4.10	5.00

Trade-off Narrative

This decision scores lowest of all six because there is no functional reason to diverge from the native client — only demonstration value.

Dimension	REST Proxy	Native AvroProducer
Extra network hop	Yes (+ ~1–5 ms per request)	No
Schema sent in every request	Yes (wasteful, ~2 KB)	No (schema ID only after first register)
Error handling	Silent drop on HTTP failure	Delivery callback with retry
Maintenance burden	Two integration patterns to understand	One
Polyglot value	Useful if producer is non-Python	Not applicable here

Verdict: 🟡 The REST Proxy choice adds cognitive overhead for no functional gain in a Python-only system. If the goal is demonstration, the inconsistency should be documented clearly (it now is in ADR-005). For a production system, weather should use AvroProducer like every other producer, and the REST Proxy demo should be a separate isolated example.

2.6 Dashboard Server — Tornado vs Alternatives

Decision: ADR-006 — Tornado async web server

Weighted Scoring Matrix

Quality Attribute	Weight	Tornado	Flask (sync)	aiohttp	FastAPI	Separate Consumer + Redis
Throughput (QA-01)	20 %	4	2	5	5	5
Decoupling (QA-02)	20 %	4	4	4	4	5
Schema Evolution (QA-03)	15 %	4	4	4	4	4
Replayability (QA-04)	15 %	5	3	5	5	4
Responsiveness (QA-05)	15 %	5	2	5	5	5
Extensibility (QA-06)	15 %	4	3	4	5	4
Weighted Total		4.30	2.90	4.65	4.65	4.45

Positioning

Trade-off Narrative

Why Tornado is a reasonable but not optimal choice: Tornado's IOLoop integrates naturally with confluent_kafka's callback-based API and was the idiomatic async web server in the Python ecosystem before asyncio matured. The chosen design — spawn_callback for consumers, synchronous GET handler — achieves the goal with minimal code.

aiohttp / FastAPI score higher today because:

Both are built natively on asyncio (no legacy compatibility shim)
FastAPI provides automatic OpenAPI documentation
The aiokafka library provides a fully async consumer compatible with both

Flask's fatal flaw in this context: A synchronous web server cannot co-locate Kafka consumer polling in the same process without threads. Using threads reintroduces shared-state locking complexity that the async model eliminates.

Key risk: 🟡 All four consumers share a single Kafka group.id. Starting a second dashboard instance would split partition ownership, causing each instance to see only a subset of events — producing an incoherent UI state.

3. Cross-Cutting Trade-off Analysis

3.1 Serialisation Consistency

The system uses two serialisation formats across its six topics:

Dimension	Avro path (5 topics)	JSON path (TURNSTILE_SUMMARY)
Schema enforcement	Registry rejects breaking changes	None
Consumer code	`AvroConsumer` (auto-deserialise)	Manual JSON decode
Wire size	Compact (schema ID only)	Verbose
Debuggability	Schema Registry UI	Raw JSON readable in Topics UI
Risk of silent breakage	Low	High

Recommendation: Register an Avro schema for TURNSTILE_SUMMARY and change VALUE_FORMAT='AVRO' in the KSQL CREATE TABLE statement. This removes the is_avro=False branch from the consumer and makes the serialisation model uniform.

3.2 State Management Strategy

The system employs three distinct state-management patterns with different durability guarantees:

State Store	Pattern	Cold-Start Cost	Data Loss Risk	Recovery
PostgreSQL	Source of truth	None	Low (volume)	Re-seed from CSV
Kafka logs	Event log	None	🔴 `replication_factor=1`	None if broker lost
Faust table (`memory://`)	Materialised view	Replay full topic	None (replays)	Automatic
Tornado in-process	Derived state	Replay all 4 topics	None (replays)	Automatic

Structural tension: The design makes all in-process state reconstruct-able from Kafka, which is elegant and correct. However, it assumes the Kafka logs are themselves durable — an assumption violated by replication_factor=1.

3.3 Concurrency Model

Three different concurrency approaches coexist across the system:

Component	Concurrency Model	Thread-safe?	Scale-out strategy
`simulation.py`	Sequential (single Python process, no async)	N/A	N/A
`faust_stream.py`	Asyncio event loop (Faust worker)	Yes	Multiple Faust worker instances
`ksql.py`	Single HTTP request, then exits	N/A	N/A
`server.py`	Tornado IOLoop + `spawn_callback` coroutines	Single-thread cooperative	🔴 Blocked by shared `group.id`

Scale-out constraint for the dashboard:

Fix: Each dashboard instance should use a unique group.id (e.g. append a UUID suffix) so every instance receives the full partition set and sees all events.

3.4 Operational Complexity

The system requires 7 infrastructure containers + 4 Python processes to be started in a specific order:

Total components: 11

Layer	Components	Startup dependencies	Failure impact
Infrastructure	7 Docker containers	Ordered by `depends_on`	Total system down
Producers	1 Python process	Kafka + Schema Registry + Connect up	No events produced
Stream processors	2 Python processes	Kafka + producer running	Dashboard sees no data
Dashboard	1 Python process	Stream processors running	No UI

Operational risk: 🟠 There is no automated readiness check or restart policy for the Python processes. A crash at any layer requires manual diagnosis and ordered restart.

Mitigation options:

Option	Effort	Benefit
Add `healthcheck` to `docker-compose.yaml` for each service	Low	Detect infrastructure failures automatically
Wrap Python processes in a `Makefile` with retry logic	Low	Reduce manual restart toil
Add a startup probe script (`wait-for-it.sh` pattern)	Medium	Enforce ordering without manual timing
Convert Python processes to Docker services	Medium	Unified `docker-compose up` startup
Migrate to Kubernetes with init containers + readiness probes	High	Production-grade orchestration

4. Architecture Fitness Function

A fitness function scores how well the as-built architecture meets each quality attribute.

Scale: ████████░░ = partially met   ██████████ = fully met   ████░░░░░░ = significantly unmet

QA	Attribute	Current Score	Evidence	Gap
QA-01	Throughput	🟢 4.0/5	Kafka partitioning, AvroProducer batching handle demo load	Single broker caps real scale
QA-02	Decoupling	🟢 4.5/5	All flows via Kafka; zero direct service calls	REST Proxy + native producer inconsistency
QA-03	Schema Evolution	🟡 3.5/5	Avro + Schema Registry on 5/6 topics	TURNSTILE_SUMMARY bypasses registry
QA-04	Replayability	🟡 3.5/5	`offset_earliest` on all consumers; Faust rebuilds	`replication_factor=1` — log loss is unrecoverable
QA-05	Responsiveness	🟢 4.0/5	Tornado async model; non-blocking poll	Hard `exit(1)` if topics missing blocks startup
QA-06	Extensibility	🟢 4.5/5	New consumers subscribe without producer changes	Startup ordering is implicit, not automated

Overall fitness: 4.0 / 5.0 (80 %) — suitable for demonstration, not production

5. Strategic Recommendations

Prioritised by risk reduction value vs implementation effort:

Priority 1 — Critical (address before any production use)

#	Recommendation	ADR	Risk Addressed	Effort
P1-1	Set `replication_factor=3` on all topics; add 2 Kafka brokers	ADR-001	🔴 Data loss on broker failure	Medium
P1-2	Externalise all credentials (`DB_USER`, `DB_PASS`, `BOOTSTRAP_SERVERS`) to environment variables	ADR-003	🔴 Credential exposure	Low
P1-3	Assign unique `group.id` per dashboard instance	ADR-006	🔴 Incoherent state on scale-out	Low

Priority 2 — High (address in first production sprint)

#	Recommendation	ADR	Risk Addressed	Effort
P2-1	Register Avro schema for `TURNSTILE_SUMMARY`; change `VALUE_FORMAT='AVRO'`	ADR-002	🟠 Silent schema breaks on rider count	Low
P2-2	Replace `AvroProducer` with `SerializingProducer` + `AvroSerializer`	ADR-002	🟠 Deprecated API removal	Medium
P2-3	Add automated startup ordering (health checks + wait scripts)	ADR-004	🟠 Manual restart toil on failure	Medium
P2-4	Replace `store="memory://"` with `store="rocksdb://"` in Faust	ADR-004	🟠 Startup latency grows with topic size	Low

Priority 3 — Medium (address in backlog)

#	Recommendation	ADR	Risk Addressed	Effort
P3-1	Unify all producers to use `AvroProducer`; remove REST Proxy dependency	ADR-005	🟡 Cognitive overhead for maintainers	Low
P3-2	Migrate `faust_stream.py` + `server.py` to FastAPI + aiokafka	ADR-006	🟡 Tornado is legacy; FastAPI is modern async standard	High
P3-3	Merge the two `constants.py` files into a shared config module	—	🟡 Duplication / drift risk	Low
P3-4	Add unit tests for all producer models and consumer models	—	🟡 No regression safety net	High

6. Trade-off Summary Heatmap

The heatmap below shows the contribution of each architectural decision (rows) to each quality attribute (columns). Green = positive contribution; Red = negative contribution.

Decision	Throughput	Decoupling	Schema Evol.	Replayability	Responsiveness	Extensibility	Net
ADR-001 Kafka	🟢🟢	🟢🟢	🟢🟢	🟢🟢	🟡	🟢🟢	+11
ADR-002 Avro	🟢	🟢🟢	🟢🟢	🟢🟢	🟡	🟢🟢	+9
ADR-003 Connect	🟡	🟢🟢	🟡	🟢🟢	🟡	🟢🟢	+7
ADR-004 Faust+KSQL	🟢	🟢🟢	🟡	🔴	🟡	🟢🟢	+5
ADR-005 REST Proxy	🔴	🟢	🟡	🟢	🔴	🟢	+1
ADR-006 Tornado	🟡	🟡	🟡	🟢🟢	🟢🟢	🟡	+5
ADR-002 gap JSON KSQL	🟡	🟡	🔴🔴	🟡	🟡	🟡	-3
ADR-001 gap RF=1	🟡	🟡	🟡	🔴🔴	🟡	🟡	-3

Key:

🟢🟢 Strong positive (+2)
🟢 Positive (+1)
🟡 Neutral (0)
🔴 Negative (−1)
🔴🔴 Strong negative (−2)

Overall Assessment

The core architectural spine — Kafka + Avro + Schema Registry + Kafka Connect — scores highly and is well-suited to the problem. The gaps are concentrated in two areas:

Operational resilience (single broker, no startup orchestration)
Serialisation consistency (KSQL JSON bypass undermines the otherwise strong Avro contract)

Addressing the P1 recommendations above would raise the overall fitness score from 4.0 / 5.0 → ~4.6 / 5.0, making the architecture production-worthy.

Trade-off analysis generated by reverse-engineering source code as of 2026-03-12. Weighted scores are analytical judgements based on code evidence, not empirical benchmarks.

ADR-001: Apache Kafka as the Central Event Bus

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

The CTA (Chicago Transit Authority) public transport optimisation system must ingest and distribute high-frequency, heterogeneous events from multiple sources:

Train arrivals at every station on three colour lines (Blue, Red, Green), each carrying 10 trains
Turnstile entry counts produced per time-step at every station
Hourly weather readings
Static station reference data held in a relational database

A naive polling or REST-request-per-event approach would not scale to the volume, would couple producers tightly to consumers, and would make it difficult to replay or replay events for new consumers.

Decision

Apache Kafka (Confluent Platform 5.2.2) is used as the single, central event streaming backbone. All data flows in and out of Kafka topics; no service communicates directly with another.

Evidence from code:

Source	Topic	Producer mechanism
Train simulation	`org.chicago.cta.station.arrivals.t001`	`confluent_kafka` `AvroProducer`
Turnstile simulation	`com.cta.stations.turnstile.entry`	`confluent_kafka` `AvroProducer`
Weather simulation	`org.chicago.cta.weather.v1`	Kafka REST Proxy (HTTP POST)
PostgreSQL stations table	`com.cta.stations.data.rawt001.stations`	Kafka Connect JDBC Source
Faust stream processor	`org.chicago.cta.stations.table.v1t001`	Faust internal producer
KSQL aggregation	`TURNSTILE_SUMMARY`	KSQL internal producer

Topics are created with LZ4 compression, a short delete-retention window (2 s) suitable for real-time transit dashboards, and 10 partitions by default for arrival topics (producers/models/producer.py:18-32).

Alternatives Considered

Alternative	Reason Rejected
RabbitMQ / AMQP	No log-replay; difficult to add consumers without re-engineering
REST polling from dashboard	Tight coupling, synchronous latency, no fan-out
Redis Streams	Weaker ecosystem for schema enforcement and SQL-style aggregations

Consequences

Positive

Producers and consumers are fully decoupled; new consumers (e.g. analytics) can subscribe independently without touching producers.
Log retention enables late-joining consumers to replay from the earliest offset (offset_earliest=True in consumers/server.py:73-91).
Kafka's partition model provides horizontal scale-out for high-throughput arrival events.

Negative / Risks

Single-broker setup (kafka0 in docker-compose.yaml) is a single point of failure for local development; production would require at minimum 3 brokers.
Replication factor is set to 1 throughout — data loss on broker failure.
Hard-coded localhost:9092 in producer bootstrap config (producers/models/producer.py:66) couples the code to the local Docker environment.

ADR-002: Avro Schemas + Confluent Schema Registry for Message Contracts

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

Multiple independent processes — producers written in Python and stream processors written with Faust and KSQL — exchange messages over Kafka. Without a shared, versioned contract, a schema change in a producer silently breaks downstream consumers. The system needs:

A machine-readable schema for every message type.
A registry that enforces backward/forward compatibility on publish.
Consumers that can deserialise messages without embedding the schema in every message.

Decision

Apache Avro is the default serialisation format for first-party Kafka topics, with Confluent Schema Registry (port 8081) acting as the central schema store. Python producers that use the shared producer base class publish Avro-encoded messages via AvroProducer, and consumers use Avro deserialisation for those Avro-backed topics via AvroConsumer from confluent-kafka-python.

There are explicit exceptions to that default path. Weather data is produced via the REST Proxy (producers/models/weather.py) rather than through AvroProducer. The dashboard also consumes some JSON topics with is_avro=False in consumers/server.py, including the stations table and the TURNSTILE_SUMMARY topic. Schema files are stored as JSON alongside the producer models:

producers/models/schemas/
  arrival_key.json
  arrival_value.json
  turnstile_key.json
  turnstile_value.json
  weather_key.json
  weather_value.json

Representative schema (arrival_value.json):

{
  "namespace": "com.udacity",
  "type": "record",
  "name": "arrival.value",
  "fields": [
    {"name": "station_id",       "type": "int"},
    {"name": "train_id",         "type": "string"},
    {"name": "direction",        "type": "string"},
    {"name": "line",             "type": ["null","string"]},
    {"name": "train_status",     "type": ["null","string"]},
    {"name": "prev_station_id",  "type": ["null","int"]},
    {"name": "prev_direction",   "type": ["null","string"]}
  ]
}

The producer base class wires schemas at construction time (producers/models/producer.py:75-77):

self.avroProducer = AvroProducer(
    {"bootstrap.servers": "...", "schema.registry.url": "http://localhost:8081"},
    default_key_schema=self.key_schema, default_value_schema=self.value_schema)

The exception is the TURNSTILE_SUMMARY topic produced by KSQL, which uses JSON encoding (VALUE_FORMAT='JSON') and is consumed without Avro deserialisation (consumers/server.py:87-91).

Alternatives Considered

Alternative	Reason Rejected
JSON (plain)	No schema enforcement; brittle under field renames
Protobuf	Supported by Confluent but less native to the Python confluent-kafka library at the time
MessagePack	No registry ecosystem; debugging harder

Consequences

Positive

Schema Registry enforces compatibility before messages are published.
Schema IDs are embedded in the Avro wire format — consumers can always retrieve the exact schema used to write a message.
Faust's faust.Record dataclasses mirror the Avro schema structure, making the contract explicit in both the registry and the Python type system (consumers/faust_stream.py:14-33).

Negative / Risks

AvroProducer is marked as a legacy API in newer Confluent SDK versions; migration to SerializingProducer with AvroSerializer will be needed.
The KSQL TURNSTILE_SUMMARY topic diverges from the Avro convention (uses JSON), creating an inconsistency that consumers must handle explicitly (is_avro=False).
Schema files live inside producers/ only; the consumer side has no local copy, creating a coupling between producer deployment and consumer startup.

ADR-003: Kafka Connect JDBC Source for PostgreSQL Station Data

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

Station reference data (stop IDs, names, line membership, ordering) is stored in a PostgreSQL table (stations) seeded from a CSV file at container start-up (load_stations.sql). The consumer side needs this data in Kafka so that stream processors (Faust) can enrich and transform it alongside real-time event streams.

Two ingestion options were on the table:

Write a bespoke Python producer that reads from the database and publishes to Kafka.
Use a managed connector that understands JDBC semantics.

Decision

Confluent Kafka Connect with the JdbcSourceConnector is used to stream the stations table from PostgreSQL into Kafka automatically.

The connector is configured programmatically at simulation start-up via the Kafka Connect REST API (producers/connector.py:16-57):

Config key	Value	Rationale
`connector.class`	`io.confluent.connect.jdbc.JdbcSourceConnector`	Standard JDBC source
`mode`	`incrementing`	Detects new rows via monotonically increasing `stop_id`
`incrementing.column.name`	`stop_id`	Primary key / surrogate key for new-row detection
`table.whitelist`	`stations`	Scope connector to a single table
`topic.prefix`	`com.cta.stations.data.rawt001.`	Output topic = prefix + table name
`poll.interval.ms`	`3600000` (1 h)	Station data is quasi-static; hourly polling is sufficient
`batch.max.rows`	`500`	Limits per-poll memory footprint

The connector is idempotent — if it already exists the setup function returns early (producers/connector.py:19-22).

Faust then reads from the output topic com.cta.stations.data.rawt001.stations (consumers/faust_stream.py:40).

Alternatives Considered

Alternative	Reason Rejected
Custom Python producer reading from PostgreSQL	More code to maintain; no built-in retry or offset tracking
Debezium CDC connector	Overkill for quasi-static reference data; requires PostgreSQL WAL configuration
Reading CSV directly in producer	Bypasses the Kafka pipeline; consumers cannot subscribe independently

Consequences

Positive

Zero custom ingestion code; the connector handles polling, batching, and offset management.
Decouples the database schema from producer code — schema changes propagate via the connector.
New consumers of station data subscribe to the Kafka topic without touching the database.

Negative / Risks

incrementing mode only detects inserts, not updates or deletes; stale station data will not be corrected unless the connector is reset.
Hard-coded credentials (cta_admin / chicago) in connector.py:43-44 must be externalised for any non-local environment.
The connector is registered once at simulation start; a crash before registration completes leaves no station data in Kafka.

ADR-004: Dual Stream-Processing Engines — Faust (Python) + KSQL

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

Two distinct stream-processing requirements exist:

Station enrichment — raw station rows arriving from the JDBC connector carry boolean red/blue/green columns. A downstream topic is needed that replaces these booleans with a single line string field and retains only the fields required by the UI model.
Turnstile aggregation — individual turnstile-entry events must be aggregated into a count per station so the dashboard can display a single rider-count per station rather than a stream of raw entry records.

These two problems have different shapes: the first is a stateless record-by-record transformation; the second is a stateful GROUP BY aggregation.

Decision

Two separate stream-processing tools are used, each chosen for its natural fit with one problem:

Faust — station transformation (`consumers/faust_stream.py`)

Faust is a Python-native stream-processing library. It is used to:

Subscribe to com.cta.stations.data.rawt001.stations
Produce TransformedStation records to org.chicago.cta.stations.table.v1t001
Maintain an in-memory Faust Table as a materialised view keyed by station_id

@app.agent(in_topic)
async def transform_stations(in_stations):
    async for sn in in_stations:
        t = TransformedStation(sn.station_id, sn.station_name, sn.order, "na")
        if sn.red:   t.line = "red"
        elif sn.blue: t.line = "blue"
        elif sn.green: t.line = "green"
        else: continue
        table[sn.station_id] = t

KSQL — turnstile aggregation (`consumers/ksql.py`)

KSQL (now ksqlDB) is used to express a SQL aggregation over the turnstile topic:

CREATE TABLE turnstile ( ... )
WITH (KAFKA_TOPIC='com.cta.stations.turnstile.entry', VALUE_FORMAT='AVRO', KEY='station_id');

CREATE TABLE TURNSTILE_SUMMARY WITH (VALUE_FORMAT='JSON') AS
    SELECT station_id, count(station_id) as COUNT
    FROM turnstile GROUP BY station_id;

The KSQL statement is submitted via the KSQL REST API at consumer start-up and is idempotent — it is skipped if TURNSTILE_SUMMARY already exists.

Alternatives Considered

Alternative	Reason Rejected
Kafka Streams (Java)	Project is Python-only; JVM dependency is undesirable
Single Faust app for both transformations	Aggregation with Faust Tables is more complex than KSQL GROUP BY; KSQL is more expressive for SQL aggregations
Single KSQL for both transformations	KSQL cannot natively run arbitrary Python logic cleanly; Faust keeps the transformation in the same language as the rest of the application

Consequences

Positive

Each tool is used for its core strength: Faust for Python-idiomatic record transformation, KSQL for declarative aggregation.
The Faust app and KSQL statements are independently deployable and restartable.

Negative / Risks

Two different stream-processing runtimes increase operational surface area (two separate processes to start, monitor, and upgrade).
The Faust Table uses store="memory://" — state is lost on restart; the table is rebuilt from Kafka on each startup, which adds startup latency.
KSQL's output (TURNSTILE_SUMMARY) uses JSON while all other topics use Avro, creating a serialisation inconsistency (see ADR-002).
The consumers/server.py startup guard (topic_check) blocks the dashboard if either Faust or KSQL has not yet produced its output topic, creating an implicit startup ordering dependency.

ADR-005: Kafka REST Proxy for the Weather Producer

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

The weather simulation model (producers/models/weather.py) needs to publish Avro-encoded records to Kafka. All other producers in the system use the confluent-kafka Python library's AvroProducer directly against the broker.

During development, a second integration path was explored: the Confluent Kafka REST Proxy (port 8082), which accepts HTTP POST requests with embedded schemas and records.

Decision

The Weather producer uses the Kafka REST Proxy instead of a native Kafka producer client.

Implementation (producers/models/weather.py:71-86):

resp = requests.post(
    f"{constants.Constants.rest_proxy_url}topics/{constants.Constants.weather_topic_name}",
    headers={"Content-Type": "application/vnd.kafka.avro.v2+json"},
    data=json.dumps({
        "key_schema":   json.dumps(Weather.key_schema),
        "value_schema": json.dumps(Weather.value_schema),
        "records": [{"value": {...}, "key": {"timestamp": self.time_millis()}}]
    }),
)

Schema JSON is inlined in every POST payload rather than being pre-registered with the Schema Registry. The REST Proxy handles registration transparently.

Alternatives Considered

Alternative	Reason Rejected
Native `AvroProducer` (used by other producers)	Both approaches produce identical results; REST Proxy was chosen to demonstrate the capability
`requests` posting to the broker directly	Kafka wire protocol is binary and not HTTP-accessible without a proxy

Consequences

Positive

Demonstrates an HTTP-based integration path useful for polyglot producers (languages without a native Kafka client library).
No Kafka client dependency required in the producing service.

Negative / Risks

An additional network hop (producer → REST Proxy → broker) adds latency compared to the native client path.
Inlining the full schema JSON in every request is wasteful; the Schema Registry already holds the schema after the first publish.
Error handling on HTTP failures is minimal — a failed raise_for_status() logs the error but drops the weather event silently.
Inconsistency: weather uses REST Proxy while all other producers use the native client, increasing cognitive overhead for maintainers.

ADR-006: Tornado Async Web Server for the Real-Time Transit Dashboard

2026-05-19T00:00:00.000Z

Date: 2026-03-12 Status: Accepted Deciders: Engineering Team

Context

The consumer layer must simultaneously:

Poll four Kafka topics continuously and update in-memory state (weather, line status, arrivals, turnstile counts).
Serve HTTP GET requests that render the current state as an HTML page.

A blocking web server would stall Kafka consumption while handling HTTP requests. A blocking Kafka consumer would stall HTTP responses while waiting for new messages.

Decision

The Tornado asynchronous web framework is used as the server runtime (consumers/server.py). Kafka consumers are scheduled as Tornado IO loop callbacks:

for consumer in consumers:
    tornado.ioloop.IOLoop.current().spawn_callback(consumer.consume)
tornado.ioloop.IOLoop.current().start()

Each KafkaConsumer.consume() is an async coroutine that yields control between polls via await gen.sleep(self.sleep_secs) (consumers/consumer.py:70-76). The HTTP handler renders state synchronously on GET without blocking Kafka consumption.

Four consumers are registered on startup:

Consumer	Topic	Avro?
Weather	`org.chicago.cta.weather.v1`	Yes
Stations table	`^org.chicago.cta.stations.table.*`	No (regex, JSON)
Train arrivals	`^org.chicago.cta.station.arrivals.*`	Yes (regex)
Turnstile summary	`TURNSTILE_SUMMARY`	No (JSON)

All consumers share a single group.id (com.chicago.transport.consumer.group.1).

The server listens on port 8888 and serves a single route (/) rendered from consumers/templates/status.html.

Alternatives Considered

Alternative	Reason Rejected
Flask / Django (synchronous)	Cannot multiplex Kafka polling with HTTP serving without threads
asyncio + aiohttp	Viable alternative; Tornado chosen for built-in IOLoop integration matching `confluent_kafka` callback style
Separate Kafka consumer process + shared state store (Redis)	Over-engineered for a dashboard with a single user

Consequences

Positive

Single process handles both Kafka consumption and HTTP serving without threading.
Tornado's spawn_callback allows an arbitrary number of consumers to coexist on one event loop.
Simple HTML template rendering — no JavaScript framework needed for the status page.

Negative / Risks

State is stored in Python objects (Weather, Lines) in process memory; any restart loses accumulated state until topics are re-consumed from the earliest offset.
All four consumers share one group.id, meaning if a second dashboard instance were started it would steal partitions from the first.
The dashboard blocks entirely during startup if KSQL or Faust topics are not yet ready (hard exit(1) at consumers/server.py:49-57), requiring a manual restart order.
No authentication or HTTPS on port 8888.

Architecture Decision Records

2026-05-19T00:00:00.000Z

This directory contains Architecture Decision Records (ADRs) for the CTA Public Transport Optimisation system. ADRs are generated by reverse-engineering the codebase as of 2026-03-12.

Index

ADR	Title	Status
ADR-001	Apache Kafka as the Central Event Bus	Accepted
ADR-002	Avro Schemas + Confluent Schema Registry	Accepted
ADR-003	Kafka Connect JDBC Source for PostgreSQL	Accepted
ADR-004	Dual Stream Processing — Faust + KSQL	Accepted
ADR-005	Kafka REST Proxy for Weather Producer	Accepted
ADR-006	Tornado Async Web Server for Dashboard	Accepted

System Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        PRODUCER LAYER                               │
│                                                                     │
│  simulation.py                                                      │
│    ├─ Line (Blue/Red/Green)                                         │
│    │    ├─ Station ──────────────────► org.chicago.cta.station.     │
│    │    │   (AvroProducer)              arrivals.t001               │
│    │    └─ Turnstile ─────────────────► com.cta.stations.           │
│    │        (AvroProducer)               turnstile.entry            │
│    └─ Weather ──────────────────────►  org.chicago.cta.weather.v1  │
│         (REST Proxy HTTP POST)                                      │
│                                                                     │
│  connector.py ──[JDBC Source]──► com.cta.stations.data.rawt001.    │
│  (Kafka Connect)                  stations  (from PostgreSQL)       │
└────────────────────────────┬────────────────────────────────────────┘
                             │  Apache Kafka  (+ Schema Registry)
┌────────────────────────────▼────────────────────────────────────────┐
│                    STREAM PROCESSING LAYER                          │
│                                                                     │
│  faust_stream.py                                                    │
│    com.cta.stations.data.rawt001.stations                           │
│      ──[transform]──► org.chicago.cta.stations.table.v1t001        │
│                                                                     │
│  ksql.py                                                            │
│    com.cta.stations.turnstile.entry                                 │
│      ──[GROUP BY station_id]──► TURNSTILE_SUMMARY (JSON)            │
└────────────────────────────┬────────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────────┐
│                      CONSUMER / DASHBOARD LAYER                     │
│                                                                     │
│  server.py  (Tornado, port 8888)                                    │
│    KafkaConsumer × 4  ──► in-memory Weather + Lines state           │
│    GET /  ──► status.html                                           │
└─────────────────────────────────────────────────────────────────────┘

Key Technology Choices

Concern	Technology	ADR
Event streaming	Apache Kafka (Confluent 5.2.2)	ADR-001
Schema enforcement	Apache Avro + Confluent Schema Registry	ADR-002
DB-to-Kafka ingestion	Kafka Connect JDBC Source	ADR-003
Station transformation	Faust (Python stream processor)	ADR-004
Turnstile aggregation	KSQL	ADR-004
HTTP-based produce path	Kafka REST Proxy	ADR-005
Real-time web dashboard	Tornado async web server	ADR-006
Infrastructure	Docker Compose (single-broker dev cluster)	ADR-001

Welcome to the JigsawFlux Blog

2026-05-15T00:00:00.000Z

Welcome to the JigsawFlux Blog — a space for updates, ideas, and stories from the JigsawFlux open-source community.

What is JigsawFlux?

JigsawFlux is an open-source organisation focused on building tools that make a real difference in health tech, crisis management, and humanitarian response. Every project we build is open, non-profit in spirit, and built to solve real problems that affect real people.

What this blog is for

This blog will be updated weekly or monthly with:

Project updates — what we're building, what's shipping, what's next
Technical deep-dives — architecture decisions, lessons learned, interesting engineering problems
Community stories — contributors, use cases, and the humans behind the code
Ideas and proposals — things we're thinking about, inviting input before we build

Get involved

If you want to contribute — code, documentation, design, testing, or ideas — the best place to start is jigsawflux.org/contribute. Every bit helps.

You can also follow along via the RSS feed or the GitHub organisation.

Here's to building tools that matter.

Running a Local LLM with Ollama and MCP — An Architecture Spike

2026-05-15T00:00:00.000Z

AI inference doesn't have to mean a cloud API call. This post walks through a spike I built to run a locally hosted language model through a clean, layered architecture using Ollama and the Model Context Protocol (MCP).

The full source is on GitHub. Contributions and feedback welcome.

Why Local LLMs?

Most AI integrations today rely on external APIs — you send a prompt to a cloud provider, they run the model, you get a response. That works fine for many use cases, but it comes with real tradeoffs:

Privacy: your prompts leave your network
Connectivity: requires reliable internet
Cost: token-based pricing adds up quickly at scale
Latency: round-trip to a remote data centre

For health tech, crisis response, and humanitarian tools — areas where JigsawFlux focuses — these tradeoffs can matter a great deal. Patient data is sensitive. Field teams in disaster zones may have limited connectivity. Running AI locally sidesteps all of these concerns.

Local LLMs have become genuinely useful. Models like llama3.2:3b run comfortably on consumer hardware and handle a wide range of practical tasks: summarisation, triage, Q&A, and structured extraction.

What is Ollama?

Ollama is an open-source tool that makes running LLMs locally straightforward. Think of it as a package manager for AI models — pull a model, run it, and interact with it over a local HTTP API.

ollama pull llama3.2:3b
ollama run llama3.2:3b

Once running, Ollama exposes a REST API on localhost:11434 (or a networked host). It handles model loading, memory management, and inference. You interact with it exactly as you would a cloud API, just without the round-trip.

For this spike, Ollama runs on a separate machine on my local network at 192.168.1.80 — a common setup where a more capable machine hosts the model and lighter clients query it.

Setting Up Ollama on Linux

Installation

The quickest way is the official install script — one line and it handles everything:

curl -fsSL https://ollama.com/install.sh | sh

If you prefer to install manually (air-gapped environments, or you want control over the binary location):

# Download the Linux binary
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama

# Make it executable
chmod +x ollama

# Move to a directory in your PATH
sudo mv ollama /usr/local/bin/

Starting the Server

By default, ollama serve binds to 127.0.0.1 — only accessible from the same machine. For this spike, Ollama runs on a dedicated machine on the local network and the application connects to it from a different host. To allow that, bind to all interfaces:

OLLAMA_HOST=0.0.0.0 ollama serve

To make this permanent (e.g. as a systemd service override):

sudo systemctl edit ollama.service

Add the following and save:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

The server now listens on port 11434 on all network interfaces. Other machines on the same network can reach it at http://:11434.

Pulling a Model

Before the server can handle requests, pull the model you want to use:

ollama pull llama3.2:3b

Testing the API

Ollama exposes a simple HTTP API. Once the server is running, you can test it directly with curl — no SDK needed.

Generate completion (single prompt, no conversation context):

curl $OLLAMA_HOST/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain kubernetes in one paragraph",
  "stream": false
}'

Chat completion (conversation format with message roles):

curl $OLLAMA_HOST/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is Docker?"}
  ],
  "stream": false
}'

The /api/generate endpoint is stateless — one prompt in, one response out. The /api/chat endpoint accepts a messages array so you can pass conversation history and get contextually aware replies. This spike uses /api/chat for both the chat and summarise tools.

Set OLLAMA_HOST in your environment or .env file to point at the machine running Ollama:

OLLAMA_HOST=http://192.168.1.80:11434

What is MCP?

The Model Context Protocol (MCP) is an open standard from Anthropic for structuring communication between AI hosts and the tools or services they call. It defines a consistent way to:

Expose tools — name them, describe their inputs/outputs
Call tools — invoke them with structured arguments
Handle responses — receive results in a predictable format

The key idea is separation of concerns. Your application logic doesn't need to know the details of how a model is invoked — it just calls a tool. The MCP server handles the translation to whatever inference backend is running.

This makes the architecture composable. Swap Ollama for another model runner, add a new tool, or connect a different MCP client — each layer stays isolated.

Architecture of the Spike

The system has four main layers:

Layer 1: Browser (Frontend)

A vanilla HTML/CSS/JS frontend with two tabs — Chat and Summarise. No framework, no build step. The UI handles user input, loading states, and error display. It talks only to the Express backend and never directly to Ollama.

Layer 2: Express Backend (API Layer)

A Node.js/Express server that exposes three endpoints:

Endpoint	Description
`GET /health`	Returns status of backend, MCP, and Ollama
`POST /chat`	Accepts a prompt, returns a model response
`POST /summarise`	Accepts text and optional style, returns a summary

The backend holds a persistent MCP client session — one long-lived connection to the MCP server rather than spinning up a new process per request. It also assigns a correlation ID to each request for tracing.

Layer 3: MCP Server (Tool Layer)

A stdio-based MCP server written in TypeScript. It registers three tools:

health_check — pings Ollama and returns status
chat — sends a prompt to the model
summarise — sends a text block for summarisation

The MCP server contains an Ollama adapter — a single module that centralises all Ollama API communication. Connection errors, timeouts, and model-not-found responses are normalised here into consistent, user-actionable messages.

Layer 4: Ollama (Inference Layer)

The model runner. This spike uses llama3.2:3b — a capable 3-billion parameter model that runs fast on modest hardware. Ollama receives requests at /api/chat and returns model completions.

Request Flow

Here is what happens when you submit a prompt in the chat UI:

The response includes the model name and duration — useful for understanding latency and knowing which model answered.

Key Design Decisions

Isolated Ollama adapter. All HTTP calls to Ollama live in one file (ollama-adapter.ts). Changing the inference backend, adding retry logic, or switching models only requires touching this one module.

Stdio MCP transport. The MCP server runs as a child process communicating over stdio rather than a network socket. This keeps deployment simple — no extra port to manage — and makes it easy to integrate with other MCP-compatible clients like VS Code Copilot.

Single persistent MCP session. The backend creates one MCP client session at startup and reuses it across requests. This avoids the overhead of spawning a new process per request and keeps session state coherent.

Vanilla frontend. No React, no Vite, no build pipeline. The frontend is plain HTML/CSS/JS served directly. This keeps the system portable — it can run anywhere a static file server exists.

Security baseline. The browser never talks to Ollama directly. All inference is proxied through the backend. Request body size limits are applied. Sensitive text is redacted from error logs.

Deployment View

Running It Locally

The project structure looks like this:

ollama-claude/
├── mcp-server/           # MCP server + Ollama adapter (TypeScript)
├── backend/
│   ├── ...               # Express API + MCP client session (TypeScript)
│   └── k8s-deployment/   # Kubernetes manifests for MicroK8s
├── frontend/             # Static UI (HTML/CSS/JS)
├── start.sh              # Unix startup script
└── start.bat             # Windows startup script

To get started:

# Copy environment files and configure your Ollama host
cp .env.example .env
# Edit OLLAMA_HOST to point at your Ollama instance

# Start all three services
./start.sh          # macOS/Linux
start.bat           # Windows

Then open http://localhost:3000 in your browser.

Deploying on Kubernetes (MicroK8s)

The backend/k8s-deployment/ folder contains production-ready manifests for running the Ollama stack on MicroK8s with MetalLB and Nginx ingress. The key files are:

Manifest	Purpose
`ollama-stack.yaml`	`ollama` namespace, 50 Gi PVC, Ollama StatefulSet, ClusterIP service
`ollama-automated.yaml`	Variant that pre-pulls `llama3.1:8b` via an `initContainer` on first boot
`open-webui-stack.yaml`	Open WebUI deployment with 10 Gi PVC, pointed at the internal Ollama service
`ollama-ingress.yaml`	Exposes Ollama at `ollama.local`
`open-webui-ingress.yaml`	Exposes Open WebUI at `ai.local`
`dashboard-ingress.yaml`	Exposes the Kubernetes dashboard at `dashboard.local` (SSL passthrough)
`ingress-lb.yaml`	LoadBalancer service for the Nginx ingress controller (ports 80/443)

Apply them in order:

kubectl apply -f backend/k8s-deployment/ollama-stack.yaml
kubectl apply -f backend/k8s-deployment/open-webui-stack.yaml
kubectl apply -f backend/k8s-deployment/ollama-ingress.yaml
kubectl apply -f backend/k8s-deployment/open-webui-ingress.yaml
kubectl apply -f backend/k8s-deployment/dashboard-ingress.yaml
kubectl apply -f backend/k8s-deployment/ingress-lb.yaml

Prerequisites: MicroK8s with dns, storage, ingress, and metallb add-ons enabled. Add ollama.local, ai.local, and dashboard.local to your /etc/hosts pointing at the MetalLB-assigned IP.

GPU support is optional — uncomment the nvidia.com/gpu: 1 resource limit in ollama-stack.yaml after running microk8s enable gpu.

Demo

In this demo, I asked this crazy question to both locally hosted ollama and Microsoft CoPilot

The below is from my locally hosted ollama

The chat tab lets you send prompts directly to the model. The health indicator in the header shows live status of the MCP and Ollama connections. If either goes down, the indicator updates and requests fail with a clear error message rather than a silent timeout.

The below was the response from Microsoft Co-pilot

What's Next

This is a spike — a working proof of concept, not production software. The obvious next steps are:

Streaming responses — currently the UI waits for the full completion; streaming would feel much more interactive
Conversation history — right now each prompt is stateless; persisting context would allow follow-up questions
Structured logging — correlation IDs are assigned but not yet threaded through all log lines
Docker packaging — containerising all three services would make deployment reproducible anywhere

For JigsawFlux, the interesting application of this architecture is in tools for field teams and health workers — where data privacy matters, connectivity is unreliable, and a locally-running model on a shared device could provide genuine decision support.

The full source is on GitHub. Contributions and feedback welcome.

JigsawFlux Blog

Picking an Open-Source Agent Framework: LangGraph, CrewAI, and AutoGen

The shared task​

LangGraph: the stateful graph​

CrewAI: declarative roles​

AutoGen: conversation-based​

Telemetry: what the runs produced​

A note on n8n​

Framework decision guide​

What's next​

References​

Agent Clinic: Human-in-the-Loop Medical Consultations with LangGraph and AWS Bedrock

Why Not Lambda + Bedrock?​

The Architecture​

Human-in-the-Loop: How interrupt() Actually Works​

Demo Walkthrough​

The Tool Layer: @tool Now, MCP Later​

Cost​

POC → Production Roadmap​

What's Next​

Stop Burning AI Credits: A Framework for Right-Sizing Model Usage

The shift that changed everything​

Tier your workload​

Two traffic lanes: IDE vs programmatic​

Lane A: IDE traffic via GitHub Copilot​

Lane B: Programmatic traffic via Azure APIM​

Architecture​

What this looks like day-to-day​

Rolling this out​

The non-profit dimension​

References​

Appendix: How this post was written​

Running a Local LLM on Kubernetes — A Home Lab Setup

The Hardware​

The GPU caveat​

Why Move to Kubernetes?​

Architecture Overview​

MicroK8s Setup​

Install​

Enable Add-ons​

kubectl alias​

GPU Operator​

Deploying Ollama​

Why a StatefulSet?​

The Stack Manifest​

Model Preload with initContainer​

Deploying Open-WebUI​

MetalLB + Nginx Ingress​

Why MetalLB?​

Ingress Rules​

Client Hosts File​

Connecting the MCP App from Part 1​

Verifying the Deployment​

Demo​

Lessons Learned​

What's Next​

Claude™, Copilot™ & Gemini™ for Architects — A 2026 Field Guide

Table of Contents​

1. Three Superpowers​

Architecture Document — CTA Public Transport Optimisation System

Table of Contents​

1. Document Purpose and Scope​

1.1 Purpose​

1.2 System Overview​

1.3 Scope​

2. Architectural Drivers​

2.1 Quality Attribute Requirements​

2.2 Constraints​

3. Use Case View (+1)​

3.1 Actor Diagram​

3.2 Key Scenarios​

UC-01 — View Live Transit Status​

UC-02 — Publish Train Arrival Event​

UC-04 — Publish Weather Reading​

UC-06 — Aggregate Rider Counts​

UC-07 — Transform Station Schema​

4. Logical View​

4.1 Component Overview​

4.2 Key Abstractions​

Producer Hierarchy​

The shared task

LangGraph: the stateful graph

CrewAI: declarative roles

AutoGen: conversation-based

Telemetry: what the runs produced

A note on n8n

Framework decision guide

What's next

References

Why Not Lambda + Bedrock?

The Architecture

Human-in-the-Loop: How `interrupt()` Actually Works

Demo Walkthrough

The Tool Layer: `@tool` Now, MCP Later

Cost

POC → Production Roadmap

What's Next

The shift that changed everything

Tier your workload

Two traffic lanes: IDE vs programmatic

Lane A: IDE traffic via GitHub Copilot

Lane B: Programmatic traffic via Azure APIM

Architecture

What this looks like day-to-day

Rolling this out

The non-profit dimension

References

Appendix: How this post was written

The Hardware

The GPU caveat

Why Move to Kubernetes?

Architecture Overview

MicroK8s Setup

Install

Enable Add-ons

kubectl alias

GPU Operator

Deploying Ollama

Why a StatefulSet?

The Stack Manifest

Model Preload with initContainer

Deploying Open-WebUI

MetalLB + Nginx Ingress

Why MetalLB?

Ingress Rules

Client Hosts File

Connecting the MCP App from Part 1

Verifying the Deployment

Demo

Lessons Learned

What's Next

Table of Contents

1. Three Superpowers

Table of Contents

1. Document Purpose and Scope

1.1 Purpose

1.2 System Overview

1.3 Scope

2. Architectural Drivers

2.1 Quality Attribute Requirements

2.2 Constraints

3. Use Case View (+1)

3.1 Actor Diagram

3.2 Key Scenarios

UC-01 — View Live Transit Status

UC-02 — Publish Train Arrival Event

UC-04 — Publish Weather Reading

UC-06 — Aggregate Rider Counts

UC-07 — Transform Station Schema

4. Logical View

4.1 Component Overview

4.2 Key Abstractions

Producer Hierarchy

Consumer / Model Hierarchy

4.3 Kafka Topic Catalogue

5. Process View

5.1 System Startup Sequence

5.2 End-to-End Data Flow — Train Arrival

5.3 End-to-End Data Flow — Turnstile Aggregation

5.4 Concurrency Model