Skip to main content

Architectural Trade-off Analysis — CTA Public Transport Optimisation System

· 21 min read

Version: 1.0 Date: 2026-03-12 Authors: Architecture Review (reverse-engineered from codebase) References: architecture.md · ADR-001 through ADR-006


Table of Contents

  1. Evaluation Framework
  2. Decision Trade-off Analysis
  3. Cross-Cutting Trade-off Analysis
  4. Architecture Fitness Function
  5. Strategic Recommendations
  6. Trade-off Summary Heatmap

1. Evaluation Framework

Every trade-off is scored against the six quality attributes (QAs) derived from the system's architectural drivers (see architecture.md §2).

1.1 Quality Attribute Weights

IDQuality AttributeWeightJustification
QA-01Throughput20 %System processes events from 3 lines × stations × 10 trains @ 5 s intervals
QA-02Decoupling20 %Producers and consumers must evolve independently
QA-03Schema Evolution15 %Fields may be added; consumers must not break
QA-04Replayability15 %Dashboard must rebuild state on restart
QA-05Responsiveness15 %Dashboard HTTP latency must not stall Kafka polling
QA-06Extensibility15 %New data sources/consumers without code changes

1.2 Scoring Scale

ScoreMeaning
5Fully meets the quality attribute
4Meets with minor gaps
3Partial / neutral
2Partially undermines the quality attribute
1Significantly undermines the quality attribute

1.3 Risk Scale

LevelSymbolMeaning
Critical🔴Likely to cause production incidents
High🟠Significant impact under normal load
Medium🟡Impact under edge cases or growth
Low🟢Manageable with standard practices

2. Decision Trade-off Analysis

2.1 Event Bus — Kafka vs Alternatives

Decision: ADR-001 — Apache Kafka as the single event bus

Weighted Scoring Matrix

Quality AttributeWeightKafkaRabbitMQRedis StreamsREST Polling
Throughput (QA-01)20 %5442
Decoupling (QA-02)20 %5431
Schema Evolution (QA-03)15 %5322
Replayability (QA-04)15 %5231
Responsiveness (QA-05)15 %4452
Extensibility (QA-06)15 %5431
Weighted Total4.853.553.301.55

Positioning

Trade-off Narrative

Why Kafka wins here: Kafka's append-only, partitioned log is the single feature that unlocks replayability and fan-out simultaneously — properties that no queue-based broker (RabbitMQ) provides out of the box. The ability to start a new consumer at offset_earliest and rebuild the full station/weather state is architecturally critical for the dashboard's cold-start scenario.

What is sacrificed:

  • Operational simplicity. Kafka requires Zookeeper (in CP 5.x), Schema Registry, and REST Proxy as satellites. A RabbitMQ cluster is simpler to operate.
  • Latency at p99. Kafka batches records before acknowledgment; for the weather producer that posts once per simulated hour, this is irrelevant, but it rules Kafka out for sub-millisecond latency use cases.

Key risk introduced: 🔴 Single-broker deployment with replication_factor=1. In production, broker failure loses all un-replicated messages.


2.2 Serialisation — Avro vs Alternatives

Decision: ADR-002 — Apache Avro + Confluent Schema Registry

Weighted Scoring Matrix

Quality AttributeWeightAvro + RegistryPlain JSONProtobufMessagePack
Throughput (QA-01)20 %5354
Decoupling (QA-02)20 %5252
Schema Evolution (QA-03)15 %5152
Replayability (QA-04)15 %5343
Responsiveness (QA-05)15 %4544
Extensibility (QA-06)15 %5242
Weighted Total4.852.554.552.80

Trade-off Narrative

Why Avro wins: The Schema Registry's compatibility check acts as a compile-time equivalent at publish-time — a field removal or rename is rejected before a single consumer can be broken. Avro's wire format embeds only the schema ID (4 bytes), making messages far more compact than equivalent JSON.

Protobuf is the credible alternative: Protobuf achieves nearly identical scores. The differentiator is ecosystem fit: confluent-kafka-python's AvroProducer/AvroConsumer were the idiomatic Python Confluent API at CP 5.2.2, whereas Protobuf support required more boilerplate. Today (Confluent Platform 7+), Protobuf is first-class; migrating would be viable.

Key inconsistency introduced: 🟠 TURNSTILE_SUMMARY uses JSON while all other topics use Avro. This forces consumers to branch on is_avro and removes schema-enforcement for rider counts — the metric that most directly feeds the UI.


2.3 DB Ingestion — Kafka Connect vs Custom Producer

Decision: ADR-003 — Kafka Connect JDBC Source Connector

Weighted Scoring Matrix

Quality AttributeWeightKafka Connect JDBCCustom Python ProducerDirect DB Read in ConsumerDebezium CDC
Throughput (QA-01)20 %4435
Decoupling (QA-02)20 %5315
Schema Evolution (QA-03)15 %3215
Replayability (QA-04)15 %5415
Responsiveness (QA-05)15 %4424
Extensibility (QA-06)15 %5315
Weighted Total4.353.301.554.85

Trade-off Narrative

Why Kafka Connect wins over a custom producer: Zero-code ingestion eliminates an entire class of bugs: offset tracking, error handling, and retry logic are handled by a battle-tested framework. The connector is idempotent — safe to re-run on simulation restart.

Why Debezium CDC scores higher but was rejected: Debezium captures every INSERT/UPDATE/DELETE via PostgreSQL Write-Ahead Log, which is more correct (would capture station updates, not just inserts). However, enabling WAL replication requires DBA-level PostgreSQL configuration (wal_level=logical), which is overkill when the stations table is quasi-static reference data loaded once from CSV.

Hidden cost of the chosen approach: 🟡 mode=incrementing only detects new rows by monotonically increasing stop_id. A station name correction or line reassignment will silently remain stale in the Kafka topic and in the dashboard until the connector is manually reset and replayed.

When the decision should be revisited

If station data becomes writable (e.g. an admin UI for updating station names), migrate to Debezium CDC to capture UPDATE and DELETE events.


2.4 Stream Processing — Faust + KSQL vs Alternatives

Decision: ADR-004 — Faust for record transformation + KSQL for aggregation

Option Space

Weighted Scoring Matrix

Quality AttributeWeightFaust + KSQLFaust OnlyKSQL OnlyKafka StreamsSpark SS
Throughput (QA-01)20 %44555
Decoupling (QA-02)20 %55555
Schema Evolution (QA-03)15 %44454
Replayability (QA-04)15 %33455
Responsiveness (QA-05)15 %44443
Extensibility (QA-06)15 %54454
Weighted Total4.204.004.354.804.35

Trade-off Narrative

The dual-engine pattern is a deliberate pedagogical trade-off: The use of two tools increases operational complexity (two processes, two different programming models) but each tool is used where it excels:

ConcernFaustKSQL
Programming modelAsync Python coroutinesDeclarative SQL
Best forArbitrary code logic, Python type safetyGROUP BY, windowed aggregations
State storeIn-memory (dev) / RocksDB (prod)Kafka-backed materialised table
Restart behaviourReplays topic from earliestPersistent table survives restart

Key risk: 🟡 Faust's store="memory://" means station state is rebuilt from the full topic on every restart. As the station topic grows this adds startup latency. Replace with store="rocksdb://" for a persistent local state store.

Operational debt: 🟠 No orchestration enforces the startup order:

  1. Kafka Connect must publish station data
  2. Faust must transform it
  3. KSQL must create TURNSTILE_SUMMARY
  4. Only then can the dashboard start

A failure anywhere in this chain requires manual intervention.


2.5 Weather Producer — REST Proxy vs Native Client

Decision: ADR-005 — Kafka REST Proxy for the Weather producer

Weighted Scoring Matrix

Quality AttributeWeightREST ProxyNative AvroProducer
Throughput (QA-01)20 %35
Decoupling (QA-02)20 %55
Schema Evolution (QA-03)15 %45
Replayability (QA-04)15 %55
Responsiveness (QA-05)15 %35
Extensibility (QA-06)15 %55
Weighted Total4.105.00

Trade-off Narrative

This decision scores lowest of all six because there is no functional reason to diverge from the native client — only demonstration value.

DimensionREST ProxyNative AvroProducer
Extra network hopYes (+ ~1–5 ms per request)No
Schema sent in every requestYes (wasteful, ~2 KB)No (schema ID only after first register)
Error handlingSilent drop on HTTP failureDelivery callback with retry
Maintenance burdenTwo integration patterns to understandOne
Polyglot valueUseful if producer is non-PythonNot applicable here

Verdict: 🟡 The REST Proxy choice adds cognitive overhead for no functional gain in a Python-only system. If the goal is demonstration, the inconsistency should be documented clearly (it now is in ADR-005). For a production system, weather should use AvroProducer like every other producer, and the REST Proxy demo should be a separate isolated example.


2.6 Dashboard Server — Tornado vs Alternatives

Decision: ADR-006 — Tornado async web server

Weighted Scoring Matrix

Quality AttributeWeightTornadoFlask (sync)aiohttpFastAPISeparate Consumer + Redis
Throughput (QA-01)20 %42555
Decoupling (QA-02)20 %44445
Schema Evolution (QA-03)15 %44444
Replayability (QA-04)15 %53554
Responsiveness (QA-05)15 %52555
Extensibility (QA-06)15 %43454
Weighted Total4.302.904.654.654.45

Positioning

Trade-off Narrative

Why Tornado is a reasonable but not optimal choice: Tornado's IOLoop integrates naturally with confluent_kafka's callback-based API and was the idiomatic async web server in the Python ecosystem before asyncio matured. The chosen design — spawn_callback for consumers, synchronous GET handler — achieves the goal with minimal code.

aiohttp / FastAPI score higher today because:

  • Both are built natively on asyncio (no legacy compatibility shim)
  • FastAPI provides automatic OpenAPI documentation
  • The aiokafka library provides a fully async consumer compatible with both

Flask's fatal flaw in this context: A synchronous web server cannot co-locate Kafka consumer polling in the same process without threads. Using threads reintroduces shared-state locking complexity that the async model eliminates.

Key risk: 🟡 All four consumers share a single Kafka group.id. Starting a second dashboard instance would split partition ownership, causing each instance to see only a subset of events — producing an incoherent UI state.


3. Cross-Cutting Trade-off Analysis

3.1 Serialisation Consistency

The system uses two serialisation formats across its six topics:

DimensionAvro path (5 topics)JSON path (TURNSTILE_SUMMARY)
Schema enforcementRegistry rejects breaking changesNone
Consumer codeAvroConsumer (auto-deserialise)Manual JSON decode
Wire sizeCompact (schema ID only)Verbose
DebuggabilitySchema Registry UIRaw JSON readable in Topics UI
Risk of silent breakageLowHigh

Recommendation: Register an Avro schema for TURNSTILE_SUMMARY and change VALUE_FORMAT='AVRO' in the KSQL CREATE TABLE statement. This removes the is_avro=False branch from the consumer and makes the serialisation model uniform.


3.2 State Management Strategy

The system employs three distinct state-management patterns with different durability guarantees:

State StorePatternCold-Start CostData Loss RiskRecovery
PostgreSQLSource of truthNoneLow (volume)Re-seed from CSV
Kafka logsEvent logNone🔴 replication_factor=1None if broker lost
Faust table (memory://)Materialised viewReplay full topicNone (replays)Automatic
Tornado in-processDerived stateReplay all 4 topicsNone (replays)Automatic

Structural tension: The design makes all in-process state reconstruct-able from Kafka, which is elegant and correct. However, it assumes the Kafka logs are themselves durable — an assumption violated by replication_factor=1.


3.3 Concurrency Model

Three different concurrency approaches coexist across the system:

ComponentConcurrency ModelThread-safe?Scale-out strategy
simulation.pySequential (single Python process, no async)N/AN/A
faust_stream.pyAsyncio event loop (Faust worker)YesMultiple Faust worker instances
ksql.pySingle HTTP request, then exitsN/AN/A
server.pyTornado IOLoop + spawn_callback coroutinesSingle-thread cooperative🔴 Blocked by shared group.id

Scale-out constraint for the dashboard:

Fix: Each dashboard instance should use a unique group.id (e.g. append a UUID suffix) so every instance receives the full partition set and sees all events.


3.4 Operational Complexity

The system requires 7 infrastructure containers + 4 Python processes to be started in a specific order:

Total components: 11

LayerComponentsStartup dependenciesFailure impact
Infrastructure7 Docker containersOrdered by depends_onTotal system down
Producers1 Python processKafka + Schema Registry + Connect upNo events produced
Stream processors2 Python processesKafka + producer runningDashboard sees no data
Dashboard1 Python processStream processors runningNo UI

Operational risk: 🟠 There is no automated readiness check or restart policy for the Python processes. A crash at any layer requires manual diagnosis and ordered restart.

Mitigation options:

OptionEffortBenefit
Add healthcheck to docker-compose.yaml for each serviceLowDetect infrastructure failures automatically
Wrap Python processes in a Makefile with retry logicLowReduce manual restart toil
Add a startup probe script (wait-for-it.sh pattern)MediumEnforce ordering without manual timing
Convert Python processes to Docker servicesMediumUnified docker-compose up startup
Migrate to Kubernetes with init containers + readiness probesHighProduction-grade orchestration

4. Architecture Fitness Function

A fitness function scores how well the as-built architecture meets each quality attribute.

Scale: ████████░░ = partially met   ██████████ = fully met   ████░░░░░░ = significantly unmet
QAAttributeCurrent ScoreEvidenceGap
QA-01Throughput🟢 4.0/5Kafka partitioning, AvroProducer batching handle demo loadSingle broker caps real scale
QA-02Decoupling🟢 4.5/5All flows via Kafka; zero direct service callsREST Proxy + native producer inconsistency
QA-03Schema Evolution🟡 3.5/5Avro + Schema Registry on 5/6 topicsTURNSTILE_SUMMARY bypasses registry
QA-04Replayability🟡 3.5/5offset_earliest on all consumers; Faust rebuildsreplication_factor=1 — log loss is unrecoverable
QA-05Responsiveness🟢 4.0/5Tornado async model; non-blocking pollHard exit(1) if topics missing blocks startup
QA-06Extensibility🟢 4.5/5New consumers subscribe without producer changesStartup ordering is implicit, not automated

Overall fitness: 4.0 / 5.0 (80 %) — suitable for demonstration, not production


5. Strategic Recommendations

Prioritised by risk reduction value vs implementation effort:

Priority 1 — Critical (address before any production use)

#RecommendationADRRisk AddressedEffort
P1-1Set replication_factor=3 on all topics; add 2 Kafka brokersADR-001🔴 Data loss on broker failureMedium
P1-2Externalise all credentials (DB_USER, DB_PASS, BOOTSTRAP_SERVERS) to environment variablesADR-003🔴 Credential exposureLow
P1-3Assign unique group.id per dashboard instanceADR-006🔴 Incoherent state on scale-outLow

Priority 2 — High (address in first production sprint)

#RecommendationADRRisk AddressedEffort
P2-1Register Avro schema for TURNSTILE_SUMMARY; change VALUE_FORMAT='AVRO'ADR-002🟠 Silent schema breaks on rider countLow
P2-2Replace AvroProducer with SerializingProducer + AvroSerializerADR-002🟠 Deprecated API removalMedium
P2-3Add automated startup ordering (health checks + wait scripts)ADR-004🟠 Manual restart toil on failureMedium
P2-4Replace store="memory://" with store="rocksdb://" in FaustADR-004🟠 Startup latency grows with topic sizeLow

Priority 3 — Medium (address in backlog)

#RecommendationADRRisk AddressedEffort
P3-1Unify all producers to use AvroProducer; remove REST Proxy dependencyADR-005🟡 Cognitive overhead for maintainersLow
P3-2Migrate faust_stream.py + server.py to FastAPI + aiokafkaADR-006🟡 Tornado is legacy; FastAPI is modern async standardHigh
P3-3Merge the two constants.py files into a shared config module🟡 Duplication / drift riskLow
P3-4Add unit tests for all producer models and consumer models🟡 No regression safety netHigh

6. Trade-off Summary Heatmap

The heatmap below shows the contribution of each architectural decision (rows) to each quality attribute (columns). Green = positive contribution; Red = negative contribution.

DecisionThroughputDecouplingSchema Evol.ReplayabilityResponsivenessExtensibilityNet
ADR-001 Kafka🟢🟢🟢🟢🟢🟢🟢🟢🟡🟢🟢+11
ADR-002 Avro🟢🟢🟢🟢🟢🟢🟢🟡🟢🟢+9
ADR-003 Connect🟡🟢🟢🟡🟢🟢🟡🟢🟢+7
ADR-004 Faust+KSQL🟢🟢🟢🟡🔴🟡🟢🟢+5
ADR-005 REST Proxy🔴🟢🟡🟢🔴🟢+1
ADR-006 Tornado🟡🟡🟡🟢🟢🟢🟢🟡+5
ADR-002 gap JSON KSQL🟡🟡🔴🔴🟡🟡🟡-3
ADR-001 gap RF=1🟡🟡🟡🔴🔴🟡🟡-3

Key:

  • 🟢🟢 Strong positive (+2)
  • 🟢 Positive (+1)
  • 🟡 Neutral (0)
  • 🔴 Negative (−1)
  • 🔴🔴 Strong negative (−2)

Overall Assessment

The core architectural spine — Kafka + Avro + Schema Registry + Kafka Connect — scores highly and is well-suited to the problem. The gaps are concentrated in two areas:

  1. Operational resilience (single broker, no startup orchestration)
  2. Serialisation consistency (KSQL JSON bypass undermines the otherwise strong Avro contract)

Addressing the P1 recommendations above would raise the overall fitness score from 4.0 / 5.0 → ~4.6 / 5.0, making the architecture production-worthy.


Trade-off analysis generated by reverse-engineering source code as of 2026-03-12. Weighted scores are analytical judgements based on code evidence, not empirical benchmarks.