QA Knowledge Hub

Difficult Questions

Questions that require precise scoping and technically defensible answers.

Difficult Questions

Why don’t you use SQLite?

Likely question

Why don’t you do this with SQLite?

Short answer

Because the research question is not whether SQLite can do this, but whether object storage itself can act as a queryable read layer without a database server.

Longer answer

SQLite would be a strong default if the goal were a general embedded SQL solution. But then the research setup would be different. The interesting part here is specifically whether immutable object storage can be used efficiently as a queryable artifact.

Why don’t you use D1?

Likely question

Why don’t you use Cloudflare D1?

Short answer

D1 solves a different problem.

Longer answer

In this research, the interesting combination is immutable object + Worker + byte-range read + query planner without a database server and without a traditional relational layer.

Safe formulation

D1 is the right answer when you want a database. The point of this research is to determine when object storage itself can be enough as a read layer.

Why is this not SQL?

Likely question

Why don’t you just call this a database?

Short answer

Because this is not a general SQL engine.

Longer answer

The queries are known in advance, the planner is deterministic, and the model does not contain arbitrary SQL, joins, or OLTP-style mutation load. The correct description is a read-optimized publish and briefing architecture.

What problem does this solve?

Likely question

What problem does this actually solve?

Short answer

It solves two different problems: the editorial problem and the data delivery problem.

Longer answer

The editorial problem is that there is too much news flow. You need a pipeline that fetches multiple sources, removes duplicates, clusters the same story together, separates source facts from editorial copy, and shows confidence.

The data delivery problem is that once content is built ahead of time, it may not make sense to serve it through a full database server. Then the question becomes whether mostly-read briefing data can be delivered directly from object storage without a database server.

How does AI relate to this?

Likely question

Is this just an AI news site?

Short answer

No. AI is one part of the pipeline, but it is not in the public request path.

Longer answer

In the current model, Claude selects the shortlist and final selection, OpenAI produces publishable editorial fields, and grounding + publish-gate try to block weak signals. The public /api/data or /api/news/history path does not call AI models on every request.

How does the AI know what happened?

Likely question

Does the AI roam the web freely?

Short answer

No. The AI works on a bounded research corpus.

Longer answer

First, a corpus is built:

  • raw articles from RSS sources
  • story clusters
  • source metadata
  • audit snapshots

Then the editorial stage works on that material. The AI does not guess the world from nothing; it uses collected and auditable corpus material.

Why is AI not in the request path?

Likely question

Why not just generate the answer live?

Short answer

Because the public interface needs stable latency, reproducible content, and an audit path.

Longer answer

If AI were in the request path, latency, reproducibility, versioning, and auditability would all become weaker. In this model, AI does corpus interpretation, shortlist generation, editorial field generation, and grounding assistance before publication.

What is the publish gate?

Likely question

Does AI publish everything automatically?

Short answer

No. The publish gate separates AI output from a publishable signal.

Longer answer

The publish gate is meant to block weak or incomplete signals, lower conviction in low-certainty cases, and keep the public publish path controlled.

Safe formulation

The publish gate does not turn the system into a truth machine, but it makes it more transparent and more controlled than a plain AI-summary pipeline.

Why immutable?

Likely question

Why did you make this an immutable model?

Short answer

Because it fits object storage and makes publication and audit paths easier.

Longer answer

In an immutable model, you can version, roll back, publish, and audit without in-place writes into the active base object. That makes object storage a natural publication surface.

What is the pointer?

Likely question

Why do you need a pointer?

Short answer

The pointer determines which manifest is active right now.

Longer answer

The new base and manifest are built first, and the pointer is switched only at the end. That allows publication to happen atomically while preserving the old chain for rollback.

What is the delta engine?

Likely question

How do changes happen if the base is immutable?

Short answer

The change does not overwrite the base; it becomes an append-only delta object.

Longer answer

The typical lifecycle is:

  1. ingest
  2. write delta
  3. update manifest
  4. publish pointer
  5. query live state
  6. compaction
  7. rollback / retention when needed

How is publication safety handled?

Likely question

What if two publishers write at the same time?

Short answer

Publish is protected by a CAS / ETag model.

Longer answer

Pointer publish does not rely on “last write wins.” The controls used are:

  • pointerKey
  • expectedGeneration
  • ETag guard
  • conditional write
  • health check

Why snapshots if JDBIN already works?

Likely question

If JDBIN works, why does the UI use snapshots?

Short answer

Because snapshots and the live JDBIN path solve different problems.

Longer answer

JDBIN is the canonical storage / read layer. A snapshot is the public delivery artifact. The snapshot path keeps the public UI fast and stable. The live JDBIN path makes the canonical storage layer auditable, versioned, and measurable.

Safe formulation

The default frontend request path is not a live JDBIN query, but snapshot-first JSON delivery. JDBIN/JDBON functions as a canonical store + publish engine.

Has this actually been run in production?

Likely question

Is this just a local demo?

Short answer

No. The implementation has actually been run on Cloudflare.

Longer answer

At minimum, the following has been verified:

  • Worker + R2 + pointer + manifest + base/delta chain works
  • the active live chain has been published into a compacted base
  • the benchmark endpoint works against a real R2 object
  • snapshot paths are used in the public UI

What is realistically finished right now?

Likely question

What is actually complete here?

Short answer

Several core parts are already working, even if they are not final.

Longer answer

Working now:

  • Worker + R2 live
  • research corpus pipeline
  • AI editorial pipeline
  • JDBIN/JDBON write path
  • public snapshot path
  • archive snapshot path
  • consensus v1.1 product layer

What do the benchmarks actually prove?

Likely question

What do these benchmarks prove?

Short answer

They prove that a large immutable object can be kept in R2 and queried without loading the entire object.

Longer answer

The benchmarks show that the Worker can resolve known query shapes so that query cost can stay tied to the query shape instead of to the total object size.

What should not be claimed

  • a general SQL replacement
  • a finished query engine for every problem
  • proven 100M / 5GB production readiness

Why is rangeReads such an important metric?

Likely question

Isn’t bytesRead enough?

Short answer

No, not by itself.

Longer answer

R2 latency accumulates not only from the number of bytes read, but also from the number of requests. That is why rangeReads is often just as important, and sometimes more important, than bytesRead.

What is not finished yet?

Likely question

What are the biggest gaps?

Short answer

The limitations should be stated directly.

Longer answer

The main unfinished areas are:

  • not a general SQL engine
  • not a broad query VM
  • not broad compression benchmarking
  • not a formalized consensus model
  • not yet a light enough live UI hot path
  • not yet editorial quality at the final target level
  • not yet a proven 100M / 5GB production path
  • not yet unified week-long p50/p95 telemetry

What should not be claimed about the project?

  • do not claim that it replaces the general SQL ecosystem
  • do not claim that all query types behave with equal efficiency
  • do not claim that the architecture removes data-modeling tradeoffs
  • do not claim that this is a new SQLite or a general database
  • do not claim that the editorial product is already finished
  • do not claim that consensus is already a formalized research model

On this page