333 lines
14 KiB
Markdown
333 lines
14 KiB
Markdown
## 1. Project Overview
|
||
|
||
**Project Name:** Proof DB
|
||
**Type:** Historical Evidence Retrieval System (RAG-oriented backend)
|
||
|
||
This project is a **backend-centric system** designed to manage, index, and retrieve historical evidence (documents, archives, OCR text) with strong emphasis on:
|
||
|
||
* Evidence traceability
|
||
* Chunk-level retrieval
|
||
* Hybrid search (full-text + vector)
|
||
* Citation reconstruction
|
||
|
||
Unlike generic RAG systems, this project treats **"evidence" as first-class structured objects**, not just text.
|
||
|
||
---
|
||
|
||
## 2. Core Concept
|
||
|
||
The system is divided into three conceptual layers:
|
||
|
||
* **Proof DB** → Data layer (PostgreSQL + OpenSearch + Vector)
|
||
* **Archive Cask** → Frontend interface (not part of this task)
|
||
* **Few-shot Engine** → OCR (external, not part of this task)
|
||
|
||
Current scope: **Proof DB only**
|
||
|
||
---
|
||
|
||
## 3. System Architecture (Backend Focus)
|
||
|
||
The backend follows a **modular service architecture** (not microservices yet, but clearly separated layers):
|
||
|
||
### Components:
|
||
|
||
1. **Ingestion Layer**
|
||
|
||
* Accepts raw Markdown archive documents
|
||
* Pre-processes Markdown page markers such as `<!-- DOCMASTER:PAGE 0001 -->`
|
||
* Splits documents into page-bounded vector chunks
|
||
* Keeps list-style archive records and their `COMMENT` blocks together where possible
|
||
* Extracts metadata, including page numbers
|
||
* Enqueues missing archive metadata such as title, year, author, tags, and summary for async LLM enrichment
|
||
|
||
2. **Storage Layer**
|
||
|
||
* PostgreSQL → metadata, relations
|
||
* OpenSearch → full-text index
|
||
* Vector DB → embeddings (can be OpenSearch kNN or Qdrant)
|
||
|
||
3. **Retrieval Layer**
|
||
|
||
* Full-text search (BM25)
|
||
* Vector search (embedding similarity)
|
||
* Hybrid search (fusion)
|
||
|
||
4. **Evidence Layer**
|
||
|
||
* Maps chunk → page → article
|
||
* Provides page-level citation traceability
|
||
|
||
👉 这是典型 backend architecture 分层设计(server + database + API协同) ([DEV Community][1])
|
||
|
||
---
|
||
|
||
## 4. Tech Stack
|
||
|
||
### Backend Framework
|
||
|
||
* PHP 8+
|
||
* Webman (HTTP API)
|
||
* Workerman (async workers / background jobs)
|
||
|
||
### Database
|
||
|
||
* PostgreSQL (relational metadata)
|
||
|
||
### Search Engine
|
||
|
||
* OpenSearch
|
||
|
||
* Full-text search (BM25)
|
||
* Optional vector search (kNN)
|
||
|
||
### Vector Layer
|
||
|
||
* Option A: OpenSearch kNN
|
||
* Option B: Qdrant (preferred if scaling)
|
||
|
||
### Data Flow Tools
|
||
|
||
* Custom chunking logic (PHP)
|
||
* Embedding via external API / local model
|
||
* Metadata enrichment via Redis queue + OpenAI-compatible chat completion API
|
||
|
||
---
|
||
|
||
## 5. Data Model (CRITICAL)
|
||
|
||
### Core Entities
|
||
|
||
```text
|
||
Archive
|
||
├── archive_uid (ULID)
|
||
├── title
|
||
├── summary
|
||
├── source
|
||
└── metadata
|
||
|
||
Page
|
||
├── page_number
|
||
├── block_count
|
||
├── chunk_count
|
||
└── content_length
|
||
|
||
PageBlock (internal import structure)
|
||
├── block_uid
|
||
├── archive_uid
|
||
├── page_number
|
||
└── content
|
||
|
||
Chunk
|
||
├── chunk_uid (archive_uid + sequence + short uid)
|
||
├── page_start
|
||
├── page_end
|
||
├── text
|
||
├── embedding_ref
|
||
```
|
||
|
||
### Key Principle
|
||
|
||
* **archive_uid 是档案级核心 ID,使用 ULID**
|
||
* **chunk_uid 是 chunk 级核心 ID,格式为 `{archive_uid}_{chunk_index}_{short_uid}`**
|
||
* PostgreSQL / OpenSearch / Vector DB 全部围绕 `archive_uid` 和 `chunk_uid`
|
||
* **page_number 是证据定位的关键字段**
|
||
* Chunk 是向量化和检索召回单位,不是精确 citation 单位
|
||
* 证据定位只需要定位到页码,因此 chunk 可以跨段落合并,但不能跨页
|
||
|
||
---
|
||
|
||
## 6. Search Design
|
||
|
||
### Full-text (OpenSearch)
|
||
|
||
* Indexed at chunk level
|
||
* Supports:
|
||
|
||
* keyword match
|
||
* phrase match
|
||
|
||
### Vector Search
|
||
|
||
* embedding similarity
|
||
|
||
### Hybrid Search
|
||
|
||
* BM25 + vector fusion
|
||
* rerank stage
|
||
|
||
|
||
---
|
||
|
||
## 7. API Design (First Phase)
|
||
|
||
### Ingestion
|
||
|
||
```http
|
||
POST /api/articles/import
|
||
```
|
||
|
||
---
|
||
|
||
### Retrieval
|
||
|
||
```http
|
||
POST /api/search/fulltext
|
||
POST /api/search/vector
|
||
POST /api/search/hybrid
|
||
```
|
||
|
||
---
|
||
|
||
### Evidence
|
||
|
||
```http
|
||
GET /api/chunks/{chunk_uid}
|
||
GET /api/evidence/{chunk_uid}
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Design Philosophy (IMPORTANT)
|
||
|
||
* Evidence > Text
|
||
* Chunk > Document
|
||
* Traceability > Raw Retrieval
|
||
* Hybrid Search by default
|
||
|
||
---
|
||
|
||
## 9. Non-goals (IMPORTANT)
|
||
|
||
* No frontend (Archive Cask handled later)
|
||
* No OCR (Few-shot Engine external)
|
||
* No heavy microservices (keep simple modular architecture first)
|
||
|
||
|
||
|
||
* Proof DB ≠ storage
|
||
* Proof DB = retrieval + meaning + traceability
|
||
|
||
|
||
[1]: https://dev.to/tomjohnson3/understanding-backend-architecture-ljb?utm_source=chatgpt.com "Understanding Backend Architecture"
|
||
[2]: https://exodata.io/what-is-a-tech-stack-how-to-architect-a-modern-scalable-technology-stack/?utm_source=chatgpt.com "How to Build a Tech Stack That Scales [2026] | Exodata"
|
||
[3]: https://medium.com/%40hanxuyang0826/roadmap-to-backend-programming-master-architectural-patterns-c763c9194414?utm_source=chatgpt.com "Roadmap to Backend Programming Master: Architectural ..."
|
||
|
||
---
|
||
|
||
## 10. StepMapToDo
|
||
|
||
> Code review date: 2026-05-03
|
||
> Scope reviewed: `app/`, `config/`, `scripts/`, `apidoc/`, project root runtime/deploy files. `vendor/` is treated as third-party dependency code and not counted as project implementation.
|
||
> Database decision: PostgreSQL is the project database contract. The default Webman Dockerfile is out of scope for this StepMap.
|
||
|
||
### Done
|
||
|
||
- [x] Webman backend skeleton is present and listens on `0.0.0.0:8787`.
|
||
- [x] Import API route is registered: `POST /api/articles/import`.
|
||
- [x] Import controller supports multipart Markdown upload, raw Markdown body, and JSON body.
|
||
- [x] Archive import service can normalize payloads, infer fallback title/source, validate inputs, parse DOCMASTER Markdown page markers, and fall back to single-page Markdown when markers are absent.
|
||
- [x] Page-level parsing and chunking are implemented with page-bounded chunks. Current chunking does not intentionally cross page boundaries.
|
||
- [x] Noise filtering exists for common archive/OCR boilerplate such as page numbers, classification headers, and declassification footer lines.
|
||
- [x] List-style policy records and following `COMMENT` blocks are kept together where the local pattern matches.
|
||
- [x] `archive_uid` uses ULID and `chunk_uid` follows `{archive_uid}_{chunk_index}_{short_uid}`.
|
||
- [x] Runtime import snapshot writing is implemented under `runtime/proofdb/imports/{import_uid}.json`.
|
||
- [x] Relational persistence is implemented through `ArchiveRepository::saveImport()`, including `archives` and `chunks` writes.
|
||
- [x] Minimal admin entry frontend exists: landing page with Archive Cask redirect and admin login, plus a session-backed admin dashboard shell.
|
||
- [x] Admin dashboard now includes archives-table management, OpenSearch status, admin-user management, APIDOC viewing, and a whitelist-based maintenance-script terminal.
|
||
- [x] PostgreSQL is the selected relational database, matching current `pgsql`, JSONB, `BIGSERIAL`, and `TIMESTAMPTZ` implementation.
|
||
- [x] PostgreSQL setup script exists for creating `archives` and `chunks` tables plus indexes.
|
||
- [x] Admin user bootstrap script exists for creating `admin_users` and seeding/updating an admin account.
|
||
- [x] Async AI metadata queue exists on Redis with pending, delayed, failed, retry, and error keys.
|
||
- [x] `ai_metadata` Workerman process is registered and can consume Redis jobs.
|
||
- [x] OpenAI-compatible chat client exists for metadata enrichment.
|
||
- [x] Metadata enrichment service can request/fill `title`, `year`, `author`, `tags`, and `summary` when LLM config is available.
|
||
- [x] LLM retry helper exists for retryable HTTP/provider errors.
|
||
- [x] Import API documentation exists in `apidoc/importapi.md`.
|
||
- [x] BigModel/Zhipu `embedding-3` client is implemented and verified with a live 256-dimension smoke test.
|
||
- [x] Generic async task queue/process foundation exists: one DB dispatcher process plus one Redis worker process.
|
||
- [x] OpenSearch client factory is implemented and supports passwordless local OpenSearch when security is disabled.
|
||
- [x] OpenSearch `proofdb_chunks` hybrid index mapping exists with BM25 text fields and a 2048-dimension `knn_vector` embedding field.
|
||
- [x] OpenSearch search-index task handler is implemented and writes embedded chunks through bulk upsert.
|
||
- [x] Archive-level `summary` metadata is written into OpenSearch chunk documents and participates in BM25 search alongside `text`, `title`, and other metadata fields.
|
||
- [x] End-to-end embedding-to-OpenSearch smoke test passed for 14 chunks: all are `embedding_status=embedded`, `search_index_status=indexed`, and OpenSearch documents contain 2048-dimension vectors.
|
||
- [x] Full-text search service, route, controller, and external API documentation are implemented for `POST /api/search/fulltext`.
|
||
- [x] Full-text OpenSearch smoke test passed with `query="policy documents"`, returning 12 total hits from indexed chunks.
|
||
- [x] Vector search service, route, controller, and external API documentation are implemented for `POST /api/search/vector`.
|
||
- [x] Vector OpenSearch smoke test passed with English and Chinese queries. Chinese query `伊拉克入侵科威特与沙漠风暴` correctly recalled the Iraq/Kuwait/Desert Storm chunk as top hit.
|
||
- [x] Hybrid search service, route, controller, and external API documentation are implemented for `POST /api/search/hybrid` using Reciprocal Rank Fusion over full-text and vector candidates.
|
||
- [x] Hybrid smoke tests passed: English query combines fulltext/vector ranks, and Chinese query falls back to vector recall with the Iraq/Kuwait/Desert Storm chunk as top hit.
|
||
- [x] Hybrid search supports `ai=true`: the original query is used for vector search, while the full-text query is rewritten into BM25 keywords through the existing OpenAI-compatible LLM chat path. Keyword generation has a shorter timeout and falls back to the original query on failure.
|
||
- [x] Chunk detail API and evidence API are implemented with external documentation: `GET /api/chunks/{chunk_uid}` and `GET /api/evidence/{chunk_uid}`.
|
||
- [x] Archive detail API is implemented with external documentation: `GET /api/archives/{archive_uid}`.
|
||
- [x] Archive chunk-list and archive evidence-list APIs are implemented with external documentation: `GET /api/archives/{archive_uid}/chunks` and `GET /api/archives/{archive_uid}/evidence`.
|
||
- [x] Evidence smoke test passed for `01KQHVREB6XPYF604RVZAP9NNY_1_39003`, returning page label, citation string, and chunk quote.
|
||
- [x] `archives` rows should no longer persist redundant `content` or `raw` bodies. Archive body reconstruction should come from chunks or the original Markdown source outside PostgreSQL.
|
||
- [x] AI metadata, embedding, and OpenSearch indexing paths now have resumable recovery logic for stale queued/processing work instead of relying only on in-memory progress.
|
||
- [x] OpenSearch repair/reindex maintenance script exists: `php scripts/reindex_opensearch.php`, with optional `--archive_uid=...` targeting.
|
||
|
||
### Partially Done
|
||
|
||
- [ ] Archive/Page/Chunk model is partly persisted: `archives` and `chunks` tables exist, but pages/page blocks are only summarized in import output and snapshots, not stored as first-class relational tables.
|
||
- [x] `embedding_status`, `embedding_ref`, `embedding_model`, `embedding_error`, and `embedding_updated_at` fields exist; embedding generation into PostgreSQL JSONB, OpenSearch vector indexing, and vector retrieval API are all implemented.
|
||
- [ ] `search_index_status`, `search_index_error`, and `search_index_updated_at` fields exist and are used by the generic task dispatcher/worker.
|
||
- [ ] Import response exposes page summaries and chunk IDs. Archive-level and chunk-level read APIs now exist, but there is still no first-class page record API because pages are not stored as relational rows yet.
|
||
- [ ] AI metadata enrichment updates the archive row, but import-time response only reports the queue state; clients need a follow-up API or polling path to observe completed enrichment.
|
||
- [ ] Database and Redis credentials are hard-coded in config files; move them to environment variables before production use.
|
||
|
||
### Async Task Contract
|
||
|
||
The search/vector pipeline should use two generic background processes instead of one process per task family:
|
||
|
||
```text
|
||
ProofDbTaskDispatcher
|
||
-> periodically scans PostgreSQL for unfinished work
|
||
-> marks eligible rows as queued
|
||
-> pushes normalized task payloads into Redis
|
||
|
||
ProofDbTaskWorker
|
||
-> consumes Redis task payloads
|
||
-> dispatches by task_type to handlers
|
||
-> updates PostgreSQL status after success/failure
|
||
```
|
||
|
||
Task payload shape:
|
||
|
||
```json
|
||
{
|
||
"task_type": "search_index",
|
||
"target_type": "archive",
|
||
"target_uid": "01...",
|
||
"attempt": 1
|
||
}
|
||
```
|
||
|
||
Initial task types:
|
||
|
||
- `search_index`: enqueue records where `search_index_status != indexed`; handler writes chunks to OpenSearch.
|
||
- `embedding`: enqueue records where `embedding_status in pending, queued, failed_retryable`; handler calls BigModel/Zhipu `embedding-3` and writes embedding references.
|
||
|
||
Redis tasks may be duplicated or lost; PostgreSQL status is the recovery source of truth. Task handlers must be idempotent around `archive_uid` / `chunk_uid`.
|
||
|
||
### Not Done
|
||
|
||
- [ ] Page-level citation reconstruction is not implemented beyond storing `page_start` and `page_end` on chunks.
|
||
- [ ] Re-embed maintenance command is not present.
|
||
- [ ] Request validation is handwritten in the service; no dedicated validator classes or reusable validation layer are present.
|
||
- [ ] Automated tests for Markdown parsing, chunking, import persistence, queue behavior, and metadata enrichment are not present.
|
||
- [ ] Public API authentication and rate limiting are not present. Minimal admin login/session controls are now present for the maintenance frontend.
|
||
- [ ] Observability for import/search/enrichment jobs is still minimal; the admin panel now exposes coarse status endpoints, but there are no historical metrics, tracing, or alerting pipelines yet.
|
||
- [x] Default landing page is replaced with a Proof DB-specific admin entry surface instead of the Webman starter content.
|
||
|
||
### Future Optimizations
|
||
|
||
- [ ] Extend full-text search from single `query` string over `multi_match` fields to multi-query bool search, for example `queries: ["Iraq Kuwait", "Desert Storm", "policy documents"]` mapped to OpenSearch `bool.should`.
|
||
|
||
### Next Build Order
|
||
|
||
1. Normalize remaining API documentation wording from MySQL to PostgreSQL.
|
||
2. Add read APIs for archives/chunks/evidence so imported data can be verified without reading snapshots or the database directly.
|
||
3. Add focused tests for DOCMASTER page parsing, noise filtering, comment coalescing, chunk UID stability, and repository persistence.
|
||
4. Add async task foundation: task statuses, Redis task payload format, generic DB dispatcher process, and generic Redis worker process. (Done for embedding and OpenSearch indexing)
|
||
5. Improve page-level citation reconstruction beyond chunk page range metadata.
|