proofdb/apidoc/searchapi.md
2026-05-07 01:40:58 +08:00

396 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 搜索 API
## 接口说明
Proof DB 的搜索接口基于 OpenSearch `proofdb_chunks` 索引。当前版本已实现全文搜索,检索单位是 chunk返回结果包含档案元数据、页码范围和 chunk 文本,便于后续 evidence reconstruction。
OpenSearch 中每个 chunk 文档同时包含:
- `text` 等全文字段,用于 BM25 检索。
- `embedding` 2048 维向量字段,用于后续 vector / hybrid 检索。
## 全文搜索
```http
POST /api/search/fulltext
```
### 请求格式
`Content-Type: application/json`
| 字段 | 类型 | 必填 | 说明 |
| --- | --- | --- | --- |
| `query` | string | 是 | 搜索关键词或短语 |
| `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` |
| `filters` | object | 否 | 过滤条件 |
| `filters.archive_uid` | string | 否 | 只搜索某个 archive |
| `filters.chunk_uid` | string | 否 | 只搜索某个 chunk |
| `filters.source` | string | 否 | 精确匹配 source |
| `filters.author` | string | 否 | 精确匹配 author |
| `filters.series` | string | 否 | 精确匹配 series |
| `filters.year` | int | 否 | 精确匹配年份 |
| `filters.tags` | string\|array | 否 | 匹配一个或多个 tag |
### 请求示例
```bash
curl -X POST http://127.0.0.1:8787/api/search/fulltext \
-H 'Content-Type: application/json' \
--data '{
"query": "policy documents",
"limit": 5
}'
```
带过滤条件:
```bash
curl -X POST http://127.0.0.1:8787/api/search/fulltext \
-H 'Content-Type: application/json' \
--data '{
"query": "Iraq Kuwait",
"limit": 10,
"filters": {
"year": 1992,
"tags": ["NSD 76"]
}
}'
```
### 成功响应
状态码:
```http
200 OK
```
响应示例:
```json
{
"code": 0,
"message": "Full-text search completed.",
"data": {
"mode": "fulltext",
"query": "policy documents",
"limit": 5,
"filters": [],
"total": 1,
"hits": [
{
"score": 12.34,
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_1_39003",
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
"chunk_index": 1,
"page_start": 1,
"page_end": 1,
"title": "NSD 76 Disposition of NSC Policy Documents",
"source": "archive://nsc/nsd-76",
"author": "Brent Scowcroft",
"year": 1992,
"series": null,
"tags": ["NSD 76", "政策文件"],
"text": "chunk text...",
"embedding_model": "embedding-3",
"embedding_dimensions": 2048
}
]
}
}
```
### 错误响应
#### JSON 格式错误
状态码:
```http
400 Bad Request
```
```json
{
"code": 400,
"message": "Invalid JSON body.",
"errors": {
"body": "Syntax error"
}
}
```
#### 缺少 query
状态码:
```http
422 Unprocessable Entity
```
```json
{
"code": 422,
"message": "Search request validation failed.",
"errors": {
"query": "query is required."
}
}
```
#### 搜索失败
状态码:
```http
500 Internal Server Error
```
```json
{
"code": 500,
"message": "Full-text search failed.",
"errors": {
"search": "error message"
}
}
```
## 后续接口
## 向量搜索
```http
POST /api/search/vector
```
### 请求格式
`Content-Type: application/json`
| 字段 | 类型 | 必填 | 说明 |
| --- | --- | --- | --- |
| `query` | string | 是 | 搜索语句。系统会先调用智谱 `embedding-3` 转成 2048 维向量 |
| `limit` | int | 否 | 返回条数,默认 `10`,最大 `50` |
| `k` | int | 否 | OpenSearch kNN 候选数,默认等于 `limit`,最大 `50` |
| `filters` | object | 否 | 过滤条件,同全文搜索 |
### 请求示例
```bash
curl -X POST http://127.0.0.1:8787/api/search/vector \
-H 'Content-Type: application/json' \
--data '{
"query": "Iraq invasion and Desert Storm",
"limit": 5,
"k": 10
}'
```
中文 query 也可以提交给向量搜索:
```bash
curl -X POST http://127.0.0.1:8787/api/search/vector \
-H 'Content-Type: application/json' \
--data '{
"query": "伊拉克入侵科威特与沙漠风暴",
"limit": 5
}'
```
### 成功响应
状态码:
```http
200 OK
```
响应示例:
```json
{
"code": 0,
"message": "Vector search completed.",
"data": {
"mode": "vector",
"query": "Iraq invasion and Desert Storm",
"limit": 5,
"k": 10,
"filters": [],
"embedding_model": "embedding-3",
"embedding_dimensions": 2048,
"total": 5,
"hits": [
{
"score": 0.91,
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554",
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
"chunk_index": 14,
"page_start": 8,
"page_end": 8,
"title": "NSD 76 Disposition of NSC Policy Documents",
"source": "archive://nsc/nsd-76",
"author": "Brent Scowcroft",
"year": 1992,
"series": null,
"tags": ["NSD 76", "政策文件"],
"text": "chunk text...",
"embedding_model": "embedding-3",
"embedding_dimensions": 2048
}
]
}
}
```
### 错误响应
错误响应格式与全文搜索一致。常见错误包括:
- JSON 格式错误:`400 Bad Request`
- 缺少 `query``422 Unprocessable Entity`
- embedding API 或 OpenSearch 查询失败:`500 Internal Server Error`
## 后续接口
## 混合搜索
```http
POST /api/search/hybrid
```
### 接口说明
混合搜索会对同一个 `query` 同时执行:
1. BM25 全文搜索。
2. 2048 维向量 kNN 搜索。
3. 使用 Reciprocal Rank FusionRRF合并排序。
第一版不做额外 reranker。RRF 不直接比较 BM25 分数和向量分数,而是根据两路结果中的排名融合,适合作为稳定的 hybrid baseline。
### 请求格式
`Content-Type: application/json`
| 字段 | 类型 | 必填 | 说明 |
| --- | --- | --- | --- |
| `query` | string | 是 | 搜索语句 |
| `limit` | int | 否 | 最终返回条数,默认 `10`,最大 `50` |
| `candidate_limit` | int | 否 | 每一路召回候选数,默认 `max(limit * 3, 20)`,最大 `50` |
| `rrf_k` | int | 否 | RRF 平滑参数,默认 `60` |
| `filters` | object | 否 | 过滤条件,同全文搜索 |
| `ai` | bool | 否 | 默认 `false`。传 `true` 时,系统先调用现有 LLM chat 通道把原始 query 改写为 BM25 关键词;全文搜索使用 AI 关键词,向量搜索仍使用原始 query |
如果 AI 关键词生成失败或超时,系统会回退为使用原始 `query` 做全文搜索,并在响应的 `keywords.error` 中返回错误信息;向量搜索不受影响。
### 请求示例
```bash
curl -X POST http://127.0.0.1:8787/api/search/hybrid \
-H 'Content-Type: application/json' \
--data '{
"query": "Iraq invasion and Desert Storm",
"limit": 5,
"candidate_limit": 20
}'
```
中文 query
```bash
curl -X POST http://127.0.0.1:8787/api/search/hybrid \
-H 'Content-Type: application/json' \
--data '{
"query": "伊拉克入侵科威特与沙漠风暴",
"limit": 5,
"ai": true
}'
```
### 成功响应
状态码:
```http
200 OK
```
响应示例:
```json
{
"code": 0,
"message": "Hybrid search completed.",
"data": {
"mode": "hybrid",
"query": "Iraq invasion and Desert Storm",
"limit": 5,
"candidate_limit": 20,
"rrf_k": 60,
"filters": [],
"ai": true,
"fulltext_query": "Iraq Kuwait invasion Desert Storm",
"vector_query": "Iraq invasion and Desert Storm",
"keywords": {
"enabled": true,
"attempted": true,
"error": null,
"keywords": ["Iraq", "Kuwait", "invasion", "Desert Storm"],
"query": "Iraq Kuwait invasion Desert Storm",
"model": "glm-4.7-flash"
},
"total": 10,
"sources": {
"fulltext_total": 1,
"vector_total": 20,
"fulltext_hits": 1,
"vector_hits": 20
},
"hits": [
{
"score": 4.13,
"hybrid_score": 0.0325,
"rank_sources": {
"fulltext": {
"rank": 1,
"score": 4.13,
"rrf": 0.0163934426
},
"vector": {
"rank": 1,
"score": 0.79,
"rrf": 0.0163934426
}
},
"chunk_uid": "01KQHVREB6XPYF604RVZAP9NNY_14_97554",
"archive_uid": "01KQHVREB6XPYF604RVZAP9NNY",
"page_start": 8,
"page_end": 8,
"text": "chunk text..."
}
]
}
}
```
### 错误响应
错误响应格式与全文搜索一致。常见错误包括:
- JSON 格式错误:`400 Bad Request`
- 缺少 `query``422 Unprocessable Entity`
- embedding API、全文搜索或向量搜索失败`500 Internal Server Error`
## 后续接口
以下能力尚未实现:
```http
GET /api/chunks/{chunk_uid}
GET /api/evidence/{chunk_uid}
```