MariaDB 中文社区

用 MariaDB VECTOR 做 RAG(替代 pgvector)

从 chunk → embed → 写库 → 检索 → rerank → 生成,Python 与 Node 双版本完整范例

MariaDB 11.8+ 内置了 VECTOR 数据类型和 HNSW 近似最近邻索引。如果你已经在用 MariaDB / MySQL,完全不需要为 RAG 再装一个 pgvector / Pinecone / Qdrant。

这一篇是端到端的可运行示例

前置

  • MariaDB 11.7 或更高(11.4 LTS 不支持,要等 12.x LTS 或临时升 11.8)
  • OpenAI / Voyage / Cohere embedding API key

验证版本:SELECT VERSION(); 应该返回 11.7.x 或更高。

建表

CREATE DATABASE rag DEFAULT CHARSET utf8mb4;
USE rag;

CREATE TABLE documents (
  id           BIGINT AUTO_INCREMENT PRIMARY KEY,
  source_url   VARCHAR(512),
  title        VARCHAR(255),
  created_at   DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE chunks (
  id            BIGINT AUTO_INCREMENT PRIMARY KEY,
  document_id   BIGINT NOT NULL,
  chunk_index   INT NOT NULL,
  content       TEXT NOT NULL,
  content_hash  CHAR(64) NOT NULL,                  -- SHA-256,去重
  embed         VECTOR(1536) NOT NULL,              -- OpenAI text-embedding-3-small
  token_count   INT,
  UNIQUE KEY uq_doc_idx (document_id, chunk_index),
  UNIQUE KEY uq_hash (content_hash),
  VECTOR INDEX (embed) M=16 DISTANCE=cosine,
  CONSTRAINT fk_doc FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

M=16 DISTANCE=cosine 是 HNSW 的常用配置:M 越大检索质量越高、写入越慢、内存越大。

Step 1:Chunking

// chunk.ts
export function chunk(text: string, maxChars = 800, overlap = 150): string[] {
  const out: string[] = [];
  let i = 0;
  while (i < text.length) {
    let end = Math.min(i + maxChars, text.length);
    // 尝试在句号 / 换行处断
    const lastBreak = text.lastIndexOf('\n', end);
    if (lastBreak > i + maxChars / 2) end = lastBreak;
    out.push(text.slice(i, end).trim());
    i = end - overlap;
  }
  return out.filter(Boolean);
}

经验:800 字符 / 150 重叠对中英文混合都比较稳。文档结构清晰的(带标题)可以按标题切。

Step 2:嵌入 + 入库

import { createPool } from 'mysql2/promise';
import { createHash } from 'crypto';
import OpenAI from 'openai';

const db = createPool({ uri: process.env.MARIADB_URL });
const oai = new OpenAI();

async function embed(texts: string[]): Promise<number[][]> {
  const r = await oai.embeddings.create({ model: 'text-embedding-3-small', input: texts });
  return r.data.map((d) => d.embedding);
}

export async function ingest(docId: number, fullText: string) {
  const chunks = chunk(fullText);
  // 批量嵌入(OpenAI 限 2048 input / call)
  const batchSize = 64;
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const embeds = await embed(batch);
    const values = batch.map((c, j) => [
      docId,
      i + j,
      c,
      createHash('sha256').update(c).digest('hex'),
      JSON.stringify(embeds[j]),
      c.length, // 简化 token 估算
    ]);
    await db.query(
      `INSERT IGNORE INTO chunks
       (document_id, chunk_index, content, content_hash, embed, token_count)
       VALUES ?`,
      [values.map((v) => [v[0], v[1], v[2], v[3], db.escape(v[4]), v[5]])],
    );
  }
}
import mariadb, hashlib, json
from openai import OpenAI

db = mariadb.connect(host=..., user=..., password=..., database='rag')
oai = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    r = oai.embeddings.create(model='text-embedding-3-small', input=texts)
    return [d.embedding for d in r.data]

def ingest(doc_id: int, full_text: str):
    chunks = chunk(full_text)
    cur = db.cursor()
    for i in range(0, len(chunks), 64):
        batch = chunks[i:i+64]
        embeds = embed(batch)
        for j, c in enumerate(batch):
            h = hashlib.sha256(c.encode()).hexdigest()
            cur.execute(
              """INSERT IGNORE INTO chunks
                 (document_id, chunk_index, content, content_hash, embed, token_count)
                 VALUES (?, ?, ?, ?, VEC_FromText(?), ?)""",
              (doc_id, i + j, c, h, json.dumps(embeds[j]), len(c)),
            )
    db.commit()

Step 3:检索

export async function retrieve(question: string, k = 8) {
  const [qEmb] = await embed([question]);
  const [rows] = await db.query<any[]>(
    `SELECT id, document_id, content,
            VEC_DISTANCE_COSINE(embed, VEC_FromText(?)) AS dist
     FROM chunks
     ORDER BY dist LIMIT ?`,
    [JSON.stringify(qEmb), k],
  );
  return rows;
}

返回的 dist 越小越相似。Cosine 距离在 [0, 2],通常 < 0.4 算高相关。

Step 4:Rerank(强烈推荐)

向量检索的 top-k 里"语义相近但答非所问"的比例很高。用一个 reranker(Cohere / Voyage / Jina)筛一遍:

import { CohereClient } from 'cohere-ai';
const co = new CohereClient({ token: process.env.COHERE_API_KEY });

async function rerank(question: string, docs: string[]) {
  const r = await co.rerank({
    model: 'rerank-multilingual-v3.0',
    query: question,
    documents: docs,
    topN: 4,
  });
  return r.results.map((x) => ({ index: x.index, score: x.relevanceScore }));
}

Top-8 → rerank → top-4 是常用配比。

Step 5:生成

const top = await retrieve(question, 8);
const reranked = await rerank(question, top.map((t) => t.content));
const context = reranked.map((r) => top[r.index].content).join('\n---\n');

const answer = await llm.complete(`
Use only the CONTEXT to answer. If the context doesn't contain the answer, say "I don't know".

CONTEXT:
${context}

QUESTION: ${question}
ANSWER:
`);

性能提示

操作量级提示
VEC_DISTANCE_COSINE + LIMIT k千万行HNSW 索引必须建;不带索引会全表扫描,秒级变分钟级
入库万行/分钟用批量 INSERT,避免逐条
嵌入 API$/M tokentext-embedding-3-small ≈ $0.02 / M token,性价比之王
索引重建视行数写入密集场景考虑 ANALYZE TABLE chunks 定期跑

与 pgvector / Pinecone 对比

维度MariaDB VECTORpgvectorPinecone
部署已有 MariaDB 零额外组件已有 PG 零额外组件第三方 SaaS
算法HNSWHNSW / IVFFlatHNSW + 商业优化
维度上限163831600020000
事务✅ 与业务表同库可一并提交
多租户 / RLS用账号 + viewRLS 强metadata filter
价格自建免费自建免费$70+/月起

简单结论:业务库是 MariaDB / MySQL 的,没理由不用 MariaDB VECTOR;想要省运维去 Pinecone。

真实坑

  1. VECTOR 在 MariaDB 11.4 LTS 还没有,等 12.x LTS 或先用 11.8
  2. HNSW 索引建好之前,所有查询都是全表扫描 —— 上线前一定 EXPLAIN 验证
  3. VEC_FromText 接受 JSON 数组字符串,单引号包起来;用驱动占位符更安全
  4. 检索距离单位要和索引一致:建索引时 DISTANCE=cosine,查询用 VEC_DISTANCE_COSINE不能混用
  5. embedding 模型一致性:库里所有 chunk 必须用同一个模型 embed,换模型就要全量重 embed

参考

本页目录