术语表¶

Reference · 速查

字母序的概念索引。不确定该去哪个目录查，先来这里；具体问题去 FAQ；角色路径去按角色入门。

A – C¶

A/B Testing（AB 测试） —— 见 MLOps 生命周期
ACID on Object Store —— 见湖表
Ad-hoc Exploration（即席探索） —— 见即席探索
Agentic RAG —— 见 RAG §4 高级范式
Agents on Lakehouse —— 见 Agents
AI Governance（AI 治理） —— 见 ops/compliance §4 AI 合规
ANN（Approximate Nearest Neighbor） —— 见 HNSW / IVF-PQ / DiskANN
Anti-Patterns（反模式） —— 见湖仓 20 反模式
Arrow / FlightSQL / ADBC —— 见 Arrow 生态
ASR（Automatic Speech Recognition） —— 见音频管线
AWQ / GPTQ（LLM 量化） —— 见 LLM Inference
AQE（Adaptive Query Execution） —— Spark 自适应查询执行，见 Apache Spark
Backfill（数据回灌） —— 见迁移手册 · Bulk Loading
Benchmark —— 见 Benchmark 参考 · 量级数字总汇
Bin-packing（compaction 算法之一） —— 把小文件按大小桶合并到目标尺寸，见 Compaction
Binary Embedding —— 见 Quantization
Bloom Filter —— 概率型存在性判断，用于谓词下推前置过滤，见谓词下推 · Puffin
Branching & Tagging —— Iceberg 原生分支标签，见 Branching & Tagging
Bulk Loading —— 见 Bulk Loading
Business Scenarios（业务场景全景） —— 见业务场景全景
Capacity Planning（容量规划） —— 见容量规划
Case Studies（案例拆解） —— 见案例拆解
Catalog —— 见 Catalog 章节
CDC（Change Data Capture） —— 见 Streaming Upsert / CDC
CBO（Cost-Based Optimizer） —— 基于代价的查询优化器，见 Apache Spark · Trino
CDP（Customer Data Platform） —— 见 CDP · 用户分群
Classical ML（经典 ML 预测） —— 见经典 ML 预测
ColBERT / Late Interaction —— 见多模检索模式 Pattern E · RAG §4 ColBERT v2
Compliance（合规） —— 见合规
Contextual Retrieval —— 见 RAG §4 高级范式
CRAG（Corrective RAG） —— 见 RAG §4 高级范式
CLIP / SigLIP —— 多模 Embedding 模型，见多模 Embedding
ClickHouse —— 见 ClickHouse
Columnar Storage —— 见列式 vs 行式
Compaction —— 见 Compaction
Compute Pushdown —— 见 Compute Pushdown
Compute-Storage Separation（存算分离） —— 见存算分离
Consistency（一致性模型） —— 见一致性模型
CoW / MoR —— 见 Hudi / Delete Files
Cost Optimization —— 见成本优化
Cross-modal Query —— 见跨模态查询

D – F¶

Data Contract（数据契约） —— 见 Data Quality for ML
Data Governance —— 见数据治理
Data Lineage（血缘） —— 见数据治理 · Unity Catalog
Dedupe（去重） —— 见 Streaming Upsert / CDC
Data Systems Evolution（演进史） —— 见三代数据系统演进史
Debezium —— 见 Kafka 到湖
Delete Files —— 见 Delete Files
Delta Lake —— 见 Delta Lake
DiskANN —— 见 DiskANN / 论文笔记
Document Pipeline —— 见文档管线
Doris —— 见 Apache Doris
DuckDB —— 见 DuckDB
Embedding —— 见 Embedding
Embedding Pipeline —— 见 Embedding 流水线
Event Time / Watermark —— 见事件时间 Watermark
Feature Store —— 见 Feature Store · 横比
Feature Serving —— 见 Feature Serving
Feast —— 见 Feature Store 横比
Fine-tuning Data —— 见微调数据准备
Flink —— 见 Apache Flink
Flash Attention —— 见 LLM Inference
Fraud Detection（欺诈检测） —— 见欺诈检测 · 风险控制

G – I¶

GDPR / HIPAA / PDPA / 个保法 —— 见合规
GPU 调度 —— 见 GPU 调度
Gravitino —— 见 Apache Gravitino
Guardrails（LLM 护栏） —— 见 Guardrails
Hive Metastore —— 见 Hive Metastore
HNSW —— 见 HNSW
Hybrid Search —— 见 Hybrid Search
HyDE —— 见 RAG §4 高级范式
Iceberg —— 见 Apache Iceberg
Iceberg REST Catalog —— 见 Iceberg REST Catalog
Iceberg v3 —— Iceberg 协议下一代演进，见 Iceberg v3
Image Pipeline —— 见图像管线
Incident Management —— 见事故管理
IVF-PQ —— 见 IVF-PQ

J – M¶

Kafka → 湖 —— 见 Kafka 到湖
Lakehouse（湖仓） —— 数据湖 + 数仓融合架构，见湖仓章节 · 演进史
Lake Table（湖表） —— 见湖表
Lake + Vector —— 见 Lake + Vector 融合架构
Lance Format —— 见 Lance Format
LanceDB —— 见 LanceDB
LLM Inference Optimization —— 见 LLM Inference
LLM Serving —— 见 Model Serving
Manifest —— 见 Manifest
Materialized View —— 见物化视图
Matryoshka Embedding —— 见 Quantization · Embedding
MCP（Model Context Protocol） —— 见 MCP
Migration Playbook —— 见迁移手册
Milvus —— 见 Milvus
MLOps Lifecycle —— 见 MLOps 生命周期
ML Evaluation —— 见 ML Evaluation（传统 ML · LLM/RAG 在 rag-evaluation）
Experiment Tracking —— 见 Experiment Tracking
Data Contract / Data Quality for ML —— 见 Data Quality for ML
Label Quality —— 见 Data Quality for ML §4 + Model Monitoring §Label Quality
Calibration —— 见 ML Evaluation §Calibration
Fairness（公平性 · Demographic Parity / Equalized Odds / Disparate Impact）—— 见 ML Evaluation §Fairness · Model Monitoring §Fairness
A/B Testing（显著性 · power · sample size） —— 见 ML Evaluation §A/B
Model Card —— 模型合规 artifact（EU AI Act 高风险系统必需）· 见 Model Registry §Model Card
Model Monitoring —— 见 Model Monitoring
Model Registry —— 见 Model Registry
Model Serving —— 见 Model Serving
Modern Data Stack（现代数据栈） —— 见 Modern Data Stack
MoE（Mixture of Experts） —— 见 LLM Inference
MRR / nDCG —— 见检索评估
Multimodal Data Modeling —— 见多模数据建模
Multimodal Embedding —— 见多模 Embedding
MVCC —— 见 MVCC

N – R¶

Nessie —— 见 Nessie
OpenLineage —— 数据血缘开放标准，见数据治理
Object Storage —— 见对象存储
Observability —— 见可观测性
Offline Training Pipeline —— 见离线训练数据流水线
OLAP Accelerator（加速副本） —— 见 OLAP 加速副本对比
OLAP Modeling —— 见 OLAP 建模
Orchestration（编排） —— 见编排系统概览 · 对比
OLTP vs OLAP —— 见 OLTP vs OLAP
ORC —— 见 ORC
Paimon —— 见 Apache Paimon
PagedAttention —— 见 LLM Inference
Parquet —— 见 Parquet
Partition Evolution —— 见 Partition Evolution
Performance Tuning —— 见性能调优
pgvector —— 见 pgvector
PII（Personally Identifiable Information） —— 见 Guardrails · 合规
PIT Join（Point-in-Time） —— 主定义见 Feature Store（工程落地见离线训练数据流水线）
Train-Serve Skew（训推漂移） —— 主定义见 Feature Store（反模式汇总见 20 反模式）
Polaris —— 见 Apache Polaris
Predicate Pushdown（谓词下推） —— 见谓词下推
Prompt Injection —— 见 Guardrails §2 输入 Guardrails
Prompt Management —— 见 Prompt 管理
PSI（Population Stability Index） —— 见 MLOps 生命周期
Puffin —— 见 Puffin
Qdrant —— 见 Qdrant
Query Acceleration —— 见查询加速
RAG —— 见 RAG
RAG Evaluation —— 见 RAG 评估
RAGAS —— 见 RAG 评估
RBAC（Role-Based Access Control） —— 见安全与权限
Recall@K —— 见检索评估
Recommender System（推荐系统） —— 见推荐系统
Red Teaming —— 见 Guardrails §7 Red Teaming
Rerank —— 见 Rerank · 模型对比
RisingWave —— 见流处理引擎横比
RRF（Reciprocal Rank Fusion） —— 见 Hybrid Search

S – Z¶

SCD（Slowly Changing Dimension） —— 维度表慢变模型 Type 1/2/3，见 OLAP 建模
Schema Drift —— Schema 漂移，见 Schema Evolution · Data Quality for ML
Schema Evolution —— 见 Schema Evolution
Security —— 见安全与权限
Self-RAG —— 见 RAG §4 高级范式
Semantic Cache —— 见 Semantic Cache
Semantic Layer —— 见语义层
SGLang —— 见 LLM Inference
SLA / SLO / SLI —— 见 SLA · SLO · DRE
Snapshot —— 见 Snapshot
Snapshot Isolation —— MVCC 提供的事务隔离级别，见 MVCC · 一致性模型
Spark —— 见 Apache Spark
Speculative Decoding —— 见 LLM Inference
SPLADE —— 见 Sparse Retrieval · Hybrid Search
SOC 2 —— 见合规
Star Schema / Snowflake Schema —— 维度建模经典模式，见 OLAP 建模
StarRocks —— 见 StarRocks
Streaming Engines（流处理对比） —— 见流处理引擎横比
Streaming Upsert / CDC —— 见 Streaming Upsert / CDC
TCO（Total Cost of Ownership） —— 见 TCO 模型
Text-to-SQL —— 见 Text-to-SQL 平台
Time Travel —— 见 Time Travel
Training Orchestration —— 见训练编排
Trino —— 见 Trino
Troubleshooting —— 见故障排查手册
Unity Catalog —— 见 Unity Catalog
Vectorized Execution —— 见向量化执行
Vector Database —— 见向量数据库
Vector Trends —— 检索新方向散见 Embedding · Quantization · Sparse Retrieval
Video Pipeline —— 见视频管线
vLLM —— 见 LLM Inference
Weaviate —— 见 Weaviate
Whisper —— 见音频管线
Z-order / Liquid Clustering —— 见查询加速

维护约定

本页手工维护。新增概念页时顺手加一行。未来条目数 > 200 时改脚本自动生成。