【司马老师数据库系列讲座三】跟着论文学习数据库3：数据库行业论文清单

锋哥聊DORIS数仓 2025-01-13

307

点击蓝字关注我们

跟着论文学习数据库1：HTAP
跟着论文学习数据库2：MVCC

前段时间，发表了【跟着论文学习数据库】系列文章，收到了比较好的反馈。期间有同学问，有没有更全面的数据库论文。于是有了这篇数据库领域优秀论文清单索引文章，旨在帮助同学们更方便阅读数据库行业领导者、推动者的思想巨著。

本文一共包含156 篇全球数据库领域优秀论文。按数据库技术分，该清单涵盖了数据库的基础知识、系统设计、SQL引擎、存储引擎等几大块；按论文作者分，该清单包括埃德加.科德、杰姆.格林、斯通布雷克等图灵奖大师们的论文、以及开启大数据时代Google，商业数据库霸主Oracle、云计算的领导者Amazon，也囊括了国内 TiDB CTO 黄东旭、原阿里云研究员 PolarDB 负责人曹伟（现为杭州云猿生数据有限公司创始人 & CEO，更多信息详见链接）、清华大学李国良教授等国内外数据库领域巨佬们的论文。
由于小编个人能力有限，而数据库领域经典论文汗牛充栋，本次整理一定是不全面，如有遗漏，请各位多多谅解，也欢迎与小编联系、交流。

01 基础知识

1.1 基础

A relational model of data for large shared data banks (1970) - Codd, Edgar F.（大型共享数据库的数据关系模型）
https://dl.acm.org/doi/pdf/10.1145/362384.362685
SEQUEL: A structured English query language (1974) - Chamberlin, Donald D., and Raymond F. Boyce. （SQL：结构化英语查询语言）
https://dl.acm.org/doi/pdf/10.1145/800296.811515
INGRES: a relational data base system (1975) - Held, G. D., M. R. Stonebraker, and Eugene Wong.（INGRES: 一个关系型数据库系统）
https://dl.acm.org/doi/pdf/10.1145/1499949.1500029
Extending the database relational model to capture more meaning (1979) - Codd, Edgar F.
https://dl.acm.org/doi/pdf/10.1145/320107.320109
Raft: In Search of an Understandable Consensus Algorithm （2014）- Diego Ongaro, et.al （Raft：寻找一种可实现的共识算法）
https://dl.acm.org/doi/10.5555/2643634.2643666
A critique of the SQL database language (1984) - Date, C. J. （对 SQL 数据库语言的批判）
https://dl.acm.org/doi/pdf/10.1145/984549.984551
What Goes Around Comes Around... And Around... (2024) - M. R. Stonebraker, and Andy Pavlo （数据库20年总结与发展，发展是一种循环）
https://dl.acm.org/doi/10.1145/3685980.3685984
Cloud Programming Simplified: A Berkeley View on Serverless Computing (2019) - Eric Jonas
https://arxiv.org/abs/1902.03383

1.2 一致性

Consistency Tradeoffs in Modern Distributed Database System Design (2012) - Abadi, Daniel. （现代分布式数据库系统设计中的一致性权衡）
https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf
PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database (2018) - Cao, Wei, et al. （PolarFS：用于共享存储云数据库的超低延迟和故障恢复分布式文件系统）
https://dl.acm.org/doi/pdf/10.14778/3229863.3229872
Anna: A kvs for any scale (2018) - Wu, Chenggang, et al.
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-122.pdf
Strong and efficient consistency with consistency-aware durability (2021) - Ganesan, Aishwarya, et al. （强大而高效的一致性以及持久性）
https://dl.acm.org/doi/pdf/10.1145/3423138
Logical physical clocks and consistent snapshots in globally distributed databases (2014) - Kulkarni S S, Demirbas M, Madappa D, et al.
https://cse.buffalo.edu/tech-reports/2014-04.pdf
Ark: A Real-World Consensus Implementation (2014) - Kasheff, Zardosht, and Leif Walsh.（Ark：现实世界的共识实施）
https://arxiv.org/pdf/1407.4765

1.3 Consensus 共识

在数据库中，“Consensus”（共识）指分布式系统中就某个状态或结果达成一致的过程。这通常涉及到分布式协议或算法，以确保所有节点在数据一致性和可靠性方面的统一意见，尤其在面对故障或网络分区时。因此，共识在分布式数据库中尤为重要，是确保数据的一致性、准确性和安全性。

The Part-Time Parliament (1998) - Lamport, Leslie.
https://dl.acm.org/doi/pdf/10.1145/3335772.3335939
Consensus: Bridging theory and practice (2014) - Ongaro, Diego.（共识：理论与实践的桥梁）
https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf
In search of an understandable consensus algorithm (extended version) (2014) - Ongaro, Diego, and John Ousterhout. （探索一种可理解的共识算法）
https://www.repository.cam.ac.uk/bitstream/handle/1810/291682/thesis.pdf?sequence=1
Distributed consensus revised (2019) - Howard, Heidi. （分布式共识修正）
https://www.repository.cam.ac.uk/bitstream/handle/1810/291682/thesis.pdf?sequence=1
A Generalised Solution to Distributed Consensus (2019) - Howard, Heidi, and Richard Mortier. （分布式共识的通用解决方案）
https://arxiv.org/pdf/1902.06776
Paxos vs Raft: Have we reached consensus on distributed consensus? (2020) - Howard, Heidi, and Richard Mortier. （Paxos vs Raft：在分布式共识上达成了吗？）
https://dl.acm.org/doi/pdf/10.1145/3380787.3393681

02 数据库系统设计

2.1 关系型数据库

System R: Relational Approach to Database Management (1976) - Astrahan, Morton M., et al. （System R：数据库管理的关系方法）
https://dl.acm.org/doi/pdf/10.1145/320455.320457
The design and implementation of INGRES (1976) - Stonebraker, Michael, et al. （INGRES 数据库的设计与实现）
https://dl.acm.org/doi/10.1145/320473.320476
The design of Postgres (1986) - Stonebraker, Michael, and Lawrence A. Rowe. （Postgres 数据库的设计）
https://dl.acm.org/doi/pdf/10.1145/16856.16888
Online, Asynchronous Schema Change in F1 (2013) - Rae, Ian, et al. （F1 中的在线异步架构更改）
https://dl.acm.org/doi/pdf/10.14778/2536222.2536230
Amazon aurora: Design considerations for high throughput cloud-native relational databases (2017) - Verbitski, Alexandre, et al. （Amazon aurora：高吞吐量云原生关系数据库的设计）
https://dl.acm.org/doi/pdf/10.1145/3035918.3056101
Looking Back at Postgres (2019) - Hellerstein, Joseph M. （回顾 Postgres ）
https://arxiv.org/pdf/1901.01973
CockroachDB: The Resilient Geo-Distributed SQL Database (2020) - Taft, Rebecca, et al. (CockroachDB：一个弹性地理分布式关系型数据库)
https://dl.acm.org/doi/pdf/10.1145/3318464.3386134
Query Processing in Main Memory Database Management Systems (1986) - Lehman, Tobin J., and Michael J. Carey. （主内存数据库管理系统中的查询处理）
https://dl.acm.org/doi/pdf/10.1145/16894.16878
Megastore: Providing Scalable, Highly Available Storage for Interactive Services (2011) - Baker J, Bond C, Corbett J C, et al. （Megastore：一种为交互服务提供可扩展、高可用的存储）
http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/Megastore.pdf
Spanner: Google's globally distributed database (2013) - Corbett, James C., et al. （Spanner：Google 的全球分布式数据库）
https://dl.acm.org/doi/pdf/10.1145/2491245
F1 Lightning: HTAP as a Service (2020) - Yang, Jiacheng, et al. （F1 Lightning：HTAP 即服务）
https://dl.acm.org/doi/pdf/10.14778/3415478.3415553
TiDB: a Raft-based HTAP database (2020) - Huang, Dongxu, et al.
https://dl.acm.org/doi/pdf/10.14778/3415478.3415535 （TiDB：一个基于Raft 的 HTAP 数据库）
PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers (2021) - Cao, Wei, et al. （PolarDB Serverless：面向分散数据中心的云原生数据库）
https://dl.acm.org/doi/pdf/10.1145/3448016.3457560
HTAP Databases: What is New and What is Next （2022) - guoliang, li, et al （HTAP 数据库：新增内容和未来发展）
https://dl.acm.org/doi/abs/10.1145/3514221.3522565
Greenplum: A Hybrid Database for Transactional and Analytical Workloads (2021) - Zhenghua Lyu ,et al （Greenplum：用于事务和分析工作负载的混合数据库）
https://dl.acm.org/doi/10.1145/3448016.3457562

2.2 非关系型数据库

Bigtable: A Distributed Storage System for Structured Data (2006) - Chang, Fay, et al. （Bigtable：结构化数据的分布式存储系统）
https://dl.acm.org/doi/pdf/10.1145/1365815.1365816
Dynamo: Amazon’s Highly Available Key-value Store (2007) - DeCandia, Giuseppe, et al. （Dynamo：Amazon 的高可用键值存储）
https://dl.acm.org/doi/pdf/10.1145/1323293.1294281
PNUTS: Yahoo!’s Hosted Data Serving Platform (2008) - Cooper, Brian F., et al. （PNUTS：雅虎的托管数据服务平台）
https://dl.acm.org/doi/pdf/10.14778/1454159.1454167
Cassandra - A Decentralized Structured Storage System (2010) - Lakshman, Avinash, and Prashant Malik. （Cassandra - 去中心化结构化存储系统）
https://dl.acm.org/doi/pdf/10.1145/1773912.1773922
Windows azure storage: a highly available cloud storage service with strong consistency (2011) - Calder, Brad, et al. （Windows Azure Storage：高可用、强一致性的云存储服务）
https://dl.acm.org/doi/pdf/10.1145/2043556.2043571
Azure data lake store: a hyperscale distributed file service for big data analytics (2017) - Ramakrishnan, Raghu, et al. （Azure 数据湖存储：用于大数据分析的超大规模分布式文件服务）
https://dl.acm.org/doi/pdf/10.1145/3035918.3056100
Spark SQL: Relational Data Processing in Spark （2015）- Michael Armbrust, et al. （Spark SQL：Spark 中的关系数据处理）
https://dl.acm.org/doi/10.1145/2723372.2742797
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads （2009）- Azza Abouzeid ，et al. （HadoopDB：用于分析工作负载的 MapReduce 和数据库技术的混合架构）
https://15799.courses.cs.cmu.edu/fall2013/static/papers/vldb09-861.pdf

03 SQL 引擎

3.1 优化器

Access Path Selection in a Relational Database Management System (1979) - Selinger, P. Griffiths, et al. （关系数据库管理系统中的访问路径选择）
https://dl.acm.org/doi/pdf/10.1145/582095.582099
Query Optimization by Simulated Annealing (1987) - Ioannidis, Yannis E., and Eugene Wong.
https://dl.acm.org/doi/pdf/10.1145/38713.38722
The Cascades Framework for Query Optimization (1995) - Graefe, Goetz.（用于查询优化的级联框架）
https://liuyehcf.github.io/resources/paper/The-Cascades-Framework-For-Query-Optimization.pdf
An Overview of Query Optimization in Relational Systems (1998) - Chaudhuri, Surajit. （关系系统中的查询优化概述）
https://dl.acm.org/doi/pdf/10.1145/1007568.1007642
Robust Query Processing through Progressive Optimization (2004) - Markl, Volker, et al. （通过逐步优化实现稳健的查询处理）
https://dl.acm.org/doi/pdf/10.1145/1007568.1007642
Orca: A Modular Query Optimizer Architecture for Big Data (2014) - Soliman, Mohamed A., et al. （Orca：大数据模块化查询优化器架构）
https://dl.acm.org/doi/pdf/10.1145/2588555.2595637
Parallelizing Query Optimization on Shared-Nothing Architectures (2015) - Trummer, Immanuel, and Christoph Koch. （Shared-Nothing 架构上的并行查询优化）
https://arxiv.org/pdf/1511.01768
The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database (2016) - Chen, Jack, et al. （MemSQL 查询优化器：用于分布式数据库中实时分析的现代优化器）
https://dl.acm.org/doi/pdf/10.14778/3007263.3007277
Extensible/Rule Based Query Rewrite Optimization in Starburst (1992) - Pirahesh, Hamid, Joseph M. Hellerstein, and Waqar Hasan. （Starburst 中可扩展/基于规则的查询重写优化）
https://dl.acm.org/doi/pdf/10.1145/141484.130294
The Volcano Optimizer Generator- Extensibility and Efficient Search (1993) - Graefe, Goetz, and William J. McKenna. （Volcano 优化器生成器 - 可扩展性和高效搜索）
https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf
Processing queries with quantifiers a horticultural approach (1983) - Dayal, Umeshwar.
https://dl.acm.org/doi/pdf/10.1145/588058.588075
Translating SQL into relational algebra: Optimization, semantics, and equivalence of SQL queries (1985) - Ceri, Stefano, and Georg Gottlob. （将 SQL 转换为关系代数：SQL 查询的优化、语义和等价性）
https://www.academia.edu/download/50687636/tse.1985.23222320161202-29901-8u86ef.pdf
Parameterized Queries and Nesting Equivalences (2000) - Galindo-Legaria, C. A. （参数化查询和嵌套等价）
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf
Cost-based query transformation in Oracle (2006) - Ahmed, Rafi, et al. （Oracle 中基于成本的查询转换）
https://www.researchgate.net/profile/Rafi-Ahmed-2/publication/221311318_Cost-Based_Query_Transformation_in_Oracle/links/572bbc5e08aef7c7e2c6b829/Cost-Based-Query-Transformation-in-Oracle.pdf
Grammar-like Functional Rules for Representing Query Optimization Alternatives, (1988) - Lohman, Guy M. （用于表示查询优化替代方案的类似语法的功能规则）
https://dl.acm.org/doi/pdf/10.1145/971701.50204
Query Optimization by Predicate Move-Around (1994) - Levy, Alon Y., Inderpal Singh Mumick, and Yehoshua Sagiv. （通过谓词移动进行查询优化）
https://www.researchgate.net/profile/Inderpal-Mumick/publication/2754592_Query_Optimization_by_Predicate_Move-Around/links/0f317534d437e49755000000/Query-Optimization-by-Predicate-Move-Around.pdf

3.2 嵌套查询

On optimizing an SQL-like nested query (1982) - Kim, Won. （优化类似 SQL 的嵌套查询）
https://dl.acm.org/doi/pdf/10.1145/319732.319745
Optimization of nested queries in a distributed relational database (1984) - L&man, Guy M., et al. （分布式关系数据库中嵌套查询的优化）
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=12fd1fe22687f5944613832de4e64ef902043aec
Optimization of nested SQL queries revisited (1987) - Ganski, Richard A., and Harry KT Wong. （再次分析嵌套 SQL 查询的优化）
https://dl.acm.org/doi/pdf/10.1145/38714.38723
A Unitied Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers (1987) - Dayal, Umeshwar. （嵌套子查询、聚合和量词的查询的统一优化方法）
https://vldb.org/conf/1987/P197.PDF
Optimization of correlated SQL queries in a relational database management system (1998) - Jou, Michelle M., Ting Yu Leung, and Mir Hamid Pirahesh. （关系数据库管理系统中查询SQL的优化）
https://patentimages.storage.googleapis.com/3b/24/39/a947424a6eb0ea/US5822750.pdf
Orthogonal Optimization of Subqueries and Aggregation (2001) - Galindo-Legaria, César, and Milind Joshi. （子查询在聚合场景的优化）
https://dl.acm.org/doi/pdf/10.1145/376284.375748
WinMagic : Subquery Elimination Using Window Aggregation (2003) - Zuzarte, Calisto, et al. （WinMagic：使用窗口聚合消除子查询的优化）
https://dl.acm.org/doi/pdf/10.1145/872757.872840
SQL-like and Quel-like correlation queries with aggregates revisited (1984) - Kiessling, Werner.
http://www2.eecs.berkeley.edu/Pubs/TechRpts/1984/ERL-m-84-75.pdf
Translating SQL into relational algebra: Optimization, semantics, and equivalence of SQL queries (1985) - Ceri, Stefano, and Georg Gottlob. （将 SQL 转换为关系代数：SQL 查询的优化、语义和等价性）
https://www.academia.edu/download/50687636/tse.1985.23222320161202-29901-8u86ef.pdf
Execution strategies for SQL subqueries (2007) - Elhemali, Mostafa, et al. （SQL子查询的执行策略）
https://dl.acm.org/doi/pdf/10.1145/1247480.1247598
Enhanced subquery optimizations in Oracle (2009) - Bellamkonda, Srikanth, et al. （Oracle 中增强的子查询优化）
https://dl.acm.org/doi/pdf/10.14778/1687553.1687563
Unnesting Arbitrary Queries (2015) - Neumann, Thomas, and Alfons Kemper. （解除任意查询的嵌套）
https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf?sequence=1

3.3 关联和排序

Access paths in the" Abe" statistical query facility (1982) - Klug, Anthony. （“Abe”统计查询设施中的访问路径）
https://dl.acm.org/doi/pdf/10.1145/582353.582382
Extending the Algebraic Framework of Query Processing to Handle Outerjoins (1984) - RosenthaI, A., and D. Reiner.
https://www.vldb.org/conf/1984/P334.PDF
On the Correct and Complete Enumeration of the Core Search Space (2013) - Moerkotte, Guido, Pit Fender, and Marius Eich. （论核心搜索空间的正确完整枚举）
https://dl.acm.org/doi/pdf/10.1145/2463676.2465314
How Good Are Query Optimizers, Really? (2015) - Leis, Viktor, et al. （查询优化器到底有多好？）
https://dl.acm.org/doi/pdf/10.14778/2850583.2850594
The Complete Story of Joins (2017) - Neumann, Thomas, Viktor Leis, and Alfons Kemper. （join的完整性）
https://dl.gi.de/bitstreams/535a5d94-043d-4b1a-9062-fbaf8ed35468/download
Dynamic programming strikes back (2008) - Moerkotte, Guido, and Thomas Neumann.
https://dl.acm.org/doi/pdf/10.1145/1376616.1376672
Improving Join Reorderability with Compensation Operators (2018) - Wang, TaiNing, and Chee-Yong Chan. （使用补偿运算符提高连接可重排序性）
https://dl.acm.org/doi/pdf/10.1145/3183713.3183731
Adaptive Optimization of Very Large Join Queries (2018) - Neumann, Thomas, and Bernhard Radke. （超大型连接查询的自适应优化）
https://dl.acm.org/doi/pdf/10.1145/3183713.3183733

3.4 Cost Model 成本模型

数据库Cost Model 是用来评估不同查询执行路径的性能的一种方法。它通过估算执行每个操作（如扫描、连接等）所需的资源（如CPU、内存、I/O等）来帮助优化器选择最有效的执行计划。通过比较这些“成本”，优化器可以确定哪个查询路径是最终执行的最佳选择。

Modelling Costs for a MM-DBMS (1996) - Listgarten, Sherry, and Marie-Anne Neimat.
https://www.semanticscholar.org/paper/Modelling-Costs-for-a-MM-DBMS-Listgarten-Neimat/42b88445cfb28fbe4b6539c97674a8fa9815e635
Approximation Schemes for Many-Objective Query Optimization (2014) - Trummer, Immanuel, and Christoph Koch. （多目标查询优化的近似方案）
https://dl.acm.org/doi/pdf/10.1145/2588555.2610527
Multi-Objective Parametric Query Optimization (2015) - Trummer, Immanuel, and Christoph Koch. （多目标参数查询优化）
https://dl.acm.org/doi/pdf/10.1145/3068612
SEEKing the truth about ad hoc join costs (1997) - Haas, Laura M., et al. （寻求有关临时加入成本的方案）
https://minds.wisconsin.edu/bitstream/handle/1793/59726/TR1148.pdf?sequence=11

3.5 Statistics 统计信息

在数据库中，Statistics 是指数据分布和特征的统计信息，如表中行数、列的唯一值数、数据的分布情况等。这些统计信息帮助数据库优化器选择最佳的查询执行计划，从而提高查询性能。

Accurate Estimation of the Number of Tuples Satisfying a Condition (1984) - Piatetsky-Shapiro, Gregory, and Charles Connell. （准确估计满足条件的元组数量）
https://dl.acm.org/doi/pdf/10.1145/971697.602294
Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results (1993) - Ioannidis, Yannis E., and Stavros Christodoulakis. （用于限制连接结果中最坏情况错误传播的最佳直方图）
https://dl.acm.org/doi/pdf/10.1145/169725.169708
Universality of Serial Histograms (1993) - Ioannidis, Yannis E. （串行直方图的普遍性）
https://vldb.org/conf/1993/P256.PDF
Balancing Histogram Optimality and Practicality for Query Result Size Estimation (1995) - Ioannidis, Yannis E., and Viswanath Poosala. （平衡查询结果大小估计的直方图最优性和实用性）
https://dl.acm.org/doi/pdf/10.1145/568271.223841
Improved Histograms for Selectivity Estimation of Range Predicates (1996) - Poosala, Viswanath, et al. （范围谓词选择性估计的改进直方图）
https://dl.acm.org/doi/pdf/10.1145/235968.233342
The History of Histograms (2003) - Ioannidis, Yannis. （直方图的历史）
http://www.vldb.org/conf/2003/papers/S02P01.pdf
Automated Statistics Collection in DB2 UDB (2004) - Aboulnaga, Ashraf, et al. （DB2 UDB 中的自动统计信息收集）
http://www.vldb.org/conf/2004/IND5P3.PDF
Adaptive Query Processing in the Looking Glass (2005) - Babu, Shivnath, and Pedro Bizarro. （Glass中的自适应查询处理）
https://eden.dei.uc.pt/~bizarro/papers/cidr2005_aqp.pdf
Optimizer plan change management: improved stability and performance in Oracle 11g (2008) - Ziauddin, Mohamed, et al. （优化器计划：Oracle 11g 中的稳定性和性能优化）
https://dl.acm.org/doi/pdf/10.14778/1454159.1454175
Histograms Reloaded: The Merits of Bucket Diversity (2010) - Kanne, Carl-Christian, and Guido Moerkotte. （直方图重新加载：桶多样性的优点）
https://dl.acm.org/doi/pdf/10.1145/1807167.1807239
Adaptive Statistics in Oracle 12c (2017) - Chakkappen, Sunil, et al. （Oracle 12c 中的自适应统计）
https://dl.acm.org/doi/pdf/10.14778/3137765.3137785
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches (2011) - Cormode, Graham, et al. （海量数据概要：样本、直方图、小波、草图）
https://www.nowpublishers.com/article/DownloadSummary/DBS-004

3.6 Probabilistic Counting 概率计算

在数据库中，Probabilistic Counting 是估算不同数据数量的方法，用于处理大规模数据集，是利用概率和哈希技术来减少存储需求和计算复杂度。这样可以快速估算唯一值的数量，而无需扫描所有实际数据。尽管这种方法可能会引入一定的误差，但它能有效处理大数据集，且内存占用相对较小。因此，广泛应用于数据库中。

Towards Estimation Error Guarantees for Distinct Values (2000) - Charikar, Moses, et al.（实现不同值的估计误差保证）
https://dl.acm.org/doi/pdf/10.1145/335168.335230
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports (2001) - Gibbons, Phillip B.（不同采样可为不同值查询和事件报告提供高精度答案）
http://www.vldb.org/conf/2001/P541.pdf
LEO – DB2’s LEarning Optimizer (2001) - Stillger, Michael, et al.（LEO – DB2 的学习优化器）
http://www.vldb.org/conf/2001/P019.pdf
An Improved Data Stream Summary: The Count-Min Sketch and its Applications, Journal of Algorithms (2005) - Cormode, Graham, and Shan Muthukrishnan.（）
http://twiki.di.uniroma1.it/pub/Ing_algo/WebHome/p14_Cormode_JAl_05.pdf
New Estimation Algorithms for Streaming Data: Count-min Can Do More (2007) - Deng, Fan, and Davood Rafiei.（流数据的新估计算法：Count-min 可以做得更多）
https://www.academia.edu/download/31052190/cmm.pdf
Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities (2019) - Cai, Walter, Magdalena Balazinska, and Dan Suciu.（悲观基数估计：中间连接基数的更严格上限）
https://dl.acm.org/doi/pdf/10.1145/3299869.3319894
Deep Unsupervised Cardinality Estimation (2019) - Yang, Zongheng, et al.
https://arxiv.org/pdf/1905.04278（深度无监督基数估计）
NeuroCard: One Cardinality Estimator for All Tables (2020) - Yang, Zongheng, et al.（NeuroCard：适用于所有表的一种基数估计器）
https://arxiv.org/pdf/2006.08109

3.7 执行引擎

QueryEvaluationTechniquesfor LargeDatabas (1993) - Graefe G.
https://dl.acm.org/doi/pdf/10.1145/152610.152611 （大型数据库的查询评估技术）
Volcano - An Extensible and Parallel Query Evaluation System (1994) - Graefe, Goetz.（Volcano - 可扩展的并行查询评估系统）
https://15721.courses.cs.cmu.edu/spring2016/papers/graefe-ieee1994.pdf
Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited (2013) - Balkesen, Cagri, et al.（多核、主内存连接：重新审视排序与哈希）
https://dl.acm.org/doi/pdf/10.14778/2732219.2732227
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age (2014) - Leis, Viktor, et al.（Morsel 驱动的并行性：多核时代的 NUMA 感知查询评估框架）
https://dl.acm.org/doi/pdf/10.1145/2588555.2610507
MonetDB/X100: Hyper-Pipelining Query Execution (2005) - Boncz, Peter A., Marcin Zukowski, and Niels Nes.（MonetDB/X100：超流水线查询执行）
https://www.researchgate.net/profile/Niels-Nes/publication/45338800_MonetDBX100_Hyper-Pipelining_Query_Execution/links/0deec520cd1e8a3607000000/MonetDB-X100-Hyper-Pipelining-Query-Execution.pdf
Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last (2017) - Menon, Prashanth, Todd C. Mowry, and Andrew Pavlo.（内存数据库的宽松算子融合：最终使编译、矢量化和预取协同工作）
https://dl.acm.org/doi/pdf/10.14778/3151113.3151114
Looking Ahead Makes Query Plans Robust (2017) - Zhu, Jianqiao, et al.
https://dl.acm.org/doi/pdf/10.14778/3090163.3090167（展望未来使查询计划更加稳健）
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask (2018) - Kersten, Timo, et al.（您一直想了解但又不敢问的有关编译和矢量化查询的所有信息）
https://dl.acm.org/doi/pdf/10.14778/3275366.3284966
SuRF: Practical Range Query Filtering with Fast Succinct Tries (2018) - Zhang, Huanchen, et al.（SuRF：使用快速简洁尝试进行实用范围查询过滤）
https://dl.acm.org/doi/pdf/10.1145/3183713.3196931
Adaptive Execution of Compiled Queries (2018) - Kohn, André, Viktor Leis, and Thomas Neumann.（编译查询的自适应执行）
https://15721.courses.cs.cmu.edu/spring2019/papers/19-compilation/kohn-icde2018.pdff

3.8 MPP 优化器

DB2 Parallel Edition (1995) - Baru, Chaitanya K., et al.（DB2 并行版）
https://grape.ics.uci.edu/wiki/asterix/raw-attachment/wiki/cs295-2009-fall/ParallelDB2.pdf
Parallel SQL execution in Oracle 10g (2004) - Cruanes, Thierry, Benoit Dageville, and Bhaskar Ghosh.（Oracle 10g 中的并行 SQL 执行）
https://dl.acm.org/doi/pdf/10.1145/1007568.1007666
Query Optimization in Microsoft SQL Server PDW (2012) - Shankar, Srinath, et al.（Microsoft SQL Server PDW 中的查询优化）
https://dl.acm.org/doi/pdf/10.1145/2213836.2213953
Adaptive and big data scale parallel execution in Oracle (2013) - Bellamkonda, Srikanth, et al.（Oracle 中的自适应和大数据规模并行执行）
https://dl.acm.org/doi/pdf/10.14778/2536222.2536235）
Optimizing Queries over Partitioned Tables in MPP Systems (2014) - Antova, Lyublena, et al.（优化 MPP 系统中分区表的查询）
https://dl.acm.org/doi/pdf/10.1145/2588555.2595640

04 存储引擎

4.1 存储结构

The Ubiquitous B-Tree (1979) - Comer, Douglas.（无处不在的 B 树）
https://dl.acm.org/doi/pdf/10.1145/356770.356776
The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time (1987) - Gray, Jim, and Franco Putzolu.
https://dl.acm.org/doi/pdf/10.1145/38713.38755
The Log-Structured Merge-Tree (LSM-Tree) (1996) - O’Neil, Patrick, et al.
https://www.inf.ufpr.br/eduardo/ensino/ci763/papers/lsmtree.pdf）（日志结构合并树（LSM-Tree）介绍）
The five-minute rule ten years later, and other computer storage rules of thumb (1997) - Gray, Jim, and Goetz Graefe.（十年后的五分钟规则，以及其他计算机存储经验法则）
https://dl.acm.org/doi/pdf/10.1145/271074.271094
The Five Minute Rule 20 Years Later and How Flash Memory Changes the Rules (2008) - Graefe, Goetz.（20 年后的五分钟规则以及闪存如何改变规则）
https://dl.acm.org/doi/pdf/10.1145/1363189.1363198
A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees (2014) - Kuszmaul, Bradley C.（分形树与日志结构合并 (LSM) 树的比较）
http://www.pandademo.com/wp-content/uploads/2017/12/A-Comparison-of-Fractal-Trees-to-Log-Structured-Merge-LSM-Trees.pdf
Design Tradeoffs of Data Access Methods (2016) - Athanassoulis, Manos, and Stratos Idreos.（数据访问方法的设计权衡）
https://dl.acm.org/doi/pdf/10.1145/2882903.2912569
Designing Access Methods: The RUM Conjecture (2016) - Athanassoulis, Manos, et al.（设计访问方法：RUM 猜想）
https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/rum.pdf
The five minute rule thirty years later and its impact on the storage hierarchy (2017) - Appuswamy, Raja, et al.（三十年后的五分钟规则及其对存储层次结构的影响）
https://infoscience.epfl.ch/record/230398/files/adms-talk.pdf
WiscKey: Separating Keys from Values in SSD-conscious Storage (2017) - Lu, Lanyue, et al.（WiscKey：在 SSD 敏感存储中将键与值分离）
https://dl.acm.org/doi/pdf/10.1145/3033273
Managing Non-Volatile Memory in Database Systems (2018) - van Renen, Alexander, et al.（管理数据库系统中的非易失性内存）
https://dl.acm.org/doi/pdf/10.1145/3183713.3196897
LeanStore: In-Memory Data Management Beyond Main Memory (2018) - Leis, Viktor, et al.（LeanStore：主内存之外的内存数据管理）
https://15721.courses.cs.cmu.edu/spring2020/papers/23-largethanmemory/leis-icde2018.pdf
The Case for Learned Index Structures (2018) - Kraska, Tim, et al.（学习索引结构的案例）
https://dl.acm.org/doi/pdf/10.1145/3183713.3196909
LSM-based Storage Techniques: A Survey (2019) - Luo, Chen, and Michael J. Carey.（基于 LSM 的存储技术：调查）
https://arxiv.org/pdf/1812.07527
Learning Multi-dimensional Indexes (2019) - Nathan, Vikram, et al.（钻取多维度指数）
https://dl.acm.org/doi/pdf/10.1145/3318464.3380579
Umbra: A Disk-Based System with In-Memory Performance (2020) - Neumann, Thomas, and Michael J. Freitag.（Umbra：具有内存性能的基于磁盘的系统）
https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
XIndex: A Scalable Learned Index for Multicore Data Storage (2020) - Tang, Chuzhe, et al.（XIndex：用于多核数据存储的可扩展学习索引）
https://dl.acm.org/doi/pdf/10.1145/3332466.3374547
The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds (2020) - Ferragina, Paolo, and Giorgio Vinciguerra.（PGM 索引：具有可证明的最坏情况边界的完全动态压缩学习索引）
https://dl.acm.org/doi/pdf/10.14778/3389133.3389135
From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees (2020) - Dai, Yifan, et al.（从 WiscKey 到 Bourbon：LSM 数据结构的学习索引）
https://www.usenix.org/system/files/osdi20-dai_0.pdf
CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated Infrastructure (2024) - Yu, Qiaolin et al.（CaaS-LSM：存储分离中基于 LSM 的键值存储的压缩即服务）
https://qiaolin-yu.github.io/pubs/V2mod124-yu.pdf
The Google file system（2003） - Sanjay Ghemawat, et al. （谷歌文件系统）
https://dl.acm.org/doi/10.1145/945445.945450
MapReduce: simplified data processing on large clusters - (2008) Jeffrey Dean,et al. （MapReduce：简化大型集群上的数据处理）
https://dl.acm.org/doi/10.1145/1327452.1327492

4.2 事务

The Notions of Consistency and Predicate Locks in a Database System (1976) - Eswaran, Kapali P., et al.（数据库系统中一致性和锁的概念）
https://dl.acm.org/doi/pdf/10.1145/360363.360369
Concurrency Control in Distributed Database Systems (1981) - Bernstein, Philip A., and Nathan Goodman.（分布式数据库系统中的并发控制）
https://dl.acm.org/doi/pdf/10.1145/356842.356846
On Optimistic Methods for Concurrency Control (1981) - Kung, Hsiang-Tsung, and John T. Robinson.（并发控制的乐观方法）
https://dl.acm.org/doi/pdf/10.1145/319566.319567
Principles of transaction-oriented database recovery (1983) - Haerder, Theo, and Andreas Reuter.（面向事务的数据库恢复原理）
https://dl.acm.org/doi/10.1145/289.291
Multiversion Concurrency Control - Theory and Algorithms (1983) - Bernstein, Philip A., and Nathan Goodman.（多版本并发控制-理论与算法）
https://dl.acm.org/doi/pdf/10.1145/319996.319998
ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging (1992) - Mohan C, Haderle D, Lindsay B, et al.（ARIES：一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法）
https://dl.acm.org/doi/pdf/10.1145/128765.128770
A Critique of ANSI SQL Isolation Levels (1995) - Berenson, Hal, et al. （对 ANSI SQL 隔离级别的批评）
https://dl.acm.org/doi/pdf/10.1145/568271.223785
Generalized Isolation Level Definitions (2000) - Adya, Atul, Barbara Liskov, and Patrick O'Neil. （广义隔离级别定义）
https://pmg.csail.mit.edu/papers/icde00.pdf
Serializable Snapshot Isolation in PostgreSQL (2012) - Ports, Dan RK, and Kevin Grittner.（PostgreSQL 中的可序列化快照隔离）
https://arxiv.org/pdf/1208.4179.pdf
Calvin: Fast Distributed Transactions for Partitioned Database Systems (2012) - Thomson, Alexander, et al. （Calvin：分布式数据库系统的快速分布式事务）
https://dl.acm.org/doi/pdf/10.1145/2213836.2213838
MaaT: effective and scalable coordination of distributed transactions in the cloud (2014) - Mahmoud, Hatem A., et al. （MaaT：云中分布式事务的有效且可扩展的协调）
https://dl.acm.org/doi/pdf/10.14778/2732269.2732270
Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores (2014) - Yu, Xiangyao, et al. （凝视深渊：千核并发控制评测）
https://dspace.mit.edu/bitstream/handle/1721.1/100022/Devadas_Staring%20into.pdf?sequence=1&isAllowed=y
An Evaluation of the Advantages and Disadvantages of Deterministic Database Systems (2014) - Ren, Kun, Alexander Thomson, and Daniel J. Abadi. （定量分析数据库系统的优缺点）
https://dl.acm.org/doi/pdf/10.14778/2732951.2732955
Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015) - Neumann, Thomas, Tobias Mühlbauer, and Alfons Kemper. （主存数据库系统的快速可串行化多版本并发控制）
https://dl.acm.org/doi/pdf/10.1145/2723372.2749436
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017) - Wu, Yingjun, et al. （内存中多版本并发控制的实证评估）
https://dl.acm.org/doi/pdf/10.14778/3067421.3067427
An Evaluation of Distributed Concurrency Control (2017) - Harding, Rachael, et al. （分布式并发控制的评估）
https://dl.acm.org/doi/pdf/10.14778/3055540.3055548
Scalable Garbage Collection for In-Memory MVCC Systems (2019) - Böttcher, Jan, et al. （适用于内存中 MVCC 系统的可扩展和垃圾收集）
https://dl.acm.org/doi/pdf/10.14778/3364324.3364328

05 其他

5.1 负载

TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark (2013) - Boncz, Peter, Thomas Neumann, and Orri Erling. （TPC-H 分析：隐藏的信息和从有影响力的基准中吸取的教训）
https://dl.acm.org/doi/10.1007/978-3-319-04936-6_5
Quantifying TPCH Choke Points and Their Optimizations (2020) - Dreseler, Markus, et al. （量化 TPCH 瓶颈及其优化）
https://dl.acm.org/doi/pdf/10.14778/3389133.3389138
OceanBase: A 707 Million tpmC Distributed Relational Database (2022) - Zhenkun Yang, Chuanhui Yang, et al. （OceanBase：7.07亿tpmC的分布式关系数据库）
https://dl.acm.org/doi/abs/10.14778/3554821.3554830

5.2 网络

The End of Slow Networks: It's Time for a Redesign (2015) - Binnig, Carsten, et al. （慢速网络的终结：是时候重新设计了）
https://arxiv.org/pdf/1504.01048
Accelerating Relational Databases by Leveraging Remote Memory and RDMA (2016) - Li, Feng, et al. （通过利用远程内存和 RDMA 加速关性系数据库）
https://dl.acm.org/doi/pdf/10.1145/2882903.2882949
Don't Hold My Data Hostage: A Case for Client Protocol Redesign (2017) - Raasveldt, Mark, and Hannes Mühleisen. （不要劫持我的数据：客户端协议重新设计的案例）
https://dl.acm.org/doi/pdf/10.14778/3115404.3115408

5.3 性能诊断

Automatic SQL Tuning in Oracle 10g (2004) - Dageville B, Das D, Dias K, et al. （Oracle 10g 中的自动 SQL 调优）
http://www.vldb.org/conf/2004/IND4P2.PDF
Automatic Performance Diagnosis and Tuning in Oracle (2005) - Dias K, Ramacher M, Shaft U, et al. （Oracle 中的自动性能诊断和调优）
https://www.cidrdb.org/cidr2005/papers/P07.pdf

整理人介绍

司马辽太杰，目前就职于一家国有企业，主要负责数据库连续性保障、性能优化、架构选型和设计。10余年数据库架构和管理经验，专注于数据库运维、架构和行业发展，擅长常见关系型、NoSQL、MPP 等类型数据库性能优化、架构设计和故障排查。杭州乡下桐庐人，业余热爱历史、足球，偶尔读点闲书。欢迎关注个人公众号“程序猿读历史”,有需要也可以在公众号上加我好友。