暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

【司马老师数据库系列讲座三】跟着论文学习数据库3:数据库行业论文清单

300

点击蓝字 关注我们

前段时间,发表了【跟着论文学习数据库】系列文章,收到了比较好的反馈。期间有同学问,有没有更全面的数据库论文。于是有了这篇数据库领域优秀论文清单索引文章,旨在帮助同学们更方便阅读数据库行业领导者、推动者的思想巨著。

本文一共包含156 篇全球数据库领域优秀论文。按数据库技术分,该清单涵盖了数据库的基础知识、系统设计、SQL引擎、存储引擎等几大块;按论文作者分,该清单包括埃德加.科德、杰姆.格林、斯通布雷克等图灵奖大师们的论文、以及开启大数据时代Google,商业数据库霸主Oracle、云计算的领导者Amazon,也囊括了国内 TiDB CTO 黄东旭、原阿里云研究员 PolarDB 负责人曹伟(现为杭州云猿生数据有限公司创始人 & CEO,更多信息详见链接)、清华大学李国良教授等国内外数据库领域巨佬们的论文。

由于小编个人能力有限,而数据库领域经典论文汗牛充栋,本次整理一定是不全面,如有遗漏,请各位多多谅解,也欢迎与小编联系、交流。

01 基础知识

1.1 基础

  • A relational model of data for large shared data banks (1970) - Codd, Edgar F.(大型共享数据库的数据关系模型)

    https://dl.acm.org/doi/pdf/10.1145/362384.362685

  • SEQUEL: A structured English query language (1974) - Chamberlin, Donald D., and Raymond F. Boyce. (SQL:结构化英语查询语言)

    https://dl.acm.org/doi/pdf/10.1145/800296.811515

  • INGRES: a relational data base system (1975) - Held, G. D., M. R. Stonebraker, and Eugene Wong.(INGRES: 一个关系型数据库系统)

    https://dl.acm.org/doi/pdf/10.1145/1499949.1500029

  • Extending the database relational model to capture more meaning (1979) - Codd, Edgar F.

    https://dl.acm.org/doi/pdf/10.1145/320107.320109

  • Raft: In Search of an Understandable Consensus Algorithm (2014)- Diego Ongaro,  et.al (Raft:寻找一种可实现的共识算法)

    https://dl.acm.org/doi/10.5555/2643634.2643666

  • A critique of the SQL database language (1984) - Date, C. J. (对 SQL 数据库语言的批判)

    https://dl.acm.org/doi/pdf/10.1145/984549.984551

  • What Goes Around Comes Around... And Around... (2024) - M. R. Stonebraker, and Andy Pavlo (数据库20年总结与发展,发展是一种循环)

    https://dl.acm.org/doi/10.1145/3685980.3685984

  • Cloud Programming Simplified: A Berkeley View on Serverless Computing (2019) - Eric Jonas

    https://arxiv.org/abs/1902.03383

1.2 一致性

  • Consistency Tradeoffs in Modern Distributed Database System Design (2012) - Abadi, Daniel. (现代分布式数据库系统设计中的一致性权衡)

    https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf

  • PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database (2018) - Cao, Wei, et al. (PolarFS:用于共享存储云数据库的超低延迟和故障恢复分布式文件系统)

    https://dl.acm.org/doi/pdf/10.14778/3229863.3229872

  • Anna: A kvs for any scale (2018) - Wu, Chenggang, et al.

    https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-122.pdf

  • Strong and efficient consistency with consistency-aware durability (2021) - Ganesan, Aishwarya, et al. (强大而高效的一致性以及持久性)

    https://dl.acm.org/doi/pdf/10.1145/3423138

  • Logical physical clocks and consistent snapshots in globally distributed databases (2014) - Kulkarni S S, Demirbas M, Madappa D, et al.

    https://cse.buffalo.edu/tech-reports/2014-04.pdf

  • Ark: A Real-World Consensus Implementation (2014) - Kasheff, Zardosht, and Leif Walsh.(Ark:现实世界的共识实施)

    https://arxiv.org/pdf/1407.4765

1.3 Consensus 共识

在数据库中,“Consensus”(共识)指分布式系统中就某个状态或结果达成一致的过程。这通常涉及到分布式协议或算法,以确保所有节点在数据一致性和可靠性方面的统一意见,尤其在面对故障或网络分区时。因此,共识在分布式数据库中尤为重要,是确保数据的一致性、准确性和安全性。

  • The Part-Time Parliament (1998) - Lamport, Leslie.

    https://dl.acm.org/doi/pdf/10.1145/3335772.3335939

  • Consensus: Bridging theory and practice (2014) - Ongaro, Diego.(共识:理论与实践的桥梁)

    https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf

  • In search of an understandable consensus algorithm (extended version) (2014) - Ongaro, Diego, and John Ousterhout. (探索一种可理解的共识算法)

    https://www.repository.cam.ac.uk/bitstream/handle/1810/291682/thesis.pdf?sequence=1

  • Distributed consensus revised (2019) - Howard, Heidi. (分布式共识修正)

    https://www.repository.cam.ac.uk/bitstream/handle/1810/291682/thesis.pdf?sequence=1

  • A Generalised Solution to Distributed Consensus (2019) - Howard, Heidi, and Richard Mortier. (分布式共识的通用解决方案)

    https://arxiv.org/pdf/1902.06776

  • Paxos vs Raft: Have we reached consensus on distributed consensus? (2020) - Howard, Heidi, and Richard Mortier. (Paxos vs Raft:在分布式共识上达成了吗?)

    https://dl.acm.org/doi/pdf/10.1145/3380787.3393681


02 数据库系统设计

 2.1 关系型数据库

  • System R: Relational Approach to Database Management (1976) - Astrahan, Morton M., et al. (System R:数据库管理的关系方法)

    https://dl.acm.org/doi/pdf/10.1145/320455.320457

  • The design and implementation of INGRES (1976) - Stonebraker, Michael, et al. (INGRES 数据库的设计与实现)

    https://dl.acm.org/doi/10.1145/320473.320476

  • The design of Postgres (1986) - Stonebraker, Michael, and Lawrence A. Rowe. Postgres 数据库的设计

    https://dl.acm.org/doi/pdf/10.1145/16856.16888

  • Online, Asynchronous Schema Change in F1 (2013) - Rae, Ian, et al. (F1 中的在线异步架构更改)

    https://dl.acm.org/doi/pdf/10.14778/2536222.2536230

  • Amazon aurora: Design considerations for high throughput cloud-native relational databases (2017) - Verbitski, Alexandre, et al. (Amazon aurora:高吞吐量云原生关系数据库的设计)

    https://dl.acm.org/doi/pdf/10.1145/3035918.3056101

  • Looking Back at Postgres (2019) - Hellerstein, Joseph M. (回顾 Postgres

    https://arxiv.org/pdf/1901.01973

  • CockroachDB: The Resilient Geo-Distributed SQL Database (2020) - Taft, Rebecca, et al. (CockroachDB:一个弹性地理分布式关系型数据库)

    https://dl.acm.org/doi/pdf/10.1145/3318464.3386134

  • Query Processing in Main Memory Database Management Systems (1986) - Lehman, Tobin J., and Michael J. Carey. (主内存数据库管理系统中的查询处理)

    https://dl.acm.org/doi/pdf/10.1145/16894.16878

  • Megastore: Providing Scalable, Highly Available Storage for Interactive Services (2011) - Baker J, Bond C, Corbett J C, et al. (Megastore:一种为交互服务提供可扩展、高可用的存储)

    http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/Megastore.pdf

  • Spanner: Google's globally distributed database (2013) - Corbett, James C., et al. (Spanner:Google 的全球分布式数据库)

    https://dl.acm.org/doi/pdf/10.1145/2491245

  • F1 Lightning: HTAP as a Service (2020) - Yang, Jiacheng, et al. (F1 Lightning:HTAP 即服务)

    https://dl.acm.org/doi/pdf/10.14778/3415478.3415553

  • TiDB: a Raft-based HTAP database (2020) - Huang, Dongxu, et al.

    https://dl.acm.org/doi/pdf/10.14778/3415478.3415535 (TiDB:一个基于Raft 的 HTAP 数据库)

  • PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers (2021) - Cao, Wei, et al. (PolarDB Serverless:面向分散数据中心的云原生数据库)

    https://dl.acm.org/doi/pdf/10.1145/3448016.3457560

  • HTAP Databases: What is New and What is Next (2022) - guoliang, li, et al (HTAP 数据库:新增内容和未来发展)

    https://dl.acm.org/doi/abs/10.1145/3514221.3522565

  • Greenplum: A Hybrid Database for Transactional and Analytical Workloads (2021) - Zhenghua Lyu ,et al (Greenplum:用于事务和分析工作负载的混合数据库)

    https://dl.acm.org/doi/10.1145/3448016.3457562


 2.2 非关系型数据库

  • Bigtable: A Distributed Storage System for Structured Data (2006) - Chang, Fay, et al. (Bigtable:结构化数据的分布式存储系统)

    https://dl.acm.org/doi/pdf/10.1145/1365815.1365816

  • Dynamo: Amazon’s Highly Available Key-value Store (2007) - DeCandia, Giuseppe, et al. (Dynamo:Amazon 的高可用键值存储)

    https://dl.acm.org/doi/pdf/10.1145/1323293.1294281

  • PNUTS: Yahoo!’s Hosted Data Serving Platform (2008) - Cooper, Brian F., et al. (PNUTS:雅虎的托管数据服务平台)

    https://dl.acm.org/doi/pdf/10.14778/1454159.1454167

  • Cassandra - A Decentralized Structured Storage System (2010) - Lakshman, Avinash, and Prashant Malik. (Cassandra - 去中心化结构化存储系统)

    https://dl.acm.org/doi/pdf/10.1145/1773912.1773922

  • Windows azure storage: a highly available cloud storage service with strong consistency (2011) - Calder, Brad, et al. (Windows Azure Storage:高可用、强一致性的云存储服务)

    https://dl.acm.org/doi/pdf/10.1145/2043556.2043571

  • Azure data lake store: a hyperscale distributed file service for big data analytics (2017) - Ramakrishnan, Raghu, et al. (Azure 数据湖存储:用于大数据分析的超大规模分布式文件服务)

    https://dl.acm.org/doi/pdf/10.1145/3035918.3056100

  • Spark SQL: Relational Data Processing in Spark (2015)- Michael Armbrust, et al. (Spark SQL:Spark 中的关系数据处理)

    https://dl.acm.org/doi/10.1145/2723372.2742797

  • HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (2009)-  Azza Abouzeid ,et al. (HadoopDB:用于分析工作负载的 MapReduce 和 数据库技术的混合架构

    https://15799.courses.cs.cmu.edu/fall2013/static/papers/vldb09-861.pdf


03 SQL 引擎

3.1 优化器

  • Access Path Selection in a Relational Database Management System (1979) - Selinger, P. Griffiths, et al. (关系数据库管理系统中的访问路径选择)

    https://dl.acm.org/doi/pdf/10.1145/582095.582099

  • Query Optimization by Simulated Annealing (1987) - Ioannidis, Yannis E., and Eugene Wong.

    https://dl.acm.org/doi/pdf/10.1145/38713.38722

  • The Cascades Framework for Query Optimization (1995) - Graefe, Goetz.(用于查询优化的级联框架)

    https://liuyehcf.github.io/resources/paper/The-Cascades-Framework-For-Query-Optimization.pdf

  • An Overview of Query Optimization in Relational Systems (1998) - Chaudhuri, Surajit. (关系系统中的查询优化概述)

    https://dl.acm.org/doi/pdf/10.1145/1007568.1007642

  • Robust Query Processing through Progressive Optimization (2004) - Markl, Volker, et al. (通过逐步优化实现稳健的查询处理)

    https://dl.acm.org/doi/pdf/10.1145/1007568.1007642

  • Orca: A Modular Query Optimizer Architecture for Big Data (2014) - Soliman, Mohamed A., et al. (Orca:大数据模块化查询优化器架构)

    https://dl.acm.org/doi/pdf/10.1145/2588555.2595637

  • Parallelizing Query Optimization on Shared-Nothing Architectures (2015) - Trummer, Immanuel, and Christoph Koch. (Shared-Nothing 架构上的并行查询优化)

    https://arxiv.org/pdf/1511.01768

  • The MemSQL Query Optimizer: A modern optimizer for real-time analytics in a distributed database (2016) - Chen, Jack, et al. (MemSQL 查询优化器:用于分布式数据库中实时分析的现代优化器)

    https://dl.acm.org/doi/pdf/10.14778/3007263.3007277

  • Extensible/Rule Based Query Rewrite Optimization in Starburst (1992) - Pirahesh, Hamid, Joseph M. Hellerstein, and Waqar Hasan. (Starburst 中可扩展/基于规则的查询重写优化)

    https://dl.acm.org/doi/pdf/10.1145/141484.130294

  • The Volcano Optimizer Generator- Extensibility and Efficient Search (1993) - Graefe, Goetz, and William J. McKenna. (Volcano 优化器生成器 - 可扩展性和高效搜索)

    https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/Papers/Volcano-graefe.pdf

  • Processing queries with quantifiers a horticultural approach (1983) - Dayal, Umeshwar.

    https://dl.acm.org/doi/pdf/10.1145/588058.588075

  • Translating SQL into relational algebra: Optimization, semantics, and equivalence of SQL queries (1985) - Ceri, Stefano, and Georg Gottlob. (将 SQL 转换为关系代数:SQL 查询的优化、语义和等价性)

    https://www.academia.edu/download/50687636/tse.1985.23222320161202-29901-8u86ef.pdf

  • Parameterized Queries and Nesting Equivalences (2000) - Galindo-Legaria, C. A. (参数化查询和嵌套等价)

    https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf

  • Cost-based query transformation in Oracle (2006) - Ahmed, Rafi, et al. (Oracle 中基于成本的查询转换)

    https://www.researchgate.net/profile/Rafi-Ahmed-2/publication/221311318_Cost-Based_Query_Transformation_in_Oracle/links/572bbc5e08aef7c7e2c6b829/Cost-Based-Query-Transformation-in-Oracle.pdf

  • Grammar-like Functional Rules for Representing Query Optimization Alternatives, (1988) - Lohman, Guy M. (用于表示查询优化替代方案的类似语法的功能规则)

    https://dl.acm.org/doi/pdf/10.1145/971701.50204

  • Query Optimization by Predicate Move-Around (1994) - Levy, Alon Y., Inderpal Singh Mumick, and Yehoshua Sagiv. (通过谓词移动进行查询优化)

    https://www.researchgate.net/profile/Inderpal-Mumick/publication/2754592_Query_Optimization_by_Predicate_Move-Around/links/0f317534d437e49755000000/Query-Optimization-by-Predicate-Move-Around.pdf


3.2 嵌套查询

  • On optimizing an SQL-like nested query (1982) - Kim, Won. (优化类似 SQL 的嵌套查询)

    https://dl.acm.org/doi/pdf/10.1145/319732.319745

  • Optimization of nested queries in a distributed relational database (1984) - L&man, Guy M., et al. (分布式关系数据库中嵌套查询的优化)

    https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=12fd1fe22687f5944613832de4e64ef902043aec

  • Optimization of nested SQL queries revisited (1987) - Ganski, Richard A., and Harry KT Wong. (再次分析嵌套 SQL 查询的优化)

    https://dl.acm.org/doi/pdf/10.1145/38714.38723

  • A Unitied Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers (1987) - Dayal, Umeshwar. (嵌套子查询、聚合和量词的查询的统一优化方法)

    https://vldb.org/conf/1987/P197.PDF

  • Optimization of correlated SQL queries in a relational database management system (1998) - Jou, Michelle M., Ting Yu Leung, and Mir Hamid Pirahesh. (关系数据库管理系统中查询SQL的优化)

    https://patentimages.storage.googleapis.com/3b/24/39/a947424a6eb0ea/US5822750.pdf

  • Orthogonal Optimization of Subqueries and Aggregation (2001) - Galindo-Legaria, César, and Milind Joshi. (子查询在聚合场景的优化)

    https://dl.acm.org/doi/pdf/10.1145/376284.375748

  • WinMagic : Subquery Elimination Using Window Aggregation (2003) - Zuzarte, Calisto, et al. (WinMagic:使用窗口聚合消除子查询的优化)

    https://dl.acm.org/doi/pdf/10.1145/872757.872840

  • SQL-like and Quel-like correlation queries with aggregates revisited (1984) - Kiessling, Werner.

    http://www2.eecs.berkeley.edu/Pubs/TechRpts/1984/ERL-m-84-75.pdf

  • Translating SQL into relational algebra: Optimization, semantics, and equivalence of SQL queries (1985) - Ceri, Stefano, and Georg Gottlob. (将 SQL 转换为关系代数:SQL 查询的优化、语义和等价性)

    https://www.academia.edu/download/50687636/tse.1985.23222320161202-29901-8u86ef.pdf

  • Execution strategies for SQL subqueries (2007) - Elhemali, Mostafa, et al. (SQL子查询的执行策略)

    https://dl.acm.org/doi/pdf/10.1145/1247480.1247598

  • Enhanced subquery optimizations in Oracle (2009) - Bellamkonda, Srikanth, et al. (Oracle 中增强的子查询优化)

    https://dl.acm.org/doi/pdf/10.14778/1687553.1687563

  • Unnesting Arbitrary Queries (2015) - Neumann, Thomas, and Alfons Kemper. (解除任意查询的嵌套)

    https://dl.gi.de/bitstream/handle/20.500.12116/2418/383.pdf?sequence=1


3.3 关联和排序

  • Access paths in the" Abe" statistical query facility (1982) - Klug, Anthony. (“Abe”统计查询设施中的访问路径)

    https://dl.acm.org/doi/pdf/10.1145/582353.582382

  • Extending the Algebraic Framework of Query Processing to Handle Outerjoins (1984) - RosenthaI, A., and D. Reiner.

    https://www.vldb.org/conf/1984/P334.PDF

  • On the Correct and Complete Enumeration of the Core Search Space (2013) - Moerkotte, Guido, Pit Fender, and Marius Eich. (论核心搜索空间的正确完整枚举)

    https://dl.acm.org/doi/pdf/10.1145/2463676.2465314

  • How Good Are Query Optimizers, Really? (2015) - Leis, Viktor, et al. (查询优化器到底有多好?)

    https://dl.acm.org/doi/pdf/10.14778/2850583.2850594

  • The Complete Story of Joins (2017) - Neumann, Thomas, Viktor Leis, and Alfons Kemper. (join的完整性)

    https://dl.gi.de/bitstreams/535a5d94-043d-4b1a-9062-fbaf8ed35468/download

  • Dynamic programming strikes back (2008) - Moerkotte, Guido, and Thomas Neumann.

    https://dl.acm.org/doi/pdf/10.1145/1376616.1376672

  • Improving Join Reorderability with Compensation Operators (2018) - Wang, TaiNing, and Chee-Yong Chan. (使用补偿运算符提高连接可重排序性)

    https://dl.acm.org/doi/pdf/10.1145/3183713.3183731

  • Adaptive Optimization of Very Large Join Queries (2018) - Neumann, Thomas, and Bernhard Radke. (超大型连接查询的自适应优化)

    https://dl.acm.org/doi/pdf/10.1145/3183713.3183733


3.4 Cost Model 成本模型

数据库Cost Model 是用来评估不同查询执行路径的性能的一种方法。它通过估算执行每个操作(如扫描、连接等)所需的资源(如CPU、内存、I/O等)来帮助优化器选择最有效的执行计划。通过比较这些“成本”,优化器可以确定哪个查询路径是最终执行的最佳选择。

  • Modelling Costs for a MM-DBMS (1996) - Listgarten, Sherry, and Marie-Anne Neimat.

    https://www.semanticscholar.org/paper/Modelling-Costs-for-a-MM-DBMS-Listgarten-Neimat/42b88445cfb28fbe4b6539c97674a8fa9815e635

  • Approximation Schemes for Many-Objective Query Optimization (2014) - Trummer, Immanuel, and Christoph Koch. (多目标查询优化的近似方案)

    https://dl.acm.org/doi/pdf/10.1145/2588555.2610527

  • Multi-Objective Parametric Query Optimization (2015) - Trummer, Immanuel, and Christoph Koch. (多目标参数查询优化)

    https://dl.acm.org/doi/pdf/10.1145/3068612

  • SEEKing the truth about ad hoc join costs (1997) - Haas, Laura M., et al. (寻求有关临时加入成本的方案)

    https://minds.wisconsin.edu/bitstream/handle/1793/59726/TR1148.pdf?sequence=11


3.5 Statistics 统计信息

在数据库中,Statistics 是指数据分布和特征的统计信息,如表中行数、列的唯一值数、数据的分布情况等。这些统计信息帮助数据库优化器选择最佳的查询执行计划,从而提高查询性能。

  • Accurate Estimation of the Number of Tuples Satisfying a Condition (1984) - Piatetsky-Shapiro, Gregory, and Charles Connell. (准确估计满足条件的元组数量)

    https://dl.acm.org/doi/pdf/10.1145/971697.602294

  • Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results (1993) - Ioannidis, Yannis E., and Stavros Christodoulakis. (用于限制连接结果中最坏情况错误传播的最佳直方图)

    https://dl.acm.org/doi/pdf/10.1145/169725.169708

  • Universality of Serial Histograms (1993) - Ioannidis, Yannis E. (串行直方图的普遍性)

    https://vldb.org/conf/1993/P256.PDF

  • Balancing Histogram Optimality and Practicality for Query Result Size Estimation (1995) - Ioannidis, Yannis E., and Viswanath Poosala. (平衡查询结果大小估计的直方图最优性和实用性)

    https://dl.acm.org/doi/pdf/10.1145/568271.223841

  • Improved Histograms for Selectivity Estimation of Range Predicates (1996) - Poosala, Viswanath, et al. (范围谓词选择性估计的改进直方图)

    https://dl.acm.org/doi/pdf/10.1145/235968.233342

  • The History of Histograms (2003) - Ioannidis, Yannis. (直方图的历史)

    http://www.vldb.org/conf/2003/papers/S02P01.pdf

  • Automated Statistics Collection in DB2 UDB (2004) - Aboulnaga, Ashraf, et al. (DB2 UDB 中的自动统计信息收集)

    http://www.vldb.org/conf/2004/IND5P3.PDF

  • Adaptive Query Processing in the Looking Glass (2005) - Babu, Shivnath, and Pedro Bizarro. (Glass中的自适应查询处理)

    https://eden.dei.uc.pt/~bizarro/papers/cidr2005_aqp.pdf

  • Optimizer plan change management: improved stability and performance in Oracle 11g (2008) - Ziauddin, Mohamed, et al. (优化器计划:Oracle 11g 中的稳定性和性能优化)

    https://dl.acm.org/doi/pdf/10.14778/1454159.1454175

  • Histograms Reloaded: The Merits of Bucket Diversity (2010) - Kanne, Carl-Christian, and Guido Moerkotte. (直方图重新加载:桶多样性的优点)

    https://dl.acm.org/doi/pdf/10.1145/1807167.1807239

  • Adaptive Statistics in Oracle 12c (2017) - Chakkappen, Sunil, et al. (Oracle 12c 中的自适应统计)

    https://dl.acm.org/doi/pdf/10.14778/3137765.3137785

  • Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches (2011) - Cormode, Graham, et al. (海量数据概要:样本、直方图、小波、草图)

    https://www.nowpublishers.com/article/DownloadSummary/DBS-004


3.6 Probabilistic Counting 概率计算

在数据库中,Probabilistic Counting 是估算不同数据数量的方法,用于处理大规模数据集,是利用概率和哈希技术来减少存储需求和计算复杂度。这样可以快速估算唯一值的数量,而无需扫描所有实际数据。尽管这种方法可能会引入一定的误差,但它能有效处理大数据集,且内存占用相对较小。因此,广泛应用于数据库中。

  • Towards Estimation Error Guarantees for Distinct Values (2000) - Charikar, Moses, et al.(实现不同值的估计误差保证)

    https://dl.acm.org/doi/pdf/10.1145/335168.335230

  • Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports (2001) - Gibbons, Phillip B.(不同采样可为不同值查询和事件报告提供高精度答案)

    http://www.vldb.org/conf/2001/P541.pdf

  • LEO – DB2’s LEarning Optimizer (2001) - Stillger, Michael, et al.(LEO – DB2 的学习优化器)

    http://www.vldb.org/conf/2001/P019.pdf

  • An Improved Data Stream Summary: The Count-Min Sketch and its Applications, Journal of Algorithms (2005) - Cormode, Graham, and Shan Muthukrishnan.()

    http://twiki.di.uniroma1.it/pub/Ing_algo/WebHome/p14_Cormode_JAl_05.pdf

  • New Estimation Algorithms for Streaming Data: Count-min Can Do More (2007) - Deng, Fan, and Davood Rafiei.(流数据的新估计算法:Count-min 可以做得更多)

    https://www.academia.edu/download/31052190/cmm.pdf

  • Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities (2019) - Cai, Walter, Magdalena Balazinska, and Dan Suciu.(悲观基数估计:中间连接基数的更严格上限)

    https://dl.acm.org/doi/pdf/10.1145/3299869.3319894

  • Deep Unsupervised Cardinality Estimation (2019) - Yang, Zongheng, et al.

    https://arxiv.org/pdf/1905.04278(深度无监督基数估计)

  • NeuroCard: One Cardinality Estimator for All Tables (2020) - Yang, Zongheng, et al.(NeuroCard:适用于所有表的一种基数估计器)

    https://arxiv.org/pdf/2006.08109


3.7 执行引擎

  • QueryEvaluationTechniquesfor LargeDatabas (1993) - Graefe G.

    https://dl.acm.org/doi/pdf/10.1145/152610.152611 (大型数据库的查询评估技术)

  • Volcano - An Extensible and Parallel Query Evaluation System (1994) - Graefe, Goetz.(Volcano - 可扩展的并行查询评估系统)

    https://15721.courses.cs.cmu.edu/spring2016/papers/graefe-ieee1994.pdf

  • Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited (2013) - Balkesen, Cagri, et al.(多核、主内存连接:重新审视排序与哈希)

    https://dl.acm.org/doi/pdf/10.14778/2732219.2732227

  • Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age (2014) - Leis, Viktor, et al.(Morsel 驱动的并行性:多核时代的 NUMA 感知查询评估框架)

    https://dl.acm.org/doi/pdf/10.1145/2588555.2610507

  • MonetDB/X100: Hyper-Pipelining Query Execution (2005) - Boncz, Peter A., Marcin Zukowski, and Niels Nes.(MonetDB/X100:超流水线查询执行)

    https://www.researchgate.net/profile/Niels-Nes/publication/45338800_MonetDBX100_Hyper-Pipelining_Query_Execution/links/0deec520cd1e8a3607000000/MonetDB-X100-Hyper-Pipelining-Query-Execution.pdf

  • Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last (2017) - Menon, Prashanth, Todd C. Mowry, and Andrew Pavlo.(内存数据库的宽松算子融合:最终使编译、矢量化和预取协同工作)

    https://dl.acm.org/doi/pdf/10.14778/3151113.3151114

  • Looking Ahead Makes Query Plans Robust (2017) - Zhu, Jianqiao, et al.

    https://dl.acm.org/doi/pdf/10.14778/3090163.3090167(展望未来使查询计划更加稳健)

  • Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask (2018) - Kersten, Timo, et al.(您一直想了解但又不敢问的有关编译和矢量化查询的所有信息)

    https://dl.acm.org/doi/pdf/10.14778/3275366.3284966

  • SuRF: Practical Range Query Filtering with Fast Succinct Tries (2018) - Zhang, Huanchen, et al.(SuRF:使用快速简洁尝试进行实用范围查询过滤)

    https://dl.acm.org/doi/pdf/10.1145/3183713.3196931

  • Adaptive Execution of Compiled Queries (2018) - Kohn, André, Viktor Leis, and Thomas Neumann.(编译查询的自适应执行)

    https://15721.courses.cs.cmu.edu/spring2019/papers/19-compilation/kohn-icde2018.pdff


3.8 MPP 优化器

  • DB2 Parallel Edition (1995) - Baru, Chaitanya K., et al.(DB2 并行版)

    https://grape.ics.uci.edu/wiki/asterix/raw-attachment/wiki/cs295-2009-fall/ParallelDB2.pdf

  • Parallel SQL execution in Oracle 10g (2004) - Cruanes, Thierry, Benoit Dageville, and Bhaskar Ghosh.(Oracle 10g 中的并行 SQL 执行)

    https://dl.acm.org/doi/pdf/10.1145/1007568.1007666

  • Query Optimization in Microsoft SQL Server PDW (2012) - Shankar, Srinath, et al.(Microsoft SQL Server PDW 中的查询优化)

    https://dl.acm.org/doi/pdf/10.1145/2213836.2213953

  • Adaptive and big data scale parallel execution in Oracle (2013) - Bellamkonda, Srikanth, et al.(Oracle 中的自适应和大数据规模并行执行)

    https://dl.acm.org/doi/pdf/10.14778/2536222.2536235

  • Optimizing Queries over Partitioned Tables in MPP Systems (2014) - Antova, Lyublena, et al.(优化 MPP 系统中分区表的查询)

    https://dl.acm.org/doi/pdf/10.1145/2588555.2595640


04 存储引擎

4.1 存储结构

  • The Ubiquitous B-Tree (1979) - Comer, Douglas.(无处不在的 B 树)

    https://dl.acm.org/doi/pdf/10.1145/356770.356776

  • The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time (1987) - Gray, Jim, and Franco Putzolu.

    https://dl.acm.org/doi/pdf/10.1145/38713.38755

  • The Log-Structured Merge-Tree (LSM-Tree) (1996) - O’Neil, Patrick, et al.

    https://www.inf.ufpr.br/eduardo/ensino/ci763/papers/lsmtree.pdf)(日志结构合并树(LSM-Tree)介绍)

  • The five-minute rule ten years later, and other computer storage rules of thumb (1997) - Gray, Jim, and Goetz Graefe.(十年后的五分钟规则,以及其他计算机存储经验法则)

    https://dl.acm.org/doi/pdf/10.1145/271074.271094

  • The Five Minute Rule 20 Years Later and How Flash Memory Changes the Rules (2008) - Graefe, Goetz.(20 年后的五分钟规则以及闪存如何改变规则)

    https://dl.acm.org/doi/pdf/10.1145/1363189.1363198

  • A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees (2014) - Kuszmaul, Bradley C.(分形树与日志结构合并 (LSM) 树的比较)

    http://www.pandademo.com/wp-content/uploads/2017/12/A-Comparison-of-Fractal-Trees-to-Log-Structured-Merge-LSM-Trees.pdf

  • Design Tradeoffs of Data Access Methods (2016) - Athanassoulis, Manos, and Stratos Idreos.(数据访问方法的设计权衡)

    https://dl.acm.org/doi/pdf/10.1145/2882903.2912569

  • Designing Access Methods: The RUM Conjecture (2016) - Athanassoulis, Manos, et al.(设计访问方法:RUM 猜想)

    https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/rum.pdf

  • The five minute rule thirty years later and its impact on the storage hierarchy (2017) - Appuswamy, Raja, et al.(三十年后的五分钟规则及其对存储层次结构的影响)

    https://infoscience.epfl.ch/record/230398/files/adms-talk.pdf

  • WiscKey: Separating Keys from Values in SSD-conscious Storage (2017) - Lu, Lanyue, et al.(WiscKey:在 SSD 敏感存储中将键与值分离)

    https://dl.acm.org/doi/pdf/10.1145/3033273

  • Managing Non-Volatile Memory in Database Systems (2018) - van Renen, Alexander, et al.(管理数据库系统中的非易失性内存)

    https://dl.acm.org/doi/pdf/10.1145/3183713.3196897

  • LeanStore: In-Memory Data Management Beyond Main Memory (2018) - Leis, Viktor, et al.(LeanStore:主内存之外的内存数据管理)

    https://15721.courses.cs.cmu.edu/spring2020/papers/23-largethanmemory/leis-icde2018.pdf

  • The Case for Learned Index Structures (2018) - Kraska, Tim, et al.(学习索引结构的案例)

    https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

  • LSM-based Storage Techniques: A Survey (2019) - Luo, Chen, and Michael J. Carey.(基于 LSM 的存储技术:调查)

    https://arxiv.org/pdf/1812.07527

  • Learning Multi-dimensional Indexes (2019) - Nathan, Vikram, et al.(钻取多维度指数)

    https://dl.acm.org/doi/pdf/10.1145/3318464.3380579

  • Umbra: A Disk-Based System with In-Memory Performance (2020) - Neumann, Thomas, and Michael J. Freitag.(Umbra:具有内存性能的基于磁盘的系统)

    https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

  • XIndex: A Scalable Learned Index for Multicore Data Storage (2020) - Tang, Chuzhe, et al.(XIndex:用于多核数据存储的可扩展学习索引)

    https://dl.acm.org/doi/pdf/10.1145/3332466.3374547

  • The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds (2020) - Ferragina, Paolo, and Giorgio Vinciguerra.(PGM 索引:具有可证明的最坏情况边界的完全动态压缩学习索引)

    https://dl.acm.org/doi/pdf/10.14778/3389133.3389135

  • From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees (2020) - Dai, Yifan, et al.(从 WiscKey 到 Bourbon:LSM 数据结构的学习索引)

    https://www.usenix.org/system/files/osdi20-dai_0.pdf

  • CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated Infrastructure (2024) - Yu, Qiaolin et al.(CaaS-LSM:存储分离中基于 LSM 的键值存储的压缩即服务)

    https://qiaolin-yu.github.io/pubs/V2mod124-yu.pdf

  • The Google file system(2003) - Sanjay Ghemawat, et al. (谷歌文件系统)

    https://dl.acm.org/doi/10.1145/945445.945450

  • MapReduce: simplified data processing on large clusters - (2008) Jeffrey Dean,et al. (MapReduce:简化大型集群上的数据处理)

    https://dl.acm.org/doi/10.1145/1327452.1327492


4.2 事务

  • The Notions of Consistency and Predicate Locks in a Database System (1976) - Eswaran, Kapali P., et al.(数据库系统中一致性和锁的概念)

    https://dl.acm.org/doi/pdf/10.1145/360363.360369

  • Concurrency Control in Distributed Database Systems (1981) - Bernstein, Philip A., and Nathan Goodman.(分布式数据库系统中的并发控制)

    https://dl.acm.org/doi/pdf/10.1145/356842.356846

  • On Optimistic Methods for Concurrency Control (1981) - Kung, Hsiang-Tsung, and John T. Robinson.(并发控制的乐观方法)

    https://dl.acm.org/doi/pdf/10.1145/319566.319567

  • Principles of transaction-oriented database recovery (1983) - Haerder, Theo, and Andreas Reuter.(面向事务的数据库恢复原理)

    https://dl.acm.org/doi/10.1145/289.291

  • Multiversion Concurrency Control - Theory and Algorithms (1983) - Bernstein, Philip A., and Nathan Goodman.(多版本并发控制-理论与算法)

    https://dl.acm.org/doi/pdf/10.1145/319996.319998

  • ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging (1992) - Mohan C, Haderle D, Lindsay B, et al.(ARIES:一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法)

    https://dl.acm.org/doi/pdf/10.1145/128765.128770

  • A Critique of ANSI SQL Isolation Levels (1995) - Berenson, Hal, et al. (对 ANSI SQL 隔离级别的批评)

    https://dl.acm.org/doi/pdf/10.1145/568271.223785

  • Generalized Isolation Level Definitions (2000) - Adya, Atul, Barbara Liskov, and Patrick O'Neil. (广义隔离级别定义)

    https://pmg.csail.mit.edu/papers/icde00.pdf

  • Serializable Snapshot Isolation in PostgreSQL (2012) - Ports, Dan RK, and Kevin Grittner.(PostgreSQL 中的可序列化快照隔离)

    https://arxiv.org/pdf/1208.4179.pdf

  • Calvin: Fast Distributed Transactions for Partitioned Database Systems (2012) - Thomson, Alexander, et al. (Calvin:分布式数据库系统的快速分布式事务)

    https://dl.acm.org/doi/pdf/10.1145/2213836.2213838

  • MaaT: effective and scalable coordination of distributed transactions in the cloud (2014) - Mahmoud, Hatem A., et al. (MaaT:云中分布式事务的有效且可扩展的协调)

    https://dl.acm.org/doi/pdf/10.14778/2732269.2732270

  • Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores (2014) - Yu, Xiangyao, et al. (凝视深渊:千核并发控制评测)

    https://dspace.mit.edu/bitstream/handle/1721.1/100022/Devadas_Staring%20into.pdf?sequence=1&isAllowed=y

  • An Evaluation of the Advantages and Disadvantages of Deterministic Database Systems (2014) - Ren, Kun, Alexander Thomson, and Daniel J. Abadi. (定量分析数据库系统的优缺点)

    https://dl.acm.org/doi/pdf/10.14778/2732951.2732955

  • Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015) - Neumann, Thomas, Tobias Mühlbauer, and Alfons Kemper. (主存数据库系统的快速可串行化多版本并发控制)

    https://dl.acm.org/doi/pdf/10.1145/2723372.2749436

  • An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017) - Wu, Yingjun, et al. (内存中多版本并发控制的实证评估)

    https://dl.acm.org/doi/pdf/10.14778/3067421.3067427

  • An Evaluation of Distributed Concurrency Control (2017) - Harding, Rachael, et al. (分布式并发控制的评估)

    https://dl.acm.org/doi/pdf/10.14778/3055540.3055548

  • Scalable Garbage Collection for In-Memory MVCC Systems (2019) - Böttcher, Jan, et al. (适用于内存中 MVCC 系统的可扩展和垃圾收集)

    https://dl.acm.org/doi/pdf/10.14778/3364324.3364328


05 其他

5.1 负载

  • TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark (2013) - Boncz, Peter, Thomas Neumann, and Orri Erling. (TPC-H 分析:隐藏的信息和从有影响力的基准中吸取的教训)

    https://dl.acm.org/doi/10.1007/978-3-319-04936-6_5

  • Quantifying TPCH Choke Points and Their Optimizations (2020) - Dreseler, Markus, et al. (量化 TPCH 瓶颈及其优化)

    https://dl.acm.org/doi/pdf/10.14778/3389133.3389138

  • OceanBase: A 707 Million tpmC Distributed Relational Database (2022) - Zhenkun Yang, Chuanhui Yang, et al. (OceanBase:7.07亿tpmC的分布式关系数据库)

    https://dl.acm.org/doi/abs/10.14778/3554821.3554830


5.2 网络

  • The End of Slow Networks: It's Time for a Redesign (2015) - Binnig, Carsten, et al. (慢速网络的终结:是时候重新设计了)

    https://arxiv.org/pdf/1504.01048

  • Accelerating Relational Databases by Leveraging Remote Memory and RDMA (2016) - Li, Feng, et al. (通过利用远程内存和 RDMA 加速关性系数据库)

    https://dl.acm.org/doi/pdf/10.1145/2882903.2882949

  • Don't Hold My Data Hostage: A Case for Client Protocol Redesign (2017) - Raasveldt, Mark, and Hannes Mühleisen. (不要劫持我的数据:客户端协议重新设计的案例)

    https://dl.acm.org/doi/pdf/10.14778/3115404.3115408

5.3 性能诊断

  • Automatic SQL Tuning in Oracle 10g (2004) - Dageville B, Das D, Dias K, et al. (Oracle 10g 中的自动 SQL 调优)

    http://www.vldb.org/conf/2004/IND4P2.PDF

  • Automatic Performance Diagnosis and Tuning in Oracle (2005) - Dias K, Ramacher M, Shaft U, et al. (Oracle 中的自动性能诊断和调优)

    https://www.cidrdb.org/cidr2005/papers/P07.pdf

整理人介绍

司马辽太杰,目前就职于一家国有企业,主要负责数据库连续性保障、性能优化、架构选型和设计。10余年数据库架构和管理经验,专注于数据库运维、架构和行业发展,擅长常见关系型、NoSQL、MPP 等类型数据库性能优化、架构设计和故障排查。杭州乡下桐庐人,业余热爱历史、足球,偶尔读点闲书。欢迎关注个人公众号“程序猿读历史”,有需要也可以在公众号上加我好友。

01

企业数据库工作1:数据库选型,除了TPS、QPS还要关注什么?



02

企业数据库工作2:团队培养,如何高效阅读数据库文档





03

企业数据库工作3:数据库连续性,我们该知道什么




END


文章转载自锋哥聊DORIS数仓,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论