
原文 https://shekhargulati.com/2022/07/08/my-notes-on-gitlabs-postgres-schema-design/ 这篇文章写于 2022 年,前一年 GitLab 刚好完成 IPO。目前 GitLab 市值超过 100 亿美金,它的所有收入都来源于同名产品 GitLab,而这篇文章就是全面分析 GitLab 这个产品的数据库 schema。
我花了一些时间研究 GitLab 的 Postgres schema。GitLab 是 Github 的一个替代品。你可以自部署 GitLab,因为它是一个开源的 DevOps 平台。
我之所以要了解 Gitlab 这样的大项目的 schema,是为了与我正在设计的 schema 进行比较,并从他们的 schema 定义中学到一些最佳实践。我确实从中受益良多。
为表使用正确的主键类型
create table exp_bs(id serial primary key, n bigint not null)
insert into exp_bs(n) select g.n from generate_series(1,10000000) as g(n)
alter table exp_bs alter column id TYPE bigint;ALTER TABLETime: 10845.062 ms (00:10.845)
您还必须更改序列以改变其类型。这种操作很快。
alter sequence exp_bs_id_seq as bigint;ALTER SEQUENCETime: 4.505 ms
CREATE SEQUENCE audit_events_id_seqSTART WITH 1INCREMENT BY 1NO MINVALUENO MAXVALUECACHE 1;

CREATE TABLE issues (id integer NOT NULL,title character varying,project_id integer,iid integer,……)
https://gitlab.com/shekhargulati123/sg/-/issues/1https://gitlab.com/shekhargulati123/sg2/-/issues/1
CREATE UNIQUE INDEX index_issues_on_project_id_and_iid ON public.issues USING btree (project_id, iid);


CREATE TABLE audit_events (id bigint NOT NULL,author_id integer NOT NULL,entity_id integer NOT NULL,entity_type character varying NOT NULL,details text,ip_address inet,author_name text,entity_path text,target_details text,created_at timestamp without time zone NOT NULL,target_type text,target_id bigint,CONSTRAINT check_492aaa021d CHECK ((char_length(entity_path) <= 5500)),CONSTRAINT check_83ff8406e2 CHECK ((char_length(author_name) <= 255)),CONSTRAINT check_97a8c868e7 CHECK ((char_length(target_type) <= 255)),CONSTRAINT check_d493ec90b5 CHECK ((char_length(target_details) <= 5500)))PARTITION BY RANGE (created_at);
create table cv_exp (id bigint primary key, s varchar(200) default gen_random_uuid() not null);create index sidx on cv_exp (s);
insert into cv_exp(id) select g.n from generate_series(1,10000000) as g(n);
alter table cv_exp alter column s type varchar(300);ALTER TABLETime: 37.460 ms
alter table cv_exp alter column s type varchar(100);ALTER TABLETime: 35886.638 ms (00:35.887)
create table text_exp (id bigint primary key, s text default gen_random_uuid() not null,CONSTRAINT check_15e644d856 CHECK ((char_length(s) <= 200)));
insert into text_exp(id) select g.n from generate_series(1,10000000) as g(n);
alter table text_exp drop constraint check_15e644d856;
alter table text_exp add constraint check_15e644d856 CHECK ((char_length(s) <= 100));ALTER TABLETime: 1870.250 ms (00:01.870)
CREATE TABLE project_custom_attributes (id integer NOT NULL,created_at timestamp with time zone NOT NULL,updated_at timestamp with time zone NOT NULL,project_id integer NOT NULL,key character varying NOT NULL,value character varying NOT NULL);

merge_request_assignees
merge_request_blocks
merge_request_cleanup_schedules
merge_request_context_commit_diff_files
merge_request_context_commits
等等...
功能开关,如 create_issue、send_email、packages_enabled、merge_requests_rebase_enabled 等。
实体状态,如 deployed、onboarding_complete、archived、hidden等。
限定词,以 is_xxx 或 has_xxx 开头,如 is_active、is_sample、has_confluence 等。我认为这些可以通过上述两种方式来表达。
- 索引的命名遵循 index_#{table_name}_on_#{column_1}_and_#{column_2}_#{condition} ,例如 index_services_on_type_and_id_and_template_when_active、index_projects_on_id_service_desk_enabled。
带时区和不带时区的时间戳CREATE TABLE issues (id integer NOT NULL,title character varying,created_at timestamp without time zone,updated_at timestamp without time zone,closed_at timestamp with time zone,closed_by_id integer,);
CREATE TABLE merge_request_metrics (id integer NOT NULL,latest_build_started_at timestamp without time zone,latest_build_finished_at timestamp without time zone,merged_at timestamp without time zone,created_at timestamp without time zone NOT NULL,updated_at timestamp without time zone NOT NULL,latest_closed_at timestamp with time zone,first_comment_at timestamp with time zone,first_commit_at timestamp with time zone,last_commit_at timestamp with time zone,first_approved_at timestamp with time zone,first_reassigned_at timestamp with time zone);
外键约束在线 DDL schema 迁移操作中,特别是在 MySQL 中,外键约束的兼容性不好。
一旦将数据分片至多个数据库服务器,维护外键约束变得困难。
这些表在本质上是不可变的。一旦条目被写入,你就不想再更改它们了
这些表的行数可达数百万(或更多),因此即使是很小的性能损失也会造成很大影响
ALTER TABLE ONLY todosADD CONSTRAINT fk_rails_a27c483435 FOREIGN KEY (group_id) REFERENCES namespaces(id) ON DELETE CASCADE;ALTER TABLE ONLY projectsADD CONSTRAINT fk_projects_namespace_id FOREIGN KEY (namespace_id) REFERENCES namespaces(id) ON DELETE RESTRICT;ALTER TABLE ONLY authentication_eventsADD CONSTRAINT fk_rails_b204656a54 FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL;
ERROR: update or delete on table "a" violates foreign key constraint "fk_a_id" on table "b"DETAIL: Key (id)=(1) is still referenced from table "b".

按范围分区(PARTITION BY RANGE):这种分区方法根据选定的范围来分区表格数据。这种策略通常适用于需要对时间序列数据进行分区的场景。例如,audit_events(审计事件)和 web_hook_logs(Web 钩子日志)表就采用了这种分区策略。 按列表分区(PARTITION BY LIST):这种分区方法根据某一列的离散值来对表格数据进行分区。loose_foreign_keys_deleted_records(松散外键删除记录)表就是采用这种分区策略的例子。 按哈希分区(PARTITION BY HASH):这种分区方法通过为每个分区指定一个模数和余数来进行。每个分区将包含那些其分区键的哈希值除以指定模数后得到指定余数的行。product_analytics_events_experimental(产品分析事件实验)表采用了这种分区策略。

CREATE INDEX index_issues_on_title_trigram ON issues USING gin (title gin_trgm_ops);CREATE INDEX index_issues_on_description_trigram ON issues USING gin (description gin_trgm_ops);
create table words(id serial primary key, word text not null);
\copy words(word) from '/Users/xxx/Aword.csv' CSV;
select count(*) from words;count-------11616(1 row)
create index id1 on words using btree (word);
EXPLAIN select * from words where word like '%bul%';QUERY PLAN-----------------------------------------------------------Seq Scan on words (cost=0.00..211.20 rows=1 width=14)Filter: (word ~~ '%bul%'::text)(2 rows)
drop index id1;
CREATE EXTENSION pg_trgm;
create index index_words_on_word_trigram ON words USING gin (word gin_trgm_ops);
EXPLAIN select count(*) from words where word like '%bul%';QUERY PLAN----------------------------------------------------------------------------------------------------Aggregate (cost=16.02..16.03 rows=1 width=8)-> Bitmap Heap Scan on words (cost=12.01..16.02 rows=1 width=0)Recheck Cond: (word ~~ '%bul%'::text)-> Bitmap Index Scan on index_words_on_word_trigram (cost=0.00..12.01 rows=1 width=0)Index Cond: (word ~~ '%bul%'::text)(5 rows)
实时索引。创建索引无延迟
访问完整数据
降低架构的复杂性

转储稍后处理的请求数据
支持附加字段
一对多关系,其中多方都没有自己的标识
Key Value
更简单的 EAV (Entity-Attribute-Value) 设计
CREATE TABLE error_tracking_error_events (id bigint NOT NULL,payload jsonb DEFAULT '{}'::jsonb NOT NULL,...);
另一个例子是下图中的 operations_strategies 表。您不知道可能会收到多少个参数,因此需要像 jsonb 这样灵活的数据类型。
CREATE TABLE operations_strategies (id bigint NOT NULL,feature_flag_id bigint NOT NULL,name character varying(255) NOT NULL,parameters jsonb DEFAULT '{}'::jsonb NOT NULL);
支持附加字段用例如下所示。
CREATE TABLE packages_debian_file_metadata ( created_at timestamp with time zone NOT NULL, updated_at timestamp with time zone NOT NULL, package_file_id bigint NOT NULL, file_type smallint NOT NULL, component text, architecture text,fields jsonb,);
他们还使用 jsonb 来存储已经是 JSON 格式的数据。例如,在表 vulnerability_finding_evidences 中,报告数据已经是 JSON 格式,因此他们将其保存为 jsonb 数据类型。
CREATE TABLE vulnerability_finding_evidences (id bigint NOT NULL,created_at timestamp with time zone NOT NULL,updated_at timestamp with time zone NOT NULL,vulnerability_occurrence_id bigint NOT NULL,data jsonb DEFAULT '{}'::jsonb NOT NULL);

其他花絮
CREATE TABLE issues (id integer NOT NULL,title character varying,author_id integer,project_id integer,created_at timestamp without time zone,updated_at timestamp without time zone,// removed remaining columns and constraints);
CREATE TABLE audit_events (id bigint NOT NULL,author_id integer NOT NULL,entity_id integer NOT NULL,created_at timestamp without time zone NOT NULL,removed remaining columns and constraints);
CREATE TABLE merge_requests_compliance_violations (id bigint NOT NULL,violating_user_id bigint NOT NULL,merge_request_id bigint NOT NULL,reason smallint NOT NULL,severity_level smallint DEFAULT 0 NOT NULL);
CREATE TABLE ci_builds (status character varying,finished_at timestamp without time zone,trace text,lock_version integer DEFAULT 0,removed columnsCONSTRAINT check_1e2fbd1b39 CHECK ((lock_version IS NOT NULL)));
CREATE TABLE audit_events (id bigint NOT NULL,ip_address inet,……);
create table e (id serial primary key, ip_addr inet not null);
insert into e(ip_addr) values ('192.168.1.255');
insert into e(ip_addr) values ('192.168.1.5/24');
select id, abbrev(ip_addr) from e;id | abbrev----+----------------1 | 192.168.1.2552 | 8.8.8.83 | 192.168.1.5/24(3 rows)
insert into e(ip_addr) values ('192.168.1');ERROR: invalid input syntax for type inet: "192.168.1"LINE 1: insert into e(ip_addr) values ('192.168.1');
select * from e where ip_addr << inet '192.168.1.1/24';id | ip_addr----+---------------1 | 192.168.1.255(1 row)
select * from e where ip_addr <<= inet '192.168.1.1/24';id | ip_addr----+----------------1 | 192.168.1.2553 | 192.168.1.5/24(2 rows)
CREATE TABLE alert_management_alert_user_mentions ( id bigint NOT NULL, alert_management_alert_id bigint NOT NULL, note_id bigint, mentioned_users_ids integer[], mentioned_projects_ids integer[], mentioned_groups_ids integer[]);
CREATE TABLE dast_site_profiles (id bigint NOT NULL,excluded_urls text[] DEFAULT '{}'::text[] NOT NULL,);CREATE TABLE alert_management_alerts (id bigint NOT NULL,hosts text[] DEFAULT '{}'::text[] NOT NULL,);CREATE TABLE ci_pending_builds (id bigint NOT NULL,tag_ids integer[] DEFAULT '{}'::integer[],);
Gitlab schema structure.sql – https://gitlab.com/gitlab-org/gitlab/-/blob/master/db/structure.sql
Issue 29465: Use structure.sql instead of schema.rb – https://gitlab.com/gitlab-org/gitlab/-/issues/29465
Choosing Primary Key Type in Postgres – https://shekhargulati.com/2022/06/23/choosing-a-primary-key-type-in-postgres/
Github’s Path to 128M public repositories – https://towardsdatascience.com/githubs-path-to-128m-public-repositories-f6f656ab56b1#:~:text=There%20are%20over%20128%20million%20public%20repositories%20on%20GitHub.
Postgres Character Types Documentation – https://www.postgresql.org/docs/current/datatype-character.html
Difference between text and varchar (character varying) – https://stackoverflow.com/questions/4848964/difference-between-text-and-varchar-character-varying
CHAR(x) vs. VARCHAR(x) vs. VARCHAR vs. TEXT – https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/

头疼管理 MySQL 数据库 Schema?开源工具大盘点!







