PostgreSQL 单库对象过多，触发Linux系统限制(inode满或inode index满)

digoal 2018-04-10

626

作者

digoal

日期

2018-04-10

背景

PostgreSQL 里面创建的表，序列，索引，物化视图等带有存储的对象，每个对象的数据文件都是独立的，较依赖文件系统的管理能力。并不像Oracle那样把对象放到表空间中管理，表空间又由若干的数据文件组成。(ASM的话则接管更多的操作。)

所以，当创建了很多个有实际存储的对象时，文件数就会很多：

通常一个表包含数据文件（若干，单个默认1G），fsm文件一个, vm文件一个，如果是unlogged table则还有_init文件一个。

文件数过多，可能触发文件系统的一些限制。

ext4_dx_add_entry: Directory index full!

原因1，inode index满

某个系统，在创建新的对象时，报这样的错误。

ERROR: could not create file "base/16392/166637613": No space left on device

digoal=# do language plpgsql $$ declare begin for i in 1..1000 loop execute 'create table test'||i||'(id int primary key, c1 int unique, c2 int unique)'; end loop; end; $$; ERROR: 53100: could not create file "base/16392/166691646": No space left on device CONTEXT: SQL statement "create table test43(id int primary key, c1 int unique, c2 int unique)" PL/pgSQL function inline_code_block line 5 at EXECUTE statement LOCATION: mdcreate, md.c:304

报错的PG源码文件如下

```
/
* mdcreate() -- Create a new relation on magnetic disk.

* If isRedo is true, it's okay for the relation to exist already.
/
void
mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
{
MdfdVec mdfd;
char *path;
File fd;

    if (isRedo && reln->md_num_open_segs[forkNum] > 0)  
            return;                      /* created and opened already... */

    Assert(reln->md_num_open_segs[forkNum] == 0);

    path = relpath(reln->smgr_rnode, forkNum);

    fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, 0600);

    if (fd < 0)  
    {  
            int                     save_errno = errno;

            /*  
             * During bootstrap, there are cases where a system relation will be  
             * accessed (by internal backend processes) before the bootstrap  
             * script nominally creates it.  Therefore, allow the file to exist  
             * already, even if isRedo is not set.  (See also mdopen)  
             */  
            if (isRedo || IsBootstrapProcessingMode())  
                    fd = PathNameOpenFile(path, O_RDWR | PG_BINARY, 0600);  
            if (fd < 0)  
            {  
                    /* be sure to report the error reported by create, not open */  
                    errno = save_errno;  
                    ereport(ERROR,  
                                    (errcode_for_file_access(),  
                                     errmsg("could not create file \"%s\": %m", path)));  
            }  
    }

    pfree(path);

    _fdvec_resize(reln, forkNum, 1);  
    mdfd = &reln->md_seg_fds[forkNum][0];  
    mdfd->mdfd_vfd = fd;  
    mdfd->mdfd_segno = 0;

}
```

实际上就是创建文件出错，但并不是真的没有空间了"No space left on device"。

在系统中可以看到dmesg的消息如下

[14372230.975489] EXT4-fs warning (device dm-0): ext4_dx_add_entry: Directory index full! [14372230.984268] EXT4-fs warning (device dm-0): ext4_dx_add_entry: Directory index full! [14372230.992913] EXT4-fs warning (device dm-0): ext4_dx_add_entry: Directory index full!

这个错误与某个目录下的文件数有关。文件数与INODE有关，同时EXT4中为了提高检索文件的性能，有index的优化开关。

```
man mkfs.ext4

dir_index
Use hashed b-trees to speed up lookups in large directories.
```

看起来像是这个超了。

和数据库进程的ulimit并无关系：

```

cd /proc/17128

cat limits

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size unlimited unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 655360 655360 processes
Max open files 655360 655360 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 4133740 4133740 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
```

分析原因

这个数据库居然创建了2万多个schema。

```
digoal=# select count(*) from pg_namespace;
count

27151
(1 row)
```

每个schema的内容都差不多，有600多个对象。

```
digoal=# select count(*) from pg_class where relnamespace=(select oid from pg_namespace where nspname='repo_20171106_1034_2877737');
count

616
(1 row)
```

这个数据库中一共有2068万个对象。

```
digoal=# select count(*) from pg_class;
count

20680394
(1 row)
```

PG数据文件的存放规则

目录结构

表空间/数据库/对象表空间/数据库/对象.1 ... 表空间/数据库/对象.n 表空间/数据库/对象_fsm 表空间/数据库/对象_vm 表空间/数据库/对象_init

如果这些对象都在一个表空间里面，那么这个表空间下对应数据库OID的目录中，将有至少2000多万个文件。

```
$ cd $PGDATA/base/16392

$ ll|wc -l

21637521
```

果然，有2163万个文件，看样子和它有莫大关系。

开启ext4的dir_index，对单个目录中文件数会有限制。

解决方法

1、删除不必要的schema

2、创建多个表空间，因为每个表空间是一个单独的目录。

3、将对象分配到不同的表空间中，这样的话就可以避免开启ext4的dir_index后，当有很多个对象时，单个目录中文件数超出限制的问题。

4、如果是因为过多的schema模板导致的，可以考虑改变架构，使用database代替schema，当然了database过多也有它的缺陷，比如连接池无法复用跨DB的连接。

5、如果数据库因为这个问题起不来了，要紧急解决，先把库起来，怎么处理呢？可以临时关闭index功能。

``` 关闭index功能后，线性查找，性能会下降。有INDEX时，会使用hash查找。 Disable dir_index option of the filesystem. Note: There is no need to unmount the file system. CAUTION: Disabling this option may impact performance on lookups in large directories.

tune2fs -O ^dir_index /dev/device

注意，将来要开启index功能需要先umount. To recreate the dir_index:

umount /dev/device

tune2fs -O dir_index /dev/device

e2fsck -D /dev/device

```

原因2，inode满

1、使用df可以看到文件系统对应的inode使用量，如果达到100%，则会出现类似的错误(no space left on device ...)

df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda3 905760 126543 779217 14% / tmpfs 364177 1 364176 1% /dev/shm /dev/sda1 51200 38 51162 1% /boot

2、解决办法，格式化的时候，指定需要多少inode.

``` man mkfs.ext4

   -i bytes-per-inode
          Specify the bytes/inode ratio.  mke2fs creates an inode for every bytes-per-inode bytes of space on the disk.  The larger the bytes-per-inode ratio, the fewer inodes will be created.  This value generally shouldn't be
          smaller than the blocksize of the filesystem, since in that case more inodes would be made than can ever be used.  Be warned that it is not possible to change this ratio on a filesystem after  it  is  created,  so  be
          careful deciding the correct value for this parameter.  Note that resizing a filesystem changes the numer of inodes to maintain this ratio.

   -I inode-size
          Specify  the size of each inode in bytes.  The inode-size value must be a power of 2 larger or equal to 128.  The larger the inode-size the more space the inode table will consume, and this reduces the usable space in
          the filesystem and can also negatively impact performance.  It is not possible to change this value after the filesystem is created.

          In kernels after 2.6.10 and some earlier vendor kernels it is possible to utilize inodes larger than 128 bytes to store extended attributes for improved performance.  Extended attributes stored in large inodes are not
          visible with older kernels, and such filesystems will not be mountable with 2.4 kernels at all.

          The  default  inode  size  is  controlled by the mke2fs.conf(5) file.  In the mke2fs.conf file shipped with e2fsprogs, the default inode size is 256 bytes for most file systems, except for small file systems where the
          inode size will be 128 bytes.

```

通常指定-i，比如16K，每16K分配一个INODE。 1TB的文件系统大概有6250万个inode。

例子，指定每8K分配一个INODE（如果文件的平均大小为8K的话）。

``` [root@pg11-test ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv01 vgdata01 -wi-ao---- 4.00t
lv02 vgdata01 -wi-ao---- 4.00t
lv03 vgdata01 -wi-ao---- 4.00t
lv04 vgdata01 -wi-ao---- <1.97t

[root@pg11-test ~]# mkfs.ext4 /dev/mapper/vgdata01-lv04 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=8,stripe_width=128 -i 8192 -b 4096 -T largefile -L lv04

[root@pg11-test ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/vda1 13107200 107325 12999875 1% / devtmpfs 66021999 713 66021286 1% /dev tmpfs 66024376 2151 66022225 1% /dev/shm tmpfs 66024376 661 66023715 1% /run tmpfs 66024376 16 66024360 1% /sys/fs/cgroup /dev/mapper/vgdata01-lv03 4194304 62361 4131943 2% /data03 /dev/mapper/vgdata01-lv02 4194304 52620 4141684 2% /data02 /dev/mapper/vgdata01-lv01 4194304 61526 4132778 2% /data01 tmpfs 66024376 1 66024375 1% /run/user/0 /dev/mapper/vgdata01-lv04 264110080 11 264110069 1% /data04

postgres=# select pg_size_pretty(264110069*8192::numeric); pg_size_pretty

2015 GB (1 row) ```

小结

由于目前PG的不同对象，存放在不同的数据文件中，当有多个对象时，会创建多个数据文件。

目前PG对象的数据文件存在目录结构如下：

表空间/数据库/对象表空间/数据库/对象.1 ... 表空间/数据库/对象.n 表空间/数据库/对象_fsm 表空间/数据库/对象_vm 表空间/数据库/对象_init

如果将单个库的所有对象存放在一个表空间中，这个表空间/数据库目录下会有很多个文件，那么可能遇到开启ext4的dir_index后，当有很多个对象时，单个目录中文件数超出限制的问题。

比如本例，一个表空间/数据库目录下，有2000多万个文件。导致了ext4_dx_add_entry: Directory index full!的问题。

为了避免这个问题，建议在单个库的单个表空间中，不要超过500万个对象。如果有更多的对象要在一个库中创建，那么可以创建多个表空间。

当然，我们这里还没有提到文件系统的其他限制，比如一个文件系统INODE的限制。与位数，以及文件系统有关。

参考

https://bean-li.github.io/EXT4_DIR_INDEX/

《PostgreSQL DaaS设计注意 - schema与database的抉择》

《PostgreSQL 备库apply延迟(delay)原理分析与诊断》

《PostgreSQL 流复制延迟的测试方法》

《PostgreSQL standby 在万兆网环境中缘何延迟? 如何解决?》

https://serverfault.com/questions/104986/what-is-the-maximum-number-of-files-a-file-system-can-contain

https://ext4.wiki.kernel.org/index.php/Ext4_Howto#Bigger_File_System_and_File_Sizes

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout

https://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory

https://serverfault.com/questions/482998/how-can-i-fix-it-ext4-fs-warning-device-sda3-ext4-dx-add-entry-directory-in

https://access.redhat.com/solutions/29894

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。