暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

磐维数据库2.0.3单机部署实战,数据库无法启动解决方案

498

单节点部署磐维数据库2.0.3

在学习的过程中,我尝试在自己的虚拟机上单节点部署一套磐维2.0.3,以验证自己的学习成果,在部署的过程当中遇到了一些小问题,分享出来帮助遇到同样问题的小伙伴解惑。

一、环境准备

操作系统版本

CentOS Linux 7 (Core)

操作系统内核

Linux 3.10.0-1160.el7.x86_64

架构

x86-64

主机信息

主机名ip数据库端口
panwei1

192.168.241.132

17700

安装用户

用户属组
ommdbgrp

安装路径

/database/panweidb

二、系统环境优化

具体请见:https://www.modb.pro/db/1825811174795079680

三、单节点安装磐维2.0.3

采用OM的方式

编写xml文件

vi /database/panweidb/soft/panweidb1m.xml
<?xml version="1.0" encoding="utf-8"?>
<ROOT>
  <CLUSTER>
    <PARAM name="clusterName" value="panweidb" />
    <PARAM name="nodeNames" value="panwei1"/>
    <PARAM name="gaussdbAppPath" value="/database/panweidb/app" />
    <PARAM name="gaussdbLogPath" value="/database/panweidb/log" />
    <PARAM name="tmpMppdbPath" value="/database/panweidb/tmp"/>
    <PARAM name="gaussdbToolPath" value="/database/panweidb/tool" />
    <PARAM name="corePath" value="/database/panweidb/corefile"/>
    <PARAM name="backIp1s" value="192.168.241.132"/>
  </CLUSTER>
  
  <DEVICELIST>
    <DEVICE sn="panwei1">
      <PARAM name="name" value="panwei1"/>
      <PARAM name="azName" value="AZ1"/>
      <PARAM name="azPriority" value="1"/>
      <PARAM name="backIp1" value="192.168.241.132"/>
      <PARAM name="sshIp1" value="192.168.241.132"/>
      <PARAM name="dataNum" value="1"/>
      <PARAM name="dataPortBase" value="17700"/>
      <PARAM name="dataNode1" value="/database/panweidb/data"/>
    </DEVICE>
  </DEVICELIST>
</ROOT>

使用root用户预安装

[root@panwei1 ~]# cd /database/panweidb/soft/script/
[root@panwei1 script]# ./gs_preinstall -U omm -G dbgrp -X /database/panweidb/soft/panweidb1m.xml 
Parsing the configuration file.
Successfully parsed the configuration file.
Installing the tools on the local node.
Successfully installed the tools on the local node.
Setting host ip env
Successfully set host ip env.
Are you sure you want to create the user[omm] (yes/no)? yes
Preparing SSH service.
Successfully prepared SSH service.
Checking OS software.
Successfully check os software.
Checking OS version.
Successfully checked OS version.
Creating cluster's path.
Successfully created cluster's path.
Set and check OS parameter.
Setting OS parameters.
Successfully set OS parameters.
Warning: Installation environment contains some warning messages.
Please get more details by "/database/panweidb/soft/script/gs_checkos -i A -h panwei1 --detail".
Set and check OS parameter completed.
Preparing CRON service.
Successfully prepared CRON service.
Setting user environmental variables.
Successfully set user environmental variables.
Setting the dynamic link library.
Successfully set the dynamic link library.
Setting Core file
Successfully set core path.
Setting pssh path
Successfully set pssh path.
Setting Cgroup.
Successfully set Cgroup.
Set ARM Optimization.
No need to set ARM Optimization.
Fixing server package owner.
Setting finish flag.
Successfully set finish flag.
Preinstallation succeeded.

至此发现预安装时安装环境包含一些警告信息,我们来查看一下

[root@panwei1 script]# /database/panweidb/soft/script/gs_checkos -i A -h panwei1 --detail
Checking items:
    A1. [ OS version status ]                                   : Normal     
        [panwei1]
        centos_7.9.2009_64bit
           
    A2. [ Kernel version status ]                               : Normal     
        The names about all kernel versions are same. The value is "3.10.0-1160.el7.x86_64".
    A3. [ Unicode status ]                                      : Normal     
        The values of all unicode are same. The value is "LANG=zh_CN.UTF-8".
    A4. [ Time zone status ]                                    : Normal     
        The informations about all timezones are same. The value is "+0800".
    A5. [ Swap memory status ]                                  : Normal     
        The value about swap memory is correct.            
    A6. [ System control parameters status ]                    : Warning    
        [panwei1]
        Warning reason: variable 'net.ipv4.tcp_retries1' RealValue '3' ExpectedValue '5'.
        Warning reason: variable 'net.ipv4.tcp_syn_retries' RealValue '6' ExpectedValue '5'.
        Check_SysCtl_Parameter warning.

    A7. [ File system configuration status ]                    : Normal     
        Both soft nofile and hard nofile are correct.      
    A8. [ Disk configuration status ]                           : Normal     
        The value about XFS mount parameters is correct.   
    A9. [ Pre-read block size status ]                          : Normal     
        The value about Logical block size is correct.     
    A10.[ IO scheduler status ]                                 : Normal     
        The value of IO scheduler is correct.              
    A11.[ Network card configuration status ]                   : Warning    
        [panwei1]
BondMode Null
        Warning reason: network 'ens33' 'mtu' RealValue '1500' ExpectedValue '8192'

    A12.[ Time consistency status ]                             : Warning    
        [panwei1]
        The NTPD not detected on machine and local time is "2024-08-20 17:27:02".

    A13.[ Firewall service status ]                             : Normal     
        The firewall service is stopped.                   
    A14.[ THP service status ]                                  : Normal     
        The THP service is stopped.                        
Total numbers:14. Abnormal numbers:0. Warning numbers:3.

我们看到告警信息为:两个网络参数net.ipv4.tcp_retries1 和 net.ipv4.tcp_syn_retries 的期望值都为5;网卡ens33的MTU值为1500,期望值为8192;在本机上没有检测到NTPD进程。

预安装报警告不影响我们安装部署,我们进行下一步

使用omm用户安装

[omm@panwei1 ~]$ cd /database/panweidb/soft/script/
[omm@panwei1 script]$ gs_install -X /database/panweidb/soft/panweidb1m.xml \
> --gsinit-parameter="--encoding=UTF8" \
> --gsinit-parameter="--lc-collate=C" \
> --gsinit-parameter="--lc-ctype=C" \
> --gsinit-parameter="--dbcompatibility=B"
Parsing the configuration file.
Successfully checked gs_uninstall on every node.
Check preinstall on every node.
Successfully checked preinstall on every node.
Creating the backup directory.
Last time end with Start cluster.
Continue this step.
Successfully created the backup directory.
begin deploy..
Using omm:dbgrp to install database.
Using installation program path : /database/panweidb/app_e857234
$GAUSSHOME points to /database/panweidb/app_e857234, no need to create symbolic link.
Traceback (most recent call last):
  File "/database/panweidb/tool/script/local/Install.py", line 816, in <module>
    functionDict[g_opts.action]()
  File "/database/panweidb/tool/script/local/Install.py", line 747, in startCluster
    dn.start(self.time_out)
  File "/database/panweidb/tool/script/local/../gspylib/component/Kernel/Kernel.py", line 106, in start
    "failure details." + "\n" + output)
Exception: [GAUSS-51607] : Failed to start instance. Error: Please check the gs_ctl log for failure details.

这里我们安装之后发现部署完成后数据库无法启动,我们查看数据库状态

[omm@panwei1 panweidb]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Unavailable
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node   node_ip         port      instance                        state
------------------------------------------------------------------------------------------
1  panwei1 192.168.241.132 17700      6001 /database/panweidb/data   P Primary Unknown

我们查看一下数据库日志

[omm@panwei1 ~]$ cd /database/panweidb/log/omm/pg_log/dn_6001/[omm@panwei1 dn_6001]$ ll
total 0

我们看到数据库日志没有生成,这时我们尝试手动拉起数据库

[omm@panwei1 dn_6001]$ gs_ctl start -D /database/panweidb/data
[2024-08-20 23:53:53.263][3175][][gs_ctl]: gs_ctl started,datadir is /database/panweidb/data 
[2024-08-20 23:53:53.348][3175][][gs_ctl]: waiting for server to start...
.0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.
	
0 LOG:  [Alarm Module]Host Name: panwei1 
	
0 LOG:  [Alarm Module]Host IP: panwei1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
	
0 LOG:  [Alarm Module]Cluster Name: panweidb 
	
0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58
	
0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
 0 [BACKEND] LOG:  the config file /database/panweidb/data/postgresql.conf verify start.
2024-08-20 23:53:53.562 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  the config file /database/panweidb/data/postgresql.conf verify success.
2024-08-20 23:53:53.563 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] WARNING:  Error happen when load license, error code  2, error message  cannot write data to dir /etc/panweidb/license .
	
2024-08-20 23:53:53.563 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  base_page_saved_interval is 400, ori is 400.
2024-08-20 23:53:53.563 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [REDO] LOG:  Recovery parallelism, cpu count = 4, max = 4, actual = 4
2024-08-20 23:53:53.563 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [REDO] LOG:  ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2024-08-20 23:53:53.566 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.
	
2024-08-20 23:53:53.566 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Host Name: panwei1 
	
2024-08-20 23:53:53.566 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Host IP: panwei1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>
	
2024-08-20 23:53:53.566 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Cluster Name: panweidb 
	
2024-08-20 23:53:53.566 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58
	
2024-08-20 23:53:53.572 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  loaded library "security_plugin"
2024-08-20 23:53:53.572 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2024-08-20 23:53:53.572 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2024-08-20 23:53:53.576 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.
2024-08-20 23:53:53.576 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  reserved memory for backend threads is: 340 MB
2024-08-20 23:53:53.576 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  reserved memory for WAL buffers is: 320 MB
2024-08-20 23:53:53.576 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  Set max backend reserve memory is: 660 MB, max dynamic memory is: 4192897 MB
2024-08-20 23:53:53.576 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] FATAL:  the values of memory out of limit, the database failed to be started, max_process_memory (3000MB) must greater than 2GB + cstore_buffers(16MB) + (udf_memory_limit(200MB) - UDF_DEFAULT_MEMORY(200MB)) + shared_buffers(1024MB) + preserved memory(3366MB) = 6454MB, reduce the value of shared_buffers, max_pred_locks_per_transaction, max_connection, wal_buffers..etc will help reduce the size of preserved memory
2024-08-20 23:53:53.622 [unknown] [unknown] localhost 140396905542464 0[0:0#0] 0 [BACKEND] LOG:  FiniNuma allocIndex: 0.
[2024-08-20 23:53:54.351][3175][][gs_ctl]: waitpid 3178 failed, exitstatus is 256, ret is 2

[2024-08-20 23:53:54.351][3175][][gs_ctl]: stopped waiting
[2024-08-20 23:53:54.351][3175][][gs_ctl]: could not start server
Examine the log output.

我们看到报错是因为 max_process_memory 内存参数小于其他内存参数值的总和导致的数据库实例无法正常启动,这里我们需要根据我们机器的内存资源来综合考虑,如果我们的机器内存大于7GB,则我们可以选择修改max_process_memory参数的值来解决这个问题,如果我们的机器内存小于7GB,则此时我们无法通过只修改max_process_memory的值来解决,此时我们需要综合考量我们的内存资源,适度增加max_process_memory的值,并同时适度减少shared_buffers, max_pred_locks_per_transaction, max_connection, wal_buffers参数的值来解决这个问题。

我的虚拟机内存给到8GB,所以我们尝试使用gs_guc工具修改max_process_memory参数的值来解决这个问题。

[omm@panwei1 dn_6001]$ gs_guc set -N all -I all -c "max_process_memory=7000MB"
The pw_guc run with the following arguments: [gs_guc -N all -I all -c max_process_memory=7000MB set ].
Begin to perform the total nodes: 1.
Popen count is 1, Popen success count is 1, Popen failure count is 0.
Begin to perform gs_guc for datanodes.
Command count is 1, Command success count is 1, Command failure count is 0.

Total instances: 1. Failed instances: 0.
ALL: Success to perform gs_guc!

[omm@panwei1 dn_6001]$ gs_ctl start -D /database/panweidb/data [2024-08-20 23:56:34.629][3214][][gs_ctl]: gs_ctl started,datadir is /database/panweidb/data [2024-08-20 23:56:34.724][3214][][gs_ctl]: waiting for server to start... .0 LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env. 0 LOG: [Alarm Module]Host Name: panwei1 0 LOG: [Alarm Module]Host IP: panwei1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP> 0 LOG: [Alarm Module]Cluster Name: panweidb 0 LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58 0 WARNING: failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory. 0 WARNING: failed to parse feature control file: gaussdb.version. 0 WARNING: Failed to load the product control file, so gaussdb cannot distinguish product version. The core dump path is an invalid directory 0 [BACKEND] LOG: the config file /database/panweidb/data/postgresql.conf verify start. 2024-08-20 23:56:34.900 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: the config file /database/panweidb/data/postgresql.conf verify success. 2024-08-20 23:56:34.902 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] WARNING: Error happen when load license, error code 2, error message cannot write data to dir /etc/panweidb/license . 2024-08-20 23:56:34.902 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: base_page_saved_interval is 400, ori is 400. 2024-08-20 23:56:34.902 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 4 2024-08-20 23:56:34.902 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4 2024-08-20 23:56:34.905 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]can not read GAUSS_WARNING_TYPE env. 2024-08-20 23:56:34.905 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Host Name: panwei1 2024-08-20 23:56:34.905 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Host IP: panwei1. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP> 2024-08-20 23:56:34.905 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Cluster Name: panweidb 2024-08-20 23:56:34.905 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58 2024-08-20 23:56:34.908 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: loaded library "security_plugin" 2024-08-20 23:56:34.908 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets 2024-08-20 23:56:34.908 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] WARNING: could not create any HA TCP/IP sockets 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0. 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: reserved memory for backend threads is: 340 MB 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: reserved memory for WAL buffers is: 320 MB 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: Set max backend reserve memory is: 660 MB, max dynamic memory is: 2593 MB 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: shared memory 3730 Mbytes, memory context 3253 Mbytes, max process memory 7000 Mbytes 2024-08-20 23:56:34.913 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [MEM_CTL] LOG: PGSharedMemoryCreate(): ShmemBase=0x7f8bee0c2000, totalsize=3911737704, freeoffset=56 2024-08-20 23:56:35.543 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [CACHE] LOG: set data cache size(12582912) 2024-08-20 23:56:36.136 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [SEGMENT_PAGE] LOG: Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512 2024-08-20 23:56:36.430 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: panweidb: fsync file "/database/panweidb/data/gaussdb.state.temp" success 2024-08-20 23:56:36.431 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: create gaussdb state file success: db state(STARTING_STATE), server mode(Normal), connection index(1) 2024-08-20 23:56:36.531 [unknown] [unknown] localhost 140243488527168 0[0:0#0] 0 [BACKEND] LOG: max_safe_fds = 99976, usable_fds = 100000, already_open = 14 The core dump path is an invalid directory . [2024-08-20 23:56:37.777][3214][][gs_ctl]: done [2024-08-20 23:56:37.777][3214][][gs_ctl]: server started (/database/panweidb/data) [omm@panwei1 dn_6001]$ gs_om -t status --detail [ Cluster State ] cluster_state : Normal redistributing : No current_az : AZ_ALL [ Datanode State ] node node_ip port instance state ------------------------------------------------------------------------------------------ 1 panwei1 192.168.241.132 17700 6001 /database/panweidb/data P Primary Normal

此时我们看到数据库已经成功拉起,问题解决。

在数据库出现问题无法启动时,解决思路首先需要查看数据库的状态,然后需要查看日志看启动的过程中的报错信息,分析报错信息来排查问题,在遇到内存参数问题时,不要想当然的修改内存参数,一定要结合实际的内存大小来综合考量内存参数的修改范围。


「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论