处理方法
gaussdb=# select * from pv_total_memory_detail;nodename | memorytype | memorymbytes----------------+-------------------------+--------------coordinator1 | max_process_memory | 81920coordinator1 | process_used_memory | 14567coordinator1 | max_dynamic_memory | 34012coordinator1 | dynamic_used_memory | 1851coordinator1 | dynamic_peak_memory | 3639coordinator1 | dynamic_used_shrctx | 394coordinator1 | dynamic_peak_shrctx | 399coordinator1 | max_backend_memory | 648coordinator1 | backend_used_memory | 1coordinator1 | max_shared_memory | 46747coordinator1 | shared_used_memory | 11618coordinator1 | max_cstore_memory | 512coordinator1 | cstore_used_memory | 0coordinator1 | max_sctpcomm_memory | 0coordinator1 | sctpcomm_used_memory | 0coordinator1 | sctpcomm_peak_memory | 0coordinator1 | other_used_memory | 1013coordinator1 | gpu_max_dynamic_memory | 0coordinator1 | gpu_dynamic_used_memory | 0coordinator1 | gpu_dynamic_peak_memory | 0coordinator1 | pooler_conn_memory | 0coordinator1 | pooler_freeconn_memory | 0coordinator1 | storage_compress_memory | 0coordinator1 | udf_reserved_memory | 0(24 rows)
查看数据库进程全局的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pg_shared_memory_detail group by contextname order by sum desc limit 10;contextname | sum | ?column? | count-----------------------------------+----------------------+-----------------------+-------IncreCheckPointContext | 250.8796234130859375 | .00273132324218750000 | 1AshContext | 64.0950317382812500 | .00772094726562500000 | 1DefaultTopMemoryContext | 60.5699005126953125 | 1.0594177246093750 | 1StorageTopMemoryContext | 16.7601776123046875 | .05357360839843750000 | 1GlobalAuditMemory | 16.0081176757812500 | .00769042968750000000 | 1CBBTopMemoryContext | 14.9503479003906250 | .04009246826171875000 | 1Undo | 8.6680450439453125 | .21752929687500000000 | 1DoubleWriteContext | 6.5549163818359375 | .02331542968750000000 | 1ThreadPoolContext | 5.4042663574218750 | .00525665283203125000 | 1GlobalSysDBCacheEntryMemCxt_16384 | 4.2232666015625000 | .89799499511718750000 | 16(10 rows)
查看数据库进程所有线程的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_thread_memory_context group by contextname order by sum desc limit 10;contextname | sum | ?column? | count---------------------------------+----------------------+-----------------------+-------LocalSysCacheShareMemoryContext | 612.5096435546875000 | 57.4630737304687500 | 543StorageTopMemoryContext | 311.8157348632812500 | 3.2519149780273438 | 543DefaultTopMemoryContext | 168.5756530761718750 | 10.7153015136718750 | 543LocalSysCacheMyDBMemoryContext | 167.4375000000000000 | 65.7499847412109375 | 543ThreadTopMemoryContext | 161.4440002441406250 | 4.0309295654296875 | 543CBBTopMemoryContext | 109.1161880493164063 | 6.7845993041992188 | 543LocalSysCacheTopMemoryContext | 93.4109802246093750 | 13.2236938476562500 | 543Timezones | 43.2421417236328125 | 1.4333953857421875 | 543gs_signal | 32.2394561767578125 | 4.9155120849609375 | 1Type information cache | 22.9119262695312500 | .86848449707031250000 | 329(10 rows)
查看数据库进程所有session的内存上下文占用大小,按照内存上下文分类从大到小排序,取top10即可。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_session_memory_context group by contextname order by sum desc limit 10;contextname | sum | ?column? | count-------------------------+----------------------+-----------------------+-------CachedPlan | 223.4433593750000000 | 64.6083068847656250 | 12394CachedPlanQuery | 134.7382812500000000 | 42.3366699218750000 | 12596SessionTopMemoryContext | 132.3496398925781250 | 25.9272155761718750 | 302CachedPlanSource | 98.6943359375000000 | 28.3841018676757813 | 12897CBBTopMemoryContext | 60.6870880126953125 | 3.0470962524414063 | 302GenericRoot | 35.1962890625000000 | 14.1624069213867188 | 471Timezones | 24.0499572753906250 | .79721069335937500000 | 302SPI Plan | 21.0664062500000000 | 6.8149719238281250 | 2396AdaptiveCachedPlan | 17.5449218750000000 | 4.7733078002929688 | 546Prepared Queries | 16.4062500000000000 | 7.5508117675781250 | 300(10 rows)
gaussdb=# select * from gs_get_history_memory_detail(NULL) order by memory_info desc limit 10;memory_info-------------------------------mem_log-2023-03-10_205125.logmem_log-2023-03-10_205115.logmem_log-2023-03-10_205104.logmem_log-2023-03-10_205054.logmem_log-2023-03-10_205043.logmem_log-2023-03-10_205032.logmem_log-2023-03-10_205022.logmem_log-2023-03-10_205012.logmem_log-2023-03-10_205002.logmem_log-2023-03-10_204951.log(10 rows)
选取其中一个log文件,执行如下查询语句即可阅览log内容,记载了全局的内存概况与全局级内存上下文,线程级内存上下文,session级内存上下的top20内存上下文占用详情,如下所示
gaussdb=# select * from gs_get_history_memory_detail('mem_log-2023-03-10_205125.log');memory_info--------------------------------------------------------------------------------------{"Global Memory Statistics": {"Max_dynamic_memory": 34012,"Dynamic_used_memory": 3645,"Dynamic_peak_memory": 3664,"Dynamic_used_shrctx": 401,"Dynamic_peak_shrctx": 401,"Max_backend_memory": 648,"Backend_used_memory": 1,"other_used_memory": 0},"Memory Context Info": {"Memory Context Detail": {"Context Type": "Shared Memory Context","Memory Context": {"context": "IncreCheckPointContext","freeSize": 2864,"totalSize": 263066352},...},"Memory Context Detail": {"Context Type": "Session Memory Context","Memory Context": {"context": "CachedPlan","freeSize": 68041368,"totalSize": 235937792},...},"Memory Context Detail": {"Context Type": "Thread Memory Context","Memory Context": {"context": "LocalSysCacheShareMemoryContext","freeSize": 60431360,"totalSize": 644141760},...}}(322 rows)
根据获取内存统计信息中查询获得的内存占用概况可分析如下:
如果dynamic_used_memory较大,dynamic_used_shrctx较小,则可以确认是线程和session上内存占用较多。
如果dynamic_used_memory较大,dynamic_used_shrctx和dynamic_used_memory相差不大,则可以确认是全局内存上下文使用的动态内存较大。
如果只有shared_used_memory占用较大,则可以确认是共享内存占用较多,忽略即可。
如果是other_used_memory较大,一般情况是由于业务执行时频繁的内存申请和释放导致内存碎片缓存过多。
针对这几种种情况,分别按照下面的4类定位方法定位即可。
a.全局内存上下文占用较高
有现场环境
查询如下语句即可确认是哪个内存上下文占用内存较高。
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pg_shared_memory_detail group by contextname order by sum desc limit 10;contextname | sum | ?column? | count-----------------------------------+----------------------+-----------------------+-------IncreCheckPointContext | 250.8796234130859375 | .00273132324218750000 | 1AshContext | 64.0950317382812500 | .00772094726562500000 | 1DefaultTopMemoryContext | 60.5699005126953125 | 1.0594177246093750 | 1StorageTopMemoryContext | 16.7601776123046875 | .04942321777343750000 | 1GlobalAuditMemory | 16.0081176757812500 | .00769042968750000000 | 1CBBTopMemoryContext | 14.9503479003906250 | .04009246826171875000 | 1Undo | 8.6680450439453125 | .20516967773437500000 | 1DoubleWriteContext | 6.5549163818359375 | .02331542968750000000 | 1ThreadPoolContext | 5.3873443603515625 | .00525665283203125000 | 1GlobalSysDBCacheEntryMemCxt_16384 | 4.3115692138671875 | 1.0470581054687500 | 16(10 rows)
确定内存上下文之后,以IncreCheckPointContext为例,查询视图gs_get_shared_memctx_detail,确定内存堆积的代码位置。
gaussdb=# select * from gs_get_shared_memctx_detail('IncreCheckPointContext');file | line | size-------------------------+------+-----------ipci.cpp | 476 | 64pagewriter.cpp | 298 | 1024ipci.cpp | 498 | 4096pagewriter.cpp | 322 | 19632000pagewriter.cpp | 317 | 33669120storage_buffer_init.cpp | 90 | 209756160(6 rows)
从上述查询结果可以看出,在代码storage_buffer_init.cpp的90行申请了大量的内存,可能存在内存堆积不释放的问题。
无现场环境
有现场环境
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_thread_memory_context group by contextname order by sum desc limit 10;contextname | sum | ?column? | count---------------------------------+----------------------+-----------------------+-------LocalSysCacheShareMemoryContext | 641.0926513671875000 | 60.0820159912109375 | 543StorageTopMemoryContext | 311.8157348632812500 | 3.1896591186523438 | 543LocalSysCacheMyDBMemoryContext | 175.0625000000000000 | 65.0446166992187500 | 543DefaultTopMemoryContext | 168.5756530761718750 | 10.7153015136718750 | 543ThreadTopMemoryContext | 161.9752502441406250 | 4.1196441650390625 | 543CBBTopMemoryContext | 109.1161880493164063 | 6.7845993041992188 | 543LocalSysCacheTopMemoryContext | 93.4109802246093750 | 13.2236938476562500 | 543Timezones | 43.2421417236328125 | 1.4333953857421875 | 543gs_signal | 32.2394561767578125 | 4.9155120849609375 | 1Type information cache | 23.8869018554687500 | .90544128417968750000 | 343(10 rows)
确定内存上下文之后,以StorageTopMemoryContext为例,查询视图gs_get_thread_memctx_detail(第一个入参为线程ID,可以通过查询视图gs_thread_memory_context获得 ),确定内存堆积的代码位置。
gaussdb=# select * from gs_get_thread_memctx_detail(140639273547520,'StorageTopMemoryContext');file | line | size--------------+------+--------syncrep.cpp | 1608 | 32elog.cpp | 2008 | 16fd.cpp | 2734 | 128syncrep.cpp | 1568 | 32deadlock.cpp | 175 | 512deadlock.cpp | 169 | 342656deadlock.cpp | 157 | 85664deadlock.cpp | 146 | 21416deadlock.cpp | 144 | 32112deadlock.cpp | 136 | 10712deadlock.cpp | 135 | 10712deadlock.cpp | 128 | 85664deadlock.cpp | 126 | 21416(13 rows)
无现场环境
有现场环境
gaussdb=# select contextname, sum(totalsize)/1024/1024 sum, sum(freesize)/1024/1024, count(*) count from pv_session_memory_context group by contextname order by sum desc limit 10;contextname | sum | ?column? | count----------------------------+----------------------+-----------------------+-------CachedPlan | 226.1093750000000000 | 67.1747817993164063 | 12450CachedPlanQuery | 134.8027343750000000 | 41.8541030883789063 | 12612SessionTopMemoryContext | 132.1605682373046875 | 26.1002349853515625 | 301CachedPlanSource | 98.7617187500000000 | 28.4135513305664063 | 12912CBBTopMemoryContext | 60.4861373901367188 | 3.0370101928710938 | 301Timezones | 23.9703216552734375 | .79457092285156250000 | 301SPI Plan | 21.1307907104492188 | 6.8435440063476563 | 2412GenericRoot | 19.9628906250000000 | 7.7032165527343750 | 374Prepared Queries | 16.4062500000000000 | 7.5508117675781250 | 300unnamed prepared statement | 14.3437500000000000 | 6.6462554931640625 | 300(10 rows)
确定内存上下文之后,以CachedPlan为例,查询视图gs_get_session_memctx_detail,确定内存堆积的代码位置。
gaussdb=# select * from gs_get_session_memctx_detail('CachedPlanQuery');file | line | size---------------+------+---------copyfuncs.cpp | 2607 | 5031680copyfuncs.cpp | 7013 | 4176736copyfuncs.cpp | 7016 | 2088368copyfuncs.cpp | 5062 | 6918144copyfuncs.cpp | 3461 | 403552copyfuncs.cpp | 3397 | 2727104copyfuncs.cpp | 3401 | 487368datum.cpp | 150 | 2048copyfuncs.cpp | 2572 | 1113728copyfuncs.cpp | 6204 | 32copyfuncs.cpp | 6206 | 32copyfuncs.cpp | 7021 | 4267200copyfuncs.cpp | 7037 | 2832000copyfuncs.cpp | 7048 | 2066400bitmapset.cpp | 94 | 134400copyfuncs.cpp | 3430 | 96000copyfuncs.cpp | 2847 | 2150400copyfuncs.cpp | 2551 | 5126400copyfuncs.cpp | 3984 | 105600list.cpp | 105 | 254400list.cpp | 108 | 796800copyfuncs.cpp | 3835 | 7065600copyfuncs.cpp | 2451 | 1056000copyfuncs.cpp | 2453 | 244800copyfuncs.cpp | 3840 | 230400copyfuncs.cpp | 2895 | 1113600copyfuncs.cpp | 3442 | 38400copyfuncs.cpp | 2645 | 115200list.cpp | 166 | 19200namespace.cpp | 3853 | 144000list.cpp | 1460 | 288000copyfuncs.cpp | 2910 | 38400copyfuncs.cpp | 2762 | 1075200copyfuncs.cpp | 3953 | 67200copyfuncs.cpp | 3000 | 96000copyfuncs.cpp | 5876 | 28800copyfuncs.cpp | 2619 | 2400(37 rows)
从上述查询结果可以看出,在代码copyfuncs.cpp的3835行申请了大量的内存,可能存在内存堆积不释放的问题。
无现场环境
内存碎片过多导致内存缓存过多
gaussdb=# select * from pv_total_memory_detail;nodename | memorytype | memorymbytes----------------+-------------------------+--------------coordinator1 | max_process_memory | 81920coordinator1 | process_used_memory | 24567coordinator1 | max_dynamic_memory | 34012coordinator1 | dynamic_used_memory | 1851coordinator1 | dynamic_peak_memory | 3639coordinator1 | dynamic_used_shrctx | 394coordinator1 | dynamic_peak_shrctx | 399coordinator1 | max_backend_memory | 648coordinator1 | backend_used_memory | 1coordinator1 | max_shared_memory | 46747coordinator1 | shared_used_memory | 11618coordinator1 | max_cstore_memory | 512coordinator1 | cstore_used_memory | 0coordinator1 | max_sctpcomm_memory | 0coordinator1 | sctpcomm_used_memory | 0coordinator1 | sctpcomm_peak_memory | 0coordinator1 | other_used_memory | 11013coordinator1 | gpu_max_dynamic_memory | 0coordinator1 | gpu_dynamic_used_memory | 0coordinator1 | gpu_dynamic_peak_memory | 0coordinator1 | pooler_conn_memory | 0coordinator1 | pooler_freeconn_memory | 0coordinator1 | storage_compress_memory | 0coordinator1 | udf_reserved_memory | 0(24 rows)
其他原因导致内存未及时释放
此处需要注意:other_used_memory过大不全部都是因为内存碎片导致的,也可能是如下原因:
1)业务代码中存在没有在内存上下文上申请内存直接使用了malloc接口申请内存的地方,且出现了内存堆积。
2)第三方开源软件存在内存未及时释放的场景。
出现这两种情况时,需要联系华为工程师协助解决。
3.解决方案
内存堆积导致内存满
方案:出现内存堆积长时间不释放时,需要通过做主备切换来降低内存的使用。
业务原因导致内存满
方案:修改客户端作业,降低并发数或者修改SQL语句,使其在执行时不占用大量内存,请联系华为工程师协助给出详细的解决方案。
other内存缓存过多导致内存满
方案一:如果是由于业务场景导致的other内存缓存过高,则可以通过调整执行计划相关的参数或者从客户端侧调整业务来解决内存过高的问题,需要根据具体业务场景确定修改方案,请联系华为工程师协助给出详细的解决方案。
方案二:出现内存堆积长时间不释放时,且无法通过调整业务来降低内存时则需要通过做主备切换来降低内存的使用。




