暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

华为GaussDB A 如何使用多维特征检索

墨天轮 2019-10-12
1161

如何使用多维特征检索

为了使多维特征检索能够达到秒级响应,除了软硬件配置达到要求外,还需要在创建业务表及建立与内部检索表间的关联关系时,进行适当的设置。

操作步骤

  • 创建业务表,业务表用于存储带有特征向量的用户数据。

    业务表名称,和表结构请根据实际业务场景自定义。

    为提高特征检索查询效率,业务表必须创建为列存表,同时需要根据特征图片ID创建cstore B-tree索引以便短特征查询时,内部表根据ID使用B-tree进行快速查找。关于创建表的语法请参见CREATE TABLE。

    示例如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    CREATE TABLE face_data (
    id bigint DEFAULT (0)::bigint NOT NULL,
    accessories integer NOT NULL,
    race integer NOT NULL,
    age integer NOT NULL,
    face_feature bytea,
    from_image_id bigint,
    from_person_id bigint,
    from_video_id bigint,
    gender integer NOT NULL,
    image_data character varying(255) DEFAULT NULL::character varying,
    indexed integer DEFAULT 0 NOT NULL,
    source_id bigint NOT NULL,
    source_type integer NOT NULL,
    time timestamp(6) without time zone NOT NULL,
    version integer NOT NULL,
    json character varying(255) DEFAULT NULL::character varying,
    sequence bigint NOT NULL,
    quality integer DEFAULT 0 NOT NULL
    )
    WITH (orientation=column)
    DISTRIBUTE BY HASH (id);
    create index face_data_idx_id on face_data using btree(id);
    

  • 调用build_vector_config_env创建业务表和特征检索内部表的关联关系。

    调用build_vector_config_env后会生成未编码表、已编码表、模型表三张内部表,及一张关联关系表。三张内部表的名称依次为*_uncode、*_code和*_mode1,其中*代表业务表表名。关联关系表的表名为gpu_vector_info。

    示例如下:

    select * from build_vector_config_env('face_data', 'id' ,'face_feature', 1024, 256, 'algorithm_cpu_pq_model');
    1
    2
    3
    4
    build_vector_config_env 
    -------------------------
    
    (1 row)
    
    • “face_data”表示业务表名。
    • “id”表示业务表主键名。
    • “face_feature”表示业务表长特征字段名。
    • “algorithm_cpu_pq_model”表示创建关联表时的算法,可以通过select * from pgxc_get_searchlet_algorithm_info();命令查看可选用的算法,其中“algorithm_lopq_model”算法要求训练数据量达到512000条才触发训练。

    执行上述示例后,会生成4张表: face_data_uncode、face_data_code、 face_data_model和gpu_vector_info。

  • 通过GDS或COPY向业务表或未编码表中导入数据。方法请参见导入数据。

    图1 图像检索表关系图
    • 用户可以向业务表导入数据,再根据实际业务需求自定义SQL将业务表数据导入uncode表中。uncode表的字段比业务表多两个,“model_id integer”表示模型编号,“short_vector bytea”表示是否是短特征。
    • 用户也可以直接向uncode表中导入数据。

    uncode表中的数据量达到阈值后,GaussDB 200所提供的AI特征训练平台会实时从uncode表中读取长特征数据,训练出短特征和模型,并保存到code表中。所以可以通过控制uncode表数据的写入时间来控制特征训练的时间,可以选在业务量较少的晚上进行。

    此时注意导入的数据量需要同时满足以下要求时,才能触发AI特征训练平台训练短特征:
    • 大于或等于配置文件“${BIGDATA_HOME}/FusionInsight_MPPDB_6.5.1/install/FusionInsight-MPPDB-6.5.1/simSearch/TrainServer/config/searchletConfig.yaml”中"trainThreshold"参数值。
    • 大于等于创建关联表时选择的算法要求的最低数据量。如果选择“algorithm_lopq_model ”算法创建的关联表,则数据量需要达到512000条才触发训练。其他算法没有数据量要求。
    说明:

    可通过select * from pgxc_get_searchlet_vector_status() where table_name ='test_table1';命令查询数据是否加载到searchlet中。最后一个字段值为2表示正常、1表示在数据库中上传了数据但是没加载到searchlet模块中,0表示没有上传数据。

    如果训练平台未能正常训练数据,会在FusionInsight Manager界面产生ALM-37027告警。

  • 使用短特征和长特征查询接口进行特征检索。

    不带过滤条件的短特征查询示例如下:

    select id,distance from short_feature_search('face_data', '\x000064420000d24200005d4300000f430000aa4200008d4300004c42000000410000f4420000404300000c420000a2420000004000003743000019430000924300004c4300000c430000e84200005a430080884300004c430000f24200808e4300001041000024430000cc4200004a430000b4420080944300004c4300009343000070420000714300007f4300006d430000f042000010410000d8420000464300000000000078420000474300808d43000089430000b842000044420000de4200008c430000a2420000be42000055430000e642000080420000ca420000804300000b4300808e4300004a4300009242000065430000864300005c4300003a430080824300009841000040400000744300006b4300007243000018420000344200007e430000104300002e4300000b430000a842008084430000904200005843008081430000834300006c430000a841000040400000ae42000088430000744300007f43000078430080924300005d4300000d4300001042000000400000ee4200008b4300007743000073430000614300007b4300002843000038430000c84100006f4300005843008082430000594300000b430000b6420000754300002c42000077430000c040008093430000a0420000da4200000c430000aa4200001f430000a04100000143000070420000204300002a430000044300001743000037430000b6420000803f00808f430000a44200005c4300002b43000095430000384200001f430000e4420000d64200002c430000b84100005041000017430000e04100002e4300000543000071430000da4200000443000042430000e4420000884100009e42000018420000f442000067430000f8420000494300007e43000078420000804300007a430000734300000000000009430000584300007c4300008943000087430000164300003041000086430000bc420000454300000c4300009343000091430000f04100008843000011430000124300002d4300808a4300003143000004430080934300000c4300008f430000b041000090420000764300008f43000022430000c64200003c430000b0410000a0410000904100001f43000043430000d64200003e43000064430080844300006c420000ee42000018420000e040000050420000504100006443000038430000cc420000384300007b4300002b430000c24200002c420000d4420000a0400000844300007041000075430000084200001943000056430000934300009c4200005543000082430000644200808643000088430000704300005b430000e040000030420000154300008e43000086430000a24200000e43000010420000404200005943000078420000c242000055430000304100001343000014430000c64200004543000088420000444300003442',1,1) as (id bigint,accessories integer,race integer,age integer,face_feature bytea,from_image_id bigint,from_person_id bigint,from_video_id bigint,gender integer,image_data character varying(255),indexed integer,source_id bigint,source_type integer,time timestamp(6) without time zone,version integer,json character varying(255),sequence bigint,quality integer,target_id bigint,distance float4) order by distance asc limit 10;

    带过滤条件的短特征查询示例如下:

    select id,distance from short_feature_search('face_data', '\x000064420000d24200005d4300000f430000aa4200008d4300004c42000000410000f4420000404300000c420000a2420000004000003743000019430000924300004c4300000c430000e84200005a430080884300004c430000f24200808e4300001041000024430000cc4200004a430000b4420080944300004c4300009343000070420000714300007f4300006d430000f042000010410000d8420000464300000000000078420000474300808d43000089430000b842000044420000de4200008c430000a2420000be42000055430000e642000080420000ca420000804300000b4300808e4300004a4300009242000065430000864300005c4300003a430080824300009841000040400000744300006b4300007243000018420000344200007e430000104300002e4300000b430000a842008084430000904200005843008081430000834300006c430000a841000040400000ae42000088430000744300007f43000078430080924300005d4300000d4300001042000000400000ee4200008b4300007743000073430000614300007b4300002843000038430000c84100006f4300005843008082430000594300000b430000b6420000754300002c42000077430000c040008093430000a0420000da4200000c430000aa4200001f430000a04100000143000070420000204300002a430000044300001743000037430000b6420000803f00808f430000a44200005c4300002b43000095430000384200001f430000e4420000d64200002c430000b84100005041000017430000e04100002e4300000543000071430000da4200000443000042430000e4420000884100009e42000018420000f442000067430000f8420000494300007e43000078420000804300007a430000734300000000000009430000584300007c4300008943000087430000164300003041000086430000bc420000454300000c4300009343000091430000f04100008843000011430000124300002d4300808a4300003143000004430080934300000c4300008f430000b041000090420000764300008f43000022430000c64200003c430000b0410000a0410000904100001f43000043430000d64200003e43000064430080844300006c420000ee42000018420000e040000050420000504100006443000038430000cc420000384300007b4300002b430000c24200002c420000d4420000a0400000844300007041000075430000084200001943000056430000934300009c4200005543000082430000644200808643000088430000704300005b430000e040000030420000154300008e43000086430000a24200000e43000010420000404200005943000078420000c242000055430000304100001343000014430000c64200004543000088420000444300003442',1,1, 'time<''2018-01-05 00:00:00''') as (id bigint,accessories integer,race integer,age integer,face_feature bytea,from_image_id bigint,from_person_id bigint,from_video_id bigint,gender integer,image_data character varying(255),indexed integer,source_id bigint,source_type integer,time timestamp(6) without time zone,version integer,json character varying(255),sequence bigint,quality integer,target_id bigint,distance float4) order by distance asc limit 10;

    长特征查询示例如下:

    select id, time, long_feature_compare('face_data','\x966b86bd7890273d3f660a3dc8dc033e69b8e03d01282a3e0366113eb3b6f03de374273df9b40dbee2f9733c959ad8bdf9c5babdd22ca1bd5b5cbabdb4e79abd35fefd3a604975bd8aabb0bc0ba66fbd7f2d113e0ce0393e0e00a8bc00c24a3dff8d853cd3b0953d12ff1e3e06a219be9ca7753d9162b2bd5dd3cebd1a84b0bd6d739d3d286953bdbe1363bd06692d3ef29f903c6722113d7e621fbb179d6dbd8bed333eabc04ebd2f06ddbd928a11bd744a7f3d61b7babc4ab857bc7acf6bbb0cb1053d4b9e123e3e79313eff40a8bd636647bd41946bbba3c404bd128c5fbd8a8d6a3d692d173df502db3d0f3008be8cdb883ddb0f323de5946cbd2a7fa3bddd8568bd12e6e5bbd07c383b7f48203c681d0b3a87d2d1bccd7c333beea089bcab03103ef20df5bd9f0b373c6f46d4bd01a8b43d6118453b3e3e143c641ef43c375309be9461193dd580473dacb73dbeef4fafbddc43e5bae79f1abd5a1182bdfe8ab23d5d2209bb13af76bd2eb0a2bde2c25dbd38d4b0bdef53caba60674ebd7850a5bd2e5bd9bb1320a93d216af03cf51ac7bdc1c94bbe86020ebed484333ea6e71f3ec946773d1d81fa3c526b70bed205d03cf138cbba30ffb43de239783c7e01773daf68bfbd9709683de39adabd267e2e3ca35d11bd0555a4bdd4d4b33d75a228bdac78e8bd8d1f87bd836046bd57efc43d5eafcdbd114146bec735543d000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', face_feature) as distance from face_data where time between to_date('2018-01-02 00:00:00', 'yyyy-mm-dd hh24:mi:ss') and to_date('2018-01-03 00:00:00', 'yyyy-mm-dd hh24:mi:ss') and distance < 0.5 order by distance asc;

异常定位方法

  • 查看告警。
  • 查看日志。

    查看配置文件“${BIGDATA_HOME}/FusionInsight_MPPDB_6.5.1/install/FusionInsight-MPPDB-6.5.1/simSearch/TrainServer/config/trainServer.ini”中的配置项 trainserver_log_path确定日志路径。

    默认日志路径:/var/log/Bigdata/mpp/simSearch

    进入日志路径查看日志“train-server.log”,确定错误原因。

  • 初始化集群正常后,特征检索所需动态库加载失败

    集群正常后,通过gsql链接到CN节点上调用以下查询查看特征检索动态库是否加载正常,如果所有DN状态lib_load_status都是't',表示动态库加载正常。否则,表示动态库加载失败。lib_load_fail_reason有标识失败原因,对于具体的错误信息请联系华为工程师进一步分析解决。

    cpu1=#select * from pgxc_get_simsearch_lib_load_status();

    node_name | lib_load_status | lib_load_fail_reason

    --------------+-----------------+----------------------

    dn_6051_6052 | t |

    dn_6053_6054 | t |

    dn_6055_6056 | t |

    dn_6057_6058 | t |

    dn_6039_6040 | t |

    dn_6059_6060 | t |

    dn_6037_6038 | t |

    dn_6041_6042 | t |

    dn_6043_6044 | t |

    dn_6045_6046 | t |

    dn_6047_6048 | t |

    dn_6061_6062 | t |

  • 定时任务执行16次失败自动变为disable,重启集群、主备切换后无法自动向GPU中加载数据

    用户配置业务表,并调用build_vector_config_env函数创建特征检索配置环境后,如果发生过集群重启或主备切换,需要通过以下方法检查特征检索初始定时导入任务是否正常,如果异常需要由人为干预解决问题。

    • 切换到当前业务表所在schema,然后通过select job_id from gpu_vector_info where service_table_name='t1';获取到job_id。其中t1为用户特征检索业务表,job_id为特征检索导入GPU定时任务编号;
    • 通过select job_status from pg_job where job_id=1;查询到定时任务的状态;
    • 如果状态为'd'状态,说明特征检索向量自动导入GPU定时任务存在异常,需要在CN日志文件中查找错误信息,并进一步分析异常原因,此分析过程可以联系华为工程师分析解决。
  • 长特征、短特征查询报错

    如果在执行长特征或短特征查询报错,在错误信息中会有返回的错误码,对于系统内部的错误信息,用户可以联系华为工程师分析解决。


查看更多:华为GaussDB 200 基于GaussDB 200的图像特征检索
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论