暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL 化学分析 - RDKit Cartridge 1 - 环境部署

digoal 2019-11-25
1958

作者

digoal

日期

2019-11-25

标签

PostgreSQL , RDKit Cartridge , 化学分析


背景

化学信息学(Cheminformatics)是一个信息技术领域,它使用计算机和计算程序推进大量化学数据的收集,保存,分析,和管理。

https://rdkit.org/docs/Overview.html

rdkit 是化学信息学使用的开源工具集,包括:

  • Business-friendly BSD license
  • Core data structures and algorithms in C++
  • Python 3.x wrappers generated using Boost.Python
  • Java and C# wrappers generated with SWIG
  • 2D and 3D molecular operations
  • Descriptor generation for machine learning
  • Molecular database cartridge for PostgreSQL
  • Cheminformatics nodes for KNIME (distributed from the KNIME community site: https://www.knime.com/rdkit)

PostgreSQL在这个工具集中,承担分子数据库的角色,重要功能包括:

https://www.rdkit.org/docs/Cartridge.html#reference-guide

rdkit 数据库插件reference

新增了哪些东西?

新类型

mol : an rdkit molecule. Can be created from a SMILES via direct type conversion, for example: ‘c1ccccc1’::mol creates a molecule from the SMILES ‘c1ccccc1’ qmol : an rdkit molecule containing query features (i.e. constructed from SMARTS). Can be created from a SMARTS via direct type conversion, for example: ‘c1cccc[c,n]1’::qmol creates a query molecule from the SMARTS ‘c1cccc[c,n]1’ sfp : a sparse count vector fingerprint (SparseIntVect in C++ and Python) bfp : a bit vector fingerprint (ExplicitBitVect in C++ and Python)

包括这些类型支持的索引。

相似查询

```
% : operator used for similarity searches using Tanimoto similarity. Returns whether or not the Tanimoto similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.tanimoto_threshold.

: operator used for similarity searches using Dice similarity. Returns whether or not the Dice similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.dice_threshold.

<%> : used for Tanimoto KNN searches (to return ordered lists of neighbors).
<#> : used for Dice KNN searches (to return ordered lists of neighbors).
```

Substructure and exact structure search

@> : substructure search operator. Returns whether or not the mol or qmol on the right is a substructure of the mol on the left. <@ : substructure search operator. Returns whether or not the mol or qmol on the left is a substructure of the mol on the right. @= : returns whether or not two molecules are the same.

Molecule comparison

```
< : returns whether or not the left mol is less than the right mol

: returns whether or not the left mol is greater than the right mol
= : returns whether or not the left mol is equal to the right mol
<= : returns whether or not the left mol is less than or equal to the right mol
= : returns whether or not the left mol is greater than or equal to the right mol
```

指纹相关操作函数

Generating fingerprints

Working with fingerprints

Fingerprint I/O

分子相关操作函数

Molecule I/O and Validation

Substructure operations

Descriptors

Connectivity Descriptors

MCS

部署rdkit数据库环境 - 方法 1 使用anaconda部署集成好的环境

1、新增用户

```

useradd rdkit

su - rdkit

```

https://www.anaconda.com/distribution/#linux

2、下载并安装anaconda

```
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

bash ./Anaconda3-2019.10-Linux-x86_64.sh
```

3、设置环境变量

```
. /home/rdkit/.bashrc

(base) [rdkit@pg11-test ~]$ which conda
~/anaconda3/bin/conda
```

4、使用anaconda安装rdkit和继承了rdkit插件的postgresql

```
(base) [rdkit@pg11-test ~]$ conda install -c rdkit rdkit-postgresql

Collecting package metadata (current_repodata.json): done
Solving environment: done

Package Plan ##

environment location: /home/rdkit/anaconda3

added / updated specs:
- rdkit-postgresql

The following packages will be downloaded:

package                    |            build  
---------------------------|-----------------  
libboost-1.67.0            |       h46d08c1_4        13.0 MB  
libpq-11.2                 |       h20c2e04_0         2.0 MB  
postgresql-11.2            |       h20c2e04_0         3.7 MB  
rdkit-postgresql-2019.09.1.0|       h5fa6086_0         2.9 MB  rdkit  
------------------------------------------------------------  
                                       Total:        21.5 MB

The following NEW packages will be INSTALLED:

libboost pkgs/main/linux-64::libboost-1.67.0-h46d08c1_4
libpq pkgs/main/linux-64::libpq-11.2-h20c2e04_0
postgresql pkgs/main/linux-64::postgresql-11.2-h20c2e04_0
rdkit-postgresql rdkit/linux-64::rdkit-postgresql-2019.09.1.0-h5fa6086_0

Proceed ([y]/n)? y
```

5、安装好后,查询安装好的postgresql,安装的插件很少,后续要添加插件怎么办?后面会有例子

```
(base) [rdkit@pg11-test ~]$ which psql
~/anaconda3/bin/psql
(base) [rdkit@pg11-test ~]$ which initdb
~/anaconda3/bin/initdb
(base) [rdkit@pg11-test ~]$ which pgbench
~/anaconda3/bin/pgbench

cd /home/rdkit/anaconda3/share/extension

(base) [rdkit@pg11-test extension]$ ll
total 60
-rw-rw-r-- 2 rdkit rdkit 332 Mar 14 2019 plpgsql--1.0.sql
-rw-rw-r-- 2 rdkit rdkit 179 Mar 14 2019 plpgsql.control
-rw-rw-r-- 2 rdkit rdkit 381 Mar 14 2019 plpgsql--unpackaged--1.0.sql
-rw-rw-r-- 2 rdkit rdkit 42915 Oct 25 14:01 rdkit--3.8.sql
-rw-rw-r-- 2 rdkit rdkit 135 Oct 25 14:01 rdkit.control
```

6、anaconda编译的postgresql的编译参数如下

(base) [rdkit@pg11-test ~]$ pg_config BINDIR = /home/rdkit/anaconda3/bin DOCDIR = /home/rdkit/anaconda3/share/doc HTMLDIR = /home/rdkit/anaconda3/share/doc INCLUDEDIR = /home/rdkit/anaconda3/include PKGINCLUDEDIR = /home/rdkit/anaconda3/include INCLUDEDIR-SERVER = /home/rdkit/anaconda3/include/server LIBDIR = /home/rdkit/anaconda3/lib PKGLIBDIR = /home/rdkit/anaconda3/lib LOCALEDIR = /home/rdkit/anaconda3/share/locale MANDIR = /home/rdkit/anaconda3/share/man SHAREDIR = /home/rdkit/anaconda3/share SYSCONFDIR = /home/rdkit/anaconda3/etc PGXS = /home/rdkit/anaconda3/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/home/rdkit/anaconda3' '--with-readline' '--with-libraries=/home/rdkit/anaconda3/lib' '--with-includes=/home/rdkit/anaconda3/include' '--with-openssl' '--with-gssapi' 'CC=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc' 'CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix' 'LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib' 'CPPFLAGS=-DNDEBUG -D_FORTIFY_SOURCE=2 -O2' 'CPP=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cpp' 'PKG_CONFIG_PATH=/home/rdkit/anaconda3/lib/pkgconfig' CC = /tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc CPPFLAGS = -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -D_GNU_SOURCE -I/home/rdkit/anaconda3/include CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix CFLAGS_SL = -fPIC LDFLAGS = -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -Wl,--as-needed -Wl,-rpath,'/home/rdkit/anaconda3/lib',--disable-new-dtags LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgcommon -lpgport -lpthread -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm VERSION = PostgreSQL 11.2

7、准备数据库的数据目录

su - root (base) [root@pg11-test ~]# mkdir /data02/rdkit (base) [root@pg11-test ~]# chown rdkit:rdkit /data02/rdkit

8、设置数据库的环境变量

```
su - rdkit

vi .bash_profile

export PGHOME=/home/rdkit/anaconda3
export PGDATA=/data02/rdkit/pgdata
export LD_LIBRARY_PATH=$PGHOME/lib:$LD_LIBRARY_PATH
export PATH=$PGHOME/bin:$PATH
export PGDATABASE=postgres
export PGHOST=$PGDATA
export PGPORT=8800
export PGUSER=postgres
```

9、初始化数据库

initdb -D $PGDATA -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8 -U postgres

10、配置数据库参数(512G内存机器配置)

```
vi $PGDATA/postgresql.conf

listen_addresses = '0.0.0.0'
port = 8800
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '.'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
maintenance_work_mem = 2GB
autovacuum_work_mem = -1
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_maintenance_workers = 4
max_parallel_workers_per_gather = 0
max_parallel_workers = 24
wal_level = replica
fsync = on
synchronous_commit = off
full_page_writes = on
wal_compression = on
wal_buffers = 16MB
wal_writer_delay = 10ms
checkpoint_timeout = 15min
max_wal_size = 96GB
min_wal_size = 16GB
checkpoint_completion_target = 0.3
archive_mode = on
archive_command = '/bin/date'
max_wal_senders = 32
max_replication_slots = 16
max_standby_archive_delay = 300s
max_standby_streaming_delay = 300s
wal_receiver_status_interval = 1s
random_page_cost = 1.1
effective_cache_size = 512GB
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql-%d.log'
log_truncate_on_rotation = on
log_checkpoints = on
log_error_verbosity = verbose
log_statement = 'ddl'
log_timezone = 'PRC'
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 3
autovacuum_freeze_max_age = 1200000000
autovacuum_multixact_freeze_max_age = 1400000000
autovacuum_vacuum_cost_delay = 0ms
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'

vi $PGDATA/pg_hba.conf
host all all 0.0.0.0/0 md5
host replication all 0.0.0.0/0 md5
```

11、启动数据库,并安装插件。

```
pg_ctl start

psql
psql (11.2)
Type "help" for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+------------+-----------------------
postgres | postgres | UTF8 | C | en_US.utf8 |
template0 | postgres | UTF8 | C | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)

postgres=# create extension rdkit ;
CREATE EXTENSION
```

打包的插件太少了,加新的插件

例如,安装PG自带插件

https://www.postgresql.org/ftp/source/v11.6/

```
(base) [rdkit@pg11-test ~]$ wget https://ftp.postgresql.org/pub/source/v11.6/postgresql-11.6.tar.bz2

tar -jxvf postgresql-11.6.tar.bz2

cd
./configure --prefix=/home/rdkit/anaconda3
cd contrib
```

修改Makefile

```
vi Makefile

删除

vacuumlo
oid2name
pg_standby
```

安装

USE_PGXS=1 make USE_PGXS=1 make install

将安装后的lib和sql文件拷贝到正确目录

```
cd /home/rdkit/anaconda3

mv lib/postgresql/ lib/
mv share/postgresql/extension/
share/extension/
```

部署rdkit数据库环境 - 方法 2 源码安装

https://github.com/rdkit/rdkit/releases

```
wget https://github.com/rdkit/rdkit/archive/Release_2019_09_1.tar.gz
tar -zxvf Release_2019_09_1.tar.gz
cd rdkit-Release_2019_09_1

mkdir build
cd build

必须使用python3

export PATH
export RDBASE
export LD_LIBRARY_PATH

cmake .. -DPYTHON_EXECUTABLE=/usr/bin/python3
make
make install
```

cd rdkit-Release_2019_09_1/Code/PgSQL/rdkit USE_PGXS=1 make USE_PGXS=1 make install

使用手册

https://www.rdkit.org/docs/Cartridge.html#reference-guide

下载测试数据
http://downloads.emolecules.com/free/

ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest

参照文档建表,倒入数据,创建索引,进行查询,分析。

参考

https://www.anaconda.com/distribution/#linux

https://rdkit.readthedocs.io/en/latest/Install.html

https://www.rdkit.org/docs/Cartridge.html

https://github.com/rdkit/rdkit

Molecular database cartridge for PostgreSQL supporting substructure and similarity searches as well as many descriptor calculators

https://github.com/rdkit/rdkit/tree/master/Code/PgSQL/rdkit

https://github.com/rdkit/rdkit/blob/master/Docs/Book/Install.md

https://github.com/rdkit/rdkit/blob/master/Docs/Book/Cartridge.md

https://cactus.nci.nih.gov/presentations/meeting-08-2011/Fri_Aft_Greg_Landrum_RDKit-PostgreSQL.pdf

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论