作者
digoal
日期
2019-11-25
标签
PostgreSQL , RDKit Cartridge , 化学分析
背景
化学信息学(Cheminformatics)是一个信息技术领域,它使用计算机和计算程序推进大量化学数据的收集,保存,分析,和管理。
https://rdkit.org/docs/Overview.html
rdkit 是化学信息学使用的开源工具集,包括:
- Business-friendly BSD license
- Core data structures and algorithms in C++
- Python 3.x wrappers generated using Boost.Python
- Java and C# wrappers generated with SWIG
- 2D and 3D molecular operations
- Descriptor generation for machine learning
- Molecular database cartridge for PostgreSQL
- Cheminformatics nodes for KNIME (distributed from the KNIME community site: https://www.knime.com/rdkit)
PostgreSQL在这个工具集中,承担分子数据库的角色,重要功能包括:
https://www.rdkit.org/docs/Cartridge.html#reference-guide
rdkit 数据库插件reference
新增了哪些东西?
新类型
mol : an rdkit molecule. Can be created from a SMILES via direct type conversion, for example: ‘c1ccccc1’::mol creates a molecule from the SMILES ‘c1ccccc1’
qmol : an rdkit molecule containing query features (i.e. constructed from SMARTS). Can be created from a SMARTS via direct type conversion, for example: ‘c1cccc[c,n]1’::qmol creates a query molecule from the SMARTS ‘c1cccc[c,n]1’
sfp : a sparse count vector fingerprint (SparseIntVect in C++ and Python)
bfp : a bit vector fingerprint (ExplicitBitVect in C++ and Python)
包括这些类型支持的索引。
相似查询
```
% : operator used for similarity searches using Tanimoto similarity. Returns whether or not the Tanimoto similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.tanimoto_threshold.
: operator used for similarity searches using Dice similarity. Returns whether or not the Dice similarity between two fingerprints (either two sfp or two bfp values) exceeds rdkit.dice_threshold.
<%> : used for Tanimoto KNN searches (to return ordered lists of neighbors).
<#> : used for Dice KNN searches (to return ordered lists of neighbors).
```
Substructure and exact structure search
@> : substructure search operator. Returns whether or not the mol or qmol on the right is a substructure of the mol on the left.
<@ : substructure search operator. Returns whether or not the mol or qmol on the left is a substructure of the mol on the right.
@= : returns whether or not two molecules are the same.
Molecule comparison
```
< : returns whether or not the left mol is less than the right mol
: returns whether or not the left mol is greater than the right mol
= : returns whether or not the left mol is equal to the right mol
<= : returns whether or not the left mol is less than or equal to the right mol
= : returns whether or not the left mol is greater than or equal to the right mol
```
指纹相关操作函数
Generating fingerprints
Working with fingerprints
Fingerprint I/O
分子相关操作函数
Molecule I/O and Validation
Substructure operations
Descriptors
Connectivity Descriptors
MCS
部署rdkit数据库环境 - 方法 1 使用anaconda部署集成好的环境
1、新增用户
```
useradd rdkit
su - rdkit
```
https://www.anaconda.com/distribution/#linux
2、下载并安装anaconda
```
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash ./Anaconda3-2019.10-Linux-x86_64.sh
```
3、设置环境变量
```
. /home/rdkit/.bashrc
(base) [rdkit@pg11-test ~]$ which conda
~/anaconda3/bin/conda
```
4、使用anaconda安装rdkit和继承了rdkit插件的postgresql
```
(base) [rdkit@pg11-test ~]$ conda install -c rdkit rdkit-postgresql
Collecting package metadata (current_repodata.json): done
Solving environment: done
Package Plan ##
environment location: /home/rdkit/anaconda3
added / updated specs:
- rdkit-postgresql
The following packages will be downloaded:
package | build
---------------------------|-----------------
libboost-1.67.0 | h46d08c1_4 13.0 MB
libpq-11.2 | h20c2e04_0 2.0 MB
postgresql-11.2 | h20c2e04_0 3.7 MB
rdkit-postgresql-2019.09.1.0| h5fa6086_0 2.9 MB rdkit
------------------------------------------------------------
Total: 21.5 MB
The following NEW packages will be INSTALLED:
libboost pkgs/main/linux-64::libboost-1.67.0-h46d08c1_4
libpq pkgs/main/linux-64::libpq-11.2-h20c2e04_0
postgresql pkgs/main/linux-64::postgresql-11.2-h20c2e04_0
rdkit-postgresql rdkit/linux-64::rdkit-postgresql-2019.09.1.0-h5fa6086_0
Proceed ([y]/n)? y
```
5、安装好后,查询安装好的postgresql,安装的插件很少,后续要添加插件怎么办?后面会有例子
```
(base) [rdkit@pg11-test ~]$ which psql
~/anaconda3/bin/psql
(base) [rdkit@pg11-test ~]$ which initdb
~/anaconda3/bin/initdb
(base) [rdkit@pg11-test ~]$ which pgbench
~/anaconda3/bin/pgbench
cd /home/rdkit/anaconda3/share/extension
(base) [rdkit@pg11-test extension]$ ll
total 60
-rw-rw-r-- 2 rdkit rdkit 332 Mar 14 2019 plpgsql--1.0.sql
-rw-rw-r-- 2 rdkit rdkit 179 Mar 14 2019 plpgsql.control
-rw-rw-r-- 2 rdkit rdkit 381 Mar 14 2019 plpgsql--unpackaged--1.0.sql
-rw-rw-r-- 2 rdkit rdkit 42915 Oct 25 14:01 rdkit--3.8.sql
-rw-rw-r-- 2 rdkit rdkit 135 Oct 25 14:01 rdkit.control
```
6、anaconda编译的postgresql的编译参数如下
(base) [rdkit@pg11-test ~]$ pg_config
BINDIR = /home/rdkit/anaconda3/bin
DOCDIR = /home/rdkit/anaconda3/share/doc
HTMLDIR = /home/rdkit/anaconda3/share/doc
INCLUDEDIR = /home/rdkit/anaconda3/include
PKGINCLUDEDIR = /home/rdkit/anaconda3/include
INCLUDEDIR-SERVER = /home/rdkit/anaconda3/include/server
LIBDIR = /home/rdkit/anaconda3/lib
PKGLIBDIR = /home/rdkit/anaconda3/lib
LOCALEDIR = /home/rdkit/anaconda3/share/locale
MANDIR = /home/rdkit/anaconda3/share/man
SHAREDIR = /home/rdkit/anaconda3/share
SYSCONFDIR = /home/rdkit/anaconda3/etc
PGXS = /home/rdkit/anaconda3/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/home/rdkit/anaconda3' '--with-readline' '--with-libraries=/home/rdkit/anaconda3/lib' '--with-includes=/home/rdkit/anaconda3/include' '--with-openssl' '--with-gssapi' 'CC=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc' 'CFLAGS=-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix' 'LDFLAGS=-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib' 'CPPFLAGS=-DNDEBUG -D_FORTIFY_SOURCE=2 -O2' 'CPP=/tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cpp' 'PKG_CONFIG_PATH=/home/rdkit/anaconda3/lib/pkgconfig'
CC = /tmp/build/80754af9/postgresql-split_1552510884761/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc
CPPFLAGS = -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -D_GNU_SOURCE -I/home/rdkit/anaconda3/include
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I/home/rdkit/anaconda3/include -fdebug-prefix-map=/tmp/build/80754af9/postgresql-split_1552510884761/work=/usr/local/src/conda/postgresql-split-11.2 -fdebug-prefix-map=/home/rdkit/anaconda3=/usr/local/src/conda-prefix
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/rdkit/anaconda3/lib -Wl,-rpath-link,/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -L/home/rdkit/anaconda3/lib -Wl,--as-needed -Wl,-rpath,'/home/rdkit/anaconda3/lib',--disable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 11.2
7、准备数据库的数据目录
su - root
(base) [root@pg11-test ~]# mkdir /data02/rdkit
(base) [root@pg11-test ~]# chown rdkit:rdkit /data02/rdkit
8、设置数据库的环境变量
```
su - rdkit
vi .bash_profile
export PGHOME=/home/rdkit/anaconda3
export PGDATA=/data02/rdkit/pgdata
export LD_LIBRARY_PATH=$PGHOME/lib:$LD_LIBRARY_PATH
export PATH=$PGHOME/bin:$PATH
export PGDATABASE=postgres
export PGHOST=$PGDATA
export PGPORT=8800
export PGUSER=postgres
```
9、初始化数据库
initdb -D $PGDATA -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8 -U postgres
10、配置数据库参数(512G内存机器配置)
```
vi $PGDATA/postgresql.conf
listen_addresses = '0.0.0.0'
port = 8800
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '.'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 32GB
maintenance_work_mem = 2GB
autovacuum_work_mem = -1
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_maintenance_workers = 4
max_parallel_workers_per_gather = 0
max_parallel_workers = 24
wal_level = replica
fsync = on
synchronous_commit = off
full_page_writes = on
wal_compression = on
wal_buffers = 16MB
wal_writer_delay = 10ms
checkpoint_timeout = 15min
max_wal_size = 96GB
min_wal_size = 16GB
checkpoint_completion_target = 0.3
archive_mode = on
archive_command = '/bin/date'
max_wal_senders = 32
max_replication_slots = 16
max_standby_archive_delay = 300s
max_standby_streaming_delay = 300s
wal_receiver_status_interval = 1s
random_page_cost = 1.1
effective_cache_size = 512GB
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql-%d.log'
log_truncate_on_rotation = on
log_checkpoints = on
log_error_verbosity = verbose
log_statement = 'ddl'
log_timezone = 'PRC'
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_max_workers = 3
autovacuum_freeze_max_age = 1200000000
autovacuum_multixact_freeze_max_age = 1400000000
autovacuum_vacuum_cost_delay = 0ms
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
vi $PGDATA/pg_hba.conf
host all all 0.0.0.0/0 md5
host replication all 0.0.0.0/0 md5
```
11、启动数据库,并安装插件。
```
pg_ctl start
psql
psql (11.2)
Type "help" for help.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+------------+-----------------------
postgres | postgres | UTF8 | C | en_US.utf8 |
template0 | postgres | UTF8 | C | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)
postgres=# create extension rdkit ;
CREATE EXTENSION
```
打包的插件太少了,加新的插件
例如,安装PG自带插件
https://www.postgresql.org/ftp/source/v11.6/
```
(base) [rdkit@pg11-test ~]$ wget https://ftp.postgresql.org/pub/source/v11.6/postgresql-11.6.tar.bz2
tar -jxvf postgresql-11.6.tar.bz2
cd
./configure --prefix=/home/rdkit/anaconda3
cd contrib
```
修改Makefile
```
vi Makefile
删除
vacuumlo
oid2name
pg_standby
```
安装
USE_PGXS=1 make
USE_PGXS=1 make install
将安装后的lib和sql文件拷贝到正确目录
```
cd /home/rdkit/anaconda3
mv lib/postgresql/ lib/
mv share/postgresql/extension/ share/extension/
```
部署rdkit数据库环境 - 方法 2 源码安装
https://github.com/rdkit/rdkit/releases
```
wget https://github.com/rdkit/rdkit/archive/Release_2019_09_1.tar.gz
tar -zxvf Release_2019_09_1.tar.gz
cd rdkit-Release_2019_09_1
mkdir build
cd build
必须使用python3
export PATH
export RDBASE
export LD_LIBRARY_PATH
cmake .. -DPYTHON_EXECUTABLE=/usr/bin/python3
make
make install
```
cd rdkit-Release_2019_09_1/Code/PgSQL/rdkit
USE_PGXS=1 make
USE_PGXS=1 make install
使用手册
https://www.rdkit.org/docs/Cartridge.html#reference-guide
下载测试数据
http://downloads.emolecules.com/free/
ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest
参照文档建表,倒入数据,创建索引,进行查询,分析。
参考
https://www.anaconda.com/distribution/#linux
https://rdkit.readthedocs.io/en/latest/Install.html
https://www.rdkit.org/docs/Cartridge.html
https://github.com/rdkit/rdkit
Molecular database cartridge for PostgreSQL supporting substructure and similarity searches as well as many descriptor calculators
https://github.com/rdkit/rdkit/tree/master/Code/PgSQL/rdkit
https://github.com/rdkit/rdkit/blob/master/Docs/Book/Install.md
https://github.com/rdkit/rdkit/blob/master/Docs/Book/Cartridge.md
https://cactus.nci.nih.gov/presentations/meeting-08-2011/Fri_Aft_Greg_Landrum_RDKit-PostgreSQL.pdf
PostgreSQL 许愿链接
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.
9.9元购买3个月阿里云RDS PostgreSQL实例
PostgreSQL 解决方案集合
德哥 / digoal's github - 公益是一辈子的事.





