变异注释-软件SnpEff

罗大黑学生信 2021-07-13

3222

SnpEff 简介

SnpEff 下载安装

SnpEff 下载数据库

SnpEff 注释

SnpEff 简介

SnpEff是一款注释变异位点（SNP+InDel+MNP）和预测变异对基因的影响（例如氨基酸变化）的工具。

SnpEff 下载安装

SnpEff 是免安装的软件，下载解压即可使用；

# Download latest version
> wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
# Unzip file
> unzip snpEff_latest_core.zip

SnpEff 下载数据库

一般我们可以直接用snpeff 的命令进行下载；步骤包括：

1）查询可用的数据库；

2）下载数据库。

# 查询可用数据库
> java -jar snpEff.jar databases > snpEff.databases.list.txt


#下载数据库, 以human为例
> java -jar snpEff.jar download hg19 


# 下载成功之后，会在软件安装目录生成 data文件夹，并且生成一个以数据库名字命名的文件夹
> ll hg19/
total 447784
interactions.bin
nextProt.bin
pwms.bin
sequence.10.bin
sequence.11.bin
sequence.12.bin......

SnpEff 注释

Basic example;

输入文件：input : vcf;

#  input：vcf 8列, 如果构建虚拟的vcf文件;REF ALT不能一样
$ head examples/test.chr22.vcf
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
22    17071756    .    T    C    .    .    .
22    17072035    .    C    T    .    .    .
22    17072258    .    C    A    .    .    .
22    17072674    .    G    A    .    .    .
22    17072747    .    T    C    .    .    .
22    17072781    .    C    T    .    .    .

databases 注释执行命令如下:

# 注释命令 及 output 在输入文件的INFO列新增了一个字段信息，字段的名字叫做ANN；关于ANN中各个部分的详细信息可以参考VCF头部的注释部分。
$ java -Xmx4g -jar snpEff.jar GRCh37.75 examples/test.chr22.vcf > test.chr22.ann.vcf


## 如果指定 配置文件，用 '-c' command; 如果条件允许, 可以用 '-t' command 多线程
$ java -Xmx4g path/to/snpEff/snpEff.jar -c -t 4 path/to/snpEff/snpEff.config GRCh37.75 path/to/snps.vcf
# java -Xmx4g -jar  path/to/snpEff/snpEff.jar   hg19 test.chr22.vcf  > test.anno.hg19_multianno.txt


 # Here is how the output looks like
$ head examples/test.chr22.ann.vcf
##SnpEffVersion="4.1 (build 2015-01-07), by Pablo Cingolani"
##SnpEffCmd="SnpEff  GRCh37.75 examples/test.chr22.vcf "
##INFO=
##INFO=
##INFO=
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
22      17071756        .       T       C       .       .       ANN=C|3_prime_UTR_variant|MODIFIER|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.*11A>G|||||11|,C|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397A>G|||||4223|
22      17072035        .       C       T       .       .       ANN=T|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1406G>A|p.Gly469Glu|1666/2034|1406/1674|469/557||,T|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>A|||||3944|
22      17072258        .       C       A       .       .       ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.1183G>T|p.Gly395Cys|1443/2034|1183/1674|395/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397G>T|||||3721|
22      17072674        .       G       A       .       .       ANN=A|missense_variant|MODERATE|CCT8L2|ENSG00000198445|transcript|ENST00000359963|protein_coding|1/1|c.767C>T|p.Pro256Leu|1027/2034|767/1674|256/557||,A|downstream_gene_variant|MODIFIER|FABP5P11|ENSG00000240122|transcript|ENST00000430910|processed_pseudogene||n.*397C>T|||||3305|

对注释的结果处理

处理前：
##SnpEffVersion="4.3t (build 2017-11-24 10:18), by Pablo Cingolani"
##SnpEffCmd="SnpEff  hg19 /data/IndividualArchive/luojy/pipeline/Splitfusion/ngs/example/P4T1-L34/P4T1-L34_for_anno.txt "
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO' ">
##INFO=<ID=LOF,Number=.,Type=String,Description="Predicted loss of function effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
##INFO=<ID=NMD,Number=.,Type=String,Description="Predicted nonsense mediated decay effects for this variant. Format: 'Gene_Name | Gene_ID | Number_of_transcripts_in_gene | Percent_of_transcripts_affected'">
1  4165471  .  T  A  .  .  ANN=A|intergenic_region|MODIFIER|LINC01346-LOC284661|LINC01346-LOC284661|intergenic_region|LINC01346-LOC284661|||n.4165471T>A||||||
1  9658640  .  T  A  .  .  ANN=A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001130924.2|protein_coding|4/11|c.563T>A|p.Ser188Asn|620/3852|563/2001|188/666||WARNING_REF_DOES_NOT_MATCH_GENOME,A|missense_variant|MODERATE|TMEM201|TMEM201|transcript|NM_001010866.3|protein_coding|4/6|c.563T>A|p.Ser188Asn|620/3907|563/1179|188/392||WARNING_REF_DOES_NOT_MATCH_GENOME
1  25792274  .  T  A  .  .  ANN=A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_018202.5|protein_coding|6/10|c.1154+6891T>A||||||,A|intron_variant|MODIFIER|TMEM57|TMEM57|transcript|NM_001282564.1|protein_coding|4/8|c.473+11401T>A||||||
1  38010402  .  T  A  .  .  ANN=A|intron_variant|MODIFIER|SNIP1|SNIP1|transcript|NM_024700.3|protein_coding|2/3|c.328-4046A>T||||||
1  48305059  .  T  A  .  .  ANN=A|intron_variant|MODIFIER|TRABD2B|TRABD2B|transcript|NM_001194986.1|protein_coding|2/6|c.667-37768A>T||||||


处理后：position  gene  strand  func  exonfunc  refNM  exon  cdna
1_4165471  LINC01346-LOC284661  NA  NA  intergenic_region  LINC01346-LOC284661  intron  NA
1_9658640  TMEM201  NA  protein_coding  missense_variant  NM_001130924.2  exon4  563
1_25792274  TMEM57  NA  protein_coding  intron_variant  NM_018202.5  intron6  NA
1_38010402  SNIP1  NA  protein_coding  intron_variant  NM_024700.3  intron2  NA
1_48305059  TRABD2B  NA  protein_coding  intron_variant  NM_001194986.1  intron2  NA
1_52940791  ZCCHC11  NA  protein_coding  stop_gained  NM_001009881.2  exon13  2440
1_69212036  DEPDC1-AS1-LRRC7  NA  NA  intergenic_region  DEPDC1-AS1-LRRC7  intron  NA
1_115256558  NRAS  NA  protein_coding  synonymous_variant  NM_002524.4  exon3  153
1_115258735  NRAS  NA  protein_coding  missense_variant  NM_002524.4  exon2  47

部分代码：

def anno_process(input_file, output_file):
    list_anno = []
    with open(input_file, 'r') as r_handle:
        reader = csv.DictReader(filter(lambda row: row[0] != '#', r_handle), delimiter='\t',
                                fieldnames=['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'])
        for row in reader:
            spams = re.split(r'\|', row['INFO'])
            dict_for_anno = {
                'position': row['CHROM'] + '_' + row['POS'],
                'gene': ('NA' if spams[3] == '' else spams[3]),
                'strand': 'NA',
                'func': ('NA' if spams[7] == '' else spams[7]),
                'exonfunc': ('NA' if spams[1] == '' else spams[1]),
                'refNM': ('NA' if spams[6] == '' else spams[6]),
                'exon': ('intron' if spams[10] == '' else 'exon') + (
                    '' if spams[8] == '' else re.split(r'/', spams[8])[0]),
                'cdna': ('NA' if spams[12] == '' else re.split(r'/', spams[12])[0])  ## 注意有争议的部分
            }
            list_anno.append(dict_for_anno)


    with open(output_file, 'w') as w_handle:
        writer = csv.DictWriter(w_handle, quoting=csv.QUOTE_MINIMAL, delimiter='\t',
                                fieldnames=['position', 'gene', 'strand', 'func', 'exonfunc', 'refNM', 'exon', 'cdna'])
        writer.writeheader()
        for row in natsort.natsorted(list_anno, key=itemgetter('position')):
            writer.writerow(row)

参考

snpEff使用说明https://www.jianshu.com/p/e03095642ef0

软件

文章转载自罗大黑学生信，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

变异注释-软件SnpEff

评论