Oracle 改进有意生成笛卡尔产品的SQL

askTom 2016-04-26

152

问题描述

一位同事要求我帮助实现一个SQL解决方案，以便从一组已知值生成一组别名。

1.假设一个包含一组短语的表。
2.假设第二个表包含单词及其缩写列表。
3.对于第一个表中的每一行，生成一行，该行的主词替换为其缩写。

例如，假设短语“圣詹姆斯堡”和缩写：

福特堡
福特堡.
圣圣
圣圣.

生成以下行：

圣詹姆斯堡
圣詹姆斯堡
圣詹姆斯堡
圣詹姆斯堡
圣詹姆斯堡
圣詹姆斯堡
英尺。圣詹姆斯
英尺。圣詹姆斯
英尺。圣詹姆斯

为了模拟这个，我创建了两个表：

创建表ncy_Pluases
(
短语varchar2(100)不为空
，主键(
词组
)
) ；

insert into ncy_phrases values ( '圣詹姆斯堡' ) ；
插入到ncy_短语值中（'圣海伦斯堡山'）；

创建表ncy_abbbviations
(
base_word varchar2(100)不为空
，缩写varchar2(100)不为空
，主键(
基字
，缩写
)
) ；

插入到ncy_abbbviations值（'Fort'，'Ft.'）中；
insert into ncy_abbreviations values ( 'Fort' , 'Ft') ；
插入到ncy_abbbviations值中（'装载'，'mt.'）；
insert into ncy_abbreviations values ( 'Mount', 'Mt') ；
插入到ncy_abbbrevations值（'Saint'，'St.'）中；
insert into ncy_abbreviations values ( 'Saint', 'St') ；

我得出的查询是：

单词为(
------------------------
&'97 ；将短语拆分为N个空格分隔的单词
------------------------
选择短语
，短语_no
, substr(
txt
, instr （ txt，''， 1 ，级别） + 1
, instr （ txt，''， 1 ，级别+ 1 ） -instr （ txt，''， 1 ，级别） - 1
)作为标记
，级别为标记_no
来自(
------------------------
&'97 ；从表中选择每个短语，并追加/前置空格
------------------------
选择fragz.短语作为短语
，行号为短语_no
，'''|| fragase.plase ||'''作为txt
来自ncy_短语fragas
其中1 = 1
)
按级别<=长度连接(txt) -长度（替换( txt，''， null ） ) + 1
)
，作为(
------------------------
&'97 ；将“word”中的每个单词连接到缩写表，以得到零
--或更多单词的缩写。
------------------------
选择基本短语
, base.短语_no
, base.token
, base.token_no
, cocolesce ( abbr.abbbreviation,base.token )作为缩写
从字库
左外连接ncy_缩写abbr
上(
abbr.base_word = base.token
)
其中1 = 1
和base.token不为null
联合
------------------------
&'97 ；联合生成每个单词作为其自己的缩写
------------------------
选择基本短语
, base.短语_no
, base.token
, base.token_no
, base.token as abbreviation
从字库
其中1 = 1
按1排序--base.plaase_no
, 4 -- base.token_no
)
--------------------------------------------------------------------------
&'97 ；根据生成扩展字符串所需的次数，将"tke"子查询连接到自身
&'97 ；从生成的字符串中修剪前导/尾随空白
--------------------------------------------------------------------------
选择w1.短语
，修饰( w1.缩写
||''|| w2.缩写
||''|| w3.缩写
||''|| w4.缩写
||''|| w5.缩写
)作为匹配值
来自吸液W1
左外连接螺纹w2
上(
w2.短语_no = w1.短语_no
和w2.token_no = 2
)
左外连接点w3
上(
w3.短语_no = w1.短语_no
和w3.token_no = 3
)
左外连接套管w4
上(
w4.短语_no = w1.短语_no
和w4.token_no = 4
)
左外连接套管w5
上(
w5.短语_no = w1.短语_no
和w5.token_no = 5
)
其中1 = 1
和w1.token_no = 1
按1排序
, 2
；

在这种设置下，如何改进生成结果集的SQL ?我特别不喜欢
当短语中的字数增加时，需要向“toke”子查询添加连接。有什么想法吗？

我的数据库是Oracle Database 11g Enterprise Edition 11.2.0.4.0 - 64位。

先谢了！

专家解答

首先，为了让事情变得简单，我打算把每个基词都加起来作为自己的缩写：

insert into ncy_abbreviations (base_word,abbreviation) values ('Fort','Fort');
insert into ncy_abbreviations (base_word,abbreviation) values ('Mount','Mount');
insert into ncy_abbreviations (base_word,abbreviation) values ('Saint','Saint');

你坐得舒服吗？那我们开始吧...

首先，让我们把短语和缩写连在一起。这将提供所有需要替换的词。

使用instr执行此操作，以检查词组中是否有基字：

select p.*,  n.*
from   NCY_PHRASES p
join   NCY_ABBREVIATIONS n
on     instr(p.phrase, n.base_word) > 0;

使用这个方法，我们希望找到要替换的短语中每个单词的位置。我通过计算基字后面的空格数来完成这个操作，其中有以下几个空格：

regexp_count(
  substr(
    p.phrase, 
    1, 
    case 
      when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
        length(phrase) 
      else
        instr(p.phrase, ' ', instr(p.phrase, n.base_word))
    end
    ), 
  ' '
)

上述逻辑为：

-查找短语中缩写后面第一个空格的位置。如果在末尾，则返回字符串长度。
-从开始到现在，将短语子串起来
-数一数这个有多少个空格

这给出了词组中基词的编号。

除此之外，我们还想知道：

-词组中基词第一次出现的位置
-总共有多少个基词

我们可以使用上述结果的最小/最大分析来计算这些：

with abbrs as (
  select p.*,  n.*, 
         regexp_count(
           substr(
             p.phrase, 
             1, 
             case 
               when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
                 length(phrase) 
               else
                 instr(p.phrase, ' ', instr(p.phrase, n.base_word))
             end
           ), 
           ' '
         ) word_pos
  from   NCY_PHRASES p
  join   NCY_ABBREVIATIONS n
  on     instr(p.phrase, n.base_word) > 0
), vals as (
  select phrase, base_word, abbreviation, word_pos,
         min(word_pos) over (partition by phrase) first_replace,
         max(word_pos) over (partition by phrase) tot_replaces
  from abbrs
)
  select * from vals;

使用这个方法，我们可以沿着字符串递归地走。

-从第一个基字的位置开始
-将基词替换为缩写词
-然后移到下一个基字并重复，直到替换完所有字词

然后，您只需要返回最后一次替换的结果。这里的单词位置=最大单词位置：

with abbrs as (
  select p.*,  n.*, 
         regexp_count(
           substr(
             p.phrase, 
             1, 
             case 
               when instr(p.phrase, ' ', instr(p.phrase, n.base_word)) = 0 then
                 length(phrase) 
               else
                 instr(p.phrase, ' ', instr(p.phrase, n.base_word))
             end
           ), 
           ' '
         ) word_pos
  from   NCY_PHRASES p
  join   NCY_ABBREVIATIONS n
  on     instr(p.phrase, n.base_word) > 0
), vals as (
  select phrase, base_word, abbreviation, word_pos,
         min(word_pos) over (partition by phrase) first_replace,
         max(word_pos) over (partition by phrase) tot_replaces
  from abbrs
), phrases (phrase, base_word, abbreviation, word_pos, str, tot_replaces) as (
  select phrase, base_word, abbreviation, word_pos, 
         replace(phrase, base_word, abbreviation) str,
         tot_replaces
  from   vals
  where  word_pos = first_replace
  union all
  select p.phrase, a.base_word, a.abbreviation, a.word_pos, 
         replace(str, a.base_word, a.abbreviation) str,
         a.tot_replaces
  from   vals a
  join   phrases p
  on     a.phrase = p.phrase
  and    a.word_pos = p.word_pos + 1
)
  select phrase, str
  from   phrases
  where  word_pos = tot_replaces
  order  by phrase, str;

PHRASE                         STR                          
------------------------------ ------------------------------
Fort Mount Saint Helens        Fort Mount Saint Helens       
Fort Mount Saint Helens        Fort Mount St Helens          
Fort Mount Saint Helens        Fort Mount St. Helens         
Fort Mount Saint Helens        Fort Mt Saint Helens          
Fort Mount Saint Helens        Fort Mt St Helens             
Fort Mount Saint Helens        Fort Mt St. Helens            
Fort Mount Saint Helens        Fort Mt. Saint Helens         
Fort Mount Saint Helens        Fort Mt. St Helens            
Fort Mount Saint Helens        Fort Mt. St. Helens           
Fort Mount Saint Helens        Ft Mount Saint Helens         
Fort Mount Saint Helens        Ft Mount St Helens            
Fort Mount Saint Helens        Ft Mount St. Helens           
Fort Mount Saint Helens        Ft Mt Saint Helens            
Fort Mount Saint Helens        Ft Mt St Helens               
Fort Mount Saint Helens        Ft Mt St. Helens              
Fort Mount Saint Helens        Ft Mt. Saint Helens           
Fort Mount Saint Helens        Ft Mt. St Helens              
Fort Mount Saint Helens        Ft Mt. St. Helens             
Fort Mount Saint Helens        Ft. Mount Saint Helens        
Fort Mount Saint Helens        Ft. Mount St Helens           
Fort Mount Saint Helens        Ft. Mount St. Helens          
Fort Mount Saint Helens        Ft. Mt Saint Helens           
Fort Mount Saint Helens        Ft. Mt St Helens              
Fort Mount Saint Helens        Ft. Mt St. Helens             
Fort Mount Saint Helens        Ft. Mt. Saint Helens          
Fort Mount Saint Helens        Ft. Mt. St Helens             
Fort Mount Saint Helens        Ft. Mt. St. Helens            
Fort Saint James               Fort Saint James              
Fort Saint James               Fort St James                 
Fort Saint James               Fort St. James                
Fort Saint James               Ft Saint James                
Fort Saint James               Ft St James                   
Fort Saint James               Ft St. James                  
Fort Saint James               Ft. Saint James               
Fort Saint James               Ft. St James                  
Fort Saint James               Ft. St. James

嗒哒！

请注意，这不包括没有要缩写的单词的短语。因此，您可能需要将此结果与以下结果结合起来：

select * from ncy_phrases p
where  not exists (
   select * from NCY_ABBREVIATIONS a
   where  instr(p.phrase, a.base_word) > 0
)

Oracle asktom

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

Oracle 改进有意生成笛卡尔产品的SQL

问题描述

专家解答

评论