暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Splitting rows into bins - ntile vs width_bucket

原创 fizz 2022-11-19
107
  • Statement1

    NTILE sorts the data according to its ORDER BY, then generates N groups. Each group contains the same number of rows. If ( row count / N ) is not an integer, some groups will contain one extra row

    Split into quartiles with NTILE

    with rws as (  
      select dbms_random.value ( 0, 100 ) x  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x,   
             ntile ( 4 ) over ( order by x ) grp  
      from   rws  
    )  
      select grp, count (*) row#,   
             round ( min ( x ), 1 ) lower_bound,   
             round ( max ( x ), 1 ) upper_bound  
      from   grps  
      group  by grp order by grp
    GRPROW#LOWER_BOUNDUPPER_BOUND
    1250.324.2
    225024.350.1
    325050.175.8
    425075.999.9

    4 rows selected.
  • Statement2

    As with the previous example, each group has 250 rows. The upper and lower bound for each group is significantly different though.

    Split into quartiles with NTILE - normal distribution

    with rws as (  
      select ( 10 * dbms_random.normal ) + 50 x  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x,   
             ntile ( 4 ) over ( order by x ) grp  
      from   rws  
    )  
      select grp, count(*),   
             round ( min ( x ), 1 ),   
             round ( max ( x ), 1 )  
      from   grps  
      group  by grp order  by grp
    GRPCOUNT(*)ROUND(MIN(X),1)ROUND(MAX(X),1)
    125023.243.4
    225043.550.2
    325050.256.7
    425056.885

    4 rows selected.
  • Statement3

    WIDTH_BUCKET creates buckets of equal size. In this case each covers 25 values. This fixes the range of values in each bucket. The numbers are randomly generated using a uniform distribution. So there is roughly 250 rows in each group.

    Split into four bins with WIDTH_BUCKET

    with rws as (  
      select dbms_random.value ( 0, 100 ) x  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x,   
             -- intervals [0, 25), [25, 50), [50, 75), [75, 100)  
             width_bucket ( x, 0, 100, 4 ) bucket  
      from   rws  
    )  
      select bucket, count(*),   
             round ( min ( x ), 1 ),   
             round ( max ( x ), 1 )  
      from   grps  
      group  by bucket order by bucket
    BUCKETCOUNT(*)ROUND(MIN(X),1)ROUND(MAX(X),1)
    1249024.9
    22512550
    323450.574.8
    426675100

    4 rows selected.
  • Statement4

    As with the previous example, the upper and lower bound of each bin is fixed. But this time the data are generated using a normal distribution, so almost all the rows are in the middle two buckets.

    Split into four bins with WIDTH_BUCKET - normal distribution

    with rws as (  
      select ( 10 * dbms_random.normal ) + 50 x  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x,   
             width_bucket ( x, 0, 100, 4 ) bucket  
      from   rws  
    )  
      select bucket, count(*),   
             round ( min ( x ), 1 ),   
             round ( max ( x ), 1 )  
      from   grps  
      group  by bucket order by bucket
    BUCKETCOUNT(*)ROUND(MIN(X),1)ROUND(MAX(X),1)
    139.821.5
    249025.650
    35025074.4
    4575.386

    4 rows selected.
  • Statement5

    The previous WIDTH_BUCKET examples fixed the range to [0, 100). This dynamically sets the upper and lower bounds by using the MAX and MIN analytic functions.

    Split into four bins with WIDTH_BUCKET with dynamic bounds

    with rws as (  
      select ( 10 * dbms_random.normal ) + 50 x  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x, width_bucket (   
               x,   
               min ( x ) over (),   
               max ( x ) over () + 1,   
               4   
             ) bucket  
      from   rws  
    )  
      select bucket, count(*),   
             round ( min ( x ), 1 ),   
             round ( max ( x ), 1 ),  
             round ( max ( x ) - min ( x ), 1 ) bucket_range  
      from   grps  
      group  by bucket order by bucket
    BUCKETCOUNT(*)ROUND(MIN(X),1)ROUND(MAX(X),1)BUCKET_RANGE
    1792136.515.6
    250436.852.215.4
    339752.267.715.5
    42068.282.414.3

    4 rows selected.
  • Statement6

    WIDTH_BUCKET only supports numeric and datetime values. So if you want to split strings into buckets, you first need to map the characters to numbers. This uses DENSE_RANK to map the starting letter to the values 1 - 26.

    Split into four bins with WIDTH_BUCKET - character data

    with rws as (  
      select initcap ( dbms_random.string ( 'u', 10 ) ) x  
      from   dual connect by level <= 1000   
    ), ranks as (  
      select x, substr ( x, 1, 1 ) letter,  
             dense_rank () over ( order by substr ( x, 1, 1 ) ) dr  
      from   rws  
    ), grps as (   
      select x, letter, width_bucket (   
               dr, 1, 27, 4   
             ) grp  
      from   ranks  
    )  
      select min ( letter ), max ( letter ),  
             count ( distinct letter ) value#,  
             count (*) row#,  
             min ( x ), max ( x )   
      from   grps  
      group  by grp order by grp
    MIN(LETTER)MAX(LETTER)VALUE#ROW#MIN(X)MAX(X)
    AG7264AbbkaisqobGzthagisaj
    HM6249HaotkacnuxMzqrivbrqz
    NT7276NauugxwzqtTzjwgcpztq
    UZ6211UdnwtrnsgnZxxyobitlq

    4 rows selected.
  • Statement7

    This uses DENSE_RANK to assign each JOB_ID to a number [1, number of distinct jobs]. This enables you to place the rows for each job in alphabetical buckets.

    Split into four bins with WIDTH_BUCKET - complete strings

    with rws as (  
      select dense_rank () over ( order by job_id ) dr, e.*  
      from   hr.employees e  
    ), grps as (  
      select r.*,   
             width_bucket ( dr, 1, max ( dr ) over () + 1, 4 ) bucket  
      from   rws r  
    )  
      select bucket,  
             count ( distinct job_id ) job_count,  
             count(*) row_count,   
             listagg ( distinct job_id, ',' )    
               within group ( order by job_id ) jobs  
      from   grps  
      group  by bucket
    BUCKETJOB_COUNTROW_COUNTJOBS
    156AC_ACCOUNT,AC_MGR,AD_ASST,AD_PRES,AD_VP
    2513FI_ACCOUNT,FI_MGR,HR_REP,IT_PROG,MK_MAN
    3513MK_REP,PR_REP,PU_CLERK,PU_MAN,SA_MAN
    4475SA_REP,SH_CLERK,ST_CLERK,ST_MAN

    4 rows selected.
  • Statement8

    Using DENSE_RANK to map letters to numbers can give unexpected results if some letters are missing in the source. In this case gaps mean W is mapped to 20, instead of its actual position in the English alphabet, 23. So names starting with W go in the third bin instead of the last. The query only returns three rows instead of the four buckets requested.

    Split into four bins with WIDTH_BUCKET - missing letters

    with rws as (  
      select first_name x  
      from   hr.employees  
    ), ranks as (  
      select x, substr ( x, 1, 1 ) letter,  
             dense_rank () over ( order by substr ( x, 1, 1 ) ) dr  
      from   rws  
    ), grps as (   
      select x, letter, width_bucket (   
               dr, 1, 27, 4   
             ) grp  
      from   ranks  
    )  
      select min ( letter ), max ( letter ),  
             count ( distinct letter ) value#,  
             count (*) row#,  
             min ( x ), max ( x )   
      from   grps  
      group  by grp order by grp
    MIN(LETTER)MAX(LETTER)VALUE#ROW#MIN(X)MAX(X)
    AH734AdamHermann
    IN641IreneNeena
    OW732OliverWinston

    3 rows selected.
  • Statement9

    You can solve the problem of missing letters by mapping them to ASCII character codes

    Split letters into four groups with WIDTH_BUCKET

    with rws as ( 
      select first_name x 
      from   hr.employees 
    ), ranks as ( 
      select x, substr ( x, 1, 1 ) letter, 
             ascii ( substr ( x, 1, 1 ) ) - 64 dr 
      from   rws 
    ), grps as (  
      select x, letter, width_bucket (  
               dr, 1, 27, 4  
             ) grp 
      from   ranks 
    ) 
      select min ( letter ), max ( letter ), 
             count ( distinct letter ) value#, 
             count (*) row#, 
             min ( x ), max ( x )  
      from   grps 
      group  by grp order by grp
    MIN(LETTER)MAX(LETTER)VALUE#ROW#MIN(X)MAX(X)
    AG631AdamGuy
    HM640HarrisonMozhe
    NT631NancyTrenna
    VW25ValliWinston

    4 rows selected.
  • Statement10

    NTILE and WIDTH_BUCKET work with a fixed number of bins. If you have an unknown number of buckets of a fixed size, you can use this method instead. This generates a unique number for each row, from 1 to the number of rows. Then groups them in batches of five by taking the ceiling of the row number divided by 5.

    Split into groups with five rows each

    with rws as (  
      select dbms_random.value ( 0, 100 ) x, level id, 5 n  
      from   dual  
      connect by level <= dbms_random.value ( 100, 1000 )  
    ), grps as (  
      select x, n,  
             row_number () over ( order by id ) rn  
      from   rws  
    )  
      select ceil ( rn / n ), count(*),  
             round ( min ( x ) ), round ( max ( x ) )  
      from   grps  
      group  by ceil ( rn / n ) order by ceil ( rn / n )
    CEIL(RN/N)COUNT(*)ROUND(MIN(X))ROUND(MAX(X))
    151375
    252988
    35585
    452584
    552093
    65787
    75491
    85058
    952193
    10534100
    115496
    125192
    1351662
    145169
    155986
    1651881
    1753286
    1853083
    1954790
    205982
    214592

    21 rows selected.
  • Statement11

    The previous example puts five rows in each group. This sets the group size to five values. So the intervals are [0, 5), [5, 10), [10, 15), etc. The principle is the same - normalize the values by dividing them by the group size. This takes the floor instead of the ceiling to start the first group at zero instead of one. It then multiplies back up by the interval size to get the upper and lower bounds for each group.

    Split rows into fixed intervals

    with rws as (  
      select dbms_random.value ( 0, 100 ) x,  
             5 interval_size  
      from   dual connect by level <= 1000  
    ), grps as (  
      select floor ( x / interval_size ) * interval_size x,  
             interval_size  
      from   rws  
    )  
      select x, x + interval_size,  
             count(*)  
      from   grps  
      group  by x, interval_size  
      order  by x
    XX+INTERVAL_SIZECOUNT(*)
    0560
    51056
    101549
    152042
    202558
    253047
    303564
    354048
    404544
    455050
    505542
    556058
    606549
    657059
    707548
    758050
    808549
    859034
    909548
    9510045

    20 rows selected.
  • Statement12

    alter session set nls_date_format = '  DD-MON HH24:MI  '

    Statement processed.

  • Statement13

    alter session set nls_timestamp_format = '  DD-MON HH24:MI  '

    Statement processed.

  • Statement14

    This uses the same concept as the previous two examples to split the rows into groups of five minutes. It does this by finding the number of minutes each after the start time for each row. Then uses the normalizing formula to turn these into groups of five minutes.

    Split into five minute intervals

    with rws as (  
      select timestamp'2021-09-01 00:00:00' +   
               numtodsinterval ( dbms_random.value ( 0, 120 ), 'minute' ) x,  
             5 interval_mins  
      from   dual connect by level <= 1000  
    ), grps as (  
      select x, interval_mins, trunc ( x ) dt,  
             floor (   
                -- convert to minutes past midnight  
                ( extract ( hour from x ) + 1 ) * extract ( minute from x )   
                  / interval_mins   
             ) * interval_mins mins  
      from   rws  
    )  
      select dt + mins/1440 st, dt + ( mins + interval_mins ) / 1440 en,   
             count (*)  
      from   grps  
      group  by dt + mins/1440, dt + ( mins + interval_mins ) / 1440 order by dt + mins/1440
    STENCOUNT(*)
    01-SEP 00:0001-SEP 00:0571
    01-SEP 00:0501-SEP 00:1067
    01-SEP 00:1001-SEP 00:1555
    01-SEP 00:1501-SEP 00:2048
    01-SEP 00:2001-SEP 00:2565
    01-SEP 00:2501-SEP 00:3058
    01-SEP 00:3001-SEP 00:3579
    01-SEP 00:3501-SEP 00:4051
    01-SEP 00:4001-SEP 00:4564
    01-SEP 00:4501-SEP 00:5067
    01-SEP 00:5001-SEP 00:5563
    01-SEP 00:5501-SEP 01:0060
    01-SEP 01:0001-SEP 01:0524
    01-SEP 01:0501-SEP 01:1020
    01-SEP 01:1001-SEP 01:1528
    01-SEP 01:1501-SEP 01:2017
    01-SEP 01:2001-SEP 01:2530
    01-SEP 01:2501-SEP 01:3014
    01-SEP 01:3001-SEP 01:3522
    01-SEP 01:3501-SEP 01:4015
    01-SEP 01:4001-SEP 01:4529
    01-SEP 01:4501-SEP 01:5021
    01-SEP 01:5001-SEP 01:5517
    01-SEP 01:5501-SEP 02:0015

    24 rows selected.
  • Statement15

    This is the same as the previous example, but there are some time intervals with no rows! So these are missing from the data set.

    Split into five minute intervals - sparse data

    with rws as (  
      select timestamp'2021-09-01 00:00:00' +   
               numtodsinterval ( dbms_random.value ( 0, 120 ), 'minute' ) x,  
             5 interval_mins  
      from   dual connect by level <= 10  
    ), grps as (  
      select x, interval_mins, trunc ( x ) dt,  
             floor (   
                -- convert to minutes past midnight  
                ( extract ( hour from x ) + 1 ) * extract ( minute from x )   
                  / interval_mins   
             ) * interval_mins mins  
      from   rws  
    )  
      select dt + mins/1440 st, dt + ( mins + interval_mins ) / 1440 en,   
             count (*)  
      from   grps  
      group  by dt + mins/1440, dt + ( mins + interval_mins ) / 1440 order by dt + mins/1440
    STENCOUNT(*)
    01-SEP 00:1001-SEP 00:152
    01-SEP 00:1501-SEP 00:201
    01-SEP 00:3501-SEP 00:402
    01-SEP 00:4001-SEP 00:452
    01-SEP 00:4501-SEP 00:501
    01-SEP 00:5001-SEP 00:551
    01-SEP 00:5501-SEP 01:001

    7 rows selected.
  • Statement16

    As with the previous example, this has sparse data. To ensure the output includes a row for each interval in the period, it first generates all the intervals. Then outer joins the sparse data to this.

    Split into five minute intervals - sparse data

    with intervals as (  
      select 5 interval_size,   
             timestamp'2021-09-01 00:00:00' lower_bound,   
             120 total_minutes   
      from   dual  
    ), grps as (  
      select lower_bound + numtodsinterval ( ( level - 1 ) * interval_size, 'minute' ) mn_x,  
             lower_bound + numtodsinterval ( level * interval_size, 'minute' ) mx_x,  
             interval_size  
      from   intervals  
      connect by level <= (   
        -- unknown data size; create a range for all possible values  
        select max ( ceil ( total_minutes / interval_size ) )   
        from   intervals  
      )  
    ), rws as (  
      select lower_bound +   
               numtodsinterval ( dbms_random.value ( 0, 120 ), 'minute' ) x  
      from   intervals connect by level <= 25  
    )  
      select mn_x, mx_x,   
             count ( x ) rws_per_group   
      from   grps  
      left join rws  
      on     x >= mn_x  
      and    x < mx_x  
      group  by mn_x, mx_x order by mn_x
    MN_XMX_XRWS_PER_GROUP
    01-SEP 00:0001-SEP 00:051
    01-SEP 00:0501-SEP 00:101
    01-SEP 00:1001-SEP 00:152
    01-SEP 00:1501-SEP 00:201
    01-SEP 00:2001-SEP 00:250
    01-SEP 00:2501-SEP 00:300
    01-SEP 00:3001-SEP 00:350
    01-SEP 00:3501-SEP 00:403
    01-SEP 00:4001-SEP 00:451
    01-SEP 00:4501-SEP 00:501
    01-SEP 00:5001-SEP 00:553
    01-SEP 00:5501-SEP 01:000
    01-SEP 01:0001-SEP 01:050
    01-SEP 01:0501-SEP 01:100
    01-SEP 01:1001-SEP 01:150
    01-SEP 01:1501-SEP 01:201
    01-SEP 01:2001-SEP 01:251
    01-SEP 01:2501-SEP 01:303
    01-SEP 01:3001-SEP 01:350
    01-SEP 01:3501-SEP 01:400
    01-SEP 01:4001-SEP 01:451
    01-SEP 01:4501-SEP 01:500
    01-SEP 01:5001-SEP 01:551
    01-SEP 01:5501-SEP 02:002

    24 rows selected.
  • Statement17

    You can use the previous method for sparse numeric data as well as datetime data. This generates intervals [0, 5), [5, 10), etc., then outer joins the sparse data to this.

    Split into ranges of five - sparse data

    with intervals as (  
      select 5 interval_size, 0 lower_bound, 100 upper_bound  
      from   dual  
    ), grps as (  
      select ( level - 1 ) * interval_size mn_x,  
             level * interval_size mx_x,  
             interval_size  
      from   intervals  
      connect by level <= (   
        -- unknown data size; create a range for all possible values  
        select max ( ceil ( upper_bound / interval_size ) ) from intervals  
      )  
    ), rws as (  
      select dbms_random.value ( lower_bound, upper_bound ) x  
      from   intervals connect by level <= 25  
    )  
      select mn_x, mx_x,   
             count ( x ) rws_per_group   
      from   grps  
      left join rws  
      on     x >= mn_x  
      and    x < mx_x  
      group  by mn_x, mx_x order by mn_x
    MN_XMX_XRWS_PER_GROUP
    052
    5102
    10150
    15200
    20250
    25301
    30351
    35405
    40450
    45501
    50552
    55601
    60651
    65700
    70750
    75803
    80852
    85904
    90951
    951001

    20 rows selected.
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论