MATCH_RECOGNIZE - Empty Matches and Unmatched Rows

原创 fizz 2022-11-17

Module1
Overview of SQL Pattern Matching
Finding patterns with SQL
Recognizing patterns in a sequence of rows has been a capability that was widely desired, but not really possible with SQL until now. There were many workarounds, but these were difficult to write, hard to understand, and inefficient to execute.
The aim of this tutorial is to explain the difference between the various row output options within MATC_RECOGNIZE, specifically the EMPTY MATCHES and UNMATCHED ROWS keywords within the ALL ROWS PER MATCH syntax.
Module2
Managing output from MATCH_RECOGNIZE
When determining the type of output you want MATCH_RECOGNIZE to return most developers will opt for one of the following:
ONE ROW PER MATCH - each match produces one summary row. This is the default. ALL ROWS PER MATCH - a match spanning multiple rows will produce one output row for each row in the match.
The default behaviour for MATCH_RECOGNIZE is to return one summary row for each match. In the majority of use cases this is probably the ideal solution. However, there are also many use cases that require more detailed information to be returned. If you are degugging your MATCH_RECOGNIZE statement then a little more information can help show how the pattern is being matched to your data set.

In some cases you may find it useful, or even necessary, to use the extended syntax of the ALL ROWS PER MATCH keywords.

There are three sub options:
- ALL ROWS PER MATCH SHOW EMPTY MATCHES <- note that this is the default
- ALL ROWS PER MATCH OMIT EMPTY MATCHES
- ALL ROWS PER MATCH WITH UNMATCHED ROWS
Let’s look at these sub-options in more detail but first a quick point of reference: all the examples shown below use the default AFTER MATCH SKIP PAST LAST ROW syntax. This will be covered in a separate tutorial.
Module3
Reset my tutorial environment
Just in case...
If at anytime you need to restart this tutorial then simply run the following statement to reset your environment
```
DROP TABLE ticker PURGE;
```

Module4

Create my dataset

Creating our ticker data set

First step is to setup our data table and then populate it with data

CREATE TABLE ticker (SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);

BEGIN
INSERT INTO ticker VALUES('ACME', '01-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '02-Apr-11', 17);
INSERT INTO ticker VALUES('ACME', '03-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '04-Apr-11', 21);
INSERT INTO ticker VALUES('ACME', '05-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '06-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '07-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '08-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '09-Apr-11', 24);
INSERT INTO ticker VALUES('ACME', '10-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '11-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '12-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '13-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '14-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '15-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '16-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '17-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '18-Apr-11', 24);
INSERT INTO ticker VALUES('ACME', '19-Apr-11', 23);
INSERT INTO ticker VALUES('ACME', '20-Apr-11', 22);

INSERT INTO ticker VALUES('GLOBEX', '01-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '02-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '03-Apr-11', 13);
INSERT INTO ticker VALUES('GLOBEX', '04-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '05-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '06-Apr-11', 10);
INSERT INTO ticker VALUES('GLOBEX', '07-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '08-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '09-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '10-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '11-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '12-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '13-Apr-11', 10);
INSERT INTO ticker VALUES('GLOBEX', '14-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '15-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '16-Apr-11', 11);
INSERT INTO ticker VALUES('GLOBEX', '17-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '18-Apr-11', 7);
INSERT INTO ticker VALUES('GLOBEX', '19-Apr-11', 5);
INSERT INTO ticker VALUES('GLOBEX', '20-Apr-11', 3);

INSERT INTO ticker VALUES('OSCORP', '01-Apr-11', 22);
INSERT INTO ticker VALUES('OSCORP', '02-Apr-11', 22);
INSERT INTO ticker VALUES('OSCORP', '03-Apr-11', 19);
INSERT INTO ticker VALUES('OSCORP', '04-Apr-11', 18);
INSERT INTO ticker VALUES('OSCORP', '05-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '06-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '07-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '08-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '09-Apr-11', 16);
INSERT INTO ticker VALUES('OSCORP', '10-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '11-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '12-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '13-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '14-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '15-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '16-Apr-11', 16);
INSERT INTO ticker VALUES('OSCORP', '17-Apr-11', 14);
INSERT INTO ticker VALUES('OSCORP', '18-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '19-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '20-Apr-11', 9);

commit;

END;

Module5
Viewing my data
Quick review of our ticker data
Now let's check to see how many rows are in our dataset
```
SELECT count(*) FROM ticker;
SELECT symbol, min(tstamp), max(tstamp), count(*) FROM ticker GROUP BY symbol;
```
You should have 60 rows of data spread across three symbols (ACME, GLOBEX, OSCORP) with 20 rows of data for each ticker symbol. Our ticker data for each symbol starts on April 1 and ends on April 20.
You can view the full data set using the following code:
```
SELECT * FROM ticker ORDER BY symbol, tstamp;
```
Module6
Empty matches

What is the difference between an “empty match” and an “unmatched row"?

This is largely determined by the type of quantifier used as part of the pattern definition. By changing the quantifier it is possible to produce the similar result using both sets of keywords.

To help explore the subtleties of these keywords here is a simple pattern that looks for price decreases - note that we are using the * (star) quantifier to indicate that we are looking for zero or more matches of the DOWN pattern. Therefore, if we run the following code we will in fact return a result set containing all 20 rows for the symbol "GLOBEX":
```
SELECT 
  symbol, 
  tstamp,
  price,
  start_tstamp,
  end_tstamp,
  match_num,
  classifier
FROM ticker 
MATCH_RECOGNIZE ( 
  PARTITION BY symbol ORDER BY tstamp 
  MEASURES FIRST(down.tstamp) AS start_tstamp,
           LAST(down.tstamp) AS end_tstamp,
           match_number() AS match_num,
           classifier() AS classifier
  ALL ROWS PER MATCH SHOW EMPTY MATCHES
  PATTERN (DOWN*)
  DEFINE 
    DOWN AS (price <= PREV(price))
)
WHERE symbol = 'GLOBEX';
```
You can see that the result set contains all 20 rows that make up the data for the symbol “GLOBEX". Rows 1- 3, 9, and 13-15 are identified as unmatched rows - the classifier returns null. These rows appear because we have defined the search requirements for pattern DOWN as being zero or more occurrences.

So what is an empty match?
Based on this we can state that an empty match is a row that does not map explicitly to a pattern variable (in this case DOWN). However, it is worth noting that an empty match does in fact have a starting row and it is assigned a sequential match number, based on the ordinal position of its starting row. The above situation is largely the result of the specific quantifier that we are using: * (asterisk).
Module7
MATCH_NUMBER and EMPTY MATCHES
What's the effect on MATCH_MUMBER?
The MATCH_NUMBER() function counts the empty matches and assigns a number to them - as you can see in the output from the previous code. Therefore, if we omit the empty matches from the results the MATCH_NUMBER() column no longer contains a contiguous set of numbers:
```
SELECT
  symbol,
  tstamp,
  price,
  start_tstamp,
  end_tstamp,
  match_num,
  classifier
FROM ticker 
MATCH_RECOGNIZE ( 
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(down.tstamp) AS start_tstamp,
           LAST(down.tstamp) AS end_tstamp,
           match_number() AS match_num,
           classifier() AS classifier
  ALL ROWS PER MATCH OMIT EMPTY MATCHES
  PATTERN (DOWN*) 
  DEFINE <br>
    DOWN AS (price <= PREV(price))
)
WHERE symbol = 'GLOBEX';
```
As you can see from the previous output the MATCH_NUMBER() column starts with match number 4 followed by match 6 followed by match 10. Therefore, you need to be very careful if you decide to test for a specific match number within the MATCH_RECOGNIZE section and/or the result set because you might get caught out if you are expecting a contiguous series of numbers.
Module8
Quick Recap
Summary of EMPTY MATCHES
Some patterns permit empty matches such as those using the asterisk quantifier, as shown above. Three mains points to remember when your pattern permits this type of matching:
1. The value of MATCH_NUMBER() is the sequential match number of the empty match.
2. Any COUNT is 0.
3. Any other aggregate, row pattern navigation operation, or ordinary row pattern column reference is null.
The default is always to return empty matches, therefore, it is always a good idea to determine from the start if your pattern is capable of returning an empty match and how you want to manage those rows: include them (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES).

Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.
Module9
Unmatched Rows
Reporting unmatched rows?
Always useful to view the complete result set - at least when you are running your code against test data sets. Getting all the input rows into your output is relatively easy because you just need to include the phrase ALL ROWS PER MATCH WITH UNMATCHED ROWS. Other than for testing purposes I can’t think of a good use case for using this in production so make sure you check your code before submit your production-ready code to your DBA.
```
SELECT
  symbol,
  tstamp,
  price,
  start_tstamp,
  end_tstamp,
  match_num,
  classifier
FROM ticker 
MATCH_RECOGNIZE ( 
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(down.tstamp) AS start_tstamp,
           LAST(down.tstamp) AS end_tstamp,
           match_number() AS match_num,
           classifier() AS classifier
  ALL ROWS PER MATCH WITH UNMATCHED ROWS
  PATTERN (DOWN*) 
  DEFINE 
    DOWN AS (price <= PREV(price))
)
WHERE symbol = 'GLOBEX';
```
What about skipping?
Note that if ALL ROWS PER MATCH WITH UNMATCHED ROWS is used with the default skipping behaviour (AFTER MATCH SKIP PAST LAST ROW), then there is exactly one row in the output for every row in the input.
This statement will lead us nicely into yet another tutorial on MATCH_RECOGNIZE which explores the concept of SKIPPING. Taking a quick peak into this other tutorial. . .obviously there are many different types of skipping behaviours that are permitted when using WITH UNMATCHED ROWS. It does, in fact, become possible for a row to be mapped by more than one match and appear in the row pattern output table multiple times. Unmatched rows will appear in the output only once.
Module10
Empty Matches and Unmatched Rows
Can a query contain all three types of match?
Now the big question: Can I have a query where it is possible to have both UNMATCHED ROWS and EMPTY MATCHES? Short answer: Yes.
When the PATTERN clause allows empty matches, nothing in the DEFINE clause can stop the empty matches from happening. However, there are special PATTERN symbols that are called anchors. Anchors work in terms of positions rather than rows. They match a position either at the start or end of a partition, or it used together then across the whole partition.
- ^ matches the position before the first row in the partition
- $ matches the position after the last row in the partition
Finding Empty Matches
Therefore, using these symbols it is possible to create a PATTERN where the keywords SHOW EMPTY MATCHES, OMIT EMPTY MATCHES, and WITH UNMATCHED ROWS all produce different results from the same result set. For example, let’s start with the following:
```
SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm, 
          count(*) AS nmr, 
          classifier() AS cls
 ALL ROWS PER MATCH SHOW EMPTY MATCHES
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;
```
this shows row 1 as an empty match for the pattern A* because we are matching from the start of the partition. This sets the MATCH_NUMBER() counter to 1. After the empty match the state moves to the pattern A+ for the remainder of the rows. The first match for this pattern starts at row 2 and completes at row 4. The final match in our data set is found at the row containing 15-APR-11.
Ignoring Empty Matches
Therefore, if we omit the empty match at row 1 we only get 4 rows returned as shown here:
```
SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm, 
          count(*) AS nmr, 
          classifier() AS cls
 ALL ROWS PER MATCH OMIT EMPTY MATCHES
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;
```
returns only 4 rows.
Unmatched rows
Now if we use the last iteration of this example the MATCH_RECOGNIZE statement returns all the rows from the input data. The actual “unmatched rows” are identified as having a NULL match number and NULL classifier. The “empty matches” are identified as having a NULL classifier and in this example the COUNT(*) function returns zero.
```
SELECT symbol, tstamp, price, mnm, nmr, cls
FROM ticker MATCH_RECOGNIZE(
 PARTITION BY symbol
 ORDER BY tstamp
 MEASURES match_number() AS mnm, 
          count(*) AS nmr, 
          classifier() AS cls
 ALL ROWS PER MATCH WITH UNMATCHED ROWS
 PATTERN ((^A*)|A+)
 DEFINE A AS price > 11)
WHERE symbol = 'GLOBEX'
ORDER BY 1, 2;
```
returns all 20 rows from our data set
Module11
Summary
I hope this helps to explain how the various output keywords that are part of the ALL ROWS PER MATCH syntax can affect the results you get back. You should now understand why your results contains match_number values that are not contiguous and why classifier can return a NULL value along with specific aggregate functions. I expect the hardest concept to understand is the idea of empty matches.
As I stated earlier it is always a good idea to determine from the start if your pattern is capable of returning an empty match: are you using an asterisk * within the PATTERN clause? Then you can determine how you want to manage those rows: include the empty matches (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter.
What should be immediately obvious is that in all the examples I have used the default skip behaviour: AFTER MATCH SKIP PAST LAST ROW. There is a separate tutorial that explores the various skip keywords and how they can impact the results returned by your MATCH_RECOGNIZE statement.

matches

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

MATCH_RECOGNIZE - Empty Matches and Unmatched Rows

Overview of SQL Pattern Matching

Finding patterns with SQL

Managing output from MATCH_RECOGNIZE

Reset my tutorial environment

Just in case...

Create my dataset

Creating our ticker data set

Viewing my data

Quick review of our ticker data

Empty matches

What is the difference between an “empty match” and an “unmatched row"?

So what is an empty match?

MATCH_NUMBER and EMPTY MATCHES

What's the effect on MATCH_MUMBER?

Quick Recap

Summary of EMPTY MATCHES

Unmatched Rows

Reporting unmatched rows?

What about skipping?

Empty Matches and Unmatched Rows

Can a query contain all three types of match?

Finding Empty Matches

Ignoring Empty Matches

Unmatched rows

Summary

评论