Why doesn't a non-greedy quantifier sometimes work in Oracle regex? Why doesn't a non-greedy quantifier sometimes work in Oracle regex? oracle oracle

Why doesn't a non-greedy quantifier sometimes work in Oracle regex?


It's a BUG!

You are right that in Perl, 'A=1,B=2,C=3,' =~ /.*B=.*?,/; print $& prints A=1,B=2,

What you have stumbled upon is a bug that still exists in Oracle Database 11g R2. If the exact same regular expression atom (including the quantifier but excluding the greediness modifier) appears twice in a regular expression, both occurrences will have the greediness indicated by the first appearance regardless of the greediness specified by the second one. That this is a bug is clearly demonstrated by these results (here, "the exact same regular expression atom" is [^B]*):

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^Bx]*?,') as good FROM dual;GOOD--------A=1,B=2,SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^B]*?,') as bad FROM dual;BAD-----------A=1,B=2,C=3,

The only difference between the two regular expressions is that the "good" one excludes 'x' as a possible match in the second matching list. Since 'x' does not appear in the target string, excluding it should make no difference, but as you can see, removing the 'x' makes a big difference. That has to be a bug.

Here are some more examples from Oracle 11.2: (SQL Fiddle with even more examples)

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')  FROM dual; =>  A=1,B=2,C=3,SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*,')   FROM dual; =>  A=1,B=2,C=3,SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*?,') FROM dual; =>  A=1,B=2,SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*,')  FROM dual; =>  A=1,B=2,-- Changing second operator from * to +SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+?,')  FROM dual; =>  A=1,B=2,SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+,')   FROM dual; =>  A=1,B=2,C=3,SELECT regexp_substr('A=1,B=2,C=3,', '.+B=.+,')   FROM dual; =>  A=1,B=2,C=3,SELECT regexp_substr('A=1,B=2,C=3,', '.+?B=.+,')  FROM dual; =>  A=1,B=2,

The pattern is consistent: the greediness of the first occurrence is used for the second occurrence whether it should be or not.


Looking at the feedback, I hesitate to jump in, but here I go ;-)

According to the Oracle docs, the *? and +? match a "preceding subexpression". For *? specifically:

Matches zero or more occurrences of the preceding subexpression (nongreedyFootref 1). Matches the empty string whenever possible.

To create a subexpression group, use parenthesis ():

Treats the expression within the parentheses as a unit. The expression can be a string or a complex expression containing operators.

You can refer to a subexpression in a back reference.

This will allow you to use greedy and non-greedy (many alternating times actually) in the same regexp, with expected results. For your example:

select regexp_substr('A=1,B=2,C=3,', '(.)*B=(.)*?,') from dual;

To make the point a bit more clear (i hope), this example uses greedy and non-greedy in the same regexp_substr, with different (correct) results depending on where the ? is placed (it does NOT just use the rule for the first subexpression it sees). Also note that the subexpression (\w) will match alphanumerics and underscore only, not @.

-- non-greedy followed by greedy select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*?@(\w)*') from dual;

result: 1_@_2_a_3_

-- greedy followed by non-greedyselect regexp_substr('1_@_2_a_3_@_4_a', '(\w)*@(\w)*?') from dual;

result: 1_@


You've got a really great bounty, so I'm going to try to nail it comprehensively.

You make assumptions in your regular expression handling that are incorrect.

  1. Oracle is NOT compatible with Perl regular expressions, it iscompatible with POSIX. It describes its support for Perl as"Perl-Influenced"
  2. There is an intrinsic syntax conflict around the use of the Perl "*?" in Oracle, if youread that reference the way I do, and Oracle legitimately chooses the POSIX usage
  3. Your description of how perl handles "*?" is not quite right.

Here is a mashup of the options we've discussed. The key to this issue is around case 30

    CASE    SRC                             TEXT               RE                FROM_WHOM                                          RESULT            ------- ------------------------------- ------------------ ----------------- -------------------------------------------------- --------------          1 Egor's original source string   A=1,B=2,C=3,       .*B=.*?,          Egor's original pattern "doesn't work"             A=1,B=2,C=3,            2 Egor's original source string   A=1,B=2,C=3,       .*B=.?,           Egor's "works correctly"                           A=1,B=2,                3 Egor's original source string   A=1,B=2,C=3,       .*B=.+?,          Old Pro comment 1 form 2                           A=1,B=2,                4 Egor's original source string   A=1,B=2,C=3,       .+B=.*?,          Old Pro comment 1 form 1                           A=1,B=2,                5 Egor's original source string   A=1,B=2,C=3,       .*B=.{0,}?,       Old Pro comment 2                                  A=1,B=2,                6 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^Bx]*?,   Old Pro answer form 1 "good"                       A=1,B=2,                7 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^B]*?,    Old Pro answer form 2 "bad"                        A=1,B=2,C=3,            8 Egor's original source string   A=1,B=2,C=3,       (.)*B=(.)*?,      TBone answer form 1                                A=1,B=2,                9 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*?@(\w)*      TBone answer example 2 form 1                      1_@_2_a_3_             10 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*@(\w)*?      TBone answer example 2 form 2                      1_@                    30 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*?,        Schemaczar Variant to force Perl operation         A=1,B=2,               31 Egor's original source string   A=1,B=2,C=3,       .*B=(.*)?,        Schemaczar Variant of Egor to force POSIX          A=1,B=2,C=3,           32 Egor's original source string   A=1,B=2,C=3,       .*B=.*{0,1}       Schemaczar Applying Egor's  'non-greedy'           A=1,B=2,C=3,           33 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*{0,1}     Schemaczar Another variant of Egor's "non-greedy"  A=1,B=2,C=3,  

I am pretty sure that CASE 30 is what you thought you were writing - that is, you thought the "*?" had a stronger association than the "*" by itself. True for Perl, I guess, but for Oracle (and presumably canonical POSIX) RE's, the "*?" has a lower precedence and associativity than "*". So Oracle reads it as "(.*)?" (case 31) whereas Perl reads it as "(.)*?", that is, case 30.

Note cases 32 and 33 indicate that "*{0,1}" does not work like "*?".

Note that Oracle REGEXP does not work like LIKE, that is, it does not require the match pattern to cover the entire test string. Using the "^" begin and "$" end markers might help you with this as well.

My script:

SET SERVEROUTPUT ON<<DISCREET_DROP>> begin  DBMS_OUTPUT.ENABLE;  for dropit in (select 'DROP TABLE ' || TABLE_NAME || ' CASCADE CONSTRAINTS' AS SYNT  FROM TABS WHERE TABLE_NAME IN ('TEST_PATS', 'TEST_STRINGS')  )  LOOP    DBMS_OUTPUT.PUT_LINE('Dropping via ' || dropit.synt);    execute immediate dropit.synt;  END LOOP;END DISCREET_DROP;/----------------------------------------------------------  DDL for Table TEST_PATS--------------------------------------------------------  CREATE TABLE TEST_PATS    (    RE VARCHAR2(2000),   FROM_WHOM VARCHAR2(50),   PAT_GROUP VARCHAR2(50),   PAT_ORDER NUMBER(9,0)   ) ;/----------------------------------------------------------  DDL for Table TEST_STRINGS--------------------------------------------------------  CREATE TABLE TEST_STRINGS    (    TEXT VARCHAR2(2000),   SRC VARCHAR2(200),   TEXT_GROUP VARCHAR2(50),   TEXT_ORDER NUMBER(9,0)   ) ;/----------------------------------------------------------  DDL for View REGEXP_TESTER_V--------------------------------------------------------  CREATE OR REPLACE FORCE VIEW REGEXP_TESTER_V (CASE_NUMBER, SRC, TEXT, RE, FROM_WHOM, RESULT) AS   select pat_order as case_number,  src, text, re, from_whom,   regexp_substr (text, re) as resultfrom test_pats full outer join test_strings on (text_group = pat_group)order by pat_order, text_order;/REM INSERTING into TEST_PATSSET DEFINE OFF;Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*?,','Egor''s original pattern "doesn''t work"','Egor',1);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.?,','Egor''s "works correctly"','Egor',2);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*?,','Schemaczar Variant to force Perl operation','Egor',30);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.*)?,','Schemaczar Variant of Egor to force POSIX','Egor',31);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*{0,1}','Schemaczar Applying Egor''s  ''non-greedy''','Egor',32);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*{0,1}','Schemaczar Another variant of Egor''s "non-greedy"','Egor',33);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^Bx]*?,','Old Pro answer form 1 "good"','Egor',6);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^B]*?,','Old Pro answer form 2 "bad"','Egor',7);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.+?,','Old Pro comment 1 form 2','Egor',3);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.{0,}?,','Old Pro comment 2','Egor',5);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.+B=.*?,','Old Pro comment 1 form 1','Egor',4);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(.)*B=(.)*?,','TBone answer form 1','Egor',8);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*?@(\w)*','TBone answer example 2 form 1','TBone',9);Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*@(\w)*?','TBone answer example 2 form 2','TBone',10);REM INSERTING into TEST_STRINGSSET DEFINE OFF;Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('A=1,B=2,C=3,','Egor''s original source string','Egor',1);Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('1_@_2_a_3_@_4_a','TBone answer example 2','TBone',2);COLUMN SRC FORMAT A50 WORD_WRAPCOLUMN TEXT  FORMAT A50 WORD_WRAPCOLUMN RE FORMAT A50 WORD_WRAPCOLUMN FROM_WHOM FORMAT A50 WORD_WRAPCOLUMN RESULT  FORMAT A50 WORD_WRAPSELECT * FROM REGEXP_TESTER_V;