Regular Expression Matching First Non-Repeated Character Regular Expression Matching First Non-Repeated Character python python

Regular Expression Matching First Non-Repeated Character


Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

tooth^  °

Next take o, look ahead in the string - and fail, as there is another o.

tooth ^°

Next take the second o, look ahead in the string - no other o present - match it, return it, work done.

tooth  ^

So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.


Sebastian's answer already explains pretty well why your current attempt doesn't work.

.NET

Since you're revo is interested in a .NET flavor workaround, the solution becomes trivial:

(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?)

Demo link

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

So for each letter (?<letter>.) we check:

  • if it's repeated further in the input (?!.*?\k<letter>)
  • if it was already encountered before (?<!\k<letter>.+?)
    (we have to skip the letter we're testing when going backwards, hence the +).

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k with \g (which is quite unfortunate as with this module \g is a group backreference, whereas with PCRE it's a recursion).

The regex is:

(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)

And here's an example:

$ pythonPython 2.7.10 (default, Jun  1 2015, 18:05:38)[GCC 4.9.2] on cygwinType "help", "copyright", "credits" or "license" for more information.>>> import regex>>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth')<regex.Match object; span=(4, 5), match='h'>

PCRE

Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc:

# Anchor the pattern\A# For each letter, test to see if it's duplicated in the input string(?(?=[^a]*+a[^a]*a)(?<da>))(?(?=[^b]*+b[^b]*b)(?<db>))(?(?=[^c]*+c[^c]*c)(?<dc>))# Skip any duplicated letter and throw it away[a-c]*?\K# Check if the next letter is a duplicate(?:  (?(da)(*FAIL)|a)| (?(db)(*FAIL)|b)| (?(dc)(*FAIL)|c))

Here's how that works:

  • First, the \A anchor ensures we'll process the input string only once
  • Then, for each letter X of our alphabet, we'll set up a is duplicate flag dX:
    • The conditional pattern (?(cond)then|else) is used there:
      • The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
      • If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
      • If the condition is false, the dX group won't be matched
    • Next, we lazily skip valid letters from our alphabet: [a-c]*?
    • And we throw them out in the final match with \K
    • Now, we're trying to match one letter whose dX flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
      • If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
      • If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.

That last part of the pattern could also be replaced with:

(?:  a (*THEN) (?(da)(*FAIL))| b (*THEN) (?(db)(*FAIL))| c (*THEN) (?(dc)(*FAIL)))

Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.

The full pattern for the lowercase letters a-z looks like this:

# Anchor the pattern\A# For each letter, test to see if it's duplicated in the input string(?(?=[^a]*+a[^a]*a)(?<da>))(?(?=[^b]*+b[^b]*b)(?<db>))(?(?=[^c]*+c[^c]*c)(?<dc>))(?(?=[^d]*+d[^d]*d)(?<dd>))(?(?=[^e]*+e[^e]*e)(?<de>))(?(?=[^f]*+f[^f]*f)(?<df>))(?(?=[^g]*+g[^g]*g)(?<dg>))(?(?=[^h]*+h[^h]*h)(?<dh>))(?(?=[^i]*+i[^i]*i)(?<di>))(?(?=[^j]*+j[^j]*j)(?<dj>))(?(?=[^k]*+k[^k]*k)(?<dk>))(?(?=[^l]*+l[^l]*l)(?<dl>))(?(?=[^m]*+m[^m]*m)(?<dm>))(?(?=[^n]*+n[^n]*n)(?<dn>))(?(?=[^o]*+o[^o]*o)(?<do>))(?(?=[^p]*+p[^p]*p)(?<dp>))(?(?=[^q]*+q[^q]*q)(?<dq>))(?(?=[^r]*+r[^r]*r)(?<dr>))(?(?=[^s]*+s[^s]*s)(?<ds>))(?(?=[^t]*+t[^t]*t)(?<dt>))(?(?=[^u]*+u[^u]*u)(?<du>))(?(?=[^v]*+v[^v]*v)(?<dv>))(?(?=[^w]*+w[^w]*w)(?<dw>))(?(?=[^x]*+x[^x]*x)(?<dx>))(?(?=[^y]*+y[^y]*y)(?<dy>))(?(?=[^z]*+z[^z]*z)(?<dz>))# Skip any duplicated letter and throw it away[a-z]*?\K# Check if the next letter is a duplicate(?:  a (*THEN) (?(da)(*FAIL))| b (*THEN) (?(db)(*FAIL))| c (*THEN) (?(dc)(*FAIL))| d (*THEN) (?(dd)(*FAIL))| e (*THEN) (?(de)(*FAIL))| f (*THEN) (?(df)(*FAIL))| g (*THEN) (?(dg)(*FAIL))| h (*THEN) (?(dh)(*FAIL))| i (*THEN) (?(di)(*FAIL))| j (*THEN) (?(dj)(*FAIL))| k (*THEN) (?(dk)(*FAIL))| l (*THEN) (?(dl)(*FAIL))| m (*THEN) (?(dm)(*FAIL))| n (*THEN) (?(dn)(*FAIL))| o (*THEN) (?(do)(*FAIL))| p (*THEN) (?(dp)(*FAIL))| q (*THEN) (?(dq)(*FAIL))| r (*THEN) (?(dr)(*FAIL))| s (*THEN) (?(ds)(*FAIL))| t (*THEN) (?(dt)(*FAIL))| u (*THEN) (?(du)(*FAIL))| v (*THEN) (?(dv)(*FAIL))| w (*THEN) (?(dw)(*FAIL))| x (*THEN) (?(dx)(*FAIL))| y (*THEN) (?(dy)(*FAIL))| z (*THEN) (?(dz)(*FAIL)))

And here's the demo on regex101, complete with unit tests.

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.


For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

  • \A becomes ^
  • X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
  • You may throw away the \K and replace the last noncapturnig group (?:...) with a named group like (?<letter>...) and treat its content as the result.

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).


Regular expressions are not optimal for the task even if you use alternative implementations of re that do not limit lookbehind by fixed length strings (such as Matthew Barnett's regex).

The easiest way is to count occurrences of letters and print the first one with frequency equal to 1:

import sysfrom collections import Counter, OrderedDict# Counter that remembers that remembers the order entries were addedclass OrderedCounter(Counter, OrderedDict):    pass# Calling next() once only gives the first entryfirst=nextwith open(sys.argv[1], 'r') as test_cases:    for test in test_cases:        lettfreq = OrderedCounter(test)        print(first((l for l in lettfreq if lettfreq[l] == 1)))