Regular expression for syntax highlighting attributes in HTML tag Regular expression for syntax highlighting attributes in HTML tag reactjs reactjs

Regular expression for syntax highlighting attributes in HTML tag


I may found a possible solution.

It is not perfect because as @skamazin said in the comments if you are trying to capture an arbitrary amount of attributes you will have to repeat the pattern that matches the attributes as many times as you want to limit the number of attributes you will allow.

The regex is pretty scary but it may work for your goal. Maybe it would be possible to simplify it a bit or maybe you will have to adjust some things

For only one attribute it will be as this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))

DEMO

For more attributes you will need to add this as many times as you want:

(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))?

So for example if you want to allow maximum 3 attributes your regex will be like this:

(<)([a-zA-Z0-9:.]+)(?:(?: ((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)(?: |>))(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?(?:(?:((?<= )[^ ]+?(?==| |>)))(?:=[^ >]+)?(?: |>))?

DEMO

Tell me if it suits you and if you need further details.


I'm unfamiliar with sublimetext or react-jsx but this to me sounds like a case of "Regex is your tool, not your solution."

A solution that uses regex as a tool for this would be something like this JsFiddle (note that the regex is slightly obfuscated because of html-entities like > for > etc.)

Code that does the actual replacing:

blabla.replace(/(<!--(?:[^-]|-(?!->))*-->)|(<(?:(?!>).)+>)|(\{[^\}]+\})/g, function(m, c, t, a) {    if (c!=undefined)        return '<span class="comment">' + c + '</span>';    if (t!=undefined)        return '<span class="tag">' + t.replace(/ [a-z_-]+=?/ig, '<span class="attr">$&</span>') + '</span>';    if (a!=undefined)        return a.replace(/'[^']+'/g, '<span class="quoted">$&</span>');});

So here I'm first capturing the separate type of groups following this general pattern adapted for this use-case of HTML with accolade-blocks. Those captures are fed to a function that determines what type of capture we're dealing with and further replaces subgroups within this capture with its own .replace() statements.

There's really no other reliable way to do this. I can't tell you how this translates to your environment but maybe this is of help.


Regex alone doesn't seem to be good enough, but since you're working with sublime's scripting here, there's a way to simplify both the code and the process. Keep in mind, I'm a vim user and not familiar with sublime's internals - also, I usually work with javascript regexes, not PCREs (which seems to be the format used by sublime, or closest thereof).

The idea is as follows:

  • use a regex to get the tag, attributes (in a string) and contents of the tag
  • use capture groups to do further processing and matching if necessary

In this case, I made this regex:

<([a-z]+)\ ?([a-z]+=\".*?\"\ ?)?>([.\n\sa-z]*)(<\/\1>)?

It starts by finding an opening tag, creates a control group for the tag name, if it finds a space it proceeds, matches the bulk of attributes (inside the \"...\" pattern I could have used \"[^\"]*?\" to match only non-quote characters, but I purposefully match any character greedily until the closing quote - this is to match the bulk of attributes, which we can process later), matches any text in between tags and then finally matches the closing tag.

It creates 4 capture groups:

  1. tag name
  2. attribute string
  3. tag contents
  4. closing tag

as you can see in this demo, if there is no closing tag, we get no capture group for it, same for attributes, but we always get a capture group for the contents of the tag. This can be a problem generally (since we can't assume that a captured feature will be in the same group) but it isn't here because, in the conflict case where we get no attributes and no content, thus the 2nd capture group is empty, we can just assume it means no attributes and the lack of a 3rd group speaks for itself. If there's nothing to parse, nothing can be parsed wrongly.

Now to parse the attributes, we can simply do it with:

([a-z]+=\"[^\"]*?\")

demo here. This gives us the attributes exactly. If sublime's scripting lets you get this far, it certainly would allow you further processing if necessary. You can of course always use something like this:

(([a-z]+)=\"([^\"]*?)\")

which will provide capture groups for the attribute as a whole and its name and value separately.

Using this approach, you should be able to parse the tags well enough for highlighting in 2-3 passes and send off the contents for highlighting to whatever highlighter you want (or just highlight it as plaintext in whatever fancy way you want).