Splitting a comma-separated string but ignoring commas in quotes Splitting a comma-separated string but ignoring commas in quotes java java

Splitting a comma-separated string but ignoring commas in quotes


Try:

public class Main {     public static void main(String[] args) {        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";        String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);        for(String t : tokens) {            System.out.println("> "+t);        }    }}

Output:

> foo> bar> c;qual="baz,blurb"> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main {     public static void main(String[] args) {        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";        String otherThanQuote = " [^\"] ";        String quotedString = String.format(" \" %s* \" ", otherThanQuote);        String regex = String.format("(?x) "+ // enable comments, ignore white spaces                ",                         "+ // match a comma                "(?=                       "+ // start positive look ahead                "  (?:                     "+ //   start non-capturing group 1                "    %s*                   "+ //     match 'otherThanQuote' zero or more times                "    %s                    "+ //     match 'quotedString'                "  )*                      "+ //   end group 1 and repeat it zero or more times                "  %s*                     "+ //   match 'otherThanQuote'                "  $                       "+ // match the end of the string                ")                         ", // stop positive look ahead                otherThanQuote, quotedString, otherThanQuote);        String[] tokens = line.split(regex, -1);        for(String t : tokens) {            System.out.println("> "+t);        }    }}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))


While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";List<String> result = new ArrayList<String>();int start = 0;boolean inQuotes = false;for (int current = 0; current < input.length(); current++) {    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state    else if (input.charAt(current) == ',' && !inQuotes) {        result.add(input.substring(start, current));        start = current + 1;    }}result.add(input.substring(start));

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";StringBuilder builder = new StringBuilder(input);boolean inQuotes = false;for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {    char currentChar = builder.charAt(currentIndex);    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state    if (currentChar == ',' && inQuotes) {        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later    }}List<String> result = Arrays.asList(builder.toString().split(","));