When Not to Use Regexps

Here are three situations I've seen come up where it makes sense to avoid regexps.

  1. A mail parsing example

    I've got a program that parses MIME messages. This is an email format commonly used for HTML mail, attachments, etc. Such messages have a boundary string, defined in the headers, that separates each part. The boundary from the headers is supposed to have two hyphens added to the start for use in the body.

    My original code decided to be flexible in what it accepted, and used one regexp to parse the message into lines, then another regexp that searched for boundaries, allowing the initial hyphens optional. I found when dealing with very large attachments, it would become unacceptablly slow. In trying to find a fix to this I discovered that using index() with a fixed search pattern and substr() instead of the regexp, I got a substantial speed increase, at the cost of not being generous about broken boundaries.

    On a mail message with a 10MB attachment, 14MB MIME encoded, my original code takes about 5 and 1/2 hours to extract the attachment. With the new method, the code takes about 5 seconds to process the same message, on the same hardware.

    This message parser is available in the scripts section of CPAN, it is called mimedump. The new parsing method is default, the old one is available as a command line option (in case someone needs to deal with mail with broken boundaries).

    Here are some code excerpts to illustrate the changes.

       # Original method
     1    while($$messref =~ s/\A(.*\n)//) {
     2      $line = $1;
     3      if($line =~ /^-*\Q$boundary\E-*\s*$/) {
     4        if (length($part)) {
     5          process(\$part);
     6          $part = '';
     7        }
     8      } else {
     9        $part .= $line;
    10      }
    11    } # while messref
       # New method
     1    while(($pos = index($$messref, "\n--$boundary")) >= 0) {
     2      $part = substr($$messref, 0, $pos+1, '');
     3      $$messref =~ s/^-*\Q$boundary\E-*\s*$//;
     4      if (length($part)) {
     5        process(\$part);
     6        $part = '';
     7        }
     8    } # while pos = index

    This also is a good example for the maxim, "Fewer perl instuctions (run) is the faster way." The new method is only three lines shorter, but the loop runs four times on my example message, while the original method runs thousands of times.

  2. Delimited text

    There are several forms of delimited text, comma-separated-values is one of the more common ones. It seems easy enough to use split with a fixed string to parse this, until various edge cases come up. Things like data with commas in them, empty values, escaped quotes, embeded newlines.

    Example: This is the first field,"This, now,
    is the second field",,"Field ""4"""

    Then people often turn to complicated regular expressions, too often they break one or more of the edge cases while reinventing the wheel. The Text::ParseWords module that comes with Perl has a quotewords() function that can do the right thing. Here are two examples from Mastering Regular Expressions (Friedl, first edition).

     my $line = q;
     # _MRE_, first ed, page 205
     @fields = (); # initialize @fields to be empty
     while( $line =~ m/"([^"\\]*(\\.[^"\\]*)*)",|([^,]+),?|,/g) {
       push(@fields, defined($1) ? $1 : $3); # add the just-matched field
     push(@fields, undef) if $line =~ m/,$/; # account for an empty last field
     # _MRE_, first ed, page 207
     @columns = quotewords(',', 0, $line);
     # Let's test them
     printf("Original: %s\n\n", $line);
     my $i;
     for ($i = 0; $i < @fields; $i ++) {
       printf("fields[%d]: %s\n", $i+1, $fields[$i]);
     print "\n";
     for ($i = 0; $i < @columns; $i ++) {
       printf("columns[%d]: %s\n", $i+1, $columns[$i]);
     print "\n";
    Original: This is the first field,"This, now,
     is the second field",,"Field ""4"""
    fields[1]: This is the first field
    fields[2]: This, now,
     is the second field
    fields[4]: "Field ""4"""
    columns[1]: This is the first field
    columns[2]: This, now,
     is the second field
    columns[4]: Field "4"

    Oops, Friedl's method doesn't deal with doubling quotes to escape them. It is also easy to see that the regexp way is going to be a lot harder to maintain.

  3. Full Fledged Parsing

    words : sentences :: regexps : grammars

    Sometimes you just need a lot more than a regular expression can really provide. The famous example is regular expressions cannot match an arbitrary depth of balanced parentheses. (Strictly speaking. What perl calls "regular expressions" are more powerful, but the hoops one needs to jump through to do this make it a parlor trick, not something sane to use in real code.) A grammar is able to do that, easily. Often times the task is devided into two parts, a "lexer" that finds words, and a "grammar" that puts them together. Lexers are sometimes written with regexps, traditionally this is very common in Unix. The perl parser itself is built out of a hand coded lexer, toke.c, and a yacc built parser, perly.y. Presumably this is because a regexp lexer isn't powerful enough to deal with the extreme requirements of parsing perl.

    yacc is an acryonym for "yet another compiler compiler". Normally it is used with the regexp tool lex to build the "lexer" part. The job of a lexer is to identify "tokens", typically words, but also numbers, operators, etc. The grammar explains how those tokens can be put together in a syntactically correct manner.

    The most used perl grammar handler is Parse::RecDescent, which generates "top-down recursive-descent text parsers" from a grammar.

    I'm not going to get into the full usage of any of the grammar tools, but I'll present a sample grammar to show the usefulness oof those tools. This grammar is defines the syntax for a function call. The call can take an argument list of nothing, or one or more integer or floating point numbers, quoted strings, variables, or other function calls, which themselves can have argument lists.

    A comment on the syntax of the grammar. Identifiers appear on the left hand side of a colon when being defined, and appear unquoted on the right hand side. Regular expressions are /slash quoted/ and fixed strings are 'single quoted'. Concatentation is default, alternation is indicated by vertical bar, grouping is with parentheses, and zero or more repeats indicated by a star. That bit is very much like regular expressions. Whitespace is for readability, except that new rules must start on new lines. "Terminal" statments don't use any of the grammar identifiers on the right-hand-side, "non-terminal" ones do.

        # Non-terminal statments
        functioncall : identifier '(' arglist ')'
        arglist : empty
                | ( arg ',' )* arg
        arg  : number
    	 | string
    	 | variable
    	 | functioncall
        variable : identifier
        # Terminal statments
        identifier : /[a-z][a-z0-9]*/
        number : /[0-9]+([.][0-9]*)?/
        string : /"(\\[\\"]|[^"]+)*"/
        empty : ''

    Note how the statements mix quoted strings, regexps, and regexp meta characters used in a new way. Note also that the balanced parentheses requirement here (for function calls as arguments) means regexps cannot do this.


by Benjamin Elijah Griffin / Eli the Bearded, Feb 2004