30
Regular Expression Best Practices Tony Stubblebine [email protected] www.stubbleblog.com @tonystubblebine

Regex Best Practices

Embed Size (px)

DESCRIPTION

There are two reasons regular expressions are so hard to read and are so error prone. One, the syntax is terse. Two, programmers ignore all normal programming practices. This talk reintroduces white space, structure, and basic verification/testing and then calls them "Best Practices."

Citation preview

Page 1: Regex Best Practices

Regular Expression Best PracticesTony Stubblebine

[email protected]

www.stubbleblog.com

@tonystubblebine

Page 2: Regex Best Practices

Tabbed indentation is a sin but this isn't?

$string =~ s<(?:http://(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?)\.

)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?:\d+)(?:\.(?:\d+)

){3}))(?::(?:\d+))?)(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F

\d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{

2}))|[;:@&=])*))*)(?:\?(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-fA-F\d]{

2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?

:%[a-fA-F\d]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Z\d$\-_.+!*'(),]|(?:%[a-

fA-F\d]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Z\d](?:(?:[a-zA-Z\d]|-

)*[a-zA-Z\d])?)\.)*(?:[a-zA-Z](?:(?:[a-zA-Z\d]|-)*[a-zA-Z\d])?))|(?:(?

:\d+)(?:\.(?:\d+)){3}))(?::(?:\d+))?))(?:/(?:(?:(?:(?:[a-zA-Z\d$\-_.+!

*'(),]|(?:%[a-fA-F\d]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Z\d$\-_.+!*'()

,]|(?:%[a-fA-F\d]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:

...................

Abigail, comp.lang.perl.misc, http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59864

Page 3: Regex Best Practices

Best Practices for Any Programming

There are programming fundamentals that are routinely ignored by regular expression writers.

Put a line break after statements and space between expressions.

Throw in a comment or two. Use subroutines and modules to show structure

and avoid duplication. Test.

Page 4: Regex Best Practices

Good Code# Given a URL/URI, fetches it.

# Returns an HTTP::Response object.

sub get {

my $self = shift; my $uri = shift;

$uri = $self->base

? URI->new_abs( $uri, $self->base )

: URI->new( $uri );

return $self->SUPER::get( $uri->as_string, @_ );

}

Page 5: Regex Best Practices

What if we didn't include documentation or whitespace?

sub get{my$self=shift;my$uri=shift;$uri=$self->base?URI->new_abs($uri,$self->base):URI->new($uri);return$self->SUPER::get($uri->as_string,@_);}

Page 6: Regex Best Practices

What if we were also as terse as possible?

So: No documentation No whitespace One character variable and method names

Page 7: Regex Best Practices

We'd have a regular expression.

sub g{my($s,$u)=@_;$u=$s->b?U-> n($u,$s->b):U->q($u);return$s->SUPER::g($u->a,@_);}

Page 8: Regex Best Practices

What do we want from best practices?

Practices that maximize desired goals in certain applications.

Goals of regex best practices: Maintainability Correctness Development Speed

Page 9: Regex Best Practices

#1: Use Extended Whitespace

Add indentation, newlines, and comments to regular expressions

Usage /x: m/regex/x

# Look for green or red foxes

$text =~ /(green | red)

\s

fox (es)?

# Allow more than one

/x;

Page 10: Regex Best Practices

Extended Whitespace Gotchas

• Must explicitly ask to match a space with \s or \<SPACE>

• Must escape pound signs, \#

Page 11: Regex Best Practices

Before

What does this match?

$text =~ m/^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$/;

Page 12: Regex Best Practices

After

$text =~ m/

# Match IP addresses like 169.146.10.45

^ # Start of string

([01]?\d\d?|2[0-4]\d|25[0-5])

# Number, 0-255

\.([01]?\d\d?|2[0-4]\d|25[0-5])

# 0-255

\.([01]?\d\d?|2[0-4]\d|25[0-5])

# 0-255

\.([01]?\d\d?|2[0-4]\d|25[0-5])

# 0-255

$/x;

Page 13: Regex Best Practices

#2 Test

You don't know your data. And you have a typo in your regex. Guaranteed surprises on both fronts.

Page 14: Regex Best Practices

Fun Gotcha

What file does this code open?

$file =

"/etc/passwd\0/var/www/index.html";

if ( $file =~ m/^ .* \.html/x ) {

open (FILE, "$file);

}

Page 15: Regex Best Practices

Typical Gotcha

This matches foo.gif

But also... foojpg and jpg.doc

# match image files

m/ \. gif | jpg | jpeg | png $/x

Page 16: Regex Best Practices

Test framework

Write your regular expressions in a place where you can test them.

Build up a list of positive and negative matches Include list in your documentation, ex:

# matches 800-555-1212 but not

# 800.555.1212 or 800-BETS-OFF

Page 17: Regex Best Practices

Hackers Test Framework

Your “framework” could be this simple:

foreach my $test (@tests) {

# looks like an image file?

if (

$test =~ m/ \. gif | jpg | jpeg | png $/x ) {

print "Matched on $test\n";

} else {

print "Failed match on $test\n";

}

}

Page 18: Regex Best Practices

Real Tests Are Bettermy @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png");

my @fail = ("gif.foo", "foo.gif.", "foopng", "foo.jpeg.bar");

sub match {

return $_[0] =~ m/ \. gif | jpg | jpeg | png $/x;

}

foreach my $test (@match) {

ok( match($test), "$test matches");

}

foreach my $test (@fail) {

ok( !match($test), "$test fails to match");

}

Page 19: Regex Best Practices

#3 Use Structure

... as a slow-witted human being I have a very small head and I had better learn to live with it and to respect my limitations and give them full credit, rather than to try to ignore them, for the latter vain effort will be punished by failure.

~ Edsger Dijkstra

Page 20: Regex Best Practices

Breaking up an email regex

We can write an email regex that looks like this:

m/$user\@$domain/

Build your regexes from smaller regexes like this:

$user = "\w+";

$domain = qr/\w+\.(\w+\.)*\w\w\w?/i;

Page 21: Regex Best Practices

Use Post Processing

It's easier to say a number is <= 255 in code than it is as a regular expression.

# IP Address check

$ip =~ m/^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/;

foreach my $num ($1, $2, $3, $4) {

$failure++ unless $num < 256;

}

Page 22: Regex Best Practices

#4. Good habits

Regex are hard to debug, so avoid errors. Error avoidance habits:

Group alternations with parentheses Use lazy quantifiers Don't use regular expressions

Page 23: Regex Best Practices

Group Alternations

Group your alternations. In this regex, the dot and end of string ($) are not part of your alternation.

m/ \. (gif | jpg | jpeg | png) $/x

Page 24: Regex Best Practices

Use Lazy Quantifiers

Use lazy quantifiers. It's easier to say when to stop.

<td>.*?</td>

Page 25: Regex Best Practices

Lazy Quantifiers...

Compare that to

#Matches too much

$text = "<td>foo</td><td>bar</td>";

$text =~ m!<td>.*</td>!;

#Matches too little

$text = "<td>foo <b>bar</b> </td>";

$text = m/<td>[^<]*/;

Page 26: Regex Best Practices

Don't use regular expressions

Regular expressions don't deal well with nesting

$text = "<td> foo <table><tr><td>bar</td>...";

$text =~ m!<td> .*? </td>!;

Use something better an HTML or XML parsing library instead.

Page 27: Regex Best Practices

Don't use regular expressions

Regular expressions don't deal well with nesting

$text = "<td> foo <table><tr><td>bar</td>...";

$text =~ m!<td> .*? </td>!;

Use something better an HTML or XML parsing library instead.

Page 28: Regex Best Practices

#5. Optimize Last

It's more common for regular expressions to be broken then to be slow

Optimize last. Start with the quantifiers

Page 29: Regex Best Practices

Optimizing Quantifiers

# This is slow because the match backtracks from the end

# of the file

$text = "M1 text i'm looking for M2 thousand more characters to come...";

$text =~ m/M1 (.*) M2/s;

# This is slow because the match looks for </body> at

# (nearly) every position.

$html =~ m!&ltbody> (.*?) </body>!xs;

Page 30: Regex Best Practices

Buy The Book!

Available from Amazon for $9.95

http://bit.ly/regexpr

Thank you for reading!

I'm [email protected]