Transcript
Page 1: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

C++0xRegular Expressions

Simon Andreas Frimann Lund

Datalogisk InstitutKøbenhavns Universitet

Maj 16, 2008

Page 2: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Regular Expressions

Regular Expression, regex or regexp for short.”A set of characters, metacharacters, and operators that define astring or group of strings in a search pattern.”

� "regex"(simple regex matching the text ”regex”)

� "[-+]?([0-9]*.[0-9]+|[0-9]+)"(simple regular expression matching... what?)

The set of metacharacters, operators and other features are usuallycalled a regex flavor.

Page 3: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Regular Expressions

Regular Expression, regex or regexp for short.”A set of characters, metacharacters, and operators that define astring or group of strings in a search pattern.”

� "regex"(simple regex matching the text ”regex”)

� "[-+]?([0-9]*.[0-9]+|[0-9]+)"(simple regular expression matching... what?)

The set of metacharacters, operators and other features are usuallycalled a regex flavor.

Page 4: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 5: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 6: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 7: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 8: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 9: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages andtools of which only two are standardized:

� The POSIX Standard Basic Regex / Extended Regex.

� GNU BRE / ERE, GNU extensions of the standard used inGNU tools such as grep.

� The languages D, Haskell, .NET, Java, ECMA(JavaScript), Python, Ruby all have their own flavors.

� The languages Perl and Tcl has their own flavors as buildin language constructs.

� Libraries such as PCRE (used in PHP), Boost.Regex,Boost.Xpressive, QT/QRegExp each their own flavor.

� And the list goes on...

How are these these tasty flavours implemented?

Page 10: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Implementations

Basicly all the different flavours are implemented with a NFA(non-deterministic finite automaton) or DFA.Machine size of M character expression, pattern recognitioncomplexity for an N character sequence of S states.

Algo Machine size ComplexityDFA O(2M ) O(N)

bit-par non-backtracking NFA O(M) ∨ (2M ) O(1 + (S/B))N)non-backtracking NFA O(M) ∨ (2M ) O(SN)

backtracking NFA O(M) O(2N )

Currently many different implementations for C++ exist, somebeing procedural others object oriented. Supporting variousdifferent flavours, but most are simply object oriented wrappers forc libraries.

Page 11: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 12: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 13: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 14: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 15: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 16: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 17: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based onBoost.regex, with the following proposed changes/consequences:

� Default ECMAScript syntax.

� Optional support for POSIX BRE/ERE/awk/grep/egrep/sedsyntax.

� Localization features of POSIX is required since ECMA is notcapable of localization.

� Performance is low, due to rich expression features.

� There are given NO performance guarantees.

� Boost has a way to monitor the runtime complexity ofexpressions and stopping them.

� Customizing the expression syntax with trait classes. Nice!

Page 18: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Implementation

A full implementation in C++, not a wrapper!

Available in the header file <regex>

Representation:

� basic regex, holder of expressions, looks like abasic string.

� match results, iterator of match results

Methods:

� bool regex match(basic string, basic regex)

� bool regex search(basic string, match results,basic regex)

� basic string regex replace(basic string,basic regex, basic string )

Page 19: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Implementation

A full implementation in C++, not a wrapper!

Available in the header file <regex>

Representation:

� basic regex, holder of expressions, looks like abasic string.

� match results, iterator of match results

Methods:

� bool regex match(basic string, basic regex)

� bool regex search(basic string, match results,basic regex)

� basic string regex replace(basic string,basic regex, basic string )

Page 20: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

C++0x Example

#inc l u d e <s t d l i b . h>#inc l u d e <regex>#inc l u d e <s t r i n g >#inc l u d e <i o s t r e a m>

us ing namespace s t d ;

r e g e x e x p r e s s i o n ( ”([0−9]+)(\\−| |$ ) ( . * ) ” ) ;

// p r o c e s s f t p : on s u c c e s s r e t u r n s the f t p r e s pon s e code , and f i l l s// msg wi th the f t p r e s pon s e message .i n t p r o c e s s f t p ( const char* r e s p o n s e , s t d : : s t r i n g * msg ){

cmatch what ;i f ( r e g e x m a t c h ( r e s p o n s e , what , e x p r e s s i o n ) ){

// what [ 0 ] c o n t a i n s the whole s t r i n g// what [ 1 ] c o n t a i n s the r e s pon s e code// what [ 2 ] c o n t a i n s the s e p a r a t o r c h a r a c t e r// what [ 3 ] c o n t a i n s the t e x t message .i f ( msg )

msg−>a s s i g n ( what [ 3 ] . f i r s t , what [ 3 ] . second ) ;r e t u r n s t d : : a t o i ( what [ 1 ] . f i r s t ) ;

}// f a i l u r e d i d not matchi f ( msg )

msg−>e r a s e ( ) ;r e t u r n −1;

}

How is C++0x different from C++?

Page 21: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

C++ Example

#inc l u d e <s t d l i b . h>#inc l u d e <b o o s t / r e g e x . hpp>#inc l u d e <s t r i n g >#inc l u d e <i o s t r e a m>

us ing namespace b o o s t ;

r e g e x e x p r e s s i o n ( ”([0−9]+)(\\−| |$ ) ( . * ) ” ) ;

// p r o c e s s f t p : on s u c c e s s r e t u r n s the f t p r e s pon s e code , and f i l l s// msg wi th the f t p r e s pon s e message .i n t p r o c e s s f t p ( const char* r e s p o n s e , s t d : : s t r i n g * msg ){

cmatch what ;i f ( r e g e x m a t c h ( r e s p o n s e , what , e x p r e s s i o n ) ){

// what [ 0 ] c o n t a i n s the whole s t r i n g// what [ 1 ] c o n t a i n s the r e s pon s e code// what [ 2 ] c o n t a i n s the s e p a r a t o r c h a r a c t e r// what [ 3 ] c o n t a i n s the t e x t message .i f ( msg )

msg−>a s s i g n ( what [ 3 ] . f i r s t , what [ 3 ] . second ) ;r e t u r n s t d : : a t o i ( what [ 1 ] . f i r s t ) ;

}// f a i l u r e d i d not matchi f ( msg )

msg−>e r a s e ( ) ;r e t u r n −1;

}

It’s regex replace(”std”, sourceCode, ”boost”) different..

Page 22: C++0x - Regular Expressionshjemmesider.diku.dk/~jyrki/Course/Generic... · Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A

Regular Expressions C++0x Sources

Sources

The C++ Standards Committee, n1429

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n1429.pdf

Wikipedia, C++0x

http://en.wikipedia.org/wiki/C++0x

Regular Expressions

http://www.regular-expressions.info


Recommended