Click here to load reader

C++0x - Regular Ex jyrki/Course/Generic... · PDF file Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or regexp for short. "A set of characters, metacharacters,

  • View
    4

  • Download
    0

Embed Size (px)

Text of C++0x - Regular Ex jyrki/Course/Generic... · PDF file Regular Expressions C++0x Sources...

  • Regular Expressions C++0x Sources

    C++0x Regular Expressions

    Simon Andreas Frimann Lund

    Datalogisk Institut Københavns Universitet

    Maj 16, 2008

  • Regular Expressions C++0x Sources

    Regular Expressions

    Regular Expression, regex or regexp for short. ”A set of characters, metacharacters, and operators that define a string or group of strings in a search pattern.”

    "regex" (simple regex matching the text ”regex”)

    "[-+]?([0-9]*.[0-9]+|[0-9]+)" (simple regular expression matching... what?)

    The set of metacharacters, operators and other features are usually called a regex flavor.

  • Regular Expressions C++0x Sources

    Regular Expressions

    Regular Expression, regex or regexp for short. ”A set of characters, metacharacters, and operators that define a string or group of strings in a search pattern.”

    "regex" (simple regex matching the text ”regex”)

    "[-+]?([0-9]*.[0-9]+|[0-9]+)" (simple regular expression matching... what?)

    The set of metacharacters, operators and other features are usually called a regex flavor.

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Flavors

    There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

    The POSIX Standard Basic Regex / Extended Regex.

    GNU BRE / ERE, GNU extensions of the standard used in GNU tools such as grep.

    The languages D, Haskell, .NET, Java, ECMA (JavaScript), Python, Ruby all have their own flavors.

    The languages Perl and Tcl has their own flavors as build in language constructs.

    Libraries such as PCRE (used in PHP), Boost.Regex, Boost.Xpressive, QT/QRegExp each their own flavor.

    And the list goes on...

    How are these these tasty flavours implemented?

  • Regular Expressions C++0x Sources

    Implementations

    Basicly all the different flavours are implemented with a NFA (non-deterministic finite automaton) or DFA. Machine size of M character expression, pattern recognition complexity for an N character sequence of S states.

    Algo Machine size Complexity DFA O(2M ) O(N)

    bit-par non-backtracking NFA O(M) ∨ (2M ) O(1 + (S/B))N) non-backtracking NFA O(M) ∨ (2M ) O(SN)

    backtracking NFA O(M) O(2N )

    Currently many different implementations for C++ exist, some being procedural others object oriented. Supporting various different flavours, but most are simply object oriented wrappers for c libraries.

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

    Default ECMAScript syntax.

    Optional support for POSIX BRE/ERE/awk/grep/egrep/sed syntax.

    Localization features of POSIX is required since ECMA is not capable of localization.

    Performance is low, due to rich expression features.

    There are given NO performance guarantees.

    Boost has a way to monitor the runtime complexity of expressions and stopping them.

    Customizing the expression syntax with trait classes. Nice!

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

    Default ECMAScript syntax.

    Optional support for POSIX BRE/ERE/awk/grep/egrep/sed syntax.

    Localization features of POSIX is required since ECMA is not capable of localization.

    Performance is low, due to rich expression features.

    There are given NO performance guarantees.

    Boost has a way to monitor the runtime complexity of expressions and stopping them.

    Customizing the expression syntax with trait classes. Nice!

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

    Default ECMAScript syntax.

    Optional support for POSIX BRE/ERE/awk/grep/egrep/sed syntax.

    Localization features of POSIX is required since ECMA is not capable of localization.

    Performance is low, due to rich expression features.

    There are given NO performance guarantees.

    Boost has a way to monitor the runtime complexity of expressions and stopping them.

    Customizing the expression syntax with trait classes. Nice!

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

    Default ECMAScript syntax.

    Optional support for POSIX BRE/ERE/awk/grep/egrep/sed syntax.

    Localization features of POSIX is required since ECMA is not capable of localization.

    Performance is low, due to rich expression features.

    There are given NO performance guarantees.

    Boost has a way to monitor the runtime complexity of expressions and stopping them.

    Customizing the expression syntax with trait classes. Nice!

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

    Default ECMAScript syntax.

    Optional support for POSIX BRE/ERE/awk/grep/egrep/sed syntax.

    Localization features of POSIX is required since ECMA is not capable of localization.

    Performance is low, due to rich expression features.

    There are given NO performance guarantees.

    Boost has a way to monitor the runtime complexity of expressions and stopping them.

    Customizing the expression syntax with trait classes. Nice!

  • Regular Expressions C++0x Sources

    Flavor

    The regex support as of

Search related