85
Regular expressions are regular Marek Pawelec [email protected]

Regular expressions are regular Marek Pawelec [email protected]

Embed Size (px)

Citation preview

Page 1: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Regular expressions are regular

Marek Pawelec

[email protected]

Page 2: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Outline

1. Regex vocabulary

2. Segmentation rules

3. Regex tagger

4. Regex text filter

5. Auto-translatables

Page 3: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

Page 4: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Wildcards...

Wildcards used in regular search:• * – any text string• ? – any single character

...but somewhat different.

Page 5: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Regular expressions

• . – any character (or symbol, digit...)• [ ] – a range

[123] – digit 1 or 2 or 3[1-3] – any digit from 1 to 3[A-Za-z] – any letter[^A] – any character except „A”

• | – or1|2|3 – 1 or 2 or 3

Page 6: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Ranges

• Both [ ] and | means „or”. What is the difference?

• [USDEUR]matches U or S or D or E or U or R

• USD|EURmatches USD or EUR

Page 7: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Special symbols

• \ – modifier (”escape” character) . any character, but \. means dot \\ matches backslash

• \d – digit [0-9]• \s – white space• \w – any ”word” character [A-Za-z0-9_]• \u#### – unicode character, e.g. \u2212: –

Page 8: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Quantifiers

• ? – 0 or 1 \d? means zero or one digit

• * – 0 or more \d* means zero or more digits

• + – 1 or more \d+ meands at least one digit

• *? – zero or as little as possible• +? – one or as little as possible

• greedy

• lazy

Page 9: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Quantifiers cont.

• {num} – value or range \d{4} = 4 digits, \d{2,4} = 2, 3 or 4 digits \d{,4} = from 1 to 4 digits \d{4,} = 4 or more

Page 10: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Groups

• ( ) – creates a group ($num recalls it)

• (?: ) – passive group (not numbered)

Page 11: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Assertions

• (?= ) – look ahead assertion

memo(?=Q) will match „memo” in memoQ, but not in memory

• (?! ) – negative look ahead assertion

memo(?!Q) will match „memo” in memory, but not in memoQ

• (?<! ) – negative look back assertion

(?<!s)and will match „and” in band, but not in sand

Page 12: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

#lists#

A list contains variables:

#currency#

(EUR|USD|GBP|HUF)

#cap#

(A|B|C|D) = [ABCD]

Page 13: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Regular expressions in memoQ

• Segmentation rules

• Regexp tagger

• Regexp text filter

• Auto-translatables

Page 14: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Segmentation rules

Page 15: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 16: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 17: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 18: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 19: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

• #end##!#[\s]+#cap#• #end##!#[\s]+[\d]• #end##!#[\s]+#lpar#[\s]*#cap#• #end##!#[\s]+#lpar#[\s]*[\d]• #end#[\s]*#rpar##!#[\s]+#cap#• #end#[\s]*#rpar##!#[\s]+[\d]

Page 20: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

• #end##!#[\s]+#cap#• #end##!#[\s]+[\d]• #end##!#[\s]+#lpar#[\s]*#cap#• #end##!#[\s]+#lpar#[\s]*[\d]• #end#[\s]*#rpar##!#[\s]+#cap#• #end#[\s]*#rpar##!#[\s]+[\d]

Page 21: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 22: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 23: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 24: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 25: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

#end##!#[\s]+#cap#=

[:\!\?\.]#!#\s+[A-Z]

Page 26: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 27: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

• #end##!#[\s]+#cap#Unless:

• #abbr_long##!#[\s]+#cap#• [\s]#abbr_short##!#[\s]+#cap#• \s#cap#\.#!#[\s]+#cap#

Page 28: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 29: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 30: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Regex tagger

Page 31: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 32: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 33: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 34: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 35: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

<c:0xFF00FFFF>

\ <C: .* \>

Page 36: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 37: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 38: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 39: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 40: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 41: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 42: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

0990-4905 / N537-0392

\d{4} -\d{4}

[A-Z] \d{3} - \d{4}

Page 43: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 44: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 45: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 46: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

ERR_GRP_NO_SAMPLE

[A-Z]+ _[A-Z]+( )+

Page 47: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 48: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 49: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Tip: Regex tagger without regex

Page 50: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 51: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 52: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 53: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Regexp text filter

Page 54: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 55: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 56: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

*Popup "Putty" "c:\util\putty.exe"

\s* \* (.*)

Page 57: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 58: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 59: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 60: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

*Popup .icon="$IconDir$\Fav_Star.ico" "Quick" "!DynamicFolder:$QuickLaunch$*.lnk"

\w+(\s+\w+)*" "

\w = [A-Za-z0-9_]

Page 61: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 62: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Auto-translatables

Page 63: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 64: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 65: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 66: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Rule for EN/DE/FRHU number format conversion

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 67: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 68: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 69: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 70: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 71: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 72: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 73: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 74: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

12 345,67

12 345,67

12 345,67

12 345,67

12 345,67

12 345,67

12 345,67

12 345,67

12,345,67

12,345.67

12.345,67

12.345.67

12 345,67

12 345.67

12’345,67

12’345.67

.12,345,67

,12,345.67

0 12.345,67

0’12.345.67

12 345,67,0

12 345.67.0

12’345,67 0

12’345.67’0

Page 75: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$2 $3,$4

Page 76: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Red elements are not necessary:

(?<!(,|\.|\d|\d\s|\d'|\d’))([-|\u2212]?[\d]{2,3})(?:\.|,|\s|'|’)(\d\d\d)(?:\.|,)([\d]{1,2}|[\d]{4,})(?!(,\d|\.\d|\d|\s\d|'\d|’\d))

$1 $2,$3

Page 77: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

The same rule for ENHU only

(?<!\d,|\d\.|\d)([-–]?\d{2,3}),(\d{3})\.(\d+)(?!,\d|\.\d|\d)

12,345.67 12 345,67

Page 78: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(?<!\d,|\d\.|\d)([-–]?\d{2,3}),(\d{3})\.(\d+)(?!,\d|\.\d|\d)

12,345.67 12 345,67

Page 79: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

Day of the week,

Month

Day number (st, nd, rd, th)

Year

day of the week

day number.

month

year

Page 80: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 81: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl
Page 82: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(#day#),?\s(#month#)\s(\d{1,2})(?:st|nd|rd|th)?\s(\d{4})

$1 $3. $2 $4

Page 83: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl

(#day#),?\s(#month#)\s(\d{1,2})(?:st|nd|rd|th)?\s(\d{4})

#day#:Friday piątek ($1)

#month#: May maja ($2)

11th 11 ($3)

2012 2012 ($4)

$1 $3. $2 $4

Page 85: Regular expressions are regular Marek Pawelec wasaty@wasaty.pl