Stakeholders in memoQ Server Projects...Dangers of Greediness By default, regex expressions are...

Preview:

Citation preview

Stakeholders in memoQ Server Projects

A Quick Overview

Regular Expression

[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

Matching Text

202ca4c2-749d-4f54-ae02-fdf19939ef10

The Scary Bit

What Are Regular Expressions?

• They are not a programming language

• Symbols that describe a text pattern

• Used to match, search and manipulate text

• A more powerful “Search and replace”

• Called “regex” for short

• There are several regex engines or “flavours”

• memoQ uses Microsoft .NET

How Long Does It Take to Learn a New Language?

*http://www.effectivelanguagelearning.com/language-guide/language-difficulty

How Long Does It Take to Learn Regex?

You can start creating your own basic expressions within a few minutes.

SIGH OF RELIEF

What Are They Used For?

• Search and match: – Email addresses

– Urls

– Tags and placeholders

– Phone number formats

– Alternate spellings

– Consistency checks (e.g. lower case v. upper case)

– Trailing spaces

– Punctuation sequences (for segmentation)

– Other repetitive/sequential text

Where in memoQ?

Two Types of Regex Text

Literal characters

bomb

bomb

bomber

A-bomb

The bomb went off.

Bombs off.

b o m b

Metacharacters

\

.

*

?

+

[]

-

|

()

{}

$

^

Metacharacters

. Any character

* Preceding item zero or more times

? Preceding item zero or one time

+ Preceding item one or more times

[ Begin character set

] End character set

- Separator in ranges

| Either or

{} Bean counting

^ Start of segment // Negate a character set

$ End of segment

( Begin group

) End group

Character Sets

Will match any one of the characters in the set but only once, unless otherwise specified by bean counting {}

[a-z] Lower case [A-Z] Upper case [a-Z] Any case [0-9] Digits [0-9A-z] Digits + letters \p{Ll} Lower + special letters \p{Lu} Upper + special letters \p{L} Any case + special letters

Can be negated using ^ [^0-9] Any character except a digit

Can be combined [0-9a-e ,]

Shorthand Character Sets

\d Digit \w Digit OR letter \s Whitespace \b Boundary (Beginning OR end of word) \t Tab \r Line return \n New line \D Not a digit \W Not a digit OR a letter \S Not a whitespace \tag memoQ tag

“Escaping” Metacharacters

If you need to match a special character in the text, you will have to “escape” it, or mark it for its literal meaning.

This is achieved by putting a backslash in front of it.

\(

\)

\{

\}

\$

\^

\!

\\

\.

\?

\*

\+

\[

\]

\-

\|

Find and Replace

Replace expressions allow you choose which parts of the text to replace and which parts to keep as they are. This is achieved via groups ()

Search: (\d{1,})\s{1,}[mM][gG]

Replace: $1 mg

Finds: 225 mG

Replaces with: 225 mg

Greedy v. Lazy

Dangers of Greediness By default, regex expressions are greedy, so it is a good habit to limit your expressions as much as possible to avoid matching more text than you intend to. Use the non-greedy marker ? after * and +. Example:

pur.*\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”

pur.*?\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”

Auto-Translation: Practical Cases

• Email addresses

\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

• URLS

(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?

• Phone numbers

\d{5}\s\d{6} 01908 443300

\d{5}-\d{6} 01908-443300

\+\d{2}\s\(0\)\s\d{4}\s\d{6} +44 (0) 1908 443300

• Duplicate word pairs*

(\b\w+ \w+\b) \b\1\b

*Published by Max B. on the Yahoo mQ group

Segmentation: Practical Case

SOURCE: “Manufactured in China (PRC) for the UK market. Ingredients: Lemon Grass Purée (15%), Red Chilli Purée (11%), Onion, Water, Coconut Milk, Red Pepper, Galangal (5%), Sugar (Sulphites), Lime Juice From Concentrate (Sulphites), Salt, Rapeseed Oil, Garlic Purée, Rice Wine Vinegar (Sulphites), Lime Leaves (2.5%), Yeast Extract, Chilli Flakes, Cornflour, Tamarind Paste, Coriander, Cayenne Pepper, Paprika Extract.”

SOLUTION: Split segment before opening bracket if ending bracket is followed by a comma, a space and an upper case letter

[\s]+#!#\([\s]*[\p{L}0-9]*\.?\d*\s*%?\),\s+\p{Lu}

Regex Tagger: Practical Case

SOURCE: “Dear [%$FIRSTNAME%] [%$LASTNAME%], Your online order placed on [%$WEBSITE%] on [%$DATE%] and processed as the authorized vendor of [%$RANGE%] products, has been successfully completed (order number: [%$REFNO%]). Please note that [%if $ORDER != ""%][%$ORDER%][%else%] [%$COMPANY%] will appear on your bank statement, instead of [%$RANGE%].”

SOLUTION: Create a cascading filter (Plain text + Regex tagger) and add the below to tagger.

\[%.*?%\] OR, if you want to be more strict

\[%[a-z]+%\] \[%\$[A-Z]+%\] \[%if .*\!\=.*%\]

Resources

• Regex 101

https://regex101.com/

• Regex Pal

http://www.regexpal.com/

• Using regular expressions in memoQ (Basic level), by Miklós Urbán

https://www.memoq.com/recorded-webinars

• “Do the magic: Regular Expressions in FrameMaker”, by Marek Pawelec

https://blogs.adobe.com/techcomm/2016/03/framemaker-regular-expressions.html

• memoQ Yahoo Group

https://groups.yahoo.com/neo/groups/

• Regex Hero

http://regexhero.net/reference/

• Regex Cheat Sheet

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Queries and Feedback

Please send any comments, questions or feedback to:

angela.madrid@k-international.com

Recommended