Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Stakeholders in memoQ Server Projects
A Quick Overview
Regular Expression
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
Matching Text
202ca4c2-749d-4f54-ae02-fdf19939ef10
The Scary Bit
What Are Regular Expressions?
• They are not a programming language
• Symbols that describe a text pattern
• Used to match, search and manipulate text
• A more powerful “Search and replace”
• Called “regex” for short
• There are several regex engines or “flavours”
• memoQ uses Microsoft .NET
How Long Does It Take to Learn a New Language?
*http://www.effectivelanguagelearning.com/language-guide/language-difficulty
How Long Does It Take to Learn Regex?
You can start creating your own basic expressions within a few minutes.
SIGH OF RELIEF
What Are They Used For?
• Search and match: – Email addresses
– Urls
– Tags and placeholders
– Phone number formats
– Alternate spellings
– Consistency checks (e.g. lower case v. upper case)
– Trailing spaces
– Punctuation sequences (for segmentation)
– Other repetitive/sequential text
Where in memoQ?
Two Types of Regex Text
Literal characters
bomb
bomb
bomber
A-bomb
The bomb went off.
Bombs off.
b o m b
Metacharacters
\
.
*
?
+
[]
-
|
()
{}
$
^
Metacharacters
. Any character
* Preceding item zero or more times
? Preceding item zero or one time
+ Preceding item one or more times
[ Begin character set
] End character set
- Separator in ranges
| Either or
{} Bean counting
^ Start of segment // Negate a character set
$ End of segment
( Begin group
) End group
Character Sets
Will match any one of the characters in the set but only once, unless otherwise specified by bean counting {}
[a-z] Lower case [A-Z] Upper case [a-Z] Any case [0-9] Digits [0-9A-z] Digits + letters \p{Ll} Lower + special letters \p{Lu} Upper + special letters \p{L} Any case + special letters
Can be negated using ^ [^0-9] Any character except a digit
Can be combined [0-9a-e ,]
Shorthand Character Sets
\d Digit \w Digit OR letter \s Whitespace \b Boundary (Beginning OR end of word) \t Tab \r Line return \n New line \D Not a digit \W Not a digit OR a letter \S Not a whitespace \tag memoQ tag
“Escaping” Metacharacters
If you need to match a special character in the text, you will have to “escape” it, or mark it for its literal meaning.
This is achieved by putting a backslash in front of it.
\(
\)
\{
\}
\$
\^
\!
\\
\.
\?
\*
\+
\[
\]
\-
\|
Find and Replace
Replace expressions allow you choose which parts of the text to replace and which parts to keep as they are. This is achieved via groups ()
Search: (\d{1,})\s{1,}[mM][gG]
Replace: $1 mg
Finds: 225 mG
Replaces with: 225 mg
Greedy v. Lazy
Dangers of Greediness By default, regex expressions are greedy, so it is a good habit to limit your expressions as much as possible to avoid matching more text than you intend to. Use the non-greedy marker ? after * and +. Example:
pur.*\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”
pur.*?\b will match “All purées contains at least 10% of the main ingredient, unless otherwise specified in the purée description.”
Auto-Translation: Practical Cases
• Email addresses
\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
• URLS
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
• Phone numbers
\d{5}\s\d{6} 01908 443300
\d{5}-\d{6} 01908-443300
\+\d{2}\s\(0\)\s\d{4}\s\d{6} +44 (0) 1908 443300
• Duplicate word pairs*
(\b\w+ \w+\b) \b\1\b
*Published by Max B. on the Yahoo mQ group
Segmentation: Practical Case
SOURCE: “Manufactured in China (PRC) for the UK market. Ingredients: Lemon Grass Purée (15%), Red Chilli Purée (11%), Onion, Water, Coconut Milk, Red Pepper, Galangal (5%), Sugar (Sulphites), Lime Juice From Concentrate (Sulphites), Salt, Rapeseed Oil, Garlic Purée, Rice Wine Vinegar (Sulphites), Lime Leaves (2.5%), Yeast Extract, Chilli Flakes, Cornflour, Tamarind Paste, Coriander, Cayenne Pepper, Paprika Extract.”
SOLUTION: Split segment before opening bracket if ending bracket is followed by a comma, a space and an upper case letter
[\s]+#!#\([\s]*[\p{L}0-9]*\.?\d*\s*%?\),\s+\p{Lu}
Regex Tagger: Practical Case
SOURCE: “Dear [%$FIRSTNAME%] [%$LASTNAME%], Your online order placed on [%$WEBSITE%] on [%$DATE%] and processed as the authorized vendor of [%$RANGE%] products, has been successfully completed (order number: [%$REFNO%]). Please note that [%if $ORDER != ""%][%$ORDER%][%else%] [%$COMPANY%] will appear on your bank statement, instead of [%$RANGE%].”
SOLUTION: Create a cascading filter (Plain text + Regex tagger) and add the below to tagger.
\[%.*?%\] OR, if you want to be more strict
\[%[a-z]+%\] \[%\$[A-Z]+%\] \[%if .*\!\=.*%\]
Resources
• Regex 101
https://regex101.com/
• Regex Pal
http://www.regexpal.com/
• Using regular expressions in memoQ (Basic level), by Miklós Urbán
https://www.memoq.com/recorded-webinars
• “Do the magic: Regular Expressions in FrameMaker”, by Marek Pawelec
https://blogs.adobe.com/techcomm/2016/03/framemaker-regular-expressions.html
• memoQ Yahoo Group
https://groups.yahoo.com/neo/groups/
• Regex Hero
http://regexhero.net/reference/
• Regex Cheat Sheet
https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
Queries and Feedback
Please send any comments, questions or feedback to: