27
Chapter 23 Chapter 23 Text Text Processing Processing Bjarne Stroustrup Bjarne Stroustrup www.stroustrup.com/Programming www.stroustrup.com/Programming

Chapter 23 Text Processing

Embed Size (px)

DESCRIPTION

Chapter 23 Text Processing. Bjarne Stroustrup www.stroustrup.com/Programming. Overview. Application domains Strings I/O Maps Regular expressions. Now you know the basics. Really! Congratulations! Don ’ t get stuck with a sterile focus on programming language features - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 23 Text Processing

Chapter 23Chapter 23Text ProcessingText Processing

Bjarne StroustrupBjarne Stroustrup

www.stroustrup.com/Programmingwww.stroustrup.com/Programming

Page 2: Chapter 23 Text Processing

OverviewOverview

Application domainsApplication domains StringsStrings I/OI/O MapsMaps Regular expressionsRegular expressions

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 22

Page 3: Chapter 23 Text Processing

Now you know the basicsNow you know the basics

Really! Congratulations!Really! Congratulations!

DonDon’’t get stuck with a sterile focus on programming language t get stuck with a sterile focus on programming language featuresfeatures

What matters are programs, applications, what good can you What matters are programs, applications, what good can you do with programmingdo with programming Text processingText processing Numeric processingNumeric processing Embedded systems programmingEmbedded systems programming BankingBanking Medical applicationsMedical applications Scientific visualizationScientific visualization Animation Animation Route planningRoute planning Physical designPhysical design

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 33

Page 4: Chapter 23 Text Processing

Text processingText processing

““all we know can be represented as textall we know can be represented as text”” And often isAnd often is

Books, articlesBooks, articles Transaction logs (email, phone, bank, sales, …)Transaction logs (email, phone, bank, sales, …) Web pages (even the layout instructions)Web pages (even the layout instructions) Tables of figures (numbers)Tables of figures (numbers) Graphics (vectors)Graphics (vectors) MailMail ProgramsPrograms MeasurementsMeasurements Historical dataHistorical data Medical recordsMedical records …… Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 44

Amendment ICongress shall make no law respectingan establishment of religion, or prohibitingthe free exercise thereof; or abridging thefreedom of speech, or of the press; or theright of the people peaceably to assemble,and to petition the government for a redressof grievances.

Page 5: Chapter 23 Text Processing

String overviewString overview

StringsStrings std::stringstd::string

<string><string> s.size()s.size() s1==s2s1==s2

C-style string (zero-terminated array of char)C-style string (zero-terminated array of char) <cstring> <cstring> oror <string.h> <string.h> strlen(s)strlen(s) strcmp(s1,s2)==0strcmp(s1,s2)==0

std::basic_string<Ch>std::basic_string<Ch>,, e.g. Unicode stringse.g. Unicode strings using string = std::basic_string<char>;using string = std::basic_string<char>;

Proprietary string classesProprietary string classes

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 55

Page 6: Chapter 23 Text Processing

C++11 String ConversionC++11 String Conversion In In <string><string>, for numerical values, for numerical values

For example:For example:  

string s1 = to_string(12.333);string s1 = to_string(12.333); // "12.333"// "12.333"string s2 = to_string(1+5*6-99/7);string s2 = to_string(1+5*6-99/7); // "17"// "17"

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 66

Page 7: Chapter 23 Text Processing

String conversionString conversion We can write a simple We can write a simple to_string()to_string() for any type that has a for any type that has a “ “put to” operator<<put to” operator<<

template<class T> string to_string(const T& t)template<class T> string to_string(const T& t){{

ostringstream os;ostringstream os;os << t;os << t;return os.str();return os.str();

}}

For example:For example:  

string s3 = to_string(Date(2013, Date::nov, 14));string s3 = to_string(Date(2013, Date::nov, 14));

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 77

Page 8: Chapter 23 Text Processing

C++11 String ConversionC++11 String Conversion Part of Part of <string><string>, for numerical destinations, for numerical destinations

For example:For example:  

string s1 = "-17";string s1 = "-17";int x1 = stoi(s1);int x1 = stoi(s1); // stoi// stoi means means sstring tring toto iintnt

string s2 = "4.3";string s2 = "4.3";double d = stod(s2);double d = stod(s2); // stod// stod means means sstring tring toto ddoubleouble

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 88

Page 9: Chapter 23 Text Processing

String conversionString conversion We can write a simple We can write a simple from_string()from_string() for any type that has an for any type that has an “ “get from” operator<<get from” operator<<

template<class T> T from_string(const string& s)template<class T> T from_string(const string& s)

{{

istringstream is(s);istringstream is(s);

T t;T t;

if (!(is >> t)) throw bad_from_string();if (!(is >> t)) throw bad_from_string();

return t;return t;

}}  

For example:For example:double d = from_string<double>("12.333");double d = from_string<double>("12.333");

Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }");

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 99

Page 10: Chapter 23 Text Processing

General stream conversionGeneral stream conversion

  template<typename Target, typename Source>template<typename Target, typename Source>

Target to(Source arg)Target to(Source arg)

{{

std::stringstream ss;std::stringstream ss;

Target result;Target result;  

if (!(ss << arg)if (!(ss << arg) // // read arg into streamread arg into stream

|| !(ss >> result)|| !(ss >> result) // // read result from streamread result from stream

|| !(ss >> std::ws).eof())|| !(ss >> std::ws).eof()) // // stuff left in stream?stuff left in stream?

throw bad_lexical_cast();throw bad_lexical_cast();  

return result;return result;

}}

string s = to<string>(to<double>(" 12.7 "));string s = to<string>(to<double>(" 12.7 ")); // // okok

// // works for any type that can be streamed into and/or out of a string:works for any type that can be streamed into and/or out of a string:

XX xx = to<XX>(to<YY>(XX(whatever)));XX xx = to<XX>(to<YY>(XX(whatever))); // // !!!!!!Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1010

Page 11: Chapter 23 Text Processing

I/O overviewI/O overview

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1111

istream ostream

ifstream iostream ofstream ostringstreamistringstream

fstreamstringstream

Stream I/O

in >> x Read from in into x according to x’s format

out << x Write x to out according to x’s format

in.get(c) Read a character from in into c

getline(in,s) Read a line from in into the string s

Page 12: Chapter 23 Text Processing

Map overviewMap overview

Associative containersAssociative containers <map><map>,, <set> <set>,, <unordered_map> <unordered_map>,, <unordered_set> <unordered_set> mapmap multimapmultimap setset multisetmultiset unordered_mapunordered_map unordered_multimapunordered_multimap unordered_setunordered_set unordered_multisetunordered_multiset

The backbone of text manipulationThe backbone of text manipulation Find a wordFind a word See if you have already seen a wordSee if you have already seen a word Find information that correspond to a wordFind information that correspond to a word

See example in Chapter 23See example in Chapter 23

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1212

Page 13: Chapter 23 Text Processing

Map overviewMap overview

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1313

vector<Message>

multimap<string,Message*>

“John Doe”

“John Doe”

“John Q. Public”

Mail_file:

Page 14: Chapter 23 Text Processing

A problem: Read a ZIP codeA problem: Read a ZIP code U.S. state abbreviation and ZIP codeU.S. state abbreviation and ZIP code

two letters followed by five digitstwo letters followed by five digits

  string s;string s;while (cin>>s) {while (cin>>s) {

if (s.size()==7if (s.size()==7&& isletter(s[0]) && isletter(s[1])&& isletter(s[0]) && isletter(s[1])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4])&& isdigit(s[5]) && isdigit(s[6]))&& isdigit(s[5]) && isdigit(s[6]))

cout << "found " << s << '\n';cout << "found " << s << '\n';}}

Brittle, messy, unique codeBrittle, messy, unique code

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1414

Page 15: Chapter 23 Text Processing

A problem: Read a ZIP codeA problem: Read a ZIP code

Problems with simple solution Problems with simple solution  ItIt’’s verbose (4 lines, 8 function calls)s verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not We miss (intentionally?) every ZIP code number not

separated from its context by whitespaceseparated from its context by whitespace "TX77845""TX77845", , TX77845-1234TX77845-1234, and, and ATM77845 ATM77845

We miss (intentionally?) every ZIP code number with a We miss (intentionally?) every ZIP code number with a space between the letters and the digits space between the letters and the digits

TX 77845TX 77845 We accept (intentionally?) every ZIP code number with the We accept (intentionally?) every ZIP code number with the

letters in lower caseletters in lower case tx77845tx77845

If we decided to look for a postal code in a different format If we decided to look for a postal code in a different format we would have to completely rewrite the codewe would have to completely rewrite the code

CB3 0DSCB3 0DS, , DK-8000 ArhusDK-8000 ArhusStroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1515

Page 16: Chapter 23 Text Processing

TX77845-1234TX77845-1234 11stst try: try: wwdddddwwddddd 22ndnd (remember -12324): (remember -12324): wwddddd-ddddwwddddd-dddd What’s “special”?What’s “special”? 33rdrd:: \w\w\d\d\d\d\d-\d\d\d\d\w\w\d\d\d\d\d-\d\d\d\d 44thth (make counts explicit): (make counts explicit): \w2\d5-\d4\w2\d5-\d4 55thth (and “special”): (and “special”): \w{2}\d{5}-\d{4}\w{2}\d{5}-\d{4} But -1234 was optional?But -1234 was optional? 66thth: : \w{2}\d{5}\w{2}\d{5}((-\d{4})?-\d{4})? We wanted an optional space after TXWe wanted an optional space after TX 77thth (invisible space): (invisible space): \w{2} ?\d{5}\w{2} ?\d{5}((-\d{4})?-\d{4})? 88thth (make space visible): (make space visible): \w{2}\s?\d{5}\w{2}\s?\d{5}((-\d{4})?-\d{4})? 99thth (lots of space – or none): (lots of space – or none): \w{2}\s*\d{5}\w{2}\s*\d{5}((-\d{4})?-\d{4})?

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1616

Page 17: Chapter 23 Text Processing

#include <iostream>#include <iostream>#include <string>#include <string>#include <fstream>#include <fstream>using namespace std;using namespace std;  int main()int main(){{

ifstream in("file.txt");ifstream in("file.txt"); // // input fileinput fileif (!in) cerr << "no file\n";if (!in) cerr << "no file\n";

regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // ZIP code patternZIP code pattern// cout << "pattern: " << pat << '\n'; // // cout << "pattern: " << pat << '\n'; // printing of patterns is not C++11printing of patterns is not C++11

  //// … …

}}

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1717

Page 18: Chapter 23 Text Processing

  

int lineno = 0;int lineno = 0;string line;string line; // // input bufferinput bufferwhile (getline(in,line)) {while (getline(in,line)) {

++lineno;++lineno;smatch matches;smatch matches;// // matched strings go herematched strings go hereif (regex_search(line, matches, pat)) {if (regex_search(line, matches, pat)) {

cout << lineno << ": " << matches[0] << '\n';cout << lineno << ": " << matches[0] << '\n'; //// whole whole matchmatch

if (1<matches.size() && matches[1].matched)if (1<matches.size() && matches[1].matched)cout << "\t: " << matches[1] << '\ncout << "\t: " << matches[1] << '\n‘‘;; // // sub-matchsub-match

}}}}

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1818

Page 19: Chapter 23 Text Processing

ResultsResultsInput:Input: address TX77845address TX77845

ffff tx 77843 asasasaaffff tx 77843 asasasaaggg TX3456-23456ggg TX3456-23456howdyhowdyzzz TX23456-3456sss ggg TX33456-1234zzz TX23456-3456sss ggg TX33456-1234cvzcv TX77845-1234 sdsascvzcv TX77845-1234 sdsasxxxTx77845xxxxxxTx77845xxxTX12345-123456TX12345-123456

  Output:Output: pattern: "\w{2}\s*\d{5}(-\d{4})?"pattern: "\w{2}\s*\d{5}(-\d{4})?"

1: TX778451: TX778452: tx 778432: tx 778435: TX23456-34565: TX23456-3456

: -3456: -34566: TX77845-12346: TX77845-1234

: -1234: -12347: Tx778457: Tx778458: TX12345-12348: TX12345-1234

: -1234: -1234Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 1919

Page 20: Chapter 23 Text Processing

Regular expression syntaxRegular expression syntax

Regular expressions have a thorough theoretical Regular expressions have a thorough theoretical foundation based on state machinesfoundation based on state machines You can mess with the syntax, but not much with the semanticsYou can mess with the syntax, but not much with the semantics

The syntax is terse, cryptic, boring, usefulThe syntax is terse, cryptic, boring, useful Go learn itGo learn it

ExamplesExamples Xa{2,3}Xa{2,3} // // Xaa XaaaXaa Xaaa Xb{2}Xb{2} // // XbbXbb Xc{2,}Xc{2,} // // Xcc Xccc Xcccc Xccccc …Xcc Xccc Xcccc Xccccc … \w{2}-\d{4,5}\w{2}-\d{4,5} // // \w is letter \d is digit\w is letter \d is digit (\d*:)?(\d+) (\d*:)?(\d+) // 124:1232321 :123 123// 124:1232321 :123 123 Subject: (FW:|Re:)?(.*)Subject: (FW:|Re:)?(.*) // . (dot) matches any character// . (dot) matches any character [a-zA-Z] [a-zA-Z_0-9]*[a-zA-Z] [a-zA-Z_0-9]* // // identifieridentifier [^aeiouy][^aeiouy] // not an English vowel// not an English vowel

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2020

Page 21: Chapter 23 Text Processing

Searching vs. matchingSearching vs. matching

SearchingSearching for a string that matches a regular expression in an for a string that matches a regular expression in an (arbitrarily long) stream of data(arbitrarily long) stream of data regex_search() regex_search() looks for its pattern as a substring in the streamlooks for its pattern as a substring in the stream

MatchingMatching a regular expression against a string (of known size) a regular expression against a string (of known size) regex_match() regex_match() looks for a complete match of its pattern and the stringlooks for a complete match of its pattern and the string

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2121

Page 22: Chapter 23 Text Processing

Table grabbed from the webTable grabbed from the webKLASSE KLASSE ANTAL DRENGE ANTAL DRENGE ANTAL PIGER ANTAL PIGER ELEVER IALTELEVER IALT

0A0A 1212 1111 2323

1A1A 77 88 1515

1B1B 44 1111 1515

2A2A 1010 1313 2323

3A3A 1010 1212 2222

4A4A 77 77 1414

4B4B 1010 55 1515

5A5A 1919 88 2727

6A6A 1010 99 1919

6B6B 99 1010 1919

7A7A 77 1919 2626

7G7G 33 55 88

7I7I 77 33 1010

8A8A 1010 1616 2626

9A9A 1212 1515 2727

0MO0MO 33 22 55

0P10P1 11 11 22

0P20P2 00 55 55

10B10B 44 44 88

10CE10CE 00 11 11

1MO1MO 88 55 1313

2CE2CE 88 55 1313

3DCE3DCE 33 33 66

4MO4MO 44 11 55

6CE6CE 33 44 77

8CE8CE 44 44 88

9CE9CE 44 99 1313

RESTREST 55 66 1111

Alle klasserAlle klasser 184184 202202 386386

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2222

• Numeric fields• Text fields• Invisible field separators• Semantic dependencies

• i.e. the numbers actually mean something

• first row + second row == third row

• Last line are column sums

Page 23: Chapter 23 Text Processing

Describe rowsDescribe rows

Header lineHeader line Regular expression:Regular expression: ^[\w ]+(^[\w ]+( [\w ]+)*$[\w ]+)*$ As string literal:As string literal: "^[\\w ]+("^[\\w ]+( [\\w ]+)*$"[\\w ]+)*$"

Other linesOther lines Regular expression:Regular expression: ^([\w ]+)(^([\w ]+)( \d+)(\d+)( \d+)(\d+)( \d+)$\d+)$ As string literal: As string literal: "^([\\w ]+)("^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"\\d+)$"

Aren’t those invisible tab characters annoying?Aren’t those invisible tab characters annoying? Define a tab character classDefine a tab character class

Aren’t those invisible space characters annoying?Aren’t those invisible space characters annoying? Use Use \s\s

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2323

Page 24: Chapter 23 Text Processing

Simple layout checkSimple layout check

int main()int main()

{{

ifstream in("table.txt");ifstream in("table.txt"); // // input fileinput file

if (!in) error("no input file\n");if (!in) error("no input file\n");  

string line;string line; // // input bufferinput buffer

int lineno = 0;int lineno = 0;

  

regex header( "^[\\w ]+(regex header( "^[\\w ]+( [\\w ]+)*$");[\\w ]+)*$"); // // header lineheader line

regex row( "^([\\w ]+)(regex row( "^([\\w ]+)( \\d+)(\\d+)( \\d+)(\\d+)( \\d+)$"); // \\d+)$"); // data linedata line

   // // … check layout …… check layout …

}}

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2424

Page 25: Chapter 23 Text Processing

Simple layout checkSimple layout check

int main()int main()

{{

// … open files, define patterns …// … open files, define patterns …

if (getline(in,line)) {if (getline(in,line)) { // // check header linecheck header line

smatch matches;smatch matches;

if (!regex_match(line, matches, header))if (!regex_match(line, matches, header)) error("no header");error("no header");

}}

while (getline(in,line)) {while (getline(in,line)) { // // check data linecheck data line

++lineno;++lineno;

smatch matches;smatch matches;

if (!regex_match(line, matches, row)) if (!regex_match(line, matches, row))

error("bad line", to_string(lineno));error("bad line", to_string(lineno));

}}

}} Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2525

Page 26: Chapter 23 Text Processing

Validate tableValidate tableint boys = 0;int boys = 0; // // column totalscolumn totalsint girls = 0;int girls = 0;

  

while (getline(in,line)) {while (getline(in,line)) { // // extract and check dataextract and check datasmatch matches;smatch matches;if (!regex_match(line, matches, row)) error("bad line");if (!regex_match(line, matches, row)) error("bad line");

  

int curr_boy = from_string<int>(matches[2]);int curr_boy = from_string<int>(matches[2]); // // check rowcheck rowint curr_girl = from_string<int>(matches[3]);int curr_girl = from_string<int>(matches[3]);int curr_total = from_string<int>(matches[4]);int curr_total = from_string<int>(matches[4]);if (curr_boy+curr_girl != curr_total) error("bad row sum");if (curr_boy+curr_girl != curr_total) error("bad row sum");

  

if (matches[1]=="Alle klasser") {if (matches[1]=="Alle klasser") { // // last line; check columns:last line; check columns:if (curr_boy != boys) error("boys don't add up");if (curr_boy != boys) error("boys don't add up");if (curr_girl != girls) error("girls don't add up");if (curr_girl != girls) error("girls don't add up");return 0;return 0;

}}

boys += curr_boy;boys += curr_boy;girls += curr_girl;girls += curr_girl;

}}   Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2626

Page 27: Chapter 23 Text Processing

Application domainsApplication domains

Text processing is just one domain among manyText processing is just one domain among many Or even several domains (depending how you count)Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, …Browsers, Word, Acrobat, Visual Studio, …

Image processingImage processing Sound processingSound processing Data basesData bases

MedicalMedical ScientificScientific Commercial Commercial ……

NumericsNumerics FinancialFinancial Real-time controlReal-time control ……

Stroustrup/PPP - Nov'13Stroustrup/PPP - Nov'13 2727