56
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution Lice See http://software-carpentry.org/license.html for more informati Python

Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Embed Size (px)

Citation preview

Page 1: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Text

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.

Python

Page 2: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

Page 3: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

Page 4: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

26 characters × {upper, lower}

Page 5: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

26 characters × {upper, lower}

+ 10 digits

Page 6: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

26 characters × {upper, lower}

+ 10 digits

+ punctuation

Page 7: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

26 characters × {upper, lower}

+ 10 digits

+ punctuation

+ special characters for controlling teletypes

(new line, carriage return, form feed, bell, …)

Page 8: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent characters?

American English in the 1960s:

26 characters × {upper, lower}

+ 10 digits

+ punctuation

+ special characters for controlling teletypes

(new line, carriage return, form feed, bell, …)

= 7 bits per character (ASCII standard)

Page 9: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

Page 10: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

Page 11: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width recordsA crash reducesyour expensive computerto a simple stone.

Page 12: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width recordsA crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s · · · · · · · ·

y o u r e x p e n s i v e c o m p u t e r

t o a s i m p l e s t o n e . · · · · ·

Page 13: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width recordsA crash reducesyour expensive computerto a simple stone.

Easy to get to line N

A c r a s h r e d u c e s · · · · · · · ·

y o u r e x p e n s i v e c o m p u t e r

t o a s i m p l e s t o n e . · · · · ·

Page 14: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width recordsA crash reducesyour expensive computerto a simple stone.

Easy to get to line N

But may waste space

A c r a s h r e d u c e s · · · · · · · ·

y o u r e x p e n s i v e c o m p u t e r

t o a s i m p l e s t o n e . · · · · ·

Page 15: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width recordsA crash reducesyour expensive computerto a simple stone.

Easy to get to line N

But may waste space

What if lines are longer than the record length?

A c r a s h r e d u c e s · · · · · · · ·

y o u r e x p e n s i v e c o m p u t e r

t o a s i m p l e s t o n e . · · · · ·

Page 16: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

Page 17: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

A crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v

e c o m p u t e r t o a s i m p l e s t o n e .

Page 18: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

A crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v

e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

Page 19: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

A crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v

e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

Wastes less space

Page 20: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

A crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v

e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

Wastes less space

Skipping ahead is harder

Page 21: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

How to represent text?

1. Fixed-width records

2. Stream with embedded end-of-line markers

A crash reducesyour expensive computerto a simple stone.

A c r a s h r e d u c e s y o u r e x p e n s i v

e c o m p u t e r t o a s i m p l e s t o n e .

More flexible

Wastes less space

Skipping ahead is harder

What to use for end of line?

Page 22: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Page 23: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n')

Page 24: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n')

Oh dear…

Page 25: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n')

Oh dear…

Python converts '\r\n' to '\n' and back on Windows

Page 26: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n')

Oh dear…

Python converts '\r\n' to '\n' and back on Windows

To prevent this (e.g., when reading image files)

open the file in binary mode

Page 27: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Unix: newline ('\n')

Windows: carriage return + newline ('\r\n')

Oh dear…

Python converts '\r\n' to '\n' and back on Windows

To prevent this (e.g., when reading image files)

open the file in binary mode

reader = open('mydata.dat', 'rb')

Page 28: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

Page 29: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

Page 30: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

7 bits = 0…127

Page 31: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

7 bits = 0…127

8 bits (a byte) = 0…255

Page 32: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different

meanings for 128...255

Page 33: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different

meanings for 128...255

Did not play nicely together

Page 34: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Back to characters…

How to represent ĕ, β, Я, …?

7 bits = 0…127

8 bits (a byte) = 0…255

Different companies/countries defined different

meanings for 128...255

Did not play nicely together

And East Asian "characters" won't fit in 8 bits

Page 35: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Page 36: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Page 37: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Does not specify how to store those integers

Page 38: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Does not specify how to store those integers

32 bits per character will do it...

Page 39: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Does not specify how to store those integers

32 bits per character will do it...

...but wastes a lot of space in common cases

Page 40: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Does not specify how to store those integers

32 bits per character will do it...

...but wastes a lot of space in common cases

Use in memory (for speed)

Page 41: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

1990s: Unicode standard

Defines mapping from characters to integers

Does not specify how to store those integers

32 bits per character will do it...

...but wastes a lot of space in common cases

Use in memory (for speed)

Use something else on disk and over the wire

Page 42: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

Page 43: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Page 44: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

Page 45: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits

Page 46: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits

110yyyyy 10xxxxxx 11 bits

Page 47: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits

110yyyyy 10xxxxxx 11 bits

1110zzzz 10yyyyyy 10xxxxxx 16 bits

Page 48: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits

110yyyyy 10xxxxxx 11 bits

1110zzzz 10yyyyyy 10xxxxxx 16 bits

11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits

Page 49: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

(Almost) everyone uses a variable-length encoding

called UTF-8 instead

First 128 characters (old ASCII) stored in 1 byte each

Next 1920 stored in 2 bytes, etc.

0xxxxxxx 7 bits

110yyyyy 10xxxxxx 11 bits

1110zzzz 10yyyyyy 10xxxxxx 16 bits

11110www 10zzzzzz 10yyyyyy 10xxxxxx 21 bits

The good news is, you don't need to know

Page 50: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Page 51: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Classic: one byte per character

Page 52: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Classic: one byte per character

Unicode: "big enough" per character

Page 53: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Classic: one byte per character

Unicode: "big enough" per character

Write u'the string' for Unicode

Page 54: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Classic: one byte per character

Unicode: "big enough" per character

Write u'the string' for Unicode

Must specify encoding when converting from

Unicode to bytes

Page 55: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Python Text

Python 2.* provides two kinds of string

Classic: one byte per character

Unicode: "big enough" per character

Write u'the string' for Unicode

Must specify encoding when converting from

Unicode to bytes

Use UTF-8

Page 56: Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

October 2010

created by

Greg Wilson

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.