Lecture 17: Error correctionffh8x/d/soi19S/Lecture17.pdfParity bits and redundancy With parity bits,...

Preview:

Citation preview

Lecture 17: Error correctionDANIEL WELLER

THURSDAY, MARCH 21, 2019

AgendaError detection and correction

Parity bits and checksum codes

Hamming distance

Back to matrices

At the enterprise scale of large datacenters processing enormous amounts of information, suddenly even rare bit errors become not uncommon. Being able to detect and correct bit errors is crucial to modern computation and communication.

2

Error detection and correctionWe have already noted that real world measurements of signals are inherently noisy.

◦ If we threshold a digital signal to recover binary digits, the noise can cause random errors in our signal.

◦ The more substantial the noise, the more likely these errors are to occur.

To deal with this, we have multiple tools at our disposal. These tools can be used to detect, and sometimes correct, bit errors.

◦ Error detection: we can determine correctly (with high probability) whether or not a signal contains an error (or multiple errors).

◦ Error correction: we can determine correctly (with high probability) the intended or corrected signal bits, in the presence of one or more errors.

Since noise is random, we can only impose probabilistic guarantees on these methods, instead of strict, deterministic ones. More about that later…

3

Error detection and correctionError detection and correction is not a new idea, and it has been used long before there were iPhones or computers.

Consider written language, for example. Certain words are spelled a specific way, by convention. In a way, misspellings are like bit errors: a single letter might be changed. However, given the context, we can usually determine the correct or intended words.

◦ In fact, we are pretty good at reading sentences even when a bunch of letters are missing:◦ Example: “I w-nt t- li-e f-re-er; so fa-, s- g-od!” (I want to live forever; so far, so good!)

◦ Example: “Th-s -s wh- pe-p-e can rea- my ha-dwr-ti-g!” (This is why people can read my handwriting!)

◦ Consider: why does this work? What are we assuming about language and spelling?

4

Error detection and correctionThere are many other real examples of error detection and correction.

◦ In DNA/RNA, numerous different codons (triples of nucleic acids) encode the same amino acid. For instance, GCT, GCC, GCA, GCG all lead to producing Alanine. In other words, the third nucleic acid in these codons is redundant.

◦ Credit cards use a “check digit” to confirm that the number as entered in a valid number. This is because many times numbers are entered manually (subject to human error) or are read via a not-very-robust magnetic stripe (machine error).

◦ This check digit idea uses a formula adding together the various digits of the credit card number in different ways and checking that the formula gives a certain value. The Luhn algorithm is a common public-domain method for checking the validity of such numbers.

◦ Book ISBN’s also have check digits to ensure accuracy.

5

Types of bit errorsWhen considering bit errors in digital signals, we generally consider two types of errors:

◦ Substitution error: bits may be flipped (0 becomes a 1, or 1 becomes a 0)◦ Example: 1001011 -> 1001001

◦ We may have one or more of these errors in a bit string. We generally assume the errors occur independently of each other.

◦ Such errors are common in both communication and data storage systems and are usually caused by random fluctuations such as noise.

◦ Erasure error: bits may be erased (or lost)◦ Example: 1001011 -> 101011

◦ Sometimes we know a bit is missing (e.g., we expected 7 bits, but only received six), but not always

◦ Erasures are more common in data storage but can also appear in communication systems if our channel gets obstructed. Erasureerrors are common in old hard drives due to damaged sectors.

◦ We will primarily work with the first case, since it is usually unknown when such an error occurs.

6

Types of bit errorsWe impose probabilistic models on these types of errors.

◦ For a substitution error, let p be the probability of a single bit flip. We usually assume bit flip probabilities are symmetric, so the probability of flipping a zero to a one is the same as flipping a one to a zero.

◦ We assume bit flips are identically and independently distributed. Whatprobability distribution can be used to predict how many bit flips in a message?

◦ For an erasure error, we let pe be the probability of erasure. Again, we assume symmetry, so the erasure probability is the same regardless of the original bit being zero or one.

7

Parity bitsThe idea of a check digit applies here as well, leading to what we call parity bits.

Parity bits have been around for 60+ years and help detect and correct for errors via a checksum formula (like the formula used for credit card numbers or ISBN’s).

For instance, the simplest parity check is to see whether we have an even or odd number of ones or zeros.

◦ Example: Suppose we want to transmit 010, and we want to ensure even parity (so an even number of ones). Then we transmit 0101, where the last bit is a parity bit to ensure an even number of ones. Then, if the receiver gets 0101, the parity check will confirm it is correct, whereas if the receiver gets a bit flipped (e.g., 0111), the parity check will detect an error.

◦ In this case, the desired checksum is 𝑥1⊕𝑥2⊕𝑥3⊕𝑥4 = 0, the sum modulo 2. 0+1+0+1 = 0, 0+1+1+1 = 1.

◦ What are we assuming here about the probability of bit flip errors?

8

Parity bits and redundancyWith parity bits, we are adding redundant information.

◦ Redundancy allows us to recover a message from a corrupted or imperfect copy, or to detect that copy is erroneous.

◦ This is because the parity bit values depend on the other signal bit values, and introduce an interdependency among signal bits that did not exist before.

◦ This interdependency restricts the set of valid bit strings.

◦ While the information being transmitted remains the same, the bits per message increases.

◦ Unfortunately, this means we have a tradeoff: we can transmit data fast, or we can transmit data correctly.

◦ How best to add this redundancy? How much redundancy is enough? How can we measure the improvement (probabilistically)?

9

Reliability and redundancyFrom 2013 study of a Facebook datacenter:

◦ Hundreds of thousands of terabytes of data

◦ Growing at a few thousand terabytes per week

◦ Composed of thousands of machines storing 24-36 TB each

◦ Due to hardware failures, average of 50 machines unavailable at any given time

Amazon, Facebook, Google, Microsoft, etc. continue to expand their datacenters, and these numbers are likely now off by an order of magnitude.

How do we deal with this situation? How would you like to be one of the users on those 50 machines?

10

Reliability and redundancyFacebook engineers have choices:

1. Build more expensive, more reliable machines. Keep in mind the increased reliability may be very expensive.

2. Add redundancy across machines, so that if one machine fails, another can step in.

Ways to add redundancy:◦ Copies (today)

◦ Error correcting codes (more next time)

11

Reliability and redundancySolution: Making copies. How many copies do I need to know which bit is wrong?

One copy (double redundancy) – can see differences, but not sure which is correct

Two copies (triple redundancy) – can see differences and determine which is correct via majority vote.

This is for substitution errors. If we are concerned with an erasure error (and we know when a hard drive fails, for instance), then we can use a single backup copy.

12

Double redundancy codeLet’s consider first the simple case of detecting an error via double redundancy.

We aim to transmit a single bit x, but we transmit two copies to check for errors: xx

There are four possible received messages: 00, 11, and erroneous 01, 10.

We can decode the two correct ones (00 -> 0, 11 -> 1), and we can ask the transmitter to retransmit if we detect an error.

13

00 11

01

10

Double redundancy codeLet p(error) = e, and suppose bit errors are symmetric and independent of each other.

Let’s analyze the probabilities for the double redundancy code. Note: binomial distribution.◦ P(no errors) = (1-e)2 = 1-2e+e2. No errors, and we don’t detect any errors. So we’re good.

◦ P(1 error) = 2(1-e)e = 2e-2e2. One error, and we detect it but not correct it, so we’re sort of good.

◦ P(2 errors) = e2. Two errors, and we don’t detect them, so we’re not good.

P(correct) = P(no errors) = (1-e)2.

P(correct or detect) = P(no errors) + P(1 error) = 1-e2.

P(not correct) = P(2 errors) = e2.

14

How does this compare to no redundancy?

Double redundancy codeExample: Suppose we transmit a bit string containing ten bits, using double redundancy, for a total of 20 bits. Suppose error probability is 0.1.

What is the probability of at least one undetected error without redundancy?

What is the probability of at least one undetected error with double redundancy?

15

Triple redundancy codeNow we send two copies in addition to the original message. The third copy allows us to correct the error using majority rule.

000, 010, 100, 001 -> declare 0

111, 110, 101, 011 -> declare 1

There are two cases of no error -> correctly detected

There are six cases on one error -> correctly detected, corrected

What happens if multiple bits are flipped?

16

110010

100011

101001

111

000

Declare 0 Declare 1

Triple redundancy codeP(no error) = (1-e)3 = 1-3e+3e2-e3. Correct decoding.

P(1 error) = 3(1-e)2e = 3e-6e2+3e3. Error correctly detected and corrected.

P(2 errors) = 3(1-e)e2 = 3e2 – 3e3. Errors not corrected properly (majority rule flips bit).

P(3 errors) = e3. Errors not detected (bit remains flipped).

So the probability that triple redundancy works? (1-e)3 + 3(1-e)2e = (1-e)2(1+2e) = 1-3e2+2e3.

How does this compare to the double/single redundancy codes?

We can use the same scheme to detect two-bit errors if we don’t bother correcting. How?

17

Triple redundancy codeExample: Let’s repeat the previous example, we are transmitting a 10-bit message using 30 bits now (triple redundancy). How likely is it to receive the correct code? How likely is it to receive it or to decode it correctly with majority voting?

Observe as the message size grows or error probability increases, more redundancy may be needed to mitigate more errors.

18

Hamming distanceObserve: as we add more redundant or parity bits, we are making the valid code words more different from each other:

◦ 0 and 1 (no redundancy): differs by just one bit – susceptible to single-bit errors

◦ 00 and 11 (double redundancy): differs by two bits – can detect single-bit errors

◦ 000 and 111 (triple redundancy): differs by three bits – can detect single- and two-bit errors

Since we assume errors occur independently, the more redundancy we have, the less likely we are to miss error detection.

How do we measure difference between code words? Hamming distance: the # of bits that differ between a pair of code words, or how many bits we need to flip to make one word the same as another. Example: D(000,111) = 3 bits.

19

Hamming distance and error correctionIdea: to detect up to K errors, we need Hamming distance of K+1 between code words.

◦ If we had d(w1,w2) = K, then we could introduce K errors to turn one word (w1) into another (w2).

◦ If we had d(w1,w2) = K+1, then errors would always produce an invalid code word.

If we want to correct J errors, we need 2J+1 Hamming distance, so we are always guaranteed to be closest to the correct code word.

20

Fundamental Tenet VI:

A Hamming distance (between code words) of K+1 is needed to detect K errors per word, and a Hamming distance of 2J+1 is needed to correct J errors per word.

Hamming distanceExample: Show that four copies (5-redundant) code can detect four errors or correct two errors.

Detect 4 errors:

Correct 2 errors:

21

Error correction and data rateMore redundancy / copies lower data rate, reduced transmission efficiency

◦ Can be worth it if error probability is high enough, or cost of errors is high enough

More sophisticated codes can get better efficiency with same error correction performance (next time: block codes / Hamming codes)

Linear algebra is the key to formulating these block codes and showing how many errors they can detect or correct. To do that, we need to extend what we’ve learned about vectors and matrices to those involving binary variables.

22

Back to matricesLinear algebra on binary vectors:

◦ Each vector contains just zeros and/or ones. However, the notion of a vector space is more limited: we can only scale by zero or one.

◦ We can however, add or subtract vectors together (addition/subtraction are the same). To add a pair of vectors, we add their elements, modulo 2. So 0⊕ 0 = 0, 0⊕ 1 = 1, 1 ⊕ 0 = 1, 1 ⊕ 1 = 2 ⇒ 0

◦ Multiplying bits together is like a logical “and” operation (more about logical operations later in the course). So (0)(0) = 0, (0)(1) = 0, (1)(0) = 0, and (1)(1) = 1.

◦ With addition and multiplication, we can compute an inner product between two vectors: 𝑥𝑇𝑦 =𝑥1𝑦1⊕𝑥2𝑦2⊕𝑥3𝑦3…⊕ 𝑥𝑁𝑦𝑁. As before, two vectors are orthogonal if their inner product is zero.

◦ Thus, we can also define a matrix-vector product (or a matrix-matrix product), as inner products between the rows of the left matrix/vector and the columns of right matrix/vector.

23

Matrix algebraGiven a binary matrix:

Rows or columns are linearly dependent if one can be formed from a linear combination of the others. For instance,

So the rank is defined the same way as before, as the largest number of linearly independent rows/columns.

24

1 00 01 1

0 11 10 1

101

⊕010

=111

Null space of matrixSimilarly, a binary matrix can have a null space. In the trivial null space, just the zero vector (0,0,…,0) is in the null space. More generally, the null space contains vectors x such that Ax = 0.

Example:

The inner products between rows of A and vector x are like checksums, in that if x were supposed to be in the null space of A, each row of A describes a combination of bits in x that when added together should result in even parity (= 0).

◦ More about this next time!

25

𝐴 =1 00 01 1

0 11 10 1

𝑥 =

1011

AnnouncementsNext Thursday, March 28: Midterm #2 (more in a bit)

Next lecture: Hamming codesAlso, review for Midterm #2

26

Midterm #2When: Thursday, March 28, during class (9:30 – 10:45 AM)

Where: In class (Olsson 120)

What: All the material up through and including last lecture (on linear algebra). This exam is comprehensive, so you might reuse your note sheet from last test to help study.

◦ We’ll review in class on Tuesday.

Policies:◦ Bring two sheets (single sided 8½ x 11”) of notes, no photocopies allowed on the note sheets

◦ No books or other course materials are allowed

◦ Calculators are welcome but unnecessary (this is not a test on how to use a calculator)

◦ Make-up: please notify Prof. Weller ahead of time (if possible); being busy is not an excuse

27

Recommended