24
Pairwise Alignment, Part II Reconstructing the Path and Calculating the LCS from the Values and Directions Tables

Pairwise Alignment, Part II

  • Upload
    jaimin

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Pairwise Alignment, Part II. Reconstructing the Path and Calculating the LCS from the Values and Directions Tables . SETUP: Part II involves using the Visual Basic Editor ( VBE ) to write VBA Function & Subroutines , and creating Buttons . - PowerPoint PPT Presentation

Citation preview

Page 1: Pairwise  Alignment, Part II

Pairwise Alignment, Part IIReconstructing the Path and Calculating the LCS

from the Values and Directions Tables

Page 2: Pairwise  Alignment, Part II

SETUP: Part II involves using the Visual Basic Editor (VBE) to write VBA Function & Subroutines, and creating Buttons.

1. Add the Developer tab to the ribbon. Click the Office button , and then click the Excel Options button at the bottom. In the Popular category, select the Show Developer tab in the Ribbon check box from the options that appear and then click OK.

2. Click the Developer tab in the ribbon whenever you want to access the Visual Basic button in the Code group.

Note: You can switch back and forth between workbook environment and the Visual Basic Editor (VBE) environment with the Alt+F11 combination or with the application tabs at the bottom of the Windows screen.

Page 3: Pairwise  Alignment, Part II

Excel 2007 comes with a new file format: Macro-Enabled Workbook that uses the file extension .xlsm. You can no longer run macros and VBA scripts in regular Excel files, i.e. those that have the extension .xlsx . Therefore you will need to click the Office button , select Save As Excel Macro-Enabled Workbook to save your spreadsheet so that it will run the scripts that you will now be adding.

Page 4: Pairwise  Alignment, Part II

Open the Visual Basic Editor (VBE) and select View → Project Explorer (Ctrl+R).

You will now see a listing of the 2 worksheets – Values and Directions – in your workbook.

Double-click on the Values item to view the Code Section.

Visual Basic code uses the worksheet in which it resides as its default location when determining cell references. You will be placing the VB code into the Values sheet.

Page 5: Pairwise  Alignment, Part II

Option Explicit

Function maxAddress(The_Range) Dim maxNum As Long Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell In The_Range If Cell = maxNum Then maxAddress = Cell.Address End If Next CellEnd Function

Sub clearLCS() range("C10:M20").Interior.ColorIndex = xlColorIndexNone Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone range("C4:C6").Value = ""End Sub

in the VBA Editor, copy and paste the maxAddress function and 2 subroutines clearLCS and showLCS on this slide and the next into the Code Section of the Values Worksheet.

(We will explain what the VB script code does after you get it working)

Page 6: Pairwise  Alignment, Part II

Sub showLCS() Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long Dim bMATCH, bLEFT, bUP As Boolean ' CONSTANT declarations UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow BACKSLASH = ChrW(92) 'Hex 5C = Dec 92 CLR_GRAY = RGB(192, 192, 192) ROW_LTR = 9 COL_LTR = 2 VALUES_TABLE_DATA_RANGE = "D11:M20" maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE)) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

Page 7: Pairwise  Alignment, Part II

You will now insert 2 buttons in the Values worksheet that will float in the space above the cells A4:A6.

When the Assign Macro Dialog appears, select Sheet1.showLCS.

In the Developer tab, select Insert, then choose the Button (Form Control).

Name the button showLCS.Create a 2nd button, assign the macro Sheet1.clearLCS and name the button clearLCS.

Draw the outlines of your buttons with the mouse.

Page 8: Pairwise  Alignment, Part II

When you click on the showLCS button, if Seq 1 is ATCTGAT and Seq 2 is TGCATA, then the data table cells should be highlighted in grey as in the picture below.

The clearLCS button should remove the background color.

Page 9: Pairwise  Alignment, Part II

We reconstruct the LCS path by locating the cell containing the largest value in the Values table, then following the arrows back to a cell containing the value zero. If there are several cells with the largest value, all are possible LCS’s. However, in this exercise, we will pick the one that is most distant.

Reconstruction of the Path

In order to reach the beginning letter of the 2nd sequence (the A in the horizontal sequence), we need to continue traveling either left or up.

If the beginning letter(s) of the 2 sequences did not match, then at this point, we have only reached the beginning letter of one of the sequences (the T in the vertical sequence).

Page 10: Pairwise  Alignment, Part II

What does the VBA code DO? What does it MEAN?Option Explicit

Function maxAddress(The_Range) Dim maxNum As Long Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell In The_Range If Cell = maxNum Then maxAddress = Cell.Address End If Next CellEnd Function

Option Explicit: In most programming languages, variables must be DECLARED before they are used. Certain scripting languages, like VBA and Javascript, allow programmers to use variables without declaring them. This can lead to huge problems when trying to track down bugs, since a simple misspelling of a variable name creates a new variable, different from the one intended. The Option Explicit declaration is a safeguard measure that will show the programmer a compile error if VBA encounters a variable that has not previously been declared.

Page 11: Pairwise  Alignment, Part II

Functions and the function maxAddressOption Explicit

Function maxAddress(The_Range) Dim maxNum As Long Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell In The_Range If Cell = maxNum Then maxAddress = Cell.Address End If Next CellEnd Function

Function: In VBA, a FUNCTION returns a value, while a SUBROUTINE does not. One implements the return value of a function by assigning the function NAME a value somewhere in the function’s body. No explicit return statement is used. The maxAddress function 1st uses the common Excel worksheet function max() to find the maximum value in a range and assign it to maxNum.It then uses a for-each loop (analogous to a Java for-each loop) to (a) iterate through all of the cells in the range and (b) return the address (cell reference) of the LAST cell it finds whose value equals maxNum.NOTE: If one wanted to return the 1st cell that the for-each loop found equal to maxNum, one would place an EXIT FUNCTION or EXIT FOR statement at the end of the IF statement.

Page 12: Pairwise  Alignment, Part II

Sub clearLCS() range("C10:M20").Interior.ColorIndex = xlColorIndexNone Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone range("C4:C6").Value = ""End Sub

Subroutines and the subroutine clearLCS

Subroutine: In VBA, a SUBROUTINE does not return a value, but technically speaking is just like a FUNCTION. Subroutines, however, are generally written to perform actions. In contrast, functions are meant to calculate and return values, but not perform actions.

Line 3 of the subroutine body clears any text in the range C4:C6, although we haven’t yet placed anything in these cells. Later, we will place strings in these cells that will show the alignment and matching bases for the 2 sequences.

The clearLCS() subroutine clears the background color of all cells in the indicated range. It does so not only in the default worksheet (i.e. the VALUES worksheet, line 1 of the subroutine body), but in the DIRECTIONS worksheets as well (line 2).

Page 13: Pairwise  Alignment, Part II

Sub showLCS() Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long Dim bMATCH, bLEFT, bUP As Boolean ' CONSTANT declarations UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow BACKSLASH = ChrW(92) 'Hex 5C = Dec 92 CLR_GRAY = RGB(192, 192, 192) ROW_LTR = 9 COL_LTR = 2 VALUES_TABLE_DATA_RANGE = "D11:M20" maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE)) r = range(maxCellAddress).Row c = range(maxCellAddress).Column

The subroutine showLCS

The 1st section in showLCS() contains the Dim statements, which is where the local variables are declared.

Notice that some of these variables are capitalized to show that they are CONSTANTS, i.e. the values they are first assigned will not change during the execution of the program. The next command uses the maxAddress function to fetch the cell reference of the cell in the Values table that contains the largest value and is closest to the end of the table.

Finally, we extract the row and column numbers from the maximum value cell address.

The code above shows the initial statements for the showLCS() subroutine, that is, the statements that precede the do while loop.

Page 14: Pairwise  Alignment, Part II

The subroutine showLCS (continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

The 1st section in the Do While Loop shades the background color of the last cell In the Values table whose value equals the table’s maximum (i.e. the cell we start with to retrace the path), as well as the corresponding cell in the Directions table. It also retrieves the direction from the latter cell.

We’ll now examine the code within showLCS’s Do While Loop.

The direction variable is used to set the values for the 3 boolean variables: bMATCH, bLEFT and bUP, depending upon whether the direction indicates diagonal, left or up.

The 2nd part of bLEFT’s and bUP’s boolean Or expressions is not used until the loop has reached the beginning of one of the 2 sequences.

Page 15: Pairwise  Alignment, Part II

The subroutine showLCS (continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

If the direction value is diagonal, then bMATCH is true. In this case, we decrement both the row (r) and the column (c) variables so that in the next iteration, the cell we will examine will be the cell located diagonally to the current cell’s top left.If the direction value is left, then bLEFT is true. In this case, we decrement just the column (c) variable so that in the next iteration, we will examine the cell to the left of the current cell. Finally, if the direction value is up, then bUP is true. In this case, we decrement just the row (r) variable so that in the next iteration, we will examine the cell above the current cell.

Page 16: Pairwise  Alignment, Part II

The subroutine showLCS (continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

The subroutine keeps looping as long as c > COL_LTR + 1 Or r > ROW_LTR + 1COL_LTR and ROW_LTR are the column and row containing the letters of the 2 sequences

Hence COL_LTR + 1 and ROW_LTR + 1 are the column and row that contain the zeroes.

ROW_LTR = 9COL_LTR = 2ROW_LTR + 1 = 10COL_LTR + 1 = 3

Therefore, the loop will keep iterating until r and c correspond to cell C10 (or cells above it or to its left).

Page 17: Pairwise  Alignment, Part II

When the loop reaches one of the cell’s containing a zero, the direction variable will contain a “0”, not one of the 3 direction values. At this point, we have reached the beginning of one of the sequences (in the table above, the blue sequence). Therefore to continue moving left (to get to the beginning of the red sequence), we need to add an expression that recognizes this case.

r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1End Sub

The subroutine showLCS (continued)

If the boolean expressions for bLEFT and bUP consisted only of the 1st condition, then r and c would stop being decremented when either the row or column (or both) reached the cells containing zeroes.

If we want to move left, that means that c > COL_LTR + 1 (since at this point r = ROW_LTR + 1).Similarly, had we wanted to move up, that would mean that r > ROW_LTR + 1.

r = 3 ( ROW_LTR + 1 )c = 5 ( > COL _LTR + 1)

Page 18: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence

The bases that the 2 sequences have in common are in the cells containing backslashes.

In order to compare the 2 sequences, we need to line up the strings next to each other and show both the bases that they have in common, and where they differ. To do this, we will declare 3 string variables: strSeq1, strSeq2 and strLCS.

Open VBA. Add the Dim statement below to showLCS() to declare these 3 String variables. Place in the section with the other variable declarations.

Dim strSeq1, strSeq2, strLCS As String

Page 19: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued)

Initialize the strings so they are empty (the colon character allows you to put multiple statements on a single line). Place this line just before the Do-While Loop.

r = range(maxCellAddress).Row c = range(maxCellAddress).Column strSeq1 = "" : strSeq2 = "" : strLCS = "" ' Initialize to empty strings Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1 range("C4").Value = strSeq1 range("C5").Value = strLCS range("C6").Value = strSeq2

After the Do-While Loop, add the statements for pasting these 3 strings into the Values worksheet in the 3 cells C4:C6. (Recall that these were the cells on line 3 of the clearLCS() subroutine body.)

Page 20: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued)

Add the 3 lines above to the body of the If bMATCH statement. These statements will PREPEND the letter from the sequence into the string. We need to PREPEND each letter encountered because we are building the LCS Letter Sequence in reverse, from tail to head.

If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIf bLEFT Then c = c - 1 ElseIf bUP Then r = r - 1 End If

Note that to capture the letter from the 1st sequence – as we move backwards from column to column, we are always referencing the letter from the ROW_LTR row (row 9).

To capture the letter from the 2nd sequence – as we move backwards from row to row, we are always referencing the letter from the COL_LTR column (column 2).

For the string that will show the common letters shared by the 2 aligned sequences, it doesn’t matter whether we use cell(ROW_LTR, C) or cell(r, COL_LTR), because the letters in both of these cells match. Note the importance of placing these 3 new statements BEFORE the lines that decrement r and c; otherwise, we would be extracting letters from the wrong cells!

Page 21: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued)

Click on the showLCS button.You should see the LCS (Longest Common Subsequence) appear in each of the cells C4:C6.

Format C4 red and C6 blue to visually connect each to their original sequence. Although the text that all 3 contain is identical right now, we will be making changes that will make each distinct.

Also format these 3 cells so that they use the Courier New font (size 14 recommended). Courier New is a FIXED-WIDTH font, meaning that all of its characters are the same width (unlike most other fonts). Using this font will prove extremely useful in viewing similarities and differences between the 2 sequences because corresponding letters in the 2 strings will automatically line up vertically.

Page 22: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued)

Add each of the 3 line code sections above to the If bLEFT and If bUP statements, exactly as you just did with the bMATCH statement.

If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIf bLEFT Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = "-" + strSeq2 strLCS = "-" + strLCS c = c - 1 ElseIf bUP Then strSeq1 = "+" + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = "+" + strLCS r = r - 1 End If

When we travel LEFT, changing columns, we want to copy the letters from Sequence 1 in the ROW_LTR (9th) row into strSeq1.

Recall that when there is a match, we captured the shared letters in all 3 strings.

At the same time, we need to place a character in both strSeq2 and strLCS to preserve the alignment so that corresponding characters in all 3 strings will coincide. Therefore we add the "–" character to indicate a deletion.

Page 23: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued) If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIf bLEFT Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = "-" + strSeq2 strLCS = "-" + strLCS c = c - 1 ElseIf bUP Then strSeq1 = "+" + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = "+" + strLCS r = r - 1 End If

When we travel UP, changing rows, we want to copy the letters from Sequence 2 in the COL_LTR (2nd) column into strSeq2.At the same time, we need to place a character in both strSeq1 and strLCS to preserve the alignment so that corresponding characters in all 3 strings will coincide. Therefore we add the "+" character to indicate an insertion.

The indel (insertion/deletion) convention that we’ve implemented corresponds to a method for CONVERTING Seq 1 to Seq 2. Had we wanted an implementation that would have converted Seq 2 to Seq 1, we would have reversed our usage of the + and - characters.

Page 24: Pairwise  Alignment, Part II

Reconstructing the LCS Letter Sequence (continued)

When completed, clicking on the showLCS button when Sequence 1 = ATCTGAT andSequence 2 = TGCATA will show a pairwise alignment as in the image to the left.

Notice the usage of the plus and minus characters. To CONVERT Seq 1 to Seq 2, we need to delete the 1st three characters of Seq 1 (ATC), hence 3 minus signs appear at the beginning of strLCS (and at the beginning of strSeq2 to align it with strSeq1).

When we reach the 6th and 9th positions, we need to insert two characters (C and A), hence our use of the 2 plus signs.

strSeq1strLCSstrSeq2