Longest common subsequence problem
Dynamic Programming is a powerful technique that can be used to solve many problems in time T(n2) or T(n3) for which a naive approach would take exponential time. “DP” is a general approach to solving problems, much like “divide-and-conquer” is a general method, except that unlike divide-and-conquer, the subproblems will typically overlap and are used to solve the next one.
The cost of an algorithm in a program is a critical aspect because it can significantly impact on performances. In order to keep the cost low “DP” follow a couple of standard ways to progress
• memorization
• converting from top-down to bottom-up
Another very common problem when dealing with dynamic programming is the longest common subsequence problem (for brevity since now “LCS”). The “LCS” problem is as follows
We are given two strings: string S of length n, and string T of length m. The goal is to produce their longest common subsequence: the longest sequence of characters that appear left-to-right (but not necessarily in a contiguous block) in both strings.
The problem can be solved comparing the two strings; in the case S and T have a different length the desired subsequence has to ignore the ending elements of one of the two strings, in the case S and T have the same length all the characters have to be considered.
As subproblems the algorithm will look at the “LCS” of a prefix of S and a prefix of T, running over all pairs of prefixes. For simplicity, let’s worry first about finding the length of the LCS and then we can modify the algorithm to produce the actual sequence itself.
The longest common subsequence(LCS) problem is one of the classical and well- studied problems in computer science. The computation of the “LCS” is a frequent task in DNA sequence analysis, and has applications to genetics and molecular biology. Another practical application of the “LCS” algorithm is in order to compute the file differences.
Theoretically speaking the algorithm just fill out a matrix row by row, doing constant amount of work per entry, so this takes T(mn) time overall. The work actually performed is a comparison between the chars of the two strings at specific indexes.
The final answer (i.e. the length of the LCS of S and T) is in the lower-right corner.
To find the common sequence, the algorithm just walk backwards through matrix starting the lower-right corner.
If either the cell directly above or directly to the right contains a value equal to the value in the current cell, then move to that cell (if both to, then chose either one). If both such cells have values strictly less than the value in the current cell, then move diagonally up-left, and output the associated character.
This will output the characters in the LCS in reverse order.
Consider a practical example and search the “LCS” of the following strings
• ABAZDC
• BACBAD
The algorithm will fill out a matrix like the following one and running backward the matrix the output will be DABA that is the “LCS” between the two strings.
B | A | C | B | A | D | |
A | 0 | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 1 | 2 | 2 | 2 |
A | 1 | 2 | 2 | 2 | 3 | 3 |
Z | 1 | 2 | 2 | 2 | 3 | 3 |
D | 1 | 2 | 2 | 2 | 3 | 4 |
C | 1 | 2 | 3 | 3 | 3 | 4 |
From a practical point of view the following diagram shows up the classes involved into an ActionScript “DP” solution to the problem.
Let’s start to examine what model used by these classes.
The class LCSTable is the one to use in order to represent the matrix outlined at the beginning of the post, for this reason the constructor accept two arguments that represent the rows and columns of the matrix (i.e. the length of the two strings that will be examined) and use them to create an Array of Vectors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
public function LCSTable(rows:int, columns:int){ _table = []; for(var i:int = 0; i <= rows; i++){ table[i] = new Vector.(columns + 1, true); for(var j:int = 0; j <= columns; j++){ table[i][j] = new LCSCell(0, LCSCell.UNDEFINED); } } } |
The property _table is exposed through a public getter and return the array of Vectors that is the way all the cells are represented into the model.
The LCSCell class is pretty simple, it stores the total number of matching strings and the direction the algorithm has to follow to walk backwards through matrix starting, the direction is stored into four static constants
- public static const UP:String = “^”;
- public static const LEFT:String = “<”;
- public static const DIAGONAL:String = “\”;
- public static const UNDEFINED:String = “+”;
when the LCSTable create the new cells set the direction to UNDEFINED and the total to 0.
The core of this solution for the longest common subsequence problem is the LCSParser class, let’s start to explore it.
The constructor of the class needs two arguments (i.e. the strings to compare) and store them into two private members of the class
1 2 3 4 5 6 7 8 9 |
public function LCSParser(first:String, second:String){ firstString = first; secondString = second; init(); populate(); } |
The protected method init() store the number of columns in other two private members and initialize the table setting the default values of the cells (i.e. direction UNDEFINED and total 0)
1 2 3 4 5 6 7 8 9 10 |
protected function init():void{ _lcs = ""; columns = firstString.length; rows = secondString.length; table = new LCSTable(rows, columns).table; } |
The populate() method creates two local variables to keep track of the total numbers of matching strings for each column and start the exploration of the LCSTable with tow nested for loops
1 2 3 4 5 6 7 8 9 10 11 12 |
var totalTemp1:int; var totalTemp2:int; for(var i:int = 1; i <= rows; i++){ for(var j:int = 1; j <= columns; j++){ // Additional code here } } |
In the body of the loop the method checks if there are matching characters or not in the two stored strings. If there is a match the current cell is updated considering if it belongs to the first row or column of the table (in this situation the total is 0 and the direction is intentionally left as UNDEFINED) or not.
When updating the cell the local variable totalTemp1 is used in order to increase the total property of the LCSCell instance properly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
if(firstString.charAt(j - 1) == secondString.charAt(i - 1)){ var currentCell:LCSCell = getCell(i, j); if (i == 0 || j == 0) { currentCell.total = 0; // if not, set it to 1 plus the value of its upper left element } else { totalTemp1 = getCell(i - 1, j - 1).total; currentCell.total = totalTemp1 + 1; currentCell.direction = LCSCell.DIAGONAL; } }else{ totalTemp1 = getCell(i - 1, j).total; totalTemp2 = getCell(i, j - 1).total; updateDirection(getCell(i, j), totalTemp1, totalTemp2); } |
If there are not matching chars the method updateDirection() is called.
This method consider the position in the table (i.e. actual row and column) and update the direction and total accordingly to the position
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
private function updateDirection(cell:LCSCell, s1:int, s2:int):void{ if(s1 >= s2){ cell.total = s1; cell.direction = LCSCell.UP; }else{ cell.total = s2; cell.direction = LCSCell.LEFT; } } |
Once the table is populated the public method lcs() can be used to recover the longest common subsequence.
The method calls another one to walk backwards through matrix and return a string
1 2 3 4 5 6 |
public function get lcs():String{ parseLCS(secondString, rows, columns); return _lcs; } |
The parseLCS() method uses recursion and populate the _lcs property accordingly to the direction of each cell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
private function parseLCS(data:String, i:int, j:int):void{ if ((i == 0) || (j == 0)){ return; } var currentCell:LCSCell = getCell(i, j); if(currentCell){ if(currentCell.direction == LCSCell.DIAGONAL){ parseLCS(data, i - 1, j- 1); var temp:String = data.charAt(i - 1); _lcs += temp; }else if(currentCell.direction == LCSCell.UP){ parseLCS(data, i - 1, j); }else{ parseLCS(data, i, j - 1); } } } |
A fully working demo is available here with the source code included (right click -> view source), please feel free to use it as you want but please provide me as much as possible your feedback.
Just in case you would like to go deeper on this topic following you can find some interesting linka
- http://en.wikipedia.org/wiki/Dynamic_programming
- http://wordaligned.org/articles/longest-common-subsequence
- http://igm.univ-mlv.fr/~lecroq/seqcomp/node4.html
Enjoy with dynamic programming!
Nice work!
LCS is used, for example, to recognize mouse gestures like they did here http://www.bytearray.org/?p=91