A detailed discussion of the longest common subsequence in JavaScript
小云云
Release: 2018-01-23 10:44:02
Original
1718 people have browsed it
The longest common subsequence is obtained by taking as many characters as possible from the given two sequences X and Y, and arranging them in the order in which they are arranged in the original sequence. The algorithm for LCS problems has a wide range of uses. For example, in the management of different versions of software, the LCS algorithm is used to find the similarities and differences between the old and new versions; in software testing, the LCS algorithm is used to compare recorded and played back sequences. In the field of genetic engineering, the LCS algorithm is used The algorithm checks the similarities and differences between the patient's DNA strand and the bond's DNA strand; in the anti-plagiarism system, the LCS algorithm is used to check the plagiarism rate of the paper. The LCS algorithm can also be used for program code similarity measurement, human running sequence retrieval, video segment matching, etc., so research on the LCS algorithm has high application value.
Basic concepts
Subsequence (subsequence): A subsequence of a specific sequence is the result obtained by removing zero or more elements from the given sequence ( does not change the relative order of elements). For example, the subsequences of the sequence are: , ,
Common subsequence: Given sequences X and Y, sequence Z is a subsequence of X and a subsequence of Y, then Z is the common subsequence of X and Y sequence. For example, X=[A,B,C,B,D,A,B], Y=[B,D,C,A,B,A[, then the sequence Z=[B,C,A] is X and Y The common subsequence of , its length is 3. But Z is not the longest common subsequence of X and Y, and the sequences [B, C, B, A] and [B, D, A, B] are also the longest common subsequences of X and Y, with a length of 4 , and X and Y do not have a common subsequence with a length greater than or equal to 5. For the common subsequences of the sequence [A, B, C] and the sequence [E, F, G], there is only the empty sequence [].
Longest common subsequence: Given sequences X and Y, select the one or several with the longest length from all their common subsequences.
Substring: A new series formed by deleting zero or several characters from the front or last or both at the same time. The difference is that subsequences can have characters cut out from the middle. How many subsequences are there in the string cnblogs? Obviously there are 27 of them. For example, cb, cgs, etc. are all subsequences of them.
Let me explain with a picture:
We can see that subsequences are not necessarily continuous, continuous ones are substrings.
Problem Analysis
We still start the analysis from a matrix and derive the state transition equation ourselves.
First of all, we convert the problem into a concept that is familiar enough to the front end. Instead of calling it serially, it can be thought of as an array or a string. To keep things simple, let's just assume that two strings are being compared.
We focus on the concept of "subsequence", which can delete multiple or zero ones, or all of them. At this time our first subsequence is empty string (if our sequence is not a string, we can still do it)! This is something you really need to pay attention to! Many people just can't understand the chart in "Introduction to Algorithms", and there are also many bloggers who don't pretend to understand. We always compare from left to right, and of course the first string, because it is the height of the matrix, is placed vertically.
x
""
B
D
C
A
B
A
#""
##A
B
C
D
A
B
False orderX = "ABCDAB", Y="BDCABA", each takes out the shortest sequence, that is, compares the empty string with the empty string. The solution of the LCS equation is a number, so this table can only be filled with numbers. The length of the common area of two empty strings is 0.
x
""
B
D
C
A
B
A
#""
0
#A
B
C
D
A
B
Then we don’t move X and continue to let the empty string come out of the array, and Y lets “B” come out of the array. Obviously, the length of their common areas is 0. Y is replaced with other characters, D, C, or, they Continuous combinations of DC and DDC, the situation has not changed, it is still 0. Therefore, the first row is 0. Then we do not move Y, and Y only produces empty strings, then it is the same as the above analysis, all are 0, the first The columns are all 0.
x
""
B
D
C
A
B
A
""
0
0
0
0
0
0
0
#A
0
B
0
C
0
D
0
A
0
B
0
The LCS problem is a little different from the knapsack problem. The knapsack problem can also be set to -1 row, and the longest common subsequence is due to the occurrence of empty subsequences. , fix the left side and the top side from the beginning.
Then we enlarge the problem a little further. This time both sides produce a character. Obviously, only when both are the same, can there be a common subsequence that is not an empty string, and the length can also be understood as 1.
A is "X", Y is any subsequence of "BDCA"
##x
""
B
D
C
A
B
A
#""0
0
0
0
0
0
0
##A
0
0
0
0
1
B
0
C
0
D
0
A
0
B
0
##Continue to fill in the blanks to the right, how to fill in the blanks? Obviously, LCS cannot be greater than the length of X. How can the subsequence of Y starting from the A string be equal to 1 compared with the A sequence of B.
x""
B
D
C
A
B
A
#""
0
0
0
0
0
0
0
#A
0
0
0
0
1
1
1
#B
0
C
0
D
0
##A
0
B
0
If It’s been explained. Then let's look at B first, ${X_1} == ${Y_0}, we get a new public substring, and we should add 1. why? Because our matrix is a state table, describing the state migration process from left to right and top to bottom, and these states are
accumulated
based on the existing states. What we need to confirm now is the relationship between the value of the grid we want to fill in and the values of the grids around it that have already been filled in. At present, there is too little information and it is just an isolated point, so just fill in 1.
x
""
B
D
C
A
B
A
#""
0
0
0
0
0
0
0
#A
0
0
0
0
1
1
1
#B
0
1
##C
0
D
0
A
0
B
0
Then we let Y have an extra D as a helper, {"",A,B,AB} vs {"",B,D,BD}. Obviously, continue to fill in 1. Fill in until the second one of Y Before B, it is all 1. Because when it comes to BDCAB, they have another common subsequence, AB.
x
""
B
D
C
A
B
A
#""
0
0
0
0
0
0
0
#A
0
0
0
0
1
1
1
#B
0
1
1
1
1
2
C
0
D
0
A
0
B
0
At this step, we can summarize some rules, and then verify our results through calculations Ideas, add new rules or constraints to improve.
Y Send all the characters, X is still 2 characters, after careful observation, still fill in 2.
Look at the five lines, send more X If the subsequence set of C and ABC is larger than the subsequence set of AB, then it and the B subsequence set of Y are larger. Even if they are not large, they cannot be smaller than the original ones. Obviously the newly added C cannot become a combat power and is not a common character between the two, so the value should be equal to the subsequence set of AB. ##×
""
B
D
C
A
B
A
##""
0
0
0
0
0
0
0
#A
0
0
0
0
1
1
1
#B
0
1
1
1
1
2
2
#C
0
1
D
0
A
0
B
0
And we can be sure that if the characters to be compared between the two strings are different, then the grid to be filled in is related to the left or top of it, and the larger one will be used.
If the compared characters are the same, don’t worry, it happens that the C of X needs to be compared with the C of Y, that is, the subsequence set of ABC {"",A,B,C,AB,BC, ABC} is compared with the subsequence set {"",B,D,C,BD,DC,BDC} of BDC, and the common substrings obtained are "",B,D. At this time, the conclusion is still the same as before. When the characters are equal, its corresponding grid value is equal to the value of the left, right, and upper left corners, and the left, upper, and upper left sides are always equal. These mysteries require more rigorous mathematical knowledge to demonstrate.
假设有两个数组,A和B。A[i]为A的第i个元素,A(i)为由A的第一个元素到第i个元素所组成的前缀。m(i, j)为A(i)和B(j)的最长公共子序列长度。
由于算法本身的递推性质,其实只要证明,对于某个i和j:
m(i, j) = m(i-1, j-1) + 1 (当A[i] = B[j]时)
m(i, j) = max( m(i-1, j), m(i, j-1) ) (当A[i] != B[j]时)
第一个式子很好证明,即当A[i] = B[j]时。可以用反证,假设m(i, j) > m(i-1, j-1) + 1 (m(i, j)不可能小于m(i-1, j-1) + 1,原因很明显),那么可以推出m(i-1, j-1)不是最长的这一矛盾结果。
第二个有些trick。当A[i] != B[j]时,还是反证,假设m(i, j) > max( m(i-1, j), m(i, j-1) )。
由反证假设,可得m(i, j) > m(i-1, j)。这个可以推出A[i]一定在m(i, j)对应的LCS序列中(反证可得)。而由于A[i] != B[j],故B[j]一定不在m(i, j)对应的LCS序列中。所以可推出m(i, j) = m(i, j-1)。这就推出了与反正假设矛盾的结果。
得证。</p>
<p> </p>
<p>We now use the following equation to continue filling in the table. </p>
<p><img src="https://img.php.cn/upload/article/000/054/025/4261b59f87cbe8cff3afd0492985a179-3.png" alt="A detailed discussion of the longest common subsequence in JavaScript" ></p>
<h2>Program implementation</h2>
<pre class="brush:php;toolbar:false">//by 司徒正美
function LCS(str1, str2){
var rows = str1.split("")
rows.unshift("")
var cols = str2.split("")
cols.unshift("")
var m = rows.length
var n = cols.length
var dp = []
for(var i = 0; i < m; i++){
dp[i] = []
for(var j = 0; j < n; j++){
if(i === 0 || j === 0){
dp[i][j] = 0
continue
}
if(rows[i] === cols[j]){
dp[i][j] = dp[i-1][j-1] + 1 //对角+1
}else{
dp[i][j] = Math.max( dp[i-1][j], dp[i][j-1]) //对左边,上边取最大
}
}
console.log(dp[i].join(""))//调试
}
return dp[i-1][j-1]
}
Copy after login
LCS can be further simplified, just by moving the position, eliminating the need to generate a new array
//by司徒正美
function LCS(str1, str2){
var m = str1.length
var n = str2.length
var dp = [new Array(n+1).fill(0)] //第一行全是0
for(var i = 1; i <= m; i++){ //一共有m+1行
dp[i] = [0] //第一列全是0
for(var j = 1; j <= n; j++){//一共有n+1列
if(str1[i-1] === str2[j-1]){
//注意这里,str1的第一个字符是在第二列中,因此要减1,str2同理
dp[i][j] = dp[i-1][j-1] + 1 //对角+1
} else {
dp[i][j] = Math.max( dp[i-1][j], dp[i][j-1])
}
}
}
return dp[m][n];
}
Copy after login
Printing an LCS
We will give the printing function and first look at how to print one. We start looking from the lower right corner and end at the top line. Therefore the target string is constructed in reverse order. In order to avoid using troublesome intermediate quantities such as stringBuffer, we can implement it recursively. Each time the program is executed, only one string is returned, otherwise an empty string is returned, using printLCS(x,y,...) + str[ i] are added to get the string we require.
We write another method to verify whether the string we get is a real LCS string. As a person who is already working, I cannot write code like a student in school and put it online without doing unit testing for others to step on.
//by 司徒正美,打印一个LCS
function printLCS(dp, str1, str2, i, j){
if (i == 0 || j == 0){
return "";
}
if( str1[i-1] == str2[j-1] ){
return printLCS(dp, str1, str2, i-1, j-1) + str1[i-1];
}else{
if (dp[i][j-1] > dp[i-1][j]){
return printLCS(dp, str1, str2, i, j-1);
}else{
return printLCS(dp, str1, str2, i-1, j);
}
}
}
//by司徒正美, 将目标字符串转换成正则,验证是否为之前两个字符串的LCS
function validateLCS(el, str1, str2){
var re = new RegExp( el.split("").join(".*") )
console.log(el, re.test(str1),re.test(str2))
return re.test(str1) && re.test(str2)
}
Copy after login
Use:
function LCS(str1, str2){
var m = str1.length
var n = str2.length
//....略,自行补充
var s = printLCS(dp, str1, str2, m, n)
validateLCS(s, str1, str2)
return dp[m][n]
}
var c1 = LCS( "ABCBDAB","BDCABA");
console.log(c1) //4 BCBA、BCAB、BDAB
var c2 = LCS("13456778" , "357486782" );
console.log(c2) //5 34678
var c3 = LCS("ACCGGTCGAGTGCGCGGAAGCCGGCCGAA" ,"GTCGTTCGGAATGCCGTTGCTCTGTAAA" );
console.log(c3) //20 GTCGTCGGAAGCCGGCCGAA
Copy after login
Print all LCS
The idea is similar to the above , let us note that there is a Math.max value in the LCS method, which actually integrates three situations, so three strings can be forked. Our method will return an es6 collection object for automatic removal. Then each time the new set is used to merge the strings of the old set.
//by 司徒正美 打印所有LCS
function printAllLCS(dp, str1, str2, i, j){
if (i == 0 || j == 0){
return new Set([""])
}else if(str1[i-1] == str2[j-1]){
var newSet = new Set()
printAllLCS(dp, str1, str2, i-1, j-1).forEach(function(el){
newSet.add(el + str1[i-1])
})
return newSet
}else{
var set = new Set()
if (dp[i][j-1] >= dp[i-1][j]){
printAllLCS(dp, str1, str2, i, j-1).forEach(function(el){
set.add(el)
})
}
if (dp[i-1][j] >= dp[i][j-1]){//必须用>=,不能简单一个else搞定
printAllLCS(dp, str1, str2, i-1, j).forEach(function(el){
set.add(el)
})
}
return set
}
}
Copy after login
Use:
function LCS(str1, str2){
var m = str1.length
var n = str2.length
//....略,自行补充
var s = printAllLCS(dp, str1, str2, m, n)
console.log(s)
s.forEach(function(el){
validateLCS(el,str1, str2)
console.log("输出LCS",el)
})
return dp[m][n]
}
var c1 = LCS( "ABCBDAB","BDCABA");
console.log(c1) //4 BCBA、BCAB、BDAB
var c2 = LCS("13456778" , "357486782" );
console.log(c2) //5 34678
var c3 = LCS("ACCGGTCGAGTGCGCGGAAGCCGGCCGAA" ,"GTCGTTCGGAATGCCGTTGCTCTGTAAA" );
console.log(c3) //20 GTCGTCGGAAGCCGGCCGAA
Copy after login
Space optimization
Use rolling array:
function LCS(str1, str2){
var m = str1.length
var n = str2.length
var dp = [new Array(n+1).fill(0)],now = 1,row //第一行全是0
for(var i = 1; i <= m; i++){ //一共有2行
row = dp[now] = [0] //第一列全是0
for(var j = 1; j <= n; j++){//一共有n+1列
if(str1[i-1] === str2[j-1]){
//注意这里,str1的第一个字符是在第二列中,因此要减1,str2同理
dp[now][j] = dp[i-now][j-1] + 1 //对角+1
} else {
dp[now][j] = Math.max( dp[i-now][j], dp[now][j-1])
}
}
now = 1- now; //1-1=>0;1-0=>1; 1-1=>0 ...
}
return row ? row[n]: 0
}
Copy after login
Dangerous recursive solution
A subsequence of str1 corresponds to a subsequence of the subscript sequence {1, 2, …, m}, therefore, str1 has ${2^m}$ different subsequences (the same is true for str2, such as ${2^n}$), so the complexity reaches an astonishing exponential time (${2^m * 2^n}$).
//警告,字符串的长度一大就会爆栈
function LCS(str1, str2, a, b) {
if(a === void 0){
a = str1.length - 1
}
if(b === void 0){
b = str2.length - 1
}
if(a == -1 || b == -1){
return 0
}
if(str1[a] == str2[b]) {
return LCS(str1, str2, a-1, b-1)+1;
}
if(str1[a] != str2[b]) {
var x = LCS(str1, str2, a, b-1)
var y = LCS(str1, str2, a-1, b)
return x >= y ? x : y
}
}
The above is the detailed content of A detailed discussion of the longest common subsequence in JavaScript. For more information, please follow other related articles on the PHP Chinese website!
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn