Because your data is actually very unique, it can be streamlined here. Because all strings are 20 characters in length and consist of ATCGfour characters. Then they can be converted into integers for comparison. The binary representation is as follows
A ==> 00
T ==> 01
C ==> 10
G ==> 11
Because the length of a string is fixed and each character can be represented by 2 bits, each string can be expressed as a 40位的整数。可以表示为32+8的形式,也可以直接使用64位整形。建议使用Clanguage.
Let’s talk about comparison. Because we need to find all records with a difference of less than or equal to 4 for each candidate in bg_db, so we only need to do ^ bitwise XOR of two integers Operation, the binary 1 in the result does not exceed 8, and these 8 1s can only be divided into 4 groups at most, it is possible that it meets the requirements (00^11 =11,10^01=11). 每一个candidate在bg_db中与之差异小于等于4的所有记录,所以只要两个整数一做^按位异或操作,结果中二进制中1不超过8个,且这不超过8个1最多只能分为4个组的才有可能是符合要求的(00^11=11,10^01=11)。 把结果的40个比特位分作20个组,那么就是说最多只有4个组为b01b10b11这三个值,其余的全部为b00。 那么比较算法就很好写了。 可以对每个字节(四个组)获取其中有几个组是为三个非零值的,来简介获取整体的比较结果。 因为每个字节只有256种可能的值,而符合条件的值只有3^4=81Divide the 40 bits of the result into 20 groups, which means that there are only 4 groups at most b01b10 b11 These three values, the rest are all b00. Then the comparison algorithm is easy to write. You can obtain for each byte (four groups) how many groups have three non-zero values to briefly obtain the overall comparison result.
Because each byte has only 256 possible values , and the qualified values are only 3^4=81 🎜, so the result can be stored first Get up and get it. 🎜Here is a function to get how many non-zero groups there are in the result. 🎜
/*****************下面table中值的生成******//**
int i;
for( i=0;i<256;++i){
int t =0;
t += (i&0x01 || i&0x02)?1:0;
t += (i&0x04 || i&0x08)?1:0;
t += (i&0x10 || i&0x20)?1:0;
t += (i&0x40 || i&0x80)?1:0;
printf("%d,",t);
if(i%10 ==9){putchar('\n');}
}
********************************************//
int table[] = {
0,1,1,1,1,2,2,2,1,2,
2,2,1,2,2,2,1,2,2,2,
2,3,3,3,2,3,3,3,2,3,
3,3,1,2,2,2,2,3,3,3,
2,3,3,3,2,3,3,3,1,2,
2,2,2,3,3,3,2,3,3,3,
2,3,3,3,1,2,2,2,2,3,
3,3,2,3,3,3,2,3,3,3,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4,2,3,3,3,
3,4,4,4,3,4,4,4,3,4,
4,4,2,3,3,3,3,4,4,4,
3,4,4,4,3,4,4,4,1,2,
2,2,2,3,3,3,2,3,3,3,
2,3,3,3,2,3,3,3,3,4,
4,4,3,4,4,4,3,4,4,4,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4,2,3,3,3,
3,4,4,4,3,4,4,4,3,4,
4,4,1,2,2,2,2,3,3,3,
2,3,3,3,2,3,3,3,2,3,
3,3,3,4,4,4,3,4,4,4,
3,4,4,4,2,3,3,3,3,4,
4,4,3,4,4,4,3,4,4,4,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4
};
int getCount(uint64_t cmpresult)
{
uint8_t* pb = &cmpresult; // 这里假设是小段模式,且之前比较结果是存在低40位
return table[pb[0]]+table[pb[1]]+table[pb[2]]+table[pb[3]]+table[pb[4]];
}
First of all, your time estimate is completely wrong. This kind of large-scale data processing needs to run tens of thousands of items and last for more than ten seconds before it can be used for multiplication to calculate the total time. If only one item is counted, this time will be almost They are all the overhead of the initialization process, not the key IO and CPU overhead
The following text
The four possibilities of ACTG are equivalent to 2 bits. It is too wasteful to use one character to represent one gene position. One character is 8 bits and can hold 4 gene positions
Even if you don’t use any algorithm and just write your 20 genes into binary form, you can save 5 times the time
In addition, after looping 20 times, the number of CPU instructions is 20*n, and n is estimated to be at least 3. But for binary, the XOR operation for comparison is directly a CPU instruction, and the number of instructions is 1
If you switch to the subject's 24-core CPU, the performance will be improved by 20 times, and then add 48 machines to operate together. 5000W operations will take 15 seconds, and the time will be 10000000 * 1000000000 / 500 / 1000000 * 15 / 20 / 48 / 3600 / 24 = 3.616898 days
If you change it to multi-threading, the speed will be faster. On my machine, the performance improvement is not very big, but newer CPUs have hyper-threading technology, and the speed is expected to be better. . . 2个线程简单使用500条candidates测试,速度可以提升到9040257微秒,线程增加到4
rrreee
Sorry, I saw someone else replying today. Looking at the question carefully, I found that I thought it was just a match. So I proposed to use ac automatic machine.
But the purpose of the question is to find the difference in the sequence. This is to find the edit distance between the two. wiki: Edit distance wiki: Ravenstein distance
When I used to brush OJ, I used DP (dynamic programming) to find the minimum number of edits to convert a string into another string.
Insert b Count once Delete d Count once Modify f Count once
For the subject’s ATCG gene sequence, do you only need to find the modified one? However, how should it be calculated like this ATCGATCG TCGATCGA ?
If you just find modifications, just compare str1[i] and str2[i] directly.
for(i:0->len)
if(str1[i]!=str2[i] )++count;
Inspired by @rockford. We can preprocess the raw data.
Strings in
candidates GGGAGCAGGCAAGGACTCTG A5 T2 C4 G9
Extra data after processing A5 T2 C4 G9
String in
bg_db CTGCTGACGGGTGACACCCA A4 T3 C7 G6
Extra data after processing A4 T3 C7 G6
A5 -> A4 is recorded as -1 T2 -> T3 is recorded as +1 C4 -> C7 is recorded as +3 G9 -> G6 is recorded as -3
Obviously A can only become TCG if modified. Similarly, we only need to count all + or all - to know at least how many differences they have. Anything greater than 4 does not need to be compared.
By first comparing the pre-processed additional data, and then performing the comparison through a single comparison algorithm. (work overtime on Saturday, write this after get off work)
Your individual tasks are determined. What you need is to send these tasks to workers. Such calculations are not performed synchronously in a single process. In fact, it is equivalent to you having [a, b] and [c, d] to compare, your task is
[a, c]
[a, d]
[b, c]
[b, d]
If you are synchronous serial, the time you need is 4 * single time If you have 4 CPUs or 4 machines in parallel, your time is almost single time
So calculations like genomes are basically completed using multi-core parallel tasks on large machines. Basically, the reference principles are the principles of the paper Google MapReduce
I’m not good at algorithms, but for a large amount of data like yours, a computer is definitely not enough to compare. For CPU-intensive data tasks like yours, I agree with what others have said about using a cluster or multi-process method to calculate, which is what we Use the map-reduce model to calculate map is mapping. You first map each of your candidates to bg_db one by one to form a data structure like this(candidates,bg_db) Make it into a queue and then hand it over to different servers. Each server uses multiple processes to do it. This is the only way to calculate, but the amount of data you have is too large. Find a way to allocate your tasks and calculate in parallel,
You can try to use a dictionary tree to save all strings. Then you can use the method of traversing the dictionary tree when querying. When traversing the tree, you can maintain a sequence of current nodes. This sequence stores the number of mismatches between the currently traversed node and the corresponding node. When traversing the next node, try to go down all the nodes in the current sequence and form a new node sequence. The advantage is that the current bits of many strings can be compared together, which can save some time. Since there are not many choices for each position and the mismatch is not large, there should be no excessive expansion of the current node sequence. (This is a guess... I haven’t verified it too carefully...)
def match(candidate, root):
nset = [[],[]]
currentId = 0
nset[currentId].append((root, 0))
for ch in candidate:
nextId = 1 - currentId
for item in nset[currentId]:
node, mismatch = item
for nx in node.child:
if nx.key == ch:
nset[nextId].append((nx, mismatch))
else:
if mismatch:
nset[nextId].append((nx, mismatch - 1))
nset[currentId].clear()
currentId = 1 - currentId
return nset[currentId]
The above code is a rough idea. It would be much faster if written in C++. The entire process can be done using clusters for distributed computing.
Visually the questioner does not have multiple machines for him to calculate. I have a simple idea. Calculate the sum of the alphabetical numbers of each string (A:0, T:1, C:2, G:3). First calculate the two The difference between the ordinal sum of the string cannot exceed 12 at most. Four A's become four G's. If the difference is less than 12, it will be processed. It's just a rough idea. The specific weight can be set separately. In theory, it can Much faster. There is another algorithm that calculates the string edit distance (the minimum number of edits to add, delete, and modify one string to another string). I can’t remember it at the moment. You can check it yourself.
Write an idea
Because your data is actually very unique, it can be streamlined here.
Because all strings are 20 characters in length and consist of
ATCG
four characters. Then they can be converted into integers for comparison.The binary representation is as follows
Because the length of a string is fixed and each character can be represented by 2 bits, each string can be expressed as a
40
位的整数。可以表示为32+8
的形式,也可以直接使用64
位整形。建议使用C
language.Let’s talk about comparison.
Because each byte has onlyBecause we need to find
all records with a difference of less than or equal to 4 for each candidate in bg_db
, so we only need to do^
bitwise XOR of two integers Operation, thebinary 1 in the result does not exceed 8, and these 8 1s can only be divided into 4 groups at most
, it is possible that it meets the requirements (00^11 =11,10^01=11).每一个candidate在bg_db中与之差异小于等于4的所有记录
,所以只要两个整数一做^
按位异或操作,结果中二进制中1不超过8个,且这不超过8个1最多只能分为4个组
的才有可能是符合要求的(00^11=11,10^01=11)。把结果的
40
个比特位分作20个组,那么就是说最多只有4
个组为b01
b10
b11
这三个值,其余的全部为b00
。那么比较算法就很好写了。
可以对每个字节(四个组)获取其中有几个组是为三个非零值的,来简介获取整体的比较结果。
因为每个字节只有
256
种可能的值,而符合条件的值只有Then the comparison algorithm is easy to write.3^4=81
Divide the40
bits of the result into 20 groups, which means that there are only4
groups at mostb01
b10
b11
These three values, the rest are allb00
.You can obtain for each byte (four groups) how many groups have three non-zero values to briefly obtain the overall comparison result.
256
possible values , and the qualified values are only3^4=81
🎜, so the result can be stored first Get up and get it. 🎜Here is a function to get how many non-zero groups there are in the result. 🎜First of all, your time estimate is completely wrong. This kind of large-scale data processing needs to run tens of thousands of items and last for more than ten seconds before it can be used for multiplication to calculate the total time. If only one item is counted, this time will be almost They are all the overhead of the initialization process, not the key IO and CPU overhead
The following text
The four possibilities of ACTG are equivalent to 2 bits. It is too wasteful to use one character to represent one gene position. One character is 8 bits and can hold 4 gene positions
Even if you don’t use any algorithm and just write your 20 genes into binary form, you can save 5 times the time
In addition, after looping 20 times, the number of CPU instructions is 20*n, and n is estimated to be at least 3. But for binary, the XOR operation for comparison is directly a CPU instruction, and the number of instructions is 1
I don’t know much about the algorithm, but speaking from experience, complex algorithms take longer and are not as fast as this simple and crude one
You can consider multi-threading and clustering to process data
By the way, the Hamming distance seems to be able to calculate this
Also no algorithm is used, brute force solution is written in C
On my machine (CPU: Core 2 Duo E7500, RAM: 4G, OS: Fedora 19), test results
If you switch to the subject's 24-core CPU, the performance will be improved by 20 times, and then add 48 machines to operate together. 5000W operations will take 15 seconds, and the time will be
10000000 * 1000000000 / 500 / 1000000 * 15 / 20 / 48 / 3600 / 24 = 3.616898 days
Compile parameters & run
If you change it to multi-threading, the speed will be faster. On my machine, the performance improvement is not very big, but newer CPUs have hyper-threading technology, and the speed is expected to be better. . .
Compile and test2
个线程简单使用500条candidates
测试,速度可以提升到9040257微秒
,线程增加到4
rrreeerrreee
Sorry, I saw someone else replying today.
Looking at the question carefully, I found that I thought it was just a match.
So I proposed to use ac automatic machine.
But the purpose of the question is to find the difference in the sequence.
This is to find the edit distance between the two.
wiki: Edit distance
wiki: Ravenstein distance
When I used to brush OJ, I used DP (dynamic programming) to find the minimum number of edits to convert a string into another string.
For example:
str2 converted to str1
Insert b Count once
Delete d Count once
Modify f Count once
For the subject’s ATCG gene sequence, do you only need to find the modified one?
However, how should it be calculated like this
ATCGATCG
TCGATCGA
?
If you just find modifications, just compare str1[i] and str2[i] directly.
Inspired by @rockford.
Strings inWe can preprocess the raw data.
Extra data after processing A5 T2 C4 G9
String inExtra data after processing A4 T3 C7 G6
A5 -> A4 is recorded as -1
T2 -> T3 is recorded as +1
C4 -> C7 is recorded as +3
G9 -> G6 is recorded as -3
Obviously A can only become TCG if modified.
Similarly, we only need to count all + or all -
to know at least how many differences they have.
Anything greater than 4 does not need to be compared.
By first comparing the pre-processed additional data, and then performing the comparison through a single comparison algorithm.
(work overtime on Saturday, write this after get off work)
Your individual tasks are determined. What you need is to send these tasks to workers. Such calculations are not performed synchronously in a single process.
In fact, it is equivalent to you having [a, b] and [c, d] to compare, your task is
[a, c]
[a, d]
[b, c]
[b, d]
If you are synchronous serial, the time you need is 4 * single time
If you have 4 CPUs or 4 machines in parallel, your time is almost single time
So calculations like genomes are basically completed using multi-core parallel tasks on large machines. Basically, the reference principles are the principles of the paper Google MapReduce
I’m not good at algorithms, but for a large amount of data like yours, a computer is definitely not enough to compare. For CPU-intensive data tasks like yours, I agree with what others have said about using a cluster or multi-process method to calculate, which is what we Use the map-reduce model to calculate
map is mapping. You first map each of your candidates to bg_db one by one to form a data structure like this
(candidates,bg_db)
Make it into a queue and then hand it over to different servers. Each server uses multiple processes to do it. This is the only way to calculate, but the amount of data you have is too large. Find a way to allocate your tasks and calculate in parallel,
You can try to use a dictionary tree to save all strings. Then you can use the method of traversing the dictionary tree when querying.
When traversing the tree, you can maintain a sequence of current nodes. This sequence stores the number of mismatches between the currently traversed node and the corresponding node.
When traversing the next node, try to go down all the nodes in the current sequence and form a new node sequence.
The advantage is that the current bits of many strings can be compared together, which can save some time. Since there are not many choices for each position and the mismatch is not large, there should be no excessive expansion of the current node sequence. (This is a guess... I haven’t verified it too carefully...)
The above code is a rough idea. It would be much faster if written in C++.
The entire process can be done using clusters for distributed computing.
Visually the questioner does not have multiple machines for him to calculate.
I have a simple idea. Calculate the sum of the alphabetical numbers of each string (A:0, T:1, C:2, G:3). First calculate the two The difference between the ordinal sum of the string cannot exceed 12 at most. Four A's become four G's. If the difference is less than 12, it will be processed.
It's just a rough idea. The specific weight can be set separately. In theory, it can Much faster.
There is another algorithm that calculates the string edit distance (the minimum number of edits to add, delete, and modify one string to another string). I can’t remember it at the moment. You can check it yourself.
I use blast blastn-short
:)