I encountered a problem at work a few days ago:
There are 600,000 short message logs, each of which 50 words, 50,000 keywords, 2-8 words in length, most of them in Chinese. It is required to extract all the keywords contained in these 600,000 records and count the number of hits for each keyword.
This article gives a complete introduction to my implementation method to see how I can optimize a task that takes ten hours to run to within ten minutes. Although the implementation language is PHP, more ideas introduced in this article should be able to give you some help.
When I first received the task, my little mind immediately changed, Log + Keywords + Statistics, I did not think of writing the code myself, but first thought of the commonly used log statistics command under Linux grep
.
grep
I won’t mention the usage of the command anymore. Use grep 'keyword' | wc -l
which can be easily It is convenient to count the number of information items hit by keywords, and PHP's exec()
function allows us to directly call Linux shell commands, although there will be security risks when executing dangerous commands. .
Pseudo code:
foreach ($word_list as $keyword) { $count = intval(exec("grep '{$keyword}' file.log | wc -l")); record($keyword, $count); }
It was run on an old machine. The efficiency of the old machine was really poor, and it took 6 hours to run. It is estimated that the latest machine will take 2-3 hours. All subsequent optimizations will use new machines, and the requirements have changed. The main text has just begun.
Original, original in ideas and methods.
After handing in the job, the product came up with new ideas the next day, saying that it wanted to connect a certain data source in the future, the news Delivered as a data stream, not a file. It also requires the real-time nature of message statistics. My idea of writing the data to a file and then counting it was overturned. For the scalability of the solution, the current statistical object is no longer a whole, but needs to be considered. A single message was matched.
At this time, I was a little confused and had to resort to the most traditional tool - regularity. The implementation of regular expressions is not difficult, and each language has encapsulated regular matching functions. The focus is on the construction of patterns.
Of course, the pattern construction here is not difficult, /keywordOptimize the matching of huge amounts of keywords|keword2|.../
, use |
Just connect the keywords.
Here are two pitfalls encountered in use:
The length of the regular pattern is too long, causing matching failure: PHP's The regex has a backtracking limit to prevent consuming all the available stack of the process, eventually causing PHP to crash. Patterns that are too long will cause PHP to detect too many tracebacks and interrupt matching. After testing, the maximum pattern length is about 32,000 bytes in the default setting. The pcre.backtrack_limit
parameter in php.ini is the maximum number of backtracking times. The default value is Optimize the matching of huge amounts of keywords000000. Modify it or php.ini
or in Use ini_set('pcre.backtrack_limit', n);
at the beginning of the script. Setting it to a larger number can increase the maximum pattern length for a single match. Of course, you can also count keywords in batches (I used this =_=).
The pattern contains special characters, causing a large number of warnings: During the matching process, it was found that PHP reported a large number of warnings: unknown modifier <strong>Garbled characters</strong>
, careful inspection found that there are /
characters in the keywords, you can use the preg_quote()
function to filter the keywords.
Pseudo code:
$end = 0; $step = Optimize the matching of huge amounts of keywords500; $pattern = array(); // 先将pattern 拆成多个小块 while ($end < count($word_list)) { $tmp_arr = array_slice($word_list, $end, $step); $end += $step; $item = implode('|', $tmp_arr); $pattern[] = preg_quote($item); } $content = file_get_contents($log_file); $lines = explode("\n", $content); foreach ($lines as $line) { // 使用各小块pattern分别匹配 for ($i = 0; $i < count($pattern); $i++) { preg_match_all("/{$pattern[$i]}/", $line, $match); } $match = array_unique(array_filter($match)); dealResult($match); }
In order to complete the task, the process ran all night. When I found out the next day that I had run for nearly ten hours, I was heartbroken. . . It was too slow and completely failed to meet the usage requirements. At this time, I had already begun to consider changing the method.
When the product changed its keyword strategy, replaced some keywords, asked to run it again, and said that it would continue to optimize keywords, I completely rejected the existing plan. You must not use keywords to match information. Using all the keywords to match one by one is really unbearable.
Evolution, Evolution of Needs and Realization
I finally began to realize that to get information Go to keywords to compare. If I use keywords as keys to create a hash table, use the words in the information to search in the hash table, and if found, it will be considered a match. Wouldn't this achieve O(Optimize the matching of huge amounts of keywords) efficiency?
But a short message, how do I split it into just the right words to match? Word segmentation? Word segmentation also takes time, and my keywords are all meaningless words. Building a vocabulary and using word segmentation tools are big problems. Finally, I thought of splitting words
.
为什么叫拆词呢,我考虑以蛮力将一句话拆分为<strong>所有可能</strong>的
词。如(我是好人)
就可以拆成(我是、是好、好人、我是好、是好人、我是好人)
等词,我的关键词长度为 2-8,所以可拆词个数会随着句子长度迅速增加。不过,可以用标点符号、空格、语气词(如的、是
等)作为分隔将句子拆成小短语再进行拆词,会大大减少拆出的词量。
其实分词并没有完整实现就被后一个方法替代了,只是一个极具实现可能的构想,写这篇文章时用伪代码实现了一下,供大家参考,即使不用在匹配关键词,用在其他地方也是有可能的。
$str_list = getStrList($msg); foreach ($str_list as $str) { $keywords = getKeywords($str); foreach ($keywords as $keyword) { // 直接通过PHP数组的哈希实现来进行快速查找 if (isset($word_list[$keyword])) { record($keyword); } } } /** * 从消息中拆出短句子 */ function getStrList($msg) { $str_list = array(); $seperators = array(',', '。', '的', ...); $words = preg_split('/(?<!^)(?!$)/u', $msg); $str = array(); foreach ($words as $word) { if (in_array($word, $seperators)) { $str_list[] = $str; $str = array(); } else { $str[] = $word; } } return array_filter($str_list); } /** * 从短句中取出各个词 */ function getKeywords($str) { if (count($str) < 2) { return array(); } $keywords = array(); for ($i = 0; $i < count($str); $i++) { for ($j = 2; $j < 9; $j++) { $keywords[] = array_slice($str, $i, $j); // todo 限制一下不要超过数组最大长度 } } return $keywords; }
我们知道一个 utf-8
的中文字符要占用三个字节,为了拆分出包含中英文的每一个字符,使用简单的 split()
函数是做不到的。
这里使用了 preg_split('/(?<!^)(?!$)/u', $msg)
是通过正则匹配到两个字符之间的''
来将两个字符拆散,而两个括号里的 (?<!^)(?!$)
是分别用来限定捕获组不是第一个,也不是最后一个(不使用这两个捕获组限定符也是可以的,直接使用//
作为模式会导致拆分结果在前后各多出一个空字符串项)。 捕获组的概念和用法可见我之前的博客 PHP正则中的捕获组与非捕获组
由于没有真正实现,也不知道效率如何。估算每个短句长度约为 Optimize the matching of huge amounts of keywords0 字左右时,每条短消息约50字左右,会拆出 200 个词。虽然它会拆出很多无意义的词,但我相信效率绝不会低,由于其 hash 的高效率,甚至我觉得会可能比终极方法效率要高。
最终没有使用此方案是因为它对句子要求较高,拆词时的分隔符也不好确定,最重要的是它不够优雅。。。这个方法我不太想去实现,统计标识和语气词等活显得略为笨重,而且感觉拆出很多无意义的词感觉效率浪费得厉害。
觉醒,意识和思路的觉醒
于是我又来找谷哥帮忙了,搜索大量数据匹配,有人提出了 使用 trie 树的方式,没想到刚学习的 trie 树的就派上了用场。我上上篇文章刚介绍了 trie 树,在空间索引 – 四叉树 里字典树
这一小节,大家可以查看一下。
当然也为懒人复制了一遍我当时的解释(看过的可以跳过这一小节了)。
字典树,又称前缀树或 trie 树,是一种有序树,用于保存关联数组,其中的键通常是字符串。与二叉查找树不同,键不是直接保存在节点中,而是由节点在树中的位置决定。一个节点的所有子孙都有相同的前缀,也就是这个节点对应的字符串,而根节点对应空字符串。
我们可以类比字典的特性:我们在字典里通过拼音查找晃(huang
)这个字的时候,我们会发现它的附近都是读音为huang
的,可能是声调有区别,再往前翻,我们会看到读音前缀为huan
的字,再往前,是读音前缀为hua
的字… 取它们的读音前缀分别为 h qu hua huan huang
。我们在查找时,根据 abc...xyz
的顺序找到h
前缀的部分,再根据 ha he hu
找到 hu
前缀的部分…最后找到 huang
,我们会发现,越往后其读音前缀越长,查找也越精确,这种类似于字典的树结构就是字典树,也是前缀树。
那么 trie 树怎么实现关键字的匹配呢? 这里以一幅图来讲解 trie 树匹配的过程。
其中要点:
将关键词用上面介绍的preg_split()
函数拆分为单个字符。如科学家
就拆分为科、学、家
三个字符。
在最后一个字符后添加一个特殊字符`
,此字符作为一个关键词的结尾(图中的粉红三角),以此字符来标识查到了一个关键词(不然,我们不知道匹配到科、学
两个字符时算不算匹配成功)。
检查根部是否有第一个字符(科)节点,如果有了此节点,到步骤4
。 如果还没有,在根部添加值为科
的节点。
依次检查并添加学、家
两个节点。
在结尾添加 `
节点,并继续下一个关键词的插入。
然后我们以 <strong>这位科学家很了不起</strong>!
为例来发起匹配。
首先我们将句子拆分为单个字符 这、位
、...
;
从根查询第一个字符这
,并没有以这个字符开头的关键词,将字符“指针”向后移,直到找到根下有的字符节点科
;
接着在节点科
下寻找值为 学
节点,找到时,结果子树的深度已经到了2,关键词的最短长度是2,此时需要在学
结点下查找是否有`
,找到意味着匹配成功,返回关键词,并将字符“指针”后移,如果找不到则继续在此结点下寻找下一个字符。
如此遍历,直到最后,返回所有匹配结果。
完整代码我已经放到了GitHub上:Trie-GitHub-zhenbianshu,这里放上核心。
首先是数据结构树结点的设计,当然它也是重中之重:
$node = array( 'depth' => $depth, // 深度,用以判断已命中的字数 'next' => array( $val => $node, // 这里借用php数组的哈希底层实现,加速子结点的查找 ... ), );
然后是树构建时子结点的插入:
// 这里要往节点内插入子节点,所以将它以引用方式传入 private function insert(&$node, $words) { if (empty($words)) { return; } $word = array_shift($words); // 如果子结点已存在,向子结点内继续插入 if (isset($node['next'][$word])) { $this->insert($node['next'][$word], $words); } else { // 子结点不存在时,构造子结点插入结果 $tmp_node = array( 'depth' => $node['depth'] + Optimize the matching of huge amounts of keywords, 'next' => array(), ); $node['next'][$word] = $tmp_node; $this->insert($node['next'][$word], $words); } }
最后是查询时的操作:
// 这里也可以使用一个全局变量来存储已匹配到的字符,以替换$matched private function query($node, $words, &$matched) { $word = array_shift($words); if (isset($node['next'][$word])) { // 如果存在对应子结点,将它放到结果集里 array_push($matched, $word); // 深度到达最短关键词时,即可判断是否到词尾了 if ($node['next'] > Optimize the matching of huge amounts of keywords && isset($node['next'][$word]['next']['`'])) { return true; } return $this->query($node['next'][$word], $words, $matched); } else { $matched = array(); return false; } }
结果当然是喜人的,如此匹配,处理一千条数据只需要3秒左右。找了 Java 的同事试了下,Java 处理一千条数据只需要Optimize the matching of huge amounts of keywords秒。
这里来分析一下为什么这种方法这么快:
正则匹配:要用所有的关键词去信息里匹配匹配次数是 key_len * msg_len
,当然正则会进行优化,但基础这样,再优化效率可想而知。
而 trie 树效率最差的时候是 msg_len * 9(最长关键词长度 + Optimize the matching of huge amounts of keywords个特殊字符)
次 hash 查找,即最长关键词类似 AAA
,信息内容为 AAA...
时,而这种情况的概率可想而知。
至此方法的优化到此结束,从每秒钟匹配 Optimize the matching of huge amounts of keywords0 个,到 300 个,30 倍的性能提升还是巨大的。
终级,却不一定是终极
匹配方法的优化结束了,开头说的优化到十分钟以内的目标还没有实现,这时候就要考虑一些其他方法了。
我们一提到高效,必然想到的是 并发
,那么接下来的优化就要从并发说起。PHP 是单线程的(虽然也有不好用的多线程扩展),这没啥好的解决办法,并发方向只好从多进程进行了。
那么一个日志文件,用多个进程怎么读呢?这里当然也提供几个方案:
Add a log line counter in the process. Each process supports passing in parameter n. The process only processes the log with line number % n = n
. I have become very proficient in using the reverse distribution of this hack, haha. This method requires the process to pass parameters, and each process needs to allocate memory to read the entire log, and it is not elegant enough.
Use the split -l n file.log output_pre
command of Linux to split the file into files of n lines each, and then use Multiple processes to read multiple files. The disadvantage of this method is that it is inflexible. If you want to change the number of processes, you need to re-split the file.
Use Redis's list queue to temporarily store logs and enable multiple process consumption queues. This method requires additional writing of data into Redis, which is an extra step, but it is flexible in expansion and the code is simple and elegant.
In the end, the third method was used.
Although this method will have bottlenecks, it should eventually fall on Redis's network IO. I didn't bother to open n processes to challenge the performance of the company's Redis. I completed the statistics after running Optimize the matching of huge amounts of keywords0 processes in three or four minutes. Even if you add in the time it takes to write to Redis, it should be within Optimize the matching of huge amounts of keywords0 minutes.
At the beginning, the product had already positioned the matching speed at an hourly level. When I took out the new log matching results in Optimize the matching of huge amounts of keywords0 minutes, I felt a little happy when I saw the surprised expression of the product, haha~
Other paths can also help you go further
There are many ways to solve problems. I think that in solving various Before asking questions, you need to understand many kinds of knowledge, even if you only know its function. Just like a tool rack, you have to put as many tools as possible before you can choose the most suitable one when you encounter a problem. Then of course you need to become proficient in using these tools, so that you can use them to solve some weird problems.
If you want to do your job well, you must first sharpen your tools. If you want to solve performance problems, mastering system-level methods is not enough. Sometimes, changing a data structure or algorithm may have better results. I feel that I am still a little weak in this aspect, so I will gradually strengthen it, and everyone will encourage me.
The above is the detailed content of Optimize the matching of huge amounts of keywords. For more information, please follow other related articles on the PHP Chinese website!