How to use PHP to implement full-text search function?
Many people may be able to come up with several solutions right away, such as: file retrieval method, using SQL like statement, etc., but these methods are quite inefficient.
Here we introduce a relatively efficient method to implement PHP full-text retrieval, which is to use the FULLTEXT field type of MYSQL. However, MYSQL's FULLTEXT field does not support Chinese very well. This article also introduces how to implement the Chinese full-text search function through PHP+MYSQL.
First of all, you need to use a PHP Chinese word segmentation extension module??SCWS. Regarding the installation and use of this module, you can go to www.ftphp.com/scws to find relevant content (please leave a message if you have any questions).
Then take a look at the relevant information about the fulltext field type of mysql:
MySQL versions after 3.23.23 begin to support full-text indexing and search. The full-text index in MySQL is a FULLTEXT type index.
FULLTEXT indexes are used on MyISAM tables and can be created on CHAR, VARCHAR or TEXT columns at or after CREATE TABLE using ALTER TABLE or CREATE INDEX. For large databases, it is very fast to load the data into a table without a FULLTEXT index and then use ALTER TABLE (or CREATE INDEX) to create the index. Loading data into a table that already has a FULLTEXT index will be very slow.
MYSQL full-text search is completed through the MATCH() function.
The following is a simple example:
1. Create a new data table:
CREATE TABLE fulltext_sample(copy TEXT,FULLTEXT(copy)) TYPE=MyISAM;
The copy here is a fulltext type field. If the full text search field is not added when creating the table, it can also be added through alert, such as:
ALTER TABLE fulltext_sample ADD FULLTEXT(copy)
2. Insert data:
INSERT INTO fulltext_sample VALUES ('It appears good from here'), ('The here and the past'), ('Why are we hear'), ('An all-out alert'), ('All you need is love'), ('A good alert');
3. Data retrieval:
SELECT * FROM fulltext_sample WHERE MATCH(copy) AGAINST('love');
The above is the full-text search function of mysql. Note: Searching on the full-text index is not case-sensitive.
Let’s look at how to implement Chinese full-text search.
The fulltext field is based on words, and words need to be separated by spaces. However, in Chinese sentences, the words are not separated by spaces, so we need to segment Chinese words, which is why we need to emphasize the above. The Chinese word segmentation extension module is used for words.
However, despite segmenting Chinese words, MYSQL still cannot achieve full-text retrieval of Chinese through MATCH. This requires a certain method for conversion. A relatively simple and practical method is to use the following function (of course there are better ones), It converts Chinese into urlencode.
function q_encode($str) { $data = array_filter(explode(" ",$str)); $data = array_flip(array_flip($data)); foreach ($data as $ss) { if (strlen($ss)>1 ) $data_code .= str_replace("%","",urlencode($ss)) . " "; } $data_code = trim($data_code); return $data_code; }
Save the converted content to the pre-defined fulltext field. Similarly, when querying, the query keywords need to be converted in the same way.
How to implement UTF8 full-text search with PHP+Mysql
This article explains how to quickly perform full-text search in massive data? MySQL provides a full-text index function, that is, setting the FULLTEXT index attribute on the field, and then searching through the MATCH AGAINST statement of SELECT.
TouchUs - The Global Yellow Pages & Business Directory (www.touchus.org), a pure English site we developed, uses this function of MySQL to achieve an average full-text retrieval time of less than 0.5 seconds for more than 100,000 pieces of data. However, when developing the Chinese website of TouchUs - City Yellow Pages (www.city39.cn), we encountered new problems. It turns out that in English typesetting, words are distinguished by spaces, which FULLText can fully support, but for Chinese or East Asian characters, it is not so simple. Because there is no obvious separation between words in Chinese, MySQL cannot Supports full-text search with Chinese characters.
How to make MySQL also support Chinese full-text search? An idea came up accidentally, that is, after Chinese word segmentation, it is possible to encode the Chinese into English characters, so as to establish a specific connection between Chinese and English, and then perform full-text search. In this way, wouldn't it be possible to realize Chinese characters? Is the full text indexed? After testing, the answer is yes. The following is the specific process implemented in the City Yellow Pages network:
1. Create a separate index table, for example, corresponding to the members table, we create a members_index table. M Members (members) User information full -text
user_id user_id
user_name index_intro
user_introduction
Add FullText index in the index_intro of the members_index table.
2. Perform Chinese word segmentation processing on the contents of the User_introduction field of the user information table (members)
中文分词的处理过程,可以参考简易中文分词系统http://www.ftphp.com/scws/,在城市黄页网中,我们采用了scws的PHP扩展模块方式来实现中文分词。scws的php扩展模块安装非常简单,只需简单编译配置后即可使用。在具体的php代码中,我们写了如下的函数来实现分词后将分词结果用空格进行连接。
//中文分词函数 function str_fc($str) { $so = scws_new(); $so->set_charset('utf8'); // 这里没有调用 set_dict 和 set_rule 系统会自动试调用 ini 中指定路径下的词典和规则文件 $so->send_text($str); while ($tmp = $so->get_result()) { foreach ( $tmp as $ss ){ $s = trim($ss[word]); if ( $s ) $mystr .= trim($ss[word]) . " "; //echo urlencode(trim($ss[word])) . " "; } } return $mystr; }
该函数返回就是用空格连接的分词结果。
3. 对分词结果进行编码,可以采用多种编码方式,比如base64编码、urlencode编码、汉字转拼音等,对gb2312甚至可以采用区位码编码方式。考虑到存储空间以及便利性,我们采用了PHP的urlencode编码方式。需要注意的是,在编码前,我们可以去掉重复的分词来节约存储空间,编码后要去掉编码结果中的%符号,因为urlencode采用RFC 1738???行编码,会产生很多%,而%在MySQL是通配符。下面是编码过程用到的PHP代码
$data = str_fc($data); //中文分词 $data = array_filter(explode(" ",$data)); //删除数组空项 $data = array_flip(array_flip($data)); //删除重复项 //对分词结果进行urlcode编码 foreach ( $data as $ss ) { if (strlen($ss)>1 ) $data_code .= str_replace("%","",urlencode($ss)) . " "; }
这里的$data_code就是编码后的结果。把编码结果根据user_id存入用户信息全文索
引表(members_index)
4. 在进行搜索处理时,首先对用户输入的关键字进行同样的分词编码处理,然后通过MySQL的SELECT的MATCH AGAINST语句进行全文快速检索,根据检索结的user_id即可调用用户信息表(members)中的原始数据进行显示,而没有必要进行一次解码重组。
以上MySQL UTF8中文全文检索方法.