For Chinese search engines, Chinese word segmentation is one of the most basic parts of the entire system, because the current Chinese search algorithm based on single characters is not very good. Of course, this article is not to do research on Chinese search engines, but to share what if Use PHP to build an on-site search engine. This article is an article in this system
The PHP class for Chinese word segmentation is below. Use the proc_open() function to execute the word segmentation program and interact with it through pipes. Enter the text to be segmented and read the segmentation results.
class NLP{
private static $cmd_path;// Not '/' ends with
static function set_cmd_path($path){
self::$cmd_path = $path;
}private function cmd($str){
fwrite($pipes[0], $str); 🎜> fclose($pipes[0]);
$descriptorspec = array(
); self::$cmd_path . "/ictclas";
$process = proc_open($cmd, $descriptorspec, $pipes);
if (is_resource($process)) {
$str = iconv('utf-8', 'gbk', $str);fclose($pipes[1]);
$return_value = proc_close($process);
}/*
$cmd = $cmd = " n", $output);
$cmd = "printf '$input' | " . self::$cmd_path . "/ictclas";*/
$output = trim($output);$output = iconv('gbk', 'utf-8', $output);
$ps tutorial = preg_split('/s+/', $output);
return $output;
$output = self::cmd($input);
if($output){foreach($ps as $p){
$item = array(
list($seg, $tag) = explode('/', $p);'seg' => $seg,
return $tokens;
'tag' => $tag,}
}
NLP::set_cmd_path(dirname(__FILE__));
?>
It is very simple to use (make sure the ICTCLAS compiled executable file and dictionary are in the current directory):require_once('NLP.php');
var_dump(NLP::tokenize('Hello, world!'));
?>
Webmaster experience,
If you want to achieve word segmentation for search engines, you need a powerful vocabulary library and a more intelligent Chinese Pinyin, writing, habits and other functional libraries.