Simple code sharing for Chinese word segmentation in PHP

Simple code sharing for Chinese word segmentation in PHP_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-21 15:26:05

Original

810 people have browsed it

Of course, this article is not to do research on Chinese search engines, but to share how to use PHP to build an on-site search engine. This article is an article in this system.
The word segmentation tool I use is the open source version of ICTCLAS from the Institute of Computing Technology, Chinese Academy of Sciences. There is also the open source Bamboo, which I will also investigate later.
It is a good choice to start from ICTCLAS, because its algorithm is widely spread, has public academic documents, is easy to compile, and has few library dependencies. But currently only C/C++, Java and C# versions of the code are provided, and there is no PHP version of the code. What should we do? Maybe we can study its C/C++ source code and academic documents, and then develop a PHP version. However, I want to use inter-process communication to call the C/C++ version of the executable file from the PHP code.
After downloading and decompressing the source code, directly make ictclas on a machine with C++ development library and compilation environment. There is an error in its Makefile script, and the code that executes the test does not add '. /', of course it cannot be executed successfully like under Windows. But it does not affect the compilation results.
The PHP class for Chinese word segmentation is below. Use the proc_open() function to execute the word segmentation program, interact with it through the pipeline, input the text to be segmented, and read the word segmentation results.

Copy code The code is as follows:

 
class NLP{ 
private static $cmd_path; 
// Does not end with '/'
static function set_cmd_path($path){ 
self::$cmd_path = $path; 
} 
private function cmd($str){ 
$descriptorspec = array( 
0 => array("pipe", "r"), 
1 => array("pipe", "w"), 
); 
$cmd = self::$cmd_path . "/ictclas"; 
$process = proc_open($cmd, $descriptorspec, $pipes); 
if (is_resource($process)) { 
$str = iconv('utf-8', 'gbk', $str); 
fwrite($pipes[0], $str); 
$output = stream_get_contents($pipes[1]); 
 fclose($pipes[0]); 
fclose($pipes[1]); 
$return_value = proc_close($process); 
} 
/* 
$cmd = "printf '$input' | " . self::$cmd_path . "/ictclas"; 
exec($cmd, $output, $ret); 
$output = join("n", $output); 
*/ 
$output = trim($output); 
$output = iconv('gbk', 'utf-8', $output); 
return $output; 
} 
/**
* Perform word segmentation and return a word list. 
*/ 
function tokenize($str){ 
$tokens = array(); 
$output = self::cmd($input); 
if( $output){ 
$ps = preg_split('/s+/', $output); 
foreach($ps as $p){ 
list($seg, $tag) = explode('/ ', $p); 
$item = array( 
'seg' => $seg, 
'tag' => $tag, 
); 
$tokens[] = $item; 
} 
} 
return $tokens; 
} 
} 
NLP::set_cmd_path(dirname(__FILE__)); 
?> 
 

It is very simple to use (make sure the ICTCLAS compiled executable and dictionary are in the current directory):

Copy the code The code is as follows:

 
require_once('NLP.php'); 
var_dump(NLP::tokenize('Hello, World!')); 
?>