Home > Article > Database > Method to implement content recommendation based on Tags (code)

Method to implement content recommendation based on Tags (code)

不言Original: 2018-09-14 14:11:422390browse

The content this article brings to you is about the method (code) of implementing content recommendation based on Tags. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

It turns out that for the sake of simplicity and convenience, the relevant content recommendations for the article pages on my small website are to randomly extract data from the database to fill a list, so there is no correlation at all, and there is no way to guide users to access the recommendations. content.

Algorithm Selection

How can we recommend similar content? Since the small website is still running on a virtual host (yes, it doesn’t even have a fully controllable server), So there are not many ways to think of, and the condition is that you can only use PHP MySql. So the way I thought of was to use Tags to match similar articles for recommendation. If the TAGS of two articles are similar

For example: the TAGS of article A is: [A,B,C,D,E]
The TAGS of article B is: [A,D,E,F,G]
The TAGS of article C is: [C,H,I,J,K]

We can easily find through our eyes that article B and article A are more similar because they have three same keywords: : [A, D, E], how to use a computer to determine their similarity? Here we use the most basic application of jaccard similarity to calculate their similarity

Jaccard Similarity

Given two sets A and B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B. It is defined as follows:

Method to implement content recommendation based on Tags (code)

The intersection of article A and article B is [A,D,E], the size is 3, and the union is [A,B,C,D, E, F, G], the size is 7, 3/7=0.4285...
The intersection of article A and article C is [C], the size is 1, and the union is [A, B, C, D, E,H,I,J,K], the size is 9, 1/9=0.11111...

In this way, it can be concluded that articles A and B are more similar than articles A and C. With this Algorithms, computers can determine the similarity of two articles.

Specific recommendation ideas

Given an article, obtain the keyword TAGS of the article, and then use the above algorithm to compare the similarity of all articles in the database to obtain the most similar N articles Articles are recommended.

Implementation process

Acquisition of the first TAGS

The TAGS of the article is extracted from the high-frequency words in the article through the TF-IDF algorithm, and N words are selected as TAGS. Chinese articles also involve a problem of Chinese word segmentation. Because it is a virtual host, I used Python (why use Python, jieba word segmentation, so delicious) to write a program locally to complete the word segmentation of all articles. , word frequency statistics, generate TAGS, and write back to the server's database. Since this article is about writing recommended algorithms, the parts of word segmentation and establishing TAGS will not be discussed in detail, and different systems have different methods of establishing TAGS.

Storage of the second TAGS

Create two tables to store TAGS
tags, used to store the names of all tags

+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| tag   | text       | YES  |     | NULL    |       |
| count | bigint(20) | YES  |     | NULL    |       |
| tagid | int(11)    | NO   | PRI | 0       |       |
+-------+------------+------+-----+---------+-------+

tag_map Create tags and articles reflection relationship.

+-----------+------------+------+-----+---------+-------+
| Field     | Type       | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+-------+
| id        | bigint(20) | NO   | PRI | 0       |       |
| articleid | bigint(20) | YES  |     | NULL    |       |
| tagid     | int(11)    | YES  |     | NULL    |       |
+-----------+------------+------+-----+---------+-------+

The data stored in tag_map is similar to the following:

+----+-----------+-------+
| id | articleid | tagid |
+----+-----------+-------+
|  1 |       776 |   589 |
|  2 |       776 |   471 |
|  3 |       776 |  1455 |
|  4 |       776 |  1287 |
|  5 |       776 |    52 |
|  6 |       777 |  1386 |
|  7 |       777 |   588 |
|  8 |       777 |   109 |
|  9 |       777 |   603 |
| 10 |       777 |  1299 |
+----+-----------+-------+

In fact, when making similar recommendations, you only need to use the tag_map table, because tagid and tag name are in one-to-one correspondence.

Specific coding

1. Get the TAGID corresponding to all articles

mysql> select articleid, GROUP_CONCAT(tagid) as tags from tag_map GROUP BY articleid;
+-----------+--------------------------+
| articleid | tags                     |
+-----------+--------------------------+
|        12 | 1178,1067,49,693,1227    |
|        13 | 196,2004,2071,927,131    |
|        14 | 1945,713,1711,2024,49    |
|        15 | 35,119,9,1,1180          |
|        16 | 1182,1924,2200,181,1938  |
|        17 | 46,492,414,424,620       |
|        18 | 415,499,153,567,674      |
|        19 | 1602,805,691,1613,194    |
|        20 | 2070,1994,886,575,1149   |
|        21 | 1953,1961,1534,2038,1393 |
+-----------+--------------------------+

Through the above SQL, you can query all articles at one time, and all corresponding tags
In PHP , we can turn tags into an array.

public function getAllGroupByArticleId(){
        //缓存查询数据，因为这个是全表数据，而且不更新文章不会变化，便是每次推荐都要从数据库里获取一次数据，对性能肯定会有影响，所以做个缓存。
        if($cache = CacheHelper::getCache()){
            return $cache;
        }
        $query_result = $this->query('select articleid, GROUP_CONCAT(tagid) as tags from tag_map GROUP BY articleid');

        $result = [];
        foreach($query_result as $key => $value){
            //用articleid 做key ,值是该id下的所有tagID数组。
            $result[$value['articleid']] = explode(",",$value['tags']);
        }

        CacheHelper::setCache($result, 86400);

        return $result;

    }

With this return result, it is easier to handle. The next step is to apply the jaccard similarity algorithm. Let’s look at the code for details.

/**
     * [更据指定文章返回相似的文章推荐]
     * @param  $articleid 指定的文章ID
     * @param  $top       要返回的推荐条数
     * @return Array      推荐条目数组
     */
function getArticleRecommend($articleid, $top = 5){
        if($cache = CacheHelper::getCache()){
            return $cache;
        }
        try{
            $articleid = intval($articleid);
            $m = new TagMapModel();
            $all_tags = $m->getAllGroupByArticleId();//调用上面的函数返回所有文章的tags
            $finded = $all_tags[$articleid];//因为上面是包含所有文章了，所以肯定包含了当前文章。

            unset($all_tags[$articleid]);//把当前文章从数组中删除，不然自己和自己肯定是相似度最高了。

            $jaccard_arr = []; //用于存相似度
            foreach ($all_tags as $key => $value) {
                $intersect =array_intersect($finded, $value); //计算交集
                $union = array_unique(array_merge($finded, $value)); //计算并集

                $jaccard_arr[$key] = (float)(count($intersect) / count($union));
            }

            arsort($jaccard_arr); //按相似度排序，最相似的排最前面

            $jaccard_keys = array_keys($jaccard_arr);//由于数组的key就是文章id,所以这里把key取出来就可以了
            array_splice($jaccard_keys, $top);//获取前N条推荐

            //到这里我们就已经得到了，最相似N篇文章的ID了，接下去的工作就是通过这几个ID,从数据库里把相关信息，查询出来就可以了
    
            $articleModels = new \Api\Model\ArticleModel();
            $recommendArticles = $articleModels->getRecommendByTag($jaccard_keys);
            CacheHelper::setCache($recommendArticles, 604800); //缓存7天
            return $recommendArticles;
        } catch (\Exception $e) {
            throw new \Exception("获取推荐文章错误");
        }
    }

Related recommendations:

How to simply implement the "Related Article Recommendation" function in PHP

The above is the detailed content of Method to implement content recommendation based on Tags (code). For more information, please follow other related articles on the PHP Chinese website!

Python php sql mysql 算法数据库 tf-idf

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：How to make JDK import certificateNext article：How to make JDK import certificate

See more

Method to implement content recommendation based on Tags (code)

Algorithm Selection

Specific recommendation ideas

Implementation process

Acquisition of the first TAGS

Storage of the second TAGS

Specific coding

1. Get the TAGID corresponding to all articles

Related articles