This should be a more general and practical rule, let’s start below.
If you are a veteran of train collectors, you can refer to it, because what I am going to explain will go against traditional thinking; if you are a novice, you'd better read it carefully, because this will speed up your learning process. Get started while saving you a lot of time later. The following are some basic collection steps, which you can use flexibly:
1. Create a site
1. Please open the train collector first and create a new site. See the picture below:
For convenience Management You can give your site any name you find easy to remember, but I recommend using the name of the target source as the name of the site to facilitate future management, as shown below
Most sites, through the site Often there is only one set of templates or several sets of similar templates. The so-called similarity here means that the marks in the template are very close. So what are template marks? Template tags refer to the start and end marks of a certain part of content. For example, many regular websites (usually websites with larger sites and more content, such as sina, 163, etc.) will use words similar to or at the beginning of the content.
and other flags to indicate the start of content. There are two reasons why they did this. One is because there is a lot of content, and corresponding marks are made for cooperation between various departments to facilitate project handover. The other reason is the need for content control. With the popularity of xhtml, using There are more and more layer controls, which makes it easier and easier for us to find collection indicators (you will gradually understand this later). I tell you this above because what we need to explain next is the content rules of the entire site.
2. Explanation of title tags. The corresponding page is here: http://ent.163.com/06/1029/11/2UJNHOS3000322EL.html
First switch from "Site Basic Information" to "Whole Site Content Rules", and then add the content to be collected Copy the URL of the page to "Typical Page" and click "Test" to read the source code. Let’s start with the title tag. We found that there are more titles “_NetEase Entertainment” collected according to the default tag. Please double-click the title tag or select the title tag and click Modify to add “_NetEase Entertainment” to the excluded content box. Title Label completed. As shown in the picture:
3. Explanation of content tags. The most important thing in making any tag for a collection rule (task) is to find the start and end marks. At present, most collectors require that the start and end marks must be the only marks of the entire source code, that is, only one start or end mark can be found in all html source codes. But the train collector does not need to do this. All you are looking for is the first mark from top to bottom. What I mean is that n identical starts (ends, the same below) are allowed in the html code. ) logo, but as long as this logo located at the location of the content we want to collect is the first one from top to bottom in the html. Open any content page, taking http://ent.163.com/06/1029/11/2UJNHOS3000322EL.html as an example. We found that its content starts from "Enter Forum", so double-click the code test box to find the required The code, as shown in the picture:
We can use this as a sign of the beginning of the content, but this is not perfect yet. Please open several content pages yourself and "right-click" on the webpage - " View source code", then compare the code and extract the same parts, I use
as a mark for the start of content.
Next, look at the content end mark, as shown in the following two pictures:
The following is the content collected according to the rules we set up
General For example, the content we collect from the start tag to the end tag will contain content, advertisements, or links that must be excluded.这边我们需要排除的内容是“
相关专题>>> 第六届金鹰电视艺术节”。排除的方法是,找到相对应的代码把代码完整的拷贝进内容排除窗口,变动的部分用“(*)”替代。由于这个是整站规则,所以必须多找几个类别,比如现在的这个163娱乐还包括了“明星 | 图片 | 电影 | 电视 | 音乐 | 论坛 | 专题 | 名人访 ”等,在这边我只抽取“明星、图片、电影”作为列子跟大家讲解。找其他的类别只是希望把规则做的通用完美,如果你只要其中的一个分类,比如“图片”那么你直接做这个的规则即可。
http://ent.163.com/06/1018/15/2TNNT7EU00031H2L.html 这个页面刚好有分页,所以就顺便讲下上下页的设置。他这边的“上一页”和“下一页”是用图片做链接的,所以只要不图片的名字(右键点击对应的图片查看属性,拷贝图片名即可)拷贝进对应的代码框即可,详细的看图片:
这边提示下,任何内容的排除你只要找到对应的代码完整的拷贝进代码排除窗并把其中可变的部分替换成"(*)"即可。由于他这边没有广告,所有整站规则就算制作完毕,点击保存进入单任务制作。好了,整站规则就讲这两个标签,其他的根据需要自己按上面的步骤添加,记住,万变不离其宗。其他的问题请到火车采集器
论坛:http://bbs.locoy.com 探讨。
二、下面讲解单任务规则制作:
1、内容规则的制作,很多人到现在可能都还不明白火车采集器好在哪,现在讲的这个绝对是火车独有的特色(至少到目前为止是这样,以后有没有人出相同的功能就不得而知了!)
火车采集器是不需要经过网址规则制作即可直接进入内容采集,这样你就可以根据站点的难易决定是否采集选定的目标源,而不必等到网址采集后才发现原来这个网站你没办法采或者根本不值得你浪费这个时间(前面的时间白搭了!)。
火车v3.0最大的功能之一既是可以继承站点的规则,只要你前面制作的规则通用,那么在接下来的所有任务都不需要再制作内容采集规则了。由于前面我们制作的内容采集规则通用,所以这边的规则我们就不用讲解了,直接继承站点的,如图:
2、网址采集规则制作
步骤:“新建”——“新建任务”,其他的操作如下图:
作规则需要善于去发现规律性的东西,作到这点采集就没什么问题了。我们要采集示例的地址在这http://ent.163.com/special/00031HI0/entnews.html
这板只采集其中的1-3页作为范例。我们发现每个叶面的网址开始前面都包含“过往娱乐热点”结束都是“第1 2……页”,所以请到html源代码里面拷贝对应的代码,到特定区域采集范围中,另外,网址中必须包含“/06/” 这样网址采集就搞定了(简单吧,自己试试看),如下图:
3、发布方式。发布方式有5种,这边以最常用的“在线发布”为例。
选定web在线发布到网站,点击“定义全局发布方式”,然后按系统提示的步骤:选定发布模块——》填写网站/cms根地址——》使用火车内置浏览器登陆——》登陆后关闭内置浏览器——》刷新列表——》测试模块,测试成功——》保存配置——》保存任务——》发表 如下图高亮的部分是你要操作的步骤,从左到右从上到下:
下面是刚才我采集到本地论坛采集测试的两个截屏: