Introduction to HTML extractor (woody)
woody 是一款 Java 的HTML 解析/提取器,用法非常类似 webmagic, 是对其抽取模板完全重写,之所有单独提取出来是因为为来更好可重用。
一些新功能:
多种结果数据类型(String, char, byte, short int, long, double, float, string[], Set, List,Data)
支持用户之定义脚本处理函数(目前支持Javascript 函数配置处理)
支持css,xpath内核替换
支持filter功能
一个完整的例子:
public class OsChinaBlog {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://www.oschina.net/news/43879/webmagic-0-3-0").timeout(60000)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0").get();
String html = doc.html();
OsChinaBlogModel model = AnnotationExtractor.me().process(html, OsChinaBlogModel.class);
System.out.println(model.toJson());
}
public static class OsChinaBlogModel extends Model {
public OsChinaBlogModel() {
//use to reflect
}
@Inject
@ComboExtract(value = { @ExtractBy(value = "h1.OSCTitle", type = ExprType.CSS),
@ExtractBy(value = "//title/text()", type = ExprType.XPATH) }, op = OP.OR)
public String title;
@Inject
@ExtractBy(value = "p.PubDate a[href~=http://my\\.oschina\\.net/]", type = ExprType.CSS)
public String author;
@Inject
@ExtractBy(value = "发布于.\\s*(\\d+年\\d+月\\d+日)", type = ExprType.REGEX)
public Date publishDate;
@Inject
@ComboExtract(value = {
@ExtractBy(value = "p.PubDate", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
@ExtractBy(value = "(\\d+)评", type = ExprType.REGEX) }, op = OP.AND)
public int commentNum;
@Inject
@ExtractBy(value = "span#p_favor_count", type = ExprType.CSS, setting = @Setting(function = @Function(value = "replace", args = {
"+", "" })))
public int collectNum;
@Inject
@ComboExtract(value = {
@ExtractBy(value = "p[id=userComments]", type = ExprType.CSS, setting = @Setting(outerHtml = true)),
@ExtractBy(value = "p.TextContent", type = ExprType.CSS) }, op = OP.AND, multi = true)
public List commentContents;
@Inject
@ExtractBy(value = "p[id=toolbar_wrapper]", setting = @Setting(fliters = { "b", "span" }), type = ExprType.CSS, impl = Document.class)
public String weibo;
}
}【相关推荐】
1. 免费html在线视频教程
2. html开发手册
The above is the detailed content of Introduction to HTML extractor (woody). For more information, please follow other related articles on the PHP Chinese website!
Hot AI Tools
Undresser.AI Undress
AI-powered app for creating realistic nude photos
AI Clothes Remover
Online AI tool for removing clothes from photos.
Undress AI Tool
Undress images for free
Clothoff.io
AI clothes remover
AI Hentai Generator
Generate AI Hentai for free.
Hot Article
Hot Tools
Notepad++7.3.1
Easy-to-use and free code editor
SublimeText3 Chinese version
Chinese version, very easy to use
Zend Studio 13.0.1
Powerful PHP integrated development environment
Dreamweaver CS6
Visual web development tools
SublimeText3 Mac version
God-level code editing software (SublimeText3)
Hot Topics
1384
52
Table Border in HTML
Sep 04, 2024 pm 04:49 PM
Guide to Table Border in HTML. Here we discuss multiple ways for defining table-border with examples of the Table Border in HTML.
HTML margin-left
Sep 04, 2024 pm 04:48 PM
Guide to HTML margin-left. Here we discuss a brief overview on HTML margin-left and its Examples along with its Code Implementation.
Nested Table in HTML
Sep 04, 2024 pm 04:49 PM
This is a guide to Nested Table in HTML. Here we discuss how to create a table within the table along with the respective examples.
HTML Table Layout
Sep 04, 2024 pm 04:54 PM
Guide to HTML Table Layout. Here we discuss the Values of HTML Table Layout along with the examples and outputs n detail.
HTML Input Placeholder
Sep 04, 2024 pm 04:54 PM
Guide to HTML Input Placeholder. Here we discuss the Examples of HTML Input Placeholder along with the codes and outputs.
HTML Ordered List
Sep 04, 2024 pm 04:43 PM
Guide to the HTML Ordered List. Here we also discuss introduction of HTML Ordered list and types along with their example respectively
Moving Text in HTML
Sep 04, 2024 pm 04:45 PM
Guide to Moving Text in HTML. Here we discuss an introduction, how marquee tag work with syntax and examples to implement.
HTML onclick Button
Sep 04, 2024 pm 04:49 PM
Guide to HTML onclick Button. Here we discuss their introduction, working, examples and onclick Event in various events respectively.


