Java JSoup|极客教程

JSoup 教程是 JSoup HTML 解析器的入门指南。在本教程中，我们将解析 HTML 字符串，本地 HTML 文件和网页中的 HTML 数据。我们将清理数据并执行 Google 搜索。

JSoup

JSoup 是用于提取和处理 HTML 数据的 Java 库。它实现了 HTML5 规范，并将 HTML 解析为与现代浏览器相同的 DOM。该项目的网站是 jsoup.org 。

JSoup 功能

使用 JSoup，我们能够：

从 URL，文件或字符串中抓取并解析 HTML
使用 DOM 遍历或 CSS 选择器查找和提取数据
处理 HTML 元素，属性和文本
根据安全的白名单清除用户提交的内容，以防止 XSS 攻击
输出整洁的 HTML

JSoup Maven 依赖项

在本教程的示例中，我们使用以下 Maven 依赖关系。

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

JSoup 类

JSoup类通过其静态方法为 jsoup 功能提供了核心公共访问点。例如，clean()方法清理 HTML 代码，connect()方法创建与 URL 的连接，或者parse()方法解析 HTML 内容。

JSoup 解析 HTML 字符串

在第一个示例中，我们将解析一个 HTML 字符串。

com/zetcode/JSoupFromStringEx.java

package com.zetcode;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JSoupFromStringEx {

    public static void main(String[] args) {

        String htmlString = "<html><head><title>My title</title></head>"
                + "<body>Body content</body></html>";

        Document doc = Jsoup.parse(htmlString);
        String title = doc.title();
        String body = doc.body().text();

        System.out.printf("Title: %s%n", title);
        System.out.printf("Body: %s", body);
    }
}

该示例分析 HTML 字符串并输出其标题和正文内容。

String htmlString = "<html><head><title>My title</title></head>"
        + "<body>Body content</body></html>";

此字符串包含简单的 HTML 数据。

Document doc = Jsoup.parse(htmlString);

使用 Jsoup 的parse()方法，我们解析 HTML 字符串。该方法返回一个 HTML 文档。

String title = doc.title();

文档的title()方法获取文档的 title 元素的字符串内容。

String body = doc.body().text();

文档的body()方法返回 body 元素；其text()方法获取元素的文本。

JSoup 解析本地 HTML 文件

在第二个示例中，我们将解析本地 HTML 文件。我们使用重载的Jsoup.parse()方法，该方法将File对象作为其第一个参数。

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>My title</title>
        <meta charset="UTF-8">
    </head>
    <body>
        <div id="mydiv">Contents of a div element</div>
    </body>
</html>

对于示例，我们使用上面的 HTML 文件。

com/zetcode/JSoupFromFileEx.java

package com.zetcode;

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JSoupFromFileEx {

    public static void main(String[] args) throws IOException {

        String fileName = "src/main/resources/index.html";

        Document doc = Jsoup.parse(new File(fileName), "utf-8"); 
        Element divTag = doc.getElementById("mydiv"); 

        System.out.println(divTag.text());
    }
}

该示例将分析src/main/resources/目录中的index.html文件。

Document doc = Jsoup.parse(new File(fileName), "utf-8");

我们使用Jsoup.parse()方法解析 HTML 文件。

Element divTag = doc.getElementById("mydiv");

使用文档的getElementById()方法，我们通过元素的 ID 获得元素。

System.out.println(divTag.text());

使用元素的text()方法检索标签的文本。

JSoup 读取网站的标题

在以下示例中，我们将抓取并解析网页，然后检索 title 元素的内容。

com/zetcode/JSoupTitleEx.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JSoupTitleEx {

    public static void main(String[] args) throws IOException {

        String url = "http://webcode.me";

        Document doc = Jsoup.connect(url).get();
        String title = doc.title();
        System.out.println(title);
    }
}

在代码示例中，我们读取了指定网页的标题。

Document doc = Jsoup.connect(url).get();

Jsoup 的connect()方法创建到给定 URL 的连接。 get()方法执行 GET 请求并解析结果；它返回一个 HTML 文档。

String title = doc.title();

使用文档的title()方法，我们获得 HTML 文档的标题。

JSoup 读取 HTML 源码

下一个示例检索网页的 HTML 源。

com/zetcode/JSoupHTMLSourceEx.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;

public class JSoupHTMLSourceEx {

    public static void main(String[] args) throws IOException {

        String webPage = "http://webcode.me";

        String html = Jsoup.connect(webPage).get().html();

        System.out.println(html);
    }
}

该示例打印网页的 HTML。

String html = Jsoup.connect(webPage).get().html();

html()方法返回元素的 HTML；在我们的案例中，整个文档的 HTML 源代码。

JSoup 获取信息

HTML 文档的元信息提供有关网页的结构化元数据，例如其描述和关键字。

com/zetcode/JSoupMetaInfoEx.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JSoupMetaInfoEx {

    public static void main(String[] args) throws IOException {

        String url = "http://www.jsoup.org";

        Document document = Jsoup.connect(url).get();

        String description = document.select("meta[name=description]").first().attr("content");
        System.out.println("Description : " + description);

        String keywords = document.select("meta[name=keywords]").first().attr("content");
        System.out.println("Keywords : " + keywords);
    }
}

该代码示例检索有关指定网页的元信息。

String keywords = document.select("meta[name=keywords]").first().attr("content");

文档的select()方法查找与给定查询匹配的元素。 first()方法返回第一个匹配的元素。使用attr()方法，我们获得content属性的值。

JSoup 解析链接

下一个示例分析 HTML 页面中的链接。

com/zetcode/JSoupLinksEx.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JSoupLinksEx {

    public static void main(String[] args) throws IOException {

        String url = "http://jsoup.org";

        Document document = Jsoup.connect(url).get();
        Elements links = document.select("a[href]");

        for (Element link : links) {

            System.out.println("link : " + link.attr("href"));
            System.out.println("text : " + link.text());
        }
    }
}

在该示例中，我们连接到网页并解析其所有链接元素。

Elements links = document.select("a[href]");

要获取链接列表，我们使用文档的select()方法。

JSoup 清理 HTML 数据

Jsoup 提供了用于清理 HTML 数据的方法。

com/zetcode/JSoupSanitizeEx.java

package com.zetcode;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Whitelist;

public class JSoupSanitizeEx {

    public static void main(String[] args) {

        String htmlString = "<html><head><title>My title</title></head>"
                + "<body><center>Body content</center></body></html>";

        boolean valid = Jsoup.isValid(htmlString, Whitelist.basic());

        if (valid) {

            System.out.println("The document is valid");
        } else {

            System.out.println("The document is not valid.");
            System.out.println("Cleaned document");

            Document dirtyDoc = Jsoup.parse(htmlString);
            Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);

            System.out.println(cleanDoc.html());
        }
    }
}

在示例中，我们清理并清理 HTML 数据。

String htmlString = "<html><head><title>My title</title></head>"
        + "<body><center>Body content</center></body></html>";

HTML 字符串包含不推荐使用的 center 元素。

boolean valid = Jsoup.isValid(htmlString, Whitelist.basic());

isValid()方法确定字符串是否为有效的 HTML。白名单是可以通过清除程序的 HTML（元素和属性）列表。 Whitelist.basic()定义了一组基本的干净 HTML 标记。

Document dirtyDoc = Jsoup.parse(htmlString);
Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);

借助Cleaner，我们清理了脏的 HTML 文档。

The document is not valid.
Cleaned document
<html>
 <head></head>
 <body>
  Body content
 </body>
</html>

这是程序的输出。我们可以看到中心元素已被删除。

JSoup 执行 Google 搜索

以下示例使用 Jsoup 执行 Google 搜索。

com/zetcode/JsoupGoogleSearchEx.java

package com.zetcode;

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupGoogleSearchEx {

    private static Matcher matcher;
    private static final String DOMAIN_NAME_PATTERN
            = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}";
    private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN);

    public static String getDomainName(String url) {

        String domainName = "";
        matcher = patrn.matcher(url);

        if (matcher.find()) {
            domainName = matcher.group(0).toLowerCase().trim();
        }

        return domainName;
    }

    public static void main(String[] args) throws IOException {

        String query = "Milky Way";

        String url = "https://www.google.com/search?q=" + query + "&num=10";

        Document doc = Jsoup
                .connect(url)
                .userAgent("Jsoup client")
                .timeout(5000).get();

        Elements links = doc.select("a[href]");

        Set<String> result = new HashSet<>();

        for (Element link : links) {

            String attr1 = link.attr("href");
            String attr2 = link.attr("class");

            if (!attr2.startsWith("_Zkb") && attr1.startsWith("/url?q=")) {

                result.add(getDomainName(attr1));
            }
        }

        for (String el : result) {
            System.out.println(el);
        }
    }
}

该示例为“银河”一词创建搜索请求。它会打印十个与该术语匹配的域名。

private static final String DOMAIN_NAME_PATTERN
        = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}";
private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN);

Google 搜索返回长链接，我们要从中获取域名。为此，我们使用正则表达式模式。

public static String getDomainName(String url) {

    String domainName = "";
    matcher = patrn.matcher(url);

    if (matcher.find()) {
        domainName = matcher.group(0).toLowerCase().trim();
    }

    return domainName;
}

getDomainName()使用正则表达式匹配器从搜索链接返回域名。

String query = "Milky Way";

这是我们的搜索词。

String url = "https://www.google.com/search?q=" + query + "&num=10";

这是执行 Google 搜索的网址。

Document doc = Jsoup
        .connect(url)
        .userAgent("Jsoup client")
        .timeout(5000).get();

我们连接到 URL，设置 5 s 超时，然后发送 GET 请求。返回 HTML 文档。

Elements links = doc.select("a[href]");

从文档中，我们选择链接。

Set<String> result = new HashSet<>();

for (Element link : links) {

    String attr1 = link.attr("href");
    String attr2 = link.attr("class");

    if (!attr2.startsWith("_Zkb") && attr1.startsWith("/url?q=")) {

        result.add(getDomainName(attr1));
    }
}

我们寻找不具有 class =“ _ Zkb”属性并且具有 href =“ / url？q =”属性的链接。请注意，这些是硬编码的值，将来可能会更改。

for (String el : result) {
    System.out.println(el);
}

最后，我们将域名打印到终端。

en.wikipedia.org
www.space.com
www.nasa.gov
sk.wikipedia.org
www.bbc.co.uk
imagine.gsfc.nasa.gov
www.forbes.com
www.milkywayproject.org
www.youtube.com
www.universetoday.com

这些是“银河”一词的 Google 顶级搜索结果。

本教程专门针对 Jsoup HTML 解析器。