Java 读取网页|极客教程

在 Java 中读取网页是一个教程，它提供了几种读取 Java 中网页的方法。它包含六个示例，这些示例从一个很小的网页上下载 HTTP 源。

Java 读取网页工具

Java 具有内置的工具和第三方库，用于读取/下载网页。在示例中，我们使用 URL，JSoup，HtmlCleaner，Apache HttpClient，Jetty HttpClient 和 HtmlUnit。

在以下示例中，我们从 something.com 微型网页下载 HTML 源。

读取带有 URL 的网页

URL表示一个统一资源定位符，它是指向万维网上资源的指针。

ReadWebPageEx.java

package com.zetcode;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

public class ReadWebPageEx {

    public static void main(String[] args) throws MalformedURLException, IOException {

        BufferedReader br = null;

        try {

            URL url = new URL("http://www.something.com");
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String line;

            StringBuilder sb = new StringBuilder();

            while ((line = br.readLine()) != null) {

                sb.append(line);
                sb.append(System.lineSeparator());
            }

            System.out.println(sb);
        } finally {

            if (br != null) {
                br.close();
            }
        }
    }
}

该代码示例读取网页的内容。

br = new BufferedReader(new InputStreamReader(url.openStream()));

openStream()方法打开到指定 URL 的连接，并返回InputStream以从该连接读取。 InputStreamReader是从字节流到字符流的桥梁。它读取字节，并使用指定的字符集将其解码为字符。另外，BufferedReader用于获得更好的性能。

StringBuilder sb = new StringBuilder();

while ((line = br.readLine()) != null) {

    sb.append(line);
    sb.append(System.lineSeparator());
}

使用readLine()方法逐行读取 HTML 数据。源被附加到StringBuilder上。

System.out.println(sb);

最后，StringBuilder的内容被打印到终端。

使用 JSoup 读取网页

JSoup 是流行的 Java HTML 解析器。

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

我们已经使用了这种 Maven 依赖关系。

ReadWebPageEx2.java

package com.zetcode;

import java.io.IOException;
import org.jsoup.Jsoup;

public class ReadWebPageEx2 {

    public static void main(String[] args) throws IOException {

        String webPage = "http://www.something.com";

        String html = Jsoup.connect(webPage).get().html();

        System.out.println(html);
    }
}

该代码示例使用 JSoup 下载并打印一个很小的网页。

String html = Jsoup.connect(webPage).get().html();

connect()方法连接到指定的网页。 get()方法发出 GET 请求。最后，html()方法检索 HTML 源。

使用`HtmlCleaner`读取网页

HtmlCleaner 是用 Java 编写的开源 HTML 解析器。

<dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.16</version>
</dependency>

对于此示例，我们使用htmlcleaner Maven 依赖项。

ReadWebPageEx3.java

package com.zetcode;

import java.io.IOException;
import java.net.URL;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;

public class ReadWebPageEx3 {

    public static void main(String[] args) throws IOException {

        URL url = new URL("http://www.something.com");

        CleanerProperties props = new CleanerProperties();
        props.setOmitXmlDeclaration(true);

        HtmlCleaner cleaner = new HtmlCleaner(props);
        TagNode node = cleaner.clean(url);

        SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
        htmlSerializer.writeToStream(node, System.out);        
    }
}

该示例使用HtmlCleaner下载网页。

CleanerProperties props = new CleanerProperties();
props.setOmitXmlDeclaration(true);

在属性中，我们设置为省略 XML 声明。

SimpleHtmlSerializer htmlSerializer = new SimpleHtmlSerializer(props);
htmlSerializer.writeToStream(node, System.out);

SimpleHtmlSerializer将创建结果 HTML，而不会进行任何缩进和/或压缩。

使用 Apache `HttpClient`读取网页

Apache HttpClient 是 HTTP / 1.1 兼容的 HTTP 代理实现。它可以使用请求和响应过程来抓取网页。 HTTP 客户端实现 HTTP 和 HTTPS 协议的客户端。

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

我们将这种 Maven 依赖关系用于 Apache HTTP 客户端。

ReadWebPageEx4.java

package com.zetcode;

import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;

public class ReadWebPageEx4 {

    public static void main(String[] args) throws IOException {

        HttpGet request = null;

        try {

            String url = "http://www.something.com";
            HttpClient client = HttpClientBuilder.create().build();
            request = new HttpGet(url);

            request.addHeader("User-Agent", "Apache HTTPClient");
            HttpResponse response = client.execute(request);

            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity);
            System.out.println(content);

        } finally {

            if (request != null) {

                request.releaseConnection();
            }
        }
    }
}

在代码示例中，我们将 GET HTTP 请求发送到指定的网页并接收 HTTP 响应。从响应中，我们读取了 HTML 源代码。

HttpClient client = HttpClientBuilder.create().build();

生成了HttpClient。

request = new HttpGet(url);

HttpGet是 HTTP GET 方法的类。

request.addHeader("User-Agent", "Apache HTTPClient");
HttpResponse response = client.execute(request);

执行 GET 方法并接收到HttpResponse。

HttpEntity entity = response.getEntity();
String content = EntityUtils.toString(entity);
System.out.println(content);

从响应中，我们获得网页的内容。

使用 Jetty `HttpClient`读取网页

Jetty 项目也有一个 HTTP 客户端。

<dependency>
    <groupId>org.eclipse.jetty</groupId>
    <artifactId>jetty-client</artifactId>
    <version>9.4.0.M1</version>
</dependency>

这是 Jetty HTTP 客户端的 Maven 依赖项。

ReadWebPageEx5.java

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();

            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

在示例中，我们使用 Jetty HTTP 客户端获取网页的 HTML 源。

client = new HttpClient();
client.start();

创建并启动HttpClient。

ContentResponse res = client.GET(url);

GET 请求发布到指定的 URL。

System.out.println(res.getContentAsString());

使用getContentAsString()方法从响应中检索内容。

使用`HtmlUnit`读取网页

HtmlUnit 是用于测试基于 Web 的应用的 Java 单元测试框架。

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.23</version>
</dependency>

我们使用这种 Maven 依赖关系。

ReadWebPageEx6.java

package com.zetcode;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;

public class ReadWebPageEx6 {

    public static void main(String[] args) throws IOException {

        try (WebClient webClient = new WebClient()) {

            String url = "http://www.something.com";

            HtmlPage page = webClient.getPage(url);
            WebResponse response = page.getWebResponse();
            String content = response.getContentAsString();

            System.out.println(content);
        }
    }
}

该示例下载一个网页并使用 HtmlUnit 库进行打印。

在本文中，我们使用各种工具（包括 URL，JSoup，HtmlCleaner，Apache HttpClient，Jetty HttpClient 和 HtmlUnit）在 Java 中抓取了一个网页。