Java调用Chrome Headless初体验
背景
对于JavaScript比较重的网站,要爬取的话,可以使用Chrome Headless。网上多是Python的版本,本文记录一个Java版本。
初体验
准备chrome
- 安装chrome:略
- 下载chromedriver:https://sites.google.com/a/chromium.org/chromedriver/downloads
修改pom
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.8.1</version>
</dependency>
爬取示例
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
public class MainServer {
public static void main(String[] args) throws InterruptedException {
// webdriver路径
System.setProperty("webdriver.chrome.driver", "/Users/xuqian/share/chromedriver");
// 设置headless模式
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920,1200", "--ignore-certificate-errors");
// 初始化driver
WebDriver driver = new ChromeDriver(options);
// 爬取
driver.get("http://www.baidu.com/");
Thread.sleep(5000); // Let the user actually see something!
String pageSource = driver.getPageSource();
System.out.println(pageSource);
}
}