既然sakila的数据不是很好用,干脆自己爬点数据吧。
- 在 实战-sakila数据库项目 介绍了sakila数据开发应用,结果发现只有文字数据,而且影片数据也不理想,所以这里需要使用爬虫从万能的互联网上抓点数据。
目标选择
抓取类型
技术选型
代码实现
简单的单元测试入门。jsoup的用法可以直接查看官方文档。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| private static final String UA_PHONE = "Mozilla/5.0 (Linux; Android 4.3; Nexus 10 Build/JSS15Q) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Safari/537.36";
private static final String UA_PC = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36";
private static final String QQ_PAGE_MOVIE = "http://v.qq.com/x/list/movie?itype=-1&offset=1"; @Test public void test() throws IOException { Document document = Jsoup.connect(QQ_PAGE_MOVIE) .userAgent(UA_PC) .timeout(TIME_OUT) .ignoreContentType(true).get(); Elements elements = document.select("li.list_item a.figure"); for (Element element : elements) { String url = element.attr("href"); String title = element.select("img").attr("alt"); String image = element.select("img").attr("r-lazyload"); System.out.println("href " + url + " title " + title + " image " + image); } }
|
输出结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
| href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
href https:
|