16
Palakorn Nakphong Founder: Nextzy Technologies Co.,ltd. [“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer]; fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Nextzy Technologies Co.,ltd. Jsoup

Embed Size (px)

Citation preview

Palakorn Nakphong

Founder: Nextzy Technologies Co.,ltd.

[“Java Programmer”, Fullstack Web Developer, Ruby On Rails Developer];

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

JsoupJava HTML Parser

Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Complex DOM Element

Old Web Scraping

How to get data in tag?

Regular expression is F*uk

String expr = "<td><span\\s+class=\"flagicon\"[^>]*>"

+ ".*?</span><a href=\""

+ "([^\"]+)" // first piece of data goes up to quote

+ "\"[^>]*>" // end quote, then skip to end of tag

+ "([^<]+)" // name is data up to next tag

+ "</a>.*?</td>"; // end a tag, then skip to the td close tag

New Web Scraping

Using Jsoup

<dependency>

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

<version>1.8.1</version>

</dependency>

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Wh

at is

JSo

up

Lib

rary

?

• Jsoup can scrape and parse HTML from a URL, file, or string

• Jsoup can find and extract data, using DOM traversal or CSS selectors

• Jsoup allows you to manipulate the HTML elements, attributes, and text

• Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks

• Jsoup also output tidy HTML

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Example DOM Element

Document doc = Jsoup.connect("http://www.nextzy.com/").get();

String title = doc.title();

<html>

<head>

<title>My title</title>

</head>

<body>

<h1>My header</h1>

<a href="test.html">My link</a>

</body>

</html>

File input = new File("/file/nextzy.html");

Document doc = Jsoup.parse(input, "UTF-8", "http://nextzy.com/");

Element content = doc.getElementById("content");

Elements links = content.getElementsByTag("a");

for (Element link : links) {

String linkHref = link.attr("href");

String linkText = link.text();

}

Get Element By …

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Elements links = doc.select("a[href]");

Elements pngs = doc.select("img[src$=.png]");

Element masthead = doc.select("div.masthead").first();

Elements resultLinks = doc.select("h3.active > a");

Like CSS Selector …

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();

String relHref = link.attr("href"); // == "/“

String absHref = link.attr("abs:href"); // "http://jsoup.org/"

Work with URL …

fb.com/codingz @Codingz th.linkedin.com/in/palakorn

มาร่วมเป็นโจรสลดักบัเรา...https://www.blognone.com/node/64996

Thanks You