Example of Web Scrapping in Java using JSoup
In this blog I'm going to describe how we can use JSoup library to scrap content from a website. The websites uses a standard markup called HTML to display documents in a web browser. They contain XML like document structure composed of elements and attributes.
<rootElement> //element with tag rootElement
<aTag width="10" height="20" color="RED"> //sub element aTag with attributes width, height etc
<content>Hello</content> //another nested sub element
</aTag>
<summary> This is summary.</summary> //another element under root element
</rootElement>
Although a HTML document starts with <HTML> and the content are kept under <BODY> element, the actual semantics of HTML is irrelevant to web Scrapping because HTML is really an XML document. All the web scrapping libraries deals with parsing the XML and reading the data out of the XML document.
Let's build a Quotes scrapping app!
In this example we are going to extract quotes from goodreads.com(https://www.goodreads.com/quotes.
Step 1: Setup a skeleton Java Project with JSoup dependency
We are going to use Maven to add the JSoup dependency and build the project.
Step 1.a Generate Maven Project using maven archetype
mvn archetype:generate -DgroupId=gt -DartifactId=web-scrapper-java -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
It generated the following files. Note that I deleted the AppTest.java under /src/test/java/gt/ because we won't be writing unit tests for this app.
├── pom.xml
├── src
│ └── main
│ └── java
│ └── gt
│ ├── App.java
Step 1.b Add JSoup dependency
I searched for jsoup dependency at https://mvnrepository.com/artifact/org.jsoup/jsoup and copied the following definition for the current version of jsoup and pasted inside <dependency> section
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version> <!-- use the new version -->
</dependency>
I also deleted junit dependency from pom.xml since we won't be writing unit tests.
Step 2: Basic Scrapping Examples
Let's play with JSoup API first. See the examples below. Here we are parsing XML content from string and extracting several pieces of the content using cssQuery. Please refer to https://www.w3schools.com/cssref/css_selectors.asp for more examples of css query.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import static java.lang.System.out;
public class Test {
public static void main(String[] args) {
String html = "<rootElement> " +
" <aTag width='10' height='20' color='RED' class='C1'> " +
" <content>Hello</content> " +
" </aTag>" +
" <aTag width='10' height='20' color='GREEN' class='C1'> " +
" <content class = 'small-font'>Hello Again small font</content> " +
" </aTag>" +
" <summary>" +
" <content class = 'small-font'> This is summary in small font </content>" +
" </summary> " +
"</rootElement>";
Document doc = Jsoup.parse(html);
//print all content element
/*
it prints:
Hello
Hello Again small font
This is summary in small font
*/
Elements els = doc.select("content");
for (Element e : els) {
out.println(e.text());
}
//text inside content element under aTag
/*
it prints:
Hello
Hello Again small font
*/
for (Element e : doc.select("aTag > content")) {
out.println(e.text());
}
//get all elements that have a color attribute and display the value of the attribute
/*
int prints
RED
GREEN
*/
for (Element e : doc.getElementsByAttribute("color")) {
out.println(e.attributes().get("color"));
}
//get all elements that have a attribute class = C1 attribute and display the value of the attribute
/*
int prints
RED
GREEN
*/
for (Element e : doc.select(".C1")) {
out.println(e.attributes().get("color"));
}
//read text inside a tag
/*
it prints:
Hello Again small font
This is summary in small font
*/
for (Element e : doc.select(".small-font")) {
out.println(e.text());
}
}
}
Step 3: Scrapping goodreads.com
Step 3.a Examine the html content
The first step is to examine the structure of the document to see where our data is located. Here we want to read the quote, author and the tags.
After inspecting the structure of the HTML through the inspect tool on browser, we can notice that:
- The <div class='quote'> is repeated for each Quote.
- The text inside 'quoteText' class.
- Author name is inside authorOrTitle class under the quoteText class.
- Tags are inside the 'quoteFooter' class
Here's the html content we are interested in. We want to extract the text in red.
<div class="quoteText">
“I'm
selfish, impatient and a little insecure. I make mistakes, I am out of
control and at times hard to handle. But if you can't handle me at my
worst, then you sure as hell don't deserve me at my best.”
<br> ―
<span class="authorOrTitle">
Marilyn Monroe
</span>
</div>
<div class="quoteFooter">
<div class="greyText smallText left">
tags:
<a href="/quotes/tag/attributed-no-source">attributed-no-source</a>,
<a href="/quotes/tag/best">best</a>,
<a href="/quotes/tag/life">life</a>,
<a href="/quotes/tag/love">love</a>,
<a href="/quotes/tag/mistakes">mistakes</a>,
<a href="/quotes/tag/out-of-control">out-of-control</a>,
<a href="/quotes/tag/truth">truth</a>,
<a href="/quotes/tag/worst">worst</a>
</div>
<div class="right">
<a class="smallText" title="View this quote"
href="/quotes/8630-i-m-selfish-impatient-and-a-little-insecure-i-make-mistakes">151963
likes</a>
</div>
</div>
Step 3.b Read quotes from goodreads.com
In the above example we used a static String to parse. We can use Jsoup.connect(THE URL).get() to read a webpage and get the Document object as below:
Document doc = Jsoup.connect("https://www.goodreads.com/quotes?page=1").get();
The full code to read quote text, author and tags
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;
public class GoodReadsScrapper {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.goodreads.com/quotes?page=1").get();
Elements quoteElements = doc.select(".quoteText");
for (Element e : quoteElements) {
//read quote text and the author from the body of quoteText css
//e.text() returns all the visible text inside this element which also includes the author... use ownText to not look at child elements
String qStr = e.ownText();
String quoteText = qStr.replaceAll("“", "").replaceAll("”", "");
//author is inside span inside authorOrTitle class within the current element
String author = e.select(".authorOrTitle").text();
//Tags: read sibling element of div with class 'quoteText', choose the one with class 'quoteFooter' and read the a tags
Elements tagElements = e.nextElementSiblings().select(".quoteFooter").select(".greyText").select("a");
List<String> tags = tagElements.stream().map(Element::text).collect(Collectors.toList());
System.out.println(quoteText + " By:" + author + " , Tags:" + tags);
}
}
}
Step 4: Thinking Bigger:
What if we want to read quotes from multiple web sites?
What if we want to store the quotes to DB?
What if we want to run the scrapping job periodically?
For these 'what-ifs', I updated the above code to include following:
├── pom.xml
├── src
│ └── main
│ └── java
│ └── gt
│ ├── GoodReadsScrapper.java //implementation for GoodReads
│ ├── Quote.java //wrapper class to hold quote data
│ ├── QuoteScrapper.java //base interface
│ ├── ScrapperService.java //a job
│ ├── Source.java //enum to hold sources