GT's Blog

Web Scrapping in Java using JSoup

Example of Web Scrapping in Java using JSoup

In this blog I'm going to describe how we can use JSoup library to scrap content from a website. The websites uses a standard markup called HTML to display documents in a web browser. They contain XML like document structure composed of elements and attributes.

<rootElement> //element with tag rootElement

<aTag width="10" height="20" color="RED"> //sub element aTag with attributes width, height etc

<content>Hello</content> //another nested sub element

</aTag>

<summary> This is summary.</summary> //another element under root element

</rootElement>

Although a HTML document starts with <HTML> and the content are kept under <BODY> element, the actual semantics of HTML is irrelevant to web Scrapping because HTML is really an XML document. All the web scrapping libraries deals with parsing the XML and reading the data out of the XML document.

Let's build a Quotes scrapping app!

In this example we are going to extract quotes from goodreads.com(https://www.goodreads.com/quotes.

Step 1: Setup a skeleton Java Project with JSoup dependency

We are going to use Maven to add the JSoup dependency and build the project.

Step 1.a Generate Maven Project using maven archetype

mvn archetype:generate -DgroupId=gt -DartifactId=web-scrapper-java -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

It generated the following files. Note that I deleted the AppTest.java under /src/test/java/gt/ because we won't be writing unit tests for this app.

├── pom.xml
├── src
│   └── main
│       └── java
│           └── gt
│               ├── App.java

Step 1.b Add JSoup dependency

I searched for jsoup dependency at https://mvnrepository.com/artifact/org.jsoup/jsoup and copied the following definition for the current version of jsoup and pasted inside <dependency> section


    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.13.1</version> <!-- use the new version -->
    </dependency>

I also deleted junit dependency from pom.xml since we won't be writing unit tests.

Step 2: Basic Scrapping Examples

Let's play with JSoup API first. See the examples below. Here we are parsing XML content from string and extracting several pieces of the content using cssQuery. Please refer to https://www.w3schools.com/cssref/css_selectors.asp for more examples of css query.


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import static java.lang.System.out;

public class Test {


    public static void main(String[] args) {

        String html = "<rootElement> " +
                "   <aTag width='10' height='20' color='RED' class='C1'>  " +
                "        <content>Hello</content> " +
                "    </aTag>" +
                "   <aTag width='10' height='20' color='GREEN' class='C1'>  " +
                "        <content class = 'small-font'>Hello Again small font</content> " +
                "    </aTag>" +
                "    <summary>" +
                "       <content class = 'small-font'> This is summary in small font </content>" +
                "    </summary> " +
                "</rootElement>";

        Document doc = Jsoup.parse(html);

        //print all content element
        /*
        it prints:
            Hello
            Hello Again small font
            This is summary in small font
         */
        Elements els = doc.select("content");
        for (Element e : els) {
            out.println(e.text());
        }

        //text inside content element under aTag
        /*
        it prints:
            Hello
            Hello Again small font
         */
        for (Element e : doc.select("aTag > content")) {
            out.println(e.text());
        }

        //get all elements that have a color attribute and display the value of the attribute
        /*
        int prints
            RED
            GREEN
         */
        for (Element e : doc.getElementsByAttribute("color")) {
            out.println(e.attributes().get("color"));
        }

        //get all elements that have a attribute class = C1 attribute and display the value of the attribute
        /*
        int prints
            RED
            GREEN
         */
        for (Element e : doc.select(".C1")) {
            out.println(e.attributes().get("color"));
        }

        //read text inside a tag
                /*
        it prints:
            Hello Again small font
            This is summary in small font
         */
        for (Element e : doc.select(".small-font")) {
            out.println(e.text());
        }

    }
}

Step 3: Scrapping goodreads.com

Step 3.a Examine the html content

The first step is to examine the structure of the document to see where our data is located. Here we want to read the quote, author and the tags.

After inspecting the structure of the HTML through the inspect tool on browser, we can notice that:

The <div class='quote'> is repeated for each Quote.
The text inside 'quoteText' class.
Author name is inside authorOrTitle class under the quoteText class.
Tags are inside the 'quoteFooter' class

Here's the html content we are interested in. We want to extract the text in red.
<div class="quoteText">
      “I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.”
<br> ―
<span class="authorOrTitle">
    Marilyn Monroe
</span>
</div>
<div class="quoteFooter">
   <div class="greyText smallText left">
     tags:
       <a href="/quotes/tag/attributed-no-source">attributed-no-source</a>,
       <a href="/quotes/tag/best">best</a>,
       <a href="/quotes/tag/life">life</a>,
       <a href="/quotes/tag/love">love</a>,
       <a href="/quotes/tag/mistakes">mistakes</a>,
       <a href="/quotes/tag/out-of-control">out-of-control</a>,
       <a href="/quotes/tag/truth">truth</a>,
       <a href="/quotes/tag/worst">worst</a>
   </div>
   <div class="right">
     <a class="smallText" title="View this quote" href="/quotes/8630-i-m-selfish-impatient-and-a-little-insecure-i-make-mistakes">151963 likes</a>
   </div>
</div>

Step 3.b Read quotes from goodreads.com

In the above example we used a static String to parse. We can use Jsoup.connect(THE URL).get() to read a webpage and get the Document object as below:

Document doc = Jsoup.connect("https://www.goodreads.com/quotes?page=1").get();

The full code to read quote text, author and tags

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;

public class GoodReadsScrapper {

    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("https://www.goodreads.com/quotes?page=1").get();

        Elements quoteElements = doc.select(".quoteText");

        for (Element e : quoteElements) {

            //read quote text and the author from the body of quoteText css
            //e.text() returns all the visible text inside this element which also includes the author... use ownText to not look at child elements
            String qStr = e.ownText();
            String quoteText = qStr.replaceAll("“", "").replaceAll("”", "");

            //author is inside span inside authorOrTitle class within the current element
            String author = e.select(".authorOrTitle").text();

            //Tags: read sibling element of div with class 'quoteText', choose the one with class 'quoteFooter' and read the  a tags
            Elements tagElements = e.nextElementSiblings().select(".quoteFooter").select(".greyText").select("a");
            List<String> tags = tagElements.stream().map(Element::text).collect(Collectors.toList());

            System.out.println(quoteText + " By:" + author + " , Tags:" + tags);
        }
    }

}

Step 4: Thinking Bigger:

What if we want to read quotes from multiple web sites?

What if we want to store the quotes to DB?

What if we want to run the scrapping job periodically?

For these 'what-ifs', I updated the above code to include following:

├── pom.xml
├── src
│   └── main
│       └── java
│           └── gt
│               ├── GoodReadsScrapper.java //implementation for GoodReads
│               ├── Quote.java //wrapper class to hold quote data
│               ├── QuoteScrapper.java //base interface
│               ├── ScrapperService.java //a job
│               ├── Source.java //enum to hold sources

The source is available at https://github.com/gtiwari333/java-web-scrapping-jsoup

A bigger (web app) application that uses Spring Boot, Angular is available here: https://github.com/gtiwari333/spring-boot-keycloak-angular-quote-app

gradle - exclude a module in multi-module nested project

Suppose you have a multi-module gradle project with nested structure. Eg: module A depends on B, B depends on C and so on. And you want to exclude module C from A.

So, here's how you can exclude a 'transitive' module prj-C from prj-B at Project A

build.gradle -- at Project A

dependencies {

implementation ( project(':prj-B')){ //note the parenthesis

exclude group: 'com.gt', module: 'prj-C'

}

//other dependencies

}

Bonus:

If you want to exclude a transitive dependency (not the module) you can do the following:

dependencies {

implementation 'com.gt:libraryA'{

exclude group: 'com.gt', name : 'libararyB'

}

//other dependencies

}

Java Sort Map by Value

Java Sort Map By Value

The following snippet works for any Type

static <K, V extends Comparable<? super V>> Map<K, V> sortByValue(Map<K, V> map) {
  return map.entrySet()
    .stream()
    .sorted(comparingByValue())
    .collect(toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new));
}

Reverse Order

static <K, V extends Comparable<? super V>> Map<K, V> sortByValue(Map<K, V> map) {
  return map.entrySet()
    .stream()
    .sorted(comparingByValue(reverseOrder()))
    .collect(toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new));
}

Complete Code with test!

import java.util.*;

import static java.lang.System.out;
import static java.util.Collections.reverseOrder;
import static java.util.Map.Entry.comparingByValue;
import static java.util.stream.Collectors.toMap;

public class MapUtil {

    public static <K, V extends Comparable<? super V>> Map<K, V> sortByValue(Map<K, V> map) {
        return map.entrySet()
            .stream()
            .sorted(comparingByValue())
            .collect(toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new));
    }

    public static <K, V extends Comparable<? super V>> Map<K, V> sortByValueReverseOrder(Map<K, V> map) {
        return map.entrySet()
            .stream()
            .sorted(comparingByValue(reverseOrder()))
            .collect(toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new));
    }

    public static void main(String[] args) {
        Map<Integer, String> unsortedMap = Map.of(
            1, "G",
            2, "J",
            3, "A",
            4, "Y",
            5, "N",
            6, "O");

        out.println(sortByValue(unsortedMap));
        out.println(sortByValueReverseOrder(unsortedMap));
    }

}

JavaCV Configuration in Windows

I had published a article on how to configure JavaCV on windows machine about 5 year back. Since then a lot of changes has been made to JavaCV:

The repository host Google Code stopped their services
JavaCV team moved to github with a different package name. They have replaced the "com.googlecode.javacv.cpp." package with "org.bytedeco.opencv." or "org.bytedeco.javacv"
They ( probably OpenCV too) moved some classes here and there. eg: the static method cvSaveImage is now under org.bytedeco.opencv.helper.opencv_imgcodecs.cvSaveImage package. It was on com.googlecode.javacv.cpp.opencv_highgui.cvSaveImage before.
Finally the good thing is the installation/setup steps has been easier than before. This is because they have wrapped all libraries files (dll, so ) into the platform specific jar files and we don't need to install and configure the OpenCV binaries separately

Setup Steps:

1) Install the JDK on your system.

You can choose between 3 options:

OpenJDK http://openjdk.java.net/install/ or
Sun JDK http://www.oracle.com/technetwork/java/javase/downloads/ or
IBM JDK http://www.ibm.com/developerworks/java/jdk/

2) Download the JavaCV binaries.

2.a) Manually:

Download from Github release: https://github.com/bytedeco/javacv/releases

2.b) Automatically - Using Maven (Recommended)

<dependency>
    <groupId>org.bytedeco</groupId>
    <artifactId>javacv-platform</artifactId>
    <version>1.5.3</version>
</dependency>

3) Project Setup:

3.a) Basic project using Eclipse/Netbeans,Intellij or other IDE
Extract the JavaCV binaries and add all the jars into your classpath.

3.b) Maven Project Setup:
If you want to use Maven you need to add the dependencies as in 2.b) in your pom.xml file. There is already a sample project available on GitHub. Download it and import into your IDE. It has a sample code to capture images from webcam.

GitHub Sample Project URL: https://github.com/gtiwari333/JavaCV-Test-Capture-WebCam

Sample Code to Capture Images from WebCam:

Happy Coding ...

Solve - Hyper-v not compatible on VMware player

I had enabled hyper-visor when installing docker in my system. But it seems it needs to be disabled to allow VMware run smoothly.

The error that I got from VMware.

Hyper-V not compatible - VMWare Error

Here's how I solved it:

1) Run command prompt as Administrator
2) Run the following command to disable hyper visor launcher
C:\>bcdedit /set hypervisorlaunchtype off

If you want to enable it back to run Docker, the following command will help.
C:\>bcdedit /set hypervisorlaunchtype auto

MySql get full name from last first mid name

Using CONCAT_WS to extract full name from first , middle and last name.

It also handles the case that the mid_name can be empty or null or even have multiple spaces!

SELECT id, email,
    CASE WHEN mid_name IS NULL OR TRIM(mid_name) ='' THEN
        CONCAT_WS( " ", first_name, last_name )
    ELSE
        CONCAT_WS( " ", first_name, mid_name, last_name )
    END AS full_name
FROM USER;

JPA EntityManager get Session Object in Hibernate

How to get Session object from JPA EntityManager

With Hibernate (JPA 2.0 implementation), you would do:

Steps:

//1. Inject/Autowire Entity Manager
@Inject

private EntityManager entityManager;    //javax.persistence.EntityManager

//2. Get Session Object

Session session = entityManager.unwrap(Session.class);  //org.hibernate.Session

Hibernate Create Update Delete child objects - Best way

AngularJS Download File From Server - Best way with Java Spring Backend

Here's how you can download a file from server using AngularJS.

In this example, the client sends a API call to Java server to download a file /api/download/{id} and server sends the base64 data stream download for a given file id.

Below is the snippet from working code. The code is pretty descriptive.
This will allow you to download any type of file.

AngularJS controller method:

function downloadReportFile(fileId) {
    Download.downloadQueuedReport({id: fileId}, function (response) {

        var anchor = angular.element('<a/>');
        anchor.attr({
            href: 'data:application/octet-stream;base64,' + response.data,
            target: '_self',
            download: response.headers.filename        });

        angular.element(document.body).append(anchor);
        anchor[0].click();

    });
}

AngularJS service to do the API call:

'downloadQueuedReport': {
    method: 'GET',
    url: 'api/download/:id',
    params: {id: '@id'},
    transformResponse: function (data, headers) {
        var response = {};
        response.data = data;
        // take note of headers() call        response.headers = headers();
        return response;
    }
},

Spring Powered Backend REST API

@RequestMapping(value = "api/download/{id}",
    method = RequestMethod.GET,
    produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)

 public ResponseEntity<byte[]> downloadReportFile(@PathVariable Long id) {
    log.debug("REST request to download report file");

    File file = getReportFile(id); // a method that returns file for given ID
    if (!file.exists()) { // handle FNF
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(null);
    }

    try {
        FileSystemResource fileResource = new FileSystemResource(file);

        byte[] base64Bytes = Base64.encodeBase64(IOUtils.toByteArray(fileResource.getInputStream()));

        HttpHeaders headers = new HttpHeaders();
        headers.add("filename", fileResource.getFilename());

        return ResponseEntity.ok().headers(headers).body(base64Bytes);
    } catch (IOException e) {
        log.error("Failed to download file ", e);
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(null);
    }

}

How browserSync actually works ?

How BrowserSync actually works ?

BrowserSync starts a small Node.js server which injects a script ( as below) into the webpage that it's monitoring.The script makes use of WebSockets to communicate between server and client to watch for changes to the code or browser actions. As soon as BrowserSync detects an action ( either in one browser or a server code) it performs a page reload.

<body>
<script id="__bs_script__">
//<![CDATA[

document.write("<script async src='/browser-sync/browser-sync-client.2.11.2.js'> <\/script>".replace("HOST", location.hostname));

//]]>
</script>
...
...

If you’re already using a local web server or need to connect to a live website, you can start BrowserSync as a proxy server. See how to do this.

Articles related to BrowserSync /Grunt configuration:

BrowserSync local server proxy configuration

Integrate BrowserSync - with existing local server :

In this example, I will show how we can configure BrowserSync - Grunt task with you existing existing webapp that is running on a local server.

If you want to know the details on

how to configure the BrowserSync and Watch tasks on Grunt, Please visit my previous post :

BrowserSync Grunt configuration - Multi browswer Live Reload

Pros and Cons of BrowserSync with LiveReload

BrowserSync vs LiveReload productivity boosters comparison

The configuration is simple : You just need to let the browserSync to know URL of your local server.

options: {
         proxy: "local.server-URL"
       }

The final Gruntfile.js file : (full configuration is already described on my earlier blog post
BrowserSync Grunt configuration - Multi browswer Live Reload )

   
module.exports = function(grunt) {
  // Task configuration will go here
  grunt.initConfig({
   watch: {
  
      },
   browserSync: {
       bsFiles: {
         src: [
           "css/*.css", "js/.js", "./*.html" //search file/folders
         ]
       },
       options: {
         proxy: "local.server-URL" // NEEDS TO BE CONFIGURED
       }
   }
  });
  
  
  // Load tasks dependencies
  grunt.loadNpmTasks('grunt-contrib-watch');
  grunt.loadNpmTasks('grunt-browser-sync');
  
  // Setup default task
  // both browserSync and watch will run when running >grunt command
  grunt.registerTask('default', ['browserSync', 'watch']);

};

Subscribe to: Comments ( Atom )