java read huge xml file and convert to csv

SAX parser uses event handler org.xml.sax.helpers.DefaultHandler to efficiently parse and handle the intermediate results of an XML file.  

It provides the following three important methods on each event where we can write custom logic to take specific action at each events:
  • startDocument() and endDocument() – Method called at the start and end of an XML document. 
  • startElement() and endElement() – Method called at the start and end of a document element.  
  • characters() – Method called with the text contents in between the start and end tags of an XML document element.
We will be using this class to read a HUGE xml file (6.58GB, it should support any size without any problem) efficiently and convert and write to CSV file.

I am going to use my existing code from my old blog xml-parsing-using-saxparser and updating it for this purpose. The final code is available on github project java-read-big-xml-to-csv


Java HUGE XML to CSV - project structure

How to Import/Run:

Its a simple maven project(with no dependencies). You can import it into your IDE or  use command line to compile and run.
If you plan on using Command Line, to compile and create a runnable jar file, go to the root of the project and run mvnw clean package .
Then you can run the executable as following:
java -jar target\xmltocsv-FINAL.jar  C:\folder\input.xml  C:\folder\output.csv

The code:

SaxParseEventHandler 
SaxParseEventHandler class takes the RecordWriter as constructor parameter
public SaxParseEventHandler(RecordWriter<Book> writer) {


We create new book record on startElement event
public void startElement(String s, String s1, String elementName, Attributes attributes) { /* handle start of a new Book tag and attributes of an element */ if (elementName.equalsIgnoreCase("book")) { //start bookTmp = new Book();


and we write the parsed book data to file on endElement() event.
public void endElement(String s, String s1, String element) { if (element.equals("book")) { //end writer.write(bookTmp, counter);





RecordWriter:
Its a simple wrapper for FileWriter to write content to file. We are currently writing T.toString() to file.
public void write(T t, int n) throws IOException { fw.write(t.toString()); if (n % 10000 == 0) { fw.flush(); } }

Main:
Its the main 'launcher' class
SAXParserFactory factory = SAXParserFactory.newInstance(); try (RecordWriter<Book> w = new RecordWriter<>(outputCSV)) { SAXParser parser = factory.newSAXParser(); parser.parse(inputXml, new SaxParseEventHandler(w)); }






Results at 16GB RAM, Core i5, 6MB L3 cache, SSD | Windows Machine
Max RAM usage: 190MB
Time Taken:
For the file big2.xml with size 118MB
- JDK8 - 8-9 sec
- JDK 11 - 6-7 sec
- JDK 14 - 5 sec 

big3.xml with size 6.58GB takes about 2 minutes


Next Steps: create a binary using GraalVM. I will keep posting !!

Java Compress/Decompress String/Data

Java provides the Deflater class for general purpose compression using the ZLIB compression library. It also provides the DeflaterOutputStream which uses the Deflater class to filter a stream of data by compressing (deflating) it and then writing the compressed data to another output stream. There are equivalent Inflater and InflaterOutputStream classes to handle the decompression.

Compression


Here is an example of how to use the DeflatorOutputStream to compress a byte array.
static byte[]compressBArray(byte[]bArray) throws IOException{
        ByteArrayOutputStream os=new ByteArrayOutputStream();
        try(DeflaterOutputStream dos=new DeflaterOutputStream(os)){
            dos.write(bArray);
        }
        return os.toByteArray();
}

Let's test:

byte[] input = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"        .getBytes();
byte[] op = CompressionUtil.compressBArray(input);
System.out.println("original data length " + input.length +
        ",  compressed data length " + op.length);

This results 'original data length 71,  compressed data length 12'

Decompression

Let's test:

public static byte[] decompress(byte[] compressedTxt) throws IOException {
        ByteArrayOutputStream os = new ByteArrayOutputStream();    
        try (OutputStream ios = new InflaterOutputStream(os)) {
            ios.write(compressedTxt);    
        }
        return os.toByteArray();
}
This prints the original 'input' string.


Let's convert the byte[] to Base64 to make it portable

In the above examples we are getting the compressed data in byte array format (byte []) which is an array of numbers.

But we might want to transmit the compressed data to a file or json or db right? So, in order to transmit, we can convert it to Base64 using the following

byte[] bytes = {}; //the byte array    
String b64Compressed = new String(Base64.getEncoder().encode(bytes));
byte[] decompressedBArray = Base64.getDecoder().decode(b64Compressed);
//convert to original string if input was string
new String(decompressedBArray, StandardCharsets.UTF_8);

Here's the complete code and the test cases

package compress;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import java.util.zip.DeflaterOutputStream;
import java.util.zip.InflaterOutputStream;

public class CompressionUtil {

    public static String compressAndReturnB64(String text) throws IOException {
        return new String(Base64.getEncoder().encode(compress(text)));
    }

    public static String decompressB64(String b64Compressed) throws IOException {
        byte[] decompressedBArray = decompress(Base64.getDecoder().decode(b64Compressed));
        return new String(decompressedBArray, StandardCharsets.UTF_8);
    }

    public static byte[] compress(String text) throws IOException {
        return compress(text.getBytes());
    }

    public static byte[] compress(byte[] bArray) throws IOException {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try (DeflaterOutputStream dos = new DeflaterOutputStream(os)) {
            dos.write(bArray);
        }
        return os.toByteArray();
    }

    public static byte[] decompress(byte[] compressedTxt) throws IOException {
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        try (OutputStream ios = new InflaterOutputStream(os)) {
            ios.write(compressedTxt);
        }
        return os.toByteArray();
    }

}

Test case:

package compress;

import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class CompressionTest {

    String testStr = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

    @Test
    void compressByte() throws IOException {
        byte[] input = testStr.getBytes();
        byte[] op = CompressionUtil.compress(input);
        System.out.println("original data length " + input.length + ",  compressed data length " + op.length);
        byte[] org = CompressionUtil.decompress(op);
        System.out.println(org.length);
        System.out.println(new String(org, StandardCharsets.UTF_8));
    }

    @Test
    void compress() throws IOException {

        String op = CompressionUtil.compressAndReturnB64(testStr);
        System.out.println("Compressed data b64" + op);
        String org = CompressionUtil.decompressB64(op);
        System.out.println("Original text" + org);
    }

}


 Note: Since the compress and decompress method operate on byte[], we can compress/decompress any data type.

Game of Thrones Style farewell email to coworkers

I left my job for another opportunity after working for 8.5 years. Since I had nothing left to do on my last day, I decided to get little creative and looked up online about how do write farewell email on Game of Thrones style. With little help from internet, I came up with the following email.


My lords and ladies,

It has been a great honor serving this most noble software development house. I fought for this house with all my heart for 8.5 years and we won some glorious battles together. Songs will be sung for the next thousand years about our great victories over White Walkers(Bugs and Issues).

It’s a bittersweet ending for me to part ways and go North of the Wall now(NEW_COMPANY is north from OLD_COMPANY).

Whatever I do in the future, I now take this pledge in the sights of Old Gods and the New, that I will forever cherish the time I spent here and I hope you Lords and Ladies do the same.

My Watch has ended but don’t forget to send me a raven from time to time on gtiwari333@gmail.com/ 203-XXX-XXXX. You can also find me wandering around Citadel at https://www.linkedin.com/in/gtiwari333 

Winter is coming 😝 in 5 months. Just ended though (ITS MINNESOTA).

--

Ganesh Tiwari | A Crow Member

Project to test your programming skills

Project to test your programming skills

A Guessing Game - to become Full Scope/Stack Developer

If you are wondering what would be a perfect project to practice your programming skills. You are in the right place!
It's a simple number guessing game. We start with a console app and migrate to a web app with lots of features.

Steps

1. Console App:

  • Read a number N from the console between a range (MIN, MAX). Your code should then generate a random number CN between the same range(MIN, MAX) and compare if the computer-generated random number CN and the user entered number N matches.
    If it matches, the user wins. If it doesn't match, the computer wins.

2. (Optional) Desktop app:

  • Create an interface to enter the MIN/ MAX number and the user guess.
    MIN :        [ Textbox  ]
    MAX:         [ Textbox  ]
    User Guess   [ Textbox  ]
Also provide a button with label "Play" that generates the random number and displays a message in Label if the user won.
    [ Play  ]
  • Save the win/loss counts and winning/losing number in a text file 'stat.csv'. Display the average win/loss on UI when the user closes the app.
    eg:
    DATE_TIME, WINNER, WINNING_NUM, LOSING_NUM
    2020-05-10 20:01:50, USER, 5, 10
    2020-05-10 20:02:50, USER, 3, 4
    2020-05-11 20:05:50, COMPUTER, 7, 9

3. Two-player:

  • Update the GUI or Console application to allow two users to play with the computer. Both of the users can enter their guess and click Play. The user that made the correct guess will win
    GUI mockup:

    User A's Guess   [  Text Box  ]
    
    User B's Guess   [  Text Box  ]

                     [    PLAY    ]

4. (Optional) Multi-Player Game - over socket connection:

  • Update the GUI application to allow several users to play simultaneously. All the users will have a copy of the application and can join the Game by running the application on their computer. The first user to start the game can act as a Server.

5. Web Application:

  • Create a web application to play the same game in the browser. Reuse the previous code on the backend
  • Support single-player mode (play with the computer).
  • Add a sign-up page to register users. Update logic to allow only the registered/logged-in users to play. Use ReCaptcha to prevent robot making requests.
  • Block users from playing more than 1 hour. Lock them for 2 hours.
  • Multi-player: list online users and provide the ability to request/accept to play with the user. Use WebSocket to listen for updates in realtime.
  • Store the win/loss statistics into DB.
  • Generate a CSV report with stats about the winner, numbers, etc that you can download it from the web interface.

 6. Fun Stuff

  • Schedule the 'winner stat' report to run every day and deliver it to your email address.
  • Setup a background job that sends an account deactivation email if the user is not logged-in in last 20 days
  • Setup a background job to deactivate user if the user is not logged-in in the last 30 days
  • Setup a public web API to expose information about the winners
  • Use caching to read user profile from the cache instead of reading from DB on every request

7. Operation:

  • Setup a Dockerfile script to run your app in docker
  • Setup static code analysis with local SonarQube instance. You can use docker to run SonarQube. Take care of SonarQube warnings.
  • Deploy your app in a cloud environment (eg Heroku, AWS, Azure)

 Note:

  • Focus on readability, reusability throughout the development.
  • Try to make your app modular
  • Use the build system
  • Use git

 

Want to update this?

Please submit a PR at  https://github.com/GT-Corp/myths-and-facts-about-programming/blob/master/full-scope-developer.md

Myths and Facts About Programming

Myths and Facts About Programming - Stuff that I wish I knew in my early career


What's this?

A collection of common myths and facts (opinionated) about computer programming that I wish I knew in my early career.

Programming requires math

  • Neutral.
  • Only a few percentages of programmers deal with math problems in their careers.
  • Analytical skills help to break down the problem. Think of programming as understanding the problem, breaking down into smaller steps, and solving it. Similar to math right?
  • However, people who are bad at Math can be a good programmer. It also depends on the type of role and type of problem they are trying to solve.

Programming job is similar to a typist. It's all about typing code.

  • False
  • Programming(at entry level) is about:
    • reading documentation and requirements
    • documenting stuff
    • thinking how to write code
    • writing code
    • testing
    • debugging bugs
    • deploying
    • discuss with team member/management
  • The amount of time you spend typing code depends on your role and job description. There will be days you won't be typing any code.
  • Majority of programming job requires maintaining an existing system written over the years by several people. You will be required to add features, customize, fix bugs, etc

You won't require a college degree to be a programmer

Everyone can learn and be a programmer within months

Programming is really hard

  • Neutral.
  • It depends on the individual, their learning/intellectual capability, and the type of programming role they learn/get into.
  • There will be certain things you can learn easily. But a college degree will help to broaden your perspective and learn things quickly.

Programming is monotonous. Its like working in the assembly line at a factory

  • False
  • On certain days or working in the same role for a long time, you may get a feel of your job being monotonous.
  • But it's not like working in the assembly line. It requires lots of thinking and analysis.

Programming is not for girls

  • False

You need to keep reading new stuff throughout your career

  • Neutral
  • You don't "need to". But learning new stuff helps advance your career.
  • Also, it depends on the type of tool and technologies you are using. Some tools/technology (eg: JS Frameworks) get deprecated every few years. Sometimes
  • Learning a new paradigm, best practices, new architecture concepts is always useful.

Machine Learning and AI seems easy to learn.

I don't have any knowledge of statistics/probability/modeling. However, the ML/AI tutorial I found online is just 10 lines of code and it seems easy.

  • False
  • It may seem easy to use ML/AI tools created by somebody else or follow a cookbook. But you will need to understand many concepts to use those tools when solving real problems. Don't get intimidated by simple tutorials. Start by the basics and dig into the tools.

Using long variable makes program slow. So I should program like this:

int a = read()
int b = 1000
if(a > 18 && b > 50)
    println("Entry allowed")


  • False
  • With compiled languages, no. With interpreted languages, possibly but the difference would be negligible.
  • Always focus on readability. Compare the above code with the following:
int age = readMemberAge()
int balance = 1000
if(age > 18 && balance > 50)
    println("Entry allowed")

I have to learn as many programming languages eg C, Python, Java, Ruby, Kotlin, Scala, Groovy, C#, Go to be a good programmer.

  • False
  • Think of programming language as natural language eg: Nepali, French, English, Japanese, and Chinese. And the art of writing a novel or poem as the actual programming. If you mastered five languages but do not have a skill of writing a (good) poem in either of those you are still not an artist.
  • Think of programming as art. Try to be an artist in at least one language. Think of a hobby project and develop with paying attention to code quality, performance, UI, features, etc.
  • Focus on learning programming rather than learning a language.
    • Programming is a skill that you can gain with just one language. If you know how to do X in Y language then you can do it in Z language too with little effort.

HackerRank, LeetCode will guarantee me a job

  • False
  • There's no doubt that the questions on those sites help you think critically and solve a problem.
  • Its a widely used screening method to filter our candidates these days.
  • Pet project(s) and your college projects will also help you land the first job.

Google, Amazon and Facebook are using X tool. It must be good so I should learn.

  • False
  • A lot of tools developed by tech-giants are being deprecated after a couple of years.
  • Looks for a tool/language/framework that's being used by a lot of companies for a long time.

X was developed by Google, Amazon, and Facebook so it must be good. I should learn and use it.

  • False
  • There's no guarantee that those tools MUST be good. Don't fall for advertisements
  • Review 100 job descriptions on Linkedin/Indeed etc and find yourself what's popular on the market

I must learn Angular, React, Vue and XYZ web framework to master my web development skills

  • False
  • It's better to start the web development without the frameworks so that you understand how those frameworks are solving the problems of not using those frameworks
  • You don't need to learn all of these, one would be enough. If you started learning web development without using frameworks, switching between frameworks would be easier.

I know X1 framework/library/tool. But the job vacancy says mentions X2(the alternative of X1). I should not apply for this job.

  • False
  • Test yourself if you know X1 framework/library/tool how long you will take to learn X2.
  • As long as you know the abstract concepts and have worked on at least one pet/professional project on your own there's a high chance that you can learn another framework/library/tool quickly. They all are trying to solve a similar project but slightly different ways.
  • Also look for 'preferred' vs 'required' skills on job vacancies.

Everyone on social media hates language/framework X. X must be bad.

  • False
  • Don't fall for people's 'opinions'. People think languages/frameworks/tools as religion. They hate each other.
  • The best way to find out what to learn is to look at job vacancies. At least a hundred of them.

Language X does that in one line. So, it is the best language.

  • Neutral
 DB.allRecords().read().toCsv("file.csv");
  • It's nice that they provided that functionality in one line out of the box. But there is a great deal of code hidden behind the scene.
  • All languages support creating library modules to extend the feature. Some languages are by nature too abstract/low level and it requires developers to write libraries around it to make things simpler.
  • So, that doesn't mean language X is best.

Want to add more Q/A and correct sth?

Please submit a Pull Request at https://github.com/GT-Corp/myths-and-facts-about-programming/blob/master/README.md


AWS Java SDK - automatically detect the region

When the app is deployed in multiple regions in AWS, its useful to detect the region automatically without specifying the region by using a property/environment variable ourself.


We can detect the region by using the AWS SDK:

    Regions.getCurrentRegion(); //returns Regions enum

Or by using:

    EC2MetadataUtils.getEC2InstanceRegion(); //returns region String

Or:

    System.getenv("AWS_REGION")

AWS DynamoDB - dynamic table prefix using DynamoDBMapper


We can use DynamoDBMapperConfig.TableNameOverride to configure the DynamoDBMapper and provide a custom/dynamic table name prefix using TableNameOverride.withTableNamePrefix(String).


Plain Java Example:

import com.amazonaws.services.dynamodbv2.*;
import com.amazonaws.services.dynamodbv2.datamodeling.*;

import java.util.UUID;

//code:

String prefix = "SOME_DYNAMIC_PREFIX"; //can be pulled from a dynamic logic eg: profile, env variable etc
var mapperConfig = new DynamoDBMapperConfig.Builder()
.withTableNameOverride(DynamoDBMapperConfig.TableNameOverride.withTableNamePrefix(prefix + "-"))
.build();

var dynamoDB = AmazonDynamoDBClientBuilder.standard().build();
var dbMapper = new DynamoDBMapper(dynamoDB, mapperConfig);


// use it
dbMapper.load(MyTable.class, UUID.randomUUID());

Spring DynamoDB dynamic table prefix example



import com.amazonaws.services.dynamodbv2.*;
import com.amazonaws.services.dynamodbv2.datamodeling.*;
import org.springframework.context.annotation.*;
import java.util.UUID;

@Configuration
class AwsConfig {
@Bean
AmazonDynamoDB dynamoDB() {
return AmazonDynamoDBClientBuilder.standard().build();
}

@Bean
DynamoDBMapperConfig dynamoDBMapperConfig() {
String prefix = "SOME_DYNAMIC_PREFIX"; //can be pulled from a dynamic logic eg: profile, env variable etc
return new DynamoDBMapperConfig.Builder()
.withTableNameOverride(DynamoDBMapperConfig.TableNameOverride.withTableNamePrefix(prefix + "-"))
.build();
}

@Bean
DynamoDBMapper dynamoDBMapper(AmazonDynamoDB dynamoDB, DynamoDBMapperConfig dynamoDBMapperConfig)
{
return new DynamoDBMapper(dynamoDB, dynamoDBMapperConfig);
}
}


import com.amazonaws.services.dynamodbv2.datamodeling.*;
import java.util.UUID;
@DynamoDBTable(tableName = "person")
public class MyTable {
@DynamoDBHashKey
@DynamoDBAutoGeneratedKey
UUID id;

String name;
//getter setter/other fields

Spring Boot - How to skip cache thyemeleaf template, js, css etc to bypass restarting the server everytime

The default template resolver registered by Spring Boot autoconfiguration for ThyemeLeaf is classpath based, meaning that it loads the templates and other static resources from the compiled resources i.e, /target/classes/**.



To load the changes to the resources (HTML, js, CSS, etc), we can
  • Restart the application every time- which is of course not a good idea!
  • Recompile the resources using CTRL+F9 on IntelliJ or (CTRL+SHIFT+F9 if you are using eclipse keymap) or simply Right Click and Click Compile
  • Or a better solution as described below !!

Thymeleaf includes a file-system based resolver, this loads the templates from the file-system directly not through the classpath (compiled resources).

See the snippet from DefaultTemplateResolverConfiguration#defaultTemplateResolver

@Bean
public SpringResourceTemplateResolver defaultTemplateResolver() {
 SpringResourceTemplateResolver resolver = new SpringResourceTemplateResolver();
 resolver.setApplicationContext(this.applicationContext);
 resolver.setPrefix(this.properties.getPrefix());

Where the property prefix is defaulted to "classpath:/template/". See the snippet ThymeleafProperties#DEFAULT_PREFIX
public static final String DEFAULT_PREFIX = "classpath:/templates/";


The Solution:

Spring Boot allows us to override the property 'spring.thymeleaf.prefix' to point to source folder 'src/main/resources/templates/ instead of the default "classpath:/templates/" as folllows.

In application.yml|properties file:
spring:
    thymeleaf:
        prefix: file:src/main/resources/templates/  #directly serve from src folder instead of target

This would tell the runtime to not look into the target/ folder. And you don't need to restart server everytime you update a html template on our src/main/resources/template

What about the JavaScript/CSS files?

You can further go ahead and update the 'spring.resources.static-locations' to point to your static resource folder (where you keep js/css, images etc)
spring:
    resources:
        static-locations: file:src/main/resources/static/ #directly serve from src folder instead of target        cache:
          period: 0

The full code:

It a good practice to have the above configuration during development only. To have the default configuration for production system, you can use Profiles and define separate behaviour for each environment.

Here's the full code snippets based on what we just described!

Project Structure:

Pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <artifactId>my-sample-app</artifactId>
    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.3.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>

    <properties>
        <java.version>11</java.version>
    </properties>

    <dependencies>
        <!-- the basic dependencies as described on the blog -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
    </dependencies>

    <build>
        <finalName>${build.profile}-${project.version}-app</finalName>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

    <profiles>

        <!-- Two profiles -->

        <profile>
            <id>dev</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <properties>
                <spring.profiles.active>dev</spring.profiles.active>
                <build.profile>dev<build.profile>
            </properties>
        </profile>

        <profile>
            <id>prod</id>
            <properties>
                <spring.profiles.active>prod</spring.profiles.active>
                <build.profile>prod<build.profile>
            </properties>
        </profile>

    </profiles>

</project>

The property files (yml)

application-dev.yml
spring:
    profiles:
        active: dev
    thymeleaf:
        cache: false        prefix: file:src/main/resources/templates/  #directly serve from src folder instead of target    resources:
        static-locations: file:src/main/resources/static/ #directly serve from src folder instead of target        cache:
            period: 0
 
application-prod.yml (doesn't override anything)
spring:
    profiles:
        active: prod



Hope this helps!