Skip to Content

How I wrote a python wrapper for Java implementation of VnCoreNLP

Table of Contents

When working on my thesis, I have to experiment with different tokenize tool. One of those is VnCoreNLP which is reported to have very high accuracy. The thing is it’s written in Java while my pipeline is written in Python. So, I decided to write a python wrapper for it. Let’s go.

FYI: If you don’t know it yet, the problem of calling other programming language is interoperability. For example, Java can call native functions written in C/C++ using JNI.

The problem

As I stated before, it’s a matter of how to call Java implementation from python. After some consideration, I choose py4j for its simple usage.

Preparation

Okay, as a rule of thumb, I always create a new virtual environment for any python project.

Let say we are in a new folder named tokenizer. To create a virtual environment, standing in tokenizer, you should type in the terminal like this.

python3 -m venv .env

Remember to activate it using

source .env/bin/activate

As always, install a python package is a piece of cake.

pip install py4j

And yes, because I’m writing a wrapper for VnCoreNLP, I can’t miss it. Clone the VnCoreNLP repo. Placed VnCoreNLP-1.1.jar and the models directory of VnCoreNLP in the same working folder.

To write a python library you should create a folder name __init__

The working directory structure should look like this now.

tokenizer/
├── .env                        # Virtual environment folder
├── __init__                    # Python library files
├── models                      # VnCoreNLP models
└── VnCoreNLP-1.1.jar           # VnCoreNLP java lib

Let’s do it

Create a new python file in the folder __init__, in my case I named it vncorenlp.py

In vncorenlp.py import JavaGateway

from py4j.java_gateway import JavaGateway

It’s the standard library that you must import if you want to call a Java instance.

Now, let’s construct a class for the wrapper. I won’t get into details, there will be full source code at the end of this post.

The most important of this wrapper is how to get the Java instance to run. Creating a new instance of a Java class is easy using py4j, they didn’t tell about this, but I guess it’s a thing to figure out.

gateway = JavaGateway.launch_gateway(classpath=self.path, die_on_exit=True)

You should pass a path to the jar file that encapsulating the Java function to the launch_gateway method. In this case, I’m passing the path through the class constructor so you see I get the path through the self instance.

VnCoreNLP requires an array of strings containing annotators. What to note here is how to create a string list using py4j.

annotators = gateway.new_array(gateway.jvm.java.lang.String, 1)

The new_array function receives the first argument as the type of the array, for the String type, you can obtain it from java.lang library. The length of the array is specified in the following arguments. Here I create a 1 element array.

Assigning an element to an array is easy as it heard.

annotators[0] = 'wseg'

Calling a Java class need a fully qualified name like below.

pipeline = gateway.jvm.vn.pipeline.VnCoreNLP(annotators)
annotation = gateway.jvm.vn.pipeline.Annotation(sentence)

After getting the Java class, calling a function is just like a simple python function call.

pipeline.annotate(annotation)

And that’s all you need to note on how to write the wrapper.

The full project can be found here

comments powered by Disqus