Big Data Discovery – Custom Java Transformations Part 1

In our first post introducing Oracle Big Data Discovery, we highlighted the data transform capabilities of BDD.  The transform editor provides a variety of built in functions for transforming datasets.  While these built in functions are straightforward to use and don’t require any additional configuration, they are also limited to a predefined set of transformations.  Fortunately, for those looking for additional functionality during transform, it is possible to introduce custom transformations that can leverage external Java libraries by implementing a custom Groovy script.  The rest of this post will walk through the implementation of a basic example, and a subsequent post will go in depth with a few real world use cases.

Create a Groovy script

The core component needed to implement a custom transform with external libraries is a Groovy script that defines the pluginExec() method.  Groovy is a programming language developed for the Java platform.  Details and documentation on the language can be found here.  For this basic example, we’ll begin by creating a file called CustomTransform.groovy and define a method, pluginExec(), which should take an object array, args, as an argument:

def pluginExec(Object[] args) {
    String input = args[0] //args[0] is the input field from the BDD Transform Editor 

    //Implement code to transform input in some way
    //The return of this method will be inserted into the transform field

    input.toUpperCase() //This example would return an upper cased version of input
}

pluginExec() will be applied to each record in the BDD dataset, and args[0] corresponds to the field to be transformed.  In the example script above, args[0] is assigned to the variable input and the toUpperCase() method is called on it.  This means that if this custom transformation is applied to a field called name, the value of name for each record will be returned upper cased (For example, “johnathon” => “JOHNATHON”).

Import Custom Java Library

Now that we’ve covered the basics of how the custom Groovy script works, we can augment the script with external Java libraries.  These libraries can be imported and implemented just as they would be in standard Java:

import com.oracle.endeca.transform.HelloWorld
    
def pluginExec(Object[] args) {
    String input = args[0] //Note that though the input variable is defined in this example, it is not used.  Defining input is not required.
    
    HelloWorld hw = new HelloWorld() //Create a new instance of the HelloWorld class defined in the imported library
    hw.testMe() //Call the testMe() method, which returns a string "Hello World"
}

In the example above, the HelloWorld class is imported.  A new instance of HelloWorld is assigned to the variable hw, and the testMe() method is called.  testMe() is designed to simply return the string “Hello World”.  Therefore, the expected output of this custom script is that the string “Hello World” will be inserted for each record in the transformed BDD dataset.  Now that the script has been created, it needs to be packaged up and added to the Spark class path so that it’s accessible during data processing.

Package the Groovy script into a jar

In order to utilize CustomTransform.groovy, it needs to packaged into a .jar file.  It is important that the Groovy script be located at the root of the jar, so make sure that the file is not nested within any directories.  See below for an example of the file structure:

CustomTransform.jar
  |---CustomTransform.groovy
  |---AdditionalFile_1
  |---AdditionalFile_2
  ...
  ...
  .etc

Note that additional files can be included in the jar as well.  These additional files can be referenced in CustomTransform.groovy if desired.  There are multiple ways to pack up the file(s), but the simplest is to use the command line.  Navigate to the directory that contains CustomTransform.groovy and use the following command to package it up:

# jar cf <new_jar_name> <input_file_for_jar>
> jar cf CustomTransform.jar CustomTransform.groovy

Setup a custom lib location in Hadoop

CustomTransform.jar and any additional Java libraries that are imported by the Groovy script need to be added to all spark nodes in your Hadoop cluster.  For simplicity, it is helpful to establish a standard location for all custom libraries that you want Spark to be able to access:

$ mkdir /opt/bdd/edp/lib/custom_lib

The /opt/bdd/edp/lib directory is the default location for the BDD dataprocessing libraries used by Spark.  In this case, we’ve created a subdirectory, custom_lib, that will hold any additional libraries we want Spark to be able to use.

Once the directory has been created, use scp, WinSCP, MobaXterm, or some other utility to upload CustomTransform.jar and any additional libraries used by the Groovy script into the custom_lib directory.  The directory needs to be created on all Spark nodes, and the libraries need to be uploaded to all nodes as well.

Update sparkContext.properties on the BDD Server

The last step that needs to be completed before running the custom transformation is updating the sparkContext.properties file.  This step only needs to be completed the first time you create a custom transformation as long as the location of the custom_lib directory remains constant for each subsequent script.

Navigate to /localdisk/Oracle/Middleware/BDD<version>/dataprocessing/edp_cli/config on the BDD server and open the sparkContext.properties file for editing:

$ cd /localdisk/Oracle/Middleware/BDD1.0/dataprocessing/edp_cli/config
$ vim sparkContext.properties

The file should look something like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################


Add an entry to the file to define the spark.executor.extraClassPath property.  The value for this property should be <path_to_custom_lib_directory>/*.  This will add everything in the custom_lib directory to the Spark class path.  The updated file should look like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################

spark.executor.extraClassPath=/opt/bdd/edp/lib/custom_lib/*

It is important to note, if there is already an entry in sparkContext.properties for the spark.executor.extraClassPath property, any libraries referenced by the property should be moved to the custom_lib directory so they are still included in the Spark class path.

Run the custom transform

Now that the script has been created and added to the Spark class path, everything is in place to run the custom transform in BDD.  To try it out, open the Transform tab in BDD and click on the Show Transformation Editor button.  In this example, we are going to create a new field called custom with the type String:

Create new attribute

Create new attribute

Now in the editor window, we need to reference the custom script:

Transform Editor

Transform Editor

The runExternalPlugin() method is used to reference the custom script.  The first argument is the name of the Groovy script.  Note that the value above is 'CustomTransform.groovy' and not 'CustomTransform.jar'.  The second argument is the field to be passed as an input to the script (this is what gets assigned to args[0] in pluginExec()).  In the case of the “Hello World” example, the input isn’t used, so it doesn’t matter what field is passed here.  However, in the first example that returned an upper cased version of the input field, the script above would return an upper cased version of the key field.

One of the nice features of the built-in transform functions is that they make it possible to preview the transform changes before committing.  With these custom scripts, however, it isn’t possible to see the results of the transform before running the commit.  Clicking preview will just return blank results for all fields, as seen in the example below:

Example of custom transform preview

Example of custom transform preview

The last thing to do is click ‘Add to Script’ and then ‘Commit to Project’ to kick off the transformation process.  Below are the results of the transform.  As expected, a new custom field has been added to the data set and the value “Hello World” has been inserted for every record.

Transform results

Transform results

This tutorial just hints at the possibilities of utilizing custom transformations with Groovy and external Java libraries in BDD.  Stay tuned for the second post on this subject, when we will go into detail with some real world use cases.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] ranzal.com or share your questions and comments with us below.

One thought on “Big Data Discovery – Custom Java Transformations Part 1

  1. Pingback: Big Data Discovery – Custom Java Transformations Part 2 | Bird's Eye View

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s