OEID 3.0 First Look — Text Enrichment & Whitespace

I recently spent some cycles building my first POC for a potential customer with OEID v3.0.  After running some of the unstructured data through the text enrichment component, I noticed something odd:

whitespace_prob

The charts I configured to group by those salient terms were displaying a “null” bucket.  This bucket was essentially collecting all records that were not tagged with a term.  After a bit of investigation, it seems this is expected behavior in v3.0 — the Endeca Server now treats empty, yet non-null attributes, as valid and houses them on the Endeca record.  Empty, yet non-null, attributes are common after employing some of the OOTB text enrichment capabilities in 3.0 (tagging, extraction, regex).  Thus, a best practice treatment for this side-effect is warranted.

The good news is that the workaround was very straightforward.

1) Add a “Reformatter” component to the .grf before the bulk loader with the same input and output metadata edge definition.  From the reformatter “Source” tab, select “Java Transform Wizard” and give your new transformation class a name like “removeWhitespaces”.  This will create a .java source file and a compiled .class file in your Integrator project’s ./trans directory (where Integrator expects your java source code to reside).

removeWhitespace

2) Provide the following java logic in your new “removeWhitespaces” transformation class:
import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.TransformException;
import org.jetel.metadata.DataFieldType;

public class removeWhitespaces extends DataRecordTransform {

@Override
public int transform(DataRecord[] arg0, DataRecord[] arg1) throws TransformException {
for(int i = 0; i < arg0.length; i++) {
DataRecord rec = arg0[i];
for(int j = 0; j < rec.getNumFields(); j++) {
if(rec.getField(j).getMetadata().getDataType().equals(DataFieldType.STRING)) {
if(rec.getField(j).getValue() == null || rec.getField(j).getValue().equals(“”) || rec.getField(j).getValue().toString().length() == 0) {
rec.getField(j).setValue(null);
}
}
arg1[i].getField(j).setValue(rec.getField(j).getValue());
}
}
return 0;
}
}

3) Make sure the name of this new class is specified in the “Transform class” input.  Rerun the .grf that loads your data and….profit!

whitespace_fix

We look forward to sharing more emerging OEID v3.0 best practices here….and hearing about your approaches as well.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s