Return to Action

Late last year, we introduced some incredibly compelling capabilities that allowed users to collaborate with each other inside of Oracle Endeca Information Discovery (OEID) 3.0. Our collaborative discovery extension allowed users with certain permissions to delete records, edit attributes of records and add attribute values to existing attributes, all from within OEID Studio.  It’s an incredibly powerful way to assist in data cleanup, data flagging or grouping certain records together with applicability to almost every data discovery scenario. We’re pleased to announce that we’ve re-launched this functionality and it is now available for licensed users of OEID 3.1.  The same great capabilities that have always been there remain, but we’ve given it a bit of a facelift as part of the upgrade, as you can see below.

Deleting Misleading or Invalid Records

delete-pre-ambleHey, this tweet has nothing to do with the Olympics!

Occasionally incorrect or misleading data will find its way into a given application.  If it has no business being there or has an unwanted, adverse affect, let’s get rid of it! Hey, this tweet has nothing to do with the Olympics! Bye, bye spammer!

Augmenting Existing Records With More Attribution

In addition, users may find something interesting on a record (or set of records) and want to take action to augment the attribution of the record.  Below, two terrorist incidents (from one of our internal applications) have been identified as possibly having links to ISIS based on location: The data can be augmented by selecting the field to hold this additional information (for simplicity sake, we added it to “Themes”)… add-themes-1 …adding the additional value (or values):

Replacing Existing Attributes on Records

In the same vein as the first use case, users may find a record or set of records where they want to set a brand new value for an attribute.  It could be changing a Status from Open to Closed or from Valid to Invalid or maybe correcting an error during ingest such as a poorly calculated sentiment score. After each change, we update the index upon selection of “Apply Changes”.  If you look below, you can see the result in the application reflected immediately: Now, there’s one final piece that hasn’t been mentioned that completely closes the loop.  Since an Oracle Endeca Discovery Application is typically not the “system of record”, there’s the possibility that an update sourced from an upstream system could override changes made by users. We’ve accounted for that as well by persisting all changes to a dedicated database store that can be integrated into the update process.  For example, if I’ve deleted a record from my application, I can use the “log of deletes” in the database as a data source in any ETL processes that may happen subsequently.  Simply filter the incoming data stream using the data stored in the database and you’re good to go.  If there are attribute replacements and additions, they work the same way and are tracked and logged appropriately.

If you’re interested in pricing or capabilities or just want to give feedback, drop us a line at product [at] ranzal.com.  It’s already been delivered to a customer in Spain last month and we’re looking forward to seeing more and more people in the community get their hands on it.

Why Jython for FDMEE

Originally published here May 19, 2014 on ODTUG.comkscope14-logo

Contributed by:
Tony Scalese, Integration Practice Director
Hyperion Certified
Oracle ACE
Edgewater Ranzal
ascalese@ranzal.com

In the 11.1.2.3 release of the EPM stack, Oracle introduced Financial Data Quality Management, Enterprise Edition, or FDMEE. FDMEE was not entirely a new product but rather a rebranding of ERP Integrator (ERPi), which was released in 11.1.1.3. FDMEE actually represents the convergence of two products – FDM (now known as FDM Classic) and ERPi – and represents the best of both products. Besides the obvious changes – a new user interface (UI) that is integrated into Workspace, leveraging Oracle Data Integrator (ODI) as its data engine and direct integration to many Oracle and SAP ERP source systems – FDMEE introduced one rather significant change: Jython as its scripting language.

For organizations that have a significant investment in FDM Classic, a new scripting language likely represents one of the most daunting changes in the new data management tool known as FDMEE. Before we continue, let’s briefly talk about scripting in FDMEE. Customers face a choice with FDMEE scripting – VBScript or Jython. I have spoken with a number of customers that have asked, “Can’t I just stick with VBScript because it’s very similar to the FDM scripting, which was basically VBScript.” The technical answer is, in most cases, yes, you likely could. The more thought-out answer is, “Have you considered what you are giving up by sticking with VBScript?” Well, that really isn’t an answer, is it?

Let’s take a moment to understand why I ask that question. Let’s consider at a high level the differences in these two languages. For Wikipedia: VBScript (Visual Basic Scripting Edition) is an Active Scripting language developed by Microsoft that is modeled on Visual Basic. It is designed as a “lightweight” language with a fast interpreter for use in a wide variety of Microsoft environments.Jython, successor of JPython, is an implementation of the Python programming language written in Java.

Take a moment to consider the Enterprise Performance Management (EPM) stack at Oracle. Have you noticed any trends over the past two to three years? In 11.1.2.2, Oracle rewrote the UI for HFM using the Oracle ADF framework. In 11.1.2.3, Oracle removed all but the most basic functionality from the HFM Win32 client. In 11.1.2.4, HFM is planned to be platform agnostic meaning it can run on virtually any operating system including Linux and UNIX. Have you heard about this nifty new appliance called Exalytics? My point in this trip down memory lane is that Oracle is clearly moving away from any reliance on Microsoft technology in its product stack. Any time I have said this, the question inevitably is asked: “Do you think we’ll still be able to run EPM on Windows servers?” My answer is a resounding YES. Oracle may not be the biggest fan of integrating Microsoft technology into its software solutions, but they are smart enough to understand that failing to support Windows as a server platform would lock them out of too large of a share of the market. So breathe easily; I don’t see Oracle producing EPM software that won’t be able to be deployed on Windows servers.

The EPM stack is moving toward becoming fully platform agnostic. Exalytics is, for those of you who are not familiar, an in-memory Linux or Solaris machine that delivers extreme performance for Business Intelligence and Enterprise Performance Management applications. At a very high level, these machines have an extraordinary amount of memory (RAM) that allows the entire database to be brought into memory. The result is incredible performance gains particularly for large applications.

There are at least two constants with technology. First, data continues to grow. The demand for more data to support business decisions is no exception. The other constant is that hardware continually improves while the cost always drops. I can’t envision Exalytics being an exception this. Today’s Exalytics machines often cost six figures, and that may not be an expense your organization can justify. However, in two to five years, your organization may require an Exalytics machine, and it may well be an expense you can justify.

Given this bit of background, let’s talk about why I firmly believe Jython is the better choice for your FDMEE application. As the EPM stack moves toward being platform agnostic, I believe that support for supporting Microsoft technologies such as VBScript will slowly diminish. The application programming interface (API) will continue to be enriched for the Jython language set, while the API for VBScript will be less robust. Please keep in mind that this is just my prediction at this point. But Oracle is no different than any other organization that undertakes technology projects. They employ the same three levers that every project does – scope, time, and budget. As a customer of EPM, you have noticed the speed at which new releases have been deployed. To continue to support two sets of APIs within these accelerated development timelines will require one of the remaining levers to be “pulled.”

That leaves us the scope and budget levers. To maintain complete parity (scope) with a fixed timeline, the remaining lever is budget. Budget for any technology project is heavily correlated to people/projects. Add more people to the project, and the cost goes up. As I said before, Oracle is no different than any other organization. Project costs must be justified. So the development head of FDMEE would need to justify to his senior management the need to add additional resources to support having two sets of APIs – one of which is specifically for Microsoft technologies. One can imagine how that conversation might go.

So we’re left with the scope lever. There are two APIs – one to support Jython (JAVA on Python) and a second to support VBScript (a Microsoft technology). Let’s not forget that Oracle owns JAVA. Which do you think wins? I hope that I have built a case to support my previous conjecture about the expected richness of the Jython API vs. the VBScript API.

Let’s say you believe my above predication is wrong. That’s OK. Let’s focus on one key difference in these technologies – error handling. Throughout my years of developing scripts for FDM Classic, the reoccurring theme I heard from customers was when a process fails, can the system alert me? The technical answer is, most likely. The practical answer is no. While in a VBScript routine I can leverage the On Error Resume Next in conjunction with If Err.Number = 0, I would need to do this after every line of code and that is simply not realistic. The best solution I have found is writing scripting operations to a log file that can then be reviewed to identify the point at which a script fails. While this approach has helped, it’s not nearly as elegant as true error handling like what is available in Jython.

Jython provides error handling through the use of the Except keyword. If you have ever written (not recorded) Excel VBA macros, you may be familiar with this functionality. In VBA, you would code On Error Goto ErrHandler and then have code within an ErrHandler section of the script that performs some operation in the event of an error. Within Jython, there is a similar, albeit more robust, concept with the Try – except keywords. For example:

def divide_numbers(x, y):
   try:
      return x/y
   except ZeroDivisionError:
      return ‘You cannot divide by zero, try again’

In the above example, the Except clause is used to handle division by zero. With Jython, you can have multiple Except clauses to handle different anticipated failures in a process. Likewise you can have a catch-all (finally) clause to handle any unexpected failures. A key functionality with the Except clause is the ability to capture the line in the script that caused the failure. This is a key improvement over VBScript.

We could continue to delve further into the technical details of why Jython is a more robust language, but when I think about this conversation in context of, “Why do I want to use Jython instead of VBScript for my application?” I think the above arguments are compelling on their own. If you are interested in learning more about Jython scripting for FDMEE, please attend my session at Kscope14:

Jython Scripting in FDMEE: It’s Not that Scary on Tuesday, June 24, at 11:15 AM.
http://kscope14.com/component/seminar/seminarslist#Jython Scripting in FDMEE: It’s Not That Scary

Tag 100 Times Faster — Introducing Branchbird’s Fast Text Tagger

BBFTTClip
Text Tagging is the process of using a list of keywords to search and annotate unstructured data. This capability is frequently required by Ranzal customers, most notably in the healthcare industry.

Oracle’s Endeca Data Integrator provides three different ways to text tag your data “out of the box” .

  • The first is the “Text Tagger – Whitelist” component which is fed a list of keywords and searches your text for exact matches.
  • The second is the “Text Tagger – Regex” component which works similarly but allows for the use of regular expressions to expand the fuzzy matching capabilities when searching the text.
  • The third is using “Endeca’s Text Enrichment” component (OEM’ed from Lexalytics) and supplying a model (keyword list) that takes advantage of the component’s model-based entity extraction.

Ranzal began working on a custom text tagging component due to challenges with the aforementioned components at scale. All of the above text taggers are built to handle tagging with relatively small inputs — both the size of the supplied dictionary and the number (and size) of documents.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
Fast Text Tagger (FTT) 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
Text Enrichment (TE) 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

In one of our most common use cases, customers analyzing electronic medical records with Endeca need to enrich large amounts of free text (typically physician notes) using a medical ontology such as SNOMED-CT or MeSH. Each of these ontologies has a large number of medical “concepts” and their associated synonyms. For example, the US version of SNOMED-CT contains nearly 150,000 concepts. Unfortunately, the “out of the box” text tagger components do not perform well beyond a couple hundred keywords. To realize slightly better throughput during tagging, Endeca developers have traditionally leveraged the third component listed above — the  Lexalytics-based “Text Enrichment” component — which offers better performance than the other options listed above.

However, after extensive use of the “Text Enrichment” component, it became clear that not only was the performance still not acceptable at high scale, the recall of the component was inadequate especially with Electronic Medical Records (EMRs). The Text Enrichment component is NLP-based and relies on accurately parsing sentence structure and word boundaries to tokenize the document before entity extraction begins. EMRs typically have very challenging sentence structure due both to the ad hoc writing style of clinicians at point of entry and the observational metrics embedded in the record. Because of this, Text Enrichment of even small documents at high scale can be prohibitive for simple text tagging. A recent customer of ours, using very high end enterprise equipment, was experiencing 24 hour processing times using Text Enrichment text tagging with SNOMED-CT concepts to process approximately six million EMRs.

To improve both the performance and recall issues, Ranzal set out to build a simple text tagger component for Integrator that would be easy to setup and use. The Ranzal “Fast Text Tagger” was built using a high performance dictionary matching algorithm that ingests the list of terms (and phrases) into a finite state pattern matching machine which can then be used to process the documents. One of the largest benefits of these search algorithms is that the document text only needs to be parsed once to find all possible matches within the supplied dictionary.

The Ranzal Fast Text Tagger is intended to replace the stock “Text Tagger – Whitelist” component and the use of the “Text Enrichment” component for whitelisting. Our text tagger is intended for straight text matching with optional restrictions to allow for matching on word boundaries. If your use cases require more fuzzy-style text matching, then you should continue to use the “Text Tagger – Regexp” at low scale and “Text Enrichment” at higher scales.

Performance Observations

To go further on the metrics shown above, and duplicated here, you can see the remarkable performance of the Ranzal Fast Text Tagger as compared to “Text Enrichment” even when Text Enrichment is configured to consume 4 threads. Furthermore, the rate of the BB FTT tends to increase with the number of documents, before starting to level off near 1 million documents, whereas Text Enrichment stays relatively constant.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
BB FTT 1 thread 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
TE 1 thread 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

As a final performance note, the previously mentioned customer with the 24 hour graph run just for text tagging, the same process was done on this same test harness with the same data in just shy of 20 minutes. It took longer to read the data from disk than it took to stream it all through the Fast Text Tagger.  This implies that, in typical use cases, the Fast Text Tagger will not be a limiting component in your graph. For those of you curious about the benchmarking methods used, please continue below.

Test Runs

We built a graph that could execute the different test configurations sequentially and then compile the results. Shown below are four separate test runs and a screen capture of Integrator at test completion. Below each screen cap is a list of metrics:

  • Match AVG: The average number of concepts extracted over the corpus
  • Total Match: The total number of concepts extracted over the corpus
  • Misses: The number of non-empty EMRs where no concept was found
  • Exec Time: The total execution time of the test configuration

Note that Text Enrichment’s poor recall negatively impacts its precision (Match AVG). If you remove the (significant number of) misses, TE has precision nearly as high as our Fast Text Tagger.

Test 1: 1,000 EMRs

1000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 9 9,126 14 4 secs
TE (1 thread) 4 4,876 260 153 secs
TE (4 threads) 4 4,876 260 57 secs

Test 2: 10,000 EMRs

10000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 12 127,617 14 7 secs
TE (1 thread) 5 55,567 3,739 2,010 secs
TE (4 threads) 5 55,567 3,739 675 secs

Test 3: 100,000 EMRs

100000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 13 1,380,258 17 23 secs
TE (4 threads) 5 546,598 38,466 6,555 secs

Test 4: 1,000,000 EMRs

1000000EMRs

Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 14 14,834,247 17 162 secs

Benchmarking Notes

Tests conducted on OEID 3.1 using the US SNOMED-CT concept dictionary (148,000 concepts) against authentic Electronic Medical Records. Physical hardware used: PC, 4 core i7 with hyperthreading, 32 GB RAM on SSD drives.

The “Text Tagger – Whitelist” was discarded as unusable for this test setup. “Text Enrichment” with 1 thread was discarded after the 10,000 document run and TE with 4 threads was discarded after the 100,000 document run.

Advanced Visualizations on Oracle Endeca 3.1

Ranzal is pleased to announce that our Advanced Visualization Framework is now generally available.  Spend more time discovering, less time coding.

The calendar has turned, the air is frozen and, with a new year, comes the annual deluge of “predictions and trends” articles for 2014.  Spoiler Alert: Hadoop isn’t going away and data, and what you do with it, is everything right now.

Maybe you’ve seen some of these articles but one in particular, Forbes’ “Top Four Big Data Trends”, and more specifically, one section caught our eye.  It’s simple, it’s casual (possibly too much so), but it really resonates:

“Visualization allows people who are not analytics gurus to get insights out of complex stuff in a way they can absorb.”

At Ranzal, we believe the goal of Business Intelligence and Data Discovery is to democratize the discovery process and allow anyone who can read a chart to understand their data.  It’s not about the fancy query or the massive map/reduce, it’s what you do with it.

Continue reading

Leveraging Your Organization’s OBI Investment for Data Discovery

Coupling disparate data sets into meaningful “mashups” is a powerful way to test new hypotheses and ask new questions of your organization’s data.  However, more often than not, the most valuable data in your organization has already been transformed and warehoused by IT in order to support the analytics needed to run the business.  Tools that neglect these IT-managed silos don’t allow your organization to tell the most accurate story possible when pursuing their discovery initiatives.  Data discovery should not focus only on the new varieties of data that exist outside your data warehouse.  The value from social media data and machine generated data cannot be fully realized until it can be paired with the transactional data your organization already stockpiles.

Judging by the heavy investment in a new “self-service” theme in the recently released version 3.1 of Endeca Information Discovery, this truth has not been lost on Oracle.

Companies that are eager to get into the data discovery game, yet are afraid to walk away from the time and effort they’ve poured into their OBI solution, can breathe a little easier.  Oracle has made the proper strides in the Endeca product to incorporate OBI into the discovery experience.

And unlike other discovery products on the market today, the access to these IT-managed repositories (like OBI) is centrally managed.  By controlling access to the data and keeping all data “on the platform”, this centralized management allows IT to avoid the common “spreadmart” problem that plagues other discovery products.

Rather than explain how OBI has been introduced into the discovery experience, I figured I would show you.  Check out this short 4 minute demonstration which illustrates how your organization can build their own data “mashups” leveraging the valuable data tied up in OBI.

 

 

Chances are that a handful of these tested hypotheses will unlock new ways to measure your business.  These new data mashups will warrant permanent applications that are made available to larger audiences within your organization.  The need for more permanent applications will require IT to “operationalize” your discovery application — introducing data updates, security, and properly sized hardware to support the application.

For these IT-provisioned applications, Oracle has also provided some tooling in Endeca to make the job more straightforward.  Specifically, when it comes to OBI, the product now boasts a wizard that will produce a Integrator project with all of the plumbing necessary to pull data tied up in OBI into a discovery application in minutes.  Check out this video to see how:

 

 

It is product investments like these that will allow organizations to realize the transformative effects data discovery can have on their business without having to ignore the substantial BI investments already in place.

As always, please direct any questions or comments to [at] ranzal.com.

The Feature List : Oracle Endeca Information Discovery 3.1

As promised last week, we’ve been compiling a list of all the new features that were added as part of the Oracle Endeca Information Discovery (OEID) 3.1 release earlier this month.

If we’ve missed anything, please shoot us an email and we’ll update the post.

OEID Integrator v3.1

hadoop-cloveretl

The gang at Javlin has implemented some major updates in the past 6 months, especially around big data.  The OEID Integrator releases, obviously, lag a bit behind their corresponding CloverETL release but there’s still a lot to get excited about from both a CloverETL and “pure OEID” standpoint:

  • Base CloverETL version upgraded from 3.3 to 3.4 – full details here
  • HadoopReader / HadoopWriter components
  • Job Control component for executing MapReduce jobs
  • Integrated Hive JDBC connection
  • Language Detection component!

The big takeaway here is the work that the Javlin team has done in terms of integrating their product more closely with the “Big Data” ecosystem.  Endeca has always been a great complementary fit with sophisticated data architectures based on Hadoop and these enhancements will only make it easier.

Keeping with our obsession of giving some time to the small wins that add big gains, I really like the quick win with the Language Detection component.  This is something that had been around “forever” in the old Endeca world of Forge and Dgidx but was rarely used or understood.  It is nice to see the return of this functionality as it will play a huge role in multi-lingual/multi-national organizations, especially those with a lot of unstructured data.  Think about a European bank with a large presence in multiple countries trying to hear the “Voice of the Customer”.  Having the ability to navigate, filter and summarize based on a customer’s native language gets so much easier.

OEID Web Acquisition Toolkit (aka KAPOW!) Continue reading