Oracle Business Intelligence – Synchronizing Hierarchical Structures to Enable Federation

More and more Oracle customers are finding value in federating their EPM cubes with existing relational data stores such as data marts and data warehouses (for brevity, data warehouse will refer to all relational data stores). This post explains the concept of federation, explores the consequences of allowing hierarchical structures to get out of synchronization, and shares options to enable this synchronization.

In OBI, federation is the integration of distinct data sources to allow end users to perform analytical tasks without having to consider where the data is coming from. There are two types of federation to consider when using EPM and data warehouse sources:  vertical and horizontal.  Vertical federation allows users to drill down a hierarchy and switch data sources when moving from an aggregate data source to a more detailed one.  Most often, this occurs in the Time dimension whereby the EPM cube stores data for year, quarter, and month, and the relational data sources have details on daily transactions.  Horizontal federation allows users to combine different measures from the distinct data sources naturally in an OBI analysis, rather than extracting the data and building a unified report in another tool.

Federation makes it imperative that the common hierarchical structures are kept in sync. To demonstrate issues that can occur during vertical federation when the data sources are not synchronized, take the following hierarchies in an EMP application and a data warehouse:

Figure 1: Unsynchronized Hierarchies

Jason Hodson Blog Figure 1.jpg

Notice that Colorado falls under the Western region in the EPM application, but under the Southwestern region in the data warehouse. Also notice that the data warehouse contains an additional level (or granularity) in the form of cities for each region.  Assume that both data sources contain revenue data.  An OBI analysis such as this would route the query to the EPM cube and return these results:

Figure 2: EPM Analysis – Vertical Federation

Jason Hodson Blog Figure 2

However, if the user were to expand the state of Washington to see the results for each city, OBI would route the query to the data warehouse. When the results return, the user would be confronted with different revenue figures for the Southwest and West regions:

Figure 3: Data Warehouse – Vertical Federation

Jason Hodson Blog Figure 3

When the hierarchical structures are not aligned between the two data sources, irreconcilable differences can occur when switching between the sources. Many times, end users are not aware that they are switching between EPM and a data warehouse, and will simply experience a confusing reorganization in their analysis.

To demonstrate issues that occur in horizontal federation, assume the same hierarchies as in Figure 1 above, but the EPM application contains data on budget revenue while the data warehouse contains details on actual revenue. An analysis such as this could be created to query each source simultaneously and combine the budget and actual data along the common dimension:

Figure 4: Horizontal Federation

Jason Hodson Blog Figure 4

However, drilling into the West and Southwest regions will result in Colorado becoming an erroneously “shared” member:

Figure 5: Colorado as a “Shared” Member

Jason Hodson Blog Figure 5

In actuality, the mocked up analysis above would more than likely result in an error since OBI would not be able to match the hierarchical structures during query generation.

There are a number of options to enable the synchronization of hierarchical structures across EPM applications and data warehouses. Many organizations are manually maintaining their hierarchical structures in spreadsheets and text files, often located on an individual’s desktop.  It is possible to continue this manual maintenance; however, these dispersed files should be centralized, a governance processes defined, and the EPM metadata management and data warehouse ETL process redesigned to pick up these centralized files.  This method is still subject to errors and is inherently difficult to properly govern and audit.  For organizations that are already using Enterprise Performance Management Architect (EPMA), a scripting process can be implemented that extracts the hierarchical structures in flat files.  A follow on ETL process to move these hierarchies into the data warehouse will also have to be implemented.

The best practices solution is to use Hyperion Data Relationship Management (DRM) to manage these hierarchical structures. DRM boasts robust metadata management capabilities coupled with a system-agnostic approach to exporting this metadata.  DRM’s most valuable export method allows pushing directly to a relational database.  If a data warehouse is built in tandem with an EPM application, DRM can push directly to a dimensional table that can then be accessed by OBI.  If there is a data warehouse already in place, existing ETL processes may have to be modified or a dimensional table devoted to the dimension hierarchy created.  Ranzal has a DRM accelerator package to enable the synchronization of hierarchical structures between EPM and data warehouses that is designed to work with our existing EPM application DRM implementation accelerators.  Using these accelerators, Ranzal can perform an implementation in as little as six weeks that provides metadata management for the EPM application, establishes a process for maintaining hierarchical structure synchronization between EPM and the data warehouse, and federation of the data source.

While the federation of EPM and data warehouse sources has been the primary focus, it is worth noting that two EPM cubes or two data warehouses could be federated in OBI. For many of the reasons discussed previously, data synchronization processes will have to be in place to enable this federation.  The previous solutions for maintaining metadata synchronization may be able to be adapted to enable this federation.

The federation of EPM and data warehouse sources allows an enterprise to create a more tightly integrated analytical solution. This tight integration allows users to transverse the organization’s data, gain insight, and answer business essential questions at the speed of thought.  As demonstrated, mismanaging hierarchical structures can result in an analytical solution that produces unexpected results that can harm user confidence.  Enterprise solutions often need enterprise approaches to governance; therefore, it is often imperative to understand and address shortcomings in hierarchical structure management.  Ranzal has a deep knowledge of EPM, DRM, and OBIEE, and how these systems can be implemented to tightly work together to address an organization’s analytical and reporting needs.

Visualizing Big Data

This post explores using Tableau Server to visualize data in a Hadoop cluster.

More and more businesses are finding value or savings in replumbing their databases to be deployed on commodity hardware running open-source software like Cloudera Distribution of Hadoop. With tools like Impala, realtime queries of massive datasets become possible. To get the most insight, compelling interactive data exploration and visualization is necessary. We wanted to explore how Tableau works in this regard, and found Tableau Server visualizing data from CDH using Impala proved facile. This combination provides data exploration at the speed of thought with beautiful intuitive visualizations, resulting in a quick front-end for big data.

To demonstrate this, I took the classic AdventureWorks Bike Store data that Microsoft uses to demo a data warehouse in SQL Server (also used in our Endeca Information Discovery demo) and loaded it in our CDH cluster using Impala. I downloaded Tableau Desktop, as well as the Cloudera ODBC Driver for Impala, and spun up a VM running Windows Server to host Tableau Server. After configuring the Impala driver to point at our cluster, I launched Tableau and added a new data source. Tableau makes it pretty simple to choose from a menu of data sources, from a simple CSV to a massive CDH cluster. After selecting the Cloudera Hadoop option, I input our cluster DNS and the port and credentials for Impala. I selected my new Bike Store database and table, and was ready to whip up some visualizations.

Tableau provides tools for ETL, including a pretty nice GUI for simple joins, but since I was trying to denormalize a star schema I did the transformations using impala-shell where I have more control and the operations are more visible. Cluster-side ETL would also help these visualizations run at the speed of thought, even if working with big data at scale.

Using Tableau provides really easy creation of standard charts and even more complicated visualizations like maps can be rendered automagically with geographic attribute detection. I found the date detection to be less dependable. You can add filters to slice and dice on different attributes, and create dashboards to combine several worksheets, or create “Stories” to walk a viewer through a series of visualizations.

Overall, I found some aspects of Tableau a little bit like using Apple products: you get great design and intuitive functionality at the cost of robust configurability. It’s a fantastic tool to get pretty visualizations up and running quickly and intuitively, to add data sources on the fly with little complexity, and to quickly share the results. But in spite of those strengths, Tableau isn’t a replacement for enterprise-scale BI offerings.

Interact with the visualization we built and embedded here:

https://public.tableau.com/javascripts/api/viz_v1.jsBike Store Dashboard

Big Data Discovery – Custom Java Transformations Part 2

In a previous post, we walked through how to implement a custom Java transformation in Oracle Big Data Discovery.  While that post was more technical in nature, this follow up post will highlight a detailed use case for the transformations and illustrate how they can be used to augment an existing dataset.

Example: 2015 Chicago Mayoral Election Runoff

For this example we will be using data from the 2015 Mayoral Election Runoff in Chicago.  Incumbent Rahm Emanuel defeated challenger “Chuy” Garcia with 55.7% of the popular vote.  Results data from the election were compiled and matched up with Chicago communities, which were then subdivided by zip code.  A small sample of the data can be seen below:

Sample election data

Sample election data

In its original state, the data already offers some insight into the results of the election, but only at a high level.  By utilizing the custom transformations, it is possible to bring in additional data and find answers to more detailed questions.  For example, what impact did the wealth of Chicago’s communities have on their selection of a candidate?

One indicator of wealth within a community is the median sale price of homes in that area.  Naturally, the higher the price of homes, the wealthier the community tends to be.  Zillow provides an API that allows users to query for a variety of real estate and demographic information, including median sale price of homes.  Through the custom transformations, we can augment the existing election results with the data available through the API.

The structure of the custom transformation is exactly the same as the ‘Hello World’ example from our previous post.  The transformation is initiated in the BDD Custom Transform Editor with the command runExternalPlugin('ZillowAPI.groovy',zip). In this case, the custom groovy script is called ZillowAPI.groovy and the field being passed to the script is the zip code, zip.

The script then uses the zip to construct a string and convert it to the URL required to make the API call:

def zip = args[0]
String url = "http://www.zillow.com/webservice/GetDemographics.htm?zws-id=<ZILLOW_API_KEY>&zip=" + zip;
URL api_url = new URL(url);

Once the transform script completes, the median_sale_price field is now accessible in BDD:

Updated data in BDD

Updated data in BDD

Now that the additional data is available, we can use it to build some visualizations to help answer the question posed earlier.

Median Sale Price by Chicago Community - Created using the Ranzal Data Visualization Portlet*

Median Sale Price by Chicago Community – Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community - Created using the Ranzal Data Visualization Portlet*

Percentage for Chuy by Community – Created using the Ranzal Data Visualization Portlet*

The two choropleths above show the median sale price by community and the percentage of votes for “Chuy” by community.  Communities in the northeastern sections of the city seem to have the highest concentration of median sale price, while communities in the western and southern sections tend to have lower prices.  For median sale price to be a strong indicator of how the communities voted, the map displaying votes for “Chuy” should show a similar pattern, with the communities grouped by northeast and southwest.  However, the pattern is noticeably different, with votes for “Chuy” distributed across all sections of the map.

Bar-Line chart of Median Sale Price and Percent for Chuy

Bar-Line chart of Median Sale Price and Percent for Chuy

Looking at the median sale price in conjunction with the percentage of votes for “Chuy” provides an even clearer picture.  The bars in the chart above represent the median sale price of homes, and are sorted in descending order from left to right.  The line graph represents the percentage of votes for “Chuy” in each community.  If there was a connection between median sale price and the percentage of votes for “Chuy”, we’d expect to see the line graph either increase or decrease as sale price decreases.  However, the percentage of votes varies widely from community to community, and doesn’t seem to follow an obvious pattern in relation to median sale price.  This corresponds with the observations from the two choropleths above.

While these findings don’t provide a definitive answer to the initial question as to whether community wealth was a factor in the election results, they do suggest that median sale price is not a good indicator of how Chicago communities voted in the election.  More importantly, this example illustrates how easy it is to utilize custom Java transformations in BDD to answer detailed questions and get more out of your original dataset.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] ranzal.com or share your questions and comments with us below.


* – The Ranzal Data Visualization Portlet is a custom portlet developed by Ranzal and is not available out of the box in BDD.  If you would like more information on the portlet and it’s capabilities, please contact us and stay tuned for a future blog post that will cover the portlet in more detail.

Undocumented Data Export Feature in Oracle Hyperion PBCS (Planning and Budgeting Cloud Service)

In response to companies looking for more decentralized services with less IT overhead, Oracle has launched the Planning and Budgeting Cloud Service (PBCS). PBCS is a hosted version of the Oracle Hyperion Planning and Data Management/Integration (FDMEE) tools with a particular focus on a completely online-based interface. For additional information on PBCS, please click HERE.

From a functional perspective, this is an ideal situation: to have near-full capabilities of an on-premise solution without the infrastructure maintenance concerns. Practically, though, there are some holes to fill as Oracle perfects and grows the solution.

One of the main areas for concern has been the integration of data into and out of PBCS. Data Management (a version of FDMEE) is the recommended tool for loading flat file data into the system, while there is also the ability to load directly to Essbase with perfect files. Getting files out of the system, on the other hand, has not been so straightforward. Without access to the Essbase server, exporting files proves impractical. Companies often need data exports from Essbase for backups, integrations into other systems, or for review. PBCS does not seem to have a native method of being able to extract Level Zero (Lv0) data on a regular basis that could be easily copied out of the system and used elsewhere.

Despite this, the DATAEXPORT command still exists in the PBCS world. How, then, could it be used to get a needed file?

It actually begins as with a normal on-premise application by creating a Business Rule to do a data export. This can be done manually, but it is recommended to use the System Template to make sure everything is set up perfectly.

JP_ScreenShot_2015_03_09_09_10_35

When setting up the location to export the file to, it should be set up as:

 “/u03/lcm/[File_Name.txt]”

JP_ScreenShot_2015_03_09_09_30_59

When this is done, a user can then navigate over to the Inbox/Outbox Explorer and see the file in there:

JP_ScreenShot_2015_03_09_09_42_39

JP_ScreenShot_2015_03_09_09_44_01

And that is really all there is to it! With a business rule in place, the entire process can be automated using EPMAutomate (EPMAutomate and recommendations for an automation engine/methodology will be discussed in a later post) and a batch scripting client to do a process that:

  • Deletes the old file
  • Runs the business rule to do the data export
  • Copy the file off of PBCS and to a local location
  • Push the file to any other needed location

The one important thing to note is that as of PBCS 11.1.2.3.606 (April 2015 patch), all files in the Inbox/Outbox Explorer — along with any files in Application Management (LCM) — that are older than two months will be automatically deleted. As such, if these files are being kept for archive purposes, they must be backed up offline in order to be preserved.

Big Data Discovery – Custom Java Transformations Part 1

In our first post introducing Oracle Big Data Discovery, we highlighted the data transform capabilities of BDD.  The transform editor provides a variety of built in functions for transforming datasets.  While these built in functions are straightforward to use and don’t require any additional configuration, they are also limited to a predefined set of transformations.  Fortunately, for those looking for additional functionality during transform, it is possible to introduce custom transformations that can leverage external Java libraries by implementing a custom Groovy script.  The rest of this post will walk through the implementation of a basic example, and a subsequent post will go in depth with a few real world use cases.

Create a Groovy script

The core component needed to implement a custom transform with external libraries is a Groovy script that defines the pluginExec() method.  Groovy is a programming language developed for the Java platform.  Details and documentation on the language can be found here.  For this basic example, we’ll begin by creating a file called CustomTransform.groovy and define a method, pluginExec(), which should take an object array, args, as an argument:

def pluginExec(Object[] args) {
    String input = args[0] //args[0] is the input field from the BDD Transform Editor 

    //Implement code to transform input in some way
    //The return of this method will be inserted into the transform field

    input.toUpperCase() //This example would return an upper cased version of input
}

pluginExec() will be applied to each record in the BDD dataset, and args[0] corresponds to the field to be transformed.  In the example script above, args[0] is assigned to the variable input and the toUpperCase() method is called on it.  This means that if this custom transformation is applied to a field called name, the value of name for each record will be returned upper cased (For example, “johnathon” => “JOHNATHON”).

Import Custom Java Library

Now that we’ve covered the basics of how the custom Groovy script works, we can augment the script with external Java libraries.  These libraries can be imported and implemented just as they would be in standard Java:

import com.oracle.endeca.transform.HelloWorld
    
def pluginExec(Object[] args) {
    String input = args[0] //Note that though the input variable is defined in this example, it is not used.  Defining input is not required.
    
    HelloWorld hw = new HelloWorld() //Create a new instance of the HelloWorld class defined in the imported library
    hw.testMe() //Call the testMe() method, which returns a string "Hello World"
}

In the example above, the HelloWorld class is imported.  A new instance of HelloWorld is assigned to the variable hw, and the testMe() method is called.  testMe() is designed to simply return the string “Hello World”.  Therefore, the expected output of this custom script is that the string “Hello World” will be inserted for each record in the transformed BDD dataset.  Now that the script has been created, it needs to be packaged up and added to the Spark class path so that it’s accessible during data processing.

Package the Groovy script into a jar

In order to utilize CustomTransform.groovy, it needs to packaged into a .jar file.  It is important that the Groovy script be located at the root of the jar, so make sure that the file is not nested within any directories.  See below for an example of the file structure:

CustomTransform.jar
  |---CustomTransform.groovy
  |---AdditionalFile_1
  |---AdditionalFile_2
  ...
  ...
  .etc

Note that additional files can be included in the jar as well.  These additional files can be referenced in CustomTransform.groovy if desired.  There are multiple ways to pack up the file(s), but the simplest is to use the command line.  Navigate to the directory that contains CustomTransform.groovy and use the following command to package it up:

# jar cf <new_jar_name> <input_file_for_jar>
> jar cf CustomTransform.jar CustomTransform.groovy

Setup a custom lib location in Hadoop

CustomTransform.jar and any additional Java libraries that are imported by the Groovy script need to be added to all spark nodes in your Hadoop cluster.  For simplicity, it is helpful to establish a standard location for all custom libraries that you want Spark to be able to access:

$ mkdir /opt/bdd/edp/lib/custom_lib

The /opt/bdd/edp/lib directory is the default location for the BDD dataprocessing libraries used by Spark.  In this case, we’ve created a subdirectory, custom_lib, that will hold any additional libraries we want Spark to be able to use.

Once the directory has been created, use scp, WinSCP, MobaXterm, or some other utility to upload CustomTransform.jar and any additional libraries used by the Groovy script into the custom_lib directory.  The directory needs to be created on all Spark nodes, and the libraries need to be uploaded to all nodes as well.

Update sparkContext.properties on the BDD Server

The last step that needs to be completed before running the custom transformation is updating the sparkContext.properties file.  This step only needs to be completed the first time you create a custom transformation as long as the location of the custom_lib directory remains constant for each subsequent script.

Navigate to /localdisk/Oracle/Middleware/BDD<version>/dataprocessing/edp_cli/config on the BDD server and open the sparkContext.properties file for editing:

$ cd /localdisk/Oracle/Middleware/BDD1.0/dataprocessing/edp_cli/config
$ vim sparkContext.properties

The file should look something like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################


Add an entry to the file to define the spark.executor.extraClassPath property.  The value for this property should be <path_to_custom_lib_directory>/*.  This will add everything in the custom_lib directory to the Spark class path.  The updated file should look like this:

#########################################################
# Spark additional runtime properties, see
# https://spark.apache.org/docs/1.0.0/configuration.html
# for examples
#########################################################

spark.executor.extraClassPath=/opt/bdd/edp/lib/custom_lib/*

It is important to note, if there is already an entry in sparkContext.properties for the spark.executor.extraClassPath property, any libraries referenced by the property should be moved to the custom_lib directory so they are still included in the Spark class path.

Run the custom transform

Now that the script has been created and added to the Spark class path, everything is in place to run the custom transform in BDD.  To try it out, open the Transform tab in BDD and click on the Show Transformation Editor button.  In this example, we are going to create a new field called custom with the type String:

Create new attribute

Create new attribute

Now in the editor window, we need to reference the custom script:

Transform Editor

Transform Editor

The runExternalPlugin() method is used to reference the custom script.  The first argument is the name of the Groovy script.  Note that the value above is 'CustomTransform.groovy' and not 'CustomTransform.jar'.  The second argument is the field to be passed as an input to the script (this is what gets assigned to args[0] in pluginExec()).  In the case of the “Hello World” example, the input isn’t used, so it doesn’t matter what field is passed here.  However, in the first example that returned an upper cased version of the input field, the script above would return an upper cased version of the key field.

One of the nice features of the built-in transform functions is that they make it possible to preview the transform changes before committing.  With these custom scripts, however, it isn’t possible to see the results of the transform before running the commit.  Clicking preview will just return blank results for all fields, as seen in the example below:

Example of custom transform preview

Example of custom transform preview

The last thing to do is click ‘Add to Script’ and then ‘Commit to Project’ to kick off the transformation process.  Below are the results of the transform.  As expected, a new custom field has been added to the data set and the value “Hello World” has been inserted for every record.

Transform results

Transform results

This tutorial just hints at the possibilities of utilizing custom transformations with Groovy and external Java libraries in BDD.  Stay tuned for the second post on this subject, when we will go into detail with some real world use cases.

If you would like to learn more about Oracle Big Data Discovery and how it can help your organization, please contact us at info [at] ranzal.com or share your questions and comments with us below.

Bringing Data Discovery to Hadoop – Part 3

In our last post, we talked about some of the tools in the Hadoop ecosystem that Oracle Big Data Discovery takes advantage of to do its work — namely Hive and Spark. In this post, we’re going to delve a little deeper into how BDD integrates with data that is already sitting in Hive, how it can write transformed data back to HDFS, and how it can help give users new insights on that data.

Data Processing

BDD ships with a data processing tool that makes imports from Hive easy. Simply point it at the database and table(s) you would like to pull into BDD and the application does the rest. Behind the scenes, the data processing utility launches Spark workers to read in the data from HDFS for the targeted table into new Avro files. BDD then indexes the data in these files for easy discovery.

Spark at work.

Spark at work.

Another feature of BDD’s data processing is that it can be set to auto-detect new tables that are created in a Hive database to keep it in sync with Hive. The BDD Hive Table Detector automatically launches a workflow to import a table whenever one is created. Currently, BDD doesn’t yet support updates to existing tables but we hope to see that feature in an future release.

One thing to note: depending on the size of the table, BDD may import only a sampling of its data for discovery purposes. By default, the application’s record threshold is set to one-million. This is in order to keep any analysis of a particular collection as interactive as possible while maintaining a relatively dependable and accurate view. For most intents and purposes, this default setting should probably be enough. However, the threshold can be increased if necessary. Ultimately the amount of data sampling to use would have to be a balance between an individual’s needs and the computing resources available to them.

Exporting Back to Hive

A unique component of BDD is its ability to throw data back to Hadoop once you have it in a state that you are satisfied with or would like to share with other users. We have some campaign funding data to work with as a test case:

The Chicago mayor’s race has been getting some attention due to an unexpected underdog forcing incumbent Rahm Emanuel to a runoff. As you can see, the challenger, Chuy Garcia, is wildly out-funded compared to Emanuel:

Mayor01

Creating this application involved pulling campaign spending data for Illinois from electionmoney.org, importing it into BDD, and then joining a couple tables together and cleaning it all up using the transform tools contained within the application.

Now let’s say we wanted to export the results of this work — these joined, transformed data sets — for other users to query for themselves in Hive. We can do that with a simple, built-in export feature that can write our denormalized data set back to HDFS.

Export01

With a few quick clicks, BDD can create Avro-formatted files, write them to our Hadoop cluster, and then create the corresponding Hive table automatically:

Export03

This particular feature adds a lot of flexibility and opportunity for collaboration in teams where members span a wide range of skills. You can imagine users on the business side and technical side of a company throwing data sets back and forth between each other, sharing insights in a natural way that might have been much more difficult to accomplish in other environments.

That concludes our three-part look at Oracle Big Data Discovery. As we’ve said before, there is a lot to be excited about and we believe the application offers a viable data discovery solution to organizations running data in Hadoop, as well as those who are interested in creating first-time clusters.

For more information or guidance on how BDD could help your organization, contact us at info [at] ranzal.com.

Bringing Data Discovery To Hadoop – Part 2

The most exciting thing about Oracle Big Data Discovery is its integration with all the latest tools in the Hadoop ecosystem. This includes Spark, which is rapidly supplanting MapReduce as the processing paradigm of choice on distributed architectures. BDD also makes clever use of the tried and tested Hive as a metadata layer, meaning it has a stable foundation on which to build its complex data processing operations.

In our first post of this series, we showcased some of BDD’s most handy features, from its streamlined UI to its very flexible data transformation abilities. In this post, we’ll delve a little deeper into BDD’s underlying mechanics and explain why we think the application might be a great solution for Hadoop users.

Hive

Much of the backbone for BDD’s data processing operations lie in Hive, which effectively acts as a robust metastore for BDD. While operations on the data itself are not performed using Hive functions (which currently run on MapReduce), Hive is a great way to store and retrieve information about the data: where it lives, what it looks like, and how it’s formatted.

For organizations that are already running data in Hive, the integration with BDD couldn’t be simpler. The application ships with a data processing tool that can automatically import databases and tables from Hive, all while keeping data types intact. The tool can also sync up with a Hive database so that when new tables are created a user can automatically work with that data in BDD. If a table is dropped, BDD deletes that particular data set from its index. Currently, the 1.0 version doesn’t support updates to existing Hive tables, but we hope to see that feature in an upcoming release.

BDD can also upload data to HDFS and create a new table with that data in Hive to work with. It does this whenever a user uploads a file through the UI. For example, here’s what we saw in Hive with the consumer complaints data set from the last post after BDD imported it:

Example of an auto-generated Hive table by BDD

This easy integration with Hive makes BDD a good option for both experienced Hadoop users who are using Hive already, as well as less technical users.

Spark

While Hive provides a solid foundation for BDD’s operations, Spark is the workhorse. All data processing operations are run through Spark, which allows BDD to analyze and transform data in-memory. This approach effectively sidesteps the launching of slower MapReduce jobs through Hive and gives the processing engine direct access to the data.

When a user commits a series of transforms to a data set via the BDD UI, those transforms are interpreted into a Groovy script that are then passed to Spark through an Oozie job. Here, we can see how some date strings are converted to datetime objects behind the scenes:

Tech021

After Spark has done its handiwork, the data is then written out to HDFS as a new set of files, serialized and compressed in Avro. The original collection stays intact in another location in case we want to go back to it in the future.

From this point, the data is then loaded into the Dgraph.

Dgraph

The Dgraph is basically an in-memory index, and is what enables the real-time, dynamic exploration of data in BDD. This concept might be familiar to those who have used Oracle Endeca Information Discovery, where the Dgraph also played a key role, and this lineage means BDD inherits some very nice features: quick response, keyword search, impromptu querying, and the ability to unify metrics, structured and unstructured data in a single interface. The biggest difference now is that users have the ability to apply these real-time search and analytic capabilities to data sitting on Hadoop.

We think the marriage of this kind of discovery application with Hadoop makes a lot of sense. For starters, Hadoop has enabled organizations to store vast amounts of data cheaply without necessarily knowing everything about its structure and contents. BDD, meanwhile, offers a solution to indexing exactly this kind of data — data that is irregular, inconsistent and varied.

There’s also the issue of access. Currently, most data tools in the Hadoop ecosystem require a moderate level of technical knowledge, meaning wide swaths of an organization might have little to no view of all that data on HDFS. BDD offers a system to connect more people to that data, in a way that’s straightforward and intuitive.

If you would like to learn more about Oracle Big Data Discovery and how it might help your organization, please contact us at info [at] ranzal.com.