Using the Oracle Database for an Analytic Warehouse

Using the Oracle Database for an Analytic Warehouse

Dan and Tim Vlamis, Vlamis Software Solutions, Inc.

November 10, 2016

Introduction:

Data Warehouses have evolved since their heydays in the 1980s. Many people have given up on the dream of capturing and codifying into one place all the internal data for an organization and consider the task to be too difficult. In the quest to capture everything, they have focused almost exclusively on how to get data into a data warehouse through ETL processes. Instead, we believe the value lies not in storing data, but rather in organizing it, using it to calculate new measures, and facilitating its presentation to users. By focusing on results and uses, you can achieve what organizations truly care about: fast, deep, meaningful analyses.

Comparison between Data Warehouse and Analytic Warehouse:

A traditional data warehouse focuses on storing as much business data as possible. We propose identifying the data deemed most useful for affecting business decisions and focusing on analyzing data. We call a purpose-built warehouse of data for analysis an analytic warehouse. Most processing in a traditional data warehouse is done externally in ETL (Extract Load Transform). Most processing in an analytic warehouse is done internally by processes that:

  • define relationships and structures to organize data elements
  • create new measures, fields, and structures through analytical processing

Additional differences between traditional data warehouses and analytic warehouse sometimes include the following:

  •  In traditional data warehouses, one of the most important metadata facts about a particular data element is its source or provenance. The data element’s importance was directly related to how recently it was stored and sometimes, to how much it was replicated. In analytic warehouses, the most important information about a particular data element is its use, either in a subsequent analytic process or a call as a raw data element. The more times an element is used in a calculation or is called for presentation to a user, the more important it is.
  •  In the case when traditional data warehouses are significantly larger than their sources, it’s almost always because they are replicating data and storing it multiple times. Analytic warehouses can also be larger than their sources, but when they are, it’s almost always because they are creating new data elements and structures through descriptive, diagnostic, predictive, and prescriptive analytic workflows and machine learning techniques.
  • Traditional data warehouses were defined by the inclusion of data. Analytic warehouses often make calls on external data stores for analytic workflows. There is often no need to store the data in the warehouse; just use it to produce new data elements and structures. This emphasis on data federation and analytic processing is a key difference between traditional data warehouses and analytic warehouses.

In short, traditional data warehouses are about getting all available data in for storage; analytic warehouses are about getting valuable data out for presentation.

Description of Analytic Warehouse:

Analytic warehouses are characterized by the following:

  • Organization around logical structures designed for analysis
  • A distinction between the processing/query engine (which can reach out to many sources) and the storage layer (which contains the most valuable and frequently accessed data)
  • Lots of derived measures, comparative values, and the generation of new data elements and structures
  • Emphasis on relationships, hierarchies, and structures (both discovered and assigned) within and between data elements
  • Emphasis on the fast processing and delivery of queries
  • Ability to federate data and execute queries and analytic processes in external data storage systems
  • Ability to perform complex statistical, graphical, and high mathematical processes in parallel

Analytic warehouses often contain dimensional structures that organize data into hierarchical taxonomies and include aggregations at different levels. Facilitating the fast presentation of data at different levels of aggregation is a primary purpose of analytic warehouses. While counting and aggregation are important internal processes, newer machine learning, statistical analytics, and data mining functions are now critical to delivering value to business users.

By focusing on a subset of an organization’s data, analytic warehouses can be surprisingly easier to build than full data warehouses. Compared with a full data warehouse, 80% of the value may be obtained with only 20% of the data. If data elements are not often used for analysis purposes, it’s better to leave them in their sources systems and use them through an external process on the rare occasion they are needed.

Use of Analytic Warehouse:

Whereas a traditional data warehouse supports historic reporting, an analytic warehouse focuses on predictive and prescriptive analytics. Analytic warehouses may be used for historic reporting also, but they are designed to predict the future. Often the emphasis is on additional calculated measures, forecasts, or predictions. Depending on the technology used, this may imply that measures in an analytic warehouse may not have the same precision as data in a data warehouse.  For example, a data warehouse may report that a customer has bought $23,753.03 worth of gadgets in the past year, but in an analytic warehouse, we may compute next year’s gadget sales are between $24,000 and $28,000 with a 95% probability associated with that assertion.  Since this is not truly known, we may end up calculating this value on the fly instead of storing the actual value.  Even though the value is an estimate, the fact that the value is directionally higher than last year may be very valuable. 

An important use for analytic warehouse is for comparison purposes.  Historic reporting often emphasizes precision and absolute values in isolation.  Analytic warehouses emphasize comparisons within a context.  Sometimes the context is historical (e.g. sales are trending up) and sometimes the context is in the future (e.g. sales for California are likely to exceed Texas in two years).

Oracle Database and Options for the Analytic Warehouse

An Oracle database is an excellent foundation for an analytic warehouse. In addition to the traditional relational data store, modern Oracle databases have the ability to store spatial data, dimensions, cubes and can calculate analytics directly in the database, eliminating the need for moving data between servers. Specifically, the Advanced Analytics option of the Oracle Enterprise Edition database has a host of embedded analytic algorithms such as clustering, regression, attribute importance, and more. Moreover, Oracle R Enterprise brings the power of the language ‘R’ to the Oracle Database. With the Database In Memory option, tables can be placed in memory and special memory-optimized algorithms can aggregate data as it is being read for extremely fast aggregation on the fly. Analytic Views (new in release 12.2) enable OLAP-style calculations on fact tables, without having to store data in a multi-dimensional format. Oracle Database is capable of processing data in XML, JSON, and through Big Data SQL can interact with external data stores such as Hadoop, NoSQL, and integrate Apache Spark processes. This also allows Oracle Database’s advanced security features (e.g. VPD, redaction) to be used with these data sources. All of these capabilities make the Oracle Database the optimal vehicle for realizing the analytic warehouse.

Analytic Warehouses in the Oracle Cloud

The movement towards storing and processing data in the cloud makes analytic warehouses even more compelling.  The High Performance Package of the Oracle Database Cloud Service contains all of the options necessary for an analytic warehouse.  The Extreme Performance Package adds the Database In Memory option for additional performance.  An Oracle Database in the cloud can reach out to other cloud data sources, as well as external databases, without having to replicate data.  In addition, cloud databases enable solutions that can scale to any size without having to pre-determine storage and compute requirements for an analytic warehouse.

Formatting SQL with Notepad++

Tidbits

If you have ever had to format SQL in various editors of various flavors you most likely have experienced severe frustration at the lack of ability to apply personal formating needs to SQL.

I experienced just that issue not long ago trying to format TSQL to line break every 3rd comma. Thankfully with the help of every programmers best friend (Notepad++) and most programmers source of utter befuddlement (Regular Expressions) I was able to solve the issue easily.

Imagine you have a SQL statement that looks like this -
select [Field1],[Field2],[Field3],[Field4],[Field5],[Field6]
    from [Some_Table]

but it's more like 65 fields with a preceding insert statement and proceeding group by clause and the editor you are using doesn't support formatting in any useful way. You could manually format it and then when you get another statement do it again and then jam a pen in your eyeball or you could use the dark and mysterious art of regular expressions.

So to insert a line break every third comma you just need to put your SQL into Notepad++, use TextFX Unwrap text option to get it all on a single line, sacrifice a chicken while dancing backwards around 13 tuples chanting "Chamberlin Boyce Codd" and enter the following in the Find / Replace dialog in Notepad++ -

The ((.?,.?)(.?,.?)[,]) means find and select everything between the starting point and the 3rd comma and the \1\r\n means replace it with itself and a newline.

Why does it work? Magic of course. Hopefully this helps some poor soul.

If you want to learn more I borrowed my logic from -

Notepad++: A guide to using regular expressions and extended search mode

Replace every nth instance of a character with a new line in Notepad++

DVD 12.2.2 and OBIEE 12.2.1.2. New Features Webinar

Join Dan Vlamis, Oracle ACE Director, as he breaks down the exciting new features in Data Visualization Desktop 2.0 (internally known as 12.2.2) and 12.2.1.2 release of Oracle Business Intelligence Enterprise Edition, available starting October, 2016 from Oracle. 

Expect a fast-paced presentation with live demos of DVD and OBIEE on a new SampleApp image, on as many of the new features as time allows including: 

DVD 2.0
• Improved data prep with Data Flows
• New more intuitive pages for creating data sources
• 25+ new data sources
• New smart insights feature to help you better understand data
• 8 new visualizations
• New BI Ask feature
• Multiple Canvases added with easier to use interface
• Printing, email and export to pdf and PowerPoint options
• New SDK for developing 1st Class Custom Visualizations

OBIEE 12.2.1.2
• Export/Import Visual Analyzer projects
• Single sign-on to Visual Analyzer

Just like our previous demonstrations of new BI releases (see www.vlamis.com/obiwebinars for a list), we will give a live demonstration of as many of these features as we have time for.

DVD 12.2.2 and OBIEE 12.2.1.2. New Features Webinar is scheduled for Tuesday November 22, 2016 at 11:00AM Central. Registration for this webinar is now open.

 

 

Twitter Live Feed with Oracle Database as a Service and Business Intelligence Cloud Service

In this post I am going to show how you can very quickly get a Twitter live feed streaming into your Oracle Database as a Service (DBAAS) instance.  Although this example will work on any machine with an Oracle client and for any Oracle Database instance putting it on DBAAS makes it extremely easy to utilize our streaming Twitter data inside of other Oracle Cloud applications such as Business Intelligence Cloud Service (BICS).

Acknowledgments:

  1. Thank you to geniuses at Tweepy for making it so I don’t have to deal with oauth or really anything related to the Twitter API.  I don’t know who you are but I love you! ♥
  2. Twitter so I never have to wonder what Kim or Kanye are thinking.
  3. My homey Przemyslaw Piotrowski (I don’t know him either but his examples sustain my laziness in coding) for writing a super helpful series called The Mastering Oracle+Python Series which I fully intend to finish someday.
  4. Who ever this dude is (https://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/) whose example made this easy.

The great thing about DBAAS is I actually have access to the underlying server so I can take advantage of the operating system, and languages like python, as well as the database.  Some might be offended by such interloping but I’m not one of them and as I said above there isn’t any requirement to do this on the database server I’m just taking advantage of a machine that has the cycles to spare.

I will assume that you already have a DBAAS instance up and running and are familiar with how to administer it and are also familiar with using tools like MobaXterm to connect to it.

At the time of this example our Database as a Service is running Oracle Linux 6.7 and 12.0.1.2 of the database with all the bells and whistles.  I needed to install 5 things to make this work:

1. plliblzma to make it so I can get the latest EPEL repository without having to deal with “expletive” Error: xz compression not available message

wget https://kojipkgs.fedoraproject.org//packages/pyliblzma/0.5.3/3.el6/x86_64/pyliblzma-0.5.3-3.el6.x86_64.rpm

yum install pyliblzma-0.5.3-3.el6.x86_64.rpm

2. Latest EPEL repository for this version of linux

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

yum install epel-release-6-8.noarch.rpm

3. Then I can install python-pip with ease

yum install python-pip

4. And then we can fetch Tweepy with equal ease

pip install tweepy

5. And last but not least Oracle’s python library for interacting with the database

pip install cx_Oracle

Now that we have all the chunks make sure you can import the stuff you need with python and not get any errors.

[oracle@testdrive-01 ~]$ python
Python 2.6.6 (r266:84292, Jul 23 2015, 05:13:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tweepy
>>> import cx_Oracle
>>>

If you have issue with cx_Oracle you most likely just need to make sure the Oracle environment variables are set via bashrc or however you like to set them.

I created a table in my database to store the JSON document that is the API response

CREATE TABLE "TWITTER_USER"."EAT_MY_TWEET"
( "ID" NUMBER,
"USERNAME" VARCHAR2(500 BYTE),
"TIMEWHYUPUNISHME" TIMESTAMP (6) DEFAULT systimestamp,
"TWEET_JSON" CLOB,
CONSTRAINT "ENSURE_JSON" CHECK (TWEET_JSON is JSON) ENABLE
)

In order to use the Twitter API you will have to register your app with them here https://apps.twitter.com at which point they will give you the keys to make oauth work.

Then all we need to do is write (by which I mean copy paste and modify) a little python code and we are ready to rock:

#import libraries
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import cx_Oracle
import datetime
import json


#connection string for database
conn_str='WHODAT/Ap@ssw0rd@localhost:1521/MyPluggableDatabseServiceName'

#get me a connection
conn =cx_Oracle.connect(conn_str)

#turn on autocommit
conn.autocommit=1

#object for executing sql
c=conn.cursor()

#clob variable
bvar=c.var(cx_Oracle.CLOB)

#twitter api application keys
#consumer key, consumer secret, access token, access secret.
ckey='Dont'
csecret='Tell'
atoken='Anybody'
asecret='The Password'

#listen to the stream
class listener(StreamListener):

#get some
    def on_data(self, data):
         try:

            #barf response insto json object
            all_data = json.loads(data)

            #parse out tweet text, screenname and tweet id
            tweet = all_data["text"]
            if (all_data["user"]["screen_name"]) is not None:
                username = all_data["user"]["screen_name"]
            else:
                username = 'No User'
            tid = all_data["id"]

            #set clob variable to json doc
            bvar.setvalue(0,data)
            try:
                #create sql string with bind for clob var

                sql_str="INSERT INTO EAT_MY_TWEET (ID,USERNAME,TWEET_JSON) Values("+str(tid)+",q'["+username.encode('utf-8').strip()+"]',:EATIT)" 

                #insert into database 
                c.execute(sql_str,[bvar])                  

            except Exception: 
                sys.exc_clear() 

            #watch tweets go by in console 
            print((username,tweet)) 

            #in case you want to print response 
            #print(data) 
            return(True) 
        except Exception: 
                sys.exc_clear() 
def on_error(self, status): 
     print status 
     print(data) 
     print sql_str 

#log in to twitter api 
auth = OAuthHandler(ckey, csecret) 
auth.set_access_token(atoken, asecret)
 
#fire it up 
twitterStream = Stream(auth, listener()) 

#what to search for (not case sensitive) 
#comma separated for words use 
# for hashtag 
#phrases are words separated by spaces (like this comment) 
twitterS.filter(track=["ProveFilterWorks,Oracle,Vlamis,OBIEE,BICS,DVCS,Data Visualization Desktop"])

I named my file stream.py so to execute I just fire it up with nohup –

nohup python -u /home/oracle/stream.py>/home/oracle/stream.out 2>/home/oracle/stream.err &

and if I tail my stream.out file with tail -f stream.out I can watch the tweets go by as they are inserted into the database

now that I am inserting the json document that the twitter api sends back to me into the database

I can use Oracle Database 12c support for json to expose the document in my table as a view using the following SQL

CREATE OR REPLACE FORCE EDITIONABLE VIEW "TWITTER_USER"."LIVE_TWITTER_FEED" ("CST_DATE", "UTC_DATE", "UTC_HOUR", "UTC_MINUTE", "ID", "CREATED_ON", "SCREEN_NAME", "LOCATION", "FOLLOWERS_CNT", "FRIENDS_CNT", "LISTED_CNT", "FAVOURITES_CNT", "STATUSES_CNT", "RETWEET_CNT", "FAVOURITE_CNT", "URL", "PROFILE_IMAGE_URL", "BANNER_IMAGE_URL", "HASHTAGS", "TWEET", "EMT_TIMEWHYUPUNISHME") AS 
  SELECT cast(TO_TIMESTAMP_TZ(REPLACE(upper(b.created_on),'+0000','-00:00'),'DY MON DD HH24:MI:SS TZH:TZM YYYY')  at Time zone 'CST' as date) CST_DATE,
cast(TO_TIMESTAMP_TZ(REPLACE(upper(b.created_on),'+0000','-00:00'),'DY MON DD HH24:MI:SS TZH:TZM YYYY') as date) UTC_DATE,
cast(to_char(cast(TO_TIMESTAMP_TZ(REPLACE(upper(b.created_on),'+0000','-00:00'),'DY MON DD HH24:MI:SS TZH:TZM YYYY') as date),'HH24') as number) UTC_HOUR,
cast(to_char(cast(TO_TIMESTAMP_TZ(REPLACE(upper(b.created_on),'+0000','-00:00'),'DY MON DD HH24:MI:SS TZH:TZM YYYY') as date),'MI') as number) UTC_MINUTE,
a."ID",b."CREATED_ON",a."USERNAME",b."LOCATION",b."FOLLOWERS_CNT",b."FRIENDS_CNT",b."LISTED_CNT",b."FAVOURITES_CNT",b."STATUSES_CNT",
b."RETWEET_CNT",b."FAVOURITE_CNT",b."URL",b."PROFILE_IMAGE_URL",b."BANNER_IMAGE_URL",b."HASHTAGS",b."TWEET",
a.TIMEWHYUPUNISHME
FROM EAT_MY_TWEET a,
json_table(tweet_json,
'$' columns(
    id varchar(50) path '$.id',
    created_on varchar2(100) path '$.created_at',
    screen_name varchar2(200) path '$."user".screen_name',
    location varchar2(250) path '$."user"."location"',
    followers_cnt number path '$."user".followers_count',
    friends_cnt  number path '$."user".friends_count',
    listed_cnt  number path '$."user".listed_count',
    favourites_cnt  number path '$."user".favourites_count',
    statuses_cnt  number path '$."user".statuses_count',
    retweet_cnt number path '$.retweet_count',
    favourite_cnt number path '$.favorite_count',
     url varchar2(250) path '$."user"."url"',
    profile_image_url  varchar2(500) path '$."user".profile_image_url',
    banner_image_url varchar2(500) path '$."user".profile_banner_url',
     hashtags varchar2(500) format json with wrapper path '$.entities.hashtags[*].text',
      tweet varchar2(250) path '$.text'
    -- nested path '$.entities.hashtags[*]' columns (ind_hashtag varchar2(30) path '$.text'    )
   
)
) b

Then I can analyze the real time twitter data I’m interested in using a simple sql statement

SELECT * FROM LIVE_TWITTER_FEED ;

and while being able to hit real time twitter data with SQL is cool the real goal is being able to surface that data inside of our BI tools. Here we combine Answers and Visual Analyzer on a dashboard to show latest tweets and aggregate information over time utilizing Oracle Business Intelligence Cloud Service (BICS)