Machine Learning and You: K-Means Clustering

We are seeing a tremendous amount of interest in people wanting to learn more about machine learning, AI, data mining, and predictive analytics. Our presentations addressing these topics are often some of the best attended at conferences. I believe there are two fundamental drivers for this: the first is a sincere desire to learn about new things and new capabilities. The second is a very real fear of not knowing about machine learning and the like and having others (particularly higher-level executives) expect that you know about it. We see this especially with people involved in business intelligence and analytics on the business side and also with IT managers and executives who are expected to explain what their strategy or road map is for machine learning and advanced analytics. However, the observation that machine learning is a lot like teenage sex is fairly apt: everyone talks about it but few actually do it. And many of those who do do it, don’t really know what they are doing.

Here are some of things that we try to emphasize to get people over their fear.

  • These are just calculations like thousands of other calculations in your system.
  • You typically don’t get only one answer, but a series of answers that can be considered together to provide context on how to use them.
  • You don’t have to know the specifics of the calculation to be able to understand and use the results.
  • You likely understand the concepts, but the words that are used are unfamiliar and sound hard. Once you learn some new vocabulary, it seems like a normal part of your system.
  • It’s helpful to be able to explain the basics to others. This skill frees analysts to use the powerful and accessible machine learning tools in the Oracle Analytics Cloud. No one likes getting “challenge questions” from their superiors and being unable to answer them.

For example, let’s look at a K-Means clustering algorithm. Clustering is used to help reveal similarities and differences in sets of data elements such as customers, orders, locations, products, etc. Whenever you have a lot of something and you want to know “what are the natural patterns in similarities and differences and how can we organize this very large set into smaller sets of like elements that are different from one another?”, we use clustering.

A very common business use case is to cluster the set of customers that a business has into segments that are differentiated from one another. We do this all the time with business intelligence queries. We separate customers and aggregate the results by all kinds of attributes. We might look at customers by location and separate them by state or zip code and then look at the aggregated results to see where customers are buying the most and the least. We might look at customers by what products or services they purchased or when they made their purchases. We might use descriptive attributes such as number of employees, or how long they’ve been a customer, or whether they purchase online or in-store. What is common for all of these analyses is that the analyst is determining how to organize the groups of customers. We might have dozens or even hundreds of possible attributes which we could use.

Clustering offers us a methodology for letting the data tell us where the commonalities exist across many dimensions and many attributes. There is no right and wrong in clustering. Just a mathematical calculation and cluster assignment. But there is also a lot of other information we can get about both the clusters and about the members of each cluster. How many members are in each cluster? What is the average or center of each cluster? How close or far apart are those cluster centers from each other? How close is each member of a cluster to its center? Which of the attributes contributes the most in determining cluster assignments?

The K-Means clustering algorithm is just a method for organizing the elements by which ones are close together. It is a procedure for minimizing the total distance between a given number of centroids (cluster centers) and the data elements. To be fair, there are several different versions of K-Means, but fundamentally, it is a distance minimization algorithm. So all we’re doing is grouping customers into a given number of segments and using math and the data itself to make the assignments rather than making the assignments using a human-stated rule(s).

If you wondering, well how else could we determine which element goes to which cluster besides distance? There are two other basic methods, drawing straight lines through the data space to separate elements into areas of similarity and calculating the density of areas to determine regions of similarity. Sounds easier than Orthogonal Partitioning and Expectation Maximization, right? But more about those algorithms later.

Dan and Tim Vlamis are Presenting at GLOC 2018!

This year the Great Lakes Oracle Conference will take place on May 16th and May 17th at the Cleveland Public Auditorium in Cleveland Ohio. This 2-day conference is packed full of informative and technical sessions suitable for all experience levels.

Dan Vlamis and Tim Vlamis have been accepted to present this year and will be involved in the following sessions:

May 16, 2018

Registration for the Great Lakes Oracle Conference is available by visiting the conference website.

Let us know if you are planning to attend GLOC 18, we would love to meet with you during the conference!

For more information on our presentations visit our website.

Join us at Collaborate 2018!

We are very excited to be attending and speaking at Collaborate 2018! Collaborate takes place from April 22 to 26, 2018 at the Mandalay Bay Resort & Casino in Las Vegas, NV.

This year our presentations include:

We would love to set aside some time to meet with you personally during the conference. Send us an email and let us know what days you are available and we'll reach out to schedule a meeting!

More information can be found at:

Analytics and Data Summit 2018

BIWA Summit returns in March of 2018 with a new name! This year the BIWA SIG has re-branded their flagship event as the "Analytics and Data Summit" to better encompass the technologies and strategies that will be covered. The Analytics and Data Summit is the premier conference focused exclusively on analytics, business intelligence, data warehousing, and machine learning/predictive analytics. Attendees include a mixture of Oracle customers including, IT, executives, functional business users, consultants, and Oracle Product Management. If you are looking to network with passionate experts, this is your conference!

Analytics and Data Summit will take place from March 20 - 22 at the Oracle Conference Center at Oracle Headquarters in Redwood City, California. The full schedule for this event is available now.

Vlamis Software Solutions will be involved in the following presentations:

Tuesday, March 20

Wednesday, March 21

Thursday, March 22

More information is available on our Presentations Page.

Registration is open at http://analyticsanddatasummit.org/register/.

Creating a Property Graph on Oracle Database

Creating a Property Graph on Oracle Database

Oracle Database 12c now includes property graphs, a great feature to analyze relationships in your data, such as finding the most influential people in a social network or discovering patterns of fraud in financial transactions. 

 

Graph analysis can do things that the relational model can't do - or can't do without very complicated queries and expensive table joins.  For someone who has never touched a property graph before it can be a little intimidating to get started.  It doesn’t help that most examples seem to start big and kind of assume that you already know what you are doing.  So for the people who are interested in exploring the new property graph functionality of the Oracle Database but don’t have a clue how to get started I thought it would be helpful to put an example together that assumes you don’t know anything.