Wednesday, 20 September 2017

Machine Learning in Oracle Database – Classification of Conference Abstracts based on Text Analysis

Machine Learning is hot. The ability to have an automated system predict, classify, recommend and even decide based on models derived from past experience is quite attractive. And with the number of obvious applications of machine learning – Netflix and Amazon recommendations, intelligent chat bots, license plate recognition in parking garages, spam filters in email servers – the interest further grows. Who does not want to apply machine learning?

This article shows that the Oracle Database (platform) – with the Advanced Analytics option – is perfectly capable of doing ‘machine learning’. And has been able to do such learning for many years. From the comfort of their SQL & PL/SQL zone, database developers can play data scientists. The challenge is as follows:

For the nlOUG Tech Experience 2017 conference, we have a set of about 90 abstracts in our table (title and description). 80 of these abstracts have been classified into the conference tracks, such as DBA, Development, BI & Warehousing, Web & Mobile, Integration & Process. For about 10 abstracts, this classification has not yet been done – they do not currently have an assigned track. We want to employ machine learning to determine the track for these unassigned abstracts.

The steps we will go through to solve this challenge:

◉ Create a database table with the conference abstracts – at least columns title, abstract and track
◉ Create an Oracle Text policy object
◉ Specify the model configuration settings
◉ Create the model using the model settings and text transformation instructions to DBMS_DATA_MINING.CREATE_MODEL.
◉ Test the model/Try out the model – in our case against the currently unassigned conference abstracts

The volume of code required for this is very small (less than 30 lines of PL/SQL). The time it takes to go through this is very limited as well. Let’s see how this works. Note: the code is in a GitHub repository: https://github.com/lucasjellema/theArtOfMachineLearning.

Note: from the Oracle Database documentation on text mining:

Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.

Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by you to Oracle Text and uses the results in the model creation process.

Create a database table with the conference abstracts


I received the data in an Excel spreadsheet. I used SQL Developer to import the file and create a table from it. I then exported the table to a SQL file with DDL and DML statements.

Oracle Database, Oracle Java Certification, Oracle Java Tutorial and Material, Oracle Java Guides, Oracle Java Learning, Oracle Java Text Analysis

Create an Oracle Text policy object


An Oracle Text policy specifies how text content must be interpreted. You can provide a text policy to govern a model, an attribute, or both the model and individual attributes.

DECLARE
  l_policy     VARCHAR2(30):='conf_abstrct_mine_policy';
  l_preference VARCHAR2(30):='conference_abstract_lexer';
BEGIN
  ctx_ddl.create_preference(l_preference, 'BASIC_LEXER');
  ctx_ddl.create_policy(l_policy, lexer => l_preference);
END;

Note: the database user you use for this requires two system privileges from the DBA: grant execute on ctx_ddl and grant create mining model

Specify the text mining model configuration settings


When the Data Mining  model is created with a PL/SQL command, we need to specify the name of a table that holds key-value pairs (columns setting_name and setting value) with the settings that should be applied.

Create this settings table.

CREATE TABLE text_mining_settings
  (
    setting_name  VARCHAR2(30),
    setting_value VARCHAR2(4000)
  );

Choose the algorithm to use for classification – in this case Naïve Bayes. Indicate the Oracle Text policy to use – in this case conf_abstrct_mine_policy- through INSERT statements.

DECLARE
  l_policy     VARCHAR2(30):='conf_abstrct_mine_policy';
BEGIN
  -- Populate settings table
  INSERT
  INTO text_mining_settings VALUES
    (
      dbms_data_mining.algo_name,
      dbms_data_mining.algo_naive_bayes
    );
  INSERT
  INTO text_mining_settings VALUES
    (
      dbms_data_mining.prep_auto,
      dbms_data_mining.prep_auto_on
    );
  INSERT
  INTO text_mining_settings VALUES
    (
      dbms_data_mining.odms_text_policy_name,
      l_policy
    ); 
  COMMIT;
END;


Pass the model settings and text transformation instructions to DBMS_DATA_MINING.CREATE_MODEL

I do not like the elaborate, unintuitive syntax required for creating model. I do not like the official Oracle Documentation around this. It is not as naturally flowing as it should be, the pieces do not fit together nicely. It feels a little like the SQL Model clause – something that never felt quite right to me.

Well, this is how it is. To specify which columns must be treated as text (configure text attribute) and, optionally, provide text transformation instructions for individual attributes, we need to use a dbms_data_mining_transform.TRANSFORM_LIST object to hold all columns and/or SQL expressions that contribute to the identification of each record. The attribute specification is a field (attribute_spec) in a transformation record (transform_rec). Transformation records are components of transformation lists (xform_list) that can be passed to CREATE_MODEL. You can view attribute specifications in the data dictionary view ALL_MINING_MODEL_ATTRIBUTES.

Here is how we specify the text attribute abstract:

dbms_data_mining_transform.SET_TRANSFORM( xformlist, ‘abstract’, NULL, ‘abstract’, NULL, ‘TEXT(TOKEN_TYPE:NORMAL)’);

where xformlist is a local PL/SQL variable of type dbms_data_mining_transform.TRANSFORM_LIST.

In the call to create_model, we specify the name of the new model, the table (of view) against which the model is to be built, the target column name for which the model should predict the values, the name of the database table with the key value pairs holding the settings for the model and the list of text attributes:

DECLARE
  xformlist dbms_data_mining_transform.TRANSFORM_LIST;
BEGIN
  -- add column abstract as a column to parse and use for text mining
  dbms_data_mining_transform.SET_TRANSFORM( xformlist, 'abstract', NULL, 'abstract', NULL, 'TEXT(TOKEN_TYPE:NORMAL)');
  dbms_data_mining_transform.SET_TRANSFORM( xformlist, 'title', NULL, 'title', NULL, 'TEXT(TOKEN_TYPE:NORMAL)');
  DBMS_DATA_MINING.CREATE_MODEL( model_name => 'ABSTRACT_CLASSIFICATION'
  , mining_function => dbms_data_mining.classification
  , data_table_name => 'OGH_TECHEXP17'
  , case_id_column_name => 'title'
  , target_column_name => 'track'
  , settings_table_name => 'text_mining_settings'
  , xform_list => xformlist);
END;

Oracle Data Miner needs to have one attribute that identifies each records; the name of the column to use for this is passed as the case id.

Test the model/Try out the model – in our case against the currently unassigned conference abstracts


Now that the model has been created, we can make use of it for predicting the value of the target column for selected records.

First, let’s have the model classify the abstracts without track:

SELECT title
,      PREDICTION(ABSTRACT_CLASSIFICATION USING *) AS predicted_target
,      abstract
FROM   OGH_TECHEXP17
where  track is null

Oracle Database, Oracle Java Certification, Oracle Java Tutorial and Material, Oracle Java Guides, Oracle Java Learning, Oracle Java Text Analysis

We can use the model also to classify data on the fly, like this (using two abstracts from a different conference that are not stored in the database at all):

with sessions_to_judge as
( select 'The Modern JavaScript Server Stack' title
  , 'The usage of JavaScript on the server is rising, and Node.js has become popular with development shops, from startups to big corporations. With its asynchronous nature, JavaScript provides the ability to scale dramatically as well as the ability to drive server-side applications. There are a number of tools that help with all aspects of browser development: testing, packaging, and deployment. In this session learn about these tools and discover how you can incorporate them into your environment.' abstract
  from dual
  UNION ALL
  select 'Winning Hearts and Minds with User Experience' title
  , 'Not too long ago, applications could focus on feature functionality alone and be successful. Today, they must also be beautiful, responsive, and intuitive. In other words, applications must be designed for user experience (UX) because when they are, users are far more productive, more forgiving, and generally happier. Who doesnt want that? In this session learn about the psychology behind what makes a great UX, discuss the key principles of good design, and learn how to apply them to your own projects. Examples are from Oracle Application Express, but these principles are valid for any technology or platform. Together, we can make user experience a priority, and by doing so, win the hearts and minds of our users. We will use Oracle JET as well as ADF and some mobile devices and Java' abstract
  from dual
)
SELECT title
,      PREDICTION(ABSTRACT_CLASSIFICATION USING *) AS predicted_target
,      abstract
FROM   sessions_to_judge

Oracle Database, Oracle Java Certification, Oracle Java Tutorial and Material, Oracle Java Guides, Oracle Java Learning, Oracle Java Text Analysis

Both abstracts are assigned tracks within the boundaries of the model. If these abstracts were submitted to the Tech Experience 2017 conference, they would have been classified like this. It would be interesting to see which changes to make to for example the second abstract on user experience in order to have it assigned to the more fitting Web & Mobile track.

One final test: find all abstracts for which the model predicts a different track than the track that was actually assigned:

select *
from ( SELECT title
       ,      track 
       ,      PREDICTION(ABSTRACT_CLASSIFICATION USING *) AS predicted_track
       FROM   OGH_TECHEXP17
       where  track is not null
     )
where track != predicted_track

Oracle Database, Oracle Java Certification, Oracle Java Tutorial and Material, Oracle Java Guides, Oracle Java Learning, Oracle Java Text Analysis

Seems not unreasonable to have a second look at this track assignment.