Conditional Random Field

hanaml.CRF is an R wrapper for SAP HANA PAL conditional random field algorithm.

hanaml.CRF(
  data = NULL,
  used.cols = NULL,
  label = NULL,
  enet.lambda = NULL,
  tol = NULL,
  max.iter = NULL,
  lbfgs.m = NULL,
  thread.ratio = NULL,
  use.class.feature = NULL,
  use.word = NULL,
  use.ngrams = NULL,
  no.mid.ngrams = NULL,
  max.ngram.length = NULL,
  use.prev.word = NULL,
  use.next.word = NULL,
  use.disjunctive = NULL,
  disjunction.width = NULL,
  use.sequences = NULL,
  use.prev.sequences = NULL,
  use.type.seqs = NULL,
  use.type.seqs2 = NULL,
  use.type.ysequences = NULL,
  use.word.shape = NULL
)

Arguments

data

DataFrame
DataFrame containting the data.

used.cols

list of characters, optional
This parameter specifies the three columns used for training a conditional random field model. Namely, one column should correspond to Document ID, another column should correspond to word position, and a 3rd column corresponds to word.
If not NULL, this parameter should be specified in two ways:

(1) used.cols = list(document.id = "xxx", word.pos = "yyy", word = "zzz")
(2) used.cols = list("xxx", "yyy", "zzz")

In case (2), "xxx", "yyy" and "zzz" must be the column data of document ID, word position and word respectively.
Defaults to the first three non-label columns of data if not provided.

label

character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.

enet.lambda

numeric, optional
Elastic-net penalization weight. The value should be greater than 0.
Defaults to 1.0.

tol

numeric, optional
Convergence tolerance in optimization(i.e. l-bfgs algorithm).
Defaults to 1e-4.

max.iter

integer, optional
Maximum number of iterations in optimization(i.e. l-bfgs algorithm).
Defaults to 1000.

lbfgs.m

integer, optional
Number of previous memories to keep for l-bfgs algorithm.
Defaults to 25.

thread.ratio

double, optional
Controls the proportion of available threads that can be used by this function.
The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads.
Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored.
Defaults to 0.

use.class.feature

logical, optional
Whether to include a feature for the class or not, the same as having a bias vector in the model.
Defaults to TRUE.

use.word

logical, optional
Whether to use the feature for current word or not.
Defaults to TRUE.

use.ngrams

logical, optional
Whether or not to make feature from letter n-grams(i.e. substrings of the word).
Defaults to TRUE.

no.mid.ngrams

logical, optional
TRUE means not to include character n-gram features for n-grams that contain neither the beginning nor the end of the word.
Defaults to TRUE

max.ngram.length

integer, optional
Threshold for the size of n-grams to be used in the model. Must be positive.
Defaults to 6.

use.prev.word

logical, optional
Whether to make a feature from both the current word and the previous word.
Defaults to TRUE.

use.next.word

logical, optional]
Whether to make a feature from both the current word and its next word.
Defauls to TRUE.

use.disjunctive

logical, optional
Whether to include in features giving disjunctions of words anywhere in left or right disjunction.width words.
Defaults to TRUE.

disjunction.width

logical, optional
See use.disjunctive.
Defaults to 4.

use.sequences

logical, optional
Whether or not to use class combination features.
Defaults to TRUE.

use.prev.sequences

logical, optional
Whether or not to use any class combination features using the previous class.
Defaults to TRUE.

use.type.seqs

logical, optional
Whether to use basic 0th order word shape features or not.
Defaults to TRUE.

use.type.seqs2

logical, optional
Whethr to use additional 1st and 2nd order word shape features.
Defaults to TRUE.

use.type.ysequences

logical, optional
Whehter or not to use some first order word shape patterns.
Defaults to TRUE.

use.word.shape

logical, optional
Whether or not to use word shape(e.g. capitalized or numeric). Only supports chris2UseLC currently.
Defaults to FALSE.

Value

An R6 object of class "CRF" with the following attributes and methods:

Attributes

model: DataFrame CRF model.
statistics: DataFrame Summary of the CRF model training process.
optim.param: DataFrame Optimal parameter of the CRF model. Reserved for future use and currently empty.

Methods

CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)

Usage:


   > crf <- hanaml.CRF(data=df)
   > crf$CreateModelState()

Arguments:

model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model.
algorithm: character
Specifies the PAL algorithm associated with model.
Defaults to self$pal.algorithm.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model.
Defaults to FALSE.

After calling this method, an attribute state that contains the parsed info for model shall be assigned to the corresponding R6 object.

DeleteModelState(state=NULL)

Usage:
Assuming we have trained a hanaml model and created its model state, like the following:


   > crf <- hanaml.CRF(data=df)
   > crf$CreateModelState()

After using the model state for real-time scoring, we can delete the state by calling:


   > crf$DelateModelState()

Arguments:

state: DataFrame
DataFrame containing the state info.
Defaults to self$state.

After calling this method, the specified model state shall be cleaned up and associated memory be released.

Details

Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. It can be put into the general framework of maximum likelihood. In PAL, L-BFGS algorithms is adopted for for maximizing the (penalized) likelihood function.

Examples

Input DataFrame data:


> data$Collect()
   DOC_ID WORD_POSITION        WORD            LABEL
1       1             1      RECORD                O
2       1             2     #497321                O
3       1             3    78554939                O
4       1             4           |                O
.......
88      3            29          on OxygenSaturation
89      3            30           2 OxygenSaturation
90      3            31      liters OxygenSaturation
91      3            32          of OxygenSaturation
92      3            33      oxygen OxygenSaturation

Call the function:


> crf <- hanaml.CRF(data = df, thread.ratio = 1.0,
                    enet.lambda = 0.1, max.iter = 1000, tol = 1e-4,
                    use.word.shape = FALSE, lbfgs.m = 25)

Output:


> crf$statistics$Collect()
         STAT_NAME          STAT_VALUE
1              obj 0.44251900977373015
2             iter                  22
3  solution status           Converged
4      numSentence                   2
5          numWord                  92
......
25         iter 19        obj=0.442519
26         iter 20        obj=0.442519
27         iter 21        obj=0.442519
28         iter 22        obj=0.442519

Arguments

Value

Details

Examples

See also