Conditional Random Field

hanaml.CRF is an R wrapper for SAP HANA PAL conditional random field algorithm.

hanaml.CRF(
  data = NULL,
  used.cols = NULL,
  label = NULL,
  enet.lambda = NULL,
  tol = NULL,
  max.iter = NULL,
  lbfgs.m = NULL,
  thread.ratio = NULL,
  use.class.feature = NULL,
  use.word = NULL,
  use.ngrams = NULL,
  no.mid.ngrams = NULL,
  max.ngram.length = NULL,
  use.prev.word = NULL,
  use.next.word = NULL,
  use.disjunctive = NULL,
  disjunction.width = NULL,
  use.sequences = NULL,
  use.prev.sequences = NULL,
  use.type.seqs = NULL,
  use.type.seqs2 = NULL,
  use.type.ysequences = NULL,
  use.word.shape = NULL
)

Arguments

data	`DataFrame` DataFrame containting the data.
used.cols	`list of character, optional` This parameter specifies the three columns used for training a conditional random field model. Namely, one column should correspond to Document ID, another column should correspond to word position, and a 3rd column corresponds to word. If not NULL, this parameter should be specified in two ways: (1) used.cols = list(document.id = "xxx", word.pos = "yyy", word = "zzz") (2) used.cols = list("xxx", "yyy", "zzz") In case (2), "xxx", "yyy" and "zzz" must be the column data of document ID, word position and word respectively. Defaults to the first three non-label columns of data if not provided.
label	`character, optional` Name of the column which specifies the dependent variable. Defaults to the last column of data if not provided.
enet.lambda	`numeric, optional` Elastic-net penalization weight. The value should be greater than 0. Defaults to 1.0.
tol	`numeric, optional` Convergence tolerance in optimization(i.e. l-bfgs algorithm). Defaults to 1e-4.
max.iter	`integer, optional` Maximum number of iterations in optimization(i.e. l-bfgs algorithm). Defaults to 1000.
lbfgs.m	`integer, optional` Number of previous memories to keep for l-bfgs algorithm. Defaults to 25.
thread.ratio	`double, optional` Controls the proportion of available threads that can be used by this function. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads.Values outside this range are ignored. Defaults to 0.
use.class.feature	`logical, optional` Whether to include a feature for the class or not, the same as having a bias vector in the model. Defaults to TRUE.
use.word	`logical, optional` Whether to use the feature for current word or not. Defaults to TRUE.
use.ngrams	`logical, optional` Whether or not to make feature from letter n-grams(i.e. substrings of the word). Defaults to TRUE.
no.mid.ngrams	`logical, optional` TRUE means not to include character n-gram features for n-grams that contain neither the beginning nor the end of the word. Defaults to TRUE
max.ngram.length	`integer, optional` Threshold for the size of n-grams to be used in the model. Must be positive. Defaults to 6.
use.prev.word	`logical, optional` Whether to make a feature from both the current word and the previous word. Defaults to TRUE.
use.next.word	`logical, optional`] Whether to make a feature from both the current word and its next word. Defauls to TRUE.
use.disjunctive	`logical, optional` Whether to include in features giving disjunctions of words anywhere in left or right `disjunction.width` words. Defaults to TRUE.
disjunction.width	`logical, optional` See `use.disjunctive`. Defaults to 4.
use.sequences	`logical, optional` Whether or not to use class combination features. Defaults to TRUE.
use.prev.sequences	`logical, optional` Whether or not to use any class combination features using the previous class. Defaults to TRUE.
use.type.seqs	`logical, optional` Whether to use basic 0th order word shape features or not. Defaults to TRUE.
use.type.seqs2	`logical, optional` Whethr to use additional 1st and 2nd order word shape features. Defaults to TRUE.
use.type.ysequences	`logical, optional` Whehter or not to use some first order word shape patterns. Defaults to TRUE.
use.word.shape	`logical, optional` Whether or not to use word shape(e.g. capitalized or numeric). Only supports chris2UseLC currently. Defaults to FALSE.

Value

A "CRF" object with the following attributes:

model: DataFrame CRF model.
statistics: DataFrame Summary of the CRF model training process.
optim.param: DataFrame Optimal parameter of the CRF model. Reserved for future use and currently empty.

Details

Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. It can be put into the general framework of maximum likelihood. In PAL, L-BFGS algorithms is adopted for for maximizing the (penalized) likelihood function.

Examples

Input DataFrame data:

> data$Collect()
   DOC_ID WORD_POSITION        WORD            LABEL
1       1             1      RECORD                O
2       1             2     #497321                O
3       1             3    78554939                O
4       1             4           |                O
.......
88      3            29          on OxygenSaturation
89      3            30           2 OxygenSaturation
90      3            31      liters OxygenSaturation
91      3            32          of OxygenSaturation
92      3            33      oxygen OxygenSaturation

Call the function:

> crf <- hanaml.CRF(data = df, thread.ratio = 1.0,
                    enet.lambda = 0.1, max.iter = 1000, tol = 1e-4,
                    use.word.shape = FALSE, lbfgs.m = 25)

Output:

> crf$statistics$Collect()
         STAT_NAME          STAT_VALUE
1              obj 0.44251900977373015
2             iter                  22
3  solution status           Converged
4      numSentence                   2
5          numWord                  92
......
25         iter 19        obj=0.442519
26         iter 20        obj=0.442519
27         iter 21        obj=0.442519
28         iter 22        obj=0.442519

Arguments

Value

Details

Examples

See also