hanaml.CRF.Rd
hanaml.CRF is an R wrapper for SAP HANA PAL conditional random field algorithm.
hanaml.CRF(
data = NULL,
used.cols = NULL,
label = NULL,
enet.lambda = NULL,
tol = NULL,
max.iter = NULL,
lbfgs.m = NULL,
thread.ratio = NULL,
use.class.feature = NULL,
use.word = NULL,
use.ngrams = NULL,
no.mid.ngrams = NULL,
max.ngram.length = NULL,
use.prev.word = NULL,
use.next.word = NULL,
use.disjunctive = NULL,
disjunction.width = NULL,
use.sequences = NULL,
use.prev.sequences = NULL,
use.type.seqs = NULL,
use.type.seqs2 = NULL,
use.type.ysequences = NULL,
use.word.shape = NULL
)
DataFrame
DataFrame containting the data.
list of characters, optional
This parameter specifies the three columns used for training a conditional random field model.
Namely, one column should correspond to Document ID, another column should correspond to
word position, and a 3rd column corresponds to word.
If not NULL, this parameter should be specified in two ways:
(1) used.cols = list(document.id = "xxx", word.pos = "yyy", word = "zzz")
(2) used.cols = list("xxx", "yyy", "zzz")
In case (2), "xxx", "yyy" and "zzz" must be the column data of document ID, word position
and word respectively.
Defaults to the first three non-label columns of data if not provided.
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
numeric, optional
Elastic-net penalization weight. The value should be greater than 0.
Defaults to 1.0.
numeric, optional
Convergence tolerance in optimization(i.e. l-bfgs algorithm).
Defaults to 1e-4.
integer, optional
Maximum number of iterations in optimization(i.e. l-bfgs algorithm).
Defaults to 1000.
integer, optional
Number of previous memories to keep for l-bfgs algorithm.
Defaults to 25.
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
logical, optional
Whether to include a feature for the class or not, the same as having a bias
vector in the model.
Defaults to TRUE.
logical, optional
Whether to use the feature for current word or not.
Defaults to TRUE.
logical, optional
Whether or not to make feature from letter n-grams(i.e. substrings of the word).
Defaults to TRUE.
logical, optional
TRUE means not to include character n-gram features for n-grams that contain neither
the beginning nor the end of the word.
Defaults to TRUE
integer, optional
Threshold for the size of n-grams to be used in the model.
Must be positive.
Defaults to 6.
logical, optional
Whether to make a feature from both the current word and the previous word.
Defaults to TRUE.
logical, optional
]
Whether to make a feature from both the current word and its next word.
Defauls to TRUE.
logical, optional
Whether to include in features giving disjunctions of words anywhere
in left or right disjunction.width
words.
Defaults to TRUE.
logical, optional
See use.disjunctive
.
Defaults to 4.
logical, optional
Whether or not to use class combination features.
Defaults to TRUE.
logical, optional
Whether or not to use any class combination features using the previous class.
Defaults to TRUE.
logical, optional
Whether to use basic 0th order word shape features or not.
Defaults to TRUE.
logical, optional
Whethr to use additional 1st and 2nd order word shape features.
Defaults to TRUE.
logical, optional
Whehter or not to use some first order word shape patterns.
Defaults to TRUE.
logical, optional
Whether or not to use word shape(e.g. capitalized or numeric).
Only supports chris2UseLC currently.
Defaults to FALSE.
An R6 object of class "CRF" with the following attributes and methods:
Attributes
model: DataFrame
CRF model.
statistics: DataFrame
Summary of the CRF model training process.
optim.param: DataFrame
Optimal parameter of the CRF model.
Reserved for future use and currently empty.
Methods
CreateModelState(model=NULL, algorithm=NULL, func=NULL, state.description="ModelState", force=FALSE)
Usage:
> crf <- hanaml.CRF(data=df)
> crf$CreateModelState()
Arguments:
model: DataFrame
DataFrame containing the model for parsing.
Defaults to self$model
.
algorithm: character
Specifies the PAL algorithm associated with model
.
Defaults to self$pal.algorithm
.
func: character
Specifies the functionality for Unified Classification/Regression.
Valid only for object instance of R6Class "UnifiedClassification" or "UnifiedRegression".
Defaults to self$func
.
state.description: character
A summary string for the generated model state.
Defaults to "ModelState".
force: logic
Specifies whether or not the replace existing state for model
.
Defaults to FALSE.
After calling this method, an attribute state
that contains the parsed info for model
shall be assigned
to the corresponding R6 object.
DeleteModelState(state=NULL)
Usage:
Assuming we have trained a hanaml
model and created its model state, like the following:
> crf <- hanaml.CRF(data=df)
> crf$CreateModelState()
After using the model state for real-time scoring, we can delete the state by calling:
> crf$DelateModelState()
Arguments:
state: DataFrame
DataFrame containing the state info.
Defaults to self$state
.
After calling this method, the specified model state shall be cleaned up and associated memory be released.
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. It can be put into the general framework of maximum likelihood. In PAL, L-BFGS algorithms is adopted for for maximizing the (penalized) likelihood function.
Input DataFrame data:
> data$Collect()
DOC_ID WORD_POSITION WORD LABEL
1 1 1 RECORD O
2 1 2 #497321 O
3 1 3 78554939 O
4 1 4 | O
.......
88 3 29 on OxygenSaturation
89 3 30 2 OxygenSaturation
90 3 31 liters OxygenSaturation
91 3 32 of OxygenSaturation
92 3 33 oxygen OxygenSaturation
Call the function:
> crf <- hanaml.CRF(data = df, thread.ratio = 1.0,
enet.lambda = 0.1, max.iter = 1000, tol = 1e-4,
use.word.shape = FALSE, lbfgs.m = 25)
Output:
> crf$statistics$Collect()
STAT_NAME STAT_VALUE
1 obj 0.44251900977373015
2 iter 22
3 solution status Converged
4 numSentence 2
5 numWord 92
......
25 iter 19 obj=0.442519
26 iter 20 obj=0.442519
27 iter 21 obj=0.442519
28 iter 22 obj=0.442519