hanaml.CRF is an R wrapper
for SAP HANA PAL conditional random field algorithm.
hanaml.CRF(
data = NULL,
used.cols = NULL,
label = NULL,
enet.lambda = NULL,
tol = NULL,
max.iter = NULL,
lbfgs.m = NULL,
thread.ratio = NULL,
use.class.feature = NULL,
use.word = NULL,
use.ngrams = NULL,
no.mid.ngrams = NULL,
max.ngram.length = NULL,
use.prev.word = NULL,
use.next.word = NULL,
use.disjunctive = NULL,
disjunction.width = NULL,
use.sequences = NULL,
use.prev.sequences = NULL,
use.type.seqs = NULL,
use.type.seqs2 = NULL,
use.type.ysequences = NULL,
use.word.shape = NULL
)
Arguments
| data |
DataFrame
DataFrame containting the data.
|
| used.cols |
list of character, optional
This parameter specifies the three columns used for training a conditional random field model.
Namely, one column should correspond to Document ID, another column should correspond to
word position, and a 3rd column corresponds to word.
If not NULL, this parameter should be specified in two ways:
(1) used.cols = list(document.id = "xxx",
word.pos = "yyy",
word = "zzz")
(2) used.cols = list("xxx", "yyy", "zzz")
In case (2), "xxx", "yyy" and "zzz" must be the column data of document ID, word position
and word respectively.
Defaults to the first three non-label columns of data if not provided. |
| label |
character, optional
Name of the column which specifies the dependent variable.
Defaults to the last column of data if not provided.
|
| enet.lambda |
numeric, optional
Elastic-net penalization weight. The value should be greater than 0.
Defaults to 1.0.
|
| tol |
numeric, optional
Convergence tolerance in optimization(i.e. l-bfgs algorithm).
Defaults to 1e-4.
|
| max.iter |
integer, optional
Maximum number of iterations in optimization(i.e. l-bfgs algorithm).
Defaults to 1000.
|
| lbfgs.m |
integer, optional
Number of previous memories to keep for l-bfgs algorithm.
Defaults to 25.
|
| thread.ratio |
double, optional
Controls the proportion of available threads that can be used by this
function.
The value range is from 0 to 1, where 0 indicates a single thread,
and 1 indicates all available threads.
Values between 0 and 1 will use up to
that percentage of available threads.Values outside this
range are ignored.
Defaults to 0.
|
| use.class.feature |
logical, optional
Whether to include a feature for the class or not, the same as having a bias
vector in the model.
Defaults to TRUE.
|
| use.word |
logical, optional
Whether to use the feature for current word or not.
Defaults to TRUE.
|
| use.ngrams |
logical, optional
Whether or not to make feature from letter n-grams(i.e. substrings of the word).
Defaults to TRUE.
|
| no.mid.ngrams |
logical, optional
TRUE means not to include character n-gram features for n-grams that contain neither
the beginning nor the end of the word.
Defaults to TRUE
|
| max.ngram.length |
integer, optional
Threshold for the size of n-grams to be used in the model.
Must be positive.
Defaults to 6.
|
| use.prev.word |
logical, optional
Whether to make a feature from both the current word and the previous word.
Defaults to TRUE.
|
| use.next.word |
logical, optional]
Whether to make a feature from both the current word and its next word.
Defauls to TRUE.
|
| use.disjunctive |
logical, optional
Whether to include in features giving disjunctions of words anywhere
in left or right disjunction.width words.
Defaults to TRUE.
|
| disjunction.width |
logical, optional
See use.disjunctive.
Defaults to 4.
|
| use.sequences |
logical, optional
Whether or not to use class combination features.
Defaults to TRUE.
|
| use.prev.sequences |
logical, optional
Whether or not to use any class combination features using the previous class.
Defaults to TRUE.
|
| use.type.seqs |
logical, optional
Whether to use basic 0th order word shape features or not.
Defaults to TRUE.
|
| use.type.seqs2 |
logical, optional
Whethr to use additional 1st and 2nd order word shape features.
Defaults to TRUE.
|
| use.type.ysequences |
logical, optional
Whehter or not to use some first order word shape patterns.
Defaults to TRUE.
|
| use.word.shape |
logical, optional
Whether or not to use word shape(e.g. capitalized or numeric).
Only supports chris2UseLC currently.
Defaults to FALSE.
|
Value
A "CRF" object with the following attributes:
model: DataFrame CRF model.
statistics: DataFrame Summary of the CRF model training process.
optim.param: DataFrame Optimal parameter of the CRF model.
Reserved for future use and currently empty.
Details
Conditional random fields (CRFs) are a probabilistic framework for
labeling and segmenting structured data, such as sequences.
It can be put into the general framework of maximum likelihood.
In PAL, L-BFGS algorithms is adopted for for maximizing the (penalized)
likelihood function.
Examples
Input DataFrame data:
> data$Collect()
DOC_ID WORD_POSITION WORD LABEL
1 1 1 RECORD O
2 1 2 #497321 O
3 1 3 78554939 O
4 1 4 | O
.......
88 3 29 on OxygenSaturation
89 3 30 2 OxygenSaturation
90 3 31 liters OxygenSaturation
91 3 32 of OxygenSaturation
92 3 33 oxygen OxygenSaturation
Call the function:
> crf <- hanaml.CRF(data = df, thread.ratio = 1.0,
enet.lambda = 0.1, max.iter = 1000, tol = 1e-4,
use.word.shape = FALSE, lbfgs.m = 25)
Output:
> crf$statistics$Collect()
STAT_NAME STAT_VALUE
1 obj 0.44251900977373015
2 iter 22
3 solution status Converged
4 numSentence 2
5 numWord 92
......
25 iter 19 obj=0.442519
26 iter 20 obj=0.442519
27 iter 21 obj=0.442519
28 iter 22 obj=0.442519
See also