CRF

class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. The underlying idea is that of defining a conditional probability distribution over label sequences given an observation sequences, rather than a joint distribution over both label and observation sequences.

Parameters:

epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

Defaults to 6.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whether or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

Defaults to 0.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
0       1              1    RECORD     O
1       1              2   #497321     O
...
9       1             10   7368393     O

Set up an instance of CRF model:

>>> crf = CRF(lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)

Perform fit():

>>> crf.fit(data=df, doc_id="DOC_ID",
            word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
0              obj  0.44251900977373015
1             iter                   22
...
9           iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
...
9       2             10        86g52

Perform prediction():

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
                      word='WORD', thread_ratio=1.0)
>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
0       2              1      GENERAL
1       2              2     PHYSICAL
...
8       2              9     pressure
9       2             10        86g52

Attributes:

model_DataFrame: Model content.
stats_DataFrame: Statistics.

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, doc_id, word_pos, word, label])	Fit the model to the given dataset.
`get_model_metrics`()	Get the model metrics.
`get_score_metrics`()	Get the score metrics.
`predict`(data[, doc_id, word_pos, word, ...])	Predicts text labels using a trained CRF model.
`set_model_state`(state)	Set the model state by state information.

fit(data, doc_id=None, word_pos=None, word=None, label=None)

Fit the model to the given dataset.

Parameters:

dataDataFrame

Input data. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to 1st non-doc_id, non-word_pos column of the input data.

labelstr, optional

Name of the label column.

Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.

predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

Predicts text labels using a trained CRF model.

Parameters:

dataDataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the 1st column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the 1st non-doc_id, non-word_pos column of the input data.

thread_ratiofloat, optional

Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns:

DataFrame

Prediction result for the input data, structured as follows:

1st column: document ID,

2nd column: word position,

3rd column: label.

create_model_state(model=None, function=None, pal_funcname='PAL_CRF', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for CRF.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CRF'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

get_model_metrics()

Get the model metrics.

Returns:

DataFrame: The model metrics.

get_score_metrics()

Get the score metrics.

Returns:

DataFrame: The score metrics.

Inherited Methods from PALBase

Besides those methods mentioned above, the CRF class also inherits methods from PALBase class, please refer to PAL Base for more details.