CRF

class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

Parameters:

epsilonfloat, optional

Convergence tolerance of the optimization algorithm.

Defaults to 1e-4.

lambfloat, optional

Regularization weight, should be greater than 0.

Defaults t0 1.0.

max_iterint, optional

Maximum number of iterations in optimization.

Defaults to 1000.

lbfgs_mint, optional

Number of memories to be stored in L_BFGS optimization algorithm.

Defaults to 25.

use_class_featurebool, optional

To include a feature for class/label. This is the same as having a bias vector in a model.

Defaults to True.

use_wordbool, optional

If True, gives you feature for current word.

Defaults to True.

use_ngramsbool, optional

Whether to make feature from letter n-grams, i.e. substrings of the word.

Defaults to True.

mid_ngramsbool, optional

Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.

Defaults to False.

max_ngram_lengthint, optional

Upper limit for the size of n-grams to be included. Effective only this parameter is positive.

use_prevbool, optional

Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.

Defaults to True.

use_nextbool, optional

Whether or not to include a feature for next word and current word.

Defaults to True.

disjunction_widthint, optional

Defines the width for disjunctions of words, see use_disjunctive.

Defaults to 4.

use_disjunctivebool, optional

Whether or not to include in features giving disjunctions of words anywhere in left or right disjunction_width words.

Defaults to True.

use_seqsbool, optional

Whether or not to use any class combination features.

Defaults to True.

use_prev_seqsbool, optional

Whether or not to use any class combination features using the previous class.

Defaults to True.

use_type_seqsbool, optional

Whether or not to use basic zeroth order word shape features.

Defaults to True.

use_type_seqs2bool, optional

Whether or not to add additional first and second order word shape features.

Defaults to True.

use_type_yseqsbool, optional

Whether or not to use some first order word shape patterns.

Defaults to True.

word_shapeint, optional

Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function.

The range of this parameter is from 0 to 1.

0 means only using single thread, 1 means using at most all available threads currently.

Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.

Defaults to 1.0.

Examples

Input data for training:

>>> df.head(10).collect()
   DOC_ID  WORD_POSITION      WORD LABEL
     1              1    RECORD     O
     1              2   #497321     O
     1              3  78554939     O
     1              4         |     O
     1              5       LRH     O
     1              6         |     O
     1              7  62413233     O
     1              8         |     O
     1              9         |     O
     1             10   7368393     O

Set up an instance of CRF model, and fit it on the training data:

>>> crf = CRF(lamb=0.1,
...           max_iter=1000,
...           epsilon=1e-4,
...           lbfgs_m=25,
...           word_shape=0,
...           thread_ratio=1.0)
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION",
...         word="WORD", label="LABEL")

Check the trained CRF model and related statistics:

>>> crf.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
        0  {"classIndex":[["O","OxygenSaturation"]],"defa...
>>> crf.stats_.head(10).collect()
         STAT_NAME           STAT_VALUE
            obj  0.44251900977373015
           iter                   22
solution status            Converged
    numSentence                    2
        numWord                   92
    numFeatures                  963
         iter 1          obj=26.6557
         iter 2          obj=14.8484
         iter 3          obj=5.36967
         iter 4           obj=2.4382

Input data for predicting labels using trained CRF model

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
     2              1      GENERAL
     2              2     PHYSICAL
     2              3  EXAMINATION
     2              4            :
     2              5        VITAL
     2              6        SIGNS
     2              7            :
     2              8        Blood
     2              9     pressure
     2             10        86g52

Do the prediction:

>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION',
...                   word='WORD', thread_ratio=1.0)

Check the prediction result:

>>> df_pred.head(10).collect()
   DOC_ID  WORD_POSITION         WORD
     2              1      GENERAL
     2              2     PHYSICAL
     2              3  EXAMINATION
     2              4            :
     2              5        VITAL
     2              6        SIGNS
     2              7            :
     2              8        Blood
     2              9     pressure
     2             10        86g52

Attributes:

model_DataFrame

CRF model content.

stats_DataFrame

Statistic info for CRF model fitting, structured as follows:

1st column: name of the statistics, type NVARCHAR(100).

2nd column: the corresponding statistics value, type NVARCHAR(1000).

optimal_param_DataFrame

Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).

Methods

`create_model_state`([model, function, ...])	Create PAL model state.
`delete_model_state`([state])	Delete PAL model state.
`fit`(data[, doc_id, word_pos, word, label])	Function for training the CRF model on English text.
`predict`(data[, doc_id, word_pos, word, ...])	The function that predicts text labels based trained CRF model.
`set_model_state`(state)	Set the model state by state information.

fit(data, doc_id=None, word_pos=None, word=None, label=None)

Function for training the CRF model on English text.

Parameters:

dataDataFrame

Input data for training/fitting the CRF model.

It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the first column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to 1st non-doc_id, non-word_pos column of the input data.

labelstr, optional

Name of the label column.

Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.

predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)

The function that predicts text labels based trained CRF model.

Parameters:

dataDataFrame

Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.

doc_idstr, optional

Name of the column for document ID.

Defaults to the 1st column of the input data.

word_posstr, optional

Name of the column for word position.

Defaults to the 1st non-doc_id column of the input data.

wordstr, optional

Name of the column for word.

Defaults to the 1st non-doc_id, non-word_pos column of the input data.

thread_ratiofloat, optional

Specifies the ratio of total number of threads that can be used by predict function.

The range of this parameter is from 0 to 1.

0 means only using a single thread, and 1 means using at most all available threads currently.

Values outside this range are ignored, and predict function heuristically determines the number of threads to use.

Defaults to 1.0.

Returns:

DataFrame

Prediction result for the input data, structured as follows:

1st column: document ID,

2nd column: word position,

3rd column: label.

create_model_state(model=None, function=None, pal_funcname='PAL_CRF', state_description=None, force=False)

Create PAL model state.

Parameters:

modelDataFrame, optional

Specify the model for AFL state.

Defaults to self.model_.

functionstr, optional

Specify the function in the unified API.

A placeholder parameter, not effective for CRF.

pal_funcnameint or str, optional

PAL function name. Must be a valid PAL procedure that supports model state.

Defaults to 'PAL_CRF'.

state_descriptionstr, optional

Description of the state as model container.

Defaults to None.

forcebool, optional

If True it will delete the existing state.

Defaults to False.

set_model_state(state)

Set the model state by state information.

Parameters:

state: DataFrame or dict

If state is DataFrame, it has the following structure:

NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.

VALUE: VARCHAR(1000), the values according to NAME.

If state is dict, the key must have STATE_ID, HINT, HOST and PORT.

delete_model_state(state=None)

Delete PAL model state.

Parameters:

stateDataFrame, optional

Specified the state.

Defaults to self.state.

property fit_hdbprocedure: Returns the generated hdbprocedure for fit.

property predict_hdbprocedure: Returns the generated hdbprocedure for predict.

Inherited Methods from PALBase

Besides those methods mentioned above, the CRF class also inherits methods from PALBase class, please refer to PAL Base for more details.