CRF¶
- class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)¶
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. The underlying idea is that of defining a conditional probability distribution over label sequences given an observation sequences, rather than a joint distribution over both label and observation sequences.
- Parameters
- epsilonfloat, optional
Convergence tolerance of the optimization algorithm.
Defaults to 1e-4.
- lambfloat, optional
Regularization weight, should be greater than 0.
Defaults t0 1.0.
- max_iterint, optional
Maximum number of iterations in optimization.
Defaults to 1000.
- lbfgs_mint, optional
Number of memories to be stored in L_BFGS optimization algorithm.
Defaults to 25.
- use_class_featurebool, optional
To include a feature for class/label. This is the same as having a bias vector in a model.
Defaults to True.
- use_wordbool, optional
If True, gives you feature for current word.
Defaults to True.
- use_ngramsbool, optional
Whether to make feature from letter n-grams, i.e. substrings of the word.
Defaults to True.
- mid_ngramsbool, optional
Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.
Defaults to False.
- max_ngram_lengthint, optional
Upper limit for the size of n-grams to be included. Effective only this parameter is positive.
Defaults to 6.
- use_prevbool, optional
Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.
Defaults to True.
- use_nextbool, optional
Whether or not to include a feature for next word and current word.
Defaults to True.
- disjunction_widthint, optional
Defines the width for disjunctions of words, see
use_disjunctive.Defaults to 4.
- use_disjunctivebool, optional
Whether or not to include in features giving disjunctions of words anywhere in left or right
disjunction_widthwords.Defaults to True.
- use_seqsbool, optional
Whether or not to use any class combination features.
Defaults to True.
- use_prev_seqsbool, optional
Whether or not to use any class combination features using the previous class.
Defaults to True.
- use_type_seqsbool, optional
Whether or not to use basic zeroth order word shape features.
Defaults to True.
- use_type_seqs2bool, optional
Whether or not to add additional first and second order word shape features.
Defaults to True.
- use_type_yseqsbool, optional
Whether or not to use some first order word shape patterns.
Defaults to True.
- word_shapeint, optional
Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- Attributes
- model_DataFrame
Model content.
- stats_DataFrame
Statistics.
Methods
create_model_state([model, function, ...])Create PAL model state.
delete_model_state([state])Delete PAL model state.
fit(data[, doc_id, word_pos, word, label])Fit the model to the given dataset.
predict(data[, doc_id, word_pos, word, ...])Predicts text labels using a trained CRF model.
set_model_state(state)Set the model state by state information.
Examples
Input data for training:
>>> df.head(10).collect() DOC_ID WORD_POSITION WORD LABEL 0 1 1 RECORD O 1 1 2 #497321 O ... 9 1 10 7368393 O
Set up an instance of CRF model:
>>> crf = CRF(lamb=0.1, ... max_iter=1000, ... epsilon=1e-4, ... lbfgs_m=25, ... word_shape=0, ... thread_ratio=1.0)
Perform fit():
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION", ... word="WORD", label="LABEL")
Check the trained CRF model and related statistics:
>>> crf.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"classIndex":[["O","OxygenSaturation"]],"defa... >>> crf.stats_.head(10).collect() STAT_NAME STAT_VALUE 0 obj 0.44251900977373015 1 iter 22 ... 9 iter 4 obj=2.4382
Input data for predicting labels using trained CRF model
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL ... 9 2 10 86g52
Perform prediction():
>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION', word='WORD', thread_ratio=1.0) >>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL ... 8 2 9 pressure 9 2 10 86g52
- fit(data, doc_id=None, word_pos=None, word=None, label=None)¶
Fit the model to the given dataset.
- Parameters
- dataDataFrame
Input data. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the first column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to 1st non-doc_id, non-word_pos column of the input data.
- labelstr, optional
Name of the label column.
Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.
- predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)¶
Predicts text labels using a trained CRF model.
- Parameters
- dataDataFrame
Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the 1st column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to the 1st non-doc_id, non-word_pos column of the input data.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- Returns
- DataFrame
Prediction result for the input data, structured as follows:
1st column: document ID,
2nd column: word position,
3rd column: label.
- create_model_state(model=None, function=None, pal_funcname='PAL_CRF', state_description=None, force=False)¶
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for CRF.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CRF'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)¶
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)¶
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.