CRF
- class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)
Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences. The underlying idea is that of defining a conditional probability distribution over label sequences given an observation sequences, rather than a joint distribution over both label and observation sequences.
- Parameters:
- epsilonfloat, optional
Convergence tolerance of the optimization algorithm.
Defaults to 1e-4.
- lambfloat, optional
Regularization weight, should be greater than 0.
Defaults t0 1.0.
- max_iterint, optional
Maximum number of iterations in optimization.
Defaults to 1000.
- lbfgs_mint, optional
Number of memories to be stored in L_BFGS optimization algorithm.
Defaults to 25.
- use_class_featurebool, optional
To include a feature for class/label. This is the same as having a bias vector in a model.
Defaults to True.
- use_wordbool, optional
If True, gives you feature for current word.
Defaults to True.
- use_ngramsbool, optional
Whether to make feature from letter n-grams, i.e. substrings of the word.
Defaults to True.
- mid_ngramsbool, optional
Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.
Defaults to False.
- max_ngram_lengthint, optional
Upper limit for the size of n-grams to be included. Effective only this parameter is positive.
Defaults to 6.
- use_prevbool, optional
Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.
Defaults to True.
- use_nextbool, optional
Whether or not to include a feature for next word and current word.
Defaults to True.
- disjunction_widthint, optional
Defines the width for disjunctions of words, see
use_disjunctive
.Defaults to 4.
- use_disjunctivebool, optional
Whether or not to include in features giving disjunctions of words anywhere in left or right
disjunction_width
words.Defaults to True.
- use_seqsbool, optional
Whether or not to use any class combination features.
Defaults to True.
- use_prev_seqsbool, optional
Whether or not to use any class combination features using the previous class.
Defaults to True.
- use_type_seqsbool, optional
Whether or not to use basic zeroth order word shape features.
Defaults to True.
- use_type_seqs2bool, optional
Whether or not to add additional first and second order word shape features.
Defaults to True.
- use_type_yseqsbool, optional
Whether or not to use some first order word shape patterns.
Defaults to True.
- word_shapeint, optional
Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.
Defaults to 0.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
Examples
Input data for training:
>>> df.head(10).collect() DOC_ID WORD_POSITION WORD LABEL 0 1 1 RECORD O 1 1 2 #497321 O ... 9 1 10 7368393 O
Set up an instance of CRF model:
>>> crf = CRF(lamb=0.1, ... max_iter=1000, ... epsilon=1e-4, ... lbfgs_m=25, ... word_shape=0, ... thread_ratio=1.0)
Perform fit():
>>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION", ... word="WORD", label="LABEL")
Check the trained CRF model and related statistics:
>>> crf.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"classIndex":[["O","OxygenSaturation"]],"defa... >>> crf.stats_.head(10).collect() STAT_NAME STAT_VALUE 0 obj 0.44251900977373015 1 iter 22 ... 9 iter 4 obj=2.4382
Input data for predicting labels using trained CRF model
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL ... 9 2 10 86g52
Perform prediction():
>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION', word='WORD', thread_ratio=1.0) >>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL ... 8 2 9 pressure 9 2 10 86g52
- Attributes:
- model_DataFrame
Model content.
- stats_DataFrame
Statistics.
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, doc_id, word_pos, word, label])Fit the model to the given dataset.
Get the model metrics.
Get the score metrics.
predict
(data[, doc_id, word_pos, word, ...])Predicts text labels using a trained CRF model.
set_model_state
(state)Set the model state by state information.
- fit(data, doc_id=None, word_pos=None, word=None, label=None)
Fit the model to the given dataset.
- Parameters:
- dataDataFrame
Input data. It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the first column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to 1st non-doc_id, non-word_pos column of the input data.
- labelstr, optional
Name of the label column.
Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.
- predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)
Predicts text labels using a trained CRF model.
- Parameters:
- dataDataFrame
Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the 1st column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to the 1st non-doc_id, non-word_pos column of the input data.
- thread_ratiofloat, optional
Adjusts the percentage of available threads to use, from 0 to 1. A value of 0 indicates the use of a single thread, while 1 implies the use of all possible current threads. Values outside the range will be ignored and this function heuristically determines the number of threads to use.
Defaults to 1.0.
- Returns:
- DataFrame
Prediction result for the input data, structured as follows:
1st column: document ID,
2nd column: word position,
3rd column: label.
- create_model_state(model=None, function=None, pal_funcname='PAL_CRF', state_description=None, force=False)
Create PAL model state.
- Parameters:
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for CRF.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CRF'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters:
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters:
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- get_model_metrics()
Get the model metrics.
- Returns:
- DataFrame
The model metrics.
- get_score_metrics()
Get the score metrics.
- Returns:
- DataFrame
The score metrics.
Inherited Methods from PALBase
Besides those methods mentioned above, the CRF class also inherits methods from PALBase class, please refer to PAL Base for more details.