CRF
- class hana_ml.algorithms.pal.crf.CRF(lamb=None, epsilon=None, max_iter=None, lbfgs_m=None, use_class_feature=None, use_word=None, use_ngrams=None, mid_ngrams=False, max_ngram_length=None, use_prev=None, use_next=None, disjunction_width=None, use_disjunctive=None, use_seqs=None, use_prev_seqs=None, use_type_seqs=None, use_type_seqs2=None, use_type_yseqs=None, word_shape=None, thread_ratio=None)
Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).
- Parameters
- epsilonfloat, optional
Convergence tolerance of the optimization algorithm.
Defaults to 1e-4.
- lambfloat, optional
Regularization weight, should be greater than 0.
Defaults t0 1.0.
- max_iterint, optional
Maximum number of iterations in optimization.
Defaults to 1000.
- lbfgs_mint, optional
Number of memories to be stored in L_BFGS optimization algorithm.
Defaults to 25.
- use_class_featurebool, optional
To include a feature for class/label. This is the same as having a bias vector in a model.
Defaults to True.
- use_wordbool, optional
If True, gives you feature for current word.
Defaults to True.
- use_ngramsbool, optional
Whether to make feature from letter n-grams, i.e. substrings of the word.
Defaults to True.
- mid_ngramsbool, optional
Whether to include character n-gram features for n-grams that contain neither the beginning or the end of the word.
Defaults to False.
- max_ngram_lengthint, optional
Upper limit for the size of n-grams to be included. Effective only this parameter is positive.
- use_prevbool, optional
Whether or not to include a feature for previous word and current word, and together with other options enables other previous features.
Defaults to True.
- use_nextbool, optional
Whether or not to include a feature for next word and current word.
Defaults to True.
- disjunction_widthint, optional
Defines the width for disjunctions of words, see
use_disjunctive
.Defaults to 4.
- use_disjunctivebool, optional
Whether or not to include in features giving disjunctions of words anywhere in left or right
disjunction_width
words.Defaults to True.
- use_seqsbool, optional
Whether or not to use any class combination features.
Defaults to True.
- use_prev_seqsbool, optional
Whether or not to use any class combination features using the previous class.
Defaults to True.
- use_type_seqsbool, optional
Whether or not to use basic zeroth order word shape features.
Defaults to True.
- use_type_seqs2bool, optional
Whether or not to add additional first and second order word shape features.
Defaults to True.
- use_type_yseqsbool, optional
Whether or not to use some first order word shape patterns.
Defaults to True.
- word_shapeint, optional
Word shape, e.g. whether capitalized or numeric. Only supports chris2UseLC currently. Do not use word shape if this is 0.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by the fit(i.e. training) function.
The range of this parameter is from 0 to 1.
0 means only using single thread, 1 means using at most all available threads currently.
Values outside this range are ignored, and the fit function heuristically determines the number of threads to use.
Defaults to 1.0.
Examples
Input data for training:
>>> df.head(10).collect() DOC_ID WORD_POSITION WORD LABEL 0 1 1 RECORD O 1 1 2 #497321 O 2 1 3 78554939 O 3 1 4 | O 4 1 5 LRH O 5 1 6 | O 6 1 7 62413233 O 7 1 8 | O 8 1 9 | O 9 1 10 7368393 O
Set up an instance of CRF model, and fit it on the training data:
>>> crf = CRF(lamb=0.1, ... max_iter=1000, ... epsilon=1e-4, ... lbfgs_m=25, ... word_shape=0, ... thread_ratio=1.0) >>> crf.fit(data=df, doc_id="DOC_ID", word_pos="WORD_POSITION", ... word="WORD", label="LABEL")
Check the trained CRF model and related statistics:
>>> crf.model_.collect() ROW_INDEX MODEL_CONTENT 0 0 {"classIndex":[["O","OxygenSaturation"]],"defa... >>> crf.stats_.head(10).collect() STAT_NAME STAT_VALUE 0 obj 0.44251900977373015 1 iter 22 2 solution status Converged 3 numSentence 2 4 numWord 92 5 numFeatures 963 6 iter 1 obj=26.6557 7 iter 2 obj=14.8484 8 iter 3 obj=5.36967 9 iter 4 obj=2.4382
Input data for predicting labels using trained CRF model
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL 2 2 3 EXAMINATION 3 2 4 : 4 2 5 VITAL 5 2 6 SIGNS 6 2 7 : 7 2 8 Blood 8 2 9 pressure 9 2 10 86g52
Do the prediction:
>>> res = crf.predict(data=df_pred, doc_id='DOC_ID', word_pos='WORD_POSITION', ... word='WORD', thread_ratio=1.0)
Check the prediction result:
>>> df_pred.head(10).collect() DOC_ID WORD_POSITION WORD 0 2 1 GENERAL 1 2 2 PHYSICAL 2 2 3 EXAMINATION 3 2 4 : 4 2 5 VITAL 5 2 6 SIGNS 6 2 7 : 7 2 8 Blood 8 2 9 pressure 9 2 10 86g52
- Attributes
- model_DataFrame
CRF model content.
- stats_DataFrame
Statistic info for CRF model fitting, structured as follows:
1st column: name of the statistics, type NVARCHAR(100).
2nd column: the corresponding statistics value, type NVARCHAR(1000).
- optimal_param_DataFrame
Placeholder for storing optimal parameter of the model. None empty only when parameter selection is triggered (in the future).
Methods
create_model_state
([model, function, ...])Create PAL model state.
delete_model_state
([state])Delete PAL model state.
fit
(data[, doc_id, word_pos, word, label])Function for training the CRF model on English text.
predict
(data[, doc_id, word_pos, word, ...])The function that predicts text labels based trained CRF model.
set_model_state
(state)Set the model state by state information.
- fit(data, doc_id=None, word_pos=None, word=None, label=None)
Function for training the CRF model on English text.
- Parameters
- dataDataFrame
Input data for training/fitting the CRF model.
It should contain at least 4 columns, corresponding to document ID, word position, word and label, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the first column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to 1st non-doc_id, non-word_pos column of the input data.
- labelstr, optional
Name of the label column.
Defaults to the last non-doc_id, non-word_pos, non-word column of the input data.
- predict(data, doc_id=None, word_pos=None, word=None, thread_ratio=None)
The function that predicts text labels based trained CRF model.
- Parameters
- dataDataFrame
Input data to predict the labels. It should contain at least 3 columns, corresponding to document ID, word position and word, respectively.
- doc_idstr, optional
Name of the column for document ID.
Defaults to the 1st column of the input data.
- word_posstr, optional
Name of the column for word position.
Defaults to the 1st non-doc_id column of the input data.
- wordstr, optional
Name of the column for word.
Defaults to the 1st non-doc_id, non-word_pos column of the input data.
- thread_ratiofloat, optional
Specifies the ratio of total number of threads that can be used by predict function.
The range of this parameter is from 0 to 1.
0 means only using a single thread, and 1 means using at most all available threads currently.
Values outside this range are ignored, and predict function heuristically determines the number of threads to use.
Defaults to 1.0.
- Returns
- DataFrame
Prediction result for the input data, structured as follows:
1st column: document ID,
2nd column: word position,
3rd column: label.
- create_model_state(model=None, function=None, pal_funcname='PAL_CRF', state_description=None, force=False)
Create PAL model state.
- Parameters
- modelDataFrame, optional
Specify the model for AFL state.
Defaults to self.model_.
- functionstr, optional
Specify the function in the unified API.
A placeholder parameter, not effective for CRF.
- pal_funcnameint or str, optional
PAL function name. Must be a valid PAL procedure that supports model state.
Defaults to 'PAL_CRF'.
- state_descriptionstr, optional
Description of the state as model container.
Defaults to None.
- forcebool, optional
If True it will delete the existing state.
Defaults to False.
- set_model_state(state)
Set the model state by state information.
- Parameters
- state: DataFrame or dict
If state is DataFrame, it has the following structure:
NAME: VARCHAR(100), it mush have STATE_ID, HINT, HOST and PORT.
VALUE: VARCHAR(1000), the values according to NAME.
If state is dict, the key must have STATE_ID, HINT, HOST and PORT.
- delete_model_state(state=None)
Delete PAL model state.
- Parameters
- stateDataFrame, optional
Specified the state.
Defaults to self.state.
- property fit_hdbprocedure
Returns the generated hdbprocedure for fit.
- property predict_hdbprocedure
Returns the generated hdbprocedure for predict.