In [5]: df.describe() Out[5]: Unnamed: 0 FTE Total count 1560.000000 449.000000 1.542000e+03 mean 227767.180128 0.493532 1.446867e+04 std 130207.535688 0.452844 7.916752e+04 min 198.000000 -0.002369 -1.044084e+06 25% 113690.750000 NaN NaN 50% 226445.500000 NaN NaN 75% 340883.500000 NaN NaN max 450277.000000 1.047222 1.367500e+06
FTE 全职员工数
1 2 3 4 5 6 7 8
In [4]: df.FTE.head() Out[4]: 198 NaN 209 NaN 750 1.0 931 NaN 1524 NaN Name: FTE, dtype: float64
数据里的FTE为(Full Time equivalent)全职员工的意思,在我们的数据中,如果一项预算与一个员工有关,这个值就反应了这个员工的全职工作的百分比。
def compute_log_loss(predicted, actual, eps=1e-14): """ Computes the logarithmic loss between predicted and actual when these are 1D arrays.
:param predicted: The predicted probabilities as floats between 0-1 :param actual: The actual binary labels. Either 0 or 1. :param eps (optional): log(0) is inf, so we need to offset our predicted values slightly by eps from 0 or 1. """ predicted = np.clip(predicted, eps, 1 - eps) loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))
return loss
我们可以用这个函数取计算一些提供好的值的log loss值,下面是算好后的结果:
1 2 3 4 5
Log loss, correct and confident: 0.05129329438755058 Log loss, correct and not confident: 0.4307829160924542 Log loss, wrong and not confident: 1.049822124498678 Log loss, wrong and confident: 2.9957322735539904 Log loss, actual labels: 9.99200722162646e-15
当你用正则创建好CountVectorizer后,就可以把语句传进来计算bag of words,一样它也是使用fit方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# Import CountVectorizer from sklearn.feature_extraction.text import CountVectorizer
# Create the token pattern: TOKENS_ALPHANUMERIC TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
# Fill missing values in df.Position_Extra df.Position_Extra.fillna('',inplace=True)
# Instantiate the CountVectorizer: vec_alphanumeric vec_alphanumeric = CountVectorizer(token_pattern='[A-Za-z0-9]+(?=\s+)')
# Fit to the data vec_alphanumeric.fit(df.Position_Extra)
# Print the number of tokens and first 15 tokens msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric" print(msg.format(len(vec_alphanumeric.get_feature_names()))) print(vec_alphanumeric.get_feature_names()[:15])
改善您的模型
Pipelines, feature & text preprocessing
我们已经接触过pipline了,它可以让我们处理数据的步骤变得更容易。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# Import Pipeline from sklearn.pipeline import Pipeline # Import other necessary modules from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier
# Split and select numeric data only, no nans X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']],pd.get_dummies(sample_df['label']),random_state=22)
# Instantiate Pipeline object: pl pl = Pipeline([ ('clf', OneVsRestClassifier(LogisticRegression())) ]) # Fit the pipeline to the training data pl.fit(X_train, y_train)
# Compute and print accuracyaccuracy = pl.score(X_test, y_test) print("\nAccuracy on sample data - numeric, no nans: ", accuracy)
# Import FeatureUnion from sklearn.pipeline import FeatureUnion
# Split using ALL data in sample_df X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']],pd.get_dummies(sample_df['label']), random_state=22)