Feature Engineering 備忘小筆記

對 AI 有興趣的我, 其實比較喜歡研究 model 的原理, 對於資料的處理沒興趣. 不過偶爾要用到某個語法, 卻又記不起來的話, 還是會有點傷腦筋! 所以筆記一下加深印象. 以後也好查詢.

  1. dataframe 轉 dataset
def df_to_dataset(dataframe):
    dataframe = dataframe.copy()
    
    labels = dataframe.pop('your target')
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
                            
    return ds

2. Categorical 轉 Numeral (one-hot)

from tensorflow import feature_column as fc

# invest_df is predefined dataframe

A = ['stock','bond','ETF']
B = []

for C in A:
    D = invest_df[C].unique()
    E = fc.categorical_column_with_vocabulary_list(C, D)
    F = fc.indicator_column(E)
    B.append(F)

3. Bucketized 轉 numerical

G = fc.numeric_column("net_asset")
​
# Bucketized cols
H = fc.bucketized_column(G, boundaries=[10, 20, 30, 40, 50, 60, 80, 100]) # in million USD
B.append(H)

4. Feature Cross (Bucketized + Categorical)

I = invest_df['FATFIRE_proximity'].unique()
J = fc.categorical_column_with_vocabulary_list('FATFIRE_proximity',I)
​
crossed_feature = fc.crossed_column([H, I],hash_bucket_size=1000)
crossed_feature = fc.indicator_column(crossed_feature)
B.append(crossed_feature)

5. 實際運用

# input_dim = 上述的 feature 個數, 此時 = 8
# 假設下一層是 12 nodes.
# 8 x 12 是 fully connected.

feature_layer = tf.keras.layers.DenseFeatures(B, dtype='float64')
​
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(12, input_dim=8, activation='relu'),
  layers.Dense(8, activation='relu'),
  layers.Dense(1, activation='linear',  name='your target')
])

6. 產生新的 Bucketetized feature 做成 Feature Cross


    M = np.linspace(0, 1, nbuckets).tolist()
    N = np.linspace(0, 1, nbuckets).tolist()

    OP = fc.bucketized_column(B['OPEN_PRICE'], M)
    CP = fc.bucketized_column(B['CLOSE_PRICE'], M)
    OV = fc.bucketized_column(B['OPEN_VOL'], N)
    CV = fc.bucketized_column(B['CLOSE_VOL'], N)

    OO = fc.crossed_column([OP, OV], nbuckets * nbuckets)
    CC = fc.crossed_column([CP, CV], nbuckets * nbuckets)

    new_bucket_cross_feature = fc.crossed_column([OO, CC], nbuckets ** 4)

7. 畫出酷炫流程圖並存檔

tf.keras.utils.plot_model(model, 'model.png', show_shapes=False, rankdir='LR') # or 'TB'