深度学习补充二

深度学习补充二

关于深度学习用于文本和序列

  1. 深度学习算法分别是循环神经网络与一维卷积神经网络

应用场景

  1. 文档分类和时间序列分类, 比如识别文章的主题或书的作者
  2. 时间序列对比,比如估测两个文档或者两支股票的相关程度
  3. 序列到序列的学习,比如将英语翻译成法语
  4. 情感分析,比如将推文或电影评论的情感划分为正面或者负面
  5. 时间序列的预测,比如根据某地最近的天气数据来预测未来的天气

核心术语

  1. 文本向量化是指将文本转换为数据张量的过程
  2. 讲文本分解而成的单元叫做标记,将文本分解成标记的过程叫做分词.所有文本向量化的过程都是应用某种分词方案,然后将数值向量与生成的标记相关联.
  3. 向量与标记做关联一般有:对标记做one-hot编码;与标记嵌入,通常用于单词,叫做词嵌入。

注意:轻量级浅层文本处理模型时,可以使用logistic和随机森林.使用n-gram是一种功能强大,不可或缺的工具

利用keras实现单词级的one-hot编码

1
2
3
4
5
6
from keras.preprocessing.text import Tokenizer
samples = ['the cat sat on the mat.','the dog ate my homework']
tokenizer = Tokenizer(num_words=1000) # 创建一个分词器,设置为只考虑前1000个最常见的单词
tokenizer.fit_on_texts(samples) # 构建单词索引
sequences = tokenizer.texts_to_sequences(samples) # 将字符串转换为整数索引组成的列表
print(sequences)

image-20210413093516194

1
2
one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary') # 获得ont-hot二进制表示
print(one_hot_results)

image-20210413094021713

1
2
word_index = tokenizer.word_index
print(word_index)

image-20210413094116713

注意:使用one-hot最有可能出现的是散列冲突。可能两个不同的单词具有相同的散列值

利用keras实现词嵌入

  1. 词嵌入有两种方式:一种是在完成主任务的同时学习词嵌入,一开始是随机的词向量.然后逐步通过学习加深
  2. 在不同于待解决的问题的机器学习任务上预计算好词嵌入,然后加载到模型中.这些词嵌入叫作预训练词嵌入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from keras.datasets import imdb
from tensorflow.keras import preprocessing
max_features = 10000
maxlen = 20 # 选择10000个文本单词.取20个,假定前20个都是10000个文本单词中最常见的
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train,maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test,maxlen=maxlen)

from keras.models import Sequential
from keras.layers import Flatten,Dense,Embedding
model = Sequential()
model.add(Embedding(10000,8,input_length=maxlen)) # 进行一个8维度的嵌入
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
model.summary()
history = model.fit(x_train,y_train,
epochs=10,
batch_size=32,
validation_split=0.2)

image-20210413102611622

注意:在嵌入序列上添加循环层或一维卷积层,将每个序列作为整体来学习特征

使用预训练的词嵌入

下载原始文本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import os
imdb_dir= '/Users/Yun/Desktop/wangweijie/aclImdb'
train_dir = os.path.join(imdb_dir,'train')
labels = []
texts = []
for label_type in ['neg','pos']:
dir_name = os.path.join(train_dir,label_type)
for fname in os.listdir(dir_name):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name,fname))
texts.append(f.read())
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)

image-20210413120640082

对原始数据进行分词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100 # 选取前100个单词
training_samples = 200 # 样本数量200
validation_samples = 10000 # 10000个样本上验证
max_words = 10000 # 只考虑10000个常见的单词

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts) # 构建单词索引
sequences = tokenizer.texts_to_sequences(texts) # 将字符串转换为整数索引组成的列表
print(len(tokenizer.word_index))

data = pad_sequences(sequences,maxlen=maxlen)
labels = np.asarray(labels)
print(data.shape)
print(labels.shape)

image-20210413121747396

1
2
3
4
5
6
7
8
9
# 打乱顺序
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples:training_samples+validation_samples]
y_val = labels[training_samples:training_samples+validation_samples]

使用词嵌入进行处理

1
2
3
4
5
6
7
8
9
10
11
# 解析词嵌入和构建词嵌举证
glove_dir = '/Users/Yun/Desktop/wangweijie/glove.6B/'
embeddings_index = {} # 构建字典
f = open(os.path.join(glove_dir,'glove.6B.100d.txt')) # 100维度
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:],dtype='float32')
embeddings_index[word] = coefs
f.close()
print(len(embeddings_index))

image-20210413130742934

1
2
3
4
5
6
7
embedding_dim = 100
embedding_matrix = np.zeros((max_wordsx,embedding_dim))
for word, i in tokenizer.word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

构建模型

1
2
3
4
5
6
7
8
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.summary()
1
2
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False # 冻结Embedding层.一个模型一部分是经过预训练的,那么训练的时候不应该更新预训练的部分。

训练模型与评估

1
2
3
4
5
6
7
8
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train,y_train,
epochs=10,
batch_size=32,
validation_data=(x_val,y_val))
model.save_weights('/Users/Yun/Desktop/wangweijie/pre.h5')

image-20210413132833184

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1,len(acc)+1)
plt.figure(1,figsize=(10,5))
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)
plt.sca(ax1)
plt.plot(epochs,acc,'bo',label='t')
plt.plot(epochs,val_acc,'b',label='tv')


plt.sca(ax2)
plt.plot(epochs,loss,'bo',label='tl')
plt.plot(epochs,val_loss,'b',label='tvl')

plt.show()

image-20210413133112396

不使用预训练词嵌对比

1
2
3
4
5
6
7
8
9
10
11
12
model = Sequential()
model.add(Embedding(max_words,embedding_dim,input_length=maxlen)) # 进行一个8维度的嵌入
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
model.summary()
history = model.fit(x_train,y_train,
epochs=10,
batch_size=32,
validation_data=(x_val,y_val)) # 很快就过度拟合了

image-20210413133648290

泛化到测试集中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
imdb_dir= '/Users/Yun/Desktop/wangweijie/aclImdb'
train_dir = os.path.join(imdb_dir,'test')
labels = []
texts = []
for label_type in ['neg','pos']:
dir_name = os.path.join(train_dir,label_type)
for fname in sorted(os.listdir(dir_name)):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name,fname))
texts.append(f.read())
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

model.load_weights('/Users/Yun/Desktop/wangweijie/pre.h5',by_name=True)
model.evaluate(x_test,y_test)

image-20210413135001597

循环神经网络

simpleRNN不擅长处理长序列的文本.一般使用LSTM或者GRU层

LSTM直译的说就是带长记忆的短期记忆,多了一条时间轨道

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from keras.datasets import imdb
from tensorflow.keras import preprocessing
max_features = 10000
maxlen = 500 # 选择10000个文本单词.取20个,假定前20个都是10000个文本单词中最常见的
batch_size = 32
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
input_train = preprocessing.sequence.pad_sequences(x_train,maxlen=maxlen)
input_test = preprocessing.sequence.pad_sequences(x_test,maxlen=maxlen)

from keras.layers import LSTM
model = Sequential()
model.add(Embedding(max_features,32))
model.add(LSTM(32))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(input_train,y_train,
epochs=10,
batch_size=128,
validation_split=0.2)
acc = history.history['acc']
val_acc = history.history['val_acc']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1,len(acc)+1)
plt.figure(1,figsize=(10,5))
ax1 = plt.subplot(1,2,1)
ax2 = plt.subplot(1,2,2)
plt.sca(ax1)
plt.plot(epochs,acc,'bo',label='t')
plt.plot(epochs,val_acc,'b',label='tv')


plt.sca(ax2)
plt.plot(epochs,loss,'bo',label='tl')
plt.plot(epochs,val_loss,'b',label='tvl')

plt.show()

image-20210413145606345

高级用法

  1. 循环dropout
  2. 堆叠循环层
  3. 双向循环层

观察数据

1
2
3
import pandas as pd
df_weathers = pd.read_csv('/Users/Yun/Desktop/wangweijie/jena_climate_2009_2016.csv')
print(df_weathers.head())

image-20210413150428731

1
2
3
4
5
6
7
8
9
10
import os
f = open('/Users/Yun/Desktop/wangweijie/jena_climate_2009_2016.csv')
data = f.read()
f.close()

lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]
print(header)
print(len(lines))

image-20210413152644283

1
2
3
4
5
6
import numpy as np
float_data = np.zeros((len(lines),len(header)-1))
for i,line in enumerate(lines):
values = [float(x) for x in line.split(',')[1:]]
float_data[i,:] = values
print(float_data)

image-20210413152705784

1
plt.plot(range(1440),float_data[:,1][:1440]) # 时间是否和温度有关联

image-20210413153141212

数据标准化

1
2
3
4
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

通过生成器产生数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
'''
data:原始数组
lookback:输入数据包括多少个时间步
delay:目标应该在未来多少个时间步
min_index和max_index:界定需要抽取哪些时间步
shuffle:打乱样本还是顺序抽取
batch_size:每个批量的样本数
step:数据采样的周期
'''
def generator(data, lookback, delay, min_index, max_index,
shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)

samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples, targets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
lookback = 1440
step = 6
delay = 144
batch_size = 128

train_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=0,
max_index=200000,
shuffle=True,
step=step,
batch_size=batch_size)
val_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=200001,
max_index=300000,
step=step,
batch_size=batch_size)
test_gen = generator(float_data,
lookback=lookback,
delay=delay,
min_index=300001,
max_index=None,
step=step,
batch_size=batch_size)


val_steps = (300000 - 200001 - lookback) // batch_size


test_steps = (len(float_data) - 300001 - lookback) // batch_size
# samples张量为(样本数,小时数,特征) 目标数就是(样本数,)也就是24小时后的温度数值
# 直译的说就是以24个小时的数据为变量,预测24小时后的那个数据因变量

基于常识非基准的方法

1
2
3
4
5
6
7
8
9
10
def evaluate_naive_method():
batch_maes = []
for step in range(val_steps):
samples, targets = next(val_gen)
preds = samples[:, -1, 1]
mae = np.mean(np.abs(preds - targets))
batch_maes.append(mae)
print(np.mean(batch_maes))

evaluate_naive_method() # 该值乘以温度的标准差可以获得相差温度 (基于常识的基准)

image-20210413165603532

基本的机器学习

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen,
validation_steps=val_steps)
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

image-20210413170604753

循环网络+DROUPT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen,
validation_steps=val_steps)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32,
dropout=0.2,
recurrent_dropout=0.2,
input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=40,
validation_data=val_gen,
validation_steps=val_steps)

循环层堆叠

增加网络容量通常是增加每层单元数或者层数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.GRU(32,
dropout=0.1,
recurrent_dropout=0.5,
return_sequences=True, # 通过这个来实现
input_shape=(None, float_data.shape[-1])))
model.add(layers.GRU(64, activation='relu',
dropout=0.1,
recurrent_dropout=0.5))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=40,
validation_data=val_gen,
validation_steps=val_steps)

使用双向RNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def reverse_order_generator(data, lookback, delay, min_index, max_index,
shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)

samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples[:, ::-1, :], targets

train_gen_reverse = reverse_order_generator(
float_data,
lookback=lookback,
delay=delay,
min_index=0,
max_index=200000,
shuffle=True,
step=step,
batch_size=batch_size)
val_gen_reverse = reverse_order_generator(
float_data,
lookback=lookback,
delay=delay,
min_index=200001,
max_index=300000,
step=step,
batch_size=batch_size)
1
2
3
4
5
6
7
8
9
10
11
model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen_reverse,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen_reverse,
validation_steps=val_steps)
# 相当于是将数据反续
双向RNN细节
1
2
3
4
5
6
7
model = Sequential()
model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32))) # 加入双向网络
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Bidirectional(
layers.GRU(32), input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=40,
validation_data=val_gen,
validation_steps=val_steps)

利用一维卷积处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000 # number of words to consider as features
max_len = 500 # cut texts after this number of words (among top max_features most common words)

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
一维
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

model.summary()

model.compile(optimizer=RMSprop(lr=1e-4),
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=128,
validation_split=0.2)
一维和RNN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
model = Sequential()
model.add(layers.Conv1D(32, 5, activation='relu',
input_shape=(None, float_data.shape[-1])))
model.add(layers.MaxPooling1D(3))
model.add(layers.Conv1D(32, 5, activation='relu'))
model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5))
model.add(layers.Dense(1))

model.summary()

model.compile(optimizer=RMSprop(), loss='mae')
history = model.fit_generator(train_gen,
steps_per_epoch=500,
epochs=20,
validation_data=val_gen,
validation_steps=val_steps)

总结

  1. 使用RNN可以进行时间序列的回归.分类等
  2. 一维卷积神经网络可以用于机器翻译
  3. 序列数据的整体顺序很重要可以使用循环来处理
  4. 如果不重要可以使用一维卷积处理,比如句首发现关键词和句尾发现关键词很有意义