TextCNN 详解

论文背景

论文 “Convolutional Neural Networks for Sentence Classification”Yoon Kim2014EMNLP 会议上提出。

将卷积神经网络 CNN 应用到文本分类任务,利用多个不同 sizekernel 来提取句子中的关键信息(类似于多窗口大小的 ngram ),从而能够更好地捕捉局部相关性。

网络结构

01_net

TextCNN 的详细过程原理图如下:

02_detail

说明:

  • Embedding:第一层是图中最左边的 7 乘 5 的句子矩阵,每行是词向量,维度=5,这个可以类比为图像中的原始像素点。
  • Convolution:然后经过 kernel_sizes=(2,3,4) 的一维卷积层,每个kernel_size 有两个输出 channel 。
  • MaxPolling:第三层是一个1-max pooling层,这样不同长度句子经过 pooling 层之后都能变成定长的表示。
  • FullConnection and Softmax:最后接一层全连接的 softmax 层,输出每个类别的概率。

嵌入层(embedding layer)

TextCNN 使用预先训练好的词向量作 embedding layer 。对于数据集里的所有词,因为每个词都可以表征成一个向量,因此我们可以得到一个嵌入矩阵 M, M 里的每一行都是词向量。这个 M 可以是静态 (static) 的,也就是固定不变。可以是非静态 (non-static) 的,也就是可以根据反向传播更新。

论文给出了几种模型,其实这里基本都是针对 Embedding layer 做的变化。

  • CNN-rand
    作为一个基础模型,Embedding layer 所有 words 被随机初始化,然后模型整体进行训练。
  • CNN-static
    模型使用预训练的 word2vec 初始化 Embedding layer,对于那些在预训练的 word2vec 没有的单词,随机初始化。然后固定 Embedding layer,fine-tune 整个网络。
  • CNN-non-static
    同(CNN-static),只是训练的时候,Embedding layer跟随整个网络一起训练。
  • CNN-multichannel
    Embedding layer 有两个 channel,一个 channel 为 static,一个为 non-static 。然后整个网络 fine-tune 时只有一个 channel 更新参数。两个 channel 都是使用预训练的 word2vec 初始化的。

卷积层(convolution)

通道 Channels

  • 图像中可以利用 (R, G, B) 作为不同 channel;
  • 文本的输入的channel通常是不同方式的 embedding 方式(比如 word2vec 或 Glove),实践中也有利用静态词向量和 fine-tunning 词向量作为不同 channel 的做法。
  • channel 也可以一个是词序列,另一个 channel 是对应的词性序列。接下来就可以通过加和或者拼接进行结合。

一维卷积 conv-1d

我们可以把嵌入层矩阵 M 看成是一幅图像,使用卷积神经网络去提取特征。由于句子中相邻的单词关联性总是很高的,因此可以使用一维卷积,即文本卷积与图像卷积的不同之处在于只在文本序列的一个方向(垂直)做卷积。卷积核的宽度固定为词向量的维度 d,高度是超参数,可以设置。

  • 图像是二维数据;
  • 文本是一维数据,因此在TextCNN 卷积用的是一维卷积(在word-level上是一维卷积;虽然文本经过词向量表达后是二维数据,但是在 embedding-level 上的二维卷积没有意义)。一维卷积带来的问题是需要通过设计不同 kernel_size 的 filter 获取不同宽度的视野。

池化层(pooling)

不同尺寸的卷积核得到的特征 (feature map) 大小也是不一样的,因此我们对每个 feature map 使用池化函数,使它们的维度相同。

Max Pooling

03_pooling_max
最常用的就是 1-max pooling,提取出 feature map 照片那个的最大值,通过选择每个 feature map 的最大值,可捕获其最重要的特征。这样每一个卷积核得到特征就是一个值,对所有卷积核使用 1-max pooling,再级联起来,可以得到最终的特征向量。

K-Max Pooling

04_pooling_kmax

取所有特征值中得分在 Top–K 的值,并(保序拼接)保留这些特征值原始的先后顺序(即多保留一些特征信息供后续阶段使用)。

参见论文 A Convolutional Neural Network for Modelling Sentences

Chunk-MaxPooling

05_pooling_dynamic

把某个 Filter 对应的 Convolution 层的所有特征向量进行分段,切割成若干段后,在每个分段里面各自取得一个最大特征值,比如将某个 Filter 的特征向量切成 3 个 Chunk,那么就在每个Chunk里面取一个最大值,于是获得 3 个特征值。因为是先划分 Chunk 再分别取 Max 值的,所以保留了比较粗粒度的模糊的位置信息;当然,如果多次出现强特征,则也可以捕获特征强度。至于这个Chunk怎么划分,可以有不同的做法,比如可以事先设定好段落个数,这是一种静态划分 Chunk 的思路;也可以根据输入的不同动态地划分 Chunk 间的边界位置,可以称之为动态 Chunk-Max 方法。 “Local Translation Prediction with Global Sentence Representation” 这篇论文也用实验证明了静态 Chunk-Max 性能相对 MaxPooling Over Time 方法在机器翻译应用中对应用效果有提升。

Dynamic Pooling

卷积时如果碰到 triggle 词,可以标记下不同色,max-pooling 时按不同标记划分 chunk。“Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks” 这篇论文提出的是一种 ChunkPooling 的变体,就是动态 Chunk-Max Pooling 的思路,实验证明性能有提升。

代码实现

代码结构

  • data_tools.py: 数据预处理
  • modules.py: 模型网络结构
  • run.py: 运行文件,主程序入口

数据预处理

数据集使用复旦大学中文文本分类数据集,该数据集共 9833 篇文档,分为19个类别。训练语料和测试语料基本按照1:1的比例来划分。详细的代码解析参考我的另外一篇博客,关于如何预处理中文语料库的文章。

中文数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
import os
import re
import jieba
from collections import Counter
from sklearn.utils import shuffle

def is_chinese(char):
"""
判断一个字符是否为中文汉字
:param char: str
:return: bool, 是返回 True, 否返回 False
"""
punctuation = [u'\u300a', u'\u2014', u'\u201c'] # ['《', '—', '“']
if u'\u4e00' <= char <= u'\u9fff' \
or u'\uff10' <= char <= u'\uff19' \
or char in punctuation:
return True
else:
return False

def get_listdir(path):
"""
获取一个路径下所有文件或文件夹的名字列表
:param path: str
:return: list
"""
return os.listdir(path)

def get_listpath(path):
"""
获取一个路径下所有文件或文件夹的路径列表
:param path: str
:return: list
"""
dirname = os.listdir(path)
path_ = [os.path.join(path, x) for x in dirname]
return path_

def parser_file2para(file):
"""
将文件 file 解析为段落列表
:param file: str, 文件名
:return: list, 段落列表
"""
start = (' ', '  ', '  ', '   ', '  ')
text = ""
s = False
paras = []
with open(file) as f:
# 处理文件中的每一行
for line in f:
if line.startswith(start) and len(line) > 4 and is_chinese(line[4]):
if text is not "":
text = re.sub('(.*?)', '', text)
if len(text) <= 10: # 假如 text 长度小于 10 文本丢弃
pass
else:
paras.append(text.replace(' ', ''))
text = ""
text += line.lstrip().rstrip('\n')
s = True
continue
if line.startswith(start) and len(line) > 4 and not is_chinese(line[4]):
if text is not "":
text = re.sub('(.*?)', '', text)
if len(text) <= 10:
pass
else:
paras.append(text.replace(' ', ''))
text = ""
s = False
continue
if s == True and line[0] is not ' ':
text += line.rstrip('\n')
continue
# 处理最后一段
if text is not "":
text = re.sub('(.*?)', '', text)
if len(text) <= 10:
pass
else:
paras.append(text.replace(' ', ''))
return paras

def parser_file2sentence(file, low, high):
"""
将文件 file 解析为句子列表
:param file: str, 文件名
:param low: 句子长度下限
:param high: 句子长度上限
:return: list, 句子列表
"""
sentences = []
paras = parser_file2para(file)
for para in paras:
sentence = re.split('(。”|!”|?”|。)|。|!|?)', para)
sentence = [x.strip() for x in sentence]
if len(sentence) == 1 and low <= len(*sentence) <= high:
sentences.extend(sentence)
continue
sentence = ["".join(i) for i in zip(sentence[0::2], sentence[1::2])]
sentence = [x for x in sentence if low <= len(x) <= high and x[0] != '(']
sentences.extend(sentence)
return sentences

def text_save(dirpath, filename, data):#filename为写入CSV文件的路径,data为要写入数据列表.
"""
将数据列表 data 写入 filename 文件中, 文件中的一行为数据 data 中的一列数据
:param dir: str, 文件夹路径
:param filename: str, 文件名
:param data: list, 数据列表
:return: None
"""
if not os.path.exists(dirpath):
os.makedirs(dirpath)
print(dirpath, "路路径已创建")
filepath = os.path.join(dirpath, filename)
file = open(filepath, 'a')
for line in data:
line = line + '\n' # 每行末尾追加换行符
file.write(line)
file.close()
print("保存文件成功")

def count_word_sentence(filepath):
"""
计算一个文件包含的句子数量和单词数量,该文件一句一行
:param filepath: str, 文件路径
:return: int, 单词数量, 句子数量, 句子平均长度, 句子最大长度, 句子最小长度
"""
num_word, num_sent = 0, 0
max_sent = -float('inf')
min_sent = float('inf')
with open(filepath, 'r') as f:
for line in f:
l = len(line)
max_sent = max(max_sent, l)
min_sent = min(min_sent, l)
num_word += l
num_sent += 1
if num_sent == 0:
ave_sent = 0
else:
ave_sent = num_word // num_sent
return num_word, num_sent, ave_sent, max_sent, min_sent

def cut_words(rawpath, datapath):
"""
将原路径的文件进行分词,然后存入新的文件,词语以空格分割.
:param rawpath: str, 原始数据路径
:param datapath: str, 处理后数据路径
:return: list, 词表
"""
words = []
if not os.path.exists(datapath):
for dirpath, dirnames, filenames in os.walk(rawpath):
if filenames is []:
continue
filepaths = [os.path.join(dirpath, filename) for filename in filenames]
for filepath, filename in zip(filepaths, filenames):
text = []
with open(filepath, 'r') as f:
for line in f:
sentence = jieba.lcut(line)
sentence = sentence[:-1]
words.extend(sentence)
sentence = ' '.join(sentence)
text.append(sentence)
data_path = os.path.join(datapath, os.path.basename(dirpath))
text_save(data_path, filename, text)
return words
else:
for dirpath, dirnames, filenames in os.walk(datapath):
if filenames is []:
continue
filepaths = [os.path.join(dirpath, filename) for filename in filenames]
for filepath, filename in zip(filepaths, filenames):
with open(filepath, 'r') as f:
for line in f:
sentence = line.split(' ')
sentence = sentence[:-1]
words.extend(sentence)
return words

def read_dataset(rawpath, datapath, vocab_size):
"""
读取数据集,制作成训练集,验证集和测试集
:param rawpath: str, 原始数据路径
:param datapath: str, 处理后数据路径
:param vocab_size: int, 词表大小
:return: word2index, 单词到编码的映射
label2index, 标签到编码的映射
x_train_, 训练集输入
y_train, 训练集标签
x_dev_, 验证集输入
y_dev, 验证集标签
x_test_, 测试集输入
y_test, 测试集标签
"""
if not os.path.exists(datapath):
cut_words(rawpath, datapath)
x_train, y_train = [], []
x_test, y_test = [], []
words = []
for dirpath, dirnames, filenames in os.walk(datapath):
if filenames is []:
continue
filepaths = [os.path.join(dirpath, filename) for filename in filenames]
for filepath, filename in zip(filepaths, filenames):
tag = filename.split('-')[0]
with open(filepath, 'r') as f:
for line in f:
sentence = line.split(' ')
sentence = sentence[:-1]
if dirpath == os.path.join(datapath, 'train'):
x_train.append(sentence)
y_train.append(tag)
elif dirpath == os.path.join(datapath, 'test'):
x_test.append(sentence)
y_test.append(tag)
else:
pass
words.extend(sentence)
# 构建训练集,验证集,测试集
x_train, y_train = shuffle(x_train, y_train)
x_test, y_test = shuffle(x_test, y_test)
cut = len(y_test) // 2
x_dev, y_dev = x_test[:cut], y_test[:cut]
x_test, y_test = x_test[cut:], y_test[cut:]
# 构建词表映射
vocab = Counter(words).most_common(vocab_size)
word2index = {k_v[0]: i+2 for i, k_v in enumerate(vocab)}
word2index.update({"PAD": 0, "UNK": 1})
# 构建标签映射
label = sorted(list(set(y_train + y_dev + y_test)))
label2index = {tag: i for i, tag in enumerate(label)}
# 对数据集,验证机,测试集进行编码
max_sent_len = max([len(sent) for sent in x_train + x_dev + x_test])
x_train_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_train]
x_dev_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_dev]
x_test_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_test]
y_train = [label2index[tag] for tag in y_train]
y_dev = [label2index[tag] for tag in y_dev]
y_test = [label2index[tag] for tag in y_test]

return word2index, label2index, x_train_, y_train, x_dev_, y_dev, x_test_, y_test

附上该代码配套的英文数据集的预处理代码
英文数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
from sklearn.utils import shuffle

from collections import Counter
import pickle

def read_TREC():
data = {}

def read(mode):
x, y = [], []

with open("data/TREC/TREC_" + mode + ".txt", "r", encoding="utf-8") as f:
for line in f:
if line[-1] == "\n":
line = line[:-1]
y.append(line.split()[0].split(":")[0])
x.append(line.split()[1:])

x, y = shuffle(x, y)

if mode == "train":
dev_idx = len(x) // 10
data["dev_x"], data["dev_y"] = x[:dev_idx], y[:dev_idx]
data["train_x"], data["train_y"] = x[dev_idx:], y[dev_idx:]
else:
data["test_x"], data["test_y"] = x, y

read("train")
read("test")

return data


def read_MR():
data = {}
x, y = [], []

with open("data/MR/rt-polarity.pos", "r", encoding="utf-8") as f:
for line in f:
if line[-1] == "\n":
line = line[:-1]
x.append(line.split())
y.append(1)

with open("data/MR/rt-polarity.neg", "r", encoding="utf-8") as f:
for line in f:
if line[-1] == "\n":
line = line[:-1]
x.append(line.split())
y.append(0)

x, y = shuffle(x, y)
dev_idx = len(x) // 10 * 8
test_idx = len(x) // 10 * 9

data["train_x"], data["train_y"] = x[:dev_idx], y[:dev_idx]
data["dev_x"], data["dev_y"] = x[dev_idx:test_idx], y[dev_idx:test_idx]
data["test_x"], data["test_y"] = x[test_idx:], y[test_idx:]

data["vocab"] = sorted(list(set([w for sent in data["train_x"] + data["dev_x"] + data["test_x"] for w in sent])))
# data["vocab"] = Counter([w for sent in data["train_x"] + data["dev_x"] + data["test_x"] for w in sent]).most_common(5000)
# data["vocab"] = [x for x, y in data["vocab"]]
data["classes"] = sorted(list(set(data["train_y"])))
data["word_to_idx"] = {w: i+2 for i, w in enumerate(data["vocab"])}
data["word_to_idx"].update({"PAD": 0, "UNK": 1})
data["label_to_idx"] = {l: i+2 for i, l in enumerate(data["classes"])}
# data["idx_to_word"] = {i: w for i, w in enumerate(data["vocab"])}


max_sent_len = max([len(sent) for sent in data["train_x"] + data["dev_x"] + data["test_x"]])
data["train_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["train_x"]]
data["dev_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["dev_x"]]
data["test_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["test_x"]]

return data

模型搭建

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
from __future__ import print_function
import tensorflow as tf

class TextCNN:
def __init__(self, config):
"""init all hyperparameter here"""
# set hyperparamter
self.num_epochs = config['n_epochs']
self.num_classes = config['num_classes']
self.batch_size = config['batch_size']
self.sequence_length = config['sequence_length']
self.vocab_size = config['vocab_size']
self.embed_size = config['embed_size']
self.learning_rate = tf.Variable(config['learning_rate'], trainable=False, name="learning_rate") #ADD learning_rate
self.learning_rate_decay_half_op = tf.assign(self.learning_rate, self.learning_rate * config['decay_rate_big'])
self.filter_sizes = config['filter_sizes'] # it is a list of int. e.g. [3,4,5]
self.num_filters = config['num_filters']
self.initializer = config['initializer']
self.num_filters_total = self.num_filters * len(self.filter_sizes) #how many filters totally.
self.use_mulitple_layer_cnn = config['use_mulitple_layer_cnn']
self.multi_label_flag = config['multi_label_flag']
self.clip_gradients = config['clip_gradients']
self.use_embedding = config['use_embedding']
self.ckpt_dir = config["ckpt_dir"]

# add placeholder (X,label)
with tf.name_scope('placeholder'):
self.is_training_flag = tf.placeholder(tf.bool, name="is_training_flag")
self.input_x = tf.placeholder(tf.int32, shape=[None, self.sequence_length], name="input_x") # X
self.input_y = tf.placeholder(tf.int32, shape=[None, ], name="input_y") # y:[None,num_classes]
self.input_y_multilabel = tf.placeholder(tf.float32, shape=[None, self.num_classes], name="input_y_multilabel") # y:[None,num_classes]. this is for multi-label classification only.
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
self.iter = tf.placeholder(tf.int32) # training iteration

self.global_step = tf.Variable(0, trainable=False, name="Global_Step")
self.epoch_step = tf.Variable(0, trainable=False, name="Epoch_Step")
self.epoch_increment = tf.assign(self.epoch_step, tf.add(self.epoch_step, tf.constant(1)))
self.decay_steps, self.decay_rate = config['decay_steps'], config['decay_rate']

self.instantiate_weights()
# 构建模型
word_embedded = self.word2vec() # [None,sentence_length,embed_size,1)
if self.use_mulitple_layer_cnn: # this may take 50G memory.
print("use multiple layer CNN")
h = self.cnn_multiple_layers(word_embedded)
else: # this take small memory, less than 2G memory.
print("use single layer CNN")
h = self.cnn_single_layer(word_embedded)
self.logits = self.classifer(h) # [None, self.num_classes]
self.possibility = tf.nn.sigmoid(self.logits)

# 计算损失函数
if self.multi_label_flag:
print("going to use multi label loss.")
self.loss_val = self.loss_multilabel()
else:
print("going to use single label loss.")
self.loss_val = self.loss()
self.train_op = self.train()

# 模型评估
if not self.multi_label_flag:
self.predictions = tf.argmax(self.logits, 1, name="predictions") # shape:[None,]
print("self.predictions:", self.predictions)
correct_prediction = tf.equal(tf.cast(self.predictions, tf.int32), self.input_y) #tf.argmax(self.logits, 1)-->[batch_size]
self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="Accuracy") # shape=()

def instantiate_weights(self):
"""define all weights here"""
with tf.name_scope("embedding"): # embedding matrix
self.Embedding = tf.get_variable("Embedding", shape=[self.vocab_size, self.embed_size], initializer=self.initializer) #[vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
self.W_projection = tf.get_variable("W_projection", shape=[self.num_filters_total, self.num_classes], initializer=self.initializer) #[embed_size,label_size]
self.b_projection = tf.get_variable("b_projection", shape=[self.num_classes]) #[label_size] #ADD 2017.06.09

def word2vec(self):
with tf.name_scope("embedding"):
embedded_words = tf.nn.embedding_lookup(self.Embedding, self.input_x)
sentence_embeddings_expanded = tf.expand_dims(embedded_words, -1)
return sentence_embeddings_expanded

def cnn_single_layer(self, sentence_embeddings_expanded):
pooled_outputs = []
for i, filter_size in enumerate(self.filter_sizes):
# with tf.name_scope("convolution-pooling-%s" %filter_size):
with tf.variable_scope("convolution-pooling-%s" % filter_size):
# ====>a.create filter
filter = tf.get_variable("filter-%s" % filter_size, [filter_size, self.embed_size, 1, self.num_filters], initializer=self.initializer)
# ====>b.conv operation: conv2d===>computes a 2-D convolution given 4-D `input` and `filter` tensors.
conv = tf.nn.conv2d(sentence_embeddings_expanded, filter, strides=[1, 1, 1, 1], padding="VALID", name="conv") # shape:[batch_size,sequence_length - filter_size + 1,1, num_filters]
conv_bn = tf.contrib.layers.batch_norm(conv, is_training=self.is_training_flag, scope='cnn_bn_')

# ====>c. apply nolinearity
b = tf.get_variable("b-%s" % filter_size, [self.num_filters]) # ADD 2017-06-09
h = tf.nn.relu(tf.nn.bias_add(conv_bn, b), "relu") # shape:[batch_size, sequence_length-filter_size+1, 1, num_filters]. tf.nn.bias_add:adds `bias` to `value`
# ====>d. max-pooling. value: A 4-D `Tensor` with shape `[batch, height, width, channels]
pooled = tf.nn.max_pool(h, ksize=[1, self.sequence_length - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool") # shape:[batch_size, 1, 1, num_filters]
pooled_outputs.append(pooled)
# 3.=====>combine all pooled features, and flatten the feature.output' shape is a [1,None]
self.h_pool = tf.concat(pooled_outputs, 3) # shape:[batch_size, 1, 1, num_filters_total]. tf.concat=>concatenates tensors along one dimension.where num_filters_total=num_filters_1+num_filters_2+num_filters_3
self.h_pool_flat = tf.reshape(self.h_pool, [-1, self.num_filters_total]) # shape should be:[None,num_filters_total]. here this operation has some result as tf.sequeeze().e.g. x's shape:[3,3];tf.reshape(-1,x) & (3, 3)---->(1,9)

# 4.=====>add dropout: use tf.nn.dropout
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, keep_prob=self.dropout_keep_prob) # [None, num_filters_total]
h_ = tf.layers.dense(self.h_drop, self.num_filters_total, activation=tf.nn.tanh, use_bias=True)
return h_

def cnn_multiple_layers(self, sentence_embeddings_expanded):
# 2.=====>loop each filter size. for each filter, do:convolution-pooling layer(a.create filters,b.conv,c.apply nolinearity,d.max-pooling)--->
# you can use:tf.nn.conv2d;tf.nn.relu;tf.nn.max_pool; feature shape is 4-d. feature is a new variable
pooled_outputs = []
print("sentence_embeddings_expanded:", self.sentence_embeddings_expanded)
for i, filter_size in enumerate(self.filter_sizes):
with tf.variable_scope('cnn_multiple_layers' + "convolution-pooling-%s" % filter_size):
# 1) CNN->BN->relu
filter = tf.get_variable("filter-%s" % filter_size,[filter_size, self.embed_size, 1, self.num_filters],initializer=self.initializer)
conv1 = tf.nn.conv2d(sentence_embeddings_expanded, filter, strides=[1, 1, 1, 1], padding="SAME", name="conv") # shape:[batch_size,sequence_length - filter_size + 1, 1, num_filters]
conv1_bn = tf.contrib.layers.batch_norm(conv1, is_training=self.is_training_flag, scope='cnn1')
print(i, "conv1:", conv1_bn)
b = tf.get_variable("b-%s" % filter_size, [self.num_filters]) # ADD 2017-06-09
h = tf.nn.relu(tf.nn.bias_add(conv1_bn, b), "relu") # shape:[batch_size,sequence_length,1,num_filters]. tf.nn.bias_add:adds `bias` to `value`

# 2) CNN->BN->relu
h = tf.reshape(h, [-1, self.sequence_length, self.num_filters, 1]) # shape:[batch_size,sequence_length,num_filters,1]
# Layer2:CONV-RELU
filter2 = tf.get_variable("filter2-%s" % filter_size, [filter_size, self.num_filters, 1, self.num_filters], initializer=self.initializer)
conv2 = tf.nn.conv2d(h, filter2, strides=[1, 1, 1, 1], padding="SAME", name="conv2") # shape:[batch_size,sequence_length-filter_size*2+2,1,num_filters]
conv2_bn = tf.contrib.layers.batch_norm(conv2, is_training=self.is_training_flag, scope='cnn2')
print(i, "conv2:", conv2)
b2 = tf.get_variable("b2-%s" % filter_size, [self.num_filters]) # ADD 2017-06-09
h2 = tf.nn.relu(tf.nn.bias_add(conv2_bn, b2), "relu2") # shape:[batch_size,sequence_length,1,num_filters]. tf.nn.bias_add:adds `bias` to `value`

# 3. Max-pooling
pooling_max = tf.squeeze(tf.nn.max_pool(h2, ksize=[1, self.sequence_length, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool"))
# pooling_avg=tf.squeeze(tf.reduce_mean(h,axis=1)) #[batch_size, num_filters]
print(i, "pooling:", pooling_max)
# pooling=tf.concat([pooling_max,pooling_avg],axis=1) #[batch_size,num_filters*2]
pooled_outputs.append(pooling_max) # h:[batch_size,sequence_length,1,num_filters]
# concat
h_pool = tf.concat(pooled_outputs, axis=1) # [batch_size,num_filters*len(self.filter_sizes)]
print("h.concat:", h_pool)

with tf.name_scope("dropout"):
h_ = tf.nn.dropout(h_pool, keep_prob=self.dropout_keep_prob) # [batch_size,sequence_length - filter_size + 1,num_filters]
return h_ # [batch_size,sequence_length - filter_size + 1,num_filters]

def classifer(self, h):
with tf.name_scope("output"):
logits = tf.matmul(h, self.W_projection) + self.b_projection
return logits

def loss_multilabel(self, l2_lambda=0.0001): #0.0001#this loss function is for multi-label classification
with tf.name_scope("loss"):
losses = tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y_multilabel, logits=self.logits)
losses = tf.reduce_sum(losses, axis=1) # shape=(?,). loss for all data in the batch
loss = tf.reduce_mean(losses) # shape=(). average loss in the batch
l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
loss += l2_losses
return loss

def loss(self, l2_lambda=0.0001):# 0.001
with tf.name_scope("loss"):
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits)
loss = tf.reduce_mean(losses) # print("2.loss.loss:", loss) #shape=()
l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
loss += l2_losses
return loss

def train(self):
"""based on the loss, use SGD to update parameter"""
learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps, self.decay_rate, staircase=True)
self.learning_rate_ = learning_rate
optimizer = tf.train.AdamOptimizer(learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(self.loss_val))
gradients, _ = tf.clip_by_global_norm(gradients, self.clip_gradients)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) #ADD 2018.06.01
with tf.control_dependencies(update_ops): #ADD 2018.06.01
train_op = optimizer.apply_gradients(zip(gradients, variables))
return train_op

主程序入口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import tensorflow as tf
import numpy as np
from modules import TextCNN
import os
from data_tools import read_dataset
from numba import jit

#configuration
configuration = {
'vocab_size': 5000, # number of vocabulary
'embed_size': 300, # dimension of word embedding

'sequence_length': 200, # max length of sentence
'num_classes': 2, # number of labels

'filter_sizes': [3, 4, 5], # size of convolution kernel
'num_filters': 128, # number of convolution kernel

'learning_rate': 1e-3, # learning rate
'decay_rate': 1.0, # learning rate decay
'decay_rate_big': 0.50,
'decay_steps': 1000,
'clip_gradients': 5.0,
'dropout_rate': 0.5, # droppout

'n_epochs': 10, # epochs
'batch_size': 64, # batch_size
'initializer': tf.random_normal_initializer(stddev=0.1),
'use_mulitple_layer_cnn': False,
'multi_label_flag': False,
'use_embedding': False,

"ckpt_dir": "text_cnn_checkpoint/"

}

#1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
def main(_):

word2index, label2index, trainX, trainY, vaildX, vaildY, testX, testY = read_dataset('Chinese_Data_2', 'process_2', 30000)
configuration['vocab_size'] = len(word2index)
print("cnn_model.vocab_size:", configuration['vocab_size'])
configuration['num_classes'] = len(label2index)
print("num_classes:", configuration['num_classes'])
num_examples = len(trainX)
print("num_examples of training:", num_examples, ";sentence_len:", len(trainX[0]))

configuration['sequence_length'] = len(trainX[0])
#2.create session.
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
#Instantiate Model
textCNN = TextCNN(configuration)
#Initialize Save
saver = tf.train.Saver()
if os.path.exists(textCNN.ckpt_dir+"checkpoint"):
print("Restoring Variables from Checkpoint.")
saver.restore(sess, tf.train.latest_checkpoint(textCNN.ckpt_dir))
#for i in range(3): #decay learning rate if necessary.
# print(i,"Going to decay learning rate by half.")
# sess.run(textCNN.learning_rate_decay_half_op)
else:
print('Initializing Variables')
sess.run(tf.global_variables_initializer())
if textCNN.use_embedding: #load pre-trained word embedding
index2word = {v: k for k, v in word2index.items()}
assign_pretrained_word_embedding(sess, index2word, textCNN, "GoogleNews-vectors-negative300.bin")
curr_epoch = sess.run(textCNN.epoch_step)
#3.feed data & training
number_of_training_data = len(trainX)
batch_size = textCNN.batch_size
iteration = 0
for epoch in range(curr_epoch, textCNN.num_epochs):
loss, counter = 0.0, 0
for start, end in zip(range(0, number_of_training_data, batch_size),\
range(batch_size, number_of_training_data, batch_size)):
iteration = iteration+1
if epoch == 0 and counter == 0:
print("trainX[start:end]:", trainX[start: end])
feed_dict = {textCNN.input_x: trainX[start:end], textCNN.dropout_keep_prob: 0.8, textCNN.is_training_flag: True}
if not textCNN.multi_label_flag:
feed_dict[textCNN.input_y] = trainY[start: end]
else:
feed_dict[textCNN.input_y_multilabel] = trainY[start: end]
curr_loss, lr, _ = sess.run([textCNN.loss_val, textCNN.learning_rate, textCNN.train_op], feed_dict)
loss, counter = loss + curr_loss, counter + 1
if counter % 50 == 0:
print("Epoch %d\tBatch %d\tTrain Loss:%.3f\tLearning rate:%.5f" %(epoch, counter, loss/float(counter), lr))

#epoch increment
print("going to increment epoch counter....")
sess.run(textCNN.epoch_increment)

# 4.validation
if epoch % 1 == 0:
eval_loss, f1_score, f1_micro, f1_macro = do_eval(sess, textCNN, vaildX, vaildY)
print("Epoch %d Validation Loss:%.3f\tF1 Score:%.3f\tF1_micro:%.3f\tF1_macro:%.3f" % (epoch, eval_loss, f1_score, f1_micro, f1_macro))
#save model to checkpoint
save_path = textCNN.ckpt_dir+"model.ckpt"
print("Going to save model..")
saver.save(sess, save_path, global_step=epoch)

# 5.最后在测试集上做测试,并报告测试准确率 Test
test_loss, f1_score, f1_micro, f1_macro = do_eval(sess, textCNN, testX, testY)
print("Test Loss:%.3f\tF1 Score:%.3f\tF1_micro:%.3f\tF1_macro:%.3f" % (test_loss,f1_score,f1_micro,f1_macro))
pass


# 在验证集上做验证,报告损失、精确度
def do_eval(sess, textCNN, evalX, evalY):
evalX = evalX
evalY = evalY
number_examples = len(evalX)
eval_loss, eval_counter, eval_f1_score, eval_p, eval_r = 0.0, 0, 0.0, 0.0, 0.0
batch_size = 1
predict = []

for start, end in zip(range(0, number_examples, batch_size), range(batch_size, number_examples + batch_size, batch_size)):
''' evaluation in one batch '''
if textCNN.multi_label_flag:
feed_dict = {textCNN.input_x: evalX[start:end],
textCNN.input_y_multilabel: evalY[start:end],
textCNN.dropout_keep_prob: 1.0,
textCNN.is_training_flag: False}
else:
feed_dict = {textCNN.input_x: evalX[start:end],
textCNN.input_y: evalY[start:end],
textCNN.dropout_keep_prob: 1.0,
textCNN.is_training_flag: False}
current_eval_loss, logits = sess.run(
[textCNN.loss_val, textCNN.logits], feed_dict)
if not textCNN.multi_label_flag:
predict = [*predict, np.argmax(np.array(logits[0]))]
eval_loss += current_eval_loss
eval_counter += 1

# if textCNN.multi_label_flag:
# evalY = [np.argmax(ii) for ii in evalY]
# else: pass

# if not textCNN.multi_label_flag:
# predict = [int(ii > 0.5) for ii in predict]
_, _, f1_macro, f1_micro, _ = fastF1(evalY, predict, textCNN.num_classes)
f1_score = (f1_micro + f1_macro) / 2.0
return eval_loss / float(eval_counter), f1_score, f1_micro, f1_macro

# @jit
def fastF1(result: list, predict: list, num_classes: int):
''' f1 score '''
true_total, r_total, p_total, p, r = 0, 0, 0, 0, 0
total_list = []
for trueValue in range(num_classes):
trueNum, recallNum, precisionNum = 0, 0, 0
for index, values in enumerate(result):
if values == trueValue:
recallNum += 1
if values == predict[index]:
trueNum += 1
if predict[index] == trueValue:
precisionNum += 1
R = trueNum / recallNum if recallNum else 0
P = trueNum / precisionNum if precisionNum else 0
true_total += trueNum
r_total += recallNum
p_total += precisionNum
p += P
r += R
f1 = (2 * P * R) / (P + R) if (P + R) else 0
total_list.append([P, R, f1])
p, r = np.array([p, r]) / num_classes
micro_r, micro_p = true_total / np.array([r_total, p_total])
macro_f1 = (2 * p * r) / (p + r) if (p + r) else 0
micro_f1 = (2 * micro_p * micro_r) / (micro_p + micro_r) if (micro_p + micro_r) else 0
accuracy = true_total / len(result)
print('P: {:.2f}%, R: {:.2f}%, Micro_f1: {:.2f}%, Macro_f1: {:.2f}%, Accuracy: {:.2f}'.format(
p * 100, r * 100, micro_f1 * 100, macro_f1 * 100, accuracy * 100))
return p, r, macro_f1, micro_f1, total_list

def assign_pretrained_word_embedding(sess, index2word, textCNN, word2vec_model_path):
from gensim.models.keyedvectors import KeyedVectors
print("using pre-trained word emebedding.started.word2vec_model_path:", word2vec_model_path)
word2vec_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
word_embedding_2dlist = [[]] * textCNN.vocab_size # create an empty word_embedding list.
word_embedding_2dlist[0] = np.zeros(textCNN.embed_size) # assign empty for first word:'PAD'
bound = np.sqrt(6.0) / np.sqrt(textCNN.vocab_size) # bound for random variables.
word_embedding_2dlist[1] = np.random.uniform(-bound, bound, textCNN.embed_size) # assign empty for word:'UNK'
count_exist = 0
count_not_exist = 0
for i in range(2, textCNN.vocab_size): # loop each word. notice that the first two words are pad and unknown token
word = index2word[i] # get a word
# embedding = None
try:
embedding = word2vec_model.word_vec(word) # try to get vector:it is an array.
except Exception:
embedding = None
if embedding is not None: # the 'word' exist a embedding
word_embedding_2dlist[i] = embedding
count_exist = count_exist + 1 # assign array to this word.
else: # no embedding for this word
word_embedding_2dlist[i] = np.random.uniform(-bound, bound, textCNN.embed_size)
count_not_exist = count_not_exist + 1 # init a random value for the word.
word_embedding_final = np.array(word_embedding_2dlist) # covert to 2d array.
word_embedding = tf.constant(word_embedding_final, dtype=tf.float32) # convert to tensor
t_assign_embedding = tf.assign(textCNN.Embedding, word_embedding) # assign this value to our embedding variables of our model.
sess.run(t_assign_embedding)
print("word. exists embedding:", count_exist, " ;word not exist embedding:", count_not_exist)
print("using pre-trained word emebedding.ended...")


if __name__ == "__main__":
tf.app.run()

参考链接

textCNN 的 github 项目

坚持原创技术分享,您的支持将鼓励我继续创作!