TextCNN 详解

论文背景

论文 “Convolutional Neural Networks for Sentence Classification” 由 Yoon Kim 在 2014 年 EMNLP 会议上提出。

将卷积神经网络 CNN 应用到文本分类任务，利用多个不同 size 的 kernel 来提取句子中的关键信息（类似于多窗口大小的 ngram ），从而能够更好地捕捉局部相关性。

网络结构

01_net

TextCNN 的详细过程原理图如下：

02_detail

说明：

Embedding：第一层是图中最左边的 7 乘 5 的句子矩阵，每行是词向量，维度=5，这个可以类比为图像中的原始像素点。
Convolution：然后经过 kernel_sizes=(2,3,4) 的一维卷积层，每个kernel_size 有两个输出 channel 。
MaxPolling：第三层是一个1-max pooling层，这样不同长度句子经过 pooling 层之后都能变成定长的表示。
FullConnection and Softmax：最后接一层全连接的 softmax 层，输出每个类别的概率。

嵌入层(embedding layer)

TextCNN 使用预先训练好的词向量作 embedding layer 。对于数据集里的所有词，因为每个词都可以表征成一个向量，因此我们可以得到一个嵌入矩阵 M, M 里的每一行都是词向量。这个 M 可以是静态 (static) 的，也就是固定不变。可以是非静态 (non-static) 的，也就是可以根据反向传播更新。

论文给出了几种模型，其实这里基本都是针对 Embedding layer 做的变化。

CNN-rand
作为一个基础模型，Embedding layer 所有 words 被随机初始化，然后模型整体进行训练。
CNN-static
模型使用预训练的 word2vec 初始化 Embedding layer，对于那些在预训练的 word2vec 没有的单词，随机初始化。然后固定 Embedding layer，fine-tune 整个网络。
CNN-non-static
同（CNN-static），只是训练的时候，Embedding layer跟随整个网络一起训练。
CNN-multichannel
Embedding layer 有两个 channel，一个 channel 为 static，一个为 non-static 。然后整个网络 fine-tune 时只有一个 channel 更新参数。两个 channel 都是使用预训练的 word2vec 初始化的。

卷积层(convolution)

通道 Channels

图像中可以利用 (R, G, B) 作为不同 channel；
文本的输入的channel通常是不同方式的 embedding 方式（比如 word2vec 或 Glove），实践中也有利用静态词向量和 fine-tunning 词向量作为不同 channel 的做法。
channel 也可以一个是词序列，另一个 channel 是对应的词性序列。接下来就可以通过加和或者拼接进行结合。

一维卷积 conv-1d

我们可以把嵌入层矩阵 M 看成是一幅图像，使用卷积神经网络去提取特征。由于句子中相邻的单词关联性总是很高的，因此可以使用一维卷积，即文本卷积与图像卷积的不同之处在于只在文本序列的一个方向（垂直）做卷积。卷积核的宽度固定为词向量的维度 d，高度是超参数，可以设置。

图像是二维数据；
文本是一维数据，因此在TextCNN 卷积用的是一维卷积（在word-level上是一维卷积；虽然文本经过词向量表达后是二维数据，但是在 embedding-level 上的二维卷积没有意义）。一维卷积带来的问题是需要通过设计不同 kernel_size 的 filter 获取不同宽度的视野。

池化层(pooling)

不同尺寸的卷积核得到的特征 (feature map) 大小也是不一样的，因此我们对每个 feature map 使用池化函数，使它们的维度相同。

Max Pooling

03_pooling_max
最常用的就是 1-max pooling，提取出 feature map 照片那个的最大值，通过选择每个 feature map 的最大值，可捕获其最重要的特征。这样每一个卷积核得到特征就是一个值，对所有卷积核使用 1-max pooling，再级联起来，可以得到最终的特征向量。

K-Max Pooling

04_pooling_kmax

取所有特征值中得分在 Top–K 的值，并（保序拼接）保留这些特征值原始的先后顺序（即多保留一些特征信息供后续阶段使用）。

参见论文 A Convolutional Neural Network for Modelling Sentences

Chunk-MaxPooling

05_pooling_dynamic

把某个 Filter 对应的 Convolution 层的所有特征向量进行分段，切割成若干段后，在每个分段里面各自取得一个最大特征值，比如将某个 Filter 的特征向量切成 3 个 Chunk，那么就在每个Chunk里面取一个最大值，于是获得 3 个特征值。因为是先划分 Chunk 再分别取 Max 值的，所以保留了比较粗粒度的模糊的位置信息；当然，如果多次出现强特征，则也可以捕获特征强度。至于这个Chunk怎么划分，可以有不同的做法，比如可以事先设定好段落个数，这是一种静态划分 Chunk 的思路；也可以根据输入的不同动态地划分 Chunk 间的边界位置，可以称之为动态 Chunk-Max 方法。 “Local Translation Prediction with Global Sentence Representation” 这篇论文也用实验证明了静态 Chunk-Max 性能相对 MaxPooling Over Time 方法在机器翻译应用中对应用效果有提升。

Dynamic Pooling

卷积时如果碰到 triggle 词，可以标记下不同色，max-pooling 时按不同标记划分 chunk。“Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks” 这篇论文提出的是一种 ChunkPooling 的变体，就是动态 Chunk-Max Pooling 的思路，实验证明性能有提升。

代码实现

代码结构

data_tools.py: 数据预处理
modules.py: 模型网络结构
run.py: 运行文件，主程序入口

数据预处理

数据集使用复旦大学中文文本分类数据集，该数据集共 9833 篇文档，分为19个类别。训练语料和测试语料基本按照1:1的比例来划分。详细的代码解析参考我的另外一篇博客，关于如何预处理中文语料库的文章。

中文数据集

import os
import re
import jieba
from collections import Counter
from sklearn.utils import shuffle

def is_chinese(char):
    """
    判断一个字符是否为中文汉字
    :param char: str
    :return: bool, 是返回 True, 否返回 False
    """
    punctuation = [u'\u300a', u'\u2014', u'\u201c']  # ['《', '—', '“']
    if u'\u4e00' <= char <= u'\u9fff' \
            or u'\uff10' <= char <= u'\uff19' \
            or char in punctuation:
        return True
    else:
        return False

def get_listdir(path):
    """
    获取一个路径下所有文件或文件夹的名字列表
    :param path: str
    :return: list
    """
    return os.listdir(path)

def get_listpath(path):
    """
    获取一个路径下所有文件或文件夹的路径列表
    :param path: str
    :return: list
    """
    dirname = os.listdir(path)
    path_ = [os.path.join(path, x) for x in dirname]
    return path_

def parser_file2para(file):
    """
    将文件 file 解析为段落列表
    :param file: str, 文件名
    :return: list, 段落列表
    """
    start = ('    ', '　　', '　  ', ' 　 ', '  　')
    text = ""
    s = False
    paras = []
    with open(file) as f:
        # 处理文件中的每一行
        for line in f:
            if line.startswith(start) and len(line) > 4 and is_chinese(line[4]):
                if text is not "":
                    text = re.sub('（.*?）', '', text)
                    if len(text) <= 10:  # 假如 text 长度小于 10 文本丢弃
                        pass
                    else:
                        paras.append(text.replace(' ', ''))
                    text = ""
                text += line.lstrip().rstrip('\n')
                s = True
                continue
            if line.startswith(start) and len(line) > 4 and not is_chinese(line[4]):
                if text is not "":
                    text = re.sub('（.*?）', '', text)
                    if len(text) <= 10:
                        pass
                    else:
                        paras.append(text.replace(' ', ''))
                    text = ""
                s = False
                continue
            if s == True and line[0] is not ' ':
                text += line.rstrip('\n')
                continue
        # 处理最后一段
        if text is not "":
            text = re.sub('（.*?）', '', text)
            if len(text) <= 10:
                pass
            else:
                paras.append(text.replace(' ', ''))
    return paras

def parser_file2sentence(file, low, high):
    """
    将文件 file 解析为句子列表
    :param file: str, 文件名
    :param low: 句子长度下限
    :param high: 句子长度上限
    :return: list, 句子列表
    """
    sentences = []
    paras = parser_file2para(file)
    for para in paras:
        sentence = re.split('(。”|！”|？”|。）|。|！|？)', para)
        sentence = [x.strip() for x in sentence]
        if len(sentence) == 1 and low <= len(*sentence) <= high:
            sentences.extend(sentence)
            continue
        sentence = ["".join(i) for i in zip(sentence[0::2], sentence[1::2])]
        sentence = [x for x in sentence if low <= len(x) <= high and x[0] != '（']
        sentences.extend(sentence)
    return sentences

def text_save(dirpath, filename, data):#filename为写入CSV文件的路径，data为要写入数据列表.
    """
    将数据列表 data 写入 filename 文件中, 文件中的一行为数据 data 中的一列数据
    :param dir: str, 文件夹路径
    :param filename: str, 文件名
    :param data: list, 数据列表
    :return: None
    """
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
        print(dirpath, "路路径已创建")
    filepath = os.path.join(dirpath, filename)
    file = open(filepath, 'a')
    for line in data:
        line = line + '\n'   # 每行末尾追加换行符
        file.write(line)
    file.close()
    print("保存文件成功")

def count_word_sentence(filepath):
    """
    计算一个文件包含的句子数量和单词数量,该文件一句一行
    :param filepath: str, 文件路径
    :return: int, 单词数量, 句子数量, 句子平均长度, 句子最大长度, 句子最小长度
    """
    num_word, num_sent = 0, 0
    max_sent = -float('inf')
    min_sent = float('inf')
    with open(filepath, 'r') as f:
        for line in f:
            l = len(line)
            max_sent = max(max_sent, l)
            min_sent = min(min_sent, l)
            num_word += l
            num_sent += 1
    if num_sent == 0:
        ave_sent = 0
    else:
        ave_sent = num_word // num_sent
    return num_word, num_sent, ave_sent, max_sent, min_sent

def cut_words(rawpath, datapath):
    """
    将原路径的文件进行分词,然后存入新的文件,词语以空格分割.
    :param rawpath: str, 原始数据路径
    :param datapath: str, 处理后数据路径
    :return: list, 词表
    """
    words = []
    if not os.path.exists(datapath):
        for dirpath, dirnames, filenames in os.walk(rawpath):
            if filenames is []:
                continue
            filepaths = [os.path.join(dirpath, filename) for filename in filenames]
            for filepath, filename in zip(filepaths, filenames):
                text = []
                with open(filepath, 'r') as f:
                    for line in f:
                        sentence = jieba.lcut(line)
                        sentence = sentence[:-1]
                        words.extend(sentence)
                        sentence = ' '.join(sentence)
                        text.append(sentence)
                data_path = os.path.join(datapath, os.path.basename(dirpath))
                text_save(data_path, filename, text)
        return words
    else:
        for dirpath, dirnames, filenames in os.walk(datapath):
            if filenames is []:
                continue
            filepaths = [os.path.join(dirpath, filename) for filename in filenames]
            for filepath, filename in zip(filepaths, filenames):
                with open(filepath, 'r') as f:
                    for line in f:
                        sentence = line.split(' ')
                        sentence = sentence[:-1]
                        words.extend(sentence)
        return words

def read_dataset(rawpath, datapath, vocab_size):
    """
    读取数据集,制作成训练集,验证集和测试集
    :param rawpath: str, 原始数据路径
    :param datapath: str, 处理后数据路径
    :param vocab_size: int, 词表大小
    :return: word2index,  单词到编码的映射
             label2index, 标签到编码的映射
             x_train_,    训练集输入
             y_train,     训练集标签
             x_dev_,      验证集输入
             y_dev,       验证集标签
             x_test_,     测试集输入
             y_test,      测试集标签
    """
    if not os.path.exists(datapath):
        cut_words(rawpath, datapath)
    x_train, y_train = [], []
    x_test, y_test = [], []
    words = []
    for dirpath, dirnames, filenames in os.walk(datapath):
        if filenames is []:
            continue
        filepaths = [os.path.join(dirpath, filename) for filename in filenames]
        for filepath, filename in zip(filepaths, filenames):
            tag = filename.split('-')[0]
            with open(filepath, 'r') as f:
                for line in f:
                    sentence = line.split(' ')
                    sentence = sentence[:-1]
                    if dirpath == os.path.join(datapath, 'train'):
                        x_train.append(sentence)
                        y_train.append(tag)
                    elif dirpath == os.path.join(datapath, 'test'):
                        x_test.append(sentence)
                        y_test.append(tag)
                    else:
                        pass
                    words.extend(sentence)
    # 构建训练集,验证集,测试集
    x_train, y_train = shuffle(x_train, y_train)
    x_test, y_test = shuffle(x_test, y_test)
    cut = len(y_test) // 2
    x_dev, y_dev = x_test[:cut], y_test[:cut]
    x_test, y_test = x_test[cut:], y_test[cut:]
    # 构建词表映射
    vocab = Counter(words).most_common(vocab_size)
    word2index = {k_v[0]: i+2 for i, k_v in enumerate(vocab)}
    word2index.update({"PAD": 0, "UNK": 1})
    # 构建标签映射
    label = sorted(list(set(y_train + y_dev + y_test)))
    label2index = {tag: i for i, tag in enumerate(label)}
    # 对数据集,验证机,测试集进行编码
    max_sent_len = max([len(sent) for sent in x_train + x_dev + x_test])
    x_train_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_train]
    x_dev_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_dev]
    x_test_ = [[word2index.get(w, 1) for w in sent] + [0] * (max_sent_len - len(sent)) for sent in x_test]
    y_train = [label2index[tag] for tag in y_train]
    y_dev = [label2index[tag] for tag in y_dev]
    y_test = [label2index[tag] for tag in y_test]

    return word2index, label2index, x_train_, y_train, x_dev_, y_dev, x_test_, y_test

附上该代码配套的英文数据集的预处理代码
英文数据集

from sklearn.utils import shuffle

from collections import Counter
import pickle

def read_TREC():
    data = {}

    def read(mode):
        x, y = [], []

        with open("data/TREC/TREC_" + mode + ".txt", "r", encoding="utf-8") as f:
            for line in f:
                if line[-1] == "\n":
                    line = line[:-1]
                y.append(line.split()[0].split(":")[0])
                x.append(line.split()[1:])

        x, y = shuffle(x, y)

        if mode == "train":
            dev_idx = len(x) // 10
            data["dev_x"], data["dev_y"] = x[:dev_idx], y[:dev_idx]
            data["train_x"], data["train_y"] = x[dev_idx:], y[dev_idx:]
        else:
            data["test_x"], data["test_y"] = x, y

    read("train")
    read("test")

    return data


def read_MR():
    data = {}
    x, y = [], []

    with open("data/MR/rt-polarity.pos", "r", encoding="utf-8") as f:
        for line in f:
            if line[-1] == "\n":
                line = line[:-1]
            x.append(line.split())
            y.append(1)

    with open("data/MR/rt-polarity.neg", "r", encoding="utf-8") as f:
        for line in f:
            if line[-1] == "\n":
                line = line[:-1]
            x.append(line.split())
            y.append(0)

    x, y = shuffle(x, y)
    dev_idx = len(x) // 10 * 8
    test_idx = len(x) // 10 * 9

    data["train_x"], data["train_y"] = x[:dev_idx], y[:dev_idx]
    data["dev_x"], data["dev_y"] = x[dev_idx:test_idx], y[dev_idx:test_idx]
    data["test_x"], data["test_y"] = x[test_idx:], y[test_idx:]

    data["vocab"] = sorted(list(set([w for sent in data["train_x"] + data["dev_x"] + data["test_x"] for w in sent])))
    # data["vocab"] = Counter([w for sent in data["train_x"] + data["dev_x"] + data["test_x"] for w in sent]).most_common(5000)
    # data["vocab"] = [x for x, y in data["vocab"]]
    data["classes"] = sorted(list(set(data["train_y"])))
    data["word_to_idx"] = {w: i+2 for i, w in enumerate(data["vocab"])}
    data["word_to_idx"].update({"PAD": 0, "UNK": 1})
    data["label_to_idx"] = {l: i+2 for i, l in enumerate(data["classes"])}
    # data["idx_to_word"] = {i: w for i, w in enumerate(data["vocab"])}


    max_sent_len = max([len(sent) for sent in data["train_x"] + data["dev_x"] + data["test_x"]])
    data["train_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["train_x"]]
    data["dev_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["dev_x"]]
    data["test_x_"] = [[data["word_to_idx"][w] for w in sent] + [0] * (max_sent_len - len(sent)) for sent in data["test_x"]]

    return data

模型搭建

from __future__ import print_function
import tensorflow as tf

class TextCNN:
    def __init__(self, config):
        """init all hyperparameter here"""
        # set hyperparamter
        self.num_epochs = config['n_epochs']
        self.num_classes = config['num_classes']
        self.batch_size = config['batch_size']
        self.sequence_length = config['sequence_length']
        self.vocab_size = config['vocab_size']
        self.embed_size = config['embed_size']
        self.learning_rate = tf.Variable(config['learning_rate'], trainable=False, name="learning_rate") #ADD learning_rate
        self.learning_rate_decay_half_op = tf.assign(self.learning_rate, self.learning_rate * config['decay_rate_big'])
        self.filter_sizes = config['filter_sizes'] # it is a list of int. e.g. [3,4,5]
        self.num_filters = config['num_filters']
        self.initializer = config['initializer']
        self.num_filters_total = self.num_filters * len(self.filter_sizes) #how many filters totally.
        self.use_mulitple_layer_cnn = config['use_mulitple_layer_cnn']
        self.multi_label_flag = config['multi_label_flag']
        self.clip_gradients = config['clip_gradients']
        self.use_embedding = config['use_embedding']
        self.ckpt_dir = config["ckpt_dir"]

        # add placeholder (X,label)
        with tf.name_scope('placeholder'):
            self.is_training_flag = tf.placeholder(tf.bool, name="is_training_flag")
            self.input_x = tf.placeholder(tf.int32, shape=[None, self.sequence_length], name="input_x")  # X
            self.input_y = tf.placeholder(tf.int32, shape=[None, ], name="input_y")  # y:[None,num_classes]
            self.input_y_multilabel = tf.placeholder(tf.float32, shape=[None, self.num_classes], name="input_y_multilabel")  # y:[None,num_classes]. this is for multi-label classification only.
            self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
            self.iter = tf.placeholder(tf.int32) # training iteration

        self.global_step = tf.Variable(0, trainable=False, name="Global_Step")
        self.epoch_step = tf.Variable(0, trainable=False, name="Epoch_Step")
        self.epoch_increment = tf.assign(self.epoch_step, tf.add(self.epoch_step, tf.constant(1)))
        self.decay_steps, self.decay_rate = config['decay_steps'], config['decay_rate']

        self.instantiate_weights()
        # 构建模型
        word_embedded = self.word2vec()  # [None,sentence_length,embed_size,1)
        if self.use_mulitple_layer_cnn: # this may take 50G memory.
            print("use multiple layer CNN")
            h = self.cnn_multiple_layers(word_embedded)
        else: # this take small memory, less than 2G memory.
            print("use single layer CNN")
            h = self.cnn_single_layer(word_embedded)
        self.logits = self.classifer(h)  # [None, self.num_classes]
        self.possibility = tf.nn.sigmoid(self.logits)

        # 计算损失函数
        if self.multi_label_flag:
            print("going to use multi label loss.")
            self.loss_val = self.loss_multilabel()
        else:
            print("going to use single label loss.")
            self.loss_val = self.loss()
        self.train_op = self.train()

        # 模型评估
        if not self.multi_label_flag:
            self.predictions = tf.argmax(self.logits, 1, name="predictions")  # shape:[None,]
            print("self.predictions:", self.predictions)
            correct_prediction = tf.equal(tf.cast(self.predictions, tf.int32), self.input_y) #tf.argmax(self.logits, 1)-->[batch_size]
            self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="Accuracy") # shape=()

    def instantiate_weights(self):
        """define all weights here"""
        with tf.name_scope("embedding"): # embedding matrix
            self.Embedding = tf.get_variable("Embedding", shape=[self.vocab_size, self.embed_size], initializer=self.initializer) #[vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
            self.W_projection = tf.get_variable("W_projection", shape=[self.num_filters_total, self.num_classes], initializer=self.initializer) #[embed_size,label_size]
            self.b_projection = tf.get_variable("b_projection", shape=[self.num_classes])       #[label_size] #ADD 2017.06.09

    def word2vec(self):
        with tf.name_scope("embedding"):
            embedded_words = tf.nn.embedding_lookup(self.Embedding, self.input_x)
            sentence_embeddings_expanded = tf.expand_dims(embedded_words, -1)
        return sentence_embeddings_expanded

    def cnn_single_layer(self, sentence_embeddings_expanded):
        pooled_outputs = []
        for i, filter_size in enumerate(self.filter_sizes):
            # with tf.name_scope("convolution-pooling-%s" %filter_size):
            with tf.variable_scope("convolution-pooling-%s" % filter_size):
                # ====>a.create filter
                filter = tf.get_variable("filter-%s" % filter_size, [filter_size, self.embed_size, 1, self.num_filters], initializer=self.initializer)
                # ====>b.conv operation: conv2d===>computes a 2-D convolution given 4-D `input` and `filter` tensors.
                conv = tf.nn.conv2d(sentence_embeddings_expanded, filter, strides=[1, 1, 1, 1], padding="VALID", name="conv")  # shape:[batch_size,sequence_length - filter_size + 1,1, num_filters]
                conv_bn = tf.contrib.layers.batch_norm(conv, is_training=self.is_training_flag, scope='cnn_bn_')

                # ====>c. apply nolinearity
                b = tf.get_variable("b-%s" % filter_size, [self.num_filters])  # ADD 2017-06-09
                h = tf.nn.relu(tf.nn.bias_add(conv_bn, b), "relu")  # shape:[batch_size, sequence_length-filter_size+1, 1, num_filters]. tf.nn.bias_add:adds `bias` to `value`
                # ====>d. max-pooling.  value: A 4-D `Tensor` with shape `[batch, height, width, channels]
                pooled = tf.nn.max_pool(h, ksize=[1, self.sequence_length - filter_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool")  # shape:[batch_size, 1, 1, num_filters]
                pooled_outputs.append(pooled)
        # 3.=====>combine all pooled features, and flatten the feature.output' shape is a [1,None]
        self.h_pool = tf.concat(pooled_outputs, 3)  # shape:[batch_size, 1, 1, num_filters_total]. tf.concat=>concatenates tensors along one dimension.where num_filters_total=num_filters_1+num_filters_2+num_filters_3
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, self.num_filters_total])  # shape should be:[None,num_filters_total]. here this operation has some result as tf.sequeeze().e.g. x's shape:[3,3];tf.reshape(-1,x) & (3, 3)---->(1,9)

        # 4.=====>add dropout: use tf.nn.dropout
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat, keep_prob=self.dropout_keep_prob)  # [None, num_filters_total]
        h_ = tf.layers.dense(self.h_drop, self.num_filters_total, activation=tf.nn.tanh, use_bias=True)
        return h_

    def cnn_multiple_layers(self, sentence_embeddings_expanded):
        # 2.=====>loop each filter size. for each filter, do:convolution-pooling layer(a.create filters,b.conv,c.apply nolinearity,d.max-pooling)--->
        # you can use:tf.nn.conv2d;tf.nn.relu;tf.nn.max_pool; feature shape is 4-d. feature is a new variable
        pooled_outputs = []
        print("sentence_embeddings_expanded:", self.sentence_embeddings_expanded)
        for i, filter_size in enumerate(self.filter_sizes):
            with tf.variable_scope('cnn_multiple_layers' + "convolution-pooling-%s" % filter_size):
                # 1) CNN->BN->relu
                filter = tf.get_variable("filter-%s" % filter_size,[filter_size, self.embed_size, 1, self.num_filters],initializer=self.initializer)
                conv1 = tf.nn.conv2d(sentence_embeddings_expanded, filter, strides=[1, 1, 1, 1], padding="SAME", name="conv")  # shape:[batch_size,sequence_length - filter_size + 1, 1, num_filters]
                conv1_bn = tf.contrib.layers.batch_norm(conv1, is_training=self.is_training_flag, scope='cnn1')
                print(i, "conv1:", conv1_bn)
                b = tf.get_variable("b-%s" % filter_size, [self.num_filters])  # ADD 2017-06-09
                h = tf.nn.relu(tf.nn.bias_add(conv1_bn, b), "relu")  # shape:[batch_size,sequence_length,1,num_filters]. tf.nn.bias_add:adds `bias` to `value`

                # 2) CNN->BN->relu
                h = tf.reshape(h, [-1, self.sequence_length, self.num_filters, 1])  # shape:[batch_size,sequence_length,num_filters,1]
                # Layer2:CONV-RELU
                filter2 = tf.get_variable("filter2-%s" % filter_size, [filter_size, self.num_filters, 1, self.num_filters], initializer=self.initializer)
                conv2 = tf.nn.conv2d(h, filter2, strides=[1, 1, 1, 1], padding="SAME", name="conv2")  # shape:[batch_size,sequence_length-filter_size*2+2,1,num_filters]
                conv2_bn = tf.contrib.layers.batch_norm(conv2, is_training=self.is_training_flag, scope='cnn2')
                print(i, "conv2:", conv2)
                b2 = tf.get_variable("b2-%s" % filter_size, [self.num_filters])  # ADD 2017-06-09
                h2 = tf.nn.relu(tf.nn.bias_add(conv2_bn, b2), "relu2")  # shape:[batch_size,sequence_length,1,num_filters]. tf.nn.bias_add:adds `bias` to `value`

                # 3. Max-pooling
                pooling_max = tf.squeeze(tf.nn.max_pool(h2, ksize=[1, self.sequence_length, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool"))
                # pooling_avg=tf.squeeze(tf.reduce_mean(h,axis=1)) #[batch_size, num_filters]
                print(i, "pooling:", pooling_max)
                # pooling=tf.concat([pooling_max,pooling_avg],axis=1) #[batch_size,num_filters*2]
                pooled_outputs.append(pooling_max)  # h:[batch_size,sequence_length,1,num_filters]
        # concat
        h_pool = tf.concat(pooled_outputs, axis=1)  # [batch_size,num_filters*len(self.filter_sizes)]
        print("h.concat:", h_pool)

        with tf.name_scope("dropout"):
            h_ = tf.nn.dropout(h_pool, keep_prob=self.dropout_keep_prob)  # [batch_size,sequence_length - filter_size + 1,num_filters]
        return h_  # [batch_size,sequence_length - filter_size + 1,num_filters]

    def classifer(self, h):
        with tf.name_scope("output"):
            logits = tf.matmul(h, self.W_projection) + self.b_projection
            return logits

    def loss_multilabel(self, l2_lambda=0.0001): #0.0001#this loss function is for multi-label classification
        with tf.name_scope("loss"):
            losses = tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y_multilabel, logits=self.logits)
            losses = tf.reduce_sum(losses, axis=1)  # shape=(?,). loss for all data in the batch
            loss = tf.reduce_mean(losses)           # shape=().   average loss in the batch
            l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
            loss += l2_losses
        return loss

    def loss(self, l2_lambda=0.0001):# 0.001
        with tf.name_scope("loss"):
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits)
            loss = tf.reduce_mean(losses) # print("2.loss.loss:", loss) #shape=()
            l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
            loss += l2_losses
        return loss

    def train(self):
        """based on the loss, use SGD to update parameter"""
        learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps, self.decay_rate, staircase=True)
        self.learning_rate_ = learning_rate
        optimizer = tf.train.AdamOptimizer(learning_rate)
        gradients, variables = zip(*optimizer.compute_gradients(self.loss_val))
        gradients, _ = tf.clip_by_global_norm(gradients, self.clip_gradients)
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) #ADD 2018.06.01
        with tf.control_dependencies(update_ops):  #ADD 2018.06.01
            train_op = optimizer.apply_gradients(zip(gradients, variables))
        return train_op

主程序入口

import tensorflow as tf
import numpy as np
from modules import TextCNN
import os
from data_tools import read_dataset
from numba import jit

#configuration
configuration = {
    'vocab_size': 5000,  # number of vocabulary
    'embed_size': 300,  # dimension of word embedding

    'sequence_length': 200,  # max length of sentence
    'num_classes': 2,  # number of labels

    'filter_sizes': [3, 4, 5],  # size of convolution kernel
    'num_filters': 128,  # number of convolution kernel

    'learning_rate': 1e-3,  # learning rate
    'decay_rate': 1.0,  # learning rate decay
    'decay_rate_big': 0.50,
    'decay_steps': 1000,
    'clip_gradients': 5.0,
    'dropout_rate': 0.5,  # droppout

    'n_epochs': 10,  # epochs
    'batch_size': 64,  # batch_size
    'initializer': tf.random_normal_initializer(stddev=0.1),
    'use_mulitple_layer_cnn': False,
    'multi_label_flag': False,
    'use_embedding': False,

    "ckpt_dir": "text_cnn_checkpoint/"

}

#1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
def main(_):
    
    word2index, label2index, trainX, trainY, vaildX, vaildY, testX, testY = read_dataset('Chinese_Data_2', 'process_2', 30000)
    configuration['vocab_size'] = len(word2index)
    print("cnn_model.vocab_size:", configuration['vocab_size'])
    configuration['num_classes'] = len(label2index)
    print("num_classes:", configuration['num_classes'])
    num_examples = len(trainX)
    print("num_examples of training:", num_examples, ";sentence_len:", len(trainX[0]))

    configuration['sequence_length'] = len(trainX[0])
    #2.create session.
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        #Instantiate Model
        textCNN = TextCNN(configuration)
        #Initialize Save
        saver = tf.train.Saver()
        if os.path.exists(textCNN.ckpt_dir+"checkpoint"):
            print("Restoring Variables from Checkpoint.")
            saver.restore(sess, tf.train.latest_checkpoint(textCNN.ckpt_dir))
            #for i in range(3): #decay learning rate if necessary.
            #    print(i,"Going to decay learning rate by half.")
            #    sess.run(textCNN.learning_rate_decay_half_op)
        else:
            print('Initializing Variables')
            sess.run(tf.global_variables_initializer())
            if textCNN.use_embedding: #load pre-trained word embedding
                index2word = {v: k for k, v in word2index.items()}
                assign_pretrained_word_embedding(sess, index2word, textCNN, "GoogleNews-vectors-negative300.bin")
        curr_epoch = sess.run(textCNN.epoch_step)
        #3.feed data & training
        number_of_training_data = len(trainX)
        batch_size = textCNN.batch_size
        iteration = 0
        for epoch in range(curr_epoch, textCNN.num_epochs):
            loss, counter = 0.0, 0
            for start, end in zip(range(0, number_of_training_data, batch_size),\
                                  range(batch_size, number_of_training_data, batch_size)):
                iteration = iteration+1
                if epoch == 0 and counter == 0:
                    print("trainX[start:end]:", trainX[start: end])
                feed_dict = {textCNN.input_x: trainX[start:end], textCNN.dropout_keep_prob: 0.8, textCNN.is_training_flag: True}
                if not textCNN.multi_label_flag:
                    feed_dict[textCNN.input_y] = trainY[start: end]
                else:
                    feed_dict[textCNN.input_y_multilabel] = trainY[start: end]
                curr_loss, lr, _ = sess.run([textCNN.loss_val, textCNN.learning_rate, textCNN.train_op], feed_dict)
                loss, counter = loss + curr_loss, counter + 1
                if counter % 50 == 0:
                    print("Epoch %d\tBatch %d\tTrain Loss:%.3f\tLearning rate:%.5f" %(epoch, counter, loss/float(counter), lr))

            #epoch increment
            print("going to increment epoch counter....")
            sess.run(textCNN.epoch_increment)

            # 4.validation
            if epoch % 1 == 0:
                eval_loss, f1_score, f1_micro, f1_macro = do_eval(sess, textCNN, vaildX, vaildY)
                print("Epoch %d Validation Loss:%.3f\tF1 Score:%.3f\tF1_micro:%.3f\tF1_macro:%.3f" % (epoch, eval_loss, f1_score, f1_micro, f1_macro))
                #save model to checkpoint
                save_path = textCNN.ckpt_dir+"model.ckpt"
                print("Going to save model..")
                saver.save(sess, save_path, global_step=epoch)

        # 5.最后在测试集上做测试，并报告测试准确率 Test
        test_loss, f1_score, f1_micro, f1_macro = do_eval(sess, textCNN, testX, testY)
        print("Test Loss:%.3f\tF1 Score:%.3f\tF1_micro:%.3f\tF1_macro:%.3f" % (test_loss,f1_score,f1_micro,f1_macro))
    pass


# 在验证集上做验证，报告损失、精确度
def do_eval(sess, textCNN, evalX, evalY):
    evalX = evalX
    evalY = evalY
    number_examples = len(evalX)
    eval_loss, eval_counter, eval_f1_score, eval_p, eval_r = 0.0, 0, 0.0, 0.0, 0.0
    batch_size = 1
    predict = []

    for start, end in zip(range(0, number_examples, batch_size), range(batch_size, number_examples + batch_size, batch_size)):
        ''' evaluation in one batch '''
        if textCNN.multi_label_flag:
            feed_dict = {textCNN.input_x: evalX[start:end],
                         textCNN.input_y_multilabel: evalY[start:end],
                         textCNN.dropout_keep_prob: 1.0,
                         textCNN.is_training_flag: False}
        else:
            feed_dict = {textCNN.input_x: evalX[start:end],
                         textCNN.input_y: evalY[start:end],
                         textCNN.dropout_keep_prob: 1.0,
                         textCNN.is_training_flag: False}
        current_eval_loss, logits = sess.run(
            [textCNN.loss_val, textCNN.logits], feed_dict)
        if not textCNN.multi_label_flag:
            predict = [*predict, np.argmax(np.array(logits[0]))]
        eval_loss += current_eval_loss
        eval_counter += 1

    # if textCNN.multi_label_flag:
    #     evalY = [np.argmax(ii) for ii in evalY]
    # else: pass

    # if not textCNN.multi_label_flag:
    #     predict = [int(ii > 0.5) for ii in predict]
    _, _, f1_macro, f1_micro, _ = fastF1(evalY, predict, textCNN.num_classes)
    f1_score = (f1_micro + f1_macro) / 2.0
    return eval_loss / float(eval_counter), f1_score, f1_micro, f1_macro

# @jit
def fastF1(result: list, predict: list, num_classes: int):
    ''' f1 score '''
    true_total, r_total, p_total, p, r = 0, 0, 0, 0, 0
    total_list = []
    for trueValue in range(num_classes):
        trueNum, recallNum, precisionNum = 0, 0, 0
        for index, values in enumerate(result):
            if values == trueValue:
                recallNum += 1
                if values == predict[index]:
                    trueNum += 1
            if predict[index] == trueValue:
                precisionNum += 1
        R = trueNum / recallNum if recallNum else 0
        P = trueNum / precisionNum if precisionNum else 0
        true_total += trueNum
        r_total += recallNum
        p_total += precisionNum
        p += P
        r += R
        f1 = (2 * P * R) / (P + R) if (P + R) else 0
        total_list.append([P, R, f1])
    p, r = np.array([p, r]) / num_classes
    micro_r, micro_p = true_total / np.array([r_total, p_total])
    macro_f1 = (2 * p * r) / (p + r) if (p + r) else 0
    micro_f1 = (2 * micro_p * micro_r) / (micro_p + micro_r) if (micro_p + micro_r) else 0
    accuracy = true_total / len(result)
    print('P: {:.2f}%, R: {:.2f}%, Micro_f1: {:.2f}%, Macro_f1: {:.2f}%, Accuracy: {:.2f}'.format(
        p * 100, r * 100, micro_f1 * 100, macro_f1 * 100, accuracy * 100))
    return p, r, macro_f1, micro_f1, total_list

def assign_pretrained_word_embedding(sess, index2word, textCNN, word2vec_model_path):
    from gensim.models.keyedvectors import KeyedVectors
    print("using pre-trained word emebedding.started.word2vec_model_path:", word2vec_model_path)
    word2vec_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
    word_embedding_2dlist = [[]] * textCNN.vocab_size  # create an empty word_embedding list.
    word_embedding_2dlist[0] = np.zeros(textCNN.embed_size)  # assign empty for first word:'PAD'
    bound = np.sqrt(6.0) / np.sqrt(textCNN.vocab_size)  # bound for random variables.
    word_embedding_2dlist[1] = np.random.uniform(-bound, bound, textCNN.embed_size) # assign empty for  word:'UNK'
    count_exist = 0
    count_not_exist = 0
    for i in range(2, textCNN.vocab_size):  # loop each word. notice that the first two words are pad and unknown token
        word = index2word[i]  # get a word
        # embedding = None
        try:
            embedding = word2vec_model.word_vec(word)  # try to get vector:it is an array.
        except Exception:
            embedding = None
        if embedding is not None:  # the 'word' exist a embedding
            word_embedding_2dlist[i] = embedding
            count_exist = count_exist + 1  # assign array to this word.
        else:  # no embedding for this word
            word_embedding_2dlist[i] = np.random.uniform(-bound, bound, textCNN.embed_size)
            count_not_exist = count_not_exist + 1  # init a random value for the word.
    word_embedding_final = np.array(word_embedding_2dlist)  # covert to 2d array.
    word_embedding = tf.constant(word_embedding_final, dtype=tf.float32)  # convert to tensor
    t_assign_embedding = tf.assign(textCNN.Embedding, word_embedding)  # assign this value to our embedding variables of our model.
    sess.run(t_assign_embedding)
    print("word. exists embedding:", count_exist, " ;word not exist embedding:", count_not_exist)
    print("using pre-trained word emebedding.ended...")


if __name__ == "__main__":
    tf.app.run()

参考链接

textCNN 的 github 项目