Intro - BERT and Finetuning

If there is ever an NLP buzzword for the past few years, it should be BERT. You can see it in most blogs, products, and even research in the field of Natural Language Processing(NLP). BERT (Bi-Directional Encoder Representations from Transformers) is a pre-trained NLP model developed by Google back in 2016. The idea behinds BERT follows the general philosophy of NLP work in recent decades - generating embeddings that contain rich relations between words in human languages and using the embeddings for a variety of downstream tasks. Please see my previous blog post for a more detailed walk-through of what word embeddings are and how they can be used.
Concidentally BERT is also the name of a character from Sesame Street,and so are names of many other NLP models
BERT addresses a series of problems that previous NLP models have. Models like Word2Vec can generate pretty accurate word embeddings with enough training data - for example, producing vectors such that V(king) - V(man) + V(woman) = V(queen) - but it does not address the problem of the same word having multiple meanings. Because it only looks at a fixed window when doing training, the only thing that determines the embedding of a given word is the words that appear around them in the training data set. Therefore, sentences like “I withdrew some money from the bank” and “I was sleeping on the bank of the river“ would produce very similar embedding for the word “bank” despite it having essentially different meanings in the two sentences. BERT addresses this problem by using what’s called the Transformer architecture with self-attention layers (it is an encoder of the Transformer architecture to be precise). The architecture is a bit complex to understand for beginners in NLP, and I would recommend reading this blog post for a detailed introduction of the architecture . Essentially, what BERT does is that it lets every word to generate “attentions” with other words, including itself, in the sentence. This would generate dynamic relations between the words that will indicate their semantics in a given sentence.

The original BERT pre-trained models have two versions - BERT BASE and BERT LARGE. As their names indicate, the former in the base model comparable in size to the OpenAI Transformer, and the latter is a ridiculously huge model that achieved the state of the art results reported in the original paper. Both models are incredibly accurate by themselves, but BERT has yet another design that makes it even more powerful for downstream tasks - fine-tuning.

Fine-tuning is the process where developers train extra layers outside of the BERT pre-trained models to achieve specific tasks. These tasks include text classification, sentiment analysis, name-entity recognition, question answering, and more. Google actually provides code for several fine-tuning processes in its original open-sourced BERT code, and today I am going to look into the source code of arguably the most common fine-tuning task used for BERT - classification.
From http://jalammar.github.io/illustrated-bert/
As the above image illustrates, for classifiers, we are adding another layer for classification (e.g. softmax) based on the pre-trained BERT model. In training the whole model altogether, there will be minimal changes to the original BERT models while achieving state-of-art text classification performance.

Run classifier source code walkthrough

The source code I will be referencing below comes from the run_classifier.py and modeling.py files under the original BERT source code folder open-sourced by Google. I ran the programs in debug mode to trace through every step of the process.

Reading and pre-processing data

Below is the main function of the run_classifier.py program.

1
2
3
4
5
6
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("bert_hub_module_handle")
flags.mark_flag_as_required("output_dir")
tf.app.run()

Let’s start by setting a break point at tf.app.run() and run the program. After some initiations for tensowflow, you should go into the below blog of code.

1
2
3
4
5
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir) # read the data
num_train_steps = int(
len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) # make the learning rate small at first

This code block is where BERT reads in the data and determines the train steps. The num_warmup_steps variable seems a bit strange here. Like its name suggests, what it does is that it makes the learning rate small at first, and then use the normal learning rate after warmup.

After setting up the initial parameters for later training, the next step is preparing the data into the format that can be fed into BERT. This is achieved with a call to the following function:

1
2
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

Now we step into the function to see exactly how that works

1
2
3
4
5
6
7
8
9
10
11
12
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""

writer = tf.python_io.TFRecordWriter(output_file) #convert it to TFRecorder to make it run faster

for (ex_index, example) in enumerate(examples): #iterate to read through the data
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)

This function accomplishes several things: it converts the output file to TFRencoder to make the program run faster, updates the user by printing the process after every 10000 iterations, and lastly, calls again to the convert_single_example function (#decomposition).

Now we will step in to the convert_single_example function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""

if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)

label_map = {}
for (i, label) in enumerate(label_list): #construct the label
label_map[label] = i

tokens_a = tokenizer.tokenize(example.text_a) #tokenize the first sentence with wordpiece
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b) #tokenize the second sentence with wordpiece (if the second sentence exists)

if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) #truncate the squence if it is too long
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]

As commented above, this function Converts a single InputExample into a single InputFeatures. Don’t worry if this doesn’t make any sense for you, I will include a visual example later. But before, there are several variables you need to know:

  • input_ids - Indices of input sequence tokens in the vocabulary (vocab.txt file in the BERT folder)
  • input_mask - Use to indicate which words will be put on “masks.” Words that are unmasked will not be used for self-attention training later
  • segment_ids - Use to indicate different sentences in the training process. 0 would indicate the word is in the first sentence, and 1 the second.
    These three are essentially the most important information BERT needs for our data to be fed into its neural network. As you can see after this information is initiated, BERT tries to tokenize text_a and text_b (if the latter exists) using its own tokenizer function. It is not very different from normal tokenizers except it uses something called the wordpiece technique. It achieves the effect of tokenizing “Jacksonville” into “jack”, “##son”, “##ville,” thereby having smaller and more tokens to feed into the neural network. If you wish to see how to word piece is used, you can step into the tokenizer function to see how it does that (spoiler alert: they use greedy algorithm).

Taken together, what this function does essentially is that it creates a label_map_ and tokenize the words using word piece. It considers whether the second sentence is present. And add the CLS and SEP token in places they should be.

  • CLS stands for classification - it is used in classification tasks and placed as the first token in the list
  • SEP stands for separation - it is placed between the two sentences to indicate where the first sentence ends

The following code block is pretty intuitive and easy to understand - it accomplishes these tasks

  • Add CLS to the beginning
  • Add segment_id_, which is all 0 for the first sentence
  • Add SEP to indicate the end of the sentence
  • Do the same for the second sentence
  • For easy look up the vocal.txt and convert the word to their ids
    This will pre-process the data into features that can be fed into BERT’s API
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)

if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)

input_ids = tokenizer.convert_tokens_to_ids(tokens) #jc: change them to indexes for easy look up

# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids) #We do not want to do self-attention for paddings, so we put 0 input mask

# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length: #making sure all input has the same size(128 in my case)
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)

assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length

label_id = label_map[example.label]
if ex_index < 5: #打印一些例子
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))

feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature

Then we go back to convert to feature function and read all the data into features by calling the previous function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f

features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids) #jc: change input_ids to ints
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])

tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()

Now we move to the only place in the run_classifier.py program you need to modify for your classification task, and the only thing you need to do is to finish the DataProcessor template, which pre-processes your data before feeding them into BERT’s neural network. So below is the template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""

def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()

def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()

def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()

def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()

@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines

If you finish writing this class, you have a fine-tuning program ready to run for BERT! In the next part, I will show you a short demo that shows how to use run_classifier to compare similarity between two sentences.

Demo: similarity between two sentences

For this demo, I want to compare if two sentences are semantically the same despite having different wordings. The data I obtained came from Ant Group, a fintech company in China (it recently went public and has probably one of the biggest IPOs in China’s history). So each line in the data file consists of an index, two sentences, and either 1 or 0 indicating whether they mean the same. It looks like this:


The data are randomly put into three sets - the training data, testing data, and evaluation data.

The first and only step I took was to rewrite the DataProcessor class, which you can see below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# My own processor designed based on the template
class SelfProcessor(DataProcessor):
"""Processor for the CoLA data set (GLUE version)."""

def get_train_examples(self, data_dir):
file_path = os.path.join(data_dir, '/Users/jiahuichen/PycharmProjects/BERT开源项目及数据/GLUE/glue_data/SIM/train.csv')
with open(file_path, 'r', encoding="utf-8") as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split("\t")
print(split_line)
text_a = tokenization.convert_to_unicode(split_line[1])
text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[3]
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples

def get_dev_examples(self, data_dir):
file_path = os.path.join(data_dir, '/Users/jiahuichen/PycharmProjects/BERT开源项目及数据/GLUE/glue_data/SIM/val.csv')
with open(file_path, 'r', encoding="utf-8") as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split("\t")
text_a = tokenization.convert_to_unicode(split_line[1])
text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[3]
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples

def get_test_examples(self, data_dir):
"""See base class."""
file_path = os.path.join(data_dir, '/Users/jiahuichen/PycharmProjects/BERT开源项目及数据/GLUE/glue_data/SIM/test.csv')
with open(file_path, 'r', encoding="utf-8") as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split("\t")
text_a = tokenization.convert_to_unicode(split_line[1])
text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[3]
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples

def get_labels(self):
"""See base class."""
return ["0", "1"]

def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# Only the test set has a header
if set_type == "test" and i == 0:
continue
guid = "%s-%s" % (set_type, i)
if set_type == "test":
text_a = tokenization.convert_to_unicode(line[2])
label = "0"
else:
text_a = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[4])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples

The three functions for getting the training, testing, and evaluation data set are essentially the same; so I will go over only the get_train_examples function. It firsts define a file path (where the data is on your computer), and opens the file with that path. It then uses the split function to split the line into three elements we want: the first sentence, and the second sentence, and the label. Finally, it stores them into an InputExample type, which is then ready to be fed into BERT. The result of the pre-processing looks like this


The last thing to do before running the programs is adjusting the parameters. You can do this in most Python IDEs. I use PyCharm, which has the “Edit Run” option that lets you save the custom run parameters. The following are the parameters I set.

1
2
3
4
5
6
7
8
9
10
11
12
--task_name=sim #which task you want it to perform
--do_train=true #whether you want to do the training
--do_eval=true #whether you want to do the evaluation
--data_dir=../GLUE/glue_data/SIM #the directory for your data
--vocab_file=../GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/vocab.txt #the directory for BERT's vocab file
--bert_config_file=../GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json # the directory for the config file
--init_checkpoint=../GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_model.ckpt # the directory for the checkpoint file for Tensorflow
--max_seq_length=128 # the max sequence for a line. I will truncate if the line exceeds this limit and add paddings if below
--train_batch_size=32 # the batch size. I would recommend using a smaller batch size to prevent crashing
--learning_rate=2e-5 # the learning rate for gradient descent
--num_train_epochs=2.0 #numbers of epochs
--output_dir=../GLUE/sim_output #the directory where you want to store the output model

With that, you are ready to hit the run button! It will take a while for the program to run depending on how large your data is and the computing power of your computer. On my MacBook Pro, it ran pretty fast, and after running it gives me the following evaluation results:


As you can see, the accuracy is 0.75 at the end, which definitely has room for improvement. At the same time, it is important to note that my training data set only has 144 entries - it would be impossible for any previous NLP model to achieve the same performance with so few data. This speaks to the power of BERT. With a comprehensively pre-trained model, it needs far fewer task-specific data in the fine-tuning stage to achieve state-of-art performance.

I hope this walkthrough and demo is helpful for you:) If you have any questions or suggestions, you can reach me at jchen23@stanford.edu.