Convert word2vec bin file to text
Question:
From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that “It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it’s rather trivial to read the binary file.” Unfortunately, I don’t know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.
Supposedly gensim can do this also, but all the tutorials I’ve found seem to be about converting from text, not the other way.
Can someone suggest modifications to the C code or instructions for gensim to emit text?
Answers:
On the word2vec-toolkit mailing list Thomas Mensink has provided an answer in the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas’s code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.bin
will create a text version of vector.bin.
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>
const long long max_size = 2000; // max length of strings
const long long N = 40; // number of closest words that will be shown
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv) {
FILE *f;
char file_name[max_size];
float len;
long long words, size, a, b;
char ch;
float *M;
char *vocab;
if (argc < 2) {
printf("Usage: ./distance <FILE>nwhere FILE contains word projections in the BINARY FORMATn");
return 0;
}
strcpy(file_name, argv[1]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not foundn");
return -1;
}
fscanf(f, "%lld", &words);
fscanf(f, "%lld", &size);
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
M = (float *)malloc((long long)words * (long long)size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB %lld %lldn", (long long)words * size * sizeof(float) / 1048576, words, size);
return -1;
}
for (b = 0; b < words; b++) {
fscanf(f, "%s%c", &vocab[b * max_w], &ch);
for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
len = 0;
for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
len = sqrt(len);
for (a = 0; a < size; a++) M[a + b * size] /= len;
}
fclose(f);
//Code added by Thomas Mensink
//output the vectors of the binary format in text
printf("%lld %lld #File: %sn",words,size,file_name);
for (a = 0; a < words; a++){
printf("%s ",&vocab[a * max_w]);
for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
printf("bbn");
}
return 0;
}
I removed the “bb” from the printf
.
By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.
cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt
I am using gensim to work with the GoogleNews-vectors-negative300.bin and I am including a binary = True
flag while loading the model.
from gensim import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
Seems to be working fine.
I had a similar issue, I wanted to get bin/non-bin(gensim) models output as CSV.
here is the code which does that on python, it assumes you have gensim installed:
the format is IEEE 754 single-precision binary floating-point format: binary32
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
They use little-endian.
Let do an example:
- First line is string format: “3000000 300n” (vocabSize &
vecSize, getByte till byte==’n’)
-
Next line include the vocab
word first, and then (300*4 byte of float value, 4 byte for each
dimension):
getByte till byte==32 (space). (60 47 115 62 32 => <s>[space])
-
then each next 4 byte will represent one float number
next 4 byte: 0 0 -108 58 => 0.001129150390625.
You can check the wikipedia link to see how, let me do this one as example:
(little-endian -> reverse order) 00111010 10010100 00000000 00000000
- first is sign bit => sign = 1 (else = -1)
- next 8 bits => 117 => exp = 2^(117-127)
- next 23 bits => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)
value = sign * exp * pre
You can load the binary file in word2vec, and then save the text version like this:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save("file.txt")
`
Here is the code I use:
import codecs
from gensim.models import Word2Vec
def main():
path_to_model = 'GoogleNews-vectors-negative300.bin'
output_file = 'GoogleNews-vectors-negative300_test.txt'
export_to_file(path_to_model, output_file)
def export_to_file(path_to_model, output_file):
output = codecs.open(output_file, 'w' , 'utf-8')
model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')
vocab = model.vocab
for mid in vocab:
#print(model[mid])
#print(mid)
vector = list()
for dimension in model[mid]:
vector.append(str(dimension))
#line = { "mid": mid, "vector": vector }
vector_str = ",".join(vector)
line = mid + "t" + vector_str
#line = json.dumps(line)
output.write(line + "n")
output.close()
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
I use this code to load binary model, then save the model to text file,
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
Note:
Above code is for new version of gensim. For previous version, I used this code:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
convertvec is a small tool to convert vectors between different formats for the word2vec library.
Convert vectors from binary to plain text:
./convertvec bin2txt input.bin output.txt
Convert vectors from plain text to binary:
./convertvec txt2bin input.txt output.bin
Just a quick update as now there is easier way.
If you are using word2vec
from https://github.com/dav/word2vec there is additional option called -binary
which accept 1
to generate binary file or 0
to generate text file. This example comes from demo-word.sh
in the repo:
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
If you get the Error:
ImportError: No module named models.word2vec
then it is because there was an API update. This will work:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)
From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that “It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it’s rather trivial to read the binary file.” Unfortunately, I don’t know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.
Supposedly gensim can do this also, but all the tutorials I’ve found seem to be about converting from text, not the other way.
Can someone suggest modifications to the C code or instructions for gensim to emit text?
On the word2vec-toolkit mailing list Thomas Mensink has provided an answer in the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas’s code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.bin
will create a text version of vector.bin.
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>
const long long max_size = 2000; // max length of strings
const long long N = 40; // number of closest words that will be shown
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv) {
FILE *f;
char file_name[max_size];
float len;
long long words, size, a, b;
char ch;
float *M;
char *vocab;
if (argc < 2) {
printf("Usage: ./distance <FILE>nwhere FILE contains word projections in the BINARY FORMATn");
return 0;
}
strcpy(file_name, argv[1]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not foundn");
return -1;
}
fscanf(f, "%lld", &words);
fscanf(f, "%lld", &size);
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
M = (float *)malloc((long long)words * (long long)size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB %lld %lldn", (long long)words * size * sizeof(float) / 1048576, words, size);
return -1;
}
for (b = 0; b < words; b++) {
fscanf(f, "%s%c", &vocab[b * max_w], &ch);
for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
len = 0;
for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
len = sqrt(len);
for (a = 0; a < size; a++) M[a + b * size] /= len;
}
fclose(f);
//Code added by Thomas Mensink
//output the vectors of the binary format in text
printf("%lld %lld #File: %sn",words,size,file_name);
for (a = 0; a < words; a++){
printf("%s ",&vocab[a * max_w]);
for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
printf("bbn");
}
return 0;
}
I removed the “bb” from the printf
.
By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.
cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt
I am using gensim to work with the GoogleNews-vectors-negative300.bin and I am including a binary = True
flag while loading the model.
from gensim import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
Seems to be working fine.
I had a similar issue, I wanted to get bin/non-bin(gensim) models output as CSV.
here is the code which does that on python, it assumes you have gensim installed:
the format is IEEE 754 single-precision binary floating-point format: binary32
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
They use little-endian.
Let do an example:
- First line is string format: “3000000 300n” (vocabSize &
vecSize, getByte till byte==’n’) -
Next line include the vocab
word first, and then (300*4 byte of float value, 4 byte for each
dimension):getByte till byte==32 (space). (60 47 115 62 32 => <s>[space])
-
then each next 4 byte will represent one float number
next 4 byte: 0 0 -108 58 => 0.001129150390625.
You can check the wikipedia link to see how, let me do this one as example:
(little-endian -> reverse order) 00111010 10010100 00000000 00000000
- first is sign bit => sign = 1 (else = -1)
- next 8 bits => 117 => exp = 2^(117-127)
- next 23 bits => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)
value = sign * exp * pre
You can load the binary file in word2vec, and then save the text version like this:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save("file.txt")
`
Here is the code I use:
import codecs
from gensim.models import Word2Vec
def main():
path_to_model = 'GoogleNews-vectors-negative300.bin'
output_file = 'GoogleNews-vectors-negative300_test.txt'
export_to_file(path_to_model, output_file)
def export_to_file(path_to_model, output_file):
output = codecs.open(output_file, 'w' , 'utf-8')
model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')
vocab = model.vocab
for mid in vocab:
#print(model[mid])
#print(mid)
vector = list()
for dimension in model[mid]:
vector.append(str(dimension))
#line = { "mid": mid, "vector": vector }
vector_str = ",".join(vector)
line = mid + "t" + vector_str
#line = json.dumps(line)
output.write(line + "n")
output.close()
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
I use this code to load binary model, then save the model to text file,
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
Note:
Above code is for new version of gensim. For previous version, I used this code:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
convertvec is a small tool to convert vectors between different formats for the word2vec library.
Convert vectors from binary to plain text:
./convertvec bin2txt input.bin output.txt
Convert vectors from plain text to binary:
./convertvec txt2bin input.txt output.bin
Just a quick update as now there is easier way.
If you are using word2vec
from https://github.com/dav/word2vec there is additional option called -binary
which accept 1
to generate binary file or 0
to generate text file. This example comes from demo-word.sh
in the repo:
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
If you get the Error:
ImportError: No module named models.word2vec
then it is because there was an API update. This will work:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)