I am working on code using the gensim and having a tough time troubleshooting a ValueError within my code. I finally was able to zip GoogleNews-vectors-negative300.bin.gz file so I could implement it in my model. I also tried gzip which the results were unsuccessful. The error in the code occurs in the last line. I would like to know what can be done to fix the error. Is there any workarounds? Finally, is there a website that I could reference?
Thank you respectfully for your assistance!
import gensim from keras import backend from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed from keras.layers.merge import concatenate from keras.layers.embeddings import Embedding from keras.models import Mode pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin" word2vec = gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, binary=True) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-3-23bd96c1d6ab> in <module>() 1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin" ----> 2 word2vec = gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, binary=True) C:UsersgreenAnaconda3envspy35libsite- packagesgensimmodelskeyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype) 244 word.append(ch) 245 word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors) --> 246 weights = fromstring(fin.read(binary_len), dtype=REAL) 247 add_word(word, weights) 248 else: ValueError: string size must be a multiple of element size
you have to write the complete path.
use this path:
The below commands no longer work work.
brew install wget wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This downloads the GZIP compressed file that you can uncompress using:
gzip -d GoogleNews-vectors-negative300.bin.gz
You can then use the below command to get wordVector.
from gensim import models w = models.KeyedVectors.load_word2vec_format( '../GoogleNews-vectors-negative300.bin', binary=True)
try this –
import gensim.downloader as api wv = api.load('word2vec-google-news-300') vec_king = wv['king']
Here is what worked for me. I loaded a part of the model and not the entire model as it’s huge.
!pip install wget import wget url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz' filename = wget.download(url) f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') f_out = open('GoogleNews-vectors-negative300.bin', 'wb') f_out.writelines(f_in) import gensim from gensim.models import Word2Vec, KeyedVectors from sklearn.decomposition import PCA model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)
You can use this URL that points to Google Drive’s download of the bin.gz file:
Alternative mirrors (including the S3 mentioned here) seem to be broken.
Also available from figshare:
wget https://figshare.com/ndownloader/files/10798046 -O GoogleNews-vectors-negative300.bin