How to use tensorflow-hub module with tensorflow-dataset api

I want to use Tensorflow Dataset api to initialize my dataset using tensorflow Hub. I want to use dataset.map function to convert my text data into embedding. My Tensorflow version is 1.14.

Since I used elmo v2 modlule which converts bunch of sentences array into their word embeddings, I used the following code:

import tensorflow as tf
import tensorflow_hub as hub
...
sentences_array = load_sentences()
#Sentence_array=["I love Python", "python is a good PL"]
def parse(sentences):
    elmo = hub.Module("./ELMO")
    embeddings = elmo([sentences], signature="default", as_dict=True) 
    ["word_emb"]
    return embeddings
dataset = tf.data.TextLineDataset(sentences_array)
dataset = dataset.apply(tf.data.experimental.map_and_batch(map_func = 
parse, batch_size=batch_size))



I want embedding of text array like [batch_size, max_words_in_batch, embedding_size], but I got an error message as:

"NotImplementedError: Using TF-Hub module within a TensorFlow defined 
 function is currently not supported."


How can I get the expected results?

Answers

Unfortunately this is not supported in TensorFlow 1.x

It is, however, supported in TensorFlow 2.0 so if you can upgrade to tensorflow 2 and choose from the available text embedding modules for tf 2 (current list here) then you can use this in your dataset pipeline. Something like this:

embedder = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")

def parse(sentences):
    embeddings = embedder([sentences])
    return embeddings

dataset = tf.data.TextLineDataset("text.txt")
dataset = dataset.map(parse)

If you are tied to 1.x or tied to Elmo (which I don't think is yet available in the new format) then the only option I can see for embedding in the preprocessing stage is to first run your dataset through a simple embedding model and save the results then use the embedded vectors for the downstream task separately. (I appreciate this is less than ideal).

Posted on by Stewart_R