Calculating percentile of dataset words and Tensorflow-hub model

I want to calculate the percentile of dataset words that are present in a tensorflow-hub model (such as ELMo or Universal Sentence Encoder). For local models like GloVe, I use a naive method: read the local model, transfer it to set, and then calculate the percentile as that:

f = open('../glove.6B.100d.txt', encoding="utf8")
#Read all the word into a list
...
intersect_words = set(dataset_words).intersect(glove_words)
percentile = len(intersect_words)/len(dataset_words)*100

Is there any method to do like that for Tenorflow-hub models?

Answers

For some models, the vocabulary is serialized within the SavedModel protocol buffer (like for USE and ELMo) so one has to manually find it within the SavedModel and extract it (I've used logic to extract the vocab from USE from here):

import tensorflow_hub as hub
from tensorflow.python.saved_model.loader_impl import parse_saved_model

# This caches the model at `model_path`.
hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
model_path = '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
saved_model = parse_saved_model(model_path)

# The location of the tensor holding the vocab is model-specific.
graph = saved_model.meta_graphs[0].graph_def
function_ = graph.library.function
embedding_node = function_[5].node_def[1]  # Node name is "Embedding_words".
words_tensor = embedding_node.attr.get("value").tensor
word_list = [s.decode('utf-8') for s in words_tensor.string_val]
word_list[100:105]  # ['best', ',▁but', 'no', 'any', 'more']

For other models like google/Wiki-words-500/2, we're more lucky since the vocab has been exported to the assets/ directory:

hub.load("https://tfhub.dev/google/Wiki-words-500/2")
!head /tmp/tfhub_modules/bf115a5fe517f019bebae05b433eaeee6415f5bf/assets/tokens.txt -n 40000 | tail
# Antisense
# Antiseptic
# Antiseptics
Posted on by WGierke