View source on GitHub |
Builds a graph based on dense embeddings and persists it in TSV format.
nsl.tools.build_graph_from_config(
embedding_files, output_graph_path, graph_builder_config
)
Used in the notebooks
Used in the tutorials |
---|
This function reads input instances from one or more TFRecord files, each
containing tf.train.Example
protos. Each input example is expected to
contain at least the following 2 features:
id
: A singletonbytes_list
feature that identifies each example.embedding
: Afloat_list
feature that contains the (dense) embedding of each example.
id
and embedding
are not necessarily the literal feature names; if your
features have different names, you can specify them using the
graph_builder_config
fields named id_feature_name
and
embedding_feature_name
, respectively.
This function then uses the node embeddings to compute the edges of a graph.
Graph construction is configured by the graph_builder_config
instance. The
general algorithm is to consider pairs of input examples (each with an ID and
an associated dense embedding, as described above), and to generate an edge in
the graph between those two examples if the cosine similarity between the two
embeddings is at least graph_builder_config.similarity_threshold
.
Of course, the number of pairs of points is quadratic in the number of inputs, so comparing all pairs of points will take too long for large inputs. To address that problem, this implementation can be configured to use locality sensitive hashing (LSH) for better performance. The idea behind LSH is to randomly hash each input into one or more LSH "buckets" such that points in the same bucket are likely to be considered similar. In this way, we need to compare just the pairs of points within each bucket for similarity, which can lead to dramatically faster running times.
The lsh_splits
configuration attribute is used to control the maximum number
of LSH buckets. In particular, if lsh_splits
has the value n
, then there
can be at most 2^n
LSH buckets. Using a larger value for lsh_splits
will
(generally) result in a larger number of buckets, and therefore, smaller
number of instances in each bucket that need to be compared to each other.
As a result, increasing lsh_splits
can lead to dramatically faster running
times.
The disadvantage to using too many LSH buckets, however, is that we won't
create a graph edge between two instances that are highly similar if they
happen to be randomly hashed into two different LSH buckets. To address
that problem, the lsh_rounds
parameter can be used to perform multiple
rounds of the LSH bucketing process. Even if two similar instances may get
hashed to different LSH buckets during the first round, they may get hashed
into the same LSH bucket on a subsequent round. An edge is created in the
output graph if two intances are hashed into the same bucket and deemed to
be similar enough on any of the LSH rounds (i.e., the resulting graph is the
union of the graph edges generated on each LSH round).
To illustrate these concepts and how various lsh_splits
and lsh_rounds
values correlate with graph building running times, we performed multiple runs
of the graph builder on a dataset containing 50,000 instances, each with a
100-dimensional embedding. When lsh_splits = 0
, the program has to compare
each instance against every other instance, for a total of roughly 2.5B
comparisons, which takes nearly half an hour to complete and generates a total
of 35,313 graph edges (when similarity_threshold = 0.9
). As lsh_splits
is
increased, we lose recall (i.e., fewer than 35,313 edges are generated), but
the recall can then be improved by increasing lsh_rounds
. This table shows
the minimum lsh_rounds
value required to achieve a recall of >= 99.7%
(except for the lsh_splits = 1
case), as well as the elapsed running time:
lsh_splits lsh_rounds Recall Running time
0 N/A 100.0% 27m 46s
1 2 99.4% 24m 33s
2 3 99.8% 15m 35s
3 4 99.7% 9m 37.9s
4 6 99.9% 7m 07.5s
5 8 99.9% 4m 59.2s
6 9 99.7% 3m 01.2s
7 11 99.8% 2m 02.3s
8 13 99.8% 1m 20.8s
9 16 99.7% 58.5s
10 18 99.7% 43.6s
As the table illustrates, by increasing both lsh_splits
and lsh_rounds
,
we can dramatically decrease the running time of the graph builder without
sacrificing edge recall. We have found that a good rule of thumb is to set
lsh_splits >= ceiling(log_2(num_instances / 1000))
, so the expected LSH
bucket size will be at most 1000. However, if your instances are clustered or
you want an even faster run, you may want to use a larger lsh_splits
value.
Note, however, that when the similarity threshold is lower, recall rates are
reduced more quickly the larger the value of lsh_splits
is, so be careful
not to set that parameter too high for smaller similarity_threshold
values.
The generated graph edges are written to the TSV file named by
output_graph_path
. Each output edge is represented by a TSV line with the
following form:
source_id<TAB>target_id<TAB>edge_weight
All edges in the output will be symmetric (i.e., if edge A--w-->B
exists in
the output, then so will edge B--w-->A
).
Args | |
---|---|
embedding_files
|
A list of names of TFRecord files containing
tf.train.Example objects, which in turn contain dense embeddings.
|
output_graph_path
|
Name of the file to which the output graph in TSV format should be written. |
graph_builder_config
|
A nsl.configs.GraphBuilderConfig specifying the
graph building parameters.
|
Raises | |
---|---|
ValueError
|
If lsh_splits < 0 or if lsh_splits > 0 and lsh_rounds < 1 .
|