Large-Scale Embedding Solution: OneEmbedding¶
Embedding is an important component of recommender system, and it has also spread to many fields outside recommender systems. Each framework provides basic operators for Embedding, for example,
flow.nn.Embedding in OneFlow:
import numpy as np import oneflow as flow indices = flow.tensor([[1, 2, 4, 5], [4, 3, 2, 9]], dtype=flow.int) embedding = flow.nn.Embedding(10, 3) y = embedding(indices)
OneEmbedding is the large-scale Embedding solution that OneFlow provides to solve the problem of large-scale deep recommender systems. OneEmbedding has the following advantages compared to ordionary opeartors:
With Flexible hierarchical storage, OneEmbedding can place the Embedding table on GPU memory, CPU memory or SSD, and allow high-speed devices to be used as caches for low-speed devices to achieve both speed and capacity.
OneEmbedding supports dynamic expansion.
Get Start to OneEmbedding Quickly¶
The following steps is an example of getting started with OneEmbeeding quickly:
- Configure Embedding table with
- Configure the storage attribute of the Embedding table
- Instantiate Embedding
- Construct Graph for training
Configure Embedding Table with
By importing relevant package and the following codes, you can configure Embedding table with
make_table_options.OneEmbedding supports simultaneous creation of multiple Embedding table. The following codes configured three Embedding table.
import oneflow as flow import oneflow.nn as nn import numpy as np tables = [ flow.one_embedding.make_table_options( flow.one_embedding.make_uniform_initializer(low=-0.1, high=0.1) ), flow.one_embedding.make_table_options( flow.one_embedding.make_uniform_initializer(low=-0.05, high=0.05) ), flow.one_embedding.make_table_options( flow.one_embedding.make_uniform_initializer(low=-0.15, high=0.15) ), ]
When configuring the Embedding table, you need to specify the initialization method. The above Embedding tables are initialized in the
uniform method. The result of configuring the Embedding table is stored in the
Click make_table_options and make_uniform_initializer to check more detailed information.
Configure the Storage Attribute of the Embedding Table¶
Then run the following codes to configure the storage attribute of the Embedding table:
store_options = flow.one_embedding.make_cached_ssd_store_options( cache_budget_mb=8142, persistent_path="/your_path_to_ssd", capacity=40000000, size_factor=1, physical_block_size=512 )
make_cached_ssd_store_options here, you can store Embedding table on SSD and use GPU as cache. For the meaning of specific parameters, please refer to make_cached_ssd_store_options API 文档.
In addition, you can use pure GPU as storage; or use CPU memory to store Embedding table, but use GPU as cache. For more details, please refer to make_device_mem_store_options and make_cached_host_mem_store_option.
After the above configuration is completed, you can use
MultiTableEmbedding to get the instantiated Embedding layer.
embedding_size = 128 embedding = flow.one_embedding.MultiTableEmbedding( name="my_embedding", embedding_dim=embedding_size, dtype=flow.float, key_type=flow.int64, tables=tables, store_options=store_options, ) embedding.to("cuda")
tables is the Embedding table attribute previously configured by
store_options is the previously configured storage attribute,
embedding_dim is the feature dimension,
dtype is the data type of the feature vector,
key_type is the data type of feature ID.
If two OneEmbeddings are created at the same time, different name and persistent path parameters are needed to be set during instantiation. For more detailes, please refer to one_embedding.MultiTableEmbedding.
Construct Graph for Training¶
Currently, OneEmbedding is only supported in Graph mode.
In the following example, we construct a simple Graph class that includes
num_tables = 3 mlp = flow.nn.FusedMLP( in_features=embedding_size * num_tables, hidden_features=[512, 256, 128], out_features=1, skip_final_activation=True, ) mlp.to("cuda") class TrainGraph(flow.nn.Graph): def __init__(self,): super().__init__() self.embedding_lookup = embedding self.mlp = mlp self.add_optimizer( flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0) ) self.add_optimizer( flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0) ) def build(self, ids): embedding = self.embedding_lookup(ids) loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size))) loss = loss.sum() loss.backward() return loss
Then you can instantiate the Graph and start training.
ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64) ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda") graph = TrainGraph() loss = graph(ids_tensor) print(loss)
For the detailed information on using Graph, please refer to 静态图模块 nn.Graph.
The Features of OneEmbedding¶
Feature ID and Dynamic Insertion¶
OneEmbedding supports dynamic insertion of new feature ID. As long as the storage medium has sufficient capacity, there is no upper limit on the number of feature IDs. This is why when you use
make_table_options, you only need to specify the initialization method, not the total number of feature IDs (Embedding table lines).
Feature ID and Multi-Table Query¶
Feature ID cannot be repeated
OneEmbedding users who make datasets need to pay special attention: When using
MultiTableEmbedding to create multiple tables at the same time,multiple Embedding Tables only have different initialization parameters, and other parameters are the same,at this time, feature IDs in multiple tables cannot be repeated.
The query method is no different from the normal Embedding query method if you only use
MultiTableEmbedding to configure one table. You can call it directly and pass the feature ID, such as
If you use
MultiTableEmbedding to configure more than one tables, then you need to specify in which to query for a feature ID with the following two methods:
Method 1: Pass an
ids of shape
(batch_size, number of Embedding table) for query, then the column of this
ids corresponds to a Embedding table in turn.
ids = np.array([[488, 333, 220], [18, 568, 508]], dtype=np.int64) # This means to query `[, ]` in the zeroth table, `[, ]` in the first table, and the corresponding feature vector of `[, ]` in the second table. embedding_lookup(ids)
Method 2:When passing the
ids parameter, pass a
table_ids parameter, which has the exact same shape as
ids, and specifies the ordinal number of the table in
ids = np.array([488, 333, 220, 18, 568, 508], dtype=np.int64) # table_ids has the exact same shape as `ids` table_ids = np.array([0, 1, 2, 0, 1, 2]) # This means to query `488, 18` in the zeroth table, `333, 568` in the first table, and the corresponding feature vector of `220, 508` in the second table. embedding_lookup(ids, table_ids)
How to Choose the Proper Storage Configuration¶
OneEmbedding provides three storage options configurations,they are pure GPU storage, use CPU memory to store and GPU memory as cache and use SSD to store and GPU memory as cache.
Pure GPU storage
When the size of Embedding table is smaller than the GPU memory, it is the fastest to place all the Embedding table on the GPU memory. In this case, it is recommended to select the pure GPU storage configuration.
Use CPU memory to store and GPU memory as cache
When the size of Embedding table is larger than the GPU memory, but smaller than the CPU memory, it is recommended to store the Embedding table in the CPU memory and use the GPU memory as cache.
Use SSD to store and GPU memory as cache
When the size of Embedding table is larger than both the GPU memory and the system memory, if you have a high-speed SSD, you can choose to store the Embedding table in the SSD and use the GPU memory as a cache. In this case, frequent data reading and writing will be performed on the stored vocabulary during the training process, so the random reading and writing speed of files under the path set by
persistent_pathhas a great impact on the overall performance. It is strongly recommended to use a high-performance SSD. If you use a normal disk, it will have a great negative impact on the overall performance.
Similar to other modules of OneFlow, OneEmbedding supports distributed expansion natively. Users can refer to the README in #dlrm to start DLRM distributed training. You can also refer to Global Tensor for necessary prerequisites.
When using the OneEmbedding module for distributed expansion, please be careful:
- Currently, OneEmbedding only supports placement on all devices, and the parallelism must be the same as the world size. For example, when training with 4 cards in parallel, the parallelism of the Embedding table must be 4. It is not supported when the network is trained with 4 cards but the Embedding table parallelism is 2.
persistent_pathparameter in the
store_optionsconfiguration specifies the path of the storage. In parallel scenarios, it can be either a string representing a path or a
list. If configured as a string representing a path, it represents the root directory under each rank in distributed parallelism. OneFlow will create a storage path based on the number of each rank under this root path, and the name format is
list, rank will be configured individually for each item in the list.
- In parallel scenarios, the
store_optionsconfiguration represents the capacity of total Embedding table, but not the capacity of each rank.
cache_budget_mbrepresents the video memory per GPU device.
Extended Reading: DLRM¶
This article shows how to get started with OneEmbedding quickly.
Practical examples of OneEmbedding in DLRM tasks are prepared in the OneFlow model repository, please refer to https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems/dlrm