mirror of
https://github.com/twitter/the-algorithm-ml.git
synced 2024-12-23 14:51:49 +01:00
.. | ||
config | ||
data | ||
models | ||
scripts | ||
config.py | ||
machines.yaml | ||
metrics.py | ||
optimizer.py | ||
README.md | ||
run.py | ||
test_optimizer.py |
Twhin in torchrec
This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.
We obtain entity embeddings based on a variety of graph data within Twitter such as:
- "User follows User"
- "User favorites Tweet"
- "User clicks Advertisement"
While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:
- https://huggingface.co/datasets/Twitter/TwitterFollowGraph
- https://huggingface.co/datasets/Twitter/TwitterFaveGraph
The code expects parquet files with three columns:
- lhs
- rel
- rhs
that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.
The location of the data must be specified in the configuration yaml files in projects/twhin/configs
.
Workflow
- Build local development images
./scripts/build_images.sh
- Run with
./scripts/docker_run.sh
- Iterate in image with
./scripts/idocker.sh
- Run tests with
./scripts/docker_test.sh