mirror of
https://github.com/twitter/the-algorithm-ml.git
synced 2025-01-25 21:31:14 +01:00
25 lines
1.2 KiB
Markdown
25 lines
1.2 KiB
Markdown
|
Twhin in torchrec
|
||
|
|
||
|
This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.
|
||
|
|
||
|
We obtain entity embeddings based on a variety of graph data within Twitter such as:
|
||
|
"User follows User"
|
||
|
"User favorites Tweet"
|
||
|
"User clicks Advertisement"
|
||
|
|
||
|
While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:
|
||
|
https://huggingface.co/datasets/Twitter/TwitterFollowGraph
|
||
|
https://huggingface.co/datasets/Twitter/TwitterFaveGraph
|
||
|
|
||
|
The code expects parquet files with three columns: lhs, rel, rhs that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.
|
||
|
|
||
|
The location of the data must be specified in the configuration yaml files in projects/twhin/configs.
|
||
|
|
||
|
|
||
|
Workflow
|
||
|
========
|
||
|
- Build local development images `./scripts/build_images.sh`
|
||
|
- Run with `./scripts/docker_run.sh`
|
||
|
- Iterate in image with `./scripts/idocker.sh`
|
||
|
- Run tests with `./scripts/docker_test.sh`
|