the-algorithm-ml/projects/twhin/README.md

25 lines
1.2 KiB
Markdown

Twhin in torchrec
This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.
We obtain entity embeddings based on a variety of graph data within Twitter such as:
"User follows User"
"User favorites Tweet"
"User clicks Advertisement"
While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:
https://huggingface.co/datasets/Twitter/TwitterFollowGraph
https://huggingface.co/datasets/Twitter/TwitterFaveGraph
The code expects parquet files with three columns: lhs, rel, rhs that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.
The location of the data must be specified in the configuration yaml files in projects/twhin/configs.
Workflow
========
- Build local development images `./scripts/build_images.sh`
- Run with `./scripts/docker_run.sh`
- Iterate in image with `./scripts/idocker.sh`
- Run tests with `./scripts/docker_test.sh`