Update README.md

This commit is contained in:
Anirudh NJ 2023-04-03 11:09:15 +02:00 committed by GitHub
parent 78c3235eee
commit 31a6d5125b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,19 +1,24 @@
Twhin in torchrec
# Twhin in torchrec
This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.
This project contains code for pretraining dense vector embedding features for Twitter entities.
Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.
We obtain entity embeddings based on a variety of graph data within Twitter such as:
"User follows User"
"User favorites Tweet"
"User clicks Advertisement"
* "User follows User"
* "User favorites Tweet"
* "User clicks Advertisement"
While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:
https://huggingface.co/datasets/Twitter/TwitterFollowGraph
https://huggingface.co/datasets/Twitter/TwitterFaveGraph
* https://huggingface.co/datasets/Twitter/TwitterFollowGraph
* https://huggingface.co/datasets/Twitter/TwitterFaveGraph
The code expects parquet files with three columns: lhs, rel, rhs that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.
The code expects parquet files with three columns:
* lhs
* rel
* rhs
that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.
The location of the data must be specified in the configuration yaml files in projects/twhin/configs.
The location of the data must be specified in the configuration yaml files in `projects/twhin/configs`.
Workflow