mirror of
https://github.com/twitter/the-algorithm.git
synced 2024-06-01 08:48:46 +02:00
fix grammar
This commit is contained in:
parent
ec83d01dca
commit
976d4cc3a4
|
@ -1,7 +1,7 @@
|
|||
# Earlybird Light Ranker
|
||||
|
||||
*Note: the light ranker is an old part of the stack which we are currently in the process of replacing.
|
||||
The current model was last trained several years ago, and uses some very strange features.
|
||||
The current model was last trained several years ago and uses some very strange features.
|
||||
We are working on training a new model, and eventually rebuilding this part of the stack entirely.*
|
||||
|
||||
The Earlybird light ranker is a logistic regression model which predicts the likelihood that the user will engage with a
|
||||
|
@ -10,15 +10,15 @@ It is intended to be a simplified version of the heavy ranker which can run on a
|
|||
|
||||
There are currently 2 main light ranker models in use: one for ranking in network tweets (`recap_earlybird`), and
|
||||
another for
|
||||
out of network (UTEG) tweets (`rectweet_earlybird`). Both models are trained using the `train.py` script which is
|
||||
out-of-network (UTEG) tweets (`rectweet_earlybird`). Both models are trained using the `train.py` script which is
|
||||
included in this directory. They differ mainly in the set of features
|
||||
used by the model.
|
||||
The in network model uses
|
||||
The in-network model uses
|
||||
the `src/python/twitter/deepbird/projects/timelines/configs/recap/feature_config.py` file to define the
|
||||
feature configuration, while the
|
||||
out of network model uses `src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py`.
|
||||
out-of-network model uses `src/python/twitter/deepbird/projects/timelines/configs/rectweet_earlybird/feature_config.py`.
|
||||
|
||||
The `train.py` script is essentially a series of hooks provided to for Twitter's `twml` framework to execute,
|
||||
The `train.py` script is essentially a series of hooks provided for Twitter's `twml` framework to execute,
|
||||
which is included under `twml/`.
|
||||
|
||||
### Features
|
||||
|
@ -29,25 +29,25 @@ The light ranker features pipeline is as follows:
|
|||
Some of these components are explained below:
|
||||
|
||||
- Index Ingester: an indexing pipeline that handles the tweets as they are generated. This is the main input of
|
||||
Earlybird, it produces Tweet Data (the basic information about the tweet, the text, the urls, media entities, facets,
|
||||
etc) and Static Features (the features you can compute directly from a tweet right now, like whether it has URL, has
|
||||
Cards, has quotes, etc); All information computed here are stored in index and flushed as each realtime index segments
|
||||
become full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a
|
||||
Earlybird produces Tweet Data (the basic information about the tweet, the text, the URLs, media entities, facets,
|
||||
etc) and Static Features (the features you can compute directly from a tweet right now, like whether it has a URL, has
|
||||
Cards, has quotes, etc); All information computed here is stored in an index and flushed as each real-time index segments
|
||||
become full. They are loaded back later from the disk when Earlybird restarts. Note that the features may be computed in a
|
||||
non-trivial way (like deciding the value of hasUrl), they could be computed and combined from some more "raw"
|
||||
information in the tweet and from other services.
|
||||
Signal Ingester: the ingester for Realtime Features, per-tweet features that can change after the tweet has been
|
||||
indexed, mostly social engagements like retweetCount, favCount, replyCount, etc, along with some (future) spam signals
|
||||
that's computed with later activities. These were collected and computed in a Heron topology by processing multiple
|
||||
that are computed with later activities. These were collected and computed in a Heron topology by processing multiple
|
||||
event streams and can be extended to support more features.
|
||||
- User Table Features is another set of features per user. They are from User Table Updater, a different input that
|
||||
processes a stream written by our user service. It's used to store sparse realtime user
|
||||
processes a stream written by our user service. It's used to store sparse real-time user
|
||||
information. These per-user features are propagated to the tweet being scored by
|
||||
looking up the author of the tweet.
|
||||
- Search Context Features are basically the information of current searcher, like their UI language, their own
|
||||
- Search Context Features are the information of current searcher, like their UI language, their own
|
||||
produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the
|
||||
features used in scoring.
|
||||
|
||||
The scoring function in Earlybird uses both static and realtime features. Examples of static features used are:
|
||||
The scoring function in Earlybird uses both static and real-time features. Examples of static features used are:
|
||||
|
||||
- Whether the tweet is a retweet
|
||||
- Whether the tweet contains a link
|
||||
|
|
Loading…
Reference in New Issue
Block a user