From 3338ab076e0f9aedfa0c9b7b1a1a8ec273b47dd7 Mon Sep 17 00:00:00 2001 From: Faraz Razi <72218210+FarazRazi@users.noreply.github.com> Date: Sat, 1 Apr 2023 05:11:34 +0500 Subject: [PATCH 1/3] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 056cc0770..afb437eab 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,6 @@ We include Bazel BUILD files for most components, but not a top level BUILD or W ## Contributing -We invite the community to submit GitHub issues and pull requests for suggestions on improving the recommendation algorithm. We are working on tools to manage these suggestions and sync changes to our internal repository. Any security concerns or issues should be routed to our official [bug bounty program](https://hackerone.com/twitter) through HackerOne. We hope to benefit from the collective intelligence and expertise of the global community in helping us identify issues and suggest improvements, ultimately leading to a better Twitter. +We invite the community to submit GitHub issues and pull requests to suggest improvements to the recommendation algorithm. We are working on tools to manage these suggestions and sync changes to our internal repository. Any security concerns or issues should be routed to our official [bug bounty program](https://hackerone.com/twitter) through HackerOne. We hope to benefit from the collective intelligence and expertise of the global community in helping us identify issues and suggest improvements, ultimately leading to a better Twitter experience. -Read our blog on the open source initiative [here](https://blog.twitter.com/en_us/topics/company/2023/a-new-era-of-transparency-for-twitter). +Please read our blog on the open-source initiative [here](https://blog.twitter.com/en_us/topics/company/2023/a-new-era-of-transparency-for-twitter). From 8de33f89e9d6b58906c664821f4305d8ef80216e Mon Sep 17 00:00:00 2001 From: Faraz Razi <72218210+FarazRazi@users.noreply.github.com> Date: Sat, 1 Apr 2023 05:29:20 +0500 Subject: [PATCH 2/3] Update README.md --- .../scripts/models/earlybird/README.md | 35 +++++++------------ 1 file changed, 12 insertions(+), 23 deletions(-) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md index 3eb9e6c74..c10bf0f4e 100644 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md +++ b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md @@ -28,36 +28,25 @@ The light ranker features pipeline is as follows: Some of these components are explained below: -- Index Ingester: an indexing pipeline that handles the tweets as they are generated. This is the main input of - Earlybird, it produces Tweet Data (the basic information about the tweet, the text, the urls, media entities, facets, - etc) and Static Features (the features you can compute directly from a tweet right now, like whether it has URL, has - Cards, has quotes, etc); All information computed here are stored in index and flushed as each realtime index segments - become full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a - non-trivial way (like deciding the value of hasUrl), they could be computed and combined from some more "raw" - information in the tweet and from other services. - Signal Ingester: the ingester for Realtime Features, per-tweet features that can change after the tweet has been - indexed, mostly social engagements like retweetCount, favCount, replyCount, etc, along with some (future) spam signals - that's computed with later activities. These were collected and computed in a Heron topology by processing multiple - event streams and can be extended to support more features. -- User Table Features is another set of features per user. They are from User Table Updater, a different input that - processes a stream written by our user service. It's used to store sparse realtime user - information. These per-user features are propagated to the tweet being scored by - looking up the author of the tweet. -- Search Context Features are basically the information of current searcher, like their UI language, their own - produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the - features used in scoring. +Index Ingester: This is an indexing pipeline that handles the tweets as they are generated. It is the main input of Earlybird, producing Tweet Data (the basic information about the tweet, such as the text, URLs, media entities, facets, etc.) and Static Features (features that can be computed directly from a tweet, such as whether it has a URL, cards, quotes, etc.). All information computed here is stored in the index and flushed as each real-time index segment becomes full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a non-trivial way (like deciding the value of hasUrl), as they could be computed and combined from some more "raw" information in the tweet and from other services. -The scoring function in Earlybird uses both static and realtime features. Examples of static features used are: +Signal Ingester: This is the ingester for Realtime Features, which are per-tweet features that can change after the tweet has been indexed. They mostly include social engagements like retweetCount, favCount, replyCount, etc., along with some (future) spam signals that are computed with later activities. These features are collected and computed in a Heron topology by processing multiple event streams and can be extended to support more features. + +User Table Features: This is another set of features per user, which are from User Table Updater, a different input that processes a stream written by our user service. It is used to store sparse real-time user information. These per-user features are propagated to the tweet being scored by looking up the author of the tweet. + +Search Context Features: These are basically the information of the current searcher, such as their UI language, their own produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring. + +The scoring function in Earlybird uses both static and real-time features. Examples of static features used are:- - Whether the tweet is a retweet - Whether the tweet contains a link - Whether this tweet has any trend words at ingestion time - Whether the tweet is a reply -- A score for the static quality of the text, computed in TweetTextScorer.java in the Ingester. Based on the factors - such as offensiveness, content entropy, "shout" score, length, and readability. -- tweepcred, see top-level README.md +- A score for the static quality of the text, computed in TweetTextScorer.java in the Ingester. Based on factors such as offensiveness, content entropy, "shout" score, length, and readability. +- Tweepcred (see top-level README.md) -Examples of realtime features used are: +Examples of real-time features used are: - Number of tweet likes/replies/retweets - pToxicity and pBlock scores provided by health models + From 631f5ee21c78220d0a76a6f0d9c4ddaa61f6eeda Mon Sep 17 00:00:00 2001 From: Faraz Razi <72218210+FarazRazi@users.noreply.github.com> Date: Sat, 1 Apr 2023 05:31:11 +0500 Subject: [PATCH 3/3] Update README.md --- .../projects/timelines/scripts/models/earlybird/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md index c10bf0f4e..a09257fc8 100644 --- a/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md +++ b/src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md @@ -28,13 +28,13 @@ The light ranker features pipeline is as follows: Some of these components are explained below: -Index Ingester: This is an indexing pipeline that handles the tweets as they are generated. It is the main input of Earlybird, producing Tweet Data (the basic information about the tweet, such as the text, URLs, media entities, facets, etc.) and Static Features (features that can be computed directly from a tweet, such as whether it has a URL, cards, quotes, etc.). All information computed here is stored in the index and flushed as each real-time index segment becomes full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a non-trivial way (like deciding the value of hasUrl), as they could be computed and combined from some more "raw" information in the tweet and from other services. +- Index Ingester: This is an indexing pipeline that handles the tweets as they are generated. It is the main input of Earlybird, producing Tweet Data (the basic information about the tweet, such as the text, URLs, media entities, facets, etc.) and Static Features (features that can be computed directly from a tweet, such as whether it has a URL, cards, quotes, etc.). All information computed here is stored in the index and flushed as each real-time index segment becomes full. They are loaded back later from disk when Earlybird restarts. Note that the features may be computed in a non-trivial way (like deciding the value of hasUrl), as they could be computed and combined from some more "raw" information in the tweet and from other services. -Signal Ingester: This is the ingester for Realtime Features, which are per-tweet features that can change after the tweet has been indexed. They mostly include social engagements like retweetCount, favCount, replyCount, etc., along with some (future) spam signals that are computed with later activities. These features are collected and computed in a Heron topology by processing multiple event streams and can be extended to support more features. +- Signal Ingester: This is the ingester for Realtime Features, which are per-tweet features that can change after the tweet has been indexed. They mostly include social engagements like retweetCount, favCount, replyCount, etc., along with some (future) spam signals that are computed with later activities. These features are collected and computed in a Heron topology by processing multiple event streams and can be extended to support more features. -User Table Features: This is another set of features per user, which are from User Table Updater, a different input that processes a stream written by our user service. It is used to store sparse real-time user information. These per-user features are propagated to the tweet being scored by looking up the author of the tweet. +- User Table Features: This is another set of features per user, which are from User Table Updater, a different input that processes a stream written by our user service. It is used to store sparse real-time user information. These per-user features are propagated to the tweet being scored by looking up the author of the tweet. -Search Context Features: These are basically the information of the current searcher, such as their UI language, their own produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring. +- Search Context Features: These are basically the information of the current searcher, such as their UI language, their own produced/consumed language, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring. The scoring function in Earlybird uses both static and real-time features. Examples of static features used are:-