diff --git a/README.md b/README.md index 056cc0770..af87e0b51 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -# Twitter Recommendation Algorithm +# Twitter's Recommendation Algorithm -The Twitter Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the +Twitter's Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the Home Timeline. For an introduction to how the algorithm works, please refer to our [engineering blog](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). The diagram below illustrates how major services and jobs interconnect. @@ -13,24 +13,24 @@ These are the main components of the Recommendation Algorithm included in this r | Feature | [SimClusters](src/scala/com/twitter/simclusters_v2/README.md) | Community detection and sparse embeddings into those communities. | | | [TwHIN](https://github.com/twitter/the-algorithm-ml/blob/main/projects/twhin/README.md) | Dense knowledge graph embeddings for Users and Tweets. | | | [trust-and-safety-models](trust_and_safety_models/README.md) | Models for detecting NSFW or abusive content. | -| | [real-graph](src/scala/com/twitter/interaction_graph/README.md) | Model to predict likelihood of a Twitter User interacting with another User. | +| | [real-graph](src/scala/com/twitter/interaction_graph/README.md) | Model to predict the likelihood of a Twitter User interacting with another User. | | | [tweepcred](src/scala/com/twitter/graph/batch/job/tweepcred/README) | Page-Rank algorithm for calculating Twitter User reputation. | | | [recos-injector](recos-injector/README.md) | Streaming event processor for building input streams for [GraphJet](https://github.com/twitter/GraphJet) based services. | | | [graph-feature-service](graph-feature-service/README.md) | Serves graph features for a directed pair of Users (e.g. how many of User A's following liked Tweets from User B). | | Candidate Source | [search-index](src/java/com/twitter/search/README.md) | Find and rank In-Network Tweets. ~50% of Tweets come from this candidate source. | | | [cr-mixer](cr-mixer/README.md) | Coordination layer for fetching Out-of-Network tweet candidates from underlying compute services. | -| | [user-tweet-entity-graph](src/scala/com/twitter/recos/user_tweet_entity_graph/README.md) (UTEG)| Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the [GraphJet](https://github.com/twitter/GraphJet) framework. Several other GraphJet based features and candidate sources are located [here](src/scala/com/twitter/recos) | +| | [user-tweet-entity-graph](src/scala/com/twitter/recos/user_tweet_entity_graph/README.md) (UTEG)| Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the [GraphJet](https://github.com/twitter/GraphJet) framework. Several other GraphJet based features and candidate sources are located [here](src/scala/com/twitter/recos). | | | [follow-recommendation-service](follow-recommendations-service/README.md) (FRS)| Provides Users with recommendations for accounts to follow, and Tweets from those accounts. | -| Ranking | [light-ranker](src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md) | Light ranker model used by search index (Earlybird) to rank Tweets. | +| Ranking | [light-ranker](src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md) | Light Ranker model used by search index (Earlybird) to rank Tweets. | | | [heavy-ranker](https://github.com/twitter/the-algorithm-ml/blob/main/projects/home/recap/README.md) | Neural network for ranking candidate tweets. One of the main signals used to select timeline Tweets post candidate sourcing. | -| Tweet mixing & filtering | [home-mixer](home-mixer/README.md) | Main service used to construct and serve the Home Timeline. Built on [product-mixer](product-mixer/README.md) | +| Tweet mixing & filtering | [home-mixer](home-mixer/README.md) | Main service used to construct and serve the Home Timeline. Built on [product-mixer](product-mixer/README.md). | | | [visibility-filters](visibilitylib/README.md) | Responsible for filtering Twitter content to support legal compliance, improve product quality, increase user trust, protect revenue through the use of hard-filtering, visible product treatments, and coarse-grained downranking. | | | [timelineranker](timelineranker/README.md) | Legacy service which provides relevance-scored tweets from the Earlybird Search Index and UTEG service. | -| Software framework | [navi](navi/navi/README.md) | High performance, machine learning model serving written in Rust. | +| Software framework | [navi](navi/README.md) | High performance, machine learning model serving written in Rust. | | | [product-mixer](product-mixer/README.md) | Software framework for building feeds of content. | | | [twml](twml/README.md) | Legacy machine learning framework built on TensorFlow v1. | -We include Bazel BUILD files for most components, but not a top level BUILD or WORKSPACE file. +We include Bazel BUILD files for most components, but not a top-level BUILD or WORKSPACE file. ## Contributing diff --git a/ann/src/main/python/dataflow/faiss_index_bq_dataset.py b/ann/src/main/python/dataflow/faiss_index_bq_dataset.py new file mode 100644 index 000000000..dd45070db --- /dev/null +++ b/ann/src/main/python/dataflow/faiss_index_bq_dataset.py @@ -0,0 +1,232 @@ +import argparse +import logging +import os +import pkgutil +import sys +from urllib.parse import urlsplit + +import apache_beam as beam +from apache_beam.options.pipeline_options import PipelineOptions +import faiss + + +def parse_d6w_config(argv=None): + """Parse d6w config. + :param argv: d6w config + :return: dictionary containing d6w config + """ + + parser = argparse.ArgumentParser( + description="See https://docbird.twitter.biz/d6w/model.html for any parameters inherited from d6w job config" + ) + parser.add_argument("--job_name", dest="job_name", required=True, help="d6w attribute") + parser.add_argument("--project", dest="project", required=True, help="d6w attribute") + parser.add_argument( + "--staging_location", dest="staging_location", required=True, help="d6w attribute" + ) + parser.add_argument("--temp_location", dest="temp_location", required=True, help="d6w attribute") + parser.add_argument( + "--output_location", + dest="output_location", + required=True, + help="GCS bucket and path where resulting artifacts are uploaded", + ) + parser.add_argument( + "--service_account_email", dest="service_account_email", required=True, help="d6w attribute" + ) + parser.add_argument( + "--factory_string", + dest="factory_string", + required=False, + help="FAISS factory string describing index to build. See https://github.com/facebookresearch/faiss/wiki/The-index-factory", + ) + parser.add_argument( + "--metric", + dest="metric", + required=True, + help="Metric used to compute distance between embeddings. Valid values are 'l2', 'ip', 'l1', 'linf'", + ) + parser.add_argument( + "--use_gpu", + dest="gpu", + required=True, + help="--use_gpu=yes if you want to use GPU during index building", + ) + + known_args, unknown_args = parser.parse_known_args(argv) + d6w_config = vars(known_args) + d6w_config["gpu"] = d6w_config["gpu"].lower() == "yes" + d6w_config["metric"] = parse_metric(d6w_config) + + """ + WARNING: Currently, d6w (a Twitter tool used to deploy Dataflow jobs to GCP) and + PipelineOptions.for_dataflow_runner (a helper method in twitter.ml.common.apache_beam) do not + play nicely together. The helper method will overwrite some of the config specified in the d6w + file using the defaults in https://sourcegraph.twitter.biz/git.twitter.biz/source/-/blob/src/python/twitter/ml/common/apache_beam/__init__.py?L24.' + However, the d6w output message will still report that the config specified in the d6w file was used. + """ + logging.warning( + f"The following d6w config parameters will be overwritten by the defaults in " + f"https://sourcegraph.twitter.biz/git.twitter.biz/source/-/blob/src/python/twitter/ml/common/apache_beam/__init__.py?L24\n" + f"{str(unknown_args)}" + ) + return d6w_config + + +def get_bq_query(): + """ + Query is expected to return rows with unique entityId + """ + return pkgutil.get_data(__name__, "bq.sql").decode("utf-8") + + +def parse_metric(config): + metric_str = config["metric"].lower() + if metric_str == "l2": + return faiss.METRIC_L2 + elif metric_str == "ip": + return faiss.METRIC_INNER_PRODUCT + elif metric_str == "l1": + return faiss.METRIC_L1 + elif metric_str == "linf": + return faiss.METRIC_Linf + else: + raise Exception(f"Unknown metric: {metric_str}") + + +def run_pipeline(argv=[]): + config = parse_d6w_config(argv) + argv_with_extras = argv + if config["gpu"]: + argv_with_extras.extend(["--experiments", "use_runner_v2"]) + argv_with_extras.extend( + ["--experiments", "worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver"] + ) + argv_with_extras.extend( + [ + "--worker_harness_container_image", + "gcr.io/twttr-recos-ml-prod/dataflow-gpu/beam2_39_0_py3_7", + ] + ) + + options = PipelineOptions(argv_with_extras) + output_bucket_name = urlsplit(config["output_location"]).netloc + + with beam.Pipeline(options=options) as p: + input_data = p | "Read from BigQuery" >> beam.io.ReadFromBigQuery( + method=beam.io.ReadFromBigQuery.Method.DIRECT_READ, + query=get_bq_query(), + use_standard_sql=True, + ) + + index_built = input_data | "Build and upload index" >> beam.CombineGlobally( + MergeAndBuildIndex( + output_bucket_name, + config["output_location"], + config["factory_string"], + config["metric"], + config["gpu"], + ) + ) + + # Make linter happy + index_built + + +class MergeAndBuildIndex(beam.CombineFn): + def __init__(self, bucket_name, gcs_output_path, factory_string, metric, gpu): + self.bucket_name = bucket_name + self.gcs_output_path = gcs_output_path + self.factory_string = factory_string + self.metric = metric + self.gpu = gpu + + def create_accumulator(self): + return [] + + def add_input(self, accumulator, element): + accumulator.append(element) + return accumulator + + def merge_accumulators(self, accumulators): + merged = [] + for accum in accumulators: + merged.extend(accum) + return merged + + def extract_output(self, rows): + # Reimports are needed on workers + import glob + import subprocess + + import faiss + from google.cloud import storage + import numpy as np + + client = storage.Client() + bucket = client.get_bucket(self.bucket_name) + + logging.info("Building FAISS index") + logging.info(f"There are {len(rows)} rows") + + ids = np.array([x["entityId"] for x in rows]).astype("long") + embeds = np.array([x["embedding"] for x in rows]).astype("float32") + dimensions = len(embeds[0]) + N = ids.shape[0] + logging.info(f"There are {dimensions} dimensions") + + if self.factory_string is None: + M = 48 + + divideable_dimensions = (dimensions // M) * M + if divideable_dimensions != dimensions: + opq_prefix = f"OPQ{M}_{divideable_dimensions}" + else: + opq_prefix = f"OPQ{M}" + + clusters = N // 20 + self.factory_string = f"{opq_prefix},IVF{clusters},PQ{M}" + + logging.info(f"Factory string is {self.factory_string}, metric={self.metric}") + + if self.gpu: + logging.info("Using GPU") + + res = faiss.StandardGpuResources() + cpu_index = faiss.index_factory(dimensions, self.factory_string, self.metric) + cpu_index = faiss.IndexIDMap(cpu_index) + gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index) + gpu_index.train(embeds) + gpu_index.add_with_ids(embeds, ids) + cpu_index = faiss.index_gpu_to_cpu(gpu_index) + else: + logging.info("Using CPU") + + cpu_index = faiss.index_factory(dimensions, self.factory_string, self.metric) + cpu_index = faiss.IndexIDMap(cpu_index) + cpu_index.train(embeds) + cpu_index.add_with_ids(embeds, ids) + + logging.info("Built faiss index") + + local_path = "/indices" + logging.info(f"Writing indices to local {local_path}") + subprocess.run(f"mkdir -p {local_path}".strip().split()) + local_index_path = os.path.join(local_path, "result.index") + + faiss.write_index(cpu_index, local_index_path) + logging.info(f"Done writing indices to local {local_path}") + + logging.info(f"Uploading to GCS with path {self.gcs_output_path}") + assert os.path.isdir(local_path) + for local_file in glob.glob(local_path + "/*"): + remote_path = os.path.join( + self.gcs_output_path.split("/")[-1], local_file[1 + len(local_path) :] + ) + blob = bucket.blob(remote_path) + blob.upload_from_filename(local_file) + + +if __name__ == "__main__": + logging.getLogger().setLevel(logging.INFO) + run_pipeline(sys.argv) diff --git a/cr-mixer/README.md b/cr-mixer/README.md new file mode 100644 index 000000000..0037f7e69 --- /dev/null +++ b/cr-mixer/README.md @@ -0,0 +1,7 @@ +# CR-Mixer + +CR-Mixer is a candidate generation service proposed as part of the Personalization Strategy vision for Twitter. Its aim is to speed up the iteration and development of candidate generation and light ranking. The service acts as a lightweight coordinating layer that delegates candidate generation tasks to underlying compute services. It focuses on Twitter's candidate generation use cases and offers a centralized platform for fetching, mixing, and managing candidate sources and light rankers. The overarching goal is to increase the speed and ease of testing and developing candidate generation pipelines, ultimately delivering more value to Twitter users. + +CR-Mixer acts as a configurator and delegator, providing abstractions for the challenging parts of candidate generation and handling performance issues. It will offer a 1-stop-shop for fetching and mixing candidate sources, a managed and shared performant platform, a light ranking layer, a common filtering layer, a version control system, a co-owned feature switch set, and peripheral tooling. + +CR-Mixer's pipeline consists of 4 steps: source signal extraction, candidate generation, filtering, and ranking. It also provides peripheral tooling like scribing, debugging, and monitoring. The service fetches source signals externally from stores like UserProfileService and RealGraph, calls external candidate generation services, and caches results. Filters are applied for deduping and pre-ranking, and a light ranking step follows. diff --git a/cr-mixer/server/src/main/scala/com/twitter/cr_mixer/similarity_engine/EarlybirdTensorflowBasedSimilarityEngine.scala b/cr-mixer/server/src/main/scala/com/twitter/cr_mixer/similarity_engine/EarlybirdTensorflowBasedSimilarityEngine.scala new file mode 100644 index 000000000..8df6ec711 --- /dev/null +++ b/cr-mixer/server/src/main/scala/com/twitter/cr_mixer/similarity_engine/EarlybirdTensorflowBasedSimilarityEngine.scala @@ -0,0 +1,138 @@ +package com.twitter.cr_mixer.similarity_engine + +import com.twitter.finagle.stats.StatsReceiver +import com.twitter.search.earlybird.thriftscala.EarlybirdRequest +import com.twitter.search.earlybird.thriftscala.EarlybirdService +import com.twitter.search.earlybird.thriftscala.ThriftSearchQuery +import com.twitter.util.Time +import com.twitter.search.common.query.thriftjava.thriftscala.CollectorParams +import com.twitter.search.common.ranking.thriftscala.ThriftRankingParams +import com.twitter.search.common.ranking.thriftscala.ThriftScoringFunctionType +import com.twitter.search.earlybird.thriftscala.ThriftSearchRelevanceOptions +import javax.inject.Inject +import javax.inject.Singleton +import EarlybirdSimilarityEngineBase._ +import com.twitter.cr_mixer.config.TimeoutConfig +import com.twitter.cr_mixer.similarity_engine.EarlybirdTensorflowBasedSimilarityEngine.EarlybirdTensorflowBasedSearchQuery +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.EarlybirdClientId +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.FacetsToFetch +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetCollectorTerminationParams +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetEarlybirdQuery +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.MetadataOptions +import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetNamedDisjunctions +import com.twitter.search.earlybird.thriftscala.ThriftSearchRankingMode +import com.twitter.simclusters_v2.common.TweetId +import com.twitter.simclusters_v2.common.UserId +import com.twitter.util.Duration + +@Singleton +case class EarlybirdTensorflowBasedSimilarityEngine @Inject() ( + earlybirdSearchClient: EarlybirdService.MethodPerEndpoint, + timeoutConfig: TimeoutConfig, + stats: StatsReceiver) + extends EarlybirdSimilarityEngineBase[EarlybirdTensorflowBasedSearchQuery] { + import EarlybirdTensorflowBasedSimilarityEngine._ + override val statsReceiver: StatsReceiver = stats.scope(this.getClass.getSimpleName) + override def getEarlybirdRequest( + query: EarlybirdTensorflowBasedSearchQuery + ): Option[EarlybirdRequest] = { + if (query.seedUserIds.nonEmpty) + Some( + EarlybirdRequest( + searchQuery = getThriftSearchQuery(query, timeoutConfig.earlybirdServerTimeout), + clientHost = None, + clientRequestID = None, + clientId = Some(EarlybirdClientId), + clientRequestTimeMs = Some(Time.now.inMilliseconds), + cachingParams = None, + timeoutMs = timeoutConfig.earlybirdServerTimeout.inMilliseconds.intValue(), + facetRequest = None, + termStatisticsRequest = None, + debugMode = 0, + debugOptions = None, + searchSegmentId = None, + returnStatusType = None, + successfulResponseThreshold = None, + querySource = None, + getOlderResults = Some(false), + followedUserIds = Some(query.seedUserIds), + adjustedProtectedRequestParams = None, + adjustedFullArchiveRequestParams = None, + getProtectedTweetsOnly = Some(false), + retokenizeSerializedQuery = None, + skipVeryRecentTweets = true, + experimentClusterToUse = None + )) + else None + } +} + +object EarlybirdTensorflowBasedSimilarityEngine { + case class EarlybirdTensorflowBasedSearchQuery( + searcherUserId: Option[UserId], + seedUserIds: Seq[UserId], + maxNumTweets: Int, + beforeTweetIdExclusive: Option[TweetId], + afterTweetIdExclusive: Option[TweetId], + filterOutRetweetsAndReplies: Boolean, + useTensorflowRanking: Boolean, + excludedTweetIds: Set[TweetId], + maxNumHitsPerShard: Int) + extends EarlybirdSearchQuery + + private def getThriftSearchQuery( + query: EarlybirdTensorflowBasedSearchQuery, + processingTimeout: Duration + ): ThriftSearchQuery = + ThriftSearchQuery( + serializedQuery = GetEarlybirdQuery( + query.beforeTweetIdExclusive, + query.afterTweetIdExclusive, + query.excludedTweetIds, + query.filterOutRetweetsAndReplies).map(_.serialize), + fromUserIDFilter64 = Some(query.seedUserIds), + numResults = query.maxNumTweets, + // Whether to collect conversation IDs. Remove it for now. + // collectConversationId = Gate.True(), // true for Home + rankingMode = ThriftSearchRankingMode.Relevance, + relevanceOptions = Some(getRelevanceOptions), + collectorParams = Some( + CollectorParams( + // numResultsToReturn defines how many results each EB shard will return to search root + numResultsToReturn = 1000, + // terminationParams.maxHitsToProcess is used for early terminating per shard results fetching. + terminationParams = + GetCollectorTerminationParams(query.maxNumHitsPerShard, processingTimeout) + )), + facetFieldNames = Some(FacetsToFetch), + resultMetadataOptions = Some(MetadataOptions), + searcherId = query.searcherUserId, + searchStatusIds = None, + namedDisjunctionMap = GetNamedDisjunctions(query.excludedTweetIds) + ) + + // The specific values of recap relevance/reranking options correspond to + // experiment: enable_recap_reranking_2988,timeline_internal_disable_recap_filter + // bucket : enable_rerank,disable_filter + private def getRelevanceOptions: ThriftSearchRelevanceOptions = { + ThriftSearchRelevanceOptions( + proximityScoring = true, + maxConsecutiveSameUser = Some(2), + rankingParams = Some(getTensorflowBasedRankingParams), + maxHitsToProcess = Some(500), + maxUserBlendCount = Some(3), + proximityPhraseWeight = 9.0, + returnAllResults = Some(true) + ) + } + + private def getTensorflowBasedRankingParams: ThriftRankingParams = { + ThriftRankingParams( + `type` = Some(ThriftScoringFunctionType.TensorflowBased), + selectedTensorflowModel = Some("timelines_rectweet_replica"), + minScore = -1.0e100, + applyBoosts = false, + authorSpecificScoreAdjustments = None + ) + } +} diff --git a/home-mixer/server/src/main/scala/com/twitter/home_mixer/functional_component/decorator/HomeTweetTypePredicates.scala b/home-mixer/server/src/main/scala/com/twitter/home_mixer/functional_component/decorator/HomeTweetTypePredicates.scala new file mode 100644 index 000000000..0b06448d7 --- /dev/null +++ b/home-mixer/server/src/main/scala/com/twitter/home_mixer/functional_component/decorator/HomeTweetTypePredicates.scala @@ -0,0 +1,227 @@ +package com.twitter.home_mixer.functional_component.decorator + +import com.twitter.conversions.DurationOps._ +import com.twitter.home_mixer.model.HomeFeatures._ +import com.twitter.product_mixer.core.feature.featuremap.FeatureMap +import com.twitter.timelinemixer.injection.model.candidate.SemanticCoreFeatures +import com.twitter.tweetypie.{thriftscala => tpt} + +object HomeTweetTypePredicates { + + /** + * IMPORTANT: Please avoid logging tweet types that are tied to sensitive + * internal author information / labels (e.g. blink labels, abuse labels, or geo-location). + */ + private[this] val CandidatePredicates: Seq[(String, FeatureMap => Boolean)] = Seq( + ("with_candidate", _ => true), + ("retweet", _.getOrElse(IsRetweetFeature, false)), + ("reply", _.getOrElse(InReplyToTweetIdFeature, None).nonEmpty), + ("image", _.getOrElse(EarlybirdFeature, None).exists(_.hasImage)), + ("video", _.getOrElse(EarlybirdFeature, None).exists(_.hasVideo)), + ("link", _.getOrElse(EarlybirdFeature, None).exists(_.hasVisibleLink)), + ("quote", _.getOrElse(EarlybirdFeature, None).exists(_.hasQuote.contains(true))), + ("like_social_context", _.getOrElse(NonSelfFavoritedByUserIdsFeature, Seq.empty).nonEmpty), + ("protected", _.getOrElse(EarlybirdFeature, None).exists(_.isProtected)), + ( + "has_exclusive_conversation_author_id", + _.getOrElse(ExclusiveConversationAuthorIdFeature, None).nonEmpty), + ("is_eligible_for_connect_boost", _.getOrElse(AuthorIsEligibleForConnectBoostFeature, false)), + ("hashtag", _.getOrElse(EarlybirdFeature, None).exists(_.numHashtags > 0)), + ("has_scheduled_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isScheduled)), + ("has_recorded_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isRecorded)), + ("is_read_from_cache", _.getOrElse(IsReadFromCacheFeature, false)), + ( + "is_self_thread_tweet", + _.getOrElse(ConversationFeature, None).exists(_.isSelfThreadTweet.contains(true))), + ("get_initial", _.getOrElse(GetInitialFeature, false)), + ("get_newer", _.getOrElse(GetNewerFeature, false)), + ("get_middle", _.getOrElse(GetMiddleFeature, false)), + ("get_older", _.getOrElse(GetOlderFeature, false)), + ("pull_to_refresh", _.getOrElse(PullToRefreshFeature, false)), + ("polling", _.getOrElse(PollingFeature, false)), + ("tls_size_20_plus", _ => false), + ("near_empty", _ => false), + ("ranked_request", _ => false), + ("mutual_follow", _.getOrElse(EarlybirdFeature, None).exists(_.fromMutualFollow)), + ("has_ticketed_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.hasTickets)), + ("in_utis_top5", _.getOrElse(PositionFeature, None).exists(_ < 5)), + ("is_utis_pos0", _.getOrElse(PositionFeature, None).exists(_ == 0)), + ("is_utis_pos1", _.getOrElse(PositionFeature, None).exists(_ == 1)), + ("is_utis_pos2", _.getOrElse(PositionFeature, None).exists(_ == 2)), + ("is_utis_pos3", _.getOrElse(PositionFeature, None).exists(_ == 3)), + ("is_utis_pos4", _.getOrElse(PositionFeature, None).exists(_ == 4)), + ( + "is_signup_request", + candidate => candidate.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 30.minutes)), + ("empty_request", _ => false), + ("served_size_less_than_5", _.getOrElse(ServedSizeFeature, None).exists(_ < 5)), + ("served_size_less_than_10", _.getOrElse(ServedSizeFeature, None).exists(_ < 10)), + ("served_size_less_than_20", _.getOrElse(ServedSizeFeature, None).exists(_ < 20)), + ("served_size_less_than_50", _.getOrElse(ServedSizeFeature, None).exists(_ < 50)), + ( + "served_size_between_50_and_100", + _.getOrElse(ServedSizeFeature, None).exists(size => size >= 50 && size < 100)), + ("authored_by_contextual_user", _.getOrElse(AuthoredByContextualUserFeature, false)), + ("has_ancestors", _.getOrElse(AncestorsFeature, Seq.empty).nonEmpty), + ("full_scoring_succeeded", _.getOrElse(FullScoringSucceededFeature, false)), + ( + "account_age_less_than_30_minutes", + _.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 30.minutes)), + ( + "account_age_less_than_1_day", + _.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 1.day)), + ( + "account_age_less_than_7_days", + _.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 7.days)), + ( + "directed_at_user_is_in_first_degree", + _.getOrElse(EarlybirdFeature, None).exists(_.directedAtUserIdIsInFirstDegree.contains(true))), + ("root_user_is_in_first_degree", _ => false), + ( + "has_semantic_core_annotation", + _.getOrElse(EarlybirdFeature, None).exists(_.semanticCoreAnnotations.nonEmpty)), + ("is_request_context_foreground", _.getOrElse(IsForegroundRequestFeature, false)), + ( + "part_of_utt", + _.getOrElse(EarlybirdFeature, None) + .exists(_.semanticCoreAnnotations.exists(_.exists(annotation => + annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy)))), + ("is_random_tweet", _.getOrElse(IsRandomTweetFeature, false)), + ("has_random_tweet_in_response", _.getOrElse(HasRandomTweetFeature, false)), + ("is_random_tweet_above_in_utis", _.getOrElse(IsRandomTweetAboveFeature, false)), + ("is_request_context_launch", _.getOrElse(IsLaunchRequestFeature, false)), + ("viewer_is_employee", _ => false), + ("viewer_is_timelines_employee", _ => false), + ("viewer_follows_any_topics", _.getOrElse(UserFollowedTopicsCountFeature, None).exists(_ > 0)), + ( + "has_ancestor_authored_by_viewer", + candidate => + candidate + .getOrElse(AncestorsFeature, Seq.empty).exists(ancestor => + candidate.getOrElse(ViewerIdFeature, 0L) == ancestor.userId)), + ("ancestor", _.getOrElse(IsAncestorCandidateFeature, false)), + ( + "root_ancestor", + candidate => + candidate.getOrElse(IsAncestorCandidateFeature, false) && candidate + .getOrElse(InReplyToTweetIdFeature, None).isEmpty), + ( + "deep_reply", + candidate => + candidate.getOrElse(InReplyToTweetIdFeature, None).nonEmpty && candidate + .getOrElse(AncestorsFeature, Seq.empty).size > 2), + ( + "has_simcluster_embeddings", + _.getOrElse( + SimclustersTweetTopKClustersWithScoresFeature, + Map.empty[String, Double]).nonEmpty), + ( + "tweet_age_less_than_15_seconds", + _.getOrElse(OriginalTweetCreationTimeFromSnowflakeFeature, None) + .exists(_.untilNow <= 15.seconds)), + ("is_followed_topic_tweet", _ => false), + ("is_recommended_topic_tweet", _ => false), + ("is_topic_tweet", _ => false), + ("preferred_language_matches_tweet_language", _ => false), + ( + "device_language_matches_tweet_language", + candidate => + candidate.getOrElse(TweetLanguageFeature, None) == + candidate.getOrElse(DeviceLanguageFeature, None)), + ("question", _.getOrElse(EarlybirdFeature, None).exists(_.hasQuestion.contains(true))), + ("in_network", _.getOrElse(FromInNetworkSourceFeature, true)), + ("viewer_follows_original_author", _ => false), + ("has_account_follow_prompt", _ => false), + ("has_relevance_prompt", _ => false), + ("has_topic_annotation_haug_prompt", _ => false), + ("has_topic_annotation_random_precision_prompt", _ => false), + ("has_topic_annotation_prompt", _ => false), + ( + "has_political_annotation", + _.getOrElse(EarlybirdFeature, None).exists( + _.semanticCoreAnnotations.exists( + _.exists(annotation => + SemanticCoreFeatures.PoliticalDomains.contains(annotation.domainId) || + (annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy && + annotation.entityId == SemanticCoreFeatures.UttPoliticsEntityId))))), + ( + "is_dont_at_me_by_invitation", + _.getOrElse(EarlybirdFeature, None).exists( + _.conversationControl.exists(_.isInstanceOf[tpt.ConversationControl.ByInvitation]))), + ( + "is_dont_at_me_community", + _.getOrElse(EarlybirdFeature, None) + .exists(_.conversationControl.exists(_.isInstanceOf[tpt.ConversationControl.Community]))), + ("has_zero_score", _.getOrElse(ScoreFeature, None).exists(_ == 0.0)), + ("is_viewer_not_invited_to_reply", _ => false), + ("is_viewer_invited_to_reply", _ => false), + ("has_gte_10_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 10))), + ("has_gte_100_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 100))), + ("has_gte_1k_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 1000))), + ( + "has_gte_10k_favs", + _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 10000))), + ( + "has_gte_100k_favs", + _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 100000))), + ("above_neighbor_is_topic_tweet", _ => false), + ("is_topic_tweet_with_neighbor_below", _ => false), + ("has_audio_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.hasSpace)), + ("has_live_audio_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isLive)), + ( + "has_gte_10_retweets", + _.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 10))), + ( + "has_gte_100_retweets", + _.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 100))), + ( + "has_gte_1k_retweets", + _.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 1000))), + ( + "has_us_political_annotation", + _.getOrElse(EarlybirdFeature, None) + .exists(_.semanticCoreAnnotations.exists(_.exists(annotation => + annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy && + annotation.entityId == SemanticCoreFeatures.usPoliticalTweetEntityId && + annotation.groupId == SemanticCoreFeatures.UsPoliticalTweetAnnotationGroupIds.BalancedV0)))), + ( + "has_toxicity_score_above_threshold", + _.getOrElse(EarlybirdFeature, None).exists(_.toxicityScore.exists(_ > 0.91))), + ( + "text_only", + candidate => + candidate.getOrElse(HasDisplayedTextFeature, false) && + !(candidate.getOrElse(EarlybirdFeature, None).exists(_.hasImage) || + candidate.getOrElse(EarlybirdFeature, None).exists(_.hasVideo) || + candidate.getOrElse(EarlybirdFeature, None).exists(_.hasCard))), + ( + "image_only", + candidate => + candidate.getOrElse(EarlybirdFeature, None).exists(_.hasImage) && + !candidate.getOrElse(HasDisplayedTextFeature, false)), + ("has_1_image", _.getOrElse(NumImagesFeature, None).exists(_ == 1)), + ("has_2_images", _.getOrElse(NumImagesFeature, None).exists(_ == 2)), + ("has_3_images", _.getOrElse(NumImagesFeature, None).exists(_ == 3)), + ("has_4_images", _.getOrElse(NumImagesFeature, None).exists(_ == 4)), + ("has_card", _.getOrElse(EarlybirdFeature, None).exists(_.hasCard)), + ("3_or_more_consecutive_not_in_network", _ => false), + ("2_or_more_consecutive_not_in_network", _ => false), + ("5_out_of_7_not_in_network", _ => false), + ("7_out_of_7_not_in_network", _ => false), + ("5_out_of_5_not_in_network", _ => false), + ("user_follow_count_gte_50", _.getOrElse(UserFollowingCountFeature, None).exists(_ > 50)), + ("has_liked_by_social_context", _ => false), + ("has_followed_by_social_context", _ => false), + ("has_topic_social_context", _ => false), + ("timeline_entry_has_banner", _ => false), + ("served_in_conversation_module", _.getOrElse(ServedInConversationModuleFeature, false)), + ( + "conversation_module_has_2_displayed_tweets", + _.getOrElse(ConversationModule2DisplayedTweetsFeature, false)), + ("conversation_module_has_gap", _.getOrElse(ConversationModuleHasGapFeature, false)), + ("served_in_recap_tweet_candidate_module_injection", _ => false), + ("served_in_threaded_conversation_module", _ => false) + ) + + val PredicateMap = CandidatePredicates.toMap +} diff --git a/home-mixer/server/src/main/scala/com/twitter/home_mixer/util/earlybird/RelevanceSearchUtil.scala b/home-mixer/server/src/main/scala/com/twitter/home_mixer/util/earlybird/RelevanceSearchUtil.scala new file mode 100644 index 000000000..0de4546a6 --- /dev/null +++ b/home-mixer/server/src/main/scala/com/twitter/home_mixer/util/earlybird/RelevanceSearchUtil.scala @@ -0,0 +1,49 @@ +package com.twitter.home_mixer.util.earlybird + +import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant +import com.twitter.search.common.ranking.{thriftscala => scr} +import com.twitter.search.earlybird.{thriftscala => eb} + +object RelevanceSearchUtil { + + val Mentions: String = EarlybirdFieldConstant.MENTIONS_FACET + val Hashtags: String = EarlybirdFieldConstant.HASHTAGS_FACET + val FacetsToFetch: Seq[String] = Seq(Mentions, Hashtags) + + private val RankingParams: scr.ThriftRankingParams = { + scr.ThriftRankingParams( + `type` = Some(scr.ThriftScoringFunctionType.TensorflowBased), + selectedTensorflowModel = Some("timelines_rectweet_replica"), + minScore = -1.0e100, + selectedModels = Some(Map("home_mixer_unified_engagement_prod" -> 1.0)), + applyBoosts = false, + ) + } + + val MetadataOptions: eb.ThriftSearchResultMetadataOptions = { + eb.ThriftSearchResultMetadataOptions( + getTweetUrls = true, + getResultLocation = false, + getLuceneScore = false, + getInReplyToStatusId = true, + getReferencedTweetAuthorId = true, + getMediaBits = true, + getAllFeatures = true, + returnSearchResultFeatures = true, + // Set getExclusiveConversationAuthorId in order to retrieve Exclusive / SuperFollow tweets. + getExclusiveConversationAuthorId = true + ) + } + + val RelevanceOptions: eb.ThriftSearchRelevanceOptions = { + eb.ThriftSearchRelevanceOptions( + proximityScoring = true, + maxConsecutiveSameUser = Some(2), + rankingParams = Some(RankingParams), + maxHitsToProcess = Some(500), + maxUserBlendCount = Some(3), + proximityPhraseWeight = 9.0, + returnAllResults = Some(true) + ) + } +} diff --git a/navi/README.md b/navi/README.md new file mode 100644 index 000000000..9a4326d96 --- /dev/null +++ b/navi/README.md @@ -0,0 +1,36 @@ +# Navi: High-Performance Machine Learning Serving Server in Rust + +Navi is a high-performance, versatile machine learning serving server implemented in Rust and tailored for production usage. It's designed to efficiently serve within the Twitter tech stack, offering top-notch performance while focusing on core features. + +## Key Features + +- **Minimalist Design Optimized for Production Use Cases**: Navi delivers ultra-high performance, stability, and availability, engineered to handle real-world application demands with a streamlined codebase. +- **gRPC API Compatibility with TensorFlow Serving**: Seamless integration with existing TensorFlow Serving clients via its gRPC API, enabling easy integration, smooth deployment, and scaling in production environments. +- **Plugin Architecture for Different Runtimes**: Navi's pluggable architecture supports various machine learning runtimes, providing adaptability and extensibility for diverse use cases. Out-of-the-box support is available for TensorFlow and Onnx Runtime, with PyTorch in an experimental state. + +## Current State + +While Navi's features may not be as comprehensive as its open-source counterparts, its performance-first mindset makes it highly efficient. +- Navi for TensorFlow is currently the most feature-complete, supporting multiple input tensors of different types (float, int, string, etc.). +- Navi for Onnx primarily supports one input tensor of type string, used in Twitter's home recommendation with a proprietary BatchPredictRequest format. +- Navi for Pytorch is compilable and runnable but not yet production-ready in terms of performance and stability. + +## Directory Structure + +- `navi`: The main code repository for Navi +- `dr_transform`: Twitter-specific converter that converts BatchPredictionRequest Thrift to ndarray +- `segdense`: Twitter-specific config to specify how to retrieve feature values from BatchPredictionRequest +- `thrift_bpr_adapter`: generated thrift code for BatchPredictionRequest + +## Content +We have included all *.rs source code files that make up the main Navi binaries for you to examine. However, we have not included the test and benchmark code, or various configuration files, due to data security concerns. + +## Run +In navi/navi, you can run the following commands: +- `scripts/run_tf2.sh` for [TensorFlow](https://www.tensorflow.org/) +- `scripts/run_onnx.sh` for [Onnx](https://onnx.ai/) + +Do note that you need to create a models directory and create some versions, preferably using epoch time, e.g., `1679693908377`. + +## Build +You can adapt the above scripts to build using Cargo. diff --git a/navi/dr_transform/src/all_config.rs b/navi/dr_transform/src/all_config.rs new file mode 100644 index 000000000..29451bfd4 --- /dev/null +++ b/navi/dr_transform/src/all_config.rs @@ -0,0 +1,48 @@ +use serde::{Deserialize, Serialize}; + +use serde_json::Error; + +#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "camelCase")] +pub struct AllConfig { + #[serde(rename = "train_data")] + pub train_data: TrainData, +} + +#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "camelCase")] +pub struct TrainData { + #[serde(rename = "seg_dense_schema")] + pub seg_dense_schema: SegDenseSchema, +} + +#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "camelCase")] +pub struct SegDenseSchema { + #[serde(rename = "renamed_features")] + pub renamed_features: RenamedFeatures, +} + +#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "camelCase")] +pub struct RenamedFeatures { + pub continuous: String, + pub binary: String, + pub discrete: String, + #[serde(rename = "author_embedding")] + pub author_embedding: String, + #[serde(rename = "user_embedding")] + pub user_embedding: String, + #[serde(rename = "user_eng_embedding")] + pub user_eng_embedding: String, + #[serde(rename = "meta__author_id")] + pub meta_author_id: String, + #[serde(rename = "meta__user_id")] + pub meta_user_id: String, + #[serde(rename = "meta__tweet_id")] + pub meta_tweet_id: String, +} + +pub fn parse(json_str: &str) -> Result { + serde_json::from_str(json_str) +} diff --git a/navi/dr_transform/src/converter.rs b/navi/dr_transform/src/converter.rs new file mode 100644 index 000000000..578d766fd --- /dev/null +++ b/navi/dr_transform/src/converter.rs @@ -0,0 +1,586 @@ +use std::collections::BTreeSet; +use std::fmt::{self, Debug, Display}; +use std::fs; + +use bpr_thrift::data::DataRecord; +use bpr_thrift::prediction_service::BatchPredictionRequest; +use bpr_thrift::tensor::GeneralTensor; +use log::debug; +use ndarray::Array2; +use once_cell::sync::OnceCell; +use ort::tensor::InputTensor; +use prometheus::{HistogramOpts, HistogramVec}; +use segdense::mapper::{FeatureMapper, MapReader}; +use segdense::segdense_transform_spec_home_recap_2022::{DensificationTransformSpec, Root}; +use segdense::util; +use thrift::protocol::{TBinaryInputProtocol, TSerializable}; +use thrift::transport::TBufferChannel; + +use crate::{all_config, all_config::AllConfig}; + +pub fn log_feature_match( + dr: &DataRecord, + seg_dense_config: &DensificationTransformSpec, + dr_type: String, +) { + // Note the following algorithm matches features from config using linear search. + // Also the record source is MinDataRecord. This includes only binary and continous features for now. + + for (feature_id, feature_value) in dr.continuous_features.as_ref().unwrap() { + debug!( + "{dr_type} - Continuous Datarecord => Feature ID: {feature_id}, Feature value: {feature_value}" + ); + for input_feature in &seg_dense_config.cont.input_features { + if input_feature.feature_id == *feature_id { + debug!("Matching input feature: {input_feature:?}") + } + } + } + + for feature_id in dr.binary_features.as_ref().unwrap() { + debug!("{dr_type} - Binary Datarecord => Feature ID: {feature_id}"); + for input_feature in &seg_dense_config.binary.input_features { + if input_feature.feature_id == *feature_id { + debug!("Found input feature: {input_feature:?}") + } + } + } +} + +pub fn log_feature_matches(drs: &Vec, seg_dense_config: &DensificationTransformSpec) { + for dr in drs { + log_feature_match(dr, seg_dense_config, String::from("individual")); + } +} + +pub trait Converter: Send + Sync + Debug + 'static + Display { + fn convert(&self, input: Vec>) -> (Vec, Vec); +} + +#[derive(Debug)] +#[allow(dead_code)] +pub struct BatchPredictionRequestToTorchTensorConverter { + all_config: AllConfig, + seg_dense_config: Root, + all_config_path: String, + seg_dense_config_path: String, + feature_mapper: FeatureMapper, + user_embedding_feature_id: i64, + user_eng_embedding_feature_id: i64, + author_embedding_feature_id: i64, + discrete_features_to_report: BTreeSet, + continuous_features_to_report: BTreeSet, + discrete_feature_metrics: &'static HistogramVec, + continuous_feature_metrics: &'static HistogramVec, +} + +impl Display for BatchPredictionRequestToTorchTensorConverter { + fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { + write!( + f, + "all_config_path: {}, seg_dense_config_path:{}", + self.all_config_path, self.seg_dense_config_path + ) + } +} + +impl BatchPredictionRequestToTorchTensorConverter { + pub fn new( + model_dir: &str, + model_version: &str, + reporting_feature_ids: Vec<(i64, &str)>, + register_metric_fn: Option, + ) -> BatchPredictionRequestToTorchTensorConverter { + let all_config_path = format!("{model_dir}/{model_version}/all_config.json"); + let seg_dense_config_path = + format!("{model_dir}/{model_version}/segdense_transform_spec_home_recap_2022.json"); + let seg_dense_config = util::load_config(&seg_dense_config_path); + let all_config = all_config::parse( + &fs::read_to_string(&all_config_path) + .unwrap_or_else(|error| panic!("error loading all_config.json - {error}")), + ) + .unwrap(); + + let feature_mapper = util::load_from_parsed_config_ref(&seg_dense_config); + + let user_embedding_feature_id = Self::get_feature_id( + &all_config + .train_data + .seg_dense_schema + .renamed_features + .user_embedding, + &seg_dense_config, + ); + let user_eng_embedding_feature_id = Self::get_feature_id( + &all_config + .train_data + .seg_dense_schema + .renamed_features + .user_eng_embedding, + &seg_dense_config, + ); + let author_embedding_feature_id = Self::get_feature_id( + &all_config + .train_data + .seg_dense_schema + .renamed_features + .author_embedding, + &seg_dense_config, + ); + static METRICS: OnceCell<(HistogramVec, HistogramVec)> = OnceCell::new(); + let (discrete_feature_metrics, continuous_feature_metrics) = METRICS.get_or_init(|| { + let discrete = HistogramVec::new( + HistogramOpts::new(":navi:feature_id:discrete", "Discrete Feature ID values") + .buckets(Vec::from([ + 0.0f64, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0, + 120.0, 130.0, 140.0, 150.0, 160.0, 170.0, 180.0, 190.0, 200.0, 250.0, + 300.0, 500.0, 1000.0, 10000.0, 100000.0, + ])), + &["feature_id"], + ) + .expect("metric cannot be created"); + let continuous = HistogramVec::new( + HistogramOpts::new( + ":navi:feature_id:continuous", + "continuous Feature ID values", + ) + .buckets(Vec::from([ + 0.0f64, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0, + 120.0, 130.0, 140.0, 150.0, 160.0, 170.0, 180.0, 190.0, 200.0, 250.0, 300.0, + 500.0, 1000.0, 10000.0, 100000.0, + ])), + &["feature_id"], + ) + .expect("metric cannot be created"); + if let Some(r) = register_metric_fn { + r(&discrete); + r(&continuous); + } + (discrete, continuous) + }); + + let mut discrete_features_to_report = BTreeSet::new(); + let mut continuous_features_to_report = BTreeSet::new(); + + for (feature_id, feature_type) in reporting_feature_ids.iter() { + match *feature_type { + "discrete" => discrete_features_to_report.insert(*feature_id), + "continuous" => continuous_features_to_report.insert(*feature_id), + _ => panic!("Invalid feature type {feature_type} for reporting metrics!"), + }; + } + + BatchPredictionRequestToTorchTensorConverter { + all_config, + seg_dense_config, + all_config_path, + seg_dense_config_path, + feature_mapper, + user_embedding_feature_id, + user_eng_embedding_feature_id, + author_embedding_feature_id, + discrete_features_to_report, + continuous_features_to_report, + discrete_feature_metrics, + continuous_feature_metrics, + } + } + + fn get_feature_id(feature_name: &str, seg_dense_config: &Root) -> i64 { + // given a feature name, we get the complex feature type id + for feature in &seg_dense_config.complex_feature_type_transform_spec { + if feature.full_feature_name == feature_name { + return feature.feature_id; + } + } + -1 + } + + fn parse_batch_prediction_request(bytes: Vec) -> BatchPredictionRequest { + // parse batch prediction request into a struct from byte array repr. + let mut bc = TBufferChannel::with_capacity(bytes.len(), 0); + bc.set_readable_bytes(&bytes); + let mut protocol = TBinaryInputProtocol::new(bc, true); + BatchPredictionRequest::read_from_in_protocol(&mut protocol).unwrap() + } + + fn get_embedding_tensors( + &self, + bprs: &[BatchPredictionRequest], + feature_id: i64, + batch_size: &[usize], + ) -> Array2 { + // given an embedding feature id, extract the float tensor array into tensors. + let cols: usize = 200; + let rows: usize = batch_size[batch_size.len() - 1]; + let total_size = rows * cols; + + let mut working_set = vec![0 as f32; total_size]; + let mut bpr_start = 0; + for (bpr, &bpr_end) in bprs.iter().zip(batch_size) { + if bpr.common_features.is_some() + && bpr.common_features.as_ref().unwrap().tensors.is_some() + && bpr + .common_features + .as_ref() + .unwrap() + .tensors + .as_ref() + .unwrap() + .contains_key(&feature_id) + { + let source_tensor = bpr + .common_features + .as_ref() + .unwrap() + .tensors + .as_ref() + .unwrap() + .get(&feature_id) + .unwrap(); + let tensor = match source_tensor { + GeneralTensor::FloatTensor(float_tensor) => + //Tensor::of_slice( + { + float_tensor + .floats + .iter() + .map(|x| x.into_inner() as f32) + .collect::>() + } + _ => vec![0 as f32; cols], + }; + + // since the tensor is found in common feature, add it in all batches + for row in bpr_start..bpr_end { + for col in 0..cols { + working_set[row * cols + col] = tensor[col]; + } + } + } + // find the feature in individual feature list and add to corresponding batch. + for (index, datarecord) in bpr.individual_features_list.iter().enumerate() { + if datarecord.tensors.is_some() + && datarecord + .tensors + .as_ref() + .unwrap() + .contains_key(&feature_id) + { + let source_tensor = datarecord + .tensors + .as_ref() + .unwrap() + .get(&feature_id) + .unwrap(); + let tensor = match source_tensor { + GeneralTensor::FloatTensor(float_tensor) => float_tensor + .floats + .iter() + .map(|x| x.into_inner() as f32) + .collect::>(), + _ => vec![0 as f32; cols], + }; + for col in 0..cols { + working_set[(bpr_start + index) * cols + col] = tensor[col]; + } + } + } + bpr_start = bpr_end; + } + Array2::::from_shape_vec([rows, cols], working_set).unwrap() + } + + // Todo : Refactor, create a generic version with different type and field accessors + // Example paramterize and then instiantiate the following + // (FLOAT --> FLOAT, DataRecord.continuous_feature) + // (BOOL --> INT64, DataRecord.binary_feature) + // (INT64 --> INT64, DataRecord.discrete_feature) + fn get_continuous(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor { + // These need to be part of model schema + let rows = batch_ends[batch_ends.len() - 1]; + let cols = 5293; + let full_size = rows * cols; + let default_val = f32::NAN; + + let mut tensor = vec![default_val; full_size]; + + let mut bpr_start = 0; + for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) { + // Common features + if bpr.common_features.is_some() + && bpr + .common_features + .as_ref() + .unwrap() + .continuous_features + .is_some() + { + let common_features = bpr + .common_features + .as_ref() + .unwrap() + .continuous_features + .as_ref() + .unwrap(); + + for feature in common_features { + if let Some(f_info) = self.feature_mapper.get(feature.0) { + let idx = f_info.index_within_tensor as usize; + if idx < cols { + // Set value in each row + for r in bpr_start..bpr_end { + let flat_index = r * cols + idx; + tensor[flat_index] = feature.1.into_inner() as f32; + } + } + } + if self.continuous_features_to_report.contains(feature.0) { + self.continuous_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(feature.1.into_inner()) + } else if self.discrete_features_to_report.contains(feature.0) { + self.discrete_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(feature.1.into_inner()) + } + } + } + + // Process the batch of datarecords + for r in bpr_start..bpr_end { + let dr: &DataRecord = &bpr.individual_features_list[r - bpr_start]; + if dr.continuous_features.is_some() { + for feature in dr.continuous_features.as_ref().unwrap() { + if let Some(f_info) = self.feature_mapper.get(feature.0) { + let idx = f_info.index_within_tensor as usize; + let flat_index = r * cols + idx; + if flat_index < tensor.len() && idx < cols { + tensor[flat_index] = feature.1.into_inner() as f32; + } + } + if self.continuous_features_to_report.contains(feature.0) { + self.continuous_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(feature.1.into_inner()) + } else if self.discrete_features_to_report.contains(feature.0) { + self.discrete_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(feature.1.into_inner()) + } + } + } + } + bpr_start = bpr_end; + } + + InputTensor::FloatTensor( + Array2::::from_shape_vec([rows, cols], tensor) + .unwrap() + .into_dyn(), + ) + } + + fn get_binary(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor { + // These need to be part of model schema + let rows = batch_ends[batch_ends.len() - 1]; + let cols = 149; + let full_size = rows * cols; + let default_val = 0; + + let mut v = vec![default_val; full_size]; + + let mut bpr_start = 0; + for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) { + // Common features + if bpr.common_features.is_some() + && bpr + .common_features + .as_ref() + .unwrap() + .binary_features + .is_some() + { + let common_features = bpr + .common_features + .as_ref() + .unwrap() + .binary_features + .as_ref() + .unwrap(); + + for feature in common_features { + if let Some(f_info) = self.feature_mapper.get(feature) { + let idx = f_info.index_within_tensor as usize; + if idx < cols { + // Set value in each row + for r in bpr_start..bpr_end { + let flat_index = r * cols + idx; + v[flat_index] = 1; + } + } + } + } + } + + // Process the batch of datarecords + for r in bpr_start..bpr_end { + let dr: &DataRecord = &bpr.individual_features_list[r - bpr_start]; + if dr.binary_features.is_some() { + for feature in dr.binary_features.as_ref().unwrap() { + if let Some(f_info) = self.feature_mapper.get(feature) { + let idx = f_info.index_within_tensor as usize; + let flat_index = r * cols + idx; + v[flat_index] = 1; + } + } + } + } + bpr_start = bpr_end; + } + InputTensor::Int64Tensor( + Array2::::from_shape_vec([rows, cols], v) + .unwrap() + .into_dyn(), + ) + } + + #[allow(dead_code)] + fn get_discrete(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor { + // These need to be part of model schema + let rows = batch_ends[batch_ends.len() - 1]; + let cols = 320; + let full_size = rows * cols; + let default_val = 0; + + let mut v = vec![default_val; full_size]; + + let mut bpr_start = 0; + for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) { + // Common features + if bpr.common_features.is_some() + && bpr + .common_features + .as_ref() + .unwrap() + .discrete_features + .is_some() + { + let common_features = bpr + .common_features + .as_ref() + .unwrap() + .discrete_features + .as_ref() + .unwrap(); + + for feature in common_features { + if let Some(f_info) = self.feature_mapper.get(feature.0) { + let idx = f_info.index_within_tensor as usize; + if idx < cols { + // Set value in each row + for r in bpr_start..bpr_end { + let flat_index = r * cols + idx; + v[flat_index] = *feature.1; + } + } + } + if self.discrete_features_to_report.contains(feature.0) { + self.discrete_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(*feature.1 as f64) + } + } + } + + // Process the batch of datarecords + for r in bpr_start..bpr_end { + let dr: &DataRecord = &bpr.individual_features_list[r]; + if dr.discrete_features.is_some() { + for feature in dr.discrete_features.as_ref().unwrap() { + if let Some(f_info) = self.feature_mapper.get(feature.0) { + let idx = f_info.index_within_tensor as usize; + let flat_index = r * cols + idx; + if flat_index < v.len() && idx < cols { + v[flat_index] = *feature.1; + } + } + if self.discrete_features_to_report.contains(feature.0) { + self.discrete_feature_metrics + .with_label_values(&[feature.0.to_string().as_str()]) + .observe(*feature.1 as f64) + } + } + } + } + bpr_start = bpr_end; + } + InputTensor::Int64Tensor( + Array2::::from_shape_vec([rows, cols], v) + .unwrap() + .into_dyn(), + ) + } + + fn get_user_embedding( + &self, + bprs: &[BatchPredictionRequest], + batch_ends: &[usize], + ) -> InputTensor { + InputTensor::FloatTensor( + self.get_embedding_tensors(bprs, self.user_embedding_feature_id, batch_ends) + .into_dyn(), + ) + } + + fn get_eng_embedding( + &self, + bpr: &[BatchPredictionRequest], + batch_ends: &[usize], + ) -> InputTensor { + InputTensor::FloatTensor( + self.get_embedding_tensors(bpr, self.user_eng_embedding_feature_id, batch_ends) + .into_dyn(), + ) + } + + fn get_author_embedding( + &self, + bpr: &[BatchPredictionRequest], + batch_ends: &[usize], + ) -> InputTensor { + InputTensor::FloatTensor( + self.get_embedding_tensors(bpr, self.author_embedding_feature_id, batch_ends) + .into_dyn(), + ) + } +} + +impl Converter for BatchPredictionRequestToTorchTensorConverter { + fn convert(&self, batched_bytes: Vec>) -> (Vec, Vec) { + let bprs = batched_bytes + .into_iter() + .map(|bytes| { + BatchPredictionRequestToTorchTensorConverter::parse_batch_prediction_request(bytes) + }) + .collect::>(); + let batch_ends = bprs + .iter() + .map(|bpr| bpr.individual_features_list.len()) + .scan(0usize, |acc, e| { + //running total + *acc += e; + Some(*acc) + }) + .collect::>(); + + let t1 = self.get_continuous(&bprs, &batch_ends); + let t2 = self.get_binary(&bprs, &batch_ends); + //let _t3 = self.get_discrete(&bprs, &batch_ends); + let t4 = self.get_user_embedding(&bprs, &batch_ends); + let t5 = self.get_eng_embedding(&bprs, &batch_ends); + let t6 = self.get_author_embedding(&bprs, &batch_ends); + + (vec![t1, t2, t4, t5, t6], batch_ends) + } +} diff --git a/navi/dr_transform/src/util.rs b/navi/dr_transform/src/util.rs new file mode 100644 index 000000000..83b99805a --- /dev/null +++ b/navi/dr_transform/src/util.rs @@ -0,0 +1,32 @@ +use npyz::WriterBuilder; +use npyz::{AutoSerialize, WriteOptions}; +use std::io::BufWriter; +use std::{ + fs::File, + io::{self, BufRead}, +}; + +pub fn load_batch_prediction_request_base64(file_name: &str) -> Vec> { + let file = File::open(file_name).expect("could not read file"); + let mut result = vec![]; + for (mut line_count, line) in io::BufReader::new(file).lines().enumerate() { + line_count += 1; + match base64::decode(line.unwrap().trim()) { + Ok(payload) => result.push(payload), + Err(err) => println!("error decoding line {file_name}:{line_count} - {err}"), + } + } + println!("result len: {}", result.len()); + result +} + +pub fn save_to_npy(data: &[T], save_to: String) { + let mut writer = WriteOptions::new() + .default_dtype() + .shape(&[data.len() as u64, 1]) + .writer(BufWriter::new(File::create(save_to).unwrap())) + .begin_nd() + .unwrap(); + writer.extend(data.to_owned()).unwrap(); + writer.finish().unwrap(); +} diff --git a/recos-injector/README.md b/recos-injector/README.md new file mode 100644 index 000000000..a391578c2 --- /dev/null +++ b/recos-injector/README.md @@ -0,0 +1,40 @@ +# Recos-Injector + +Recos-Injector is a streaming event processor used to build input streams for GraphJet-based services. It is a general-purpose tool that consumes arbitrary incoming event streams (e.g., Fav, RT, Follow, client_events, etc.), applies filtering, and combines and publishes cleaned up events to corresponding GraphJet services. Each GraphJet-based service subscribes to a dedicated Kafka topic, and Recos-Injector enables GraphJet-based services to consume any event they want. + +## How to run Recos-Injector server tests + +You can run tests by using the following command from your project's root directory: + + $ bazel build recos-injector/... + $ bazel test recos-injector/... + +## How to run recos-injector-server in development on a local machine + +The simplest way to stand up a service is to run it locally. To run +recos-injector-server in development mode, compile the project and then +execute it with `bazel run`: + + $ bazel build recos-injector/server:bin + $ bazel run recos-injector/server:bin + +A tunnel can be set up in order for downstream queries to work properly. +Upon successful server startup, try to `curl` its admin endpoint in another +terminal: + + $ curl -s localhost:9990/admin/ping + pong + +Run `curl -s localhost:9990/admin` to see a list of all available admin endpoints. + +## Querying Recos-Injector server from a Scala console + +Recos-Injector does not have a Thrift endpoint. Instead, it reads Event Bus and Kafka queues and writes to the Recos-Injector Kafka. + +## Generating a package for deployment + +To package your service into a zip file for deployment, run: + + $ bazel bundle recos-injector/server:bin --bundle-jvm-archive=zip + +If the command is successful, a file named `dist/recos-injector-server.zip` will be created. diff --git a/simclusters-ann/README.md b/simclusters-ann/README.md new file mode 100644 index 000000000..69ff6cffa --- /dev/null +++ b/simclusters-ann/README.md @@ -0,0 +1,99 @@ +# SimClusters ANN + +SimClusters ANN is a service that returns tweet candidate recommendations given a SimClusters embedding. The service implements tweet recommendations based on the Approximate Cosine Similarity algorithm. + +The cosine similarity between two Tweet SimClusters Embedding represents the relevance level of two tweets in SimCluster space. The traditional algorithm for calculating cosine similarity is expensive and hard to support by the existing infrastructure. Therefore, the Approximate Cosine Similarity algorithm is introduced to save response time by reducing I/O operations. + +## Background +SimClusters V2 runtime infra introduces the SimClusters and its online and offline approaches. A heron job builds the mapping between SimClusters and Tweets. The job saves top 400 Tweets for a SimClusters and top 100 SimClusters for a Tweet. Favorite score and follow score are two types of tweet score. In the document, the top 100 SimClusters based on the favorite score for a Tweet stands for the Tweet SimClusters Embedding. + +The cosine similarity between two Tweet SimClusters Embedding presents the relevant level of two tweets in SimCluster space. The score varies from 0 to 1. The high cosine similarity score(>= 0.7 in Prod) means that the users who like two tweets share the same SimClusters. + + +SimClusters from the Linear Algebra Perspective discussed the difference between the dot-product and cosine similarity in SimCluster space. We believe the cosine similarity approach is better because it avoids the bias of tweet popularity. + + However, calculating the cosine similarity between two Tweets is pretty expensive in Tweet candidate generation. In TWISTLY, we scan at most 15,000 (6 source tweets * 25 clusters * 100 tweets per clusters) tweet candidates for every Home Timeline request. The traditional algorithm needs to make API calls to fetch 15,000 tweet SimCluster embeddings. Consider that we need to process over 6,000 RPS, it’s hard to support by the existing infrastructure. + + +## SimClusters Approximate Cosine Similarity Core Algorithm + +1. Provide a source SimCluster Embedding *SV*, *SV = [(SC1, Score), (SC2, Score), (SC3, Score) …]* + +2. Fetch top *M* tweets for each Top *N* SimClusters based on SV. In Prod, *M = 400*, *N = 50*. Tweets may appear in multiple SimClusters. + +| | | | | +|---|---|---|---| +| SC1 | T1:Score | T2: Score | ... | +| SC2 | T3: Score | T4: Score | ... | + + +3. Based on the previous table, generate an *(M x N) x N* Matrix *R*. The *R* represents the approximate SimCluster embeddings for *MxN* tweets. The embedding only contains top *N* SimClusters from *SV*. Only top *M* tweets from each SimCluster have the score. Others are 0. + +| | SC1 | SC2 | ... | +|---|---|---|---| +| T1 | Score | 0 | ... | +| T2 | Score | 0 | ... | +| T3 | 0 | Score | ... | + +4. Compute the dot product between source vector and the approximate vectors for each tweet. (Calculate *R • SV^T*). Take top *X* tweets. In Prod, *X = 200* + +5. Fetch *X* tweet SimClusters Embedding, Calculate Cosine Similarity between *X* tweets and *SV*, Return top *Y* above a certain threshold *Z*. + +Approximate Cosine Similarity is an approximate algorithm. Instead of fetching *M * N* tweets embedding, it only fetches *X* tweets embedding. In prod, *X / M * N * 100% = 6%*. Based on the metrics during TWISTLY development, most of the response time is consumed by I/O operation. The Approximate Cosine Similarity is a good approach to save a large amount of response time. + +The idea of the approximate algorithm is based on the assumption that the higher dot-product between source tweets’ SimCluster embedding and candidate tweet’s limited SimCluster Embedding, the possibility that these two tweets are relevant is higher. Additional Cosine Similarity filter is to guarantee that the results are not affected by popularity bias. + +Adjusting the M, N, X, Y, Z is able to balance the precision and recall for different products. The implementation of approximate cosine similarity is used by TWISTLY, Interest-based tweet recommendation, Similar Tweet in RUX, and Author based recommendation. This algorithm is also suitable for future user or entity recommendation based on SimClusters Embedding. + + +# ------------------------------- +# Build and Test +# ------------------------------- +Compile the service + + $ ./bazel build simclusters-ann/server:bin + +Unit tests + + $ ./bazel test simclusters-ann/server:bin + +# ------------------------------- +# Deploy +# ------------------------------- + +## Prerequisite for devel deployments +First of all, you need to generate Service to Service certificates for use while developing locally. This only needs to be done ONCE: + +To add cert files to Aurora (if you want to deploy to DEVEL): +``` +$ developer-cert-util --env devel --job simclusters-ann +``` + +## Deploying to devel/staging from a local build +Reference - + + $ ./simclusters-ann/bin/deploy.sh --help + +Use the script to build the service in your local branch, upload it to packer and deploy in devel aurora: + + $ ./simclusters-ann/bin/deploy.sh atla $USER devel simclusters-ann + +You can also deploy to staging with this script. E.g. to deploy to instance 1: + + $ ./simclusters-ann/bin/deploy.sh atla simclusters-ann staging simclusters-ann + +## Deploying to production + +Production deploys should be managed by Workflows. +_Do not_ deploy to production unless it is an emergency and you have approval from oncall. + +##### It is not recommended to deploy from Command Lines into production environments, unless 1) you're testing a small change in Canary shard [0,9]. 2) Tt is an absolute emergency. Be sure to make oncalls aware of the changes you're deploying. + + $ ./simclusters-ann/bin/deploy.sh atla simclusters-ann prod simclusters-ann +In the case of multiple instances, + + $ ./simclusters-ann/bin/deploy.sh atla simclusters-ann prod simclusters-ann - + +## Checking Deployed Version and Rolling Back + +Wherever possible, roll back using Workflows by finding an earlier good version and clicking the "rollback" button in the UI. This is the safest and least error-prone method. diff --git a/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java b/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java new file mode 100644 index 000000000..afde8a84e --- /dev/null +++ b/src/java/com/twitter/search/common/converter/earlybird/BasicIndexingConverter.java @@ -0,0 +1,647 @@ +package com.twitter.search.common.converter.earlybird; + +import java.io.IOException; +import java.util.Date; +import java.util.List; +import java.util.Optional; +import javax.annotation.concurrent.NotThreadSafe; + +import com.google.common.base.Preconditions; + +import org.apache.commons.collections.CollectionUtils; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.twitter.common_internal.text.version.PenguinVersion; +import com.twitter.search.common.converter.earlybird.EncodedFeatureBuilder.TweetFeatureWithEncodeFeatures; +import com.twitter.search.common.indexing.thriftjava.Place; +import com.twitter.search.common.indexing.thriftjava.PotentialLocation; +import com.twitter.search.common.indexing.thriftjava.ProfileGeoEnrichment; +import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents; +import com.twitter.search.common.indexing.thriftjava.VersionedTweetFeatures; +import com.twitter.search.common.metrics.SearchCounter; +import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser; +import com.twitter.search.common.relevance.entities.GeoObject; +import com.twitter.search.common.relevance.entities.TwitterMessage; +import com.twitter.search.common.relevance.entities.TwitterQuotedMessage; +import com.twitter.search.common.schema.base.ImmutableSchemaInterface; +import com.twitter.search.common.schema.base.Schema; +import com.twitter.search.common.schema.earlybird.EarlybirdCluster; +import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures; +import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants; +import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant; +import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder; +import com.twitter.search.common.schema.thriftjava.ThriftDocument; +import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent; +import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType; +import com.twitter.search.common.util.spatial.GeoUtil; +import com.twitter.search.common.util.text.NormalizerHelper; +import com.twitter.tweetypie.thriftjava.ComposerSource; + +/** + * Converts a TwitterMessage into a ThriftVersionedEvents. This is only responsible for data that + * is available immediately when a Tweet is created. Some data, like URL data, isn't available + * immediately, and so it is processed later, in the DelayedIndexingConverter and sent as an + * update. In order to achieve this we create the document in 2 passes: + * + * 1. BasicIndexingConverter builds thriftVersionedEvents with the fields that do not require + * external services. + * + * 2. DelayedIndexingConverter builds all the document fields depending on external services, once + * those services have processed the relevant Tweet and we have retrieved that data. + */ +@NotThreadSafe +public class BasicIndexingConverter { + private static final Logger LOG = LoggerFactory.getLogger(BasicIndexingConverter.class); + + private static final SearchCounter NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS = + SearchCounter.export("num_nullcast_feature_flag_set_tweets"); + private static final SearchCounter NUM_NULLCAST_TWEETS = + SearchCounter.export("num_nullcast_tweets"); + private static final SearchCounter NUM_NON_NULLCAST_TWEETS = + SearchCounter.export("num_non_nullcast_tweets"); + private static final SearchCounter ADJUSTED_BAD_CREATED_AT_COUNTER = + SearchCounter.export("adjusted_incorrect_created_at_timestamp"); + private static final SearchCounter INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS = + SearchCounter.export("inconsistent_tweet_id_and_created_at_ms"); + private static final SearchCounter NUM_SELF_THREAD_TWEETS = + SearchCounter.export("num_self_thread_tweets"); + private static final SearchCounter NUM_EXCLUSIVE_TWEETS = + SearchCounter.export("num_exclusive_tweets"); + + // If a tweet carries a timestamp smaller than this timestamp, we consider the timestamp invalid, + // because twitter does not even exist back then before: Sun, 01 Jan 2006 00:00:00 GMT + private static final long VALID_CREATION_TIME_THRESHOLD_MILLIS = + new DateTime(2006, 1, 1, 0, 0, 0, DateTimeZone.UTC).getMillis(); + + private final EncodedFeatureBuilder featureBuilder; + private final Schema schema; + private final EarlybirdCluster cluster; + + public BasicIndexingConverter(Schema schema, EarlybirdCluster cluster) { + this.featureBuilder = new EncodedFeatureBuilder(); + this.schema = schema; + this.cluster = cluster; + } + + /** + * This function converts TwitterMessage to ThriftVersionedEvents, which is a generic data + * structure that can be consumed by Earlybird directly. + */ + public ThriftVersionedEvents convertMessageToThrift( + TwitterMessage message, + boolean strict, + List penguinVersions) throws IOException { + Preconditions.checkNotNull(message); + Preconditions.checkNotNull(penguinVersions); + + ThriftVersionedEvents versionedEvents = new ThriftVersionedEvents() + .setId(message.getId()); + + ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot(); + + for (PenguinVersion penguinVersion : penguinVersions) { + ThriftDocument document = + buildDocumentForPenguinVersion(schemaSnapshot, message, strict, penguinVersion); + + ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent() + .setDocument(document) + .setEventType(ThriftIndexingEventType.INSERT) + .setSortId(message.getId()); + message.getFromUserTwitterId().map(thriftIndexingEvent::setUid); + versionedEvents.putToVersionedEvents(penguinVersion.getByteValue(), thriftIndexingEvent); + } + + return versionedEvents; + } + + private ThriftDocument buildDocumentForPenguinVersion( + ImmutableSchemaInterface schemaSnapshot, + TwitterMessage message, + boolean strict, + PenguinVersion penguinVersion) throws IOException { + TweetFeatureWithEncodeFeatures tweetFeature = + featureBuilder.createTweetFeaturesFromTwitterMessage( + message, penguinVersion, schemaSnapshot); + + EarlybirdThriftDocumentBuilder builder = + buildBasicFields(message, schemaSnapshot, cluster, tweetFeature); + + buildUserFields(builder, message, tweetFeature.versionedFeatures, penguinVersion); + buildGeoFields(builder, message, tweetFeature.versionedFeatures); + buildRetweetAndReplyFields(builder, message, strict); + buildQuotesFields(builder, message); + buildVersionedFeatureFields(builder, tweetFeature.versionedFeatures); + buildAnnotationFields(builder, message); + buildNormalizedMinEngagementFields(builder, tweetFeature.encodedFeatures, cluster); + buildDirectedAtFields(builder, message); + + builder.withSpaceIdFields(message.getSpaceIds()); + + return builder.build(); + } + + /** + * Build the basic fields for a tweet. + */ + public static EarlybirdThriftDocumentBuilder buildBasicFields( + TwitterMessage message, + ImmutableSchemaInterface schemaSnapshot, + EarlybirdCluster cluster, + TweetFeatureWithEncodeFeatures tweetFeature) { + EarlybirdEncodedFeatures extendedEncodedFeatures = tweetFeature.extendedEncodedFeatures; + if (extendedEncodedFeatures == null && EarlybirdCluster.isTwitterMemoryFormatCluster(cluster)) { + extendedEncodedFeatures = EarlybirdEncodedFeatures.newEncodedTweetFeatures( + schemaSnapshot, EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD); + } + EarlybirdThriftDocumentBuilder builder = new EarlybirdThriftDocumentBuilder( + tweetFeature.encodedFeatures, + extendedEncodedFeatures, + new EarlybirdFieldConstants(), + schemaSnapshot); + + builder.withID(message.getId()); + + final Date createdAt = message.getDate(); + long createdAtMs = createdAt == null ? 0L : createdAt.getTime(); + + createdAtMs = fixCreatedAtTimeStampIfNecessary(message.getId(), createdAtMs); + + if (createdAtMs > 0L) { + builder.withCreatedAt((int) (createdAtMs / 1000)); + } + + builder.withTweetSignature(tweetFeature.versionedFeatures.getTweetSignature()); + + if (message.getConversationId() > 0) { + long conversationId = message.getConversationId(); + builder.withLongField( + EarlybirdFieldConstant.CONVERSATION_ID_CSF.getFieldName(), conversationId); + // We only index conversation ID when it is different from the tweet ID. + if (message.getId() != conversationId) { + builder.withLongField( + EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName(), conversationId); + } + } + + if (message.getComposerSource().isPresent()) { + ComposerSource composerSource = message.getComposerSource().get(); + builder.withIntField( + EarlybirdFieldConstant.COMPOSER_SOURCE.getFieldName(), composerSource.getValue()); + if (composerSource == ComposerSource.CAMERA) { + builder.withCameraComposerSourceFlag(); + } + } + + EarlybirdEncodedFeatures encodedFeatures = tweetFeature.encodedFeatures; + if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG)) { + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.VERIFIED_FILTER_TERM); + } + if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG)) { + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.BLUE_VERIFIED_FILTER_TERM); + } + + if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG)) { + builder.withOffensiveFlag(); + } + + if (message.getNullcast()) { + NUM_NULLCAST_TWEETS.increment(); + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.NULLCAST_FILTER_TERM); + } else { + NUM_NON_NULLCAST_TWEETS.increment(); + } + if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG)) { + NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS.increment(); + } + if (message.isSelfThread()) { + builder.addFilterInternalFieldTerm( + EarlybirdFieldConstant.SELF_THREAD_FILTER_TERM); + NUM_SELF_THREAD_TWEETS.increment(); + } + + if (message.isExclusive()) { + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.EXCLUSIVE_FILTER_TERM); + builder.withLongField( + EarlybirdFieldConstant.EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF.getFieldName(), + message.getExclusiveConversationAuthorId()); + NUM_EXCLUSIVE_TWEETS.increment(); + } + + builder.withLanguageCodes(message.getLanguage(), message.getBCP47LanguageTag()); + + return builder; + } + + /** + * Build the user fields. + */ + public static void buildUserFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message, + VersionedTweetFeatures versionedTweetFeatures, + PenguinVersion penguinVersion) { + // 1. Set all the from user fields. + if (message.getFromUserTwitterId().isPresent()) { + builder.withLongField(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(), + message.getFromUserTwitterId().get()) + // CSF + .withLongField(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(), + message.getFromUserTwitterId().get()); + } else { + LOG.warn("fromUserTwitterId is not set in TwitterMessage! Status id: " + message.getId()); + } + + if (message.getFromUserScreenName().isPresent()) { + String fromUser = message.getFromUserScreenName().get(); + String normalizedFromUser = + NormalizerHelper.normalizeWithUnknownLocale(fromUser, penguinVersion); + + builder + .withWhiteSpaceTokenizedScreenNameField( + EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(), + normalizedFromUser) + .withStringField(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(), + normalizedFromUser); + + if (message.getTokenizedFromUserScreenName().isPresent()) { + builder.withCamelCaseTokenizedScreenNameField( + EarlybirdFieldConstant.CAMELCASE_USER_HANDLE_FIELD.getFieldName(), + fromUser, + normalizedFromUser, + message.getTokenizedFromUserScreenName().get()); + } + } + + Optional toUserScreenName = message.getToUserLowercasedScreenName(); + if (toUserScreenName.isPresent() && !toUserScreenName.get().isEmpty()) { + builder.withStringField( + EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(), + NormalizerHelper.normalizeWithUnknownLocale(toUserScreenName.get(), penguinVersion)); + } + + if (versionedTweetFeatures.isSetUserDisplayNameTokenStreamText()) { + builder.withTokenStreamField(EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName(), + versionedTweetFeatures.getUserDisplayNameTokenStreamText(), + versionedTweetFeatures.getUserDisplayNameTokenStream()); + } + } + + /** + * Build the geo fields. + */ + public static void buildGeoFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message, + VersionedTweetFeatures versionedTweetFeatures) { + double lat = GeoUtil.ILLEGAL_LATLON; + double lon = GeoUtil.ILLEGAL_LATLON; + if (message.getGeoLocation() != null) { + GeoObject location = message.getGeoLocation(); + builder.withGeoField(EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(), + location.getLatitude(), location.getLongitude(), location.getAccuracy()); + + if (location.getSource() != null) { + builder.withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), + EarlybirdFieldConstants.formatGeoType(location.getSource())); + } + + if (GeoUtil.validateGeoCoordinates(location.getLatitude(), location.getLongitude())) { + lat = location.getLatitude(); + lon = location.getLongitude(); + } + } + + // See SEARCH-14317 for investigation on how much space geo filed is used in archive cluster. + // In lucene archives, this CSF is needed regardless of whether geoLocation is set. + builder.withLatLonCSF(lat, lon); + + if (versionedTweetFeatures.isSetTokenizedPlace()) { + Place place = versionedTweetFeatures.getTokenizedPlace(); + Preconditions.checkArgument(place.isSetId(), "Place ID not set for tweet " + + message.getId()); + Preconditions.checkArgument(place.isSetFullName(), + "Place full name not set for tweet " + message.getId()); + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName()); + builder + .withStringField(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName(), place.getId()) + .withStringField(EarlybirdFieldConstant.PLACE_FULL_NAME_FIELD.getFieldName(), + place.getFullName()); + if (place.isSetCountryCode()) { + builder.withStringField(EarlybirdFieldConstant.PLACE_COUNTRY_CODE_FIELD.getFieldName(), + place.getCountryCode()); + } + } + + if (versionedTweetFeatures.isSetTokenizedProfileGeoEnrichment()) { + ProfileGeoEnrichment profileGeoEnrichment = + versionedTweetFeatures.getTokenizedProfileGeoEnrichment(); + Preconditions.checkArgument( + profileGeoEnrichment.isSetPotentialLocations(), + "ProfileGeoEnrichment.potentialLocations not set for tweet " + + message.getId()); + List potentialLocations = profileGeoEnrichment.getPotentialLocations(); + Preconditions.checkArgument( + !potentialLocations.isEmpty(), + "Found tweet with an empty ProfileGeoEnrichment.potentialLocations: " + + message.getId()); + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PROFILE_GEO_FILTER_TERM); + for (PotentialLocation potentialLocation : potentialLocations) { + if (potentialLocation.isSetCountryCode()) { + builder.withStringField( + EarlybirdFieldConstant.PROFILE_GEO_COUNTRY_CODE_FIELD.getFieldName(), + potentialLocation.getCountryCode()); + } + if (potentialLocation.isSetRegion()) { + builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_REGION_FIELD.getFieldName(), + potentialLocation.getRegion()); + } + if (potentialLocation.isSetLocality()) { + builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_LOCALITY_FIELD.getFieldName(), + potentialLocation.getLocality()); + } + } + } + + builder.withPlacesField(message.getPlaces()); + } + + /** + * Build the retweet and reply fields. + */ + public static void buildRetweetAndReplyFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message, + boolean strict) { + long retweetUserIdVal = -1; + long sharedStatusIdVal = -1; + if (message.getRetweetMessage() != null) { + if (message.getRetweetMessage().getSharedId() != null) { + sharedStatusIdVal = message.getRetweetMessage().getSharedId(); + } + if (message.getRetweetMessage().hasSharedUserTwitterId()) { + retweetUserIdVal = message.getRetweetMessage().getSharedUserTwitterId(); + } + } + + long inReplyToStatusIdVal = -1; + long inReplyToUserIdVal = -1; + if (message.isReply()) { + if (message.getInReplyToStatusId().isPresent()) { + inReplyToStatusIdVal = message.getInReplyToStatusId().get(); + } + if (message.getToUserTwitterId().isPresent()) { + inReplyToUserIdVal = message.getToUserTwitterId().get(); + } + } + + buildRetweetAndReplyFields( + retweetUserIdVal, + sharedStatusIdVal, + inReplyToStatusIdVal, + inReplyToUserIdVal, + strict, + builder); + } + + /** + * Build the quotes fields. + */ + public static void buildQuotesFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message) { + if (message.getQuotedMessage() != null) { + TwitterQuotedMessage quoted = message.getQuotedMessage(); + if (quoted != null && quoted.getQuotedStatusId() > 0 && quoted.getQuotedUserId() > 0) { + builder.withQuote(quoted.getQuotedStatusId(), quoted.getQuotedUserId()); + } + } + } + + /** + * Build directed at field. + */ + public static void buildDirectedAtFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message) { + if (message.getDirectedAtUserId().isPresent() && message.getDirectedAtUserId().get() > 0) { + builder.withDirectedAtUser(message.getDirectedAtUserId().get()); + builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.DIRECTED_AT_FILTER_TERM); + } + } + + /** + * Build the versioned features for a tweet. + */ + public static void buildVersionedFeatureFields( + EarlybirdThriftDocumentBuilder builder, + VersionedTweetFeatures versionedTweetFeatures) { + builder + .withHashtagsField(versionedTweetFeatures.getHashtags()) + .withMentionsField(versionedTweetFeatures.getMentions()) + .withStocksFields(versionedTweetFeatures.getStocks()) + .withResolvedLinksText(versionedTweetFeatures.getNormalizedResolvedUrlText()) + .withTokenStreamField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), + versionedTweetFeatures.getTweetTokenStreamText(), + versionedTweetFeatures.isSetTweetTokenStream() + ? versionedTweetFeatures.getTweetTokenStream() : null) + .withStringField(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName(), + versionedTweetFeatures.getSource()) + .withStringField(EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName(), + versionedTweetFeatures.getNormalizedSource()); + + // Internal fields for smileys and question marks + if (versionedTweetFeatures.hasPositiveSmiley) { + builder.withStringField( + EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), + EarlybirdFieldConstant.HAS_POSITIVE_SMILEY); + } + if (versionedTweetFeatures.hasNegativeSmiley) { + builder.withStringField( + EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(), + EarlybirdFieldConstant.HAS_NEGATIVE_SMILEY); + } + if (versionedTweetFeatures.hasQuestionMark) { + builder.withStringField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(), + EarlybirdThriftDocumentBuilder.QUESTION_MARK); + } + } + + /** + * Build the escherbird annotations for a tweet. + */ + public static void buildAnnotationFields( + EarlybirdThriftDocumentBuilder builder, + TwitterMessage message) { + List escherbirdAnnotations = + message.getEscherbirdAnnotations(); + if (CollectionUtils.isEmpty(escherbirdAnnotations)) { + return; + } + + builder.addFacetSkipList(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName()); + + for (TwitterMessage.EscherbirdAnnotation annotation : escherbirdAnnotations) { + String groupDomainEntity = String.format("%d.%d.%d", + annotation.groupId, annotation.domainId, annotation.entityId); + String domainEntity = String.format("%d.%d", annotation.domainId, annotation.entityId); + String entity = String.format("%d", annotation.entityId); + + builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), + groupDomainEntity); + builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), + domainEntity); + builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(), + entity); + } + } + + /** + * Build the correct ThriftIndexingEvent's fields based on retweet and reply status. + */ + public static void buildRetweetAndReplyFields( + long retweetUserIdVal, + long sharedStatusIdVal, + long inReplyToStatusIdVal, + long inReplyToUserIdVal, + boolean strict, + EarlybirdThriftDocumentBuilder builder) { + Optional retweetUserId = Optional.of(retweetUserIdVal).filter(x -> x > 0); + Optional sharedStatusId = Optional.of(sharedStatusIdVal).filter(x -> x > 0); + Optional inReplyToUserId = Optional.of(inReplyToUserIdVal).filter(x -> x > 0); + Optional inReplyToStatusId = Optional.of(inReplyToStatusIdVal).filter(x -> x > 0); + + // We have six combinations here. A Tweet can be + // 1) a reply to another tweet (then it has both in-reply-to-user-id and + // in-reply-to-status-id set), + // 2) directed-at a user (then it only has in-reply-to-user-id set), + // 3) not a reply at all. + // Additionally, it may or may not be a Retweet (if it is, then it has retweet-user-id and + // retweet-status-id set). + // + // We want to set some fields unconditionally, and some fields (reference-author-id and + // shared-status-id) depending on the reply/retweet combination. + // + // 1. Normal tweet (not a reply, not a retweet). None of the fields should be set. + // + // 2. Reply to a tweet (both in-reply-to-user-id and in-reply-to-status-id set). + // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id + // SHARED_STATUS_ID_CSF should be set to in-reply-to-status-id + // IS_REPLY_FLAG should be set + // + // 3. Directed-at a user (only in-reply-to-user-id is set). + // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id + // IS_REPLY_FLAG should be set + // + // 4. Retweet of a normal tweet (retweet-user-id and retweet-status-id are set). + // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id + // SHARED_STATUS_ID_CSF should be set to retweet-status-id + // IS_RETWEET_FLAG should be set + // + // 5. Retweet of a reply (both in-reply-to-user-id and in-reply-to-status-id set, + // retweet-user-id and retweet-status-id are set). + // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id + // SHARED_STATUS_ID_CSF should be set to retweet-status-id (retweet beats reply!) + // IS_RETWEET_FLAG should be set + // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id + // IS_REPLY_FLAG should NOT be set + // + // 6. Retweet of a directed-at tweet (only in-reply-to-user-id is set, + // retweet-user-id and retweet-status-id are set). + // RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id + // SHARED_STATUS_ID_CSF should be set to retweet-status-id + // IS_RETWEET_FLAG should be set + // IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id + // IS_REPLY_FLAG should NOT be set + // + // In other words: + // SHARED_STATUS_ID_CSF logic: if this is a retweet SHARED_STATUS_ID_CSF should be set to + // retweet-status-id, otherwise if it's a reply to a tweet, it should be set to + // in-reply-to-status-id. + + Preconditions.checkState(retweetUserId.isPresent() == sharedStatusId.isPresent()); + + if (retweetUserId.isPresent()) { + builder.withNativeRetweet(retweetUserId.get(), sharedStatusId.get()); + + if (inReplyToUserId.isPresent()) { + // Set IN_REPLY_TO_USER_ID_FIELD even if this is a retweet of a reply. + builder.withInReplyToUserID(inReplyToUserId.get()); + } + } else { + // If this is a retweet of a reply, we don't want to mark it as a reply, or override fields + // set by the retweet logic. + // If we are in this branch, this is not a retweet. Potentially, we set the reply flag, + // and override shared-status-id and reference-author-id. + + if (inReplyToStatusId.isPresent()) { + if (strict) { + // Enforcing that if this is a reply to a tweet, then it also has a replied-to user. + Preconditions.checkState(inReplyToUserId.isPresent()); + } + builder.withReplyFlag(); + builder.withLongField( + EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(), + inReplyToStatusId.get()); + builder.withLongField( + EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName(), + inReplyToStatusId.get()); + } + if (inReplyToUserId.isPresent()) { + builder.withReplyFlag(); + builder.withInReplyToUserID(inReplyToUserId.get()); + } + } + } + + /** + * Build the engagement fields. + */ + public static void buildNormalizedMinEngagementFields( + EarlybirdThriftDocumentBuilder builder, + EarlybirdEncodedFeatures encodedFeatures, + EarlybirdCluster cluster) throws IOException { + if (EarlybirdCluster.isArchive(cluster)) { + int favoriteCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.FAVORITE_COUNT); + int retweetCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.RETWEET_COUNT); + int replyCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.REPLY_COUNT); + builder + .withNormalizedMinEngagementField( + EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD + .getFieldName(), + favoriteCount); + builder + .withNormalizedMinEngagementField( + EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD + .getFieldName(), + retweetCount); + builder + .withNormalizedMinEngagementField( + EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD + .getFieldName(), + replyCount); + } + } + + /** + * As seen in SEARCH-5617, we sometimes have incorrect createdAt. This method tries to fix them + * by extracting creation time from snowflake when possible. + */ + public static long fixCreatedAtTimeStampIfNecessary(long id, long createdAtMs) { + if (createdAtMs < VALID_CREATION_TIME_THRESHOLD_MILLIS + && id > SnowflakeIdParser.SNOWFLAKE_ID_LOWER_BOUND) { + // This tweet has a snowflake ID, and we can extract timestamp from the ID. + ADJUSTED_BAD_CREATED_AT_COUNTER.increment(); + return SnowflakeIdParser.getTimestampFromTweetId(id); + } else if (!SnowflakeIdParser.isTweetIDAndCreatedAtConsistent(id, createdAtMs)) { + LOG.error( + "Found inconsistent tweet ID and created at timestamp: [statusID={}], [createdAtMs={}]", + id, createdAtMs); + INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS.increment(); + } + + return createdAtMs; + } +} diff --git a/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java b/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java new file mode 100644 index 000000000..0e12f18c7 --- /dev/null +++ b/src/java/com/twitter/search/earlybird/ml/ScoringModelsManager.java @@ -0,0 +1,155 @@ +package com.twitter.search.earlybird.ml; + +import java.io.IOException; + +import com.google.common.annotations.VisibleForTesting; +import com.google.common.base.Optional; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.twitter.search.common.file.AbstractFile; +import com.twitter.search.common.file.FileUtils; +import com.twitter.search.common.metrics.SearchStatsReceiver; +import com.twitter.search.common.schema.DynamicSchema; +import com.twitter.search.common.util.ml.prediction_engine.CompositeFeatureContext; +import com.twitter.search.common.util.ml.prediction_engine.LightweightLinearModel; +import com.twitter.search.common.util.ml.prediction_engine.ModelLoader; + +import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.CONTEXT; +import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.FeatureContextVersion.CURRENT_VERSION; + +/** + * Loads the scoring models for tweets and provides access to them. + * + * This class relies on a list of ModelLoader objects to retrieve the objects from them. It will + * return the first model found according to the order in the list. + * + * For production, we load models from 2 sources: classpath and HDFS. If a model is available + * from HDFS, we return it, otherwise we use the model from the classpath. + * + * The models used for default requests (i.e. not experiments) MUST be present in the + * classpath, this allows us to avoid errors if they can't be loaded from HDFS. + * Models for experiments can live only in HDFS, so we don't need to redeploy Earlybird if we + * want to test them. + */ +public class ScoringModelsManager { + + private static final Logger LOG = LoggerFactory.getLogger(ScoringModelsManager.class); + + /** + * Used when + * 1. Testing + * 2. The scoring models are disabled in the config + * 3. Exceptions thrown during loading the scoring models + */ + public static final ScoringModelsManager NO_OP_MANAGER = new ScoringModelsManager() { + @Override + public boolean isEnabled() { + return false; + } + }; + + private final ModelLoader[] loaders; + private final DynamicSchema dynamicSchema; + + public ScoringModelsManager(ModelLoader... loaders) { + this.loaders = loaders; + this.dynamicSchema = null; + } + + public ScoringModelsManager(DynamicSchema dynamicSchema, ModelLoader... loaders) { + this.loaders = loaders; + this.dynamicSchema = dynamicSchema; + } + + /** + * Indicates that the scoring models were enabled in the config and were loaded successfully + */ + public boolean isEnabled() { + return true; + } + + public void reload() { + for (ModelLoader loader : loaders) { + loader.run(); + } + } + + /** + * Loads and returns the model with the given name, if one exists. + */ + public Optional getModel(String modelName) { + for (ModelLoader loader : loaders) { + Optional model = loader.getModel(modelName); + if (model.isPresent()) { + return model; + } + } + return Optional.absent(); + } + + /** + * Creates an instance that loads models first from HDFS and the classpath resources. + * + * If the models are not found in HDFS, it uses the models from the classpath as fallback. + */ + public static ScoringModelsManager create( + SearchStatsReceiver serverStats, + String hdfsNameNode, + String hdfsBasedPath, + DynamicSchema dynamicSchema) throws IOException { + // Create a composite feature context so we can load both legacy and schema-based models + CompositeFeatureContext featureContext = new CompositeFeatureContext( + CONTEXT, dynamicSchema::getSearchFeatureSchema); + ModelLoader hdfsLoader = createHdfsLoader( + serverStats, hdfsNameNode, hdfsBasedPath, featureContext); + ModelLoader classpathLoader = createClasspathLoader( + serverStats, featureContext); + + // Explicitly load the models from the classpath + classpathLoader.run(); + + ScoringModelsManager manager = new ScoringModelsManager(hdfsLoader, classpathLoader); + LOG.info("Initialized ScoringModelsManager for loading models from HDFS and the classpath"); + return manager; + } + + protected static ModelLoader createHdfsLoader( + SearchStatsReceiver serverStats, + String hdfsNameNode, + String hdfsBasedPath, + CompositeFeatureContext featureContext) { + String hdfsVersionedPath = hdfsBasedPath + "/" + CURRENT_VERSION.getVersionDirectory(); + LOG.info("Starting to load scoring models from HDFS: {}:{}", + hdfsNameNode, hdfsVersionedPath); + return ModelLoader.forHdfsDirectory( + hdfsNameNode, + hdfsVersionedPath, + featureContext, + "scoring_models_hdfs_", + serverStats); + } + + /** + * Creates a loader that loads models from a default location in the classpath. + */ + @VisibleForTesting + public static ModelLoader createClasspathLoader( + SearchStatsReceiver serverStats, CompositeFeatureContext featureContext) + throws IOException { + AbstractFile defaultModelsBaseDir = FileUtils.getTmpDirHandle( + ScoringModelsManager.class, + "/com/twitter/search/earlybird/ml/default_models"); + AbstractFile defaultModelsDir = defaultModelsBaseDir.getChild( + CURRENT_VERSION.getVersionDirectory()); + + LOG.info("Starting to load scoring models from the classpath: {}", + defaultModelsDir.getPath()); + return ModelLoader.forDirectory( + defaultModelsDir, + featureContext, + "scoring_models_classpath_", + serverStats); + } +} diff --git a/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py b/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py new file mode 100644 index 000000000..167756c01 --- /dev/null +++ b/src/python/twitter/deepbird/projects/timelines/configs/recap_earlybird/feature_config.py @@ -0,0 +1,83 @@ +# checkstyle: noqa +from twml.feature_config import FeatureConfigBuilder + + +def get_feature_config(data_spec_path, label): + return ( + FeatureConfigBuilder(data_spec_path=data_spec_path, debug=True) + .batch_add_features( + [ + ("ebd.author_specific_score", "A"), + ("ebd.has_diff_lang", "A"), + ("ebd.has_english_tweet_diff_ui_lang", "A"), + ("ebd.has_english_ui_diff_tweet_lang", "A"), + ("ebd.is_self_tweet", "A"), + ("ebd.tweet_age_in_secs", "A"), + ("encoded_tweet_features.favorite_count", "A"), + ("encoded_tweet_features.from_verified_account_flag", "A"), + ("encoded_tweet_features.has_card_flag", "A"), + # ("encoded_tweet_features.has_consumer_video_flag", "A"), + ("encoded_tweet_features.has_image_url_flag", "A"), + ("encoded_tweet_features.has_link_flag", "A"), + ("encoded_tweet_features.has_multiple_hashtags_or_trends_flag", "A"), + # ("encoded_tweet_features.has_multiple_media_flag", "A"), + ("encoded_tweet_features.has_native_image_flag", "A"), + ("encoded_tweet_features.has_news_url_flag", "A"), + ("encoded_tweet_features.has_periscope_flag", "A"), + ("encoded_tweet_features.has_pro_video_flag", "A"), + ("encoded_tweet_features.has_quote_flag", "A"), + ("encoded_tweet_features.has_trend_flag", "A"), + ("encoded_tweet_features.has_video_url_flag", "A"), + ("encoded_tweet_features.has_vine_flag", "A"), + ("encoded_tweet_features.has_visible_link_flag", "A"), + ("encoded_tweet_features.is_offensive_flag", "A"), + ("encoded_tweet_features.is_reply_flag", "A"), + ("encoded_tweet_features.is_retweet_flag", "A"), + ("encoded_tweet_features.is_sensitive_content", "A"), + # ("encoded_tweet_features.is_user_new_flag", "A"), + ("encoded_tweet_features.language", "A"), + ("encoded_tweet_features.link_language", "A"), + ("encoded_tweet_features.num_hashtags", "A"), + ("encoded_tweet_features.num_mentions", "A"), + # ("encoded_tweet_features.profile_is_egg_flag", "A"), + ("encoded_tweet_features.reply_count", "A"), + ("encoded_tweet_features.retweet_count", "A"), + ("encoded_tweet_features.text_score", "A"), + ("encoded_tweet_features.user_reputation", "A"), + ("extended_encoded_tweet_features.embeds_impression_count", "A"), + ("extended_encoded_tweet_features.embeds_impression_count_v2", "A"), + ("extended_encoded_tweet_features.embeds_url_count", "A"), + ("extended_encoded_tweet_features.embeds_url_count_v2", "A"), + ("extended_encoded_tweet_features.favorite_count_v2", "A"), + ("extended_encoded_tweet_features.label_abusive_hi_rcl_flag", "A"), + ("extended_encoded_tweet_features.label_dup_content_flag", "A"), + ("extended_encoded_tweet_features.label_nsfw_hi_prc_flag", "A"), + ("extended_encoded_tweet_features.label_nsfw_hi_rcl_flag", "A"), + ("extended_encoded_tweet_features.label_spam_flag", "A"), + ("extended_encoded_tweet_features.label_spam_hi_rcl_flag", "A"), + ("extended_encoded_tweet_features.quote_count", "A"), + ("extended_encoded_tweet_features.reply_count_v2", "A"), + ("extended_encoded_tweet_features.retweet_count_v2", "A"), + ("extended_encoded_tweet_features.weighted_favorite_count", "A"), + ("extended_encoded_tweet_features.weighted_quote_count", "A"), + ("extended_encoded_tweet_features.weighted_reply_count", "A"), + ("extended_encoded_tweet_features.weighted_retweet_count", "A"), + ] + ) + .add_labels( + [ + label, # Tensor index: 0 + "recap.engagement.is_clicked", # Tensor index: 1 + "recap.engagement.is_favorited", # Tensor index: 2 + "recap.engagement.is_open_linked", # Tensor index: 3 + "recap.engagement.is_photo_expanded", # Tensor index: 4 + "recap.engagement.is_profile_clicked", # Tensor index: 5 + "recap.engagement.is_replied", # Tensor index: 6 + "recap.engagement.is_retweeted", # Tensor index: 7 + "recap.engagement.is_video_playback_50", # Tensor index: 8 + "timelines.earlybird_score", # Tensor index: 9 + ] + ) + .define_weight("meta.record_weight/type=earlybird") + .build() + ) diff --git a/src/scala/com/twitter/graph/batch/job/tweepcred/README b/src/scala/com/twitter/graph/batch/job/tweepcred/README new file mode 100644 index 000000000..55ef3b093 --- /dev/null +++ b/src/scala/com/twitter/graph/batch/job/tweepcred/README @@ -0,0 +1,75 @@ +Tweepcred + +Tweepcred is a social network analysis tool that calculates the influence of Twitter users based on their interactions with other users. The tool uses the PageRank algorithm to rank users based on their influence. + +PageRank Algorithm +PageRank is a graph algorithm that was originally developed by Google to determine the importance of web pages in search results. The algorithm works by assigning a numerical score to each page based on the number and quality of other pages that link to it. The more links a page has from other high-quality pages, the higher its PageRank score. + +In the Tweepcred project, the PageRank algorithm is used to determine the influence of Twitter users based on their interactions with other users. The graph is constructed by treating Twitter users as nodes, and their interactions (mentions, retweets, etc.) as edges. The PageRank score of a user represents their influence in the network. + +Tweepcred PageRank Implementation +The implementation of the PageRank algorithm in Tweepcred is based on the Hadoop MapReduce framework. The algorithm is split into two stages: preparation and iteration. + +The preparation stage involves constructing the graph of Twitter users and their interactions, and initializing each user's PageRank score to a default value. This stage is implemented in the PreparePageRankData class. + +The iteration stage involves repeatedly calculating and updating the PageRank scores of each user until convergence is reached. This stage is implemented in the UpdatePageRank class, which is run multiple times until the algorithm converges. + +The Tweepcred PageRank implementation also includes a number of optimizations to improve performance and reduce memory usage. These optimizations include block compression, lazy loading, and in-memory caching. + + +========================================== TweepcredBatchJob.scala ========================================== + + +This is a Scala class that represents a batch job for computing the "tweepcred" (Twitter credibility) score for Twitter users using weighted or unweighted PageRank algorithm. The class extends the AnalyticsIterativeBatchJob class, which is part of the Scalding framework used for data processing on Hadoop. + +The class defines various properties and methods that are used to configure and run the batch job. The args parameter represents the command-line arguments that are passed to the batch job, such as the --weighted flag that determines whether to use the weighted PageRank algorithm or not. + +The run method overrides the run method of the base class and prints the batch statistics after the job has finished. The children method defines a list of child jobs that need to be executed as part of the batch job. The messageHeader method returns a string that represents the header of the batch job message. + +========================================== ExtractTweepcred.scala ========================================== + +This class is a Scalding job that calculates "tweepcred" from a given pagerank file. Tweepcred is a measure of reputation for Twitter users that takes into account the number of followers they have and the number of people they follow. If the optional argument post_adjust is set to true (default value), then the pagerank values are adjusted based on the user's follower-to-following ratio. + +The class takes several command-line arguments specifying input and output files and options, and it uses the Scalding library to perform distributed data processing on the input files. It reads in the pagerank file and a user mass file, both in TSV format, and combines them to produce a new pagerank file with the adjusted values. The adjusted pagerank is then used to calculate tweepcred values, which are written to output files. + +The code makes use of the MostRecentCombinedUserSnapshotSource class from the com.twitter.pluck.source.combined_user_source package to obtain user information from the user mass file. It also uses the Reputation class to perform the tweepcred calculations and adjustments. + + +========================================== UserMass.scala ========================================== + +The UserMass class is a helper class used to calculate the "mass" of a user on Twitter, as defined by a certain algorithm. The mass score represents the user's reputation and is used in various applications, such as in determining which users should be recommended to follow or which users should have their content highlighted. + +The getUserMass method of the UserMass class takes in a CombinedUser object, which contains information about a Twitter user, and returns an optional UserMassInfo object, which contains the user's ID and calculated mass score. + +The algorithm used to calculate the mass score takes into account various factors such as the user's account age, number of followers and followings, device usage, and safety status (restricted, suspended, verified). The calculation involves adding and multiplying weight factors and adjusting the mass score based on a threshold for the number of friends and followers. + + +========================================== PreparePageRankData.scala ========================================== + +The PreparePageRankData class prepares the graph data for the page rank calculation. It generates the initial pagerank and then starts the WeightedPageRank job. It has the following functionalities: + +It reads the user mass TSV file generated by the twadoop user_mass job. +It reads the graph data, which is either a TSV file or a combination of flock edges and real graph inputs for weights. +It generates the initial pagerank as the starting point for the pagerank computation. +It writes the number of nodes to a TSV file and dumps the nodes to another TSV file. +It has several options like weighted, flock_edges_only, and input_pagerank to fine-tune the pagerank calculation. +It also has options for the WeightedPageRank and ExtractTweepcred jobs, like output_pagerank, output_tweepcred, maxiterations, jumpprob, threshold, and post_adjust. +The PreparePageRankData class has several helper functions like getFlockEdges, getRealGraphEdges, getFlockRealGraphEdges, and getCsvEdges that read the graph data from different sources like DAL, InteractionGraph, or CSV files. It also has the generateInitialPagerank function that generates the initial pagerank from the graph data. + +========================================== WeightedPageRank.scala ========================================== + +WeightedPageRank is a class that performs the weighted PageRank algorithm on a given graph. + +The algorithm starts from a given PageRank value and performs one iteration, then tests for convergence. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job with the updated PageRank as input. If convergence has been reached, the algorithm starts the ExtractTweepcred job instead. + +The class takes in several options, including the working directory, total number of nodes, nodes file, PageRank file, total difference, whether to perform weighted PageRank, the current iteration, maximum iterations to run, probability of a random jump, and whether to do post adjust. + +The algorithm reads a nodes file that includes the source node ID, destination node IDs, weights, and mass prior. The algorithm also reads an input PageRank file that includes the source node ID and mass input. The algorithm then performs one iteration of the PageRank algorithm and writes the output PageRank to a file. + +The algorithm tests for convergence by calculating the total difference between the input and output PageRank masses. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job. If convergence has been reached, the algorithm starts the ExtractTweepcred job. + +========================================== Reputation.scala ========================================== + +This is a helper class called Reputation that contains methods for calculating a user's reputation score. The first method called scaledReputation takes a Double parameter raw which represents the user's page rank, and returns a Byte value that represents the user's reputation on a scale of 0 to 100. This method uses a formula that involves converting the logarithm of the page rank to a number between 0 and 100. + +The second method called adjustReputationsPostCalculation takes three parameters: mass (a Double value representing the user's page rank), numFollowers (an Int value representing the number of followers a user has), and numFollowings (an Int value representing the number of users a user is following). This method reduces the page rank of users who have a low number of followers but a high number of followings. It calculates a division factor based on the ratio of followings to followers, and reduces the user's page rank by dividing it by this factor. The method returns the adjusted page rank. diff --git a/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md b/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md new file mode 100644 index 000000000..39af44deb --- /dev/null +++ b/src/scala/com/twitter/recos/user_tweet_entity_graph/README.md @@ -0,0 +1,17 @@ +# UserTweetEntityGraph (UTEG) + +## What is it +User Tweet Entity Graph (UTEG) is a Finalge thrift service built on the GraphJet framework. It maintains a graph of user-tweet relationships and serves user recommendations based on traversals in this graph. + +## How is it used on Twitter +UTEG generates the "XXX Liked" out-of-network tweets seen on Twitter's Home Timeline. +The core idea behind UTEG is collaborative filtering. UTEG takes a user's weighted follow graph (i.e a list of weighted userIds) as input, +performs efficient traversal & aggregation, and returns the top-weighted tweets engaged based on # of users that engaged the tweet, as well as +the engaged users' weights. + +UTEG is a stateful service and relies on a Kafka stream to ingest & persist states. It maintains in-memory user engagements over the past +24-48 hours. Older events are dropped and GC'ed. + +For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high-performance in-memory storage engine. +- https://github.com/twitter/GraphJet +- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf diff --git a/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala new file mode 100644 index 000000000..b8f0179cb --- /dev/null +++ b/src/scala/com/twitter/simclusters_v2/common/SimClustersEmbedding.scala @@ -0,0 +1,581 @@ +package com.twitter.simclusters_v2.common + +import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore +import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding} +import scala.collection.mutable +import scala.language.implicitConversions +import scala.util.hashing.MurmurHash3.arrayHash +import scala.util.hashing.MurmurHash3.productHash +import scala.math._ + +/** + * A representation of a SimClusters Embedding, designed for low memory footprint and performance. + * For services that cache millions of embeddings, we found this to significantly reduce allocations, + * memory footprint and overall performance. + * + * Embedding data is stored in pre-sorted arrays rather than structures which use a lot of pointers + * (e.g. Map). A minimal set of lazily-constructed intermediate data is kept. + * + * Be wary of adding further `val` or `lazy val`s to this class; materializing and storing more data + * on these objects could significantly affect in-memory cache performance. + * + * Also, if you are using this code in a place where you care about memory footprint, be careful + * not to materialize any of the lazy vals unless you need them. + */ +sealed trait SimClustersEmbedding extends Equals { + import SimClustersEmbedding._ + + /** + * Any compliant implementation of the SimClustersEmbedding trait must ensure that: + * - the cluster and score arrays are ordered as described below + * - the cluster and score arrays are treated as immutable (.hashCode is memoized) + * - the size of all cluster and score arrays is the same + * - all cluster scores are > 0 + * - cluster ids are unique + */ + // In descending score order - this is useful for truncation, where we care most about the highest scoring elements + private[simclusters_v2] val clusterIds: Array[ClusterId] + private[simclusters_v2] val scores: Array[Double] + // In ascending cluster order. This is useful for operations where we try to find the same cluster in another embedding, e.g. dot product + private[simclusters_v2] val sortedClusterIds: Array[ClusterId] + private[simclusters_v2] val sortedScores: Array[Double] + + /** + * Build and return a Set of all clusters in this embedding + */ + lazy val clusterIdSet: Set[ClusterId] = sortedClusterIds.toSet + + /** + * Build and return Seq representation of this embedding + */ + lazy val embedding: Seq[(ClusterId, Double)] = + sortedClusterIds.zip(sortedScores).sortBy(-_._2).toSeq + + /** + * Build and return a Map representation of this embedding + */ + lazy val map: Map[ClusterId, Double] = sortedClusterIds.zip(sortedScores).toMap + + lazy val l1norm: Double = CosineSimilarityUtil.l1NormArray(sortedScores) + + lazy val l2norm: Double = CosineSimilarityUtil.normArray(sortedScores) + + lazy val logNorm: Double = CosineSimilarityUtil.logNormArray(sortedScores) + + lazy val expScaledNorm: Double = + CosineSimilarityUtil.expScaledNormArray(sortedScores, DefaultExponent) + + /** + * The L2 Normalized Embedding. Optimize for Cosine Similarity Calculation. + */ + lazy val normalizedSortedScores: Array[Double] = + CosineSimilarityUtil.applyNormArray(sortedScores, l2norm) + + lazy val logNormalizedSortedScores: Array[Double] = + CosineSimilarityUtil.applyNormArray(sortedScores, logNorm) + + lazy val expScaledNormalizedSortedScores: Array[Double] = + CosineSimilarityUtil.applyNormArray(sortedScores, expScaledNorm) + + /** + * The Standard Deviation of an Embedding. + */ + lazy val std: Double = { + if (scores.isEmpty) { + 0.0 + } else { + val sum = scores.sum + val mean = sum / scores.length + var variance: Double = 0.0 + for (i <- scores.indices) { + val v = scores(i) - mean + variance += (v * v) + } + math.sqrt(variance / scores.length) + } + } + + /** + * Return the score of a given clusterId. + */ + def get(clusterId: ClusterId): Option[Double] = { + var i = 0 + while (i < sortedClusterIds.length) { + val thisId = sortedClusterIds(i) + if (clusterId == thisId) return Some(sortedScores(i)) + if (thisId > clusterId) return None + i += 1 + } + None + } + + /** + * Return the score of a given clusterId. If not exist, return default. + */ + def getOrElse(clusterId: ClusterId, default: Double = 0.0): Double = { + require(default >= 0.0) + var i = 0 + while (i < sortedClusterIds.length) { + val thisId = sortedClusterIds(i) + if (clusterId == thisId) return sortedScores(i) + if (thisId > clusterId) return default + i += 1 + } + default + } + + /** + * Return the cluster ids + */ + def getClusterIds(): Array[ClusterId] = clusterIds + + /** + * Return the cluster ids with the highest scores + */ + def topClusterIds(size: Int): Seq[ClusterId] = clusterIds.take(size) + + /** + * Return true if this embedding contains a given clusterId + */ + def contains(clusterId: ClusterId): Boolean = clusterIdSet.contains(clusterId) + + def sum(another: SimClustersEmbedding): SimClustersEmbedding = { + if (another.isEmpty) this + else if (this.isEmpty) another + else { + var i1 = 0 + var i2 = 0 + val l = scala.collection.mutable.ArrayBuffer.empty[(Int, Double)] + while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { + if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { + l += Tuple2(sortedClusterIds(i1), sortedScores(i1) + another.sortedScores(i2)) + i1 += 1 + i2 += 1 + } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { + l += Tuple2(another.sortedClusterIds(i2), another.sortedScores(i2)) + // another cluster is lower. Increment it to see if the next one matches this's + i2 += 1 + } else { + l += Tuple2(sortedClusterIds(i1), sortedScores(i1)) + // this cluster is lower. Increment it to see if the next one matches anothers's + i1 += 1 + } + } + if (i1 == sortedClusterIds.length && i2 != another.sortedClusterIds.length) + // this was shorter. Prepend remaining elements from another + l ++= another.sortedClusterIds.drop(i2).zip(another.sortedScores.drop(i2)) + else if (i1 != sortedClusterIds.length && i2 == another.sortedClusterIds.length) + // another was shorter. Prepend remaining elements from this + l ++= sortedClusterIds.drop(i1).zip(sortedScores.drop(i1)) + SimClustersEmbedding(l) + } + } + + def scalarMultiply(multiplier: Double): SimClustersEmbedding = { + require(multiplier > 0.0, "SimClustersEmbedding.scalarMultiply requires multiplier > 0.0") + DefaultSimClustersEmbedding( + clusterIds, + scores.map(_ * multiplier), + sortedClusterIds, + sortedScores.map(_ * multiplier) + ) + } + + def scalarDivide(divisor: Double): SimClustersEmbedding = { + require(divisor > 0.0, "SimClustersEmbedding.scalarDivide requires divisor > 0.0") + DefaultSimClustersEmbedding( + clusterIds, + scores.map(_ / divisor), + sortedClusterIds, + sortedScores.map(_ / divisor) + ) + } + + def dotProduct(another: SimClustersEmbedding): Double = { + CosineSimilarityUtil.dotProductForSortedClusterAndScores( + sortedClusterIds, + sortedScores, + another.sortedClusterIds, + another.sortedScores) + } + + def cosineSimilarity(another: SimClustersEmbedding): Double = { + CosineSimilarityUtil.dotProductForSortedClusterAndScores( + sortedClusterIds, + normalizedSortedScores, + another.sortedClusterIds, + another.normalizedSortedScores) + } + + def logNormCosineSimilarity(another: SimClustersEmbedding): Double = { + CosineSimilarityUtil.dotProductForSortedClusterAndScores( + sortedClusterIds, + logNormalizedSortedScores, + another.sortedClusterIds, + another.logNormalizedSortedScores) + } + + def expScaledCosineSimilarity(another: SimClustersEmbedding): Double = { + CosineSimilarityUtil.dotProductForSortedClusterAndScores( + sortedClusterIds, + expScaledNormalizedSortedScores, + another.sortedClusterIds, + another.expScaledNormalizedSortedScores) + } + + /** + * Return true if this is an empty embedding + */ + def isEmpty: Boolean = sortedClusterIds.isEmpty + + /** + * Return the Jaccard Similarity Score between two embeddings. + * Note: this implementation should be optimized if we start to use it in production + */ + def jaccardSimilarity(another: SimClustersEmbedding): Double = { + if (this.isEmpty || another.isEmpty) { + 0.0 + } else { + val intersect = clusterIdSet.intersect(another.clusterIdSet).size + val union = clusterIdSet.union(another.clusterIdSet).size + intersect.toDouble / union + } + } + + /** + * Return the Fuzzy Jaccard Similarity Score between two embeddings. + * Treat each Simclusters embedding as fuzzy set, calculate the fuzzy set similarity + * metrics of two embeddings + * + * Paper 2.2.1: https://openreview.net/pdf?id=SkxXg2C5FX + */ + def fuzzyJaccardSimilarity(another: SimClustersEmbedding): Double = { + if (this.isEmpty || another.isEmpty) { + 0.0 + } else { + val v1C = sortedClusterIds + val v1S = sortedScores + val v2C = another.sortedClusterIds + val v2S = another.sortedScores + + require(v1C.length == v1S.length) + require(v2C.length == v2S.length) + + var i1 = 0 + var i2 = 0 + var numerator = 0.0 + var denominator = 0.0 + + while (i1 < v1C.length && i2 < v2C.length) { + if (v1C(i1) == v2C(i2)) { + numerator += min(v1S(i1), v2S(i2)) + denominator += max(v1S(i1), v2S(i2)) + i1 += 1 + i2 += 1 + } else if (v1C(i1) > v2C(i2)) { + denominator += v2S(i2) + i2 += 1 + } else { + denominator += v1S(i1) + i1 += 1 + } + } + + while (i1 < v1C.length) { + denominator += v1S(i1) + i1 += 1 + } + while (i2 < v2C.length) { + denominator += v2S(i2) + i2 += 1 + } + + numerator / denominator + } + } + + /** + * Return the Euclidean Distance Score between two embeddings. + * Note: this implementation should be optimized if we start to use it in production + */ + def euclideanDistance(another: SimClustersEmbedding): Double = { + val unionClusters = clusterIdSet.union(another.clusterIdSet) + val variance = unionClusters.foldLeft(0.0) { + case (sum, clusterId) => + val distance = math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId)) + sum + distance * distance + } + math.sqrt(variance) + } + + /** + * Return the Manhattan Distance Score between two embeddings. + * Note: this implementation should be optimized if we start to use it in production + */ + def manhattanDistance(another: SimClustersEmbedding): Double = { + val unionClusters = clusterIdSet.union(another.clusterIdSet) + unionClusters.foldLeft(0.0) { + case (sum, clusterId) => + sum + math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId)) + } + } + + /** + * Return the number of overlapping clusters between two embeddings. + */ + def overlappingClusters(another: SimClustersEmbedding): Int = { + var i1 = 0 + var i2 = 0 + var count = 0 + + while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { + if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { + count += 1 + i1 += 1 + i2 += 1 + } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { + // v2 cluster is lower. Increment it to see if the next one matches v1's + i2 += 1 + } else { + // v1 cluster is lower. Increment it to see if the next one matches v2's + i1 += 1 + } + } + count + } + + /** + * Return the largest product cluster scores + */ + def maxElementwiseProduct(another: SimClustersEmbedding): Double = { + var i1 = 0 + var i2 = 0 + var maxProduct: Double = 0.0 + + while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) { + if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) { + val product = sortedScores(i1) * another.sortedScores(i2) + if (product > maxProduct) maxProduct = product + i1 += 1 + i2 += 1 + } else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) { + // v2 cluster is lower. Increment it to see if the next one matches v1's + i2 += 1 + } else { + // v1 cluster is lower. Increment it to see if the next one matches v2's + i1 += 1 + } + } + maxProduct + } + + /** + * Return a new SimClustersEmbedding with Max Embedding Size. + * + * Prefer to truncate on embedding construction where possible. Doing so is cheaper. + */ + def truncate(size: Int): SimClustersEmbedding = { + if (clusterIds.length <= size) { + this + } else { + val truncatedClusterIds = clusterIds.take(size) + val truncatedScores = scores.take(size) + val (sortedClusterIds, sortedScores) = + truncatedClusterIds.zip(truncatedScores).sortBy(_._1).unzip + + DefaultSimClustersEmbedding( + truncatedClusterIds, + truncatedScores, + sortedClusterIds, + sortedScores) + } + } + + def toNormalized: SimClustersEmbedding = { + // Additional safety check. Only EmptyEmbedding's l2norm is 0.0. + if (l2norm == 0.0) { + EmptyEmbedding + } else { + this.scalarDivide(l2norm) + } + } + + implicit def toThrift: ThriftSimClustersEmbedding = { + ThriftSimClustersEmbedding( + embedding.map { + case (clusterId, score) => + SimClusterWithScore(clusterId, score) + } + ) + } + + def canEqual(a: Any): Boolean = a.isInstanceOf[SimClustersEmbedding] + + /* We define equality as having the same clusters and scores. + * This implementation is arguably incorrect in this case: + * (1 -> 1.0, 2 -> 0.0) == (1 -> 1.0) // equals returns false + * However, compliant implementations of SimClustersEmbedding should not include zero-weight + * clusters, so this implementation should work correctly. + */ + override def equals(that: Any): Boolean = + that match { + case that: SimClustersEmbedding => + that.canEqual(this) && + this.sortedClusterIds.sameElements(that.sortedClusterIds) && + this.sortedScores.sameElements(that.sortedScores) + case _ => false + } + + /** + * hashcode implementation based on the contents of the embedding. As a lazy val, this relies on + * the embedding contents being immutable. + */ + override lazy val hashCode: Int = { + /* Arrays uses object id as hashCode, so different arrays with the same contents hash + * differently. To provide a stable hash code, we take the same approach as how a + * `case class(clusters: Seq[Int], scores: Seq[Double])` would be hashed. See + * ScalaRunTime._hashCode and MurmurHash3.productHash + * https://github.com/scala/scala/blob/2.12.x/src/library/scala/runtime/ScalaRunTime.scala#L167 + * https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/hashing/MurmurHash3.scala#L64 + * + * Note that the hashcode is arguably incorrect in this case: + * (1 -> 1.0, 2 -> 0.0).hashcode == (1 -> 1.0).hashcode // returns false + * However, compliant implementations of SimClustersEmbedding should not include zero-weight + * clusters, so this implementation should work correctly. + */ + productHash((arrayHash(sortedClusterIds), arrayHash(sortedScores))) + } +} + +object SimClustersEmbedding { + val EmptyEmbedding: SimClustersEmbedding = + DefaultSimClustersEmbedding(Array.empty, Array.empty, Array.empty, Array.empty) + + val DefaultExponent: Double = 0.3 + + // Descending by score then ascending by ClusterId + implicit val order: Ordering[(ClusterId, Double)] = + (a: (ClusterId, Double), b: (ClusterId, Double)) => { + b._2 compare a._2 match { + case 0 => a._1 compare b._1 + case c => c + } + } + + /** + * Constructors + * + * These constructors: + * - do not make assumptions about the ordering of the cluster/scores. + * - do assume that cluster ids are unique + * - ignore (drop) any cluster whose score is <= 0 + */ + def apply(embedding: (ClusterId, Double)*): SimClustersEmbedding = + buildDefaultSimClustersEmbedding(embedding) + + def apply(embedding: Iterable[(ClusterId, Double)]): SimClustersEmbedding = + buildDefaultSimClustersEmbedding(embedding) + + def apply(embedding: Iterable[(ClusterId, Double)], size: Int): SimClustersEmbedding = + buildDefaultSimClustersEmbedding(embedding, truncate = Some(size)) + + implicit def apply(thriftEmbedding: ThriftSimClustersEmbedding): SimClustersEmbedding = + buildDefaultSimClustersEmbedding(thriftEmbedding.embedding.map(_.toTuple)) + + def apply(thriftEmbedding: ThriftSimClustersEmbedding, truncate: Int): SimClustersEmbedding = + buildDefaultSimClustersEmbedding( + thriftEmbedding.embedding.map(_.toTuple), + truncate = Some(truncate)) + + private def buildDefaultSimClustersEmbedding( + embedding: Iterable[(ClusterId, Double)], + truncate: Option[Int] = None + ): SimClustersEmbedding = { + val truncatedIdAndScores = { + val idsAndScores = embedding.filter(_._2 > 0.0).toArray.sorted(order) + truncate match { + case Some(t) => idsAndScores.take(t) + case _ => idsAndScores + } + } + + if (truncatedIdAndScores.isEmpty) { + EmptyEmbedding + } else { + val (clusterIds, scores) = truncatedIdAndScores.unzip + val (sortedClusterIds, sortedScores) = truncatedIdAndScores.sortBy(_._1).unzip + DefaultSimClustersEmbedding(clusterIds, scores, sortedClusterIds, sortedScores) + } + } + + /** ***** Aggregation Methods ******/ + /** + * A high performance version of Sum a list of SimClustersEmbeddings. + * Suggest using in Online Services to avoid the unnecessary GC. + * For offline or streaming. Please check [[SimClustersEmbeddingMonoid]] + */ + def sum(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = { + if (simClustersEmbeddings.isEmpty) { + EmptyEmbedding + } else { + val sum = simClustersEmbeddings.foldLeft(mutable.Map[ClusterId, Double]()) { + (sum, embedding) => + for (i <- embedding.sortedClusterIds.indices) { + val clusterId = embedding.sortedClusterIds(i) + sum.put(clusterId, embedding.sortedScores(i) + sum.getOrElse(clusterId, 0.0)) + } + sum + } + SimClustersEmbedding(sum) + } + } + + /** + * Support a fixed size SimClustersEmbedding Sum + */ + def sum( + simClustersEmbeddings: Iterable[SimClustersEmbedding], + maxSize: Int + ): SimClustersEmbedding = { + sum(simClustersEmbeddings).truncate(maxSize) + } + + /** + * A high performance version of Mean a list of SimClustersEmbeddings. + * Suggest using in Online Services to avoid the unnecessary GC. + */ + def mean(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = { + if (simClustersEmbeddings.isEmpty) { + EmptyEmbedding + } else { + sum(simClustersEmbeddings).scalarDivide(simClustersEmbeddings.size) + } + } + + /** + * Support a fixed size SimClustersEmbedding Mean + */ + def mean( + simClustersEmbeddings: Iterable[SimClustersEmbedding], + maxSize: Int + ): SimClustersEmbedding = { + mean(simClustersEmbeddings).truncate(maxSize) + } +} + +case class DefaultSimClustersEmbedding( + override val clusterIds: Array[ClusterId], + override val scores: Array[Double], + override val sortedClusterIds: Array[ClusterId], + override val sortedScores: Array[Double]) + extends SimClustersEmbedding { + + override def toString: String = + s"DefaultSimClustersEmbedding(${clusterIds.zip(scores).mkString(",")})" +} + +object DefaultSimClustersEmbedding { + // To support existing code which builds embeddings from a Seq + def apply(embedding: Seq[(ClusterId, Double)]): SimClustersEmbedding = SimClustersEmbedding( + embedding) +} diff --git a/src/thrift/com/twitter/search/common/ranking/ranking.thrift b/src/thrift/com/twitter/search/common/ranking/ranking.thrift new file mode 100644 index 000000000..bd1cff929 --- /dev/null +++ b/src/thrift/com/twitter/search/common/ranking/ranking.thrift @@ -0,0 +1,366 @@ +namespace java com.twitter.search.common.ranking.thriftjava +#@namespace scala com.twitter.search.common.ranking.thriftscala +#@namespace strato com.twitter.search.common.ranking +namespace py gen.twitter.search.common.ranking.ranking + +struct ThriftLinearFeatureRankingParams { + // values below this will set the score to the minimal one + 1: optional double min = -1e+100 + // values above this will set the score to the minimal one + 2: optional double max = 1e+100 + 3: optional double weight = 0 +}(persisted='true') + +struct ThriftAgeDecayRankingParams { + // the rate in which the score of older tweets decreases + 1: optional double slope = 0.003 + // the age, in minutes, where the age score of a tweet is half of the latest tweet + 2: optional double halflife = 360.0 + // the minimal age decay score a tweet will have + 3: optional double base = 0.6 +}(persisted='true') + +enum ThriftScoringFunctionType { + LINEAR = 1, + MODEL_BASED = 4, + TENSORFLOW_BASED = 5, + + // deprecated + TOPTWEETS = 2, + EXPERIMENTAL = 3, +} + +// The struct to define a class that is to be dynamically loaded in earlybird for +// experimentation. +struct ThriftExperimentClass { + // the fully qualified class name. + 1: required string name + // data source location (class/jar file) for this dynamic class on HDFS + 2: optional string location + // parameters in key-value pairs for this experimental class + 3: optional map params +}(persisted='true') + +// Deprecated!! +struct ThriftQueryEngagementParams { + // Rate Boosts: given a rate (usually a small fraction), the score will be multiplied by + // (1 + rate) ^ boost + // 0 mean no boost, negative numbers are dampens + 1: optional double retweetRateBoost = 0 + 2: optional double replyRateBoost = 0 + 3: optional double faveRateBoost = 0 +}(persisted='true') + +struct ThriftHostQualityParams { + // Multiplier applied to host score, for tweets that have links. + // A multiplier of 0 means that this boost is not applied + 1: optional double multiplier = 0.0 + + // Do not apply the multiplier to hosts with score above this level. + // If 0, the multiplier will be applied to any host. + 2: optional double maxScoreToModify = 0.0 + + // Do not apply the multiplier to hosts with score below this level. + // If 0, the multiplier will be applied to any host. + 3: optional double minScoreToModify = 0.0 + + // If true, score modification will be applied to hosts that have unknown scores. + // The host-score used will be lower than the score of any known host. + 4: optional bool applyToUnknownHosts = 0 +}(persisted='true') + +struct ThriftCardRankingParams { + 1: optional double hasCardBoost = 1.0 + 2: optional double domainMatchBoost = 1.0 + 3: optional double authorMatchBoost = 1.0 + 4: optional double titleMatchBoost = 1.0 + 5: optional double descriptionMatchBoost = 1.0 +}(persisted='true') + +# The ids are assigned in 'blocks'. For adding a new field, find an unused id in the appropriate +# block. Be sure to mention explicitly which ids have been removed so that they are not used again. +struct ThriftRankingParams { + 1: optional ThriftScoringFunctionType type + + // Dynamically loaded scorer and collector for quick experimentation. + 40: optional ThriftExperimentClass expScorer + 41: optional ThriftExperimentClass expCollector + + // we must set it to a value that fits into a float: otherwise + // some earlybird classes that convert it to float will interpret + // it as Float.NEGATIVE_INFINITY, and some comparisons will fail + 2: optional double minScore = -1e+30 + + 10: optional ThriftLinearFeatureRankingParams parusScoreParams + 11: optional ThriftLinearFeatureRankingParams retweetCountParams + 12: optional ThriftLinearFeatureRankingParams replyCountParams + 15: optional ThriftLinearFeatureRankingParams reputationParams + 16: optional ThriftLinearFeatureRankingParams luceneScoreParams + 18: optional ThriftLinearFeatureRankingParams textScoreParams + 19: optional ThriftLinearFeatureRankingParams urlParams + 20: optional ThriftLinearFeatureRankingParams isReplyParams + 21: optional ThriftLinearFeatureRankingParams directFollowRetweetCountParams + 22: optional ThriftLinearFeatureRankingParams trustedCircleRetweetCountParams + 23: optional ThriftLinearFeatureRankingParams favCountParams + 24: optional ThriftLinearFeatureRankingParams multipleReplyCountParams + 27: optional ThriftLinearFeatureRankingParams embedsImpressionCountParams + 28: optional ThriftLinearFeatureRankingParams embedsUrlCountParams + 29: optional ThriftLinearFeatureRankingParams videoViewCountParams + 66: optional ThriftLinearFeatureRankingParams quotedCountParams + + // A map from MutableFeatureType to linear ranking params + 25: optional map offlineExperimentalFeatureRankingParams + + // if min/max for score or ThriftLinearFeatureRankingParams should always be + // applied or only to non-follows, non-self, non-verified + 26: optional bool applyFiltersAlways = 0 + + // Whether to apply promotion/demotion at all for FeatureBasedScoringFunction + 70: optional bool applyBoosts = 1 + + // UI language is english, tweet language is not + 30: optional double langEnglishUIBoost = 0.3 + // tweet language is english, UI language is not + 31: optional double langEnglishTweetBoost = 0.7 + // user language differs from tweet language, and neither is english + 32: optional double langDefaultBoost = 0.1 + // user that produced tweet is marked as spammer by metastore + 33: optional double spamUserBoost = 1.0 + // user that produced tweet is marked as nsfw by metastore + 34: optional double nsfwUserBoost = 1.0 + // user that produced tweet is marked as bot (self similarity) by metastore + 35: optional double botUserBoost = 1.0 + + // An alternative way of using lucene score in the ranking function. + 38: optional bool useLuceneScoreAsBoost = 0 + 39: optional double maxLuceneScoreBoost = 1.2 + + // Use user's consumed and produced languages for scoring + 42: optional bool useUserLanguageInfo = 0 + + // Boost (demotion) if the tweet language is not one of user's understandable languages, + // nor interface language. + 43: optional double unknownLanguageBoost = 0.01 + + // Use topic ids for scoring. + // Deprecated in SEARCH-8616. + 44: optional bool deprecated_useTopicIDsBoost = 0 + // Parameters for topic id scoring. See TopicIDsBoostScorer (and its test) for details. + 46: optional double deprecated_maxTopicIDsBoost = 3.0 + 47: optional double deprecated_topicIDsBoostExponent = 2.0; + 48: optional double deprecated_topicIDsBoostSlope = 2.0; + + // Hit Attribute Demotion + 60: optional bool enableHitDemotion = 0 + 61: optional double noTextHitDemotion = 1.0 + 62: optional double urlOnlyHitDemotion = 1.0 + 63: optional double nameOnlyHitDemotion = 1.0 + 64: optional double separateTextAndNameHitDemotion = 1.0 + 65: optional double separateTextAndUrlHitDemotion = 1.0 + + // multiplicative score boost for results deemed offensive + 100: optional double offensiveBoost = 1 + // multiplicative score boost for results in the searcher's social circle + 101: optional double inTrustedCircleBoost = 1 + // multiplicative score dampen for results with more than one hash tag + 102: optional double multipleHashtagsOrTrendsBoost = 1 + // multiplicative score boost for results in the searcher's direct follows + 103: optional double inDirectFollowBoost = 1 + // multiplicative score boost for results that has trends + 104: optional double tweetHasTrendBoost = 1 + // is tweet from verified account? + 106: optional double tweetFromVerifiedAccountBoost = 1 + // is tweet authored by the searcher? (boost is in addition to social boost) + 107: optional double selfTweetBoost = 1 + // multiplicative score boost for a tweet that has image url. + 108: optional double tweetHasImageUrlBoost = 1 + // multiplicative score boost for a tweet that has video url. + 109: optional double tweetHasVideoUrlBoost = 1 + // multiplicative score boost for a tweet that has news url. + 110: optional double tweetHasNewsUrlBoost = 1 + // is tweet from a blue-verified account? + 111: optional double tweetFromBlueVerifiedAccountBoost = 1 (personalDataType = 'UserVerifiedFlag') + + // subtractive penalty applied after boosts for out-of-network replies. + 120: optional double outOfNetworkReplyPenalty = 10.0 + + 150: optional ThriftQueryEngagementParams deprecatedQueryEngagementParams + + 160: optional ThriftHostQualityParams deprecatedHostQualityParams + + // age decay params for regular tweets + 203: optional ThriftAgeDecayRankingParams ageDecayParams + + // for card ranking: map between card name ordinal (defined in com.twitter.search.common.constants.CardConstants) + // to ranking params + 400: optional map cardRankingParams + + // A map from tweet IDs to the score adjustment for that tweet. These are score + // adjustments that include one or more features that can depend on the query + // string. These features aren't indexed by Earlybird, and so their total contribution + // to the scoring function is passed in directly as part of the request. If present, + // the score adjustment for a tweet is directly added to the linear component of the + // scoring function. Since this signal can be made up of multiple features, any + // reweighting or combination of these features is assumed to be done by the caller + // (hence there is no need for a weight parameter -- the weights of the features + // included in this signal have already been incorporated by the caller). + 151: optional map querySpecificScoreAdjustments + + // A map from user ID to the score adjustment for tweets from that author. + // This field provides a way for adjusting the tweets of a specific set of users with a score + // that is not present in the Earlybird features but has to be passed from the clients, such as + // real graph weights or a combination of multiple features. + // This field should be used mainly for experimentation since it increases the size of the thrift + // requests. + 154: optional map authorSpecificScoreAdjustments + + // -------- Parameters for ThriftScoringFunctionType.MODEL_BASED -------- + // Selected models along with their weights for the linear combination + 152: optional map selectedModels + 153: optional bool useLogitScore = false + + // -------- Parameters for ThriftScoringFunctionType.TENSORFLOW_BASED -------- + // Selected tensorflow model + 303: optional string selectedTensorflowModel + + // -------- Deprecated Fields -------- + // ID 303 has been used in the past. Resume additional deprecated fields from 304 + 105: optional double deprecatedTweetHasTrendInTrendingQueryBoost = 1 + 200: optional double deprecatedAgeDecaySlope = 0.003 + 201: optional double deprecatedAgeDecayHalflife = 360.0 + 202: optional double deprecatedAgeDecayBase = 0.6 + 204: optional ThriftAgeDecayRankingParams deprecatedAgeDecayForTrendsParams + 301: optional double deprecatedNameQueryConfidence = 0.0 + 302: optional double deprecatedHashtagQueryConfidence = 0.0 + // Whether to use old-style engagement features (normalized by LogNormalizer) + // or new ones (normalized by SingleBytePositiveFloatNormalizer) + 50: optional bool useGranularEngagementFeatures = 0 // DEPRECATED! +}(persisted='true') + +// This sorting mode is used by earlybird to retrieve the top-n facets that +// are returned to blender +enum ThriftFacetEarlybirdSortingMode { + SORT_BY_SIMPLE_COUNT = 0, + SORT_BY_WEIGHTED_COUNT = 1, +} + +// This is the final sort order used by blender after all results from +// the earlybirds are merged +enum ThriftFacetFinalSortOrder { + // using the created_at date of the first tweet that contained the facet + SCORE = 0, + SIMPLE_COUNT = 1, + WEIGHTED_COUNT = 2, + CREATED_AT = 3 +} + +struct ThriftFacetRankingOptions { + // next available field ID = 38 + + // ====================================================================== + // EARLYBIRD SETTINGS + // + // These parameters primarily affect how earlybird creates the top-k + // candidate list to be re-ranked by blender + // ====================================================================== + // Dynamically loaded scorer and collector for quick experimentation. + 26: optional ThriftExperimentClass expScorer + 27: optional ThriftExperimentClass expCollector + + // It should be less than or equal to reputationParams.min, and all + // tweepcreds between the two get a score of 1.0. + 21: optional i32 minTweepcredFilterThreshold + + // the maximum score a single tweet can contribute to the weightedCount + 22: optional i32 maxScorePerTweet + + 15: optional ThriftFacetEarlybirdSortingMode sortingMode + // The number of top candidates earlybird returns to blender + 16: optional i32 numCandidatesFromEarlybird = 100 + + // when to early terminate for facet search, overrides the setting in ThriftSearchQuery + 34: optional i32 maxHitsToProcess = 1000 + + // for anti-gaming we want to limit the maximum amount of hits the same user can + // contribute. Set to -1 to disable the anti-gaming filter. Overrides the setting in + // ThriftSearchQuery + 35: optional i32 maxHitsPerUser = 3 + + // if the tweepcred of the user is bigger than this value it will not be excluded + // by the anti-gaming filter. Overrides the setting in ThriftSearchQuery + 36: optional i32 maxTweepcredForAntiGaming = 65 + + // these settings affect how earlybird computes the weightedCount + 2: optional ThriftLinearFeatureRankingParams parusScoreParams + 3: optional ThriftLinearFeatureRankingParams reputationParams + 17: optional ThriftLinearFeatureRankingParams favoritesParams + 33: optional ThriftLinearFeatureRankingParams repliesParams + 37: optional map rankingExpScoreParams + + // penalty counter settings + 6: optional i32 offensiveTweetPenalty // set to -1 to disable the offensive filter + 7: optional i32 antigamingPenalty // set to -1 to disable antigaming filtering + // weight of penalty counts from all tweets containing a facet, not just the tweets + // matching the query + 9: optional double queryIndependentPenaltyWeight // set to 0 to not use query independent penalty weights + // penalty for keyword stuffing + 60: optional i32 multipleHashtagsOrTrendsPenalty + + // Language related boosts, similar to those in relevance ranking options. By default they are + // all 1.0 (no-boost). + // When the user language is english, facet language is not + 11: optional double langEnglishUIBoost = 1.0 + // When the facet language is english, user language is not + 12: optional double langEnglishFacetBoost = 1.0 + // When the user language differs from facet/tweet language, and neither is english + 13: optional double langDefaultBoost = 1.0 + + // ====================================================================== + // BLENDER SETTINGS + // + // Settings for the facet relevance scoring happening in blender + // ====================================================================== + + // This block of parameters are only used in the FacetsFutureManager. + // limits to discard facets + // if a facet has a higher penalty count, it will not be returned + 5: optional i32 maxPenaltyCount + // if a facet has a lower simple count, it will not be returned + 28: optional i32 minSimpleCount + // if a facet has a lower weighted count, it will not be returned + 8: optional i32 minCount + // the maximum allowed value for offensiveCount/facetCount a facet can have in order to be returned + 10: optional double maxPenaltyCountRatio + // if set to true, then facets with offensive display tweets are excluded from the resultset + 29: optional bool excludePossiblySensitiveFacets + // if set to true, then only facets that have a display tweet in their ThriftFacetCountMetadata object + // will be returned to the caller + 30: optional bool onlyReturnFacetsWithDisplayTweet + + // parameters for scoring force-inserted media items + // Please check FacetReRanker.java computeScoreForInserted() for their usage. + 38: optional double forceInsertedBackgroundExp = 0.3 + 39: optional double forceInsertedMinBackgroundCount = 2 + 40: optional double forceInsertedMultiplier = 0.01 + + // ----------------------------------------------------- + // weights for the facet ranking formula + 18: optional double simpleCountWeight_DEPRECATED + 19: optional double weightedCountWeight_DEPRECATED + 20: optional double backgroundModelBoost_DEPRECATED + + // ----------------------------------------------------- + // Following parameters are used in the FacetsReRanker + // age decay params + 14: optional ThriftAgeDecayRankingParams ageDecayParams + + // used in the facets reranker + 23: optional double maxNormBoost = 5.0 + 24: optional double globalCountExponent = 3.0 + 25: optional double simpleCountExponent = 3.0 + + 31: optional ThriftFacetFinalSortOrder finalSortOrder + + // Run facets search as if they happen at this specific time (ms since epoch). + 32: optional i64 fakeCurrentTimeMs // not really used anywhere, remove? +}(persisted='true') diff --git a/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift b/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift new file mode 100644 index 000000000..0d4547264 --- /dev/null +++ b/src/thrift/com/twitter/search/earlybird/thrift/earlybird.thrift @@ -0,0 +1,1416 @@ +namespace java com.twitter.search.earlybird.thrift +#@namespace scala com.twitter.search.earlybird.thriftscala +#@namespace strato com.twitter.search.earlybird +namespace py gen.twitter.search.earlybird + +include "com/twitter/ads/adserver/adserver_common.thrift" +include "com/twitter/search/common/caching/caching.thrift" +include "com/twitter/search/common/constants/query.thrift" +include "com/twitter/search/common/constants/search_language.thrift" +include "com/twitter/search/common/conversation/conversation.thrift" +include "com/twitter/search/common/features/features.thrift" +include "com/twitter/search/common/indexing/status.thrift" +include "com/twitter/search/common/query/search.thrift" +include "com/twitter/search/common/ranking/ranking.thrift" +include "com/twitter/search/common/results/expansions.thrift" +include "com/twitter/search/common/results/highlight.thrift" +include "com/twitter/search/common/results/hit_attribution.thrift" +include "com/twitter/search/common/results/hits.thrift" +include "com/twitter/search/common/results/social.thrift" +include "com/twitter/service/spiderduck/gen/metadata_store.thrift" +include "com/twitter/tweetypie/deprecated.thrift" +include "com/twitter/tweetypie/tweet.thrift" +include "com/twitter/escherbird/tweet_annotation.thrift" + +enum ThriftSearchRankingMode { + // good old realtime search mode + RECENCY = 0, + // new super fancy relevance ranking + RELEVANCE = 1, + DEPRECATED_DISCOVERY = 2, + // top tweets ranking mode + TOPTWEETS = 3, + // results from accounts followed by the searcher + FOLLOWS = 4, + + PLACE_HOLDER5 = 5, + PLACE_HOLDER6 = 6, +} + +enum ThriftSearchResultType { + // it's a time-ordered result. + RECENCY = 0, + // it's a highly relevant tweet (aka top tweet). + RELEVANCE = 1, + // top tweet result type + POPULAR = 2, + // promoted tweets (ads) + PROMOTED = 3, + // relevance-ordered (as opposed to time-ordered) tweets generated from a variety of candidates + RELEVANCE_ORDERED = 4, + + PLACE_HOLDER5 = 5, + PLACE_HOLDER6 = 6, +} + +enum ThriftSocialFilterType { + // filter only users that the searcher is directly following. + FOLLOWS = 0, + // filter only users that are in searcher's social circle of trust. + TRUSTED = 1, + // filter both follows and trusted. + ALL = 2, + + PLACE_HOLDER3 = 3, + PLACE_HOLDER4 = 4, + +} + +enum ThriftTweetSource { + ///// enums set by Earlybird + REALTIME_CLUSTER = 1, + FULL_ARCHIVE_CLUSTER = 2, + REALTIME_PROTECTED_CLUSTER = 4, + + ///// enums set inside Blender + ADSERVER = 0, + // from top news search, only used in universal search + TOP_NEWS = 3, + // special tweets included just for EventParrot. + FORCE_INCLUDED = 5, + // from Content Recommender + // from topic to Tweet path + CONTENT_RECS_TOPIC_TO_TWEET = 6, + // used for hydrating QIG Tweets (go/qig) + QIG = 8, + // used for TOPTWEETS ranking mode + TOP_TWEET = 9, + // used for experimental candidate sources + EXPERIMENTAL = 7, + // from Scanr service + SCANR = 10, + + PLACE_HOLDER11 = 11, + PLACE_HOLDER12 = 12 +} + +enum NamedEntitySource { + TEXT = 0, + URL = 1, + + PLACE_HOLDER2 = 2, + PLACE_HOLDER3 = 3, + PLACE_HOLDER4 = 4, +} + +enum ExperimentCluster { + EXP0 = 0, // Send requests to the earlybird-realtime-exp0 cluster + PLACE_HOLDER1 = 1, + PLACE_HOLDER2 = 2, +} + +enum AudioSpaceState { + RUNNING = 0, + ENDED = 1, + + PLACE_HOLDER2 = 2, + PLACE_HOLDER3 = 3, + PLACE_HOLDER4 = 4, + PLACE_HOLDER5 = 5, +} + +// Contains all scoring and relevance-filtering related controls and options for Earlybird. +struct ThriftSearchRelevanceOptions { + // Next available field ID: 31 and note that 45 and 50 have been used already + + 2: optional bool filterDups = 0 // filter out duplicate search results + 26: optional bool keepDupWithHigherScore = 1 // keep the duplicate tweet with the higher score + + 3: optional bool proximityScoring = 0 // whether to do proximity scoring or not + 4: optional i32 maxConsecutiveSameUser // filter consecutive results from the same user + 5: optional ranking.ThriftRankingParams rankingParams // composed by blender + // deprecated in favor of the maxHitsToProcess in CollectorParams + 6: optional i32 maxHitsToProcess // when to early-terminate for relevance + 7: optional string experimentName // what relevance experiment is running + 8: optional string experimentBucket // what bucket the user is in; DDG defaults to hard-coded 'control' + 9: optional bool interpretSinceId = 1 // whether to interpret since_id operator + + 24: optional i32 maxHitsPerUser // Overrides ThriftSearchQuery.maxHitsPerUser + + // only used by discovery for capping direct follow tweets + 10: optional i32 maxConsecutiveDirectFollows + + // Note - the orderByRelevance flag is critical to understanding how merging + // and trimming works in relevance mode in the search root. + // + // When orderByRelevance is true, results are trimmed in score-order. This means the + // client will get the top results from (maxHitsToProcess * numHashPartitions) hits, + // ordered by score. + // + // When orderByRelevance is false, results are trimmed in id-order. This means the + // client will get the top results from an approximation of maxHitsToProcess hits + // (across the entire corpus). These results ordered by ID. + 14: optional bool orderByRelevance = 0 + + // Max blending count for results returned due to from:user rewrites + 16: optional i32 maxUserBlendCount + + // The weight for proximity phrases generated while translating the serialized query to the + // lucene query. + 19: optional double proximityPhraseWeight = 1.0 + 20: optional i32 proximityPhraseSlop = 255 + + // Override the weights of searchable fields. + // Negative weight means the the field is not enabled for search by default, + // but if it is (e.g., by annotation), the absolute value of the weight shall be + // used (if the annotation does not specify a weight). + 21: optional map fieldWeightMapOverride + + // whether disable the coordination in the rewritten disjunction query, term query and phrase query + // the details can be found in LuceneVisitor + 22: optional bool deprecated_disableCoord = 0 + + // Root only. Returns all results seen by root to the client without trimming + // if set to true. + 23: optional bool returnAllResults + + // DEPRECATED: All v2 counters will be used explicitly in the scoring function and + // returned in their own field (in either metadata or feature map in response). + 25: optional bool useEngagementCountersV2 = 0 + + // -------- PERSONALIZATION-RELATED RELEVANCE OPTIONS -------- + // Take special care with these options when reasoning about caching. + + // Deprecated in SEARCH-8616. + 45: optional map deprecated_topicIDWeights + + // Collect hit attribution on queries and likedByUserIDFilter64-enhanced queries to + // get likedByUserIds list in metadata field. + // NOTE: this flag has no affect on fromUserIDFilter64. + 50: optional bool collectFieldHitAttributions = 0 + + // Whether to collect all hits regardless of their score with RelevanceAllCollector. + 27: optional bool useRelevanceAllCollector = 0 + + // Override features of specific tweets before the tweets are scored. + 28: optional map perTweetFeaturesOverride + + // Override features of all tweets from specific users before the tweets are scored. + 29: optional map perUserFeaturesOverride + + // Override features of all tweets before the tweets are scored. + 30: optional features.ThriftSearchResultFeatures globalFeaturesOverride +}(persisted='true') + +// Facets types that may have different ranking parameters. +enum ThriftFacetType { + DEFAULT = 0, + MENTIONS_FACET = 1, + HASHTAGS_FACET = 2, + // Deprecated in SEARCH-13708 + DEPRECATED_NAMED_ENTITIES_FACET = 3, + STOCKS_FACET = 4, + VIDEOS_FACET = 5, + IMAGES_FACET = 6, + NEWS_FACET = 7, + LANGUAGES_FACET = 8, + SOURCES_FACET = 9, + TWIMG_FACET = 10, + FROM_USER_ID_FACET = 11, + DEPRECATED_TOPIC_IDS_FACET = 12, + RETWEETS_FACET = 13, + LINKS_FACET = 14, + + PLACE_HOLDER15 = 15, + PLACE_HOLDER16 = 16, +} + +struct ThriftSearchDebugOptions { + // Make earlybird only score and return tweets (specified by tweet id) here, regardless + // if they have a hit for the current query or not. + 1: optional set statusIds; + + // Assorted structures to pass in debug options. + 2: optional map stringMap; + 3: optional map valueMap; + 4: optional list valueList; +}(persisted='true') + +// These options control what metadata will be returned by earlybird for each search result +// in the ThriftSearchResultMetadata struct. These options are currently mostly supported by +// AbstractRelevanceCollector and partially in SearchResultsCollector. Most are true by default to +// preserve backwards compatibility, but can be disabled as necessary to optimize searches returning +// many results (such as discover). +struct ThriftSearchResultMetadataOptions { + // If true, fills in the tweetUrls field in ThriftSearchResultMetadata. + // Populated by AbstractRelevanceCollector. + 1: optional bool getTweetUrls = 1 + + // If true, fills in the resultLocation field in ThriftSearchResultMetadata. + // Populated by AbstractRelevanceCollector. + 2: optional bool getResultLocation = 1 + + // Deprecated in SEARCH-8616. + 3: optional bool deprecated_getTopicIDs = 1 + + // If true, fills in the luceneScore field in ThriftSearchResultMetadata. + // Populated by LinearScoringFunction. + 4: optional bool getLuceneScore = 0 + + // Deprecated but used to be for Offline feature values for static index + 5: optional bool deprecated_getExpFeatureValues = 0 + + // If true, will omit all features derivable from packedFeatures, and set packedFeatures + // instead. + 6: optional bool deprecated_usePackedFeatures = 0 + + // If true, fills sharedStatusId. For replies this is the in-reply-to status id and for + // retweets this is the retweet source status id. + // Also fills in the the isRetweet and isReply flags. + 7: optional bool getInReplyToStatusId = 0 + + // If true, fills referencedTweetAuthorId. Also fills in the the isRetweet and isReply flags. + 8: optional bool getReferencedTweetAuthorId = 0 + + // If true, fills media bits (video/vine/periscope/etc.) + 9: optional bool getMediaBits = 0 + + // If true, will return all defined features in the packed features. This flag does not cover + // the above defined features. + 10: optional bool getAllFeatures = 0 + + // If true, will return all features as ThriftSearchResultFeatures format. + 11: optional bool returnSearchResultFeatures = 0 + + // If the client caches some features schemas, client can indicate its cache schemas through + // this field based on (version, checksum). + 12: optional list featureSchemasAvailableInClient + + // Specific feature IDs to return for recency requests. Populated in SearchResultFeatures. + // Values must be IDs of CSF fields from EarlybirdFieldConstants. + 13: optional list requestedFeatureIDs + + // If true, fills in the namedEntities field in ThriftSearchResultExtraMetadata + 14: optional bool getNamedEntities = 0 + + // If true, fills in the entityAnnotations field in ThriftSearchResultExtraMetadata + 15: optional bool getEntityAnnotations = 0 + + // If true, fills in the fromUserId field in the ThriftSearchResultExtraMetadata + 16: optional bool getFromUserId = 0 + + // If true, fills in the spaces field in the ThriftSearchResultExtraMetadata + 17: optional bool getSpaces = 0 + + 18: optional bool getExclusiveConversationAuthorId = 0 +}(persisted='true') + + +// ThriftSearchQuery describes an earlybird search request, which typically consists +// of these parts: +// - a query to retrieve hits +// - relevance options to score hits +// - a collector to collect hits and process into search results +// Note that this struct is used in both ThriftBlenderRequest and EarlybirdRequest. +// Most fields are not set when this struct is embedded in ThriftBlenderRequest, and +// are filled in by the blender before sending to earlybird. +struct ThriftSearchQuery { + // Next available field ID: 42 + + // -------- SECTION ZERO: THINGS USED ONLY BY THE BLENDER -------- + // See SEARCHQUAL-2398 + // These fields are used by the blender and clients of the blender, but not by earlybird. + + // blender use only + // The raw un-parsed user search query. + 6: optional string rawQuery(personalDataType = 'SearchQuery') + + // blender use only + // Language of the rawQuery. + 18: optional string queryLang(personalDataType = 'InferredLanguage') + + // blender use only + // What page of results to return, indexed from 1. + 7: optional i32 page = 1 + + // blender use only + // Number of results to skip (for pagination). Indexed from 0. + 2: optional i32 deprecated_resultOffset = 0 + + + // -------- SECTION ONE: RETRIEVAL OPTIONS -------- + // These options control the query that will be used to retrieve documents / hits. + + // The parsed query tree, serialized to a string. Restricts the search results to + // tweets matching this query. + 1: optional string serializedQuery(personalDataType = 'SearchQuery') + + // Restricts the search results to tweets having this minimum tweep cred, out of 100. + 5: optional i32 minTweepCredFilter = -1 + + // Restricts the search results to tweets from these users. + 34: optional list fromUserIDFilter64(personalDataType = 'PrivateAccountsFollowing, PublicAccountsFollowing') + // Restricts the search results to tweets liked by these users. + 40: optional list likedByUserIDFilter64(personalDataType = 'PrivateAccountsFollowing, PublicAccountsFollowing') + + // If searchStatusIds are present, earlybird will ignore the serializedQuery completely + // and simply score each of searchStatusIds, also bypassing features like duplicate + // filtering and early termination. + // IMPORTANT: this means that it is possible to get scores equal to ScoringFunction.SKIP_HIT, + // for results skipped by the scoring function. + 31: optional set searchStatusIds + + 35: optional set deprecated_eventClusterIdsFilter + + 41: optional map> namedDisjunctionMap + + // -------- SECTION TWO: HIT COLLECTOR OPTIONS -------- + // These options control what hits will be collected by the hit collector. + // Whether we want to collect and return per-field hit attributions is set in RelevanceOptions. + // See SEARCH-2784 + // Number of results to return (after offset/page correction). + // This is ignored when searchStatusIds is set. + 3: required i32 numResults + + // Maximum number of hits to process by the collector. + // deprecated in favor of the maxHitsToProcess in CollectorParams + 4: optional i32 maxHitsToProcess = 1000 + + // Collect hit counts for these time periods (in milliseconds). + 30: optional list hitCountBuckets + + // If set, earlybird will also return the facet labels of the specified facet fields + // in result tweets. + 33: optional list facetFieldNames + + // Options controlling which search result metadata is returned. + 36: optional ThriftSearchResultMetadataOptions resultMetadataOptions + + // Collection related Params + 38: optional search.CollectorParams collectorParams + + // Whether to collect conversation IDs + 39: optional bool collectConversationId = 0 + + // -------- SECTION THREE: RELEVANCE OPTIONS -------- + // These options control relevance scoring and anti-gaming. + + // Ranking mode (RECENCY means time-ordered ranking with no relevance). + 8: optional ThriftSearchRankingMode rankingMode = ThriftSearchRankingMode.RECENCY + + // Relevance scoring options. + 9: optional ThriftSearchRelevanceOptions relevanceOptions + + // Limits the number of hits that can be contributed by the same user, for anti-gaming. + // Set to -1 to disable the anti-gaming filter. This is ignored when searchStatusIds + // is set. + 11: optional i32 maxHitsPerUser = 3 + + // Disables anti-gaming filter checks for any tweets that exceed this tweepcred. + 12: optional i32 maxTweepcredForAntiGaming = 65 + + // -------- PERSONALIZATION-RELATED RELEVANCE OPTIONS -------- + // Take special care with these options when reasoning about caching. All of these + // options, if set, will bypass the cache with the exception of uiLang which is the + // only form of personalization allowed for caching. + + // User ID of searcher. This is used for relevance, and will be used for retrieval + // by the protected tweets index. If set, query will not be cached. + 20: optional i64 searcherId(personalDataType = 'UserId') + + // Bloom filter containing trusted user IDs. If set, query will not be cached. + 10: optional binary trustedFilter(personalDataType = 'UserId') + + // Bloom filter containing direct follow user IDs. If set, query will not be cached. + 16: optional binary directFollowFilter(personalDataType = 'UserId, PrivateAccountsFollowing, PublicAccountsFollowing') + + // UI language from the searcher's profile settings. + 14: optional string uiLang(personalDataType = 'GeneralSettings') + + // Confidence of the understandability of different languages for this user. + // uiLang field above is treated as a userlang with a confidence of 1.0. + 28: optional map userLangs(personalDataTypeKey = 'InferredLanguage') + + // An alternative to fromUserIDFilter64 that relies on the relevance bloom filters + // for user filtering. Not currently used in production. Only supported for realtime + // searches. + // If set, earlybird expects both trustedFilter and directFollowFilter to also be set. + 17: optional ThriftSocialFilterType socialFilterType + + // -------- SECTION FOUR: DEBUG OPTIONS, FORGOTTEN FEATURES -------- + + // Earlybird search debug options. + 19: optional ThriftSearchDebugOptions debugOptions + + // Overrides the query time for debugging. + 29: optional i64 timestampMsecs = 0 + + // Support for this feature has been removed and this field is left for backwards compatibility + // (and to detect improper usage by clients when it is set). + 25: optional list deprecated_iterativeQueries + + // Specifies a lucene query that will only be used if serializedQuery is not set, + // for debugging. Not currently used in production. + 27: optional string luceneQuery(personalDataType = 'SearchQuery') + + // This field is deprecated and is not used by earlybirds when processing the query. + 21: optional i32 deprecated_minDocsToProcess = 0 +}(persisted='true', hasPersonalData = 'true') + + +struct ThriftFacetLabel { + 1: required string fieldName + 2: required string label + // the number of times this facet has shown up in tweets with offensive words. + 3: optional i32 offensiveCount = 0 + + // only filled for TWIMG facets + 4: optional string nativePhotoUrl +}(persisted='true') + +struct ThriftSearchResultGeoLocation { + 1: optional double latitude(personalDataType = 'GpsCoordinates') + 2: optional double longitude(personalDataType = 'GpsCoordinates') + 3: optional double distanceKm +}(persisted='true', hasPersonalData = 'true') + +// Contains an expanded url and media type from the URL facet fields in earlybird. +// Note: thrift copied from status.thrift with unused fields renamed. +struct ThriftSearchResultUrl { + // Next available field ID: 6. Fields 2-4 removed. + + // Note: this is actually the expanded url. Rename after deprecated fields are removed. + 1: required string originalUrl + + // Media type of the url. + 5: optional metadata_store.MediaTypes mediaType +}(persisted='true') + +struct ThriftSearchResultNamedEntity { + 1: required string canonicalName + 2: required string entityType + 3: required NamedEntitySource source +}(persisted='true') + +struct ThriftSearchResultAudioSpace { + 1: required string id + 2: required AudioSpaceState state +}(persisted='true') + +// Even more metadata +struct ThriftSearchResultExtraMetadata { + // Next available field ID: 49 + + 1: optional double userLangScore + 2: optional bool hasDifferentLang + 3: optional bool hasEnglishTweetAndDifferentUILang + 4: optional bool hasEnglishUIAndDifferentTweetLang + 5: optional i32 quotedCount + 6: optional double querySpecificScore + 7: optional bool hasQuote + 29: optional i64 quotedTweetId + 30: optional i64 quotedUserId + 31: optional search_language.ThriftLanguage cardLang + 8: optional i64 conversationId + 9: optional bool isSensitiveContent + 10: optional bool hasMultipleMediaFlag + 11: optional bool profileIsEggFlag + 12: optional bool isUserNewFlag + 26: optional double authorSpecificScore + 28: optional bool isComposerSourceCamera + + // temporary V2 engagement counters, original ones in ThriftSearchResultMetadata has log() + // applied on them and then converted to int in Thrift, which is effectively a premature + // discretization. It doesn't affect the scoring inside Earlybird but for scoring and ML training + // outside earlybird, they were bad. These newly added ones stores a proper value of these + // counts. This also provides an easier transition to v2 counter when Earlybird is eventually + // ready to consume them from DL + // See SEARCHQUAL-9536, SEARCH-11181 + 18: optional i32 retweetCountV2 + 19: optional i32 favCountV2 + 20: optional i32 replyCountV2 + // Tweepcred weighted version of various engagement counts + 22: optional i32 weightedRetweetCount + 23: optional i32 weightedReplyCount + 24: optional i32 weightedFavCount + 25: optional i32 weightedQuoteCount + + // 2 bits - 0, 1, 2, 3+ + 13: optional i32 numMentions + 14: optional i32 numHashtags + + // 1 byte - 256 possible languages + 15: optional i32 linkLanguage + // 6 bits - 64 possible values + 16: optional i32 prevUserTweetEngagement + + 17: optional features.ThriftSearchResultFeatures features + + // If the ThriftSearchQuery.likedByUserIdFilter64 and ThriftSearchRelevanceOptions.collectFieldHitAttributions + // fields are set, then this field will contain the list of all users in the query that liked this tweet. + // Otherwise, this field is not set. + 27: optional list likedByUserIds + + + // Deprecated. See SEARCHQUAL-10321 + 21: optional double dopamineNonPersonalizedScore + + 32: optional list namedEntities + 33: optional list entityAnnotations + + // Health model scores from HML + 34: optional double toxicityScore // (go/toxicity) + 35: optional double pBlockScore // (go/pblock) + 36: optional double experimentalHealthModelScore1 + 37: optional double experimentalHealthModelScore2 + 38: optional double experimentalHealthModelScore3 + 39: optional double experimentalHealthModelScore4 + + 40: optional i64 directedAtUserId + + // Health model scores from HML (cont.) + 41: optional double pSpammyTweetScore // (go/pspammytweet) + 42: optional double pReportedTweetScore // (go/preportedtweet) + 43: optional double spammyTweetContentScore // (go/spammy-tweet-content) + // it is populated by looking up user table and it is only available in archive earlybirds response + 44: optional bool isUserProtected + 45: optional list spaces + + 46: optional i64 exclusiveConversationAuthorId + 47: optional string cardUri + 48: optional bool fromBlueVerifiedAccount(personalDataType = 'UserVerifiedFlag') +}(persisted='true') + +// Some basic metadata about a search result. Useful for re-sorting, filtering, etc. +// +// NOTE: DO NOT ADD NEW FIELD!! +// Stop adding new fields to this struct, all new fields should go to +// ThriftSearchResultExtraMetadata (VM-1897), or there will be performance issues in production. +struct ThriftSearchResultMetadata { + // Next available field ID: 86 + + // -------- BASIC SCORING METADATA -------- + + // When resultType is RECENCY most scoring metadata will not be available. + 1: required ThriftSearchResultType resultType + + // Relevance score computed for this result. + 3: optional double score + + // True if the result was skipped by the scoring function. Only set when the collect-all + // results collector was used - in other cases skipped results are not returned. + // The score will be ScoringFunction.SKIP_HIT when skipped is true. + 43: optional bool skipped + + // optionally a Lucene-style explanation for this result + 5: optional string explanation + + + // -------- NETWORK-BASED SCORING METADATA -------- + + // Found the tweet in the trusted circle. + 6: optional bool isTrusted + + // Found the tweet in the direct follows. + 8: optional bool isFollow + + // True if the fromUserId of this tweet was whitelisted by the dup / antigaming filter. + // This typically indicates the result was from a tweet that matched a fromUserId query. + 9: optional bool dontFilterUser + + + // -------- COMMON DOCUMENT METADATA -------- + + // User ID of the author. When isRetweet is true, this is the user ID of the retweeter + // and NOT that of the original tweet. + 7: optional i64 fromUserId = 0 + + // When isRetweet (or packed features equivalent) is true, this is the status id of the + // original tweet. When isReply and getReplySource are true, this is the status id of the + // original tweet. In all other circumstances this is 0. + 40: optional i64 sharedStatusId = 0 + + // When hasCard (or packed features equivalent) is true, this is one of SearchCardType. + 49: optional i8 cardType = 0 + + // -------- EXTENDED DOCUMENT METADATA -------- + // This is additional metadata from facet fields and column stride fields. + // Return of these fields is controlled by ThriftSearchResultMetadataOptions to + // allow for fine-grained control over when these fields are returned, as an + // optimization for searches returning a large quantity of results. + + // Lucene component of the relevance score. Only returned when + // ThriftSearchResultMetadataOptions.getLuceneScore is true. + 31: optional double luceneScore = 0.0 + + // Urls found in the tweet. Only returned when + // ThriftSearchResultMetadataOptions.getTweetUrls is true. + 18: optional list tweetUrls + + // Deprecated in SEARCH-8616. + 36: optional list deprecated_topicIDs + + // Facets available in this tweet, this will only be filled if + // ThriftSearchQuery.facetFieldNames is set in the request. + 22: optional list facetLabels + + // The location of the result, and the distance to it from the center of the query + // location. Only returned when ThriftSearchResultMetadataOptions.getResultLocation is true. + 35: optional ThriftSearchResultGeoLocation resultLocation + + // Per field hit attribution. + 55: optional hit_attribution.FieldHitAttribution fieldHitAttribution + + // whether this has geolocation_type:geotag hit + 57: optional bool geotagHit = 0 + + // the user id of the author of the source/referenced tweet (the tweet one replied + // to, retweeted and possibly quoted, etc.) (SEARCH-8561) + // Only returned when ThriftSearchResultMetadataOptions.getReferencedTweetAuthorId is true. + 60: optional i64 referencedTweetAuthorId = 0 + + // Whether this tweet has certain types of media. + // Only returned when ThriftSearchResultMetadataOptions.getMediaBits is true. + // "Native video" is either consumer, pro, vine, or periscope. + // "Native image" is an image hosted on pic.twitter.com. + 62: optional bool hasConsumerVideo + 63: optional bool hasProVideo + 64: optional bool hasVine + 65: optional bool hasPeriscope + 66: optional bool hasNativeVideo + 67: optional bool hasNativeImage + + // Packed features for this result. This field is never populated. + 50: optional status.PackedFeatures deprecated_packedFeatures + + // The features stored in earlybird + + // From integer 0 from EarlybirdFeatureConfiguration: + 16: optional bool isRetweet + 71: optional bool isSelfTweet + 10: optional bool isOffensive + 11: optional bool hasLink + 12: optional bool hasTrend + 13: optional bool isReply + 14: optional bool hasMultipleHashtagsOrTrends + 23: optional bool fromVerifiedAccount + // Static text quality score. This is actually an int between 0 and 100. + 30: optional double textScore + 51: optional search_language.ThriftLanguage language + + // From integer 1 from EarlybirdFeatureConfiguration: + 52: optional bool hasImage + 53: optional bool hasVideo + 28: optional bool hasNews + 48: optional bool hasCard + 61: optional bool hasVisibleLink + // Tweep cred aka user rep. This is actually an int between 0 and 100. + 32: optional double userRep + 24: optional bool isUserSpam + 25: optional bool isUserNSFW + 26: optional bool isUserBot + 54: optional bool isUserAntiSocial + + // From integer 2 from EarlybirdFeatureConfiguration: + + // Retweet, fav, reply, embeds counts, and video view counts are APPROXIMATE ONLY. + // Note that retweetCount, favCount and replyCount are not original unnormalized values, + // but after a log2() function for historical reason, this loses us some granularity. + // For more accurate counts, use {retweet, fav, reply}CountV2 in extraMetadata. + 2: optional i32 retweetCount + 33: optional i32 favCount + 34: optional i32 replyCount + 58: optional i32 embedsImpressionCount + 59: optional i32 embedsUrlCount + 68: optional i32 videoViewCount + + // Parus score. This is actually an int between 0 and 100. + 29: optional double parusScore + + // Extra feature data, all new feature fields you want to return from Earlybird should go into + // this one, the outer one is always reaching its limit of the number of fields JVM can + // comfortably support!! + 86: optional ThriftSearchResultExtraMetadata extraMetadata + + // Integer 3 is omitted, see expFeatureValues above for more details. + + // From integer 4 from EarlybirdFeatureConfiguration: + // Signature, for duplicate detection and removal. + 4: optional i32 signature + + // -------- THINGS USED ONLY BY THE BLENDER -------- + + // Social proof of the tweet, for network discovery. + // Do not use these fields outside of network discovery. + 41: optional list retweetedUserIDs64 + 42: optional list replyUserIDs64 + + // Social connection between the search user and this result. + 19: optional social.ThriftSocialContext socialContext + + // used by RelevanceTimelineSearchWorkflow, whether a tweet should be highlighted or not + 46: optional bool highlightResult + + // used by RelevanceTimelineSearchWorkflow, the highlight context of the highlighted tweet + 47: optional highlight.ThriftHighlightContext highlightContext + + // the penguin version used to tokenize the tweets by the serving earlybird index as defined + // in com.twitter.common.text.version.PenguinVersion + 56: optional i8 penguinVersion + + 69: optional bool isNullcast + + // This is the normalized ratio(0.00 to 1.00) of nth token(starting before 140) divided by + // numTokens and then normalized into 16 positions(4 bits) but on a scale of 0 to 100% as + // we unnormalize it for you + 70: optional double tokenAt140DividedByNumTokensBucket + +}(persisted='true') + +// Query level result stats. +// Next id: 20 +struct ThriftSearchResultsRelevanceStats { + 1: optional i32 numScored = 0 + // Skipped documents count, they were also scored but their scores got ignored (skipped), note that this is different + // from numResultsSkipped in the ThriftSearchResults. + 2: optional i32 numSkipped = 0 + 3: optional i32 numSkippedForAntiGaming = 0 + 4: optional i32 numSkippedForLowReputation = 0 + 5: optional i32 numSkippedForLowTextScore = 0 + 6: optional i32 numSkippedForSocialFilter = 0 + 7: optional i32 numSkippedForLowFinalScore = 0 + 8: optional i32 oldestScoredTweetAgeInSeconds = 0 + + // More counters for various features. + 9: optional i32 numFromDirectFollows = 0 + 10: optional i32 numFromTrustedCircle = 0 + 11: optional i32 numReplies = 0 + 12: optional i32 numRepliesTrusted = 0 + 13: optional i32 numRepliesOutOfNetwork = 0 + 14: optional i32 numSelfTweets = 0 + 15: optional i32 numWithMedia = 0 + 16: optional i32 numWithNews = 0 + 17: optional i32 numSpamUser = 0 + 18: optional i32 numOffensive = 0 + 19: optional i32 numBot = 0 +}(persisted='true') + +// Per result debug info. +struct ThriftSearchResultDebugInfo { + 1: optional string hostname + 2: optional string clusterName + 3: optional i32 partitionId + 4: optional string tiername +}(persisted='true') + +struct ThriftSearchResult { + // Next available field ID: 22 + + // Result status id. + 1: required i64 id + + // TweetyPie status of the search result + 7: optional deprecated.Status tweetypieStatus + 19: optional tweet.Tweet tweetypieTweet // v2 struct + + // If the search result is a retweet, this field contains the source TweetyPie status. + 10: optional deprecated.Status sourceTweetypieStatus + 20: optional tweet.Tweet sourceTweetypieTweet // v2 struct + + // If the search result is a quote tweet, this field contains the quoted TweetyPie status. + 17: optional deprecated.Status quotedTweetypieStatus + 21: optional tweet.Tweet quotedTweetypieTweet // v2 struct + + // Additional metadata about a search result. + 5: optional ThriftSearchResultMetadata metadata + + // Hit highlights for various parts of this tweet + // for tweet text + 6: optional list hitHighlights + // for the title and description in the card expando. + 12: optional list cardTitleHitHighlights + 13: optional list cardDescriptionHitHighlights + + // Expansion types, if expandResult == False, the expansions set should be ignored. + 8: optional bool expandResult = 0 + 9: optional set expansions + + // Only set if this is a promoted tweet + 11: optional adserver_common.AdImpression adImpression + + // where this tweet is from + // Since ThriftSearchResult used not only as an Earlybird response, but also an internal + // data transfer object of Blender, the value of this field is mutable in Blender, not + // necessarily reflecting Earlybird response. + 14: optional ThriftTweetSource tweetSource + + // the features of a tweet used for relevance timeline + // this field is populated by blender in RelevanceTimelineSearchWorkflow + 15: optional features.ThriftTweetFeatures tweetFeatures + + // the conversation context of a tweet + 16: optional conversation.ThriftConversationContext conversationContext + + // per-result debugging info that's persisted across merges. + 18: optional ThriftSearchResultDebugInfo debugInfo +}(persisted='true') + +enum ThriftFacetRankingMode { + COUNT = 0, + FILTER_WITH_TERM_STATISTICS = 1, +} + +struct ThriftFacetFieldRequest { + // next available field ID: 4 + 1: required string fieldName + 2: optional i32 numResults = 5 + + // use facetRankingOptions in ThriftFacetRequest instead + 3: optional ThriftFacetRankingMode rankingMode = ThriftFacetRankingMode.COUNT +}(persisted='true') + +struct ThriftFacetRequest { + // Next available field ID: 7 + 1: optional list facetFields + 5: optional ranking.ThriftFacetRankingOptions facetRankingOptions + 6: optional bool usingQueryCache = 0 +}(persisted='true') + +struct ThriftTermRequest { + 1: optional string fieldName = "text" + 2: required string term +}(persisted='true') + +enum ThriftHistogramGranularityType { + MINUTES = 0 + HOURS = 1, + DAYS = 2, + CUSTOM = 3, + + PLACE_HOLDER4 = 4, + PLACE_HOLDER5 = 5, +} + +struct ThriftHistogramSettings { + 1: required ThriftHistogramGranularityType granularity + 2: optional i32 numBins = 60 + 3: optional i32 samplingRate = 1 + 4: optional i32 binSizeInSeconds // the bin size, only used if granularity is set to CUSTOM. +}(persisted='true') + +// next id is 4 +struct ThriftTermStatisticsRequest { + 1: optional list termRequests + 2: optional ThriftHistogramSettings histogramSettings + // If this is set to true, even if there is no termRequests above, so long as the histogramSettings + // is set, Earlybird will return a null->ThriftTermResults entry in the termResults map, containing + // the global tweet count histogram for current query, which is the number of tweets matching this + // query in different minutes/hours/days. + 3: optional bool includeGlobalCounts = 0 + // When this is set, the background facets call does another search in order to find the best + // representative tweet for a given term request, the representative tweet is stored in the + // metadata of the termstats result + 4: optional bool scoreTweetsForRepresentatives = 0 +}(persisted='true') + +// Next id is 12 +struct ThriftFacetCountMetadata { + // this is the id of the first tweet in the index that contained this facet + 1: optional i64 statusId = -1 + + // whether the tweet with the above statusId is NSFW, from an antisocial user, + // marked as sensitive content, etc. + 10: optional bool statusPossiblySensitive + + // the id of the user who sent the tweet above - only returned if + // statusId is returned too + // NOTE: for native photos we may not be able to determine the user, + // even though the statusId can be returned. This is because the statusId + // can be determined from the url, but the user can't and the tweet may + // not be in the index anymore. In this case statusId would be set but + // twitterUserId would not. + 2: optional i64 twitterUserId = -1 + + // the language of the tweet above. + 8: optional search_language.ThriftLanguage statusLanguage + + // optionally whitelist the fromUserId from dup/twitterUserId filtering + 3: optional bool dontFilterUser = 0; + + // if this facet is a native photo we return for convenience the + // twimg url + 4: optional string nativePhotoUrl + + // optionally returns some debug information about this facet + 5: optional string explanation + + // the created_at value for the tweet from statusId - only returned + // if statusId is returned too + 6: optional i64 created_at + + // the maximum tweepcred of the hits that contained this facet + 7: optional i32 maxTweepCred + + // Whether this facet result is force inserted, instead of organically returned from search. + // This field is only used in Blender to mark the force-inserted facet results + // (from recent tweets, etc). + 11: optional bool forceInserted = 0 +}(persisted='true') + +struct ThriftTermResults { + 1: required i32 totalCount + 2: optional list histogramBins + 3: optional ThriftFacetCountMetadata metadata +}(persisted='true') + +struct ThriftTermStatisticsResults { + 1: required map termResults + 2: optional ThriftHistogramSettings histogramSettings + // If histogramSettings are set, this will have a list of ThriftHistogramSettings.numBins binIds, + // that the corresponding histogramBins in ThriftTermResults will have counts for. + // The binIds will correspond to the times of the hits matching the driving search query for this + // term statistics request. + // If there were no hits matching the search query, numBins binIds will be returned, but the + // values of the binIds will not meaningfully correspond to anything related to the query, and + // should not be used. Such cases can be identified by ThriftSearchResults.numHitsProcessed being + // set to 0 in the response, and the response not being early terminated. + 3: optional list binIds + // If set, this id indicates the id of the minimum (oldest) bin that has been completely searched, + // even if the query was early terminated. If not set no bin was searched fully, or no histogram + // was requested. + // Note that if e.g. a query only matches a bin partially (due to e.g. a since operator) the bin + // is still considered fully searched if the query did not early terminate. + 4: optional i32 minCompleteBinId +}(persisted='true') + +struct ThriftFacetCount { + // the text of the facet + 1: required string facetLabel + + // deprecated; currently matches weightedCount for backwards-compatibility reasons + 2: optional i32 facetCount + + // the simple count of tweets that contained this facet, without any + // weighting applied + 7: optional i32 simpleCount + + // a weighted version of the count, using signals like tweepcred, parus, etc. + 8: optional i32 weightedCount + + // the number of times this facet occurred in tweets matching the background query + // using the term statistics API - only set if FILTER_WITH_TERM_STATISTICS was used + 3: optional i32 backgroundCount + + // the relevance score that was computed for this facet if FILTER_WITH_TERM_STATISTICS + // was used + 4: optional double score + + // a counter for how often this facet was penalized + 5: optional i32 penaltyCount + + 6: optional ThriftFacetCountMetadata metadata +}(persisted='true') + +// List of facet labels and counts for a given facet field, the +// total count for this field, and a quality score for this field +struct ThriftFacetFieldResults { + 1: required list topFacets + 2: required i32 totalCount + 3: optional double scoreQuality + 4: optional i32 totalScore + 5: optional i32 totalPenalty + + // The ratio of the tweet language in the tweets with this facet field, a map from the language + // name to a number between (0.0, 1.0]. Only languages with ratio higher than 0.1 will be included. + 6: optional map languageHistogram +} + +struct ThriftFacetResults { + 1: required map facetFields + 2: optional i32 backgroundNumHits + // returns optionally a list of user ids that should not get filtered + // out by things like antigaming filters, because these users were explicitly + // queried for + // Note that ThriftFacetCountMetadata returns already dontFilterUser + // for facet requests in which case this list is not needed. However, it + // is needed for subsequent term statistics queries, were user id lookups + // are performed, but a different background query is used. + 3: optional set userIDWhitelist +} + +struct ThriftSearchResults { + // Next available field ID: 23 + 1: required list results = [] + + // (SEARCH-11950): Now resultOffset is deprecated, so there is no use in numResultsSkipped too. + 9: optional i32 deprecated_numResultsSkipped + + // Number of docs that matched the query and were processed. + 7: optional i32 numHitsProcessed + + // Range of status IDs searched, from max ID to min ID (both inclusive). + // These may be unset in case that the search query contained ID or time + // operators that were completely out of range for the given index. + 10: optional i64 maxSearchedStatusID + 11: optional i64 minSearchedStatusID + + // Time range that was searched (both inclusive). + 19: optional i32 maxSearchedTimeSinceEpoch + 20: optional i32 minSearchedTimeSinceEpoch + + 12: optional ThriftSearchResultsRelevanceStats relevanceStats + + // Overall quality of this search result set + 13: optional double score = -1.0 + 18: optional double nsfwRatio = 0.0 + + // The count of hit documents in each language. + 14: optional map languageHistogram + + // Hit counts per time period: + // The key is a time cutoff in milliseconds (e.g. 60000 msecs ago). + // The value is the number of hits that are more recent than the cutoff. + 15: optional map hitCounts + + // the total cost for this query + 16: optional double queryCost + + // Set to non-0 if this query was terminated early (either due to a timeout, or exceeded query cost) + // When getting this response from a single earlybird, this will be set to 1, if the query + // terminated early. + // When getting this response from a search root, this should be set to the number of individual + // earlybird requests that were terminated early. + 17: optional i32 numPartitionsEarlyTerminated + + // If ThriftSearchResults returns features in features.ThriftSearchResultFeature format, this + // field would define the schema of the features. + // If the earlybird schema is already in the client cached schemas indicated in the request, then + // searchFeatureSchema would only have (version, checksum) information. + // + // Notice that earlybird root only sends one schema back to the superroot even though earlybird + // root might receive multiple version of schemas. + // + // Earlybird roots' schema merge/choose logic when returning results to superroot: + // . pick the most occurred versioned schema and return the schema to the superroot + // . if the superroot already caches the schema, only send the version information back + // + // Superroots' schema merge/choose logic when returning results to clients: + // . pick the schema based on the order of: realtime > protected > archive + // . because of the above ordering, it is possible that archive earlybird schema with a new flush + // version (with new bit features) might be lost to older realtime earlybird schema; this is + // considered to to be rare and acceptable because one realtime earlybird deploy would fix it + 21: optional features.ThriftSearchFeatureSchema featureSchema + + // How long it took to score the results in earlybird (in nanoseconds). The number of results + // that were scored should be set in numHitsProcessed. + // Expected to only be set for requests that actually do scoring (i.e. Relevance and TopTweets). + 22: optional i64 scoringTimeNanos + + 8: optional i32 deprecated_numDocsProcessed +} + +// Note: Earlybird no longer respects this field, as it does not contain statuses. +// Blender should respect it. +enum EarlybirdReturnStatusType { + NO_STATUS = 0 + // deprecated + DEPRECATED_BASIC_STATUS = 1, + // deprecated + DEPRECATED_SEARCH_STATUS = 2, + TWEETYPIE_STATUS = 3, + + PLACE_HOLDER4 = 4, + PLACE_HOLDER5 = 5, +} + +struct AdjustedRequestParams { + // Next available field ID: 4 + + // Adjusted value for EarlybirdRequest.searchQuery.numResults. + 1: optional i32 numResults + + // Adjusted value for EarlybirdRequest.searchQuery.maxHitsToProcess and + // EarlybirdRequest.searchQuery.relevanceOptions.maxHitsToProcess. + 2: optional i32 maxHitsToProcess + + // Adjusted value for EarlybirdRequest.searchQuery.relevanceOptions.returnAllResults + 3: optional bool returnAllResults +} + +struct EarlybirdRequest { + // Next available field ID: 36 + + // -------- COMMON REQUEST OPTIONS -------- + // These fields contain options respected by all kinds of earlybird requests. + + // Search query containing general earlybird retrieval and hit collection options. + // Also contains the options specific to search requests. + 1: required ThriftSearchQuery searchQuery + + // Common RPC information - client hostname and request ID. + 12: optional string clientHost + 13: optional string clientRequestID + + // A string identifying the client that initiated the request. + // Ex: macaw-search.prod, webforall.prod, webforall.staging. + // The intention is to track the load we get from each client, and eventually enforce + // per-client QPS quotas, but this field could also be used to allow access to certain features + // only to certain clients, etc. + 21: optional string clientId + + // The time (in millis since epoch) when the earlybird client issued this request. + // Can be used to estimate request timeout time, capturing in-transit time for the request. + 23: optional i64 clientRequestTimeMs + + // Caching parameters used by earlybird roots. + 24: optional caching.CachingParams cachingParams + + // Deprecated. See SEARCH-2784 + // Earlybird requests will be early terminated in a best-effort way to prevent them from + // exceeding the given timeout. If timeout is <= 0 this early termination criteria is + // disabled. + 17: optional i32 timeoutMs = -1 + + // Deprecated. See SEARCH-2784 + // Earlybird requests will be early terminated in a best-effort way to prevent them from + // exceeding the given query cost. If maxQueryCost <= 0 this early termination criteria + // is disabled. + 20: optional double maxQueryCost = -1 + + + // -------- REQUEST-TYPE SPECIFIC OPTIONS -------- + // These fields contain options for one specific kind of request. If one of these options + // is set the request will be considered to be the appropriate type of request. + + // Options for facet counting requests. + 11: optional ThriftFacetRequest facetRequest + + // Options for term statistics requests. + 14: optional ThriftTermStatisticsRequest termStatisticsRequest + + + // -------- DEBUG OPTIONS -------- + // Used for debugging only. + + // Debug mode, 0 for no debug information. + 15: optional i8 debugMode = 0 + + // Can be used to pass extra debug arguments to earlybird. + 34: optional EarlybirdDebugOptions debugOptions + + // Searches a specific segment by time slice id if set and segment id is > 0. + 22: optional i64 searchSegmentId + + // -------- THINGS USED ONLY BY THE BLENDER -------- + // These fields are used by the blender and clients of the blender, but not by earlybird. + + // Specifies what kind of status object to return, if any. + 7: optional EarlybirdReturnStatusType returnStatusType + + + // -------- THINGS USED BY THE ROOTS -------- + // These fields are not in use by earlybirds themselves, but are in use by earlybird roots + // (and their clients). + // These fields live here since we currently reuse the same thrift request and response structs + // for both earlybirds and earlybird roots, and could potentially be moved out if we were to + // introduce separate request / response structs specifically for the roots. + + // We have a threshold for how many hash partition requests need to succeed at the root level + // in order for the earlybird root request to be considered successful. + // Each type or earlybird queries (e.g. relevance, or term statistics) has a predefined default + // threshold value (e.g. 90% or hash partitions need to succeed for a recency query). + // The client can optionally set the threshold value to be something other than the default, + // by setting this field to a value in the range of 0 (exclusive) to 1 (inclusive). + // If this value is set outside of the (0, 1] range, a CLIENT_ERROR EarlybirdResponseCode will + // be returned. + 25: optional double successfulResponseThreshold + + // Where does the query come from? + 26: optional query.ThriftQuerySource querySource + + // Whether to get archive results This flag is advisory. A request may still be restricted from + // getting reqults from the archive based on the requesting client, query source, requested + // time/id range, etc. + 27: optional bool getOlderResults + + // The list of users followed by the current user. + // Used to restrict the values in the fromUserIDFilter64 field when sending a request + // to the protectected cluster. + 28: optional list followedUserIds + + // The adjusted parameters for the protected request. + 29: optional AdjustedRequestParams adjustedProtectedRequestParams + + // The adjusted parameters for the full archive request. + 30: optional AdjustedRequestParams adjustedFullArchiveRequestParams + + // Return only the protected tweets. This flag is used by the SuperRoot to return relevance + // results that contain only protected tweets. + 31: optional bool getProtectedTweetsOnly + + // Tokenize serialized queries with the appropriate Pengin version(s). + // Only has an effect on superroot. + 32: optional bool retokenizeSerializedQuery + + // Flag to ignore tweets that are very recent and could be incompletely indexed. + // If false, will allow queries to see results that may violate implicit streaming + // guarantees and will search Tweets that have been partially indexed. + // See go/indexing-latency for more details. When enabled, prevents seeing tweets + // that are less than 15 seconds old (or a similarly configured threshold). + // May be set to false unless explicitly set to true. + 33: optional bool skipVeryRecentTweets = 1 + + // Setting an experimental cluster will reroute traffic at the realtime root layer to an experimental + // Earlybird cluster. This will have no impact if set on requests to anywhere other than realtime root. + 35: optional ExperimentCluster experimentClusterToUse + + // Caps number of results returned by roots after merging results from different earlybird partitions/clusters. + // If not set, ThriftSearchQuery.numResults or CollectorParams.numResultsToReturn will be used to cap results. + // This parameter will be ignored if ThriftRelevanceOptions.returnAllResults is set to true. + 36: optional i32 numResultsToReturnAtRoot +} + +enum EarlybirdResponseCode { + SUCCESS = 0, + PARTITION_NOT_FOUND = 1, + PARTITION_DISABLED = 2, + TRANSIENT_ERROR = 3, + PERSISTENT_ERROR = 4, + CLIENT_ERROR = 5, + PARTITION_SKIPPED = 6, + // Request was queued up on the server for so long that it timed out, and was not + // executed at all. + SERVER_TIMEOUT_ERROR = 7, + TIER_SKIPPED = 8, + // Not enough partitions returned a successful response. The merged response will have partition + // counts and early termination info set, but will not have search results. + TOO_MANY_PARTITIONS_FAILED_ERROR = 9, + // Client went over its quota, and the request was throttled. + QUOTA_EXCEEDED_ERROR = 10, + // Client's request is blocked based on Search Infra's policy. Search Infra can can block client's + // requests based on the query source of the request. + REQUEST_BLOCKED_ERROR = 11, + + CLIENT_CANCEL_ERROR = 12, + + CLIENT_BLOCKED_BY_TIER_ERROR = 13, + + PLACE_HOLDER_2015_09_21 = 14, +} + +// A recorded request and response. +struct EarlybirdRequestResponse { + // Where did we send this request to. + 1: optional string sentTo; + 2: optional EarlybirdRequest request; + // This can't be an EarlybirdResponse, because the thrift compiler for Python + // doesn't allow cyclic references and we have some Python utilities that will fail. + 3: optional string response; +} + +struct EarlybirdDebugInfo { + 1: optional string host + 2: optional string parsedQuery + 3: optional string luceneQuery + // Requests sent to dependent services. For example, superroot sends to realtime root, + // archive root, etc. + 4: optional list sentRequests; + // segment level debug info (eg. hitsPerSegment, max/minSearchedTime etc.) + 5: optional list collectorDebugInfo + 6: optional list termStatisticsDebugInfo +} + +struct EarlybirdDebugOptions { + 1: optional bool includeCollectorDebugInfo +} + +struct TierResponse { + 1: optional EarlybirdResponseCode tierResponseCode + 2: optional i32 numPartitions + 3: optional i32 numSuccessfulPartitions +} + +struct EarlybirdServerStats { + // The hostname of the Earlybird that processed this request. + 1: optional string hostname + + // The partition to which this earlybird belongs. + 2: optional i32 partition + + // Current Earlybird QPS. + // Earlybirds should set this field at the end of a request (not at the start). This would give + // roots a more up-to-date view of the load on the earlybirds. + 3: optional i64 currentQps + + // The time the request waited in the queue before Earlybird started processing it. + // This does not include the time spent in the finagle queue: it's the time between the moment + // earlybird received the request, and the moment it started processing the request. + 4: optional i64 queueTimeMillis + + // The average request time in the queue before Earlybird started processing it. + // This does not include the time that requests spent in the finagle queue: it's the average time + // between the moment earlybird received its requests, and the moment it started processing them. + 5: optional i64 averageQueueTimeMillis + + // Current average per-request latency as perceived by Earlybird. + 6: optional i64 averageLatencyMicros + + // The tier to which this earlybird belongs. + 7: optional string tierName +} + +struct EarlybirdResponse { + // Next available field ID: 17 + 1: optional ThriftSearchResults searchResults + 5: optional ThriftFacetResults facetResults + 6: optional ThriftTermStatisticsResults termStatisticsResults + 2: required EarlybirdResponseCode responseCode + 3: required i64 responseTime + 7: optional i64 responseTimeMicros + // fields below will only be returned if debug > 1 in the request. + 4: optional string debugString + 8: optional EarlybirdDebugInfo debugInfo + + // Only exists for merged earlybird response. + 10: optional i32 numPartitions + 11: optional i32 numSuccessfulPartitions + // Only exists for merged earlybird response from multiple tiers. + 13: optional list perTierResponse + + // Total number of segments that were searched. Partially searched segments are fully counted. + // e.g. if we searched 1 segment fully, and early terminated half way through the second + // segment, this field should be set to 2. + 15: optional i32 numSearchedSegments + + // Whether the request early terminated, if so, the termination reason. + 12: optional search.EarlyTerminationInfo earlyTerminationInfo + + // Whether this response is from cache. + 14: optional bool cacheHit + + // Stats used by roots to determine if we should go into degraded mode. + 16: optional EarlybirdServerStats earlybirdServerStats +} + +enum EarlybirdStatusCode { + STARTING = 0, + CURRENT = 1, + STOPPING = 2, + UNHEALTHY = 3, + BLACKLISTED = 4, + + PLACE_HOLDER5 = 5, + PLACE_HOLDER6 = 6, +} + +struct EarlybirdStatusResponse { + 1: required EarlybirdStatusCode code + 2: required i64 aliveSince + 3: optional string message +} + +service EarlybirdService { + string getName(), + EarlybirdStatusResponse getStatus(), + EarlybirdResponse search( 1: EarlybirdRequest request ) +} diff --git a/src/thrift/com/twitter/simclusters_v2/abuse.thrift b/src/thrift/com/twitter/simclusters_v2/abuse.thrift new file mode 100644 index 000000000..60043244b --- /dev/null +++ b/src/thrift/com/twitter/simclusters_v2/abuse.thrift @@ -0,0 +1,53 @@ +namespace java com.twitter.simclusters_v2.thriftjava +namespace py gen.twitter.simclusters_v2 +#@namespace scala com.twitter.simclusters_v2.thriftscala +#@namespace strato com.twitter.simclusters_v2 + +include "embedding.thrift" +include "simclusters_presto.thrift" + +/** + * Struct that associates a user with simcluster scores for different + * interaction types. This is meant to be used as a feature to predict abuse. + * + * This thrift struct is meant for exploration purposes. It does not have any + * assumptions about what type of interactions we use or what types of scores + * we are keeping track of. + **/ +struct AdhocSingleSideClusterScores { + 1: required i64 userId(personalDataType = 'UserId') + // We can make the interaction types have arbitrary names. In the production + // version of this dataset. We should have a different field per interaction + // type so that API of what is included is more clear. + 2: required map interactionScores +}(persisted="true", hasPersonalData = 'true') + +/** +* This is a prod version of the single side features. It is meant to be used as a value in a key +* value store. The pair of healthy and unhealthy scores will be different depending on the use case. +* We will use different stores for different user cases. For instance, the first instance that +* we implement will use search abuse reports and impressions. We can build stores for new values +* in the future. +* +* The consumer creates the interactions which the author receives. For instance, the consumer +* creates an abuse report for an author. The consumer scores are related to the interaction creation +* behavior of the consumer. The author scores are related to the whether the author receives these +* interactions. +* +**/ +struct SingleSideUserScores { + 1: required i64 userId(personalDataType = 'UserId') + 2: required double consumerUnhealthyScore(personalDataType = 'EngagementScore') + 3: required double consumerHealthyScore(personalDataType = 'EngagementScore') + 4: required double authorUnhealthyScore(personalDataType = 'EngagementScore') + 5: required double authorHealthyScore(personalDataType = 'EngagementScore') +}(persisted="true", hasPersonalData = 'true') + +/** +* Struct that associates a cluster-cluster interaction scores for different +* interaction types. +**/ +struct AdhocCrossSimClusterInteractionScores { + 1: required i64 clusterId + 2: required list clusterScores +}(persisted="true") diff --git a/src/thrift/com/twitter/simclusters_v2/embedding.thrift b/src/thrift/com/twitter/simclusters_v2/embedding.thrift new file mode 100644 index 000000000..110da0c65 --- /dev/null +++ b/src/thrift/com/twitter/simclusters_v2/embedding.thrift @@ -0,0 +1,137 @@ +namespace java com.twitter.simclusters_v2.thriftjava +namespace py gen.twitter.simclusters_v2.embedding +#@namespace scala com.twitter.simclusters_v2.thriftscala +#@namespace strato com.twitter.simclusters_v2 + +include "com/twitter/simclusters_v2/identifier.thrift" +include "com/twitter/simclusters_v2/online_store.thrift" + +struct SimClusterWithScore { + 1: required i32 clusterId(personalDataType = 'InferredInterests') + 2: required double score(personalDataType = 'EngagementScore') +}(persisted = 'true', hasPersonalData = 'true') + +struct TopSimClustersWithScore { + 1: required list topClusters + 2: required online_store.ModelVersion modelVersion +}(persisted = 'true', hasPersonalData = 'true') + +struct InternalIdWithScore { + 1: required identifier.InternalId internalId + 2: required double score(personalDataType = 'EngagementScore') +}(persisted = 'true', hasPersonalData = 'true') + +struct InternalIdEmbedding { + 1: required list embedding +}(persisted = 'true', hasPersonalData = 'true') + +struct SemanticCoreEntityWithScore { + 1: required i64 entityId(personalDataType = 'SemanticcoreClassification') + 2: required double score(personalDataType = 'EngagementScore') +}(persisted = 'true', hasPersonalData = 'true') + +struct TopSemanticCoreEntitiesWithScore { + 1: required list topEntities +}(persisted = 'true', hasPersonalData = 'true') + +struct PersistedFullClusterId { + 1: required online_store.ModelVersion modelVersion + 2: required i32 clusterId(personalDataType = 'InferredInterests') +}(persisted = 'true', hasPersonalData = 'true') + +struct DayPartitionedClusterId { + 1: required i32 clusterId(personalDataType = 'InferredInterests') + 2: required string dayPartition // format: yyyy-MM-dd +} + +struct TopProducerWithScore { + 1: required i64 userId(personalDataType = 'UserId') + 2: required double score(personalDataType = 'EngagementScore') +}(persisted = 'true', hasPersonalData = 'true') + +struct TopProducersWithScore { + 1: required list topProducers +}(persisted = 'true', hasPersonalData = 'true') + +struct TweetWithScore { + 1: required i64 tweetId(personalDataType = 'TweetId') + 2: required double score(personalDataType = 'EngagementScore') +}(persisted = 'true', hasPersonalData = 'true') + +struct TweetsWithScore { + 1: required list tweets +}(persisted = 'true', hasPersonalData = 'true') + +struct TweetTopKTweetsWithScore { + 1: required i64 tweetId(personalDataType = 'TweetId') + 2: required TweetsWithScore topkTweetsWithScore +}(persisted = 'true', hasPersonalData = 'true') + +/** + * The generic SimClustersEmbedding for online long-term storage and real-time calculation. + * Use SimClustersEmbeddingId as the only identifier. + * Warning: Doesn't include model version and embedding type in the value struct. + **/ +struct SimClustersEmbedding { + 1: required list embedding +}(persisted = 'true', hasPersonalData = 'true') + +struct SimClustersEmbeddingWithScore { + 1: required SimClustersEmbedding embedding + 2: required double score +}(persisted = 'true', hasPersonalData = 'false') + +/** + * This is the recommended structure for aggregating embeddings with time decay - the metadata + * stores the information needed for decayed aggregation. + **/ +struct SimClustersEmbeddingWithMetadata { + 1: required SimClustersEmbedding embedding + 2: required SimClustersEmbeddingMetadata metadata +}(hasPersonalData = 'true') + +struct SimClustersEmbeddingIdWithScore { + 1: required identifier.SimClustersEmbeddingId id + 2: required double score +}(persisted = 'true', hasPersonalData = 'false') + +struct SimClustersMultiEmbeddingByValues { + 1: required list embeddings +}(persisted = 'true', hasPersonalData = 'false') + +struct SimClustersMultiEmbeddingByIds { + 1: required list ids +}(persisted = 'true', hasPersonalData = 'false') + +/** + * Generic SimClusters Multiple Embeddings. The identifier.SimClustersMultiEmbeddingId is the key of + * the multiple embedding. + **/ +union SimClustersMultiEmbedding { + 1: SimClustersMultiEmbeddingByValues values + 2: SimClustersMultiEmbeddingByIds ids +}(persisted = 'true', hasPersonalData = 'false') + +/** + * The metadata of a SimClustersEmbedding. The updatedCount represent the version of the Embedding. + * For tweet embedding, the updatedCount is same/close to the favorite count. + **/ +struct SimClustersEmbeddingMetadata { + 1: optional i64 updatedAtMs + 2: optional i64 updatedCount +}(persisted = 'true', hasPersonalData = 'true') + +/** + * The data structure for PersistentSimClustersEmbedding Store + **/ +struct PersistentSimClustersEmbedding { + 1: required SimClustersEmbedding embedding + 2: required SimClustersEmbeddingMetadata metadata +}(persisted = 'true', hasPersonalData = 'true') + +/** + * The data structure for the Multi Model PersistentSimClustersEmbedding Store + **/ +struct MultiModelPersistentSimClustersEmbedding { + 1: required map multiModelPersistentSimClustersEmbedding +}(persisted = 'true', hasPersonalData = 'true') diff --git a/src/thrift/com/twitter/simclusters_v2/evaluation.thrift b/src/thrift/com/twitter/simclusters_v2/evaluation.thrift new file mode 100644 index 000000000..85414baf9 --- /dev/null +++ b/src/thrift/com/twitter/simclusters_v2/evaluation.thrift @@ -0,0 +1,65 @@ +namespace java com.twitter.simclusters_v2.thriftjava +namespace py gen.twitter.simclusters_v2.evaluation +#@namespace scala com.twitter.simclusters_v2.thriftscala +#@namespace strato com.twitter.simclusters_v2 + +/** + * Surface area at which the reference tweet was displayed to the user + **/ +enum DisplayLocation { + TimelinesRecap = 1, + TimelinesRectweet = 2 +}(hasPersonalData = 'false') + +struct TweetLabels { + 1: required bool isClicked = false(personalDataType = 'EngagementsPrivate') + 2: required bool isLiked = false(personalDataType = 'EngagementsPublic') + 3: required bool isRetweeted = false(personalDataType = 'EngagementsPublic') + 4: required bool isQuoted = false(personalDataType = 'EngagementsPublic') + 5: required bool isReplied = false(personalDataType = 'EngagementsPublic') +}(persisted = 'true', hasPersonalData = 'true') + +/** + * Data container of a reference tweet with scribed user engagement labels + */ +struct ReferenceTweet { + 1: required i64 tweetId(personalDataType = 'TweetId') + 2: required i64 authorId(personalDataType = 'UserId') + 3: required i64 timestamp(personalDataType = 'PublicTimestamp') + 4: required DisplayLocation displayLocation + 5: required TweetLabels labels +}(persisted="true", hasPersonalData = 'true') + +/** + * Data container of a candidate tweet generated by the candidate algorithm + */ +struct CandidateTweet { + 1: required i64 tweetId(personalDataType = 'TweetId') + 2: optional double score(personalDataType = 'EngagementScore') + // The timestamp here is a synthetically generated timestamp. + // for evaluation purpose. Hence left unannotated + 3: optional i64 timestamp +}(hasPersonalData = 'true') + +/** + * An encapsulated collection of candidate tweets + **/ +struct CandidateTweets { + 1: required i64 targetUserId(personalDataType = 'UserId') + 2: required list recommendedTweets +}(hasPersonalData = 'true') + +/** + * An encapsulated collection of reference tweets + **/ +struct ReferenceTweets { + 1: required i64 targetUserId(personalDataType = 'UserId') + 2: required list impressedTweets +}(persisted="true", hasPersonalData = 'true') + +/** + * A list of candidate tweets + **/ +struct CandidateTweetsList { + 1: required list recommendedTweets +}(hasPersonalData = 'true') \ No newline at end of file diff --git a/src/thrift/com/twitter/simclusters_v2/identifier.thrift b/src/thrift/com/twitter/simclusters_v2/identifier.thrift new file mode 100644 index 000000000..b4285e699 --- /dev/null +++ b/src/thrift/com/twitter/simclusters_v2/identifier.thrift @@ -0,0 +1,205 @@ +namespace java com.twitter.simclusters_v2.thriftjava +namespace py gen.twitter.simclusters_v2.identifier +#@namespace scala com.twitter.simclusters_v2.thriftscala +#@namespace strato com.twitter.simclusters_v2 + +include "com/twitter/simclusters_v2/online_store.thrift" + +/** + * The uniform type for a SimClusters Embeddings. + * Each embeddings have the uniform underlying storage. + * Warning: Every EmbeddingType should map to one and only one InternalId. + **/ +enum EmbeddingType { + // Reserve 001 - 99 for Tweet embeddings + FavBasedTweet = 1, // Deprecated + FollowBasedTweet = 2, // Deprecated + LogFavBasedTweet = 3, // Production Version + FavBasedTwistlyTweet = 10, // Deprecated + LogFavBasedTwistlyTweet = 11, // Deprecated + LogFavLongestL2EmbeddingTweet = 12, // Production Version + + // Tweet embeddings generated from non-fav events + // Naming convention: {Event}{Score}BasedTweet + // {Event}: The interaction event we use to build the tweet embeddings + // {Score}: The score from user InterestedIn embeddings + VideoPlayBack50LogFavBasedTweet = 21, + RetweetLogFavBasedTweet = 22, + ReplyLogFavBasedTweet = 23, + PushOpenLogFavBasedTweet = 24, + + // [Experimental] Offline generated FavThroughRate-based Tweet Embedding + Pop1000RankDecay11Tweet = 30, + Pop10000RankDecay11Tweet = 31, + OonPop1000RankDecayTweet = 32, + + // [Experimental] Offline generated production-like LogFavScore-based Tweet Embedding + OfflineGeneratedLogFavBasedTweet = 40, + + // Reserve 51-59 for Ads Embedding + LogFavBasedAdsTweet = 51, // Experimental embedding for ads tweet candidate + LogFavClickBasedAdsTweet = 52, // Experimental embedding for ads tweet candidate + + // Reserve 60-69 for Evergreen content + LogFavBasedEvergreenTweet = 60, + LogFavBasedRealTimeTweet = 65, + + // Reserve 101 to 149 for Semantic Core Entity embeddings + FavBasedSematicCoreEntity = 101, // Deprecated + FollowBasedSematicCoreEntity = 102, // Deprecated + FavBasedHashtagEntity = 103, // Deprecated + FollowBasedHashtagEntity = 104, // Deprecated + ProducerFavBasedSemanticCoreEntity = 105, // Deprecated + ProducerFollowBasedSemanticCoreEntity = 106,// Deprecated + FavBasedLocaleSemanticCoreEntity = 107, // Deprecated + FollowBasedLocaleSemanticCoreEntity = 108, // Deprecated + LogFavBasedLocaleSemanticCoreEntity = 109, // Deprecated + LanguageFilteredProducerFavBasedSemanticCoreEntity = 110, // Deprecated + LanguageFilteredFavBasedLocaleSemanticCoreEntity = 111, // Deprecated + FavTfgTopic = 112, // TFG topic embedding built from fav-based user interestedIn + LogFavTfgTopic = 113, // TFG topic embedding built from logfav-based user interestedIn + FavInferredLanguageTfgTopic = 114, // TFG topic embedding built using inferred consumed languages + FavBasedKgoApeTopic = 115, // topic embedding using fav-based aggregatable producer embedding of KGO seed accounts. + LogFavBasedKgoApeTopic = 116, // topic embedding using log fav-based aggregatable producer embedding of KGO seed accounts. + FavBasedOnboardingApeTopic = 117, // topic embedding using fav-based aggregatable producer embedding of onboarding seed accounts. + LogFavBasedOnboardingApeTopic = 118, // topic embedding using log fav-based aggregatable producer embedding of onboarding seed accounts. + LogFavApeBasedMuseTopic = 119, // Deprecated + LogFavApeBasedMuseTopicExperiment = 120 // Deprecated + + // Reserved 201 - 299 for Producer embeddings (KnownFor) + FavBasedProducer = 201 + FollowBasedProducer = 202 + AggregatableFavBasedProducer = 203 // fav-based aggregatable producer embedding. + AggregatableLogFavBasedProducer = 204 // logfav-based aggregatable producer embedding. + RelaxedAggregatableLogFavBasedProducer = 205 // logfav-based aggregatable producer embedding. + AggregatableFollowBasedProducer = 206 // follow-based aggregatable producer embedding. + KnownFor = 300 + + // Reserved 301 - 399 for User InterestedIn embeddings + FavBasedUserInterestedIn = 301 + FollowBasedUserInterestedIn = 302 + LogFavBasedUserInterestedIn = 303 + RecentFollowBasedUserInterestedIn = 304 // interested-in embedding based on aggregating producer embeddings of recent follows + FilteredUserInterestedIn = 305 // interested-in embedding used by twistly read path + LogFavBasedUserInterestedInFromAPE = 306 + FollowBasedUserInterestedInFromAPE = 307 + TwiceUserInterestedIn = 308 // interested-in multi-embedding based on clustering producer embeddings of neighbors + UnfilteredUserInterestedIn = 309 + UserNextInterestedIn = 310 // next interested-in embedding generated from BeT + + // Denser User InterestedIn, generated by Producer embeddings. + FavBasedUserInterestedInFromPE = 311 + FollowBasedUserInterestedInFromPE = 312 + LogFavBasedUserInterestedInFromPE = 313 + FilteredUserInterestedInFromPE = 314 // interested-in embedding used by twistly read path + + // [Experimental] Denser User InterestedIn, generated by aggregating IIAPE embedding from AddressBook + LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE = 320 + LogFavBasedUserInterestedAverageAddressBookFromIIAPE = 321 + LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE = 322 + LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE = 323 + LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE = 324 + LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE = 325 + + //Reserved 401 - 500 for Space embedding + FavBasedApeSpace = 401 // DEPRECATED + LogFavBasedListenerSpace = 402 // DEPRECATED + LogFavBasedAPESpeakerSpace = 403 // DEPRECATED + LogFavBasedUserInterestedInListenerSpace = 404 // DEPRECATED + + // Experimental, internal-only IDs + ExperimentalThirtyDayRecentFollowBasedUserInterestedIn = 10000 // Like RecentFollowBasedUserInterestedIn, except limited to last 30 days + ExperimentalLogFavLongestL2EmbeddingTweet = 10001 // DEPRECATED +}(persisted = 'true', hasPersonalData = 'false') + +/** + * The uniform type for a SimClusters MultiEmbeddings. + * Warning: Every MultiEmbeddingType should map to one and only one InternalId. + **/ +enum MultiEmbeddingType { + // Reserved 0-99 for Tweet based MultiEmbedding + + // Reserved 100 - 199 for Topic based MultiEmbedding + LogFavApeBasedMuseTopic = 100 // Deprecated + LogFavApeBasedMuseTopicExperiment = 101 // Deprecated + + // Reserved 301 - 399 for User InterestedIn embeddings + TwiceUserInterestedIn = 301 // interested-in multi-embedding based on clustering producer embeddings of neighbors +}(persisted = 'true', hasPersonalData = 'true') + +// Deprecated. Please use TopicId for future cases. +struct LocaleEntityId { + 1: i64 entityId + 2: string language +}(persisted = 'true', hasPersonalData = 'false') + +enum EngagementType { + Favorite = 1, + Retweet = 2, +} + +struct UserEngagedTweetId { + 1: i64 tweetId(personalDataType = 'TweetId') + 2: i64 userId(personalDataType = 'UserId') + 3: EngagementType engagementType(personalDataType = 'EventType') +}(persisted = 'true', hasPersonalData = 'true') + +struct TopicId { + 1: i64 entityId (personalDataType = 'SemanticcoreClassification') + // 2-letter ISO 639-1 language code + 2: optional string language + // 2-letter ISO 3166-1 alpha-2 country code + 3: optional string country +}(persisted = 'true', hasPersonalData = 'false') + +struct TopicSubId { + 1: i64 entityId (personalDataType = 'SemanticcoreClassification') + // 2-letter ISO 639-1 language code + 2: optional string language + // 2-letter ISO 3166-1 alpha-2 country code + 3: optional string country + 4: i32 subId +}(persisted = 'true', hasPersonalData = 'true') + +// Will be used for testing purposes in DDG 15536, 15534 +struct UserWithLanguageId { + 1: required i64 userId(personalDataType = 'UserId') + 2: optional string langCode(personalDataType = 'InferredLanguage') +}(persisted = 'true', hasPersonalData = 'true') + +/** + * The internal identifier type. + * Need to add ordering in [[com.twitter.simclusters_v2.common.SimClustersEmbeddingId]] + * when adding a new type. + **/ +union InternalId { + 1: i64 tweetId(personalDataType = 'TweetId') + 2: i64 userId(personalDataType = 'UserId') + 3: i64 entityId(personalDataType = 'SemanticcoreClassification') + 4: string hashtag(personalDataType = 'PublicTweetEntitiesAndMetadata') + 5: i32 clusterId + 6: LocaleEntityId localeEntityId(personalDataType = 'SemanticcoreClassification') + 7: UserEngagedTweetId userEngagedTweetId + 8: TopicId topicId + 9: TopicSubId topicSubId + 10: string spaceId + 11: UserWithLanguageId userWithLanguageId +}(persisted = 'true', hasPersonalData = 'true') + +/** + * A uniform identifier type for all kinds of SimClusters based embeddings. + **/ +struct SimClustersEmbeddingId { + 1: required EmbeddingType embeddingType + 2: required online_store.ModelVersion modelVersion + 3: required InternalId internalId +}(persisted = 'true', hasPersonalData = 'true') + +/** + * A uniform identifier type for multiple SimClusters embeddings + **/ +struct SimClustersMultiEmbeddingId { + 1: required MultiEmbeddingType embeddingType + 2: required online_store.ModelVersion modelVersion + 3: required InternalId internalId +}(persisted = 'true', hasPersonalData = 'true') diff --git a/timelineranker/README.md b/timelineranker/README.md new file mode 100644 index 000000000..72b9226db --- /dev/null +++ b/timelineranker/README.md @@ -0,0 +1,13 @@ +# TimelineRanker + +**TimelineRanker** (TLR) is a legacy service that provides relevance-scored tweets from the Earlybird Search Index and User Tweet Entity Graph (UTEG) service. Despite its name, it no longer performs heavy ranking or model-based ranking itself; it only uses relevance scores from the Search Index for ranked tweet endpoints. + +The following is a list of major services that Timeline Ranker interacts with: + +- **Earlybird-root-superroot (a.k.a Search):** Timeline Ranker calls the Search Index's super root to fetch a list of Tweets. +- **User Tweet Entity Graph (UTEG):** Timeline Ranker calls UTEG to fetch a list of tweets liked by the users you follow. +- **Socialgraph:** Timeline Ranker calls Social Graph Service to obtain the follow graph and user states such as blocked, muted, retweets muted, etc. +- **TweetyPie:** Timeline Ranker hydrates tweets by calling TweetyPie to post-filter tweets based on certain hydrated fields. +- **Manhattan:** Timeline Ranker hydrates some tweet features (e.g., user languages) from Manhattan. + +**Home Mixer** calls Timeline Ranker to fetch tweets from the Earlybird Search Index and User Tweet Entity Graph (UTEG) service to power both the For You and Following Home Timelines. Timeline Ranker performs light ranking based on Earlybird tweet candidate scores and truncates to the number of candidates requested by Home Mixer based on these scores. diff --git a/trust_and_safety_models/README.md b/trust_and_safety_models/README.md new file mode 100644 index 000000000..c16de2d3d --- /dev/null +++ b/trust_and_safety_models/README.md @@ -0,0 +1,10 @@ +Trust and Safety Models +======================= + +We decided to open source the training code of the following models: +- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content. +- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics. +- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service. +- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior. + +We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly. diff --git a/twml/README.md b/twml/README.md index df7a10328..b2b315b45 100644 --- a/twml/README.md +++ b/twml/README.md @@ -1,7 +1,7 @@ # TWML --- -Note: `twml` is no longer under development. Much of the code here is not out of date and unused. +Note: `twml` is no longer under development. Much of the code here is out of date and unused. It is included here for completeness, because `twml` is still used to train the light ranker models (see `src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md`) --- @@ -10,4 +10,4 @@ TWML is one of Twitter's machine learning frameworks, which uses Tensorflow unde deprecated, it is still currently used to train the Earlybird light ranking models ( see `src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/train.py`). -The most relevant part of this is the `DataRecordTrainer` class, which is where the core training logic resides. \ No newline at end of file +The most relevant part of this is the `DataRecordTrainer` class, which is where the core training logic resides. diff --git a/visibilitylib/src/main/resources/config/com/twitter/visibility/decider.yml b/visibilitylib/src/main/resources/config/com/twitter/visibility/decider.yml index c2c8f8a9a..54b5edcba 100644 --- a/visibilitylib/src/main/resources/config/com/twitter/visibility/decider.yml +++ b/visibilitylib/src/main/resources/config/com/twitter/visibility/decider.yml @@ -494,6 +494,9 @@ visibility_library_enable_trends_representative_tweet_safety_level: visibility_library_enable_trusted_friends_user_list_safety_level: default_availability: 10000 +visibility_library_enable_twitter_delegate_user_list_safety_level: + default_availability: 10000 + visibility_library_enable_tweet_detail_safety_level: default_availability: 10000 @@ -758,7 +761,7 @@ visibility_library_enable_short_circuiting_from_blender_visibility_library: visibility_library_enable_short_circuiting_from_search_visibility_library: default_availability: 0 -visibility_library_enable_nsfw_text_topics_drop_rule: +visibility_library_enable_nsfw_text_high_precision_drop_rule: default_availability: 10000 visibility_library_enable_spammy_tweet_rule_verdict_logging: diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/DeciderKey.scala b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/DeciderKey.scala index 9fefb4154..58331779c 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/DeciderKey.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/DeciderKey.scala @@ -535,6 +535,9 @@ private[visibility] object DeciderKey extends DeciderKeyEnum { val EnableTrustedFriendsUserListSafetyLevel: Value = Value( "visibility_library_enable_trusted_friends_user_list_safety_level" ) + val EnableTwitterDelegateUserListSafetyLevel: Value = Value( + "visibility_library_enable_twitter_delegate_user_list_safety_level" + ) val EnableTweetDetailSafetyLevel: Value = Value( "visibility_library_enable_tweet_detail_safety_level" ) @@ -869,8 +872,8 @@ private[visibility] object DeciderKey extends DeciderKeyEnum { "visibility_library_enable_short_circuiting_from_search_visibility_library" ) - val EnableNsfwTextTopicsDropRule: Value = Value( - "visibility_library_enable_nsfw_text_topics_drop_rule" + val EnableNsfwTextHighPrecisionDropRule: Value = Value( + "visibility_library_enable_nsfw_text_high_precision_drop_rule" ) val EnableSpammyTweetRuleVerdictLogging: Value = Value( diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/VisibilityDeciders.scala b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/VisibilityDeciders.scala index cc78fdb7e..e359d443d 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/VisibilityDeciders.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/configs/VisibilityDeciders.scala @@ -198,6 +198,7 @@ private[visibility] object VisibilityDeciders { TopicRecommendations -> DeciderKey.EnableTopicRecommendationsSafetyLevel, TrendsRepresentativeTweet -> DeciderKey.EnableTrendsRepresentativeTweetSafetyLevel, TrustedFriendsUserList -> DeciderKey.EnableTrustedFriendsUserListSafetyLevel, + TwitterDelegateUserList -> DeciderKey.EnableTwitterDelegateUserListSafetyLevel, TweetDetail -> DeciderKey.EnableTweetDetailSafetyLevel, TweetDetailNonToo -> DeciderKey.EnableTweetDetailNonTooSafetyLevel, TweetEngagers -> DeciderKey.EnableTweetEngagersSafetyLevel, @@ -287,7 +288,7 @@ private[visibility] object VisibilityDeciders { RuleParams.EnableDropAllTrustedFriendsTweetsRuleParam -> DeciderKey.EnableDropAllTrustedFriendsTweetsRule, RuleParams.EnableDropTrustedFriendsTweetContentRuleParam -> DeciderKey.EnableDropTrustedFriendsTweetContentRule, RuleParams.EnableDropAllCollabInvitationTweetsRuleParam -> DeciderKey.EnableDropCollabInvitationTweetsRule, - RuleParams.EnableNsfwTextTopicsDropRuleParam -> DeciderKey.EnableNsfwTextTopicsDropRule, + RuleParams.EnableNsfwTextHighPrecisionDropRuleParam -> DeciderKey.EnableNsfwTextHighPrecisionDropRule, RuleParams.EnableLikelyIvsUserLabelDropRule -> DeciderKey.EnableLikelyIvsUserLabelDropRule, RuleParams.EnableCardUriRootDomainCardDenylistRule -> DeciderKey.EnableCardUriRootDomainDenylistRule, RuleParams.EnableCommunityNonMemberPollCardRule -> DeciderKey.EnableCommunityNonMemberPollCardRule, diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/RuleParams.scala b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/RuleParams.scala index 44c7797b9..a4e28e690 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/RuleParams.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/RuleParams.scala @@ -85,7 +85,7 @@ private[visibility] object RuleParams { object EnableDropAllCollabInvitationTweetsRuleParam extends RuleParam(false) - object EnableNsfwTextTopicsDropRuleParam extends RuleParam(false) + object EnableNsfwTextHighPrecisionDropRuleParam extends RuleParam(false) object EnableLikelyIvsUserLabelDropRule extends RuleParam(false) diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/SafetyLevelParams.scala b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/SafetyLevelParams.scala index a8c7d9f51..ae54ffd34 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/SafetyLevelParams.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/configapi/params/SafetyLevelParams.scala @@ -186,6 +186,7 @@ private[visibility] object SafetyLevelParams { object EnableTopicRecommendationsSafetyLevelParam extends SafetyLevelParam(false) object EnableTrendsRepresentativeTweetSafetyLevelParam extends SafetyLevelParam(false) object EnableTrustedFriendsUserListSafetyLevelParam extends SafetyLevelParam(false) + object EnableTwitterDelegateUserListSafetyLevelParam extends SafetyLevelParam(false) object EnableTweetDetailSafetyLevelParam extends SafetyLevelParam(false) object EnableTweetDetailNonTooSafetyLevelParam extends SafetyLevelParam(false) object EnableTweetDetailWithInjectionsHydrationSafetyLevelParam extends SafetyLevelParam(false) diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/engine/VisibilityRuleEngine.scala b/visibilitylib/src/main/scala/com/twitter/visibility/engine/VisibilityRuleEngine.scala index 6043f3649..d1c33017b 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/engine/VisibilityRuleEngine.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/engine/VisibilityRuleEngine.scala @@ -143,7 +143,7 @@ class VisibilityRuleEngine private[VisibilityRuleEngine] ( builder.withRuleResult(rule, RuleResult(builder.verdict, ShortCircuited)) } else { - if (rule.fallbackActionBuilder.nonEmpty) { + if (failedFeatureDependencies.nonEmpty && rule.fallbackActionBuilder.nonEmpty) { metricsRecorder.recordRuleFallbackAction(rule.name) } diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevel.scala b/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevel.scala index 9042b9328..805b17497 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevel.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevel.scala @@ -194,6 +194,7 @@ object SafetyLevel { ThriftSafetyLevel.TopicsLandingPageTopicRecommendations -> TopicsLandingPageTopicRecommendations, ThriftSafetyLevel.TrendsRepresentativeTweet -> TrendsRepresentativeTweet, ThriftSafetyLevel.TrustedFriendsUserList -> TrustedFriendsUserList, + ThriftSafetyLevel.TwitterDelegateUserList -> TwitterDelegateUserList, ThriftSafetyLevel.GryphonDecksAndColumns -> GryphonDecksAndColumns, ThriftSafetyLevel.TweetDetail -> TweetDetail, ThriftSafetyLevel.TweetDetailNonToo -> TweetDetailNonToo, @@ -772,6 +773,9 @@ object SafetyLevel { case object TrustedFriendsUserList extends SafetyLevel { override val enabledParam: SafetyLevelParam = EnableTrustedFriendsUserListSafetyLevelParam } + case object TwitterDelegateUserList extends SafetyLevel { + override val enabledParam: SafetyLevelParam = EnableTwitterDelegateUserListSafetyLevelParam + } case object TweetDetail extends SafetyLevel { override val enabledParam: SafetyLevelParam = EnableTweetDetailSafetyLevelParam } diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevelGroup.scala b/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevelGroup.scala index e60daefd1..a9ebfa85c 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevelGroup.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/models/SafetyLevelGroup.scala @@ -379,13 +379,6 @@ object SafetyLevelGroup { ) } - case object ProfileMixer extends SafetyLevelGroup { - override val levels: Set[SafetyLevel] = Set( - ProfileMixerMedia, - ProfileMixerFavorites, - ) - } - case object Reactions extends SafetyLevelGroup { override val levels: Set[SafetyLevel] = Set( SignalsReactions, @@ -516,6 +509,10 @@ object SafetyLevelGroup { SafetyLevel.TimelineProfile, TimelineProfileAll, TimelineProfileSpaces, + TimelineMedia, + ProfileMixerMedia, + TimelineFavorites, + ProfileMixerFavorites ) } diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/models/SpaceSafetyLabelType.scala b/visibilitylib/src/main/scala/com/twitter/visibility/models/SpaceSafetyLabelType.scala index 432650dfd..bab719e21 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/models/SpaceSafetyLabelType.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/models/SpaceSafetyLabelType.scala @@ -36,8 +36,8 @@ object SpaceSafetyLabelType extends SafetyLabelType { s.SpaceSafetyLabelType.HatefulHighRecall -> HatefulHighRecall, s.SpaceSafetyLabelType.ViolenceHighRecall -> ViolenceHighRecall, s.SpaceSafetyLabelType.HighToxicityModelScore -> HighToxicityModelScore, - s.SpaceSafetyLabelType.UkraineCrisisTopic -> UkraineCrisisTopic, - s.SpaceSafetyLabelType.DoNotPublicPublish -> DoNotPublicPublish, + s.SpaceSafetyLabelType.DeprecatedSpaceSafetyLabel14 -> Deprecated, + s.SpaceSafetyLabelType.DeprecatedSpaceSafetyLabel15 -> Deprecated, s.SpaceSafetyLabelType.Reserved16 -> Deprecated, s.SpaceSafetyLabelType.Reserved17 -> Deprecated, s.SpaceSafetyLabelType.Reserved18 -> Deprecated, @@ -69,10 +69,6 @@ object SpaceSafetyLabelType extends SafetyLabelType { case object ViolenceHighRecall extends SpaceSafetyLabelType case object HighToxicityModelScore extends SpaceSafetyLabelType - case object UkraineCrisisTopic extends SpaceSafetyLabelType - - case object DoNotPublicPublish extends SpaceSafetyLabelType - case object Deprecated extends SpaceSafetyLabelType case object Unknown extends SpaceSafetyLabelType diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/rules/FreedomOfSpeechNotReach.scala b/visibilitylib/src/main/scala/com/twitter/visibility/rules/FreedomOfSpeechNotReach.scala index ba2861e60..03e094025 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/rules/FreedomOfSpeechNotReach.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/rules/FreedomOfSpeechNotReach.scala @@ -3,6 +3,7 @@ package com.twitter.visibility.rules import com.twitter.spam.rtf.thriftscala.SafetyResultReason import com.twitter.util.Memoize import com.twitter.visibility.common.actions.AppealableReason +import com.twitter.visibility.common.actions.AvoidReason.MightNotBeSuitableForAds import com.twitter.visibility.common.actions.LimitedEngagementReason import com.twitter.visibility.common.actions.SoftInterventionDisplayType import com.twitter.visibility.common.actions.SoftInterventionReason @@ -440,36 +441,6 @@ object FreedomOfSpeechNotReachActions { } } - case class ConversationSectionAbusiveQualityAction( - violationLevel: ViolationLevel = DefaultViolationLevel) - extends FreedomOfSpeechNotReachActionBuilder[ConversationSectionAbusiveQuality.type] { - - override def actionType: Class[_] = ConversationSectionAbusiveQuality.getClass - - override val actionSeverity = 5 - private def toRuleResult: Reason => RuleResult = Memoize { r => - RuleResult(ConversationSectionAbusiveQuality, Evaluated) - } - - def build(evaluationContext: EvaluationContext, featureMap: Map[Feature[_], _]): RuleResult = { - val appealableReason = - FreedomOfSpeechNotReach.extractTweetSafetyLabel(featureMap).map(_.labelType) match { - case Some(label) => - FreedomOfSpeechNotReach.eligibleTweetSafetyLabelTypesToAppealableReason( - label, - violationLevel) - case _ => - AppealableReason.Unspecified(violationLevel.level) - } - - toRuleResult(Reason.fromAppealableReason(appealableReason)) - } - - override def withViolationLevel(violationLevel: ViolationLevel) = { - copy(violationLevel = violationLevel) - } - } - case class SoftInterventionAvoidAction(violationLevel: ViolationLevel = DefaultViolationLevel) extends FreedomOfSpeechNotReachActionBuilder[TweetInterstitial] { @@ -662,6 +633,9 @@ object FreedomOfSpeechNotReachRules { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } case class ViewerIsNonFollowerNonAuthorAndTweetHasViolationOfLevel( @@ -678,6 +652,9 @@ object FreedomOfSpeechNotReachRules { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } case class ViewerIsNonAuthorAndTweetHasViolationOfLevel( @@ -692,6 +669,9 @@ object FreedomOfSpeechNotReachRules { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } case object TweetHasViolationOfAnyLevelFallbackDropRule diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/rules/RuleBase.scala b/visibilitylib/src/main/scala/com/twitter/visibility/rules/RuleBase.scala index 66cbae0d1..e4b99a259 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/rules/RuleBase.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/rules/RuleBase.scala @@ -188,6 +188,7 @@ object RuleBase { TopicRecommendations -> TopicRecommendationsPolicy, TrendsRepresentativeTweet -> TrendsRepresentativeTweetPolicy, TrustedFriendsUserList -> TrustedFriendsUserListPolicy, + TwitterDelegateUserList -> TwitterDelegateUserListPolicy, TweetDetail -> TweetDetailPolicy, TweetDetailNonToo -> TweetDetailNonTooPolicy, TweetDetailWithInjectionsHydration -> TweetDetailWithInjectionsHydrationPolicy, diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/rules/TweetLabelRules.scala b/visibilitylib/src/main/scala/com/twitter/visibility/rules/TweetLabelRules.scala index 11f2ef7f5..bcee096f5 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/rules/TweetLabelRules.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/rules/TweetLabelRules.scala @@ -144,6 +144,9 @@ object NsfwCardImageAvoidAllUsersTweetLabelRule action = Avoid(Some(AvoidReason.ContainsNsfwMedia)), ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object NsfwCardImageAvoidAdPlacementAllUsersTweetLabelRule @@ -247,6 +250,9 @@ object GoreAndViolenceHighPrecisionAvoidAllUsersTweetLabelRule TweetSafetyLabelType.GoreAndViolenceHighPrecision ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object GoreAndViolenceHighPrecisionAllUsersTweetLabelRule @@ -266,6 +272,9 @@ object NsfwReportedHeuristicsAvoidAllUsersTweetLabelRule TweetSafetyLabelType.NsfwReportedHeuristics ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object NsfwReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule @@ -274,6 +283,9 @@ object NsfwReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule TweetSafetyLabelType.NsfwReportedHeuristics ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object NsfwReportedHeuristicsAllUsersTweetLabelRule @@ -294,6 +306,9 @@ object GoreAndViolenceReportedHeuristicsAvoidAllUsersTweetLabelRule TweetSafetyLabelType.GoreAndViolenceReportedHeuristics ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object GoreAndViolenceReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule @@ -302,6 +317,9 @@ object GoreAndViolenceReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule TweetSafetyLabelType.GoreAndViolenceReportedHeuristics ) { override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam) + + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) } object GoreAndViolenceHighPrecisionAllUsersTweetLabelDropRule @@ -791,7 +809,7 @@ object SkipTweetDetailLimitedEngagementTweetLabelRule object DynamicProductAdDropTweetLabelRule extends TweetHasLabelRule(Drop(Unspecified), TweetSafetyLabelType.DynamicProductAd) -object NsfwTextTweetLabelTopicsDropRule +object NsfwTextHighPrecisionTweetLabelDropRule extends RuleWithConstantAction( Drop(Reason.Nsfw), And( @@ -803,7 +821,7 @@ object NsfwTextTweetLabelTopicsDropRule ) ) with DoesLogVerdict { - override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableNsfwTextTopicsDropRuleParam) + override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableNsfwTextHighPrecisionDropRuleParam) override def actionSourceBuilder: Option[RuleActionSourceBuilder] = Some( TweetSafetyLabelSourceBuilder(TweetSafetyLabelType.NsfwTextHighPrecision)) } @@ -832,7 +850,10 @@ object DoNotAmplifyTweetLabelAvoidRule extends TweetHasLabelRule( Avoid(), TweetSafetyLabelType.DoNotAmplify - ) + ) { + override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some( + new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds)))) +} object NsfaHighPrecisionTweetLabelAvoidRule extends TweetHasLabelRule( diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/rules/VisibilityPolicy.scala b/visibilitylib/src/main/scala/com/twitter/visibility/rules/VisibilityPolicy.scala index 1ff0eaada..e1dcbf88a 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/rules/VisibilityPolicy.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/rules/VisibilityPolicy.scala @@ -776,7 +776,10 @@ case object MagicRecsPolicy tweetRules = MagicRecsPolicyOverrides.union( RecommendationsPolicy.tweetRules.filterNot(_ == SafetyCrisisLevel3DropRule), NotificationsIbisPolicy.tweetRules, - Seq(NsfaHighRecallTweetLabelRule, NsfwHighRecallTweetLabelRule), + Seq( + NsfaHighRecallTweetLabelRule, + NsfwHighRecallTweetLabelRule, + NsfwTextHighPrecisionTweetLabelDropRule), Seq( AuthorBlocksViewerDropRule, ViewerBlocksAuthorRule, @@ -1171,7 +1174,7 @@ case object ReturningUserExperiencePolicy NsfwHighRecallTweetLabelRule, NsfwVideoTweetLabelDropRule, NsfwTextTweetLabelDropRule, - NsfwTextTweetLabelTopicsDropRule, + NsfwTextHighPrecisionTweetLabelDropRule, SpamHighRecallTweetLabelDropRule, DuplicateContentTweetLabelDropRule, GoreAndViolenceTweetLabelRule, @@ -1785,6 +1788,14 @@ case object TimelineListsPolicy NsfwReportedHeuristicsAllUsersTweetLabelRule, GoreAndViolenceReportedHeuristicsAllUsersTweetLabelRule, NsfwCardImageAllUsersTweetLabelRule, + NsfwHighPrecisionTweetLabelAvoidRule, + NsfwHighRecallTweetLabelAvoidRule, + GoreAndViolenceHighPrecisionAvoidAllUsersTweetLabelRule, + NsfwReportedHeuristicsAvoidAllUsersTweetLabelRule, + GoreAndViolenceReportedHeuristicsAvoidAllUsersTweetLabelRule, + NsfwCardImageAvoidAllUsersTweetLabelRule, + DoNotAmplifyTweetLabelAvoidRule, + NsfaHighPrecisionTweetLabelAvoidRule, ) ++ LimitedEngagementBaseRules.tweetRules ) @@ -2132,7 +2143,13 @@ case object TimelineHomePolicy userRules = Seq( ViewerMutesAuthorRule, ViewerBlocksAuthorRule, - DeciderableAuthorBlocksViewerDropRule + DeciderableAuthorBlocksViewerDropRule, + ProtectedAuthorDropRule, + SuspendedAuthorRule, + DeactivatedAuthorRule, + ErasedAuthorRule, + OffboardedAuthorRule, + DropTakendownUserRule ), policyRuleParams = SensitiveMediaSettingsTimelineHomeBaseRules.policyRuleParams ) @@ -2171,7 +2188,13 @@ case object BaseTimelineHomePolicy userRules = Seq( ViewerMutesAuthorRule, ViewerBlocksAuthorRule, - DeciderableAuthorBlocksViewerDropRule + DeciderableAuthorBlocksViewerDropRule, + ProtectedAuthorDropRule, + SuspendedAuthorRule, + DeactivatedAuthorRule, + ErasedAuthorRule, + OffboardedAuthorRule, + DropTakendownUserRule ) ) @@ -2255,7 +2278,13 @@ case object TimelineHomeLatestPolicy userRules = Seq( ViewerMutesAuthorRule, ViewerBlocksAuthorRule, - DeciderableAuthorBlocksViewerDropRule + DeciderableAuthorBlocksViewerDropRule, + ProtectedAuthorDropRule, + SuspendedAuthorRule, + DeactivatedAuthorRule, + ErasedAuthorRule, + OffboardedAuthorRule, + DropTakendownUserRule ), policyRuleParams = SensitiveMediaSettingsTimelineHomeBaseRules.policyRuleParams ) @@ -3283,7 +3312,7 @@ case object TopicRecommendationsPolicy tweetRules = Seq( NsfwHighRecallTweetLabelRule, - NsfwTextTweetLabelTopicsDropRule + NsfwTextHighPrecisionTweetLabelDropRule ) ++ RecommendationsPolicy.tweetRules, userRules = RecommendationsPolicy.userRules @@ -3536,6 +3565,17 @@ case object TrustedFriendsUserListPolicy ) ) +case object TwitterDelegateUserListPolicy + extends VisibilityPolicy( + userRules = Seq( + ViewerBlocksAuthorRule, + ViewerIsAuthorDropRule, + DeactivatedAuthorRule, + AuthorBlocksViewerDropRule + ), + tweetRules = Seq(DropAllRule) + ) + case object QuickPromoteTweetEligibilityPolicy extends VisibilityPolicy( tweetRules = TweetDetailPolicy.tweetRules, diff --git a/visibilitylib/src/main/scala/com/twitter/visibility/rules/generators/TweetRuleGenerator.scala b/visibilitylib/src/main/scala/com/twitter/visibility/rules/generators/TweetRuleGenerator.scala index 6bdb965a1..90db70006 100644 --- a/visibilitylib/src/main/scala/com/twitter/visibility/rules/generators/TweetRuleGenerator.scala +++ b/visibilitylib/src/main/scala/com/twitter/visibility/rules/generators/TweetRuleGenerator.scala @@ -100,30 +100,6 @@ object TweetRuleGenerator { FreedomOfSpeechNotReachActions.SoftInterventionAvoidLimitedEngagementsAction( limitedActionStrings = Some(level3LimitedActions)) ) - .addSafetyLevelRule( - SafetyLevel.TimelineMedia, - FreedomOfSpeechNotReachActions - .SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.ProfileMixerMedia, - FreedomOfSpeechNotReachActions - .SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.TimelineFavorites, - FreedomOfSpeechNotReachActions - .SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.ProfileMixerFavorites, - FreedomOfSpeechNotReachActions - .SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings = - Some(level3LimitedActions)) - ) .build, UserType.Author -> TweetVisibilityPolicy .builder() @@ -159,30 +135,6 @@ object TweetRuleGenerator { .InterstitialLimitedEngagementsAvoidAction(limitedActionStrings = Some(level3LimitedActions)) ) - .addSafetyLevelRule( - SafetyLevel.TimelineMedia, - FreedomOfSpeechNotReachActions - .InterstitialLimitedEngagementsAvoidAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.ProfileMixerMedia, - FreedomOfSpeechNotReachActions - .InterstitialLimitedEngagementsAvoidAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.TimelineFavorites, - FreedomOfSpeechNotReachActions - .InterstitialLimitedEngagementsAvoidAction(limitedActionStrings = - Some(level3LimitedActions)) - ) - .addSafetyLevelRule( - SafetyLevel.ProfileMixerFavorites, - FreedomOfSpeechNotReachActions - .InterstitialLimitedEngagementsAvoidAction(limitedActionStrings = - Some(level3LimitedActions)) - ) .build, ), )