Phone-number reputation models: what signals, what weights
Phone-number reputation models: what signals, what weights
Phone-number reputation is the attempt to assign a risk score or a trust score to an MSISDN based on its observable behaviour and network attributes. The use case is everywhere: onboarding fraud, transaction risk, call-centre prioritisation, abuse prevention on free-signup flows. The implementation varies enormously. A reputation model trained on aggregator-visible data looks different from one trained on operator-side data, and both of them differ from one that buys feeds from multiple vendors and stacks them. Few of the differences are explicit in the vendor documentation.
This post describes what signals feed a reputation model, where the weights typically land in peer-reviewed fraud-detection work, and how the licensed-operator position changes what signals are available and at what latency.
Six signal families
Phone-number reputation signals cluster into six families.
Carrier attributes: the terminating operator, line type (mobile, landline, VoIP, fixed-wireless), geographic allocation of the number, and any carrier-level reputational flags. These are basic metadata that resolve via HLR lookup. They shift rarely and are cheap to query.
Mobility signals: IMSI changes (SIM swaps), MNP events (ports), call-forwarding activations, and recent location updates. These shift frequently and carry high information density for fraud detection. The published literature on fintech fraud models shows that signals in this family, combined with transaction context, can materially reduce expected losses. The online-banking fraud study by Vanini and colleagues reports a 15 percent baseline loss reduction from ML-based detection, rising to 52 percent with economic optimisation, at a 0.4 percent false-positive rate.
Traffic patterns: the volume, direction, and time-of-day distribution of traffic on the number. A number that receives OTPs but never sends messages, pairs with many distinct applications in a short window, or shows traffic spikes at atypical hours, is a signal subset used in AIT detection. These signals are observable from the operator side and from the aggregator side with different granularity.
Account-history features: the tenure of the number's binding to a subscriber account, the age of the IMSI, and recent new-account signups associated with the number across cooperating platforms. Tenure features are the most predictive in the fraud-rates literature, especially in combination with other signals.
Device features: the IMEI currently registered, the history of IMEI changes, and the presence of the number on EIR blacklists. Device-level fraud signals compose with SIM-level signals in published schemes with measurable effect.
Vendor-aggregated blacklists: industry blacklists from GSMA, national regulators, and commercial feeds. These carry high precision on known bad numbers but low coverage. Most attack traffic is not on any published list at the moment of the attack.
The weights, from the literature
Assigning weights is where reputation models diverge. The published work consistently emphasises that single-feature weights are less informative than feature-interaction weights, and that the interaction space is where the predictive signal concentrates.
The online-payment fraud work by Vanini and colleagues describes a feature set that includes transaction-level features, account-history features, and device-network features. The observation is that the economic optimisation step, where the model is tuned not to raw accuracy but to expected-loss minimisation, is what produces the large gains beyond baseline. The 52 percent loss reduction quoted is contingent on this tuning, not on any specific feature weight.
For the mobile-specific side of the problem, the SectraBank study on Android banking fraud shows that a joint (IMEI, fingerprint, geolocation) tuple catches 98 percent of SIM-swap attempts in a controlled setup. Each individual signal in that tuple is less predictive alone. The combination is where the detection sits. Published work on SIM-swap countermeasures reaches the same conclusion: location anomaly plus recency of swap plus transaction context is the combination that carries predictive power.
The practical implication is that a reputation score should expose the individual signals that composed it, not just the final score. A relying party who is told "this number has a reputation score of 73" has less information than one who is told "this number has had a SIM swap in the last 72 hours, active call forwarding to an out-of-country destination, and a 14-month account tenure." The latter is composable with the relying party's own risk model. The former is not.
The aggregator vs operator gap
Several signal families are available only, or with higher fidelity, from the operator side.
Real-time mobility signals: an aggregator probing HLR via SRI-for-SM sees SIM swaps when the aggregator's next probe catches the IMSI change. An operator querying its own HSS provisioning trail sees the change at the moment the HSS record was written. For a reputation model that weights recency heavily, and recency weights are typical in fraud-detection schemes because the predictive value of a mobility signal decays quickly, the difference is material.
Call-forwarding state: MAP InterrogateSS is a first-class operator operation. Aggregators can query it through operator partnerships, but the latency and the coverage vary.
IMSI age: the issuance date of the IMSI is held by the operator. Aggregators can approximate it by inference from first-appearance in their own data, but the approximation is biased.
Traffic-pattern signals: an operator sees traffic on its own network, including patterns that never cross to aggregators because they are filtered at the operator's own SMSC.
The consequence is that a reputation model running with operator-sourced data has access to features that an aggregator-sourced model approximates at best. This does not make aggregator-sourced reputation useless. The published fraud-detection work using aggregator-type features still shows meaningful loss reduction. The accurate comparison acknowledges the gap in feature set rather than claiming equivalent fidelity.

Drift, retraining, and what the response should expose
Any reputation model has a drift problem. Attacker behaviour shifts, the distribution of legitimate behaviour shifts with market conditions, and a model trained on last year's data produces last year's precision-recall curve on this year's traffic. The fintech fraud literature documents this consistently. Generative-AI-enabled adversarial tools have accelerated the drift, and models that were state-of-the-art in 2022 often underperform in 2026 without retraining.
The operational response is retraining cadence plus live-feature monitoring. Weekly or monthly retraining against recent-attack labels is common in the best-operated fraud-detection stacks. Live monitoring of feature distributions catches cases where a specific feature has become less predictive before the overall model accuracy degrades visibly. Reputation vendors who do not publish their retraining cadence carry hidden model risk.
A reputation API that exposes the decomposition rather than the final score is more useful for a sophisticated integrator. The response should include the individual signals that informed the score with their current values, the timestamp of each signal's last update, the provenance of each signal (operator-sourced, aggregator-sourced, or third-party feed), the interaction features that composed the signals where the model computes them server-side, and the false-positive envelope on the final score calibrated against recent production data.
A response of the form { "score": 73, "risk_level": "medium" } is a user-experience concession to integrators who want a one-call decision. It is not the right primary response for an integrator running their own fraud model. A response that includes { "sim_swap_hours_ago": 68, "call_forward_active": true, "call_forward_destination_country": "GB", "subscriber_country": "NG", "imei_change_hours_ago": 2, "account_tenure_months": 14, "imei_eir_status": "clean" } gives the integrator the facts to compose with their own context. The integrator's risk model can weight these in ways the vendor's opaque score cannot.
The limits of reputation
Reputation models are probabilistic. A well-designed model with strong features produces a substantial reduction in expected fraud loss, and the literature supports specific measurements: 15 to 52 percent loss reduction at 0.4 percent false-positive rate in production online-payment contexts, 98 percent SIM-swap-attack mitigation in controlled banking contexts, and 90 percent recall at 0.1 to 0.2 percent false-positive rate for AIT detection.
They do not produce certainty. Every model has a residual false-negative rate, and a sophisticated attacker studying the model eventually finds inputs that score below the decision threshold. Reputation is most useful as one layer among several. Transaction-level signals, behavioural signals, device signals, and reputation signals composing into a decision produce a tighter risk surface than a single scoring primitive that answers the fraud question on its own.
How TensorShield ships reputation
TensorShield's reputation surface returns both shapes. For integrators who want a score, the composite risk score uses the weighting described in the TensorShield product documentation (SIM swap at approximately 30 percent, call forwarding at approximately 20 percent, with the remainder distributed across SIM-farm, device, and tenure signals). For integrators who want the decomposition, every signal that fed the score is returned with its provenance and timestamp.
The weighting is a starting point, not a claim of universal optimality. Integrators with their own production fraud data can reweight against their population, and the decomposed response supports that directly. The weights themselves are published on the product page, since opaque weights produce opaque decisions and integrators running their own fraud models need to see the inputs.
The provenance field names the source operator for the mobility signals and distinguishes real-time from cached data. For integrators running tight risk windows, the real-time path is available at published latency per region. For integrators who accept higher latency for lower cost, the cached path is the cheaper option with the freshness interval explicit in the response.
The product boundary for Tensormobile is the network-layer reputation rather than the full fraud model. The signals returned are what the network sees, weighted for integrators who want a score and decomposed for those who want the inputs. What the integrator does with those signals, in the context of their own behavioural and transaction data, is the fraud model. The two layers complement each other.


























