IP Anchor Section 1 & 2 — Hypothesis & Data sourcing Alternative data informs which hypotheses are worth exploring and what data is feasible. This week bridges hypothesis and data sourcing.

What this week covers

Alternative data is any non-standard data source that provides an information edge. This week walks through three major categories (satellite, positioning, sentiment), real examples from AlgoGators, and the key validation steps before committing research time to an alternative data hypothesis.

What makes data "alternative"?

Standard data: OHLCV prices, fundamentals (earnings, balance sheets), economic releases.

Alternative data: Anything else — satellites, credit card transactions, web scraping, shipping data, options flow, weather.

The four questions to ask about any alternative dataset:

  1. Does it contain information not already in market prices? If the market already prices in the data, there's no edge.
  2. Is it available at the time the signal fires (no look-ahead)? If there's a 1-week data lag, you can't trade on it intraday.
  3. Does the fund have access, and can it be ingested at the required frequency? Data that costs $500k/month requires budget approval. Data not in our subscriptions requires QD investigation.
  4. How long is the history? Less than 5 years is usually insufficient. You need at least two full market cycles (bull and bear).

Satellite data: NasaPowerCouncil (detailed example)

Data source: NASA POWER API. Free, public. Daily data back to 1981.

Variables: Solar radiation (ALLSKY_SFC_SW_DWN), temperature (T2M, T2M_MAX, T2M_MIN), precipitation (PRECTOTCORR), humidity.

Geographic coverage: Any latitude/longitude on Earth. You request the grid.

Information advantage: Crop yields depend on growing season weather. USDA yield forecasts are survey-based and release monthly (dates announced in advance). Satellite data provides daily field-level conditions. Market doesn't instantaneously incorporate satellite data because (a) not in standard feeds, (b) requires domain expertise to process, (c) few participants use it.

Signal: Growing-degree-day (GDD) deviation from 20-year seasonal average. When cumulative GDD is abnormally low relative to the 10-year average for the calendar week, crop stress is high. This predicts lower USDA yields.

Validation: Information Coefficient between satellite GDD deviation and corn yield surprises (actual USDA yield - prior month forecast). IC > 0.05 = meaningful predictive power.

Cost & access: Free. No licensing required. Ingestible via Python requests → database. No budget approval needed.

Positioning data: COT reports

Data source: CFTC Commitment of Traders reports. Free, published every Friday. Data lag: reporting Tuesday's positions on Friday.

Coverage: All major US futures (oil, gold, corn, wheat, currency, interest rates, equities).

Information advantage: COT breaks down positioning by trader category: commercial hedgers (producers, consumers), large speculators, small speculators. When large speculators are extremely net long (crowded trade), the position often reverses. When commercial hedgers are extremely short (hedging supply), the commodity is often at peak prices.

Signal: Speculator positioning zscore. When net spec positioning is > 2 std above the 20-year average (crowded long), short the commodity. When < -2 std (crowded short), go long. Contrarian signal.

Data lag risk: COT data is released Friday for Tuesday. You can't trade intraday Tuesday on Friday data — you trade Thursday/Friday on Friday release. This is acceptable for daily/weekly signals.

Cost & access: Free from CFTC website. Historical data via Bloomberg Terminal. Parseable via Python.

Options flow & market microstructure

Data source: Options flow (put/call volume, open interest, implied volatility). Available from exchanges (CBOE, CME) or data providers (Databento, OptionsIntelligence).

Information advantage: Options traders are often informed (they're willing to pay gamma to express a view). Unusual put buying can precede declines. Skew in implied volatility surfaces can signal tail hedging demand.

Signal examples:

  • Put/call ratio inversion: when calls exceed puts (bullish), the move often continues. When puts exceed calls (bearish), trend reversal likely.
  • IV skew steepening: when downside IV > upside IV by > 2 vol points (unusual), downside risk is being hedged. Precedes rebounds.

Cost & access: Included in Databento subscription (existing infrastructure).

Alternative data validation checklist

Before committing 10 weeks to an alternative data hypothesis, validate:

  • Information uniqueness: Is this information publicly available? If it's in earnings transcripts, most sell-side researchers already parsed it. Low information value.
  • Market pricing lag: How long between data release and market incorporation? If it takes 6 months to fully price, edge is real. If 1 day, probably no edge.
  • Historical depth: At least 5 years of history. Ideally 10+ years covering multiple market cycles.
  • Data quality: No unexplained gaps, outliers, or quality issues. Satellite data: cloud cover bias? Weather data: sensor drift? Verify.
  • Cost & feasibility: Get pricing in writing. Budget approval if > $10k/month. Confirm it can be ingested into TimescaleDB at your required frequency.
  • IC test: Before full backtest, compute Information Coefficient between raw alt data and forward returns. IC > 0.02 is promising. IC < 0.01 means move on.

Real hypotheses using alternative data

Hypothesis 1: Satellite weather → corn futures

Growing-season soil moisture anomalies, measured via satellite, predict USDA yield report surprises. Market prices are set on survey data; satellite data arrives before surveys are finalized. Information lag of 2–4 weeks.

Hypothesis 2: Speculator crowding → FX reversals

When CFTC large speculators are positioned at extreme net-long (historical 95th percentile), the currency pair reverses within 2–8 weeks. Contrarian edge from crowded positioning.

Hypothesis 3: Options skew → equity tail protection

When put/call skew steepens (downside IV exceeds upside by > 2 vols), downside is overhedged. Market rebounds 3–10 days after peak skew. Mean-reversion in hedging demand.

Common mistakes with alternative data

Five alternative data pitfalls

  • Using alt data that requires paying for a license the fund doesn't have. Confirm cost and budget availability BEFORE spending research time. QD won't green-light a strategy on data they can't afford.
  • Not checking the data's lookback history. "Great satellite dataset available from 2020 onward" = 4 years of history. That's only one bull market. You need two full cycles (bull + bear).
  • Failing to account for data release lag. COT data released Friday applies to Tuesday positions. You can't trade intraday Tuesday. Yield forecast data released mid-month applies to month-end expectations. Signal lags matter.
  • Overfitting to an alt data peculiarity. "This parameter works great for this specific alt data in this specific market." Doesn't generalize. Always test robustness across subsets.
  • Skipping the IC test. Before building a backtest, compute Information Coefficient. IC < 0.02 = probably no edge. Move on. Save 8 weeks of wasted development.
← Week 9: Backtest Results Week 11: Live Workshop →