Week 6: Toolstack, Data Pipeline & Writing the Hypothesis

IP Anchor Section 2 — Data sourcing & QD approval | Section 1 — Hypothesis Know the infrastructure before writing Section 2. Know how to write a falsifiable hypothesis before starting Section 1.

What this week covers

Part one: the AlgoGators data pipeline — what tools ingest data, where it lives, how it flows to strategies, and what the QD approval gate (Section 2b) is checking for. Part two: how to write a strong, falsifiable hypothesis with an economic mechanism — which is Section 1 of your IP.

Part 1 — Fund Toolstack & Data Pipeline

The data flow at AlgoGators

Raw market data
    ↓
Databento (ingest)
    ↓
TimescaleDB (storage)
    ↓
data-ngin (transform)
    ↓
AlgoSystem (backtest/live)
    ↓
Strategy execution

← QR writes signal spec here
← QD implements from IP

Understanding this flow is critical. You will describe your data requirements in Section 2a. The QD team will use those requirements to build the pipeline. But you need to understand the constraints at each step.

Databento

What it is: A market data provider. Tick data, OHLCV (open, high, low, close, volume), order book snapshots across equities, futures, options, FX.

Coverage: US equities, US futures (CME, CBOT, COMEX), cryptocurrencies, FX futures. International coverage limited.

Cost: Databento charges per data series per month. Novel data sources require approval before you spend time on them.

Frequency: Tick-level to daily OHLCV. You request the granularity you need.

For QR: Know what you're asking for before Section 2b. If you want daily closing prices for the last 10 years, that's cheap and trivial. If you want 1-minute bars for 500 stocks for 5 years, that's expensive and requires approval.

TimescaleDB

What it is: PostgreSQL extension optimized for time-series data. Time-indexed tables, automatic partitioning, continuous aggregates (pre-computed rolling statistics).

How it works: Raw tick data from Databento is ingested into TimescaleDB. Millions of rows a day. Partitioned by time so queries on recent data are fast.

Continuous aggregates: You can define a continuous aggregate that automatically computes, say, daily OHLCV from tick data. No extra code required — TimescaleDB handles it.

For QR: You don't write SQL. You define your data requirements in Section 2a ("daily OHLCV for corn futures"), and QD writes the queries. But knowing the data lives in TimescaleDB helps you understand data availability and freshness.

data-ngin (data engineering pipeline)

What it is: Internal cleaning and feature engineering layer. Handles corporate actions (stock splits, dividends), outlier detection, missing value handling, and feature construction.

Corporate actions: When a stock splits 2-for-1, raw prices from two days don't compare. data-ngin adjusts for this. Uses point-in-time adjustments (historical prices adjusted forward, not backward, to avoid look-ahead bias).

Outliers: Exchange halts, data errors, gaps. data-ngin flags and handles these.

Features: When you describe your signal as "5-day rolling z-score," data-ngin is what computes it. You specify the formula in your IP; QD implements it in data-ngin.

For QR: In Section 2a, you describe the features you need with formulas. In Section 3a, you use those features in your signal. data-ngin makes them available.

The QD Approval Gate (Section 2b)

This is a critical checkpoint. You cannot move forward without QD sign-off.

Why it exists: Too many QR analysts have written strategies on data that doesn't exist. They spent weeks developing on a dataset that was never subscribed to, or that costs $500k/month, or that isn't available at the frequency needed, or that has only 3 years of history (too short for backtesting).

What Section 2b must contain:

The exact data series needed (asset name, exchange, frequency, date range)
The cleaning steps and transformations (split adjustments, outlier treatment)
Any external or alternative data (satellite, social, etc.) with licensing status
Written confirmation from the Head of Data Engineering that the data is acquirable, ingestible at the required frequency, and fits within budget/subscriptions

What happens: You submit your IP with Section 2 (both 2a and 2b). QT reviews strategy fit. QD reviews data feasibility and signs off in 2b. If 2b is not signed, the IP is rejected, full stop. This is not punitive — it's a forcing function to catch infeasible research early.

Python data access patterns

Pulling OHLCV from TimescaleDB

import pandas as pd
import psycopg2

conn = psycopg2.connect(
    host="timescaledb.internal",
    dbname="algogators",
    user="qr_user",
    password="***"
)

query = """
    SELECT timestamp, open, high, low, close, volume
    FROM corn_futures
    WHERE timestamp BETWEEN '2020-01-01' AND '2024-01-01'
    ORDER BY timestamp
"""

df = pd.read_sql(query, conn, parse_dates=['timestamp'], index_col='timestamp')

Feature construction: rolling z-score

def rolling_zscore(series, window=20):
    """Z-score normalization using rolling window."""
    mean = series.rolling(window).mean()
    std = series.rolling(window).std()
    return (series - mean) / std

# Normalized signal for entry/exit
df['signal'] = rolling_zscore(df['close'], window=60)
entry_threshold = 2.0  # Entry when signal < -2
df['signal_entry'] = df['signal'] < -entry_threshold

Handling corporate actions (splits/dividends)

# data-ngin handles this automatically
# You specify the adjustment method in Section 2a:
# "Prices adjusted for splits/dividends using backward-adjusted method.
#  Adjustment factors applied at ex-date."
# QD implements it; you use the adjusted prices in your signal.

NasaPowerCouncil data pipeline (example)

Raw data: NASA POWER API. Daily solar radiation (ALLSKY_SFC_SW_DWN), temperature (T2M), precipitation (PRECTOTCORR) at lat/lon grid.

Geographic mapping: Corn Belt counties (IL, IA, MN, MO, etc.). Map each county to grid point lat/lon centroids.

Transformation: Compute growing-degree-day (GDD) deviation from 20-year seasonal average. Daily rolling sums, seasonal adjustment.

Feature: GDD_deviation_30d = 30-day rolling sum of (daily GDD - 20-year seasonal average for that calendar window).

Section 2b sign-off: "Data sourced from NASA POWER REST API (public, no licensing cost). Daily resolution, available back to 1981. Ingestible via Python requests → CSV → TimescaleDB. QD: Confirmed."

Part 2 — Writing the Hypothesis (IP §1)

The three components of a strong hypothesis

1. Economic mechanism

Why does this pattern exist? Not "I found it in the data." Why would market participants allow this inefficiency to persist?

Good mechanisms explain:

Behavioral: What bias or constraint causes participants to misbehave? (anchoring, disposition effect, herding, underreaction)
Structural: What market constraint forces suboptimal behavior? (hedging requirements, index rebalancing, roll yield, storage costs)

2. Predicted relationship

What signal predicts what return, in what direction, over what horizon?

Example: "When 30-day rolling soil moisture deviation is more than 1 standard deviation below the 20-year seasonal average, corn futures are expected to have negative returns over the next 5–10 trading days because low soil moisture predicts lower yields, which the market will reprice when USDA yield reports are released."

Specify:

The signal (soil moisture deviation)
The direction (negative / low moisture → lower returns)
The horizon (5–10 trading days)
The mechanism link (yields → prices)

3. Falsifiability condition

What result would convince you the hypothesis is wrong? If you can't answer this, you don't have a hypothesis.

Example: "This hypothesis is falsified if (a) the Information Coefficient between soil moisture deviation and forward corn returns is less than 0.02 (no predictive power), or (b) the backtest Sharpe on a 2023 hold-out period is less than 0.5 (results don't hold out-of-sample)."

Good falsifiability conditions are:

Testable: You can calculate it from data
Specific: Explicit number, not "seems promising"
Fair: Not unreasonably high (Sharpe 5.0) or low (0.1)

Hypothesis template

Copy this template into Section 1 of your IP. Fill in each section with your hypothesis.

HYPOTHESIS TEMPLATE

Economic mechanism:

[Market participants systematically _____ because _____, which causes prices to _____ .]

Predicted relationship:

[When [signal] is [high/low/rising/falling], [asset] returns are expected to be [positive/negative] over [horizon], because [mechanism above].]

Falsifiability:

[This hypothesis would be rejected if [specific quantitative condition — e.g., Information Coefficient < 0.02, or Sharpe < 0.5 on hold-out period, or directional accuracy < 55% on out-of-sample data].]

Hypothesis types with examples

Type 1: Behavioral inefficiency

Example: Post-earnings drift (momentum)

Economic mechanism: Investors systematically underreact to earnings news. The market reprices gradually over weeks or months, not immediately at announcement. Short-selling constraints and institutional limitations on volatility exposure prevent arbitrageurs from eliminating this drift immediately.

Predicted relationship: When earnings surprise is large and positive (actual EPS > consensus EPS by more than 1 standard deviation), the stock is expected to have positive abnormal returns for 3–6 months post-announcement.

Falsifiability: This hypothesis is rejected if the average post-announcement drift is not significantly different from zero (t-stat < 1.96) over a 3-month hold period, or if the drift reverses (negative returns) in the first week after announcement.

Type 2: Structural inefficiency

Example: Commodity roll yield (contango/backwardation)

Economic mechanism: Commodity futures contracts are rolled before expiry. When the term structure is in backwardation (near-month contract trading at a premium to far-month), rolling captures positive yield. This is not an arbitrage — it reflects the physical convenience value of holding inventory. Hedgers (producers and consumers) are willing to pay this premium because holding physical commodity has value. The premium is persistent because the underlying convenience value doesn't disappear.

Predicted relationship: When the front/second-month spread is in backwardation (front price > second-month price), a rolling long position in the front-month contract is expected to earn positive roll yield equal to the spread minus the cost of financing. Expected return is positive.

Falsifiability: This hypothesis is rejected if average roll yield is zero or negative over a 5-year period, or if roll yield does not persist after transaction costs.

Type 3: Alternative data (information latency)

Example: Satellite weather data predicting crop yields

Economic mechanism: NASA satellite data provides daily measurements of soil moisture and temperature at the field level. These measurements predict crop yields. The market prices crops based on USDA forecasts, which are survey-based and release once per month. There is an information gap: satellite data predicts yields before USDA reports them. The market does not instantaneously incorporate satellite data because (a) the data is not in standard market feeds, (b) processing it requires domain expertise, (c) few market participants have access or motivation to use it.

Predicted relationship: When growing-season soil moisture anomalies (satellite GDD deviation from 20-year seasonal average) are large and negative, corn prices are expected to decline in the 5–10 days preceding the next USDA yield report, capturing the market's repricing as the report data becomes known.

Falsifiability: This hypothesis is rejected if the Information Coefficient between satellite moisture deviation and forward corn returns is < 0.02, or if the strategy has negative Sharpe on a 2023 hold-out period.

Information Coefficient (IC)

IC measures the correlation between your signal and forward returns. It's the core metric for evaluating whether your hypothesis has predictive power before you build the full backtest.

\[ IC = \text{Corr}(\text{signal}_t,\ r_{t+h}) \]

Where signal_t is your signal at time t, r_{t+h} is the return from t to t+h (your chosen horizon).

Interpretation:

IC ≈ 0: No predictive power (noise)
IC = 0.05–0.10: Meaningful in quantitative finance. This is the zone for real edges.
IC = 0.20+: Exceptional (and suspicious — check for look-ahead bias)

IC Information Ratio (ICIR):

\[ ICIR = \frac{\overline{IC}}{\sigma_{IC}} \]

Average IC divided by the standard deviation of IC. ICIR > 0.5 suggests a consistently predictive signal.

Common mistakes

Data pipeline failures

Skipping Section 2b and assuming data availability. You need written confirmation from QD. Email counts; attachment or reply-all to your IP draft counts. Verbal is not enough.
Requesting data at the wrong frequency. You want daily but need intraday? Requesting daily won't work. Be precise: "daily close at 4:00 PM ET" not "daily data."
Not accounting for data latency. Some releases lag: COT data (Friday for Tuesday), USDA forecasts (monthly, mid-month), earnings (announced after close). If your signal fires before data arrives, you have look-ahead bias.
Writing Section 2a so vaguely that QD can't parse what you need. "Corn price data" is not enough. Need: "Continuous corn futures (ZC, nearest-to-expiry rolled 5 days before contract expiry), daily close, 2015–2024."
Assuming alternative data is cheap. Satellite data, web scraping, proprietary sources cost money. Get pricing in Section 2b before committing 10 weeks to research that costs $50k/month to run.

Hypothesis failures

Writing the hypothesis as a tautology. "When momentum is positive, the stock continues to have positive momentum." This is circular — of course it's true. A real hypothesis explains WHY.
No falsifiability condition. If you can't state what result would make you reject the hypothesis, you don't have a hypothesis. Add a specific quantitative threshold.
Economic mechanism stated too vaguely. "The market is inefficient" doesn't distinguish from data mining. "Investors underreact to earnings by an average of 6 weeks due to limited attention" is specific enough to test.
Signal direction not specified before testing. "I'll look at momentum and see if it predicts returns." Wrong. "High momentum predicts positive returns over 1-month horizon" specified before you test.
Hypothesis changed after seeing the data. If you test 10 signal variants and then form a hypothesis around the best one, you're p-hacking. The hypothesis must come before the backtest.

← Week 5: Statistics Week 7: Data Sourcing →

Toolstack & Data Pipeline+ Writing the Hypothesis