Understand the infrastructure that powers research at AlgoGators. Know what data exists before you write Section 2 of your IP.
Every strategy runs on data. The data must exist, be accessible, be affordable, and be ingestible into our systems. This week walks through the AlgoGators data pipeline: what tools ingest data, where it lives, how it flows to strategies, and what the QD approval gate (Section 2b) is checking for.
Understanding this flow is critical. You will describe your data requirements in Section 2a. The QD team will use those requirements to build the pipeline. But you need to understand the constraints at each step.
What it is: A market data provider. Tick data, OHLCV (open, high, low, close, volume), order book snapshots across equities, futures, options, FX.
Coverage: US equities, US futures (CME, CBOT, COMEX), cryptocurrencies, FX futures. International coverage limited.
Cost: Databento charges per data series per month. Novel data sources require approval before you spend time on them.
Frequency: Tick-level to daily OHLCV. You request the granularity you need.
For QR: Know what you're asking for before Section 2b. If you want daily closing prices for the last 10 years, that's cheap and trivial. If you want 1-minute bars for 500 stocks for 5 years, that's expensive and requires approval.
What it is: PostgreSQL extension optimized for time-series data. Time-indexed tables, automatic partitioning, continuous aggregates (pre-computed rolling statistics).
How it works: Raw tick data from Databento is ingested into TimescaleDB. Millions of rows a day. Partitioned by time so queries on recent data are fast.
Continuous aggregates: You can define a continuous aggregate that automatically computes, say, daily OHLCV from tick data. No extra code required — TimescaleDB handles it.
For QR: You don't write SQL. You define your data requirements in Section 2a ("daily OHLCV for corn futures"), and QD writes the queries. But knowing the data lives in TimescaleDB helps you understand data availability and freshness.
What it is: Internal cleaning and feature engineering layer. Handles corporate actions (stock splits, dividends), outlier detection, missing value handling, and feature construction.
Corporate actions: When a stock splits 2-for-1, raw prices from two days don't compare. data-ngin adjusts for this. Uses point-in-time adjustments (historical prices adjusted forward, not backward, to avoid look-ahead bias).
Outliers: Exchange halts, data errors, gaps. data-ngin flags and handles these.
Features: When you describe your signal as "5-day rolling z-score," data-ngin is what computes it. You specify the formula in your IP; QD implements it in data-ngin.
For QR: In Section 2a, you describe the features you need with formulas. In Section 3a, you use those features in your signal. data-ngin makes them available.
This is a critical checkpoint. You cannot move forward without QD sign-off.
Why it exists: Too many QR analysts have written strategies on data that doesn't exist. They spent weeks developing on a dataset that was never subscribed to, or that costs $500k/month, or that isn't available at the frequency needed, or that has only 3 years of history (too short for backtesting).
What Section 2b must contain:
What happens: You submit your IP with Section 2 (both 2a and 2b). QT reviews strategy fit. QD reviews data feasibility and signs off in 2b. If 2b is not signed, the IP is rejected, full stop. This is not punitive — it's a forcing function to catch infeasible research early.
import pandas as pd
import psycopg2
conn = psycopg2.connect(
host="timescaledb.internal",
dbname="algogators",
user="qr_user",
password="***"
)
query = """
SELECT timestamp, open, high, low, close, volume
FROM corn_futures
WHERE timestamp BETWEEN '2020-01-01' AND '2024-01-01'
ORDER BY timestamp
"""
df = pd.read_sql(query, conn, parse_dates=['timestamp'], index_col='timestamp')
def rolling_zscore(series, window=20):
"""Z-score normalization using rolling window."""
mean = series.rolling(window).mean()
std = series.rolling(window).std()
return (series - mean) / std
# Normalized signal for entry/exit
df['signal'] = rolling_zscore(df['close'], window=60)
entry_threshold = 2.0 # Entry when signal < -2
df['signal_entry'] = df['signal'] < -entry_threshold
# data-ngin handles this automatically
# You specify the adjustment method in Section 2a:
# "Prices adjusted for splits/dividends using backward-adjusted method.
# Adjustment factors applied at ex-date."
# QD implements it; you use the adjusted prices in your signal.
Raw data: NASA POWER API. Daily solar radiation (ALLSKY_SFC_SW_DWN), temperature (T2M), precipitation (PRECTOTCORR) at lat/lon grid.
Geographic mapping: Corn Belt counties (IL, IA, MN, MO, etc.). Map each county to grid point lat/lon centroids.
Transformation: Compute growing-degree-day (GDD) deviation from 20-year seasonal average. Daily rolling sums, seasonal adjustment.
Feature: GDD_deviation_30d = 30-day rolling sum of (daily GDD - 20-year seasonal average for that calendar window).
Section 2b sign-off: "Data sourced from NASA POWER REST API (public, no licensing cost). Daily resolution, available back to 1981. Ingestible via Python requests → CSV → TimescaleDB. QD: Confirmed."