Write and test schema_external methods¶
Draft schema_external.py¶
Success
Write a Pandera DataFrame Model to validate the dataframe returned by extract_external
Starting with the Open Meteo API documentation draft a OpenMeteoSchema DataFrame Model for each column in the dataset:
-
column data types: Define appropriate pandas data types for each column. The schema uses
pd.DatetimeTZDtypefor the date column to ensure timezone-aware datetime handling, andpd.Float32Dtypefor numeric columns. -
description: Add clear, descriptive human-readable field descriptions that explain what each column represents and its purpose. These descriptions are extracted from the Open Meteo API documentation and help developers understand the data structure and meaning of each field.
-
metadata-units: Include unit information in the metadata dictionary for each field (e.g., "°C", "%", "km/h", "mm", "cm", "m"). This metadata provides context for data interpretation and enables automated unit conversion or validation in downstream processing.
-
checks: Implement data validation constraints using Pandera's validation features:
- Temperature fields have realistic min/max bounds (-60°C to 60°C)
- Percentage fields (humidity, cloud cover) are constrained to 0-100%
- Wind speed has a maximum realistic limit (200 km/h)
- Precipitation and snow measurements are non-negative (≥0)
- All numeric fields use
coerce=Trueto handle type conversion gracefully. Except for thedatecolumn since datetime coersion may modify the true meaning of the data.
schema_external.py
"""
This pandera dataframe model validates input into the Open Meteo datasets
extracted from the open-meteo web API.
"""
from typing import Annotated
import pandas as pd
import pandera.pandas as pa
from pandera.typing.pandas import Series
MAX_DEG_C = 60.0 # Maximum expected temperature in degrees Celsius
MIN_DEG_C = -60.0 # Minimum expected temperature in degrees Celsius
MAX_WIND_SPEED_KMH = 200.0 # Maximum expected wind speed in km/h
class OpenMeteoSchema(pa.DataFrameModel):
"""
Schema for Open Meteo weather data.
"""
date: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]] = pa.Field(
description="Date in datetime64[ns, UTC] format.",
)
temperature_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Air temperature at 2 meters above ground (°C)",
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
relative_humidity_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Relative humidity at 2 meters above ground (%)",
metadata={"units": "%"},
ge=0.0,
le=100.0,
)
wind_speed_10m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Wind speed at 10 meters above ground level (km/h)",
metadata={"units": "km/h"},
ge=0.0,
le=MAX_WIND_SPEED_KMH,
)
cloud_cover: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Total cloud cover as an area fraction (%)",
metadata={"units": "%"},
ge=0.0,
le=100.0,
)
snowfall: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Snowfall amount of the preceding hour in centimeters. "
+ "For the water equivalent in millimeter, divide by 7. "
+ " E.g. 7 cm snow = 10 mm precipitation water equivalent (cm)"
),
metadata={"units": "cm"},
ge=0.0,
)
snow_depth: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Snow depth on the ground. Snow depth in ERA5-Land tends "
+ "to be overestimated. As the spatial resolution for "
+ "snow depth is limited, please use it with care. (m)"
),
metadata={"units": "m"},
ge=0.0,
)
rain: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Only liquid precipitation of the preceding hour including "
+ "local showers and rain from large scale systems. (mm)"
),
metadata={"units": "mm"},
ge=0.0,
)
apparent_temperature: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Apparent temperature is the perceived feels-like temperature "
+ "combining wind chill factor, relative humidity "
+ "and solar radiation"
),
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
dew_point_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Dew point temperature at 2 meters above ground (°C)",
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
precipitation: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Total precipitation (rain, showers, snow) sum of the "
+ "preceding hour. Data is stored with a 0.1 mm precision. "
+ "If precipitation data is summed up to monthly sums, "
+ "there might be small inconsistencies with the "
+ "total precipitation amount. (mm)"
),
metadata={"units": "mm"},
ge=0.0,
)
Run tests and debug¶
Run the tests through tox with the following command. Alternatively, use the VSCode debugger launch.json with PyTest and remote data.
Use breakpoints, watch variables, and the debug console to modify the test and/or schema to ensure the code behaves as expected and the tests pass.
Update extract_external to validate schema¶
Modify extract_external to convert the returned dataframe from the generic pd.dataframe to the OpenMeteoSchema dataframe model.
schema_external.py
"""
This pandera dataframe model validates input into the Open Meteo datasets
extracted from the open-meteo web API.
"""
from typing import Annotated
import pandas as pd
import pandera.pandas as pa
from pandera.typing.pandas import Series
MAX_DEG_C = 60.0 # Maximum expected temperature in degrees Celsius
MIN_DEG_C = -60.0 # Minimum expected temperature in degrees Celsius
MAX_WIND_SPEED_KMH = 200.0 # Maximum expected wind speed in km/h
class OpenMeteoSchema(pa.DataFrameModel):
"""
Schema for Open Meteo weather data.
"""
date: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]] = pa.Field(
description="Date in datetime64[ns, UTC] format.",
)
temperature_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Air temperature at 2 meters above ground (°C)",
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
relative_humidity_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Relative humidity at 2 meters above ground (%)",
metadata={"units": "%"},
ge=0.0,
le=100.0,
)
wind_speed_10m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Wind speed at 10 meters above ground level (km/h)",
metadata={"units": "km/h"},
ge=0.0,
le=MAX_WIND_SPEED_KMH,
)
cloud_cover: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Total cloud cover as an area fraction (%)",
metadata={"units": "%"},
ge=0.0,
le=100.0,
)
snowfall: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Snowfall amount of the preceding hour in centimeters. "
+ "For the water equivalent in millimeter, divide by 7. "
+ " E.g. 7 cm snow = 10 mm precipitation water equivalent (cm)"
),
metadata={"units": "cm"},
ge=0.0,
)
snow_depth: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Snow depth on the ground. Snow depth in ERA5-Land tends "
+ "to be overestimated. As the spatial resolution for "
+ "snow depth is limited, please use it with care. (m)"
),
metadata={"units": "m"},
ge=0.0,
)
rain: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Only liquid precipitation of the preceding hour including "
+ "local showers and rain from large scale systems. (mm)"
),
metadata={"units": "mm"},
ge=0.0,
)
apparent_temperature: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Apparent temperature is the perceived feels-like temperature "
+ "combining wind chill factor, relative humidity "
+ "and solar radiation"
),
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
dew_point_2m: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description="Dew point temperature at 2 meters above ground (°C)",
metadata={"units": "°C"},
ge=MIN_DEG_C,
le=MAX_DEG_C,
)
precipitation: Series[pd.Float32Dtype] = pa.Field(
coerce=True,
description=(
"Total precipitation (rain, showers, snow) sum of the "
+ "preceding hour. Data is stored with a 0.1 mm precision. "
+ "If precipitation data is summed up to monthly sums, "
+ "there might be small inconsistencies with the "
+ "total precipitation amount. (mm)"
),
metadata={"units": "mm"},
ge=0.0,
)
Commit and CI¶
Commit the changes, push to github, and ensure all the continuous integration tests pass. NOTE: The CI tests will skip any tests marked with remote-data.