Draft and test transform methods¶
Draft transform.py¶
Success
Write methods to transform the input dataframe into the output dataframe ofthe ETL process.
The input data is type hinted using the schema_external dataframe model. NOTE: The return data will be type hinted in a later step after defining schema.py
For this example, the data will be converted from °C to °F. The metadata from the schema can be used to determine which values should be converted.
transform.py
"""
Convert the Open Meteo dataset from °C to °F
"""
import pandas as pa
from pandera.typing.pandas import DataFrame
from able_weather.datasets.weather.open_meteo.runner import (
schema_external,
)
def celsius_to_fahrenheit(celsius: float) -> float:
return (celsius * 9 / 5) + 32
def transform(data: DataFrame[schema_external.OpenMeteoSchema]) -> pa.DataFrame:
"""
Convert temperature from Celsius to Fahrenheit in the Open Meteo dataset.
"""
# Get column metadata from the schema to find temperature columns
col_metadata = (
(schema_external.OpenMeteoSchema.get_metadata() or {}).get(
"OpenMeteoSchema", {}
)
or {}
).get("columns", {}) or {}
col_units = {
col: (col_metadata.get(col, {}) or {}).get("units")
for col in col_metadata.keys()
}
# Convert temperature columns from Celsius to Fahrenheit
for col in data.columns:
if col in col_units and col_units[col] == "°C":
data[col] = data[col].apply(celsius_to_fahrenheit)
data.rename(
columns={col: col.replace("_deg_c", "_deg_f")},
inplace=True,
)
return data
While writing the transformation, it became apparent that the column names should contain the units as to not create confusion. As such, extract_external.py, schema_external.py, test_extract_external.py, and test_schema_external.py were all updated so that the column names contain units.
Check that the code passes lint and typechecks
Ensure that the previously passing unit tests still pass
Write and run test_transform.py¶
Success
Write unit tests to ensure the transformation works as intended. This can use test data from the data/tests/ directory, or hard-code simple test data.
Write unit tests to confirm that columns with °C metadata units are converted to °F, their column names change, and other columns are untouched. Use a simple hard-coded dataframe for test data.
test_transform.py
import pandas as pd
from able_weather.datasets.weather.open_meteo.runner import transform
def make_sample_df() -> pd.DataFrame:
"""Create a tiny dataframe with temperatures in Celsius."""
return pd.DataFrame(
{
"date": pd.date_range("2023-01-01", periods=2, freq="h", tz="UTC"),
"temperature_deg_c_2m": [0.0, 100.0],
"apparent_temperature_deg_c": [0.0, 10.0],
"dew_point_temperature_deg_c_2m": [0.0, 5.0],
"relative_humidity_2m": [100.0, 50.0],
}
)
def test_celsius_to_fahrenheit() -> None:
"""Verify basic Celsius→Fahrenheit conversion."""
assert transform.celsius_to_fahrenheit(0.0) == 32.0
assert transform.celsius_to_fahrenheit(100.0) == 212.0
def test_transform_temperature_conversion() -> None:
"""Temperature columns should be converted and renamed."""
df = make_sample_df()
result = transform.transform(df.copy())
# temperature_deg_c_2m should be converted and renamed
assert "temperature_deg_f_2m" in result.columns
assert "temperature_deg_c_2m" not in result.columns
assert result.loc[0, "temperature_deg_f_2m"] == 32.0
assert result.loc[1, "temperature_deg_f_2m"] == 212.0
# Other Celsius columns should be converted but keep their names
assert result.loc[0, "apparent_temperature_deg_f"] == 32.0
assert result.loc[1, "apparent_temperature_deg_f"] == 50.0
assert result.loc[0, "dew_point_temperature_deg_f_2m"] == 32.0
assert result.loc[1, "dew_point_temperature_deg_f_2m"] == 41.0
# Relative humidity should remain unchanged
assert "relative_humidity_2m" in result.columns
assert result.loc[0, "relative_humidity_2m"] == 100.0
assert result.loc[1, "relative_humidity_2m"] == 50.0
Then test and debug if needed the functionality. This unit test does not require the remote data, so --remote-data=any can be ommitted.
Commit and CI¶
Commit the changes, push to github, and ensure all the continuous integration tests pass. NOTE: The CI tests will skip any tests marked with remote-data.