DataFrame methods to parse the sky condition from a terminal aerodrome forecast.
A line in a taf can report zero-eight cloud layers. Cloud layers are required in predominate lines, and optional in temporary ones. Cloud cover SKC|FEW|SCT|BKN|OVC
is associated to an octave value. 1, 3, 5, 8
as the min sky coverage for reporting a layer.
I struggled to find a pure regex solution to generate the the pattern I needed for repeating capture groups. Hence the _unpack_setup
function
from typing import Iterable
import re
import pandas as pd
import numpy as np
TAF = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
FM290300 24011KT P6SM OVC040
TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
FM291000 18009KT 3SM -TSRA BR OVC004CB
FM291900 31022G33KT 6SM -SHRA OVC011
"""
OCTAVE_INDEX = pd.Series(
(np.inf, 1, 3, 5, 8, np.nan), index=["SKC", "FEW", "SCT", "BKN", "OVC", np.nan]
)
def _unpack_setup():
base = r"(SKC|FEW|SCT|BKN|OVC)(\d{3})?(CB|TCU)?\s?"
layers = f"(?:{base})?" * 7
columns = pd.Series(["CloudCover", "CloudBase", "Flags"])
return (
re.compile(base + layers, re.VERBOSE),
pd.concat(columns + str(i) for i in range(1, 9)),
)
celestial_dome, cloud_columns = _unpack_setup()
def unpack_index(index: pd.Index, *args: str) -> Iterable[pd.Index]:
for col in args:
yield index[index.str.contains(col)]
def octave(sky_coverage: pd.Series) -> np.ndarray:
"""octave indexer"""
return OCTAVE_INDEX[sky_coverage].values
def get_sky_condition():
"""creates sky condtion dataframe"""
series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()
sky_condition: pd.DataFrame = (
series.str.extract(celestial_dome)
.set_axis(cloud_columns, axis=1)
.dropna(axis=1, how="all")
)
column_base, column_cover = unpack_index(
sky_condition.columns, "CloudBase", "CloudCover"
)
sky_condition[column_base] = sky_condition[column_base].astype(float) * 100
sky_condition[column_cover] = sky_condition[column_cover].apply(octave)
print(sky_condition)
if __name__ == "__main__":
get_sky_condition()
results
CloudCover1 CloudBase1 Flags1 CloudCover2 CloudBase2 CloudCover3 CloudBase3 Flags3 CloudCover4 CloudBase4
0 5.0 7000.0 NaN 8.0 25000.0 NaN NaN NaN NaN NaN
1 8.0 4000.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1000.0 NaN 2.0 1500.0 5.0 2000.0 TCU 8.0 2500.0
3 8.0 400.0 CB NaN NaN NaN NaN NaN NaN NaN
4 8.0 1100.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 Answers 2
At the start of get_sky_condition()
, I don't see why you do a .str.strip()
when defining series
:
series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()
I think that this should suffice?
series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))
For the regular expression, you could take advantage of named capture groups to avoid having to call .set_axis(cloud_columns, axis=1)
to name the columns.
def cloud_layers_re() -> re:
layer_re_fmt = \
r"(?P<CloudCover{0}>SKC|FEW|SCT|BKN|OVC)" \
r"(?P<CloudBase{0}>\d{{3}})?" \
r"(?P<Flags{0}>CB|TCU)?"
return re.compile(
layer_re_fmt.format(1) +
"".join("(?:\s+" + layer_re_fmt.format(i) + ")?" for i in range(2, 9))
)
⋮
def get_sky_condition():
"""creates sky condtion dataframe"""
series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))
sky_condition: pd.DataFrame = (
series.str.extract(cloud_layers_re())
.dropna(axis=1, how="all")
)
⋮
Since get_sky_condition()
is named like a getter function, I'd expect that it returns its result rather than printing it.
You've landed in trouble with your indices again. I think the shape of your dataframe significantly mischaracterises what your data are actually saying:
- Per station,
- per station observation time,
- per time group, there is some weather.
In addition to the above, per altitude, there are some clouds.
Whenever you say "per", there should be a MultiIndex
level. Do not write CloudCover1
, CloudCover2
etc. columns. A two-stage extract
can do this for you. There will be two separate dataframes because there are two different cardinalities. Said another way, the number of visibility measurements is very different from the number of cloud measurements, and to mash them into the same dataframe does not make sense and is de-normalised, in database speak. The two separate dataframes will have some common index levels.
Suggested
import re
import pandas as pd
# Based on https://aviationweather.gov/taf/decoder#Forecast
TAF_PATTERN = re.compile(
r'''(?x) # verbose
^\s* # beginning, strip whitespace
(?P<group>[A-Z]+)? # time group kind, greedy, optional
(?: # non-capture: separator between group name and time
(?<!FM)\s+ # spaces for every group except FM
)?
(?P<time>\d\S+) # group time, starting with any digit, greedy, mandatory
(?: # non-capture: wind speed with separator, optional
\s+ # at least one separator space
(?P<wind>\S*KT) # anything followed by knots, greedy
)?
(?: # non-capture: visibilitity with separator, optional
\s+ # at least one separator space
(?P<vis> # visibility
P? # "more than"
\d+ # distance figure
(?:SM)? # unit: 'statute miles' or implied metres
)
)?
(?: # non-capture: weather with separator, optional
\s+ # at least one separator space
(?P<weather>
(?:\+|-|VC)? # intensity or proximity
(?: # weather fragments, mandatory, greedy
\s* # any spaces between weather fragments
(?:
MI|BC|DR|BL|SH|TS|FZ|PR| # Qualifier descriptor
DZ|RA|SN|SG|IC|PL|GR|GS|UP| # Precipitation
BR|FG|FU|DU|SA|HZ|PY|VA| # Obscuration
PO|SQ|FC|\+FC|SS|DS # "Other"
)
)+
)
)?
(?P<clouds> # cloud measurements, optional
(?: # non-capture: clouds, mandatory, greedy, multiple included
\s+ # at least one separator space
(?: # cloud density measured in "octals" (eighths)
VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
)
\d* # observation altitude in hundreds of feet
(?:CB|TCU)? # clouds, optional, cumulonimbus or towering cumulus
)+
)?
# Don't specify the rest, and don't match on the end. This may exclude
# wind shear, probability, etc.
'''
)
CLOUD_PATTERN = re.compile(
r'''(?x) # verbose
(?P<density> # cloud density measured in "octals" (eighths)
VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
)
(?P<altitude> # observation altitude in hundreds of feet, greedy, optional
\d+
)?
(?P<kind> # cloud kind, cumulonimbus or towering cumulus, optional
CB|TCU
)?
'''
)
def get_sky_condition(taf: str) -> tuple[
pd.DataFrame, # Groups
pd.DataFrame, # Clouds
]:
station, origin_time, body = taf.split(maxsplit=2)
lines = pd.Series(body.splitlines())
df: pd.DataFrame = lines.str.extract(TAF_PATTERN)
df['station'] = station
df['origin_time'] = origin_time
df.set_index(['station', 'origin_time', 'group', 'time'], inplace=True)
clouds: pd.DataFrame = df.clouds.str.extractall(CLOUD_PATTERN)
clouds['altitude'] = clouds.altitude.astype(int) * 100
clouds = clouds.droplevel('match').set_index('altitude', append=True)
df.drop(columns=['clouds'], inplace=True)
return df, clouds
def test() -> None:
taf = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
FM290300 24011KT P6SM OVC040
TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
FM291000 18009KT 3SM -TSRA BR OVC004CB
FM291900 31022G33KT 6SM -SHRA OVC011
TEMPO 2903/2906 5000 TSRA
"""
group_df, cloud_df = get_sky_condition(taf)
print('Groups:')
print(group_df)
print('Clouds:')
print(cloud_df)
if __name__ == "__main__":
test()
Output
Groups:
wind vis weather
station origin_time group time
KGCC 282320Z NaN 2900/2924 09010KT P6SM -SHRA
FM 290300 24011KT P6SM NaN
TEMPO 2903/2906 NaN 4SM -SHRA
FM 291000 18009KT 3SM -TSRA BR
291900 31022G33KT 6SM -SHRA
TEMPO 2903/2906 NaN 5000 TSRA
Clouds:
density kind
station origin_time group time altitude
KGCC 282320Z NaN 2900/2924 7000 BKN NaN
25000 OVC NaN
FM 290300 4000 OVC NaN
TEMPO 2903/2906 1000 FEW NaN
1500 FEW NaN
2000 BKN TCU
2500 OVC NaN
FM 291000 400 OVC CB
291900 1100 OVC NaN
-
\$\begingroup\$ On mobile at the moment. What I’m working towards is representing a taf as DataFrame and the observed condition as a Series. Converting times strings to time objects and replacing string values with numeric ones. To then get the delta from a forecast and an observed condition. To which I apply a inversely proportional % of the time delta. \$\endgroup\$Jason Leaver– Jason Leaver2022年05月01日 20:23:13 +00:00Commented May 1, 2022 at 20:23
-
\$\begingroup\$ That's quite fine but I consider it out of scope for the current question. \$\endgroup\$Reinderien– Reinderien2022年05月01日 20:27:25 +00:00Commented May 1, 2022 at 20:27
-
\$\begingroup\$ Understood. I do have a question, why do you prefer to initialize with a DataFrame rather than Series \$\endgroup\$Jason Leaver– Jason Leaver2022年05月01日 20:57:04 +00:00Commented May 1, 2022 at 20:57
-
\$\begingroup\$ Good catch; that should be a
Series
\$\endgroup\$Reinderien– Reinderien2022年05月01日 21:26:51 +00:00Commented May 1, 2022 at 21:26
Explore related questions
See similar questions with these tags.
TEMPO 2903/2906 5000 TSRA
\$\endgroup\$KT
. Otherwise, what is it? \$\endgroup\$5000
may have been confusing as the example uses statue miles for visibility and I used a meter example. In the application I'm developing all values get converted to a standard unit. \$\endgroup\$