2
\$\begingroup\$

DataFrame methods to parse the sky condition from a terminal aerodrome forecast.

A line in a taf can report zero-eight cloud layers. Cloud layers are required in predominate lines, and optional in temporary ones. Cloud cover SKC|FEW|SCT|BKN|OVC is associated to an octave value. 1, 3, 5, 8 as the min sky coverage for reporting a layer.

I struggled to find a pure regex solution to generate the the pattern I needed for repeating capture groups. Hence the _unpack_setup function

from typing import Iterable
import re
import pandas as pd
import numpy as np
TAF = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
 FM290300 24011KT P6SM OVC040
 TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
 FM291000 18009KT 3SM -TSRA BR OVC004CB
 FM291900 31022G33KT 6SM -SHRA OVC011
"""
OCTAVE_INDEX = pd.Series(
 (np.inf, 1, 3, 5, 8, np.nan), index=["SKC", "FEW", "SCT", "BKN", "OVC", np.nan]
)
def _unpack_setup():
 base = r"(SKC|FEW|SCT|BKN|OVC)(\d{3})?(CB|TCU)?\s?"
 layers = f"(?:{base})?" * 7
 columns = pd.Series(["CloudCover", "CloudBase", "Flags"])
 return (
 re.compile(base + layers, re.VERBOSE),
 pd.concat(columns + str(i) for i in range(1, 9)),
 )
celestial_dome, cloud_columns = _unpack_setup()
def unpack_index(index: pd.Index, *args: str) -> Iterable[pd.Index]:
 for col in args:
 yield index[index.str.contains(col)]
def octave(sky_coverage: pd.Series) -> np.ndarray:
 """octave indexer"""
 return OCTAVE_INDEX[sky_coverage].values
def get_sky_condition():
 """creates sky condtion dataframe"""
 series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()
 sky_condition: pd.DataFrame = (
 series.str.extract(celestial_dome)
 .set_axis(cloud_columns, axis=1)
 .dropna(axis=1, how="all")
 )
 column_base, column_cover = unpack_index(
 sky_condition.columns, "CloudBase", "CloudCover"
 )
 sky_condition[column_base] = sky_condition[column_base].astype(float) * 100
 sky_condition[column_cover] = sky_condition[column_cover].apply(octave)
 print(sky_condition)
if __name__ == "__main__":
 get_sky_condition()

results

 CloudCover1 CloudBase1 Flags1 CloudCover2 CloudBase2 CloudCover3 CloudBase3 Flags3 CloudCover4 CloudBase4
0 5.0 7000.0 NaN 8.0 25000.0 NaN NaN NaN NaN NaN
1 8.0 4000.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1000.0 NaN 2.0 1500.0 5.0 2000.0 TCU 8.0 2500.0
3 8.0 400.0 CB NaN NaN NaN NaN NaN NaN NaN
4 8.0 1100.0 NaN NaN NaN NaN NaN NaN NaN NaN
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 29, 2022 at 0:55
\$\endgroup\$
7
  • \$\begingroup\$ Is this sample even TAF-compliant? Your TEMPO line is missing a wind speed. \$\endgroup\$ Commented Apr 30, 2022 at 16:50
  • \$\begingroup\$ TEMPO lines do not require every parameter. for example you could have a TEMPO condition of just TEMPO 2903/2906 5000 TSRA \$\endgroup\$ Commented May 1, 2022 at 14:15
  • \$\begingroup\$ If 5000 is a wind speed, that's missing KT. Otherwise, what is it? \$\endgroup\$ Commented May 1, 2022 at 14:33
  • \$\begingroup\$ visibility, 5000 meters \$\endgroup\$ Commented May 1, 2022 at 14:56
  • 1
    \$\begingroup\$ Oh I see where the 5000 may have been confusing as the example uses statue miles for visibility and I used a meter example. In the application I'm developing all values get converted to a standard unit. \$\endgroup\$ Commented May 1, 2022 at 16:01

2 Answers 2

2
\$\begingroup\$

At the start of get_sky_condition(), I don't see why you do a .str.strip() when defining series:

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()

I think that this should suffice?

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))

For the regular expression, you could take advantage of named capture groups to avoid having to call .set_axis(cloud_columns, axis=1) to name the columns.

def cloud_layers_re() -> re:
 layer_re_fmt = \
 r"(?P<CloudCover{0}>SKC|FEW|SCT|BKN|OVC)" \
 r"(?P<CloudBase{0}>\d{{3}})?" \
 r"(?P<Flags{0}>CB|TCU)?"
 return re.compile(
 layer_re_fmt.format(1) +
 "".join("(?:\s+" + layer_re_fmt.format(i) + ")?" for i in range(2, 9))
 )
⋮
def get_sky_condition():
 """creates sky condtion dataframe"""
 series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))
 sky_condition: pd.DataFrame = (
 series.str.extract(cloud_layers_re())
 .dropna(axis=1, how="all")
 )
⋮

Since get_sky_condition() is named like a getter function, I'd expect that it returns its result rather than printing it.

answered Apr 29, 2022 at 7:16
\$\endgroup\$
1
\$\begingroup\$

You've landed in trouble with your indices again. I think the shape of your dataframe significantly mischaracterises what your data are actually saying:

  • Per station,
  • per station observation time,
  • per time group, there is some weather.

In addition to the above, per altitude, there are some clouds.

Whenever you say "per", there should be a MultiIndex level. Do not write CloudCover1, CloudCover2 etc. columns. A two-stage extract can do this for you. There will be two separate dataframes because there are two different cardinalities. Said another way, the number of visibility measurements is very different from the number of cloud measurements, and to mash them into the same dataframe does not make sense and is de-normalised, in database speak. The two separate dataframes will have some common index levels.

Suggested

import re
import pandas as pd
# Based on https://aviationweather.gov/taf/decoder#Forecast
TAF_PATTERN = re.compile(
 r'''(?x) # verbose
 ^\s* # beginning, strip whitespace
 (?P<group>[A-Z]+)? # time group kind, greedy, optional
 (?: # non-capture: separator between group name and time
 (?<!FM)\s+ # spaces for every group except FM
 )?
 (?P<time>\d\S+) # group time, starting with any digit, greedy, mandatory
 (?: # non-capture: wind speed with separator, optional
 \s+ # at least one separator space
 (?P<wind>\S*KT) # anything followed by knots, greedy
 )?
 (?: # non-capture: visibilitity with separator, optional
 \s+ # at least one separator space
 (?P<vis> # visibility
 P? # "more than"
 \d+ # distance figure
 (?:SM)? # unit: 'statute miles' or implied metres
 )
 )?
 (?: # non-capture: weather with separator, optional
 \s+ # at least one separator space
 (?P<weather>
 (?:\+|-|VC)? # intensity or proximity
 (?: # weather fragments, mandatory, greedy
 \s* # any spaces between weather fragments
 (?:
 MI|BC|DR|BL|SH|TS|FZ|PR| # Qualifier descriptor
 DZ|RA|SN|SG|IC|PL|GR|GS|UP| # Precipitation
 BR|FG|FU|DU|SA|HZ|PY|VA| # Obscuration
 PO|SQ|FC|\+FC|SS|DS # "Other"
 )
 )+
 )
 )?
 (?P<clouds> # cloud measurements, optional
 (?: # non-capture: clouds, mandatory, greedy, multiple included
 \s+ # at least one separator space
 (?: # cloud density measured in "octals" (eighths)
 VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
 )
 \d* # observation altitude in hundreds of feet
 (?:CB|TCU)? # clouds, optional, cumulonimbus or towering cumulus
 )+
 )?
 # Don't specify the rest, and don't match on the end. This may exclude
 # wind shear, probability, etc.
 '''
)
CLOUD_PATTERN = re.compile(
 r'''(?x) # verbose
 (?P<density> # cloud density measured in "octals" (eighths)
 VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
 )
 (?P<altitude> # observation altitude in hundreds of feet, greedy, optional
 \d+
 )?
 (?P<kind> # cloud kind, cumulonimbus or towering cumulus, optional
 CB|TCU
 )?
 '''
)
def get_sky_condition(taf: str) -> tuple[
 pd.DataFrame, # Groups
 pd.DataFrame, # Clouds
]:
 station, origin_time, body = taf.split(maxsplit=2)
 lines = pd.Series(body.splitlines())
 df: pd.DataFrame = lines.str.extract(TAF_PATTERN)
 df['station'] = station
 df['origin_time'] = origin_time
 df.set_index(['station', 'origin_time', 'group', 'time'], inplace=True)
 clouds: pd.DataFrame = df.clouds.str.extractall(CLOUD_PATTERN)
 clouds['altitude'] = clouds.altitude.astype(int) * 100
 clouds = clouds.droplevel('match').set_index('altitude', append=True)
 df.drop(columns=['clouds'], inplace=True)
 return df, clouds
def test() -> None:
 taf = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
 FM290300 24011KT P6SM OVC040
 TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
 FM291000 18009KT 3SM -TSRA BR OVC004CB
 FM291900 31022G33KT 6SM -SHRA OVC011
 TEMPO 2903/2906 5000 TSRA
"""
 group_df, cloud_df = get_sky_condition(taf)
 print('Groups:')
 print(group_df)
 print('Clouds:')
 print(cloud_df)
if __name__ == "__main__":
 test()

Output

Groups:
 wind vis weather
station origin_time group time 
KGCC 282320Z NaN 2900/2924 09010KT P6SM -SHRA
 FM 290300 24011KT P6SM NaN
 TEMPO 2903/2906 NaN 4SM -SHRA
 FM 291000 18009KT 3SM -TSRA BR
 291900 31022G33KT 6SM -SHRA
 TEMPO 2903/2906 NaN 5000 TSRA
Clouds:
 density kind
station origin_time group time altitude 
KGCC 282320Z NaN 2900/2924 7000 BKN NaN
 25000 OVC NaN
 FM 290300 4000 OVC NaN
 TEMPO 2903/2906 1000 FEW NaN
 1500 FEW NaN
 2000 BKN TCU
 2500 OVC NaN
 FM 291000 400 OVC CB
 291900 1100 OVC NaN
answered May 1, 2022 at 16:58
\$\endgroup\$
4
  • \$\begingroup\$ On mobile at the moment. What I’m working towards is representing a taf as DataFrame and the observed condition as a Series. Converting times strings to time objects and replacing string values with numeric ones. To then get the delta from a forecast and an observed condition. To which I apply a inversely proportional % of the time delta. \$\endgroup\$ Commented May 1, 2022 at 20:23
  • \$\begingroup\$ That's quite fine but I consider it out of scope for the current question. \$\endgroup\$ Commented May 1, 2022 at 20:27
  • \$\begingroup\$ Understood. I do have a question, why do you prefer to initialize with a DataFrame rather than Series \$\endgroup\$ Commented May 1, 2022 at 20:57
  • \$\begingroup\$ Good catch; that should be a Series \$\endgroup\$ Commented May 1, 2022 at 21:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.