Regex and pandas to read forecast sky condition string

Question 1

DataFrame methods to parse the sky condition from a terminal aerodrome forecast.

A line in a taf can report zero-eight cloud layers. Cloud layers are required in predominate lines, and optional in temporary ones. Cloud cover SKC|FEW|SCT|BKN|OVC is associated to an octave value. 1, 3, 5, 8 as the min sky coverage for reporting a layer.

I struggled to find a pure regex solution to generate the the pattern I needed for repeating capture groups. Hence the _unpack_setup function

from typing import Iterable
import re
import pandas as pd
import numpy as np
TAF = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
 FM290300 24011KT P6SM OVC040
 TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
 FM291000 18009KT 3SM -TSRA BR OVC004CB
 FM291900 31022G33KT 6SM -SHRA OVC011
"""
OCTAVE_INDEX = pd.Series(
 (np.inf, 1, 3, 5, 8, np.nan), index=["SKC", "FEW", "SCT", "BKN", "OVC", np.nan]
)
def _unpack_setup():
 base = r"(SKC|FEW|SCT|BKN|OVC)(\d{3})?(CB|TCU)?\s?"
 layers = f"(?:{base})?" * 7
 columns = pd.Series(["CloudCover", "CloudBase", "Flags"])
 return (
 re.compile(base + layers, re.VERBOSE),
 pd.concat(columns + str(i) for i in range(1, 9)),
 )
celestial_dome, cloud_columns = _unpack_setup()
def unpack_index(index: pd.Index, *args: str) -> Iterable[pd.Index]:
 for col in args:
 yield index[index.str.contains(col)]
def octave(sky_coverage: pd.Series) -> np.ndarray:
 """octave indexer"""
 return OCTAVE_INDEX[sky_coverage].values
def get_sky_condition():
 """creates sky condtion dataframe"""
 series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()
 sky_condition: pd.DataFrame = (
 series.str.extract(celestial_dome)
 .set_axis(cloud_columns, axis=1)
 .dropna(axis=1, how="all")
 )
 column_base, column_cover = unpack_index(
 sky_condition.columns, "CloudBase", "CloudCover"
 )
 sky_condition[column_base] = sky_condition[column_base].astype(float) * 100
 sky_condition[column_cover] = sky_condition[column_cover].apply(octave)
 print(sky_condition)
if __name__ == "__main__":
 get_sky_condition()

results

 CloudCover1 CloudBase1 Flags1 CloudCover2 CloudBase2 CloudCover3 CloudBase3 Flags3 CloudCover4 CloudBase4
0 5.0 7000.0 NaN 8.0 25000.0 NaN NaN NaN NaN NaN
1 8.0 4000.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1000.0 NaN 2.0 1500.0 5.0 2000.0 TCU 8.0 2500.0
3 8.0 400.0 CB NaN NaN NaN NaN NaN NaN NaN
4 8.0 1100.0 NaN NaN NaN NaN NaN NaN NaN NaN

Question 2

Is this sample even TAF-compliant? Your TEMPO line is missing a wind speed.

Question 3

TEMPO lines do not require every parameter. for example you could have a TEMPO condition of just TEMPO 2903/2906 5000 TSRA

Question 4

If 5000 is a wind speed, that's missing KT. Otherwise, what is it?

Question 5

visibility, 5000 meters

Question 6

Oh I see where the 5000 may have been confusing as the example uses statue miles for visibility and I used a meter example. In the application I'm developing all values get converted to a standard unit.

Question 7

At the start of get_sky_condition(), I don't see why you do a .str.strip() when defining series:

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()

I think that this should suffice?

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))

For the regular expression, you could take advantage of named capture groups to avoid having to call .set_axis(cloud_columns, axis=1) to name the columns.

def cloud_layers_re() -> re:
 layer_re_fmt = \
 r"(?P<CloudCover{0}>SKC|FEW|SCT|BKN|OVC)" \
 r"(?P<CloudBase{0}>\d{{3}})?" \
 r"(?P<Flags{0}>CB|TCU)?"
 return re.compile(
 layer_re_fmt.format(1) +
 "".join("(?:\s+" + layer_re_fmt.format(i) + ")?" for i in range(2, 9))
 )
⋮
def get_sky_condition():
 """creates sky condtion dataframe"""
 series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))
 sky_condition: pd.DataFrame = (
 series.str.extract(cloud_layers_re())
 .dropna(axis=1, how="all")
 )
⋮

Since get_sky_condition() is named like a getter function, I'd expect that it returns its result rather than printing it.

Question 8

You've landed in trouble with your indices again. I think the shape of your dataframe significantly mischaracterises what your data are actually saying:

Per station,
per station observation time,
per time group, there is some weather.

In addition to the above, per altitude, there are some clouds.

Whenever you say "per", there should be a MultiIndex level. Do not write CloudCover1, CloudCover2 etc. columns. A two-stage extract can do this for you. There will be two separate dataframes because there are two different cardinalities. Said another way, the number of visibility measurements is very different from the number of cloud measurements, and to mash them into the same dataframe does not make sense and is de-normalised, in database speak. The two separate dataframes will have some common index levels.

Suggested

import re
import pandas as pd
# Based on https://aviationweather.gov/taf/decoder#Forecast
TAF_PATTERN = re.compile(
 r'''(?x) # verbose
 ^\s* # beginning, strip whitespace
 (?P<group>[A-Z]+)? # time group kind, greedy, optional
 (?: # non-capture: separator between group name and time
 (?<!FM)\s+ # spaces for every group except FM
 )?
 (?P<time>\d\S+) # group time, starting with any digit, greedy, mandatory
 (?: # non-capture: wind speed with separator, optional
 \s+ # at least one separator space
 (?P<wind>\S*KT) # anything followed by knots, greedy
 )?
 (?: # non-capture: visibilitity with separator, optional
 \s+ # at least one separator space
 (?P<vis> # visibility
 P? # "more than"
 \d+ # distance figure
 (?:SM)? # unit: 'statute miles' or implied metres
 )
 )?
 (?: # non-capture: weather with separator, optional
 \s+ # at least one separator space
 (?P<weather>
 (?:\+|-|VC)? # intensity or proximity
 (?: # weather fragments, mandatory, greedy
 \s* # any spaces between weather fragments
 (?:
 MI|BC|DR|BL|SH|TS|FZ|PR| # Qualifier descriptor
 DZ|RA|SN|SG|IC|PL|GR|GS|UP| # Precipitation
 BR|FG|FU|DU|SA|HZ|PY|VA| # Obscuration
 PO|SQ|FC|\+FC|SS|DS # "Other"
 )
 )+
 )
 )?
 (?P<clouds> # cloud measurements, optional
 (?: # non-capture: clouds, mandatory, greedy, multiple included
 \s+ # at least one separator space
 (?: # cloud density measured in "octals" (eighths)
 VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
 )
 \d* # observation altitude in hundreds of feet
 (?:CB|TCU)? # clouds, optional, cumulonimbus or towering cumulus
 )+
 )?
 # Don't specify the rest, and don't match on the end. This may exclude
 # wind shear, probability, etc.
 '''
)
CLOUD_PATTERN = re.compile(
 r'''(?x) # verbose
 (?P<density> # cloud density measured in "octals" (eighths)
 VV|NSC|SKC|NCD|CLR|FEW|SCT|BKN|OVC
 )
 (?P<altitude> # observation altitude in hundreds of feet, greedy, optional
 \d+
 )?
 (?P<kind> # cloud kind, cumulonimbus or towering cumulus, optional
 CB|TCU
 )?
 '''
)
def get_sky_condition(taf: str) -> tuple[
 pd.DataFrame, # Groups
 pd.DataFrame, # Clouds
]:
 station, origin_time, body = taf.split(maxsplit=2)
 lines = pd.Series(body.splitlines())
 df: pd.DataFrame = lines.str.extract(TAF_PATTERN)
 df['station'] = station
 df['origin_time'] = origin_time
 df.set_index(['station', 'origin_time', 'group', 'time'], inplace=True)
 clouds: pd.DataFrame = df.clouds.str.extractall(CLOUD_PATTERN)
 clouds['altitude'] = clouds.altitude.astype(int) * 100
 clouds = clouds.droplevel('match').set_index('altitude', append=True)
 df.drop(columns=['clouds'], inplace=True)
 return df, clouds
def test() -> None:
 taf = """
KGCC 282320Z 2900/2924 09010KT P6SM -SHRA BKN070 OVC250
 FM290300 24011KT P6SM OVC040
 TEMPO 2903/2906 4SM -SHRA FEW010 FEW015 BKN020TCU OVC025
 FM291000 18009KT 3SM -TSRA BR OVC004CB
 FM291900 31022G33KT 6SM -SHRA OVC011
 TEMPO 2903/2906 5000 TSRA
"""
 group_df, cloud_df = get_sky_condition(taf)
 print('Groups:')
 print(group_df)
 print('Clouds:')
 print(cloud_df)
if __name__ == "__main__":
 test()

Output

Groups:
 wind vis weather
station origin_time group time 
KGCC 282320Z NaN 2900/2924 09010KT P6SM -SHRA
 FM 290300 24011KT P6SM NaN
 TEMPO 2903/2906 NaN 4SM -SHRA
 FM 291000 18009KT 3SM -TSRA BR
 291900 31022G33KT 6SM -SHRA
 TEMPO 2903/2906 NaN 5000 TSRA
Clouds:
 density kind
station origin_time group time altitude 
KGCC 282320Z NaN 2900/2924 7000 BKN NaN
 25000 OVC NaN
 FM 290300 4000 OVC NaN
 TEMPO 2903/2906 1000 FEW NaN
 1500 FEW NaN
 2000 BKN TCU
 2500 OVC NaN
 FM 291000 400 OVC CB
 291900 1100 OVC NaN

Question 9

On mobile at the moment. What I’m working towards is representing a taf as DataFrame and the observed condition as a Series. Converting times strings to time objects and replacing string values with numeric ones. To then get the delta from a forecast and an observed condition. To which I apply a inversely proportional % of the time delta.

Question 10

That's quite fine but I consider it out of scope for the current question.

Question 11

Understood. I do have a question, why do you prefer to initialize with a DataFrame rather than Series

Question 12

Good catch; that should be a Series

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2022-04-29 07:16:10Z

At the start of get_sky_condition(), I don't see why you do a .str.strip() when defining series:

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip())).str.strip()

I think that this should suffice?

series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))

For the regular expression, you could take advantage of named capture groups to avoid having to call .set_axis(cloud_columns, axis=1) to name the columns.

def cloud_layers_re() -> re:
 layer_re_fmt = \
 r"(?P<CloudCover{0}>SKC|FEW|SCT|BKN|OVC)" \
 r"(?P<CloudBase{0}>\d{{3}})?" \
 r"(?P<Flags{0}>CB|TCU)?"
 return re.compile(
 layer_re_fmt.format(1) +
 "".join("(?:\s+" + layer_re_fmt.format(i) + ")?" for i in range(2, 9))
 )
⋮
def get_sky_condition():
 """creates sky condtion dataframe"""
 series = pd.Series(re.split(r"(?:\s(?=BECMG|TEMPO|FM))", TAF.strip()))
 sky_condition: pd.DataFrame = (
 series.str.extract(cloud_layers_re())
 .dropna(axis=1, how="all")
 )
⋮

Since get_sky_condition() is named like a getter function, I'd expect that it returns its result rather than printing it.

Stack Exchange Network

Regex and pandas to read forecast sky condition string

results

2 Answers 2

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex and pandas to read forecast sky condition string

results

2 Answers 2

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions