How to capture all items in one array in YAML?

Question 1

I want to be able to capture all tags in Obsidian front matter, which is YAML. The format is

---
tags:
 - recipe
 - cooking
---

but note that other data can appear before or after the tags section.

I'm using ICU Regex (technically Siri Shortcuts).

I'm having a devil of a time figuring this out. I've got to \s*?-\s([_/\w]*)\n—easy. But that captures any array, not just tags. (?<=tags:\n)\s*?-\s([_/\w]*)\n(?=[^\s]) or anything similar, in order to only capture tags, fails.

Alternately, (?<=^tags:\n)(.*?)(?=\n[^\s]) captures everything between "tags:" and the next attribute or the end of the YAML, but also captures the spaces and hyphens. What am I doing wrong here?

Question 2

Regex isn’t the way to parse this

Question 3

Ah. So regex can't do what I want here? That explains why I'm having so much difficuty. I'll just do it in two steps then.

Question 4

If you can use \G to chain matches e.g. PCRE, Python with PyPI regex you could do something like (?:\G(?!^)|^tags:)\n *- +(.+)

Question 5

That works! Amazing! Would you mind making this an Answer and explaining how it works? I've never understood \G, even though I've read that tutorial page several times.

Question 6

@Calion Sure, I tried to explain a bit how \G works. Best to play with it on such as regex101.

Question 7

If you can use \G to chain matches (e.g. PCRE, .NET, Java, Python with PyPI regex) it could be done quite easily. \G is an anchor that matches where a previous match ended or at start. Usually it is used to chain matches to a defined starting point. The often undesired behaviour of also matching at ^ start can get avoided by use of a negative lookahead \G(?!^).

The typical usage is (?:\G(?!^)|start)stuffbetween(capturethis) where start is usually put on the right side of the alternation inside the non-capturing group for the simple reason of efficiency - because \G is supposed to match more often than the defined starting-point.

For your example a simple variant can look like

(?:\G(?!^)|^tags:)\n *- +(.+)

See this demo at regex101 - So how does this work?

It either matches the substring tags at ^ start of the line (in multline mode) | OR \G continues where a previous match ended (chain matches).
\n *- + anyways we need a newline after tags or a chain-part, followed by any amount of space, a hyphen and one or more spaces (stuff between).
Finally (.+) captures the desired parts into the first group (one or more of any character). In PCRE you could drop the capture group and use \K to reset beginning of the reported match.

Question 8

I can propose an algorithmic solution based on 2 steps:

Create a pattern to capture the content between the tag section
Take the content and process it as a list of items

The following example is built in Python you can take the regex patterns contained in it and adapt them to your application siri shorcuts, so you can see in the tutorials you can chain actions and save variables.

Solution

# regex YALM example
import re
content ="""
---
tags:
 - recipe
 - cooking
 - sea_food
 - food\\fruits
 - spices-pepper
--- # The Smiths
- name: Mary Smith
 age: 27
- [name, age]: [Rae Smith, 4] # sequences as keys are supported
--- # People, by gender
men: [John Smith, Bill Jones]
women:
 - Mary Smith
 - Susan Williams
"""
#first action: capture content
pattern_tag_content = r'---\ntags:\n(.*?)---'
pattern = re.compile(pattern_tag_content, re.DOTALL)
matches = pattern.findall(content)
tag_content = matches[0]
print("match content : |"+tag_content+"|")
#second action: list content
patter_list_content = r'\s+-\s([A-Za-z0-9_\-\\]+)\n?'
pattern = re.compile(patter_list_content, re.DOTALL)
matches = pattern.findall(tag_content)
for list_item in matches:
 print("match list : |"+list_item+"|")

Output

 match content : | - recipe
 - cooking
 - sea_food
 - food\fruits
 - spices-pepper
 |
 match list : |recipe|
 match list : |cooking|
 match list : |sea_food|
 match list : |food\fruits|
 match list : |spices-pepper|

Question 9

So it can't be done in one step. That's what was tripping me up. Thanks!

Question 10

However—that expression matches any text array, not just tags. In your text example it also matches name, Mary, and Susan.

Question 11

If only the pattern contained in patter_list_content is used, it captures the name, Mary and Susan. That is why we try to capture only what is inside the pattern pattern_tag_content

Question 12

Maybe the pattern_tag_content pattern isn't working in your application like it does in Python. The syntax (.*?) seeks to capture everything but with the *? A "non greedy" condition applies. This should work so that you don't skip the "---" separator, it may be that in your application the "non greedy" syntax is different

Question 13

Okay, I see. There are two expressions here; I was using the second where I should have been using the first. But the first expression assumes that tags is the first attribute in the YAML, which is not necessarily the case.

Question 14

You could employ a YAML processor that can extract a YAML front matter, e.g. mikefarah/yq. Use the -f flag to extract it, the .tags[] filter to iterate over the items under the tags key, and the -r flag to output unencoded strings.

Example:

$ cat obsidian.md
---
some:
 - unimportant
 - items
tags:
 - recipe
 - cooking
more:
 - unimportant
 - stuff
---
# Obsidian Markdown Document
Note: This part is not YAML anymore: it's Markdown
$ yq -fr '.tags[]' obsidian.md
recipe
cooking

bobble bubble 18.8k4 gold badges32 silver badges52 bronze badges · Accepted Answer · 2024-08-17 18:48:11Z

If you can use \G to chain matches (e.g. PCRE, .NET, Java, Python with PyPI regex) it could be done quite easily. \G is an anchor that matches where a previous match ended or at start. Usually it is used to chain matches to a defined starting point. The often undesired behaviour of also matching at ^ start can get avoided by use of a negative lookahead \G(?!^).

The typical usage is (?:\G(?!^)|start)stuffbetween(capturethis) where start is usually put on the right side of the alternation inside the non-capturing group for the simple reason of efficiency - because \G is supposed to match more often than the defined starting-point.

For your example a simple variant can look like

(?:\G(?!^)|^tags:)\n *- +(.+)

See this demo at regex101 - So how does this work?

It either matches the substring tags at ^ start of the line (in multline mode) | OR \G continues where a previous match ended (chain matches).
\n *- + anyways we need a newline after tags or a chain-part, followed by any amount of space, a hyphen and one or more spaces (stuff between).
Finally (.+) captures the desired parts into the first group (one or more of any character). In PCRE you could drop the capture group and use \K to reset beginning of the reported match.

CollectivesTM on Stack Overflow

How to capture all items in one array in YAML?

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related