3

I want to be able to capture all tags in Obsidian front matter, which is YAML. The format is

---
tags:
 - recipe
 - cooking
---

but note that other data can appear before or after the tags section.

I'm using ICU Regex (technically Siri Shortcuts).

I'm having a devil of a time figuring this out. I've got to \s*?-\s([_/\w]*)\n—easy. But that captures any array, not just tags. (?<=tags:\n)\s*?-\s([_/\w]*)\n(?=[^\s]) or anything similar, in order to only capture tags, fails.

Alternately, (?<=^tags:\n)(.*?)(?=\n[^\s]) captures everything between "tags:" and the next attribute or the end of the YAML, but also captures the spaces and hyphens. What am I doing wrong here?

asked Aug 17, 2024 at 0:18
6
  • Regex isn’t the way to parse this Commented Aug 17, 2024 at 2:43
  • Ah. So regex can't do what I want here? That explains why I'm having so much difficuty. I'll just do it in two steps then. Commented Aug 17, 2024 at 2:46
  • 3
    If you can use \G to chain matches e.g. PCRE, Python with PyPI regex you could do something like (?:\G(?!^)|^tags:)\n *- +(.+) Commented Aug 17, 2024 at 8:55
  • That works! Amazing! Would you mind making this an Answer and explaining how it works? I've never understood \G, even though I've read that tutorial page several times. Commented Aug 17, 2024 at 14:21
  • 1
    @Calion Sure, I tried to explain a bit how \G works. Best to play with it on such as regex101. Commented Aug 17, 2024 at 18:52

3 Answers 3

3

If you can use \G to chain matches (e.g. PCRE, .NET, Java, Python with PyPI regex) it could be done quite easily. \G is an anchor that matches where a previous match ended or at start. Usually it is used to chain matches to a defined starting point. The often undesired behaviour of also matching at ^ start can get avoided by use of a negative lookahead \G(?!^).

The typical usage is (?:\G(?!^)|start)stuffbetween(capturethis) where start is usually put on the right side of the alternation inside the non-capturing group for the simple reason of efficiency - because \G is supposed to match more often than the defined starting-point.

For your example a simple variant can look like

(?:\G(?!^)|^tags:)\n *- +(.+)

See this demo at regex101 - So how does this work?

answered Aug 17, 2024 at 18:48
Sign up to request clarification or add additional context in comments.

Comments

1

I can propose an algorithmic solution based on 2 steps:

  1. Create a pattern to capture the content between the tag section
  2. Take the content and process it as a list of items

The following example is built in Python you can take the regex patterns contained in it and adapt them to your application siri shorcuts, so you can see in the tutorials you can chain actions and save variables.

Solution

# regex YALM example
import re
content ="""
---
tags:
 - recipe
 - cooking
 - sea_food
 - food\\fruits
 - spices-pepper
--- # The Smiths
- name: Mary Smith
 age: 27
- [name, age]: [Rae Smith, 4] # sequences as keys are supported
--- # People, by gender
men: [John Smith, Bill Jones]
women:
 - Mary Smith
 - Susan Williams
"""
#first action: capture content
pattern_tag_content = r'---\ntags:\n(.*?)---'
pattern = re.compile(pattern_tag_content, re.DOTALL)
matches = pattern.findall(content)
tag_content = matches[0]
print("match content : |"+tag_content+"|")
#second action: list content
patter_list_content = r'\s+-\s([A-Za-z0-9_\-\\]+)\n?'
pattern = re.compile(patter_list_content, re.DOTALL)
matches = pattern.findall(tag_content)
for list_item in matches:
 print("match list : |"+list_item+"|")

Output

 match content : | - recipe
 - cooking
 - sea_food
 - food\fruits
 - spices-pepper
 |
 match list : |recipe|
 match list : |cooking|
 match list : |sea_food|
 match list : |food\fruits|
 match list : |spices-pepper|
answered Aug 17, 2024 at 2:47

5 Comments

So it can't be done in one step. That's what was tripping me up. Thanks!
However—that expression matches any text array, not just tags. In your text example it also matches name, Mary, and Susan.
If only the pattern contained in patter_list_content is used, it captures the name, Mary and Susan. That is why we try to capture only what is inside the pattern pattern_tag_content
Maybe the pattern_tag_content pattern isn't working in your application like it does in Python. The syntax (.*?) seeks to capture everything but with the *? A "non greedy" condition applies. This should work so that you don't skip the "---" separator, it may be that in your application the "non greedy" syntax is different
Okay, I see. There are two expressions here; I was using the second where I should have been using the first. But the first expression assumes that tags is the first attribute in the YAML, which is not necessarily the case.
1

You could employ a YAML processor that can extract a YAML front matter, e.g. mikefarah/yq. Use the -f flag to extract it, the .tags[] filter to iterate over the items under the tags key, and the -r flag to output unencoded strings.

Example:

$ cat obsidian.md
---
some:
 - unimportant
 - items
tags:
 - recipe
 - cooking
more:
 - unimportant
 - stuff
---
# Obsidian Markdown Document
Note: This part is not YAML anymore: it's Markdown
$ yq -fr '.tags[]' obsidian.md
recipe
cooking
answered Aug 17, 2024 at 19:44

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.