I want to be able to capture all tags in Obsidian front matter, which is YAML. The format is
---
tags:
- recipe
- cooking
---
but note that other data can appear before or after the tags section.
I'm using ICU Regex (technically Siri Shortcuts).
I'm having a devil of a time figuring this out. I've got to \s*?-\s([_/\w]*)\n—easy. But that captures any array, not just tags. (?<=tags:\n)\s*?-\s([_/\w]*)\n(?=[^\s]) or anything similar, in order to only capture tags, fails.
Alternately, (?<=^tags:\n)(.*?)(?=\n[^\s]) captures everything between "tags:" and the next attribute or the end of the YAML, but also captures the spaces and hyphens. What am I doing wrong here?
3 Answers 3
If you can use \G to chain matches (e.g. PCRE, .NET, Java, Python with PyPI regex) it could be done quite easily. \G is an anchor that matches where a previous match ended or at start. Usually it is used to chain matches to a defined starting point. The often undesired behaviour of also matching at ^ start can get avoided by use of a negative lookahead \G(?!^).
The typical usage is (?:\G(?!^)|start)stuffbetween(capturethis) where start is usually put on the right side of the alternation inside the non-capturing group for the simple reason of efficiency - because \G is supposed to match more often than the defined starting-point.
For your example a simple variant can look like
(?:\G(?!^)|^tags:)\n *- +(.+)
See this demo at regex101 - So how does this work?
- It either matches the substring
tagsat^start of the line (in multline mode)|OR\Gcontinues where a previous match ended (chain matches). \n *- +anyways we need a newline aftertagsor a chain-part, followed by any amount of space, a hyphen and one or more spaces (stuff between).- Finally
(.+)captures the desired parts into the first group (one or more of any character). In PCRE you could drop the capture group and use\Kto reset beginning of the reported match.
Comments
I can propose an algorithmic solution based on 2 steps:
- Create a pattern to capture the content between the tag section
- Take the content and process it as a list of items
The following example is built in Python you can take the regex patterns contained in it and adapt them to your application siri shorcuts, so you can see in the tutorials you can chain actions and save variables.
Solution
# regex YALM example
import re
content ="""
---
tags:
- recipe
- cooking
- sea_food
- food\\fruits
- spices-pepper
--- # The Smiths
- name: Mary Smith
age: 27
- [name, age]: [Rae Smith, 4] # sequences as keys are supported
--- # People, by gender
men: [John Smith, Bill Jones]
women:
- Mary Smith
- Susan Williams
"""
#first action: capture content
pattern_tag_content = r'---\ntags:\n(.*?)---'
pattern = re.compile(pattern_tag_content, re.DOTALL)
matches = pattern.findall(content)
tag_content = matches[0]
print("match content : |"+tag_content+"|")
#second action: list content
patter_list_content = r'\s+-\s([A-Za-z0-9_\-\\]+)\n?'
pattern = re.compile(patter_list_content, re.DOTALL)
matches = pattern.findall(tag_content)
for list_item in matches:
print("match list : |"+list_item+"|")
Output
match content : | - recipe
- cooking
- sea_food
- food\fruits
- spices-pepper
|
match list : |recipe|
match list : |cooking|
match list : |sea_food|
match list : |food\fruits|
match list : |spices-pepper|
5 Comments
name, Mary, and Susan.tags is the first attribute in the YAML, which is not necessarily the case.You could employ a YAML processor that can extract a YAML front matter, e.g. mikefarah/yq. Use the -f flag to extract it, the .tags[] filter to iterate over the items under the tags key, and the -r flag to output unencoded strings.
Example:
$ cat obsidian.md
---
some:
- unimportant
- items
tags:
- recipe
- cooking
more:
- unimportant
- stuff
---
# Obsidian Markdown Document
Note: This part is not YAML anymore: it's Markdown
$ yq -fr '.tags[]' obsidian.md
recipe
cooking
\Gto chain matches e.g. PCRE, Python with PyPI regex you could do something like(?:\G(?!^)|^tags:)\n *- +(.+)\Gworks. Best to play with it on such as regex101.