The following script is to convert Markdown files to something closer to my aesthetical expectations. It is not assumed to be bulletproof accurate and I don't mind to slightly adjust the output manually after the script will do the most tedious part of work for me. Also, this script is for nontechnical texts only, so support for lists, tables, indented text, etc. is not intended.
Typographic substitutions
- Em dash and en dash are replaced by two hyphens. Spaces around them are removed. Single hyphen surrounded by spaces is replaced by two hyphens without spaces as well. (
Lorem—ipsum
→Lorem--ipsum
,Lorem — ipsum
→Lorem--ipsum
,Lorem - ipsum
→Lorem--ipsum
) - Ellipsis is replaced by three regular periods. (
Lorem... ipsum
→Lorem... ipsum
) - Asterisks that are used for italic are replaced by slashes. If asterisk is preceded by backslash, both characters are replaced by a single dagger. (
Lorem\* *ipsum*
→Lorem† /ipsum/
)
Line wrapping and indentation
- The text is wrapped at 72 columns.
- Single blank lines are removed.
- Not implemented yet: If there are consecutive blank lines, they are replaced by a single one.
- Trailing spaces and trailing baskslashes are removed.
Example
Input:
Lorem - ipsum *dolor* sit\* amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet. \
Lorem ipsum dolor sit amet. \
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
\* The word "sit" consists of three letters.
Desired output:
Lorem--ipsum /dolor/ sit† amet, consectetur adipiscing elit, sed do
euismo tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum
dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
† The word "sit" consists of three letters.
The current versions of the script and the output:
sed 's/ - /--/g
s/ \{0,1\}[—–] \{0,1\}/--/g
s/.../.../g
s/\\\*/†/g
s/*/\//g' 1ドル |
awk 'BEGIN{RS=""; ORS="\n "}1' |
fold -sw 72 |
sed 's/[\ ]*$//g' > 2ドル
Lorem--ipsum /dolor/ sit† amet, consectetur adipiscing elit, sed do
euismo tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum
dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
euismo tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
euismo tempor incididunt ut labore et dolore magna aliqua.
† The word "sit" consists of three letters.
Things that I haven't solved yet and would be appreciated for help:
Missing indent in the very beginning.
Missing indents at the beginning of the 2nd and 3rd lines here:
Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet.
Should be:
Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet.
The actual maximum width of text, because
fold
is used before removing trailing spaces, is not 72 but 71 characters.To implement things that I mentioned earlier as "Not implemented yet".
How the script can be improved so that the output will be as close to the desired output shown above as possible?
tech.note: I use awk
, sed
, and fold
supplied with macOS.
-
1\$\begingroup\$ "Things that I haven't solved yet" are outside the scope of Code Review; don't expect reviewers to suggest modifications to what your code actually does. \$\endgroup\$Toby Speight– Toby Speight2024年09月01日 09:31:03 +00:00Commented Sep 1, 2024 at 9:31
1 Answer 1
You never need sed
when you're using awk
. Pipes of sed-or-grep-or-awk to sed-or-grep-or-awk are an antipattern (see https://porkmail.org/era/unix/award#grep), and if fold
doesn't do what you want then it's pretty simple to implement the line wrapping functionality you're asking for in awk
too.
Using any POSIX awk:
$ cat ./foldLines.sh
#!/usr/bin/env bash
awk -v maxLen=72 '
{ sub(/\\+[[:space:]]*$/, "") }
NF {
gsub(/ - /, "--")
gsub(/ {0,1}[—–] {0,1}/, "--")
gsub(/.../, "...")
gsub(/\\\*/, "†")
gsub(/\*/, "/")
gsub(/\\\./, ".")
sub(/[[:space:]]+$/,"")
out = " "
sep = ""
for ( i=1; i<=NF; i++ ) {
nextOut = out sep $i
if ( length(nextOut) > maxLen ) {
print out
out = $i
}
else {
out = nextOut
sep = FS
}
}
print out
numEmpty = 0
next
}
++numEmpty == 2 {
print ""
}
' "${@:--}"
$ ./foldLines.sh file
Lorem--ipsum /dolor/ sit† amet, consectetur adipiscing elit, sed do
euismo tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum
dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
† The word "sit" consists of three letters.
-
\$\begingroup\$ Could you show how to use it so that the input and the output are files? \$\endgroup\$jsx97– jsx972024年08月31日 21:19:27 +00:00Commented Aug 31, 2024 at 21:19
-
1\$\begingroup\$ It's no different from any other Unix command
./foldLines.sh < input_file > output_file
. I just updated it so you can pass it a file name to use for input if you prefer -./foldLines.sh input_file > output_file
. If you want to specify input and output files as arguments then just change"${@:--}"
to"1ドル" > "2ドル"
and call it as./foldLines.sh input_file output_file
. \$\endgroup\$Ed Morton– Ed Morton2024年08月31日 21:41:33 +00:00Commented Aug 31, 2024 at 21:41 -
\$\begingroup\$ Thanks a lot, Ed. From what I see,
"\\/"
ingsub(/\*/, "\\/")
should be changed to"\/"
: otherwise,*dolor*
is converted to\/dolor\/
. Though this might be above what you intended to help me with, could you show how get rid of blank lines and to indent each paragraph with 2 spaces? I triedawk 'BEGIN{RS=""; ORS="\n "}1'
for this, but I don't understand how to properly integrate it into your version, so that the lines will not exceed the 72 column, and so that the first paragraph will be indented with 2 spaces as well. \$\endgroup\$jsx97– jsx972024年09月01日 01:05:42 +00:00Commented Sep 1, 2024 at 1:05 -
1\$\begingroup\$ I've made those changes and showed the script running against your new input now. \$\endgroup\$Ed Morton– Ed Morton2024年09月01日 12:31:57 +00:00Commented Sep 1, 2024 at 12:31