QZ qz thoughts
a blog from Eli the Bearded

Matched Pairs


In vi (and vim), there's a "motion" command, % that moves you to an enclosing symbol. In the previous sentence, with the cursor on the "v" of vim, using % will move the cursor to the "(", which is the start of the enclosed sequence. On that "(", the % motion moves to the matching ")".

Out of the box, vi knows the pairs "()", "[]", and "{}". You can change the pairs with the configuration variable matchpairs and people frequently do to add "<>" for XML or HTML work:

set matchpairs=(:),[:],{:},<,>

But there are a lot more, like quoting angles "«»" and smart quotes. And vim happily accepts UTF-8 characters for each half of a pair. So I could think up some Unicode pairs and stick them in there. Or I could look for all pairs that exist in Unicode.

Here's a stab at doing just that.

First off, we need the list of characters in Unicode. This is surprisingly easy to get. Unicode themselves provide an easy to parse list of characters in plain ASCII(!).

$ curl -sO http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

The list is not fixed, new stuff gets added with each version of Unicode. Releases happen every 12 to 18 months. Refreshing that file is the major change needed to update my Unicode Toys for a new version.

So how about a quick script to find "LEFT" characters that have a matching "RIGHT" version?

#!/usr/bin/perl
# Read the UnicodeData.txt file to create a vim 
# matchpair list.
#
# May 2022 "Eli the Bearded"
use strict;
use warnings;

my $in = 'UnicodeData.txt';
my %found;
my %pool;
my %check;
my $id;
my $pid;
my $name;
my $count = 1; # first pair is the hardcoded one

# Tag characters are an obsolete invisible set of
# ASCII for hidden metadata. Modifiers and Combining
# should not be used on their own. Arabic are not
# left-to-right text, so I decided I don't need them.
# You may decide otherwise. The others, by inspection,
# don't have anything I'd want as a pair. (Some
# are part of of larger sets, like up/down/left/right
# quads, or parts of multicharacter pictures.) This
# still leaves some unlikely pairs including box drawing
# stuff. It's a quick list.
#
# These are checked with word boundaries, so CIRCLE
# will not skip CIRCLED.
my @skip = qw[ TAG MODIFIER COMBINING ARABIC IDEOGRAPH
	       ARROWHEAD ARROW
	       AFFIX CIRCLE HALF
	       UP UPWARDS
	       DOWN DOWNWARDS
	     ];
my $skip = join('|', @skip);
my $skip_re = qr/\b(?:$skip)\b/; # \b for boundary

# no boundary check, allow "leftfacing" and the like
my $keep_re = qr/(?:LEFT|RIGHT)/;

binmode(STDOUT, ':utf8');
open(STDIN, '<', $in) or die;

while(<>) {
  # keep code point and name only
  /^([^;]+);([^;]+);/ or next;
  $id = $1;
  $name = $2;

  # Stop checking if on skip list
  next if /$skip_re/;

  # if left or right, keep, but separately
  if (/$keep_re/) {
    if (/RIGHT/) {
      $check{$name} = $id;
    } else {
      $pool{$name} = $id;
    }
  }
}
close STDIN;

for $name (keys %check) {
  my $pair = $name;
  # %check has RIGHTs, see if there is a matching left
  $pair =~ s/RIGHT/LEFT/g;
  if (length( $pid = $pool{$pair} )) {
     $id = $check{$name};
     $found{$pid} = $id;

     # In .exrc or .vimrc " is used to begin a comment.
     # These three printf()s just document the pairs.
     printf(qq{" U+%s\t%c\t%s\n}, $pid, hex($pid), $pair);
     printf(qq{" U+%s\t%c\t%s\n}, $id, hex($id), $name);
     printf "\"\n";
     $count ++;
  }
}
print STDERR "Found $count pairs\n";

# Unfortunately < and > are not named with LEFT and RIGHT
# so hardcode that.
printf "set matchpairs=<:>";
for $id (sort { $a cmp $b } (keys %found)) {
  $pid = $found{$id};
  printf ",%c:%c", hex($id), hex($pid);
}
printf "\n";
__END__

Saved as matchmaker, with the Unicode data file in same directory, let's try it.

$ perl matchmaker >> .vimrc
Found 186 pairs
$ tail -1 .vimrc
set matchpairs=<:>,(:),[:],{:},«:»,֎:֍,܆:܇,࿖:࿕,࿘:࿗,𐡷:𐡸,𝄆:𝄇,𝅊:𝅌,𝅋:𝅍,👈:👉,🔍:🔎,🕃:🕄,🕻:🕽,🖉:✎,🖘:🖙,🖚:🖛,🖜:🖝,🗦:🗧,🗨:🗩,🗬:🗭,🗮:🗯,🙬:🙮,🤛:🤜,🫲:🫱,🭪:🭨,🭬:🭮,🭼:🭿,🭽:🭾,🮜:🮝,🮟:🮞,🮠:🮡,🮢:🮣,🮤:🮥,🯇:🯈,‘:’,“:”,‹:›,⁅:⁆,⁌:⁍,⁽:⁾,₍:₎,⇇:⇉,⊣:⊢,⋉:⋊,⋋:⋌,⌈:⌉,⌊:⌋,⌍:⌌,⌏:⌎,⌜:⌝,⌞:⌟,〈:〉,⌫:⌦,⍅:⍆,⎛:⎞,⎜:⎟,⎝:⎠,⎡:⎤,⎢:⎥,⎣:⎦,⎧:⎫,⎨:⎬,⎩:⎭,⎸:⎹,⏋:⎾,⏌:⎿,⏪:⏩,⏮:⏭,⏴:⏵,┤:├,┥:┝,┨:┠,┫:┣,╡:╞,╢:╟,╣:╠,╴:╶,╸:╺,▉:🮋,▊:🮊,▋:🮉,▍:🮈,▎:🮇,▏:▕,▖:▗,▘:▝,◀:▶,◁:▷,◂:▸,◃:▹,◄:►,◅:▻,◜:◝,◟:◞,◣:◢,◤:◥,◰:◳,◱:◲,◸:◹,◺:◿,☚:☛,☜:☞,⚟:⚞,⛦:⛥,❨:❩,❪:❫,❬:❭,❮:❯,❰:❱,❲:❳,❴:❵,⟅:⟆,⟕:⟖,⟞:⟝,⟢
 :⟣,⟤:⟥,⟦:⟧,⟨:⟩,⟪:⟫,⟬:⟭,⟮:⟯,⥼:⥽,⦃:⦄,⦅:⦆,⦇:⦈,⦉:⦊,⦋:⦌,⦍:⦐,⦏:⦎,⦑:⦒,⦗:⦘,⧘:⧙,⧚:⧛,⧼:⧽,⫍:⫎,⫥:⊫,⬱:⇶,⮄:⮆,⮐:⮑,⮒:⮓,⯇:⯈,⸂:⸃,⸄:⸅,⸉:⸊,⸌:⸍,⸜:⸝,⸠:⸡,⸦:⸧,⸨:⸩,⸶:⸷,⹑:⹐,⹕:⹖,⹗:⹘,⿸:⿹,〈:〉,《:》,「:」,『:』,【:】,〔:〕,〖:〗,〘:〙,〚:〛,꧁:꧂,﴾:﴿,︵:︶,︷:︸,︹:︺,︻:︼,︽:︾,︿:﹀,﹁:﹂,﹃:﹄,﹇:﹈,﹙:﹚,﹛:﹜,﹝:﹞,(:),[:],{:},⦅:⦆,「:」
$

There are a lot of good pairs in that. But some pairs might need to be switched for taste. (Looking at those hands.)