Index: An Incremental Blog
Newer: Recipe: GitHub Upload
Older: Software that I like (STIL): ssh
[ Vlado Keselj | 2020-12-06..2020-12-07 ]

Fixing Filenames

You should avoid using filenames with dangerous characters; i.e., basically, any characters other than letters, digits, underscore, minus sign, and period. If you like to work with the command line and automate tasks with scripts, you should understand why those characters are dangerous. File names can be seen as names or identifiers in the programming context. Even if you do not do any programming at all, you should at least be careful about this issue.

Whenever I download a filename with dangerous characters, or receive such a file in an email attachment, I get mildly frustrated. To address this issue, I wrote a Perl script called fix-file-names, which is used to rename such files. The script is given below:

 #!/usr/bin/perl
 # fix-file-names - change file names to safe names, e.g. space to _ etc.
 # 2009-2020 Vlado Keselj [email protected] http://vlado.ca last update:2020-12-08
 # Usage: fix-file-names f1 f2 ...
 
 for my $fnold (@ARGV) {
   my $fnnew = &fix_filename($fnold);
 
     if ($fnnew eq $fnold) { print "$fnnew \t\tthe same file name kept!\n" }
     else {
 	if (-e $fnnew) { die "$fnnew already exists!" }
 	print "$fnold \t-> $fnnew\n";
 	rename($fnold,$fnnew) or die;
     }
 }
 
 sub fix_filename {
   local $_ = shift; s/^-/F-/; s/ +- +/-/g;
   s/''+/--/g; s/'/-/g; s/[[(<{]/_-/g; s/[])>}]/-_/g;
   s/[,:;]\s*/--/g; s/&/and/g; s/ /_/g;
   s/__+/_/g; s/---+/--/g;
   s/\xE2\x80\x99/-/g; # Single right quote
   s/[^\w.-]/"0x".uc unpack("H2",$&)/ge;
   return $_;
 }
 
 # 2020-12-06
 # - =HH encoding is replaced with 0xHH since '=' is a special character in
 #   shell (bash)
The script first tries to fix various common constructs in filenames to their roughly similar but safe equivalents, and finally it replaces any potentially non-safe character to a hexadecimal 0xHH code.

Related Work

An excellent essay on dangers of arbitrary filenames and fixing them is Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems by David A. Wheeler.

The package fix-filenames by Martin Zagora is an interesting open-source program in TypeScript (JavaScript) to fix filenames by recoding some non-ASCII characters. It contains an interesting mapping of non-ASCII characters to ASCII string equivalents. Its GitHub location is https://github.com/zaggino/fix-filenames.


created: 2020-12-06, last update: 2020-12-07, me comments

© 2020-2023 Vlado Keselj, last update: 14-Feb-2022