venerdì 28 ottobre 2016

From File::Find to File::Find::Rule

I tend to use File::Find the most in order to get some file searching and mangling. Usually my scripts have the same simple structure as follows:

$| = 1; # autoflush
find( \&directory_scanner, ( $starting_directory ) );
$| = 0; # non autoflush

# and the scanner is something like
sub directory_scanner{
    chomp;

    return if ( $_ eq $starting_directory || ! -f $_ );
    return if ( $File::Find::dir !~ /$re_dir(\d{4}-\d{6})$/ );
    ...
}
 
As you can see the event handler invoked by File::Find is used to both print some report (the $counter) in order to tell me the script is still alive (I do pass 200+k files at once) but, most notably, applies a regexp to the directory I'm in in order to avoid some staging/backup/etc. directory that could be likely the one I'm interested into but I don't want the script to pass.
For a few times I've tried to convert my Find::File based scripts to File::Find::Rule, just to get more used with such interface, but I didn't know how to fix the application of regular expression to the traversing path. Reading a little more deeply the documentation I found the exec subroutine that allows me to specify an handler (i.e., a subroutine) that can return true or false depending on what I want to do on the file I'm visiting. Therefore, converting my scripts becomes as easy as follows:
 
$| = 1; 
my $engine = File::Find::Rule->new();
my @files  = $engine->file()
    ->exec( sub {
        my ( $shortname, $path, $fullname ) = @_;
        return $path !~ /$re_dir(\d{4}-\d{6})$/;
            } )
    ->exec( sub{
        my ( $shortname, $path, $fullname ) = @_;
        $counter++;
        return $shortname =~ /KCL/;
            } )
    ->exec( sub{
        my ( $shortname, $path, $fullname ) = @_;
        print "." if ( $counter % 100 == 0 );
        print "$counter\n" if ( $counter % 1000 == 0 );
        return 1; # do not forget !
            } )
    ->in( $starting_directory );
$| = 0
 
I've kept three different handlers for readibility sake, but as you can image, it is possible to shrink them down into a single one. The funny part here is that I can check the path against a regexp again. The drawback is that an handler used for output reporting only must return always a true value.
In the case you are wondering, the autoflush is used simply to display the dots while the program is running.

Nessun commento: