Batch image processing in Python

April 2, 2022

Programming Image Processing Website Python Tools RAWs

It’s been a while since I’ve done some good programming projects. However, I write some small scripts to help me at work, but I have wanted to expand my blog with programming and photography for quite some time. So, in this post, I would like to focus on a few scripts for batch image processing that could be useful for a general audience.

My photography projects are lagging terribly due to many factors. However, I like to keep my Google Photo library up to date, and due to the recent changes in the storing policy, I cannot upload raw files anymore. Thus, I needed some reliable program that would allow me quickly and automatically convert my RAW images into small jpg previews that can be synced to Google Photos. This task is relatively simple if EXIF data is not essential. The advantage of keeping EXIF is the direct access to camera parameters and, in the case of drone photography, information regarding the GPS location. There are proprietary programs that allow doing both things simultaneously (e.g., Nikon NX studio) but are not recursively, meaning that each folder has to be done separately. Automated tools that I found (e.g., optimize-images) are struggling with keeping the EXIF from RAWs. This convinced me to write a tool that allows optimizing many images simultaneously and keeps the EXIF information.

The tool is not too complicated and is based on the PIL library, rawpy, py3exiv2, joblib, and imageio. The script will work only on Linux due to the py3exiv2 that requires the libexiv2 library currently only available on Linux. However, the script works perfectly fine with Ubuntu under WSL2. The best way to get started is to prepare the environment in conda:

Install conda and create an environment:

  ### Install conda if wasn't installed before
  wget -q -P . https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  bash ./Miniconda3-latest-Linux-x86_64.sh
  conda init bash
  bash

  ### Create conda environment and activate it
  conda create --name photo-env python=3.9
  conda activate photo-env

  sudo apt-get install libexiv2-dev
  pip install joblib imageio py3exiv2 rawpy pillow

All packages should be successfully installed. The code for the script is shown below:

optimize_images.py ⬇️

import glob
import os
import imageio
import pyexiv2
import rawpy
from PIL import Image
from joblib import Parallel, delayed
import argparse
import time
import shutil
from skimage import transform

### Run from command line
parser = argparse.ArgumentParser(
    description='Convert RAW camera files into reduced size jpgs or reduce images in folders recursively.')
parser.add_argument('--i', type=str, help='Input directory', required=True)
parser.add_argument('--o', type=str, help='Output directory', required=True)
parser.add_argument('--d', type=int, default=1920,
                    help='Maximum dimension of a photo')
parser.add_argument('--s', type=float, default=0.4,
                    help='Maximum size in MB of a photo')
parser.add_argument('--e', type=str, help='file extension, only one accepted', required=True)
parser.add_argument('--q', type=int, default=75,
                    help='JPG save quality')
parser.add_argument('--j', type=int, default=8,
                    help='Number of parallel threads')
parser.add_argument('--m', action='store_true',
                    help='Silent (mute) mode. Show only final stats.')
parser.add_argument('--b', type=str, default=False, help='Backup dir. Provide the path of the backup dir. '
                                                         'Used to rerun the script on the same folder')
args = parser.parse_args()

### Image options
max_shape = args.d, args.d
save_quality = args.q
size_max = args.s
saved_space = 0
out_folder = args.o
file_extension = args.e
input_folder = args.i
backup_folder = args.b

### List folder with raw files
files = []
for file_extension_mod in [file_extension, file_extension.capitalize() ]:
    files = files + glob.glob(input_folder + '/**/*.{}'.format(file_extension_mod), recursive=True)


### Job definition for parallel processing
def run_job(file):
    copied = False
    folder_new = (os.path.dirname(file).replace(os.path.dirname(input_folder), out_folder)) + '/'

    ### Create a new folder if it does not exist yet
    if not os.path.exists(folder_new):
        os.makedirs(folder_new, exist_ok=True)

    path_new = folder_new + os.path.basename(file).replace(file_extension, 'jpg')

    ### Process file only if not present in the output directory or there is a backup folder
    if not os.path.exists(path_new) or args.b:

        ### Backup files
        if backup_folder:

            destination_folder = backup_folder + os.path.dirname(file)

            ### Create new folder in the output directory
            if not os.path.exists(destination_folder):
                os.makedirs(destination_folder, exist_ok=True)

            ### Create new folder in the output directory
            if not os.path.exists(destination_folder):
                os.makedirs(destination_folder, exist_ok=True)

            ### copy file to back up location
            shutil.copy2(file, destination_folder)


        ### Open file and postprocess it

        try:

            ### Optimize images
            if backup_folder:
                try:
                    im = Image.open(file)

                except Exception as e:
                    print('Error detected: {}'.format(e))

            ### convert RAWs
            else:
                with rawpy.imread(file) as raw:
                    rgb = raw.postprocess()

                im = Image.fromarray(rgb, "RGB")


            ### Check file size and shape
            im_shape = im.size

            img_size = round(float(os.path.getsize(file)) / 1E6, 2)


            ### Reduce image size if needed
            if im_shape[0] > max_shape[0] or im_shape[1] > max_shape[1]:

                ### Get metadata from the file only if going to be modified
                exif_holder = []

                try:
                    metadata = pyexiv2.ImageMetadata(file)
                    metadata.read()

                    ### Store metadata in a list
                    for tag in metadata:
                        try:
                            exif_holder.append([tag, metadata[tag].value])

                        ### not all exif fields can be processed
                        except:
                            pass

                except Exception as e:
                    print('File: {} had unreadable EXIF'.format(file))


                im.thumbnail(max_shape, Image.ANTIALIAS)
                im.save(path_new, optimize=True, quality=save_quality)
                img_size_new = round(float(os.path.getsize(path_new)) / 1E6, 2)

            else:
                try:
                    shutil.copy2(file, os.path.dirname(path_new))
                except Exception as e:
                    ### Same file exception
                    pass

                img_size_new = round(float(os.path.getsize(path_new)) / 1E6, 2)
                copied=True

            ### Write new exif
            metadata_new = pyexiv2.ImageMetadata(path_new)
            metadata_new.read()



            for element in exif_holder:
                try:
                    metadata_new[element[0]] = pyexiv2.ExifTag(element[0], element[1])

                except Exception as e:
                    pass

            metadata_new.write()

            if not args.m:
                size_diff = abs(round(img_size - img_size_new, 2))
                if not copied:
                    print('Saved {}. Saved space = {} MB.'.format(path_new, size_diff))
                else:
                    print('File already optimized, skipping.')

            return img_size - img_size_new

        except Exception as e:
            if not args.m:
                try:
                    if im.mode != 'RGB':
                        im = im.convert('RGB')
                    im.save(path_new, optimize=True, quality=save_quality)

                except Exception as e:
                    print('There was a problem with a file: {}\nError: {}'.format(file, e))

    else:
        if not args.m:
            print('File {} already exists'.format(path_new))
        return round(float(os.path.getsize(file)) / 1E6, 2) - round(float(os.path.getsize(path_new)) / 1E6, 2)


t1 = time.time()
results = Parallel(n_jobs=args.j)(delayed(run_job)(queue_element) for queue_element in files)

print('Final statistics: saved space: {} MB, took: {} s.'.format(round(sum(list(filter(None, results))), 2),
                                                                 round(time.time() - t1), 2))

The script command-line options are listed below:

usage: optimize_images.py [-h] --i I --o O [--d D] [--s S] --e E [--q Q] [--j J] [--m] [--b B]

Convert RAW camera files into reduced size jpgs or reduce images in folders recursively.

optional arguments:
  -h, --help  show this help message and exit
  --i I       Input directory
  --o O       Output directory
  --d D       Maximum dimension of a photo
  --s S       Maximum size in MB of a photo
  --e E       file extension, only one accepted
  --q Q       JPG save quality
  --j J       Number of parallel threads
  --m         Silent (mute) mode. Show only final stats.
  --b B       Backup dir. Provide the path of the backup dir. Used to rerun the script on the same folder

This script works in two modes:

converting RAW files to low-resolution preview jpg and
reducing images in the folder by rescaling and optimizing jpg quality.

To run the script in the RAW preview mode can be used the command:

python optimize_images.py --i /mnt/d/Photo/Raw_photo/170611/ --o /home/dawid/reduced_raw/ --e NEF --j 8

That will convert NEFs in the given folder to jpg using 8 parallel threads. Image optimization, in contrast, can be run in the same folder with the backup option:

python optimize_images.py --i /home/dawid/dzyla-photo/ --o /home/dawid/dzyla-photo/ --e jpg --j 8 --b /home/dawid/dzyla-photo_backup/ --d 1920 --q 75

All jpg from the given directory will be first backed up to another directory, then rescaled to a long edge of 1920 px and saved with a compression quality of 75.

The script allows converting 100 RAWs (2.7 GB) within a minute. I was curious if using the Joblib and multiple parallel threads would speed up the processing time. I copied 100 NEFs to a standard SSD drive (readout max ~500 MB/s) and ran the script, increasing the number of parallel threads. Images were saved to the same drive.

for i in $(seq 1 16); do python optimize_images.py --i /mnt/c/test/ --o /mnt/e/test1/ --e NEF --j $i --m; rm -r /mnt/e/test1/; done

Results are plotted below:

Using more parallel threads speeds up total processing time ~3-fold. Using Ryzen 7 5800X (8 core / 16 thread), there was no difference in time above 8 threads. Interestingly, changing the drive to NVME SSD did not improve the performance. There are possibly some tricks to run the code faster, but it goes beyond my paygrade.

Another script I want to share is a byproduct of continued manual copying of data from the drone’s SD card to folders named after the date of taking photos. Nikon NX software with its sync function works great but does not support other sources than Nikon cameras. Maybe there is another software available, but it’s easier for me to write a script ;) Script again can be run via the command line, but this one supports more operating systems, including Windows. It requires two new libraries: tqdm and exifread.

organize_images.py ⬇️

import datetime
import os
import glob
import shutil
import tqdm
import exifread
import argparse

### Run from command line
parser = argparse.ArgumentParser(
description='Sync photos to one folder to another and group pictures by creation date')
parser.add_argument('--i', type=str, help='Input directory', required=True)
parser.add_argument('--o', type=str, help='Output directory', required=True)
parser.add_argument('--e', type=str, help='file extension, only one accepted', required=True)
args = parser.parse_args()

def get_date_taken(path):
### Read file exif, using exifread, so it will work on Windows as well
with open(path, 'rb') as fh:

    ### try to read exif
    try:
        tags = exifread.process_file(fh, stop_tag="EXIF DateTimeOriginal")
        dateTaken = tags["EXIF DateTimeOriginal"]
        return str(dateTaken)

    ### If there is no exif, use the creation date
    except Exception as e:
        return datetime.datetime.fromtimestamp(os.path.getmtime(path)).strftime('%Y:%m:%d %H:%M')


input_folder = args.i
output_folder = args.o

### Get the list of all files
files = []
for file_extension_mod in [args.e, args.e.capitalize(), args.e.lower()]:
files = files + glob.glob(input_folder + '/**/*.{}'.format(file_extension_mod), recursive=True)

### Copy files to corresponding directories
for file in tqdm.tqdm(files):
creation_date = get_date_taken(file)
try:
    creation_date = creation_date.split()[0].replace(':', '')
except IndexError:
    creation_date = '1970:01:01'.split()[0].replace(':', '')

### if files are in some other directory already, use this name after shooting date
try:
    subfolder_name = creation_date + '_' + \
                     os.path.dirname(file).replace(os.path.dirname(input_folder), '').split('/')[-1]
except:
    subfolder_name = creation_date

destination_folder = os.path.join(output_folder, subfolder_name)

### Create new folder in the output directory
if not os.path.exists(destination_folder):
    os.makedirs(destination_folder, exist_ok=True)

### copy file
shutil.copy2(file, destination_folder)

Example usage of the script:

python organize_images.py --i /mnt/c/Users/dawid/Pictures/ --o /mnt/q/organized/ --e jpg

The reason why I tried to reinvent the wheel by writing those scripts is relatively simple. Using homemade scripts makes complicated tasks a little bit easier. However, my previous experience with pre-made tools is a little bit bitter-sweet. There are many excellent programs, but I often find that the functionality I am looking for is not there, or the expansion of the ready tools requires a lot of additional work. Writing my own code has two great advantages: it allows me to optimize software for my own needs, and more importantly, it is so much fun!