Python - how to speed up the algorithm on the example of PDF file import

Python - how to speed up the algorithm on the example of PDF file import

Why is our code often not optimal?

There are many reasons why we are not creating optimal code. The first that comes to mind is short terms that won't let you spread your wings and refine the details. Another may be that the client does not always expect and needs the code that can execute in a fraction of a second. In a situation where our algorithm will be used relatively rarely, it may be more profitable to wait a few seconds than incurring greater costs. The last reason I would like to mention is our skills. Who has never felt embarrassed when analyzing their own code from a few months ago, let alone years old, throw a flash drive first.

How to optimize python code and more?

We have two options to choose from. The first is to improve the existing algorithm, while the second - usually simpler and more effective - is to rewrite the function. In the example below, I will present both approaches. The second approach can be especially fruitful years later because, in the time that has passed since created the code, new, useful libraries may have appeared.

Practical example - PDF file import

The task set by the client is as follows: "Prepare a function for me that will load the PDF file, then split it into individual pages and add it to the existing report as an attachment." When testing the solutions, I used a 13-page PDF file provided by the customer.

At first, our algorithm looks like this:

def pdf_upload (filename, data):
    images = []
    created_jpgs = False
    with Image(filename=filename, resolution=300) as img:
        images = []
        if len(img.sequence) > 1:
           for x in img.sequence:
              path = '{0}-{1.index}.jpg'.format(data.full_path.replace('.jpg', ''), x)
              convert_image(images, x, path)
        else:
           path = '{0}-{1}.jpg'.format(data.full_path.replace('.jpg', ''), 0)
           convert_image(images, img, path)
        return images

def convert_image(images, img, path):
    """Postprocessing photo."""
    images.append(path)
    img_page = Image(image=img)
    img_page.compression_quality = 20
    img_page.resize(2000, 2820)  # zachowuje proporcje formatu A
    img_page.alpha_channel = 'remove'
    img_page.save(filename=path)

Pseudocode:

- load PDF and "scan it" with DPI 300
- check if the number of pages is greater than 1
- specify the path to the file and convert each page using the 'convert_image' function
- if not use a different name, then convert the page

The execution time of this algorithm is as high as 163s; for DPI 200 it decreases to 17.2s, but in the case of some PDF files, the quality may not be sufficient.

The first approach - using the code at hand

def pdf_upload_fast_and_furious(filename, data):
    images = []
    pdf = PdfFileReader(filename)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
        output = f'{filename.replace(".pdf", "")}-{page}.pdf'
        with open(output, 'wb') as output_pdf:
            pdf_writer.write(output_pdf)
        with Image(filename=output, resolution=300) as img:
            path = f'{data.full_path.replace(".jpg", "")}-{page}.jpg'
            images.append(path)
            convert_image(images, img, path)
            os.remove(output)
    return images

Pseudocode:

- for each page in the document do:
- load the page into memory
- save the page as PDF
- load the page by "scanning" it with DPI 300
- specify the path to the file and convert it with the well-known function 'convert_image'
- delete page in PDF

 

The code written in this way takes 24.5 seconds, which means 6.8 times faster than the algorithm at the beginning.

Final version - using pdf2image library

def pdf_upload_2_fast_2_furious(filename, data):
    """Upload drawing in high quality."""
    with open(filename, 'rb') as filehandle:
        pdf = PdfFileReader(filehandle)
        pages = pdf.getNumPages()
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(
            filename,
            output_folder=path,
            last_page=pages,
            dpi= 300,
            first_page=1,
            thread_count=1,
        )
    images = []
    for index, page in enumerate(images_from_path):
        path = f'{data.full_path.replace(".jpg", "")}-{index}.jpg'
        page.save(path, 'JPEG')
        images.append(path)
    return images

Pseudocode:

- load the file into memory
- get page count
- use the 'convert_from_path' function to split the file into a single JPG
- save individual JPG as files

This time the algorithm takes only 9.2 seconds to execute, but it can speed up by increasing the number of threads involved in the operation. After selecting two threads, the execution time is 7.4s, while for, four 7.7s. This means that with the use of one logical processor, our algorithm accelerated almost 18 times.