Stitching Together Images from a PDF Generated by Microsoft
So I wanted to extract an image from a PDF. „Right-Mouseclick -> Save As” and I thought I was done. Unfortunately, I was wrong. I only got a slice of the image and not the whole image.
After some (non-LLM based) (re-)search, I learned that PDFs with a „Producer: Microsoft: Print To PDF” attribute tend to contain this „feature”. So how to remediate that?
First thing is to get a list of all the images. This is easily done with pdfimages (rather current version, based on poppler):
$ pdfimages -list damaged_by_microsoft.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 440 198 rgb 3 8 jpeg no 4 0 600 600 8666B 3.3% 1 1 image 440 198 rgb 3 8 jpeg no 5 0 600 600 8382B 3.2% 1 2 image 440 90 rgb 3 8 jpeg no 6 0 600 600 6621B 5.6% …
Extracting those images is easily done with pdfimages as well. I prefer my images as .png, so I added the appropriate conversion flag to the command.
pdfimages -png damaged_by_microsoft.pdf dbm_image
This results in a bunch of „dbm_image-000.png files in the current directory. The cumbersome part starts here where you have to identify, which fragment is the first or last section of a certain image. In my case, I wanted the image from 006 to 026 and the image starting with 027 and ending at 054.
Noting down the index of the first and last fragment of the images we want to export, we can now stich those together using ImageMagick:
magick convert dbm_image-0{06..26}.png -append image01.png magick convert dbm_image-0{27..54}.png -append image02.png
E voila! I just had to spend a couple of minutes figuring this out instead of just doing a „Save As” thanks to MIcrosoft’s genius in PDF export.
Based off PDF: extracted images are sliced / tiled - Stack Overflow
Tagged as: commandline, images, linux, microsoft, pdf, rant | Author: Martin Leyrer
[Sonntag, 20250720, 11:35 | permanent link | 0 Kommentar(e)