I‘ve decided to go paperless (as much as I can anyway). Nearly 20 years worth of various documents must be scanned – then saved to a safe place and the unimportant papers shredded. This pile of documents is pretty tall and I don’t really want to feed the scanner for days, so some ingenuity is required. As any decent software developer I hate doing the same thing over and over again so here was a chance to let the geek in me take over.
One of my old scanners, still waiting to be recycled, actually does have a document feeder. Even though I still would have to feed it, using it could make the job bearable. Or maybe not. The scanning software is not very good. So from the computer I would have to manually operate the program, handle each document and place the file in the destination folder.
Four years ago I wrote a bash script to automate scanning to PDF. It would handle a multi page document, which is nice. But the number of pages in each document had to be typed in. In this case most documents would have one page and some would have multiple. So this script would have to be elaborated upon. The basic idea was to do something like this:
- Keep on scanning while there are sheets in the ADF.
- One sheet of paper equals one page. Sheets are not turned.
- Somehow detect that pages belong in the same document.
- Convert each document to PDF.
The pixma_scan application that I used the previous time would have to be amended in order to be able to detect whether or not there are sheets in the ADF without actually scanning. So I did a minor tweak to the program allowing it to be used in this way. The code is available on GitHub. I also had to change the Bash script a bit to take advantage of this feature.
Raspberry Pi to the rescue
The Automatic Document Feeder (ADF) of the Pixma MP780 does not hold a lot of paper but it would still take quite some time to scan and convert a full ADF. I also don’t want to keep my laptop busy doing this work so I decided to utilize a Raspberry Pi. It runs Linux, so the scanning application would probably work. The same goes for the utilities the script calls for image manipulation. I had to change this script a bit and the initial test showed that this could work.
Detecting multi-page documents
Another good thing about using the Rapberry Pi is that I could make use of the camera module to detect whether or not a page should be in the same document as the current. I figured it would be possible to some degree of accuracy to analyze a portion of the image and detect whether or not it contains a marker. The images below show two documents with the marker placed in the bottom right corner of the sheet and one without.
# Test for more pages in the document
raspistill -roi 0.08,0.65,0.10,0.07 -w 256 -h 256 -br 90 -co 100 -mm average -ISO 100 -o camera.jpg
values=$(convert camera.jpg -threshold 50% -format %c histogram:info:- | tr -s ':' | cut -d ':' -f 1)
black=$(echo $values | cut -d ' ' -f 1)
white=$(echo $values | cut -d ' ' -f 2)
if [ $black -gt $white ] ; then
Using the convert utility of Imagemagick I was able to detect whether or not the image was mostly black or white. If it was mostly black the document belongs to the current document and should be appended. The problem that blocked progress was that the ADF could not be trusted to place the sheet of paper in the same position every time. I could probaly get it working if I put black separator sheets in between documents. However I did not continue down this path; as I it would take some time to tune and I am still able to merge PDF files after they have been scanned if need be.
This set-up has proved to be quite useful. It has so far scanned 68 documents in about one hour and half, which would take far longer to do manually. I still have to sort all the document files and place them where they belong. But at least I have a digital copy.
Now I’ll drop a stack at the ADF and get some Battlefield 4 quality time 🙂