Single file python script for managing a directory containing scanned documents
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Felix Stupp c56794d1c2
README: Added limitation of current directory as location of documents
2 years ago
LICENSE Added LICENSE 2 years ago
README.md README: Added limitation of current directory as location of documents 2 years ago
maintain.py Make PDF viewer configurable 2 years ago

README.md

scansystem

scansystem is a single file Python script to maintain (hence its script name) a directory containing your personal documents as PDFs. It helps you with scanning them, applying OCR and sorting them based on categories you make up as you like. It assigns incremental IDs to scanned documents which can help in finding

This script runs "stateless" without a database (everything required is encoded in human readable filenames or stored inside the PDFs itself). So in contrast to other, more complex tools this script may provide this unique features to you:

  • You do not need to use this script to access your documents, resorting them or even adding new ones. It is built to be just a small helper.
  • It's compatible with every file synchronization solution (a.k.a. cloud) and backup solution you already have and might want to use in the future.
  • You can use the tools you want to search through your documents (like pdfgrep or Recoll)
  • You can use it without first configuring a server or anything similar.
  • If this script stops working for you (for any reason), you can just continue on with your life without it.
  • You can use the script (only) from the terminal.

Other features:

  • It can combine multiple sides / pages into a single PDF which results in a single PDF having multiple IDs (ID per side) assigned.
  • For "special documents" (or if your scanner is not capable of ADF) you can use an interactive mode (provided by scanimage)
  • It also "supports" digital only documents (but for now the do not have any IDs)

Disclaimer

Please check the known issues before using!

I started this project for my own personal needs. Because I have no problem sharing or working on it with others, I decided to publish it on GitHub and mirror it to my own Gitea instance. However, especially for the first versions, it might not fit your use case and is hard to change. So before using this script, check if it fits your needs. It may help you to read my own use case described below.

This idea might not be a new, innovative solution, however I did not found any comparable solution which fit my use case. If you find other solutions which you deem better for you, use them. I would also be happy to know about them and have no problem listing them below for comparison. Create an issue or PR if you might propose a similar project which should be listed below.

Known Issues / ToDo

I know about these limitiations and I (or you) might resolve these in the future:

  • Because of dependencies, it only works on UNIX so far.
  • It assumes that your scanner will scan and return both sides of a scanned paper.
  • Digital only documents for now have no IDs assigned.
  • Ultimately, all scanned documents will be compressed with JPEG, but in a reasonable quality
  • It is not really aware of the sorting of the paper documents, it just manages the IDs for now.
  • Its configuration is "hard-coded" in the script making updating harder.
  • The index will not be maintained automatically.
  • The default configuration reflects my own and maybe are not sane defaults.
  • It uses the current directory to determine the location of your documents. If executed from another directory, it might access other files and store scans elsewhere.

How It Works (TL;DR)

Essentially system works as follows:

  • each incomming page of a paper document will be scanned immediatally and be assigned an unique and incrementing ID, which is encoded in the filename
    • odd IDs indicate the front side of a page, even IDs indicate the back side of a page
    • meaning the sides from ID 1 to 200 reference 100 pages of paper
  • the documents will be inserted in order of the IDs into binders
  • each binder is labelled with the first and last IDs it holds
    • the binders can also be split by separators to find paper documents faster
  • it is NOT required to label each document because of insertion order
    • this allows to preserve the original without modification (like a label)
    • this speeds up the insertion of 100's of pages a lot (assuming ADF scanner)

How To Use

Preperations

You need:

  • a SANE compatible scanner or printer (see list of supported devices)
    • scanner with ADF is highly recommended
    • the feature to scan both sides automatically is also recommeded
    • however manual scanning with a flatbed is already supported
  • empty ring binders
  • a hole punch
  • (optionally) plastic wraps for important paper documents which you might not want to hole punch

On your computer following tools should be installed (hopefully available in your distributions repository):

  • ocrmypdf and its dependencies
    • could also call tesseract directly
  • parallel (GNU variant, optional)
  • utils from poppler
    • pdftotext
    • pdfunite
  • utils from SANE
    • scanimage

First Setup

  1. Set up at least one binder
    • label the binder with the range of IDs it will contain (starting at 1 for the first binder & section)
    • (optionally) each binder should have separators which you can label with the IDs they (will) contain
  2. Create a directory on your computer for your scanned PDFs and copy the ./maintain.py script into it
    • make it executable with chmod +x ./maintain.py
    • you can already create the sub directories for the categories (which can be nested as well), however you can also create them dynamically
  3. Customize the configuration to your liking
    • for now, the configuration is stored inside the script at the beginning
    • especially configure the scan sources of your scanner (USE_ADF_BY_DEFAULT / ADF_SCAN_SOURCE / FLATBET_SCAN_SOURCE)
  4. Migrate your old documents into the system by applying the steps below per document / page
    • the order is irrelevant, I recommend in chronical order as future documents will be inserted in the same order

Example

  • I decided to hold up to 300 pages (600 sides or IDs) per binder
  • Each binder is separated by separators into 6 sections containing each 50 pages (100 sides or IDs)
  • If a document will span multiple sections or binders but is separated, I will still insert it correctly because I honor my system more than the "integrity" of the document

How To Add New Documents

I recommend to follow the steps one by one per page at the beginning so the order does not get messed up. If you feel safe, you can start to batch the tasks if you insert multiple pages at once.

I seperate each document into its pages so it can be scanned automatically if possible, I even remove staples if required.

Per page:

  1. Scan both sides with ./maintain.py scan
    • really either scan both sides or recall ./maintain.py scan after each page because otherwise scanimage will mess up the IDs as it is not aware that the ID pairs should match front & back page
    • with --adf you can force use ADF and it will continue to scan all pages available
    • with --flatbed you can force use the flatbed (e.g. for "special" documents)
    • by default, it will automatically apply OCR and convert the documents to PDFs
      • to speed up conversion of multiple pages by using parallel, add --skip-convert and execute ./maintain.py convert --output-commands | parallel after scanning
    • after scanning, you might remove empty back pages, the script will still select the next ID correctly (see ./maintain.py next-id)
  2. Add ring holes using a hole punch if required
    • OR insert document into a plastic wrap with ring holes
  3. Insert document into the latest binder at the end
    • Check on the IDs assigned if you want to place the document behind the next separator or in the next binder

How To Sort & Combine

Per default, the documents are called outXXXX.jpg or outXXXX.png. If you want to add date & title to your document or sort it into a category, you can use ./maintain.py merge --id <IDs>. <IDs> might be a comma separated list of IDs which can be

  • a single ID, e.g. 123
  • a single ID with its counterpart ID (the other side), suffix +, e.g. 453+,88+ == 453,454,87,88
  • a single ID with its following page, suffix ++, e.g. 869++ == 869+,871+ == 869,870,871,872
  • an ID range, start and end separated by -, e.g. 123-128 == 123,124,125,126,127,128
  • a suffix of # (compatible to +, ++ and ranges) will also select all "context pages", by default ±10 pages, e.g. 100# == 90-110 (not useful for merge but for other commands) The order of the IDs will determine the order of the pages later on. However for merge:
  • single IDs will be completed to both sides so both sides end up in the same PDF at the end
  • missing IDs will be ignored (so missing back pages might not cause any error) You can append --view so you will see the resulting document to verify. If you abort the process before answering the last question, e.g. by using CTRL+C, nothing will be changed.

First, it asks for the date of the document. By default, the current date will be proposed. By using the arrow keys, you can select one of all dates found inside the document.

Second, it asks for the title of the document. To assist you, the most used words per page will be displayed above. Because each side has its own ID and each document its own date, the titles are not required to be unique. I even recommend using the same title for documents of the same kind.

At last, it asks you where to put the document to. If one document was already sorted into a category, it will proposed. You can browse through all categories using the arrow keys and search through them using CTRL+R.

Because no database is held, you can rename the files manually as well.

My Use Case

I am kind of a perfectionist and a lazy person, which resulted in that I throwed every paper document I received in a single ring binder. I did not came up with a "perfect" list of categories and how to distribute them accross different binders to

  • minimize space (binders) required to hold all (important) documents
  • allow each category to allow all documents which I might receive in future
  • be able to find a required document quickly

Also I want all my documents to be accessable on all my devices. This is easy to accomplish with already digital documents, however our world requires real paper documents, especially in Germany. So I wanted to scan every document to be able to store them on my personal cloud. However the documents there would also be required to be sorted approriatly. At least the digital world has the advantage that resorting documents in new categories scales a lot better and might also be automatable. But still, this would require me to keep both worlds, the analog and the digital world sorted which means more work.

To solve both problems in an easy way, I introduced a system to "store" all my paper documents in binders so I only need to sort them in the digital world. If I then might need the original paper document, I can search for the desired document on my computer and look up where it is stored.

Other Projects