From 69d797851273fb05261283ce5436ce037c6ad6a9 Mon Sep 17 00:00:00 2001 From: Felix Stupp Date: Sat, 11 Jun 2022 23:50:46 +0200 Subject: [PATCH] Added README --- README.md | 201 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 201 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..b25b783 --- /dev/null +++ b/README.md @@ -0,0 +1,201 @@ +# scansystem + +scansystem is a single file Python script to maintain (hence its script name) a directory containing your personal documents as PDFs. +It helps you with scanning them, applying OCR and sorting them based on categories you make up as you like. +It assigns incremental IDs to scanned documents which can help in finding + +This script runs "stateless" without a database (everything required is encoded in human readable filenames or stored inside the PDFs itself). +So in contrast to other, more complex tools this script may provide this unique features to you: + +- You do not need to use this script to access your documents, resorting them or even adding new ones. It is built to be just a small helper. +- It's compatible with every file synchronization solution (a.k.a. cloud) and backup solution you already have and might want to use in the future. +- You can use the tools you want to search through your documents (like `pdfgrep` or Recoll) +- You can use it without first configuring a server or anything similar. +- If this script stops working for you (for any reason), you can just continue on with your life without it. +- You can use the script (only) from the terminal. + +Other features: + +- It can combine multiple sides / pages into a single PDF which results in a single PDF having multiple IDs (ID per side) assigned. +- For "special documents" (or if your scanner is not capable of [ADF][wiki-adf]) you can use an interactive mode (provided by scanimage) +- It also "supports" digital only documents (but for now the do not have any IDs) + + +## Disclaimer + +**Please check the known issues before using!** + +I started this project for my own personal needs. +Because I have no problem sharing or working on it with others, +I decided to publish it on [GitHub][self-github] and mirror it to my [own Gitea instance][self-gitea]. +However, especially for the first versions, it might not fit your use case and is hard to change. +So before using this script, check if it fits your needs. +It may help you to read my own use case described below. + +This idea might not be a new, innovative solution, however I did not found any comparable solution which fit my use case. +If you find other solutions which you deem better for you, use them. +I would also be happy to know about them and have no problem listing them below for comparison. +Create an issue or PR if you might propose a similar project which should be listed below. + + +## Known Issues / ToDo + +I know about these limitiations and I (or you) might resolve these in the future: +- Because of dependencies, it only works on UNIX so far. +- It assumes that your scanner will scan and return both sides of a scanned paper. +- Digital only documents for now have no IDs assigned. +- Ultimately, all scanned documents will be compressed with JPEG, but in a reasonable quality +- It is not really aware of the sorting of the paper documents, it just manages the IDs for now. +- Its configuration is "hard-coded" in the script making updating harder. +- The index will not be maintained automatically. +- The default configuration reflects my own and maybe are not sane defaults. + + +## How It Works (TL;DR) + +Essentially system works as follows: +- each incomming page of a paper document will be scanned immediatally and be assigned an unique and incrementing ID, which is encoded in the filename + - odd IDs indicate the front side of a page, even IDs indicate the back side of a page + - meaning the sides from ID 1 to 200 reference 100 pages of paper +- the documents will be inserted in order of the IDs into binders +- each binder is labelled with the first and last IDs it holds + - the binders can also be split by separators to find paper documents faster +- it is NOT required to label each document because of insertion order + - this allows to preserve the original without modification (like a label) + - this speeds up the insertion of 100's of pages a lot (assuming [ADF][wiki-adf] scanner) + + +## How To Use + +### Preperations + +You need: +- a [SANE][sane-web] compatible scanner or printer (see [list of supported devices][sane-supported]) + - scanner with [ADF][wiki-adf] is highly recommended + - the feature to scan both sides automatically is also recommeded + - however manual scanning with a flatbed is already supported +- empty ring binders +- a hole punch +- (optionally) plastic wraps for important paper documents which you might not want to hole punch + +On your computer following tools should be installed (hopefully available in your distributions repository): +- [ocrmypdf][ocrmypdf-github] and its dependencies + - could also call `tesseract` directly +- [parallel][parallel-web] (GNU variant, optional) +- utils from [poppler][poppler-web] + - `pdftotext` + - `pdfunite` +- utils from [SANE][sane-web] + - `scanimage` + +### First Setup + +1. Set up at least one binder + - label the binder with the range of IDs it will contain (starting at 1 for the first binder & section) + - (optionally) each binder should have separators which you can label with the IDs they (will) contain +2. Create a directory on your computer for your scanned PDFs and copy the `./maintain.py` script into it + - make it executable with `chmod +x ./maintain.py` + - you can already create the sub directories for the categories (which can be nested as well), however you can also create them dynamically +3. Customize the configuration to your liking + - for now, the configuration is stored inside the script at the beginning + - especially configure the scan sources of your scanner (`USE_ADF_BY_DEFAULT` / `ADF_SCAN_SOURCE` / `FLATBET_SCAN_SOURCE`) +4. Migrate your old documents into the system by applying the steps below per document / page + - the order is irrelevant, I recommend in chronical order as future documents will be inserted in the same order + +#### Example + +- I decided to hold up to 300 pages (600 sides or IDs) per binder +- Each binder is separated by separators into 6 sections containing each 50 pages (100 sides or IDs) +- If a document will span multiple sections or binders but is separated, I will still insert it correctly because I honor my system more than the "integrity" of the document + +### How To Add New Documents + +I recommend to follow the steps one by one per page at the beginning so the order does not get messed up. +If you feel safe, you can start to batch the tasks if you insert multiple pages at once. + +*I seperate each document into its pages so it can be scanned automatically if possible, I even remove staples if required.* + +Per page: +1. Scan both sides with `./maintain.py scan` + - really either scan both sides or recall `./maintain.py scan` after each page because otherwise `scanimage` will mess up the IDs as it is not aware that the ID pairs should match front & back page + - with `--adf` you can force use [ADF][wiki-adf] and it will continue to scan all pages available + - with `--flatbed` you can force use the flatbed (e.g. for "special" documents) + - by default, it will automatically apply OCR and convert the documents to PDFs + - to speed up conversion of multiple pages by using parallel, add `--skip-convert` and execute `./maintain.py convert --output-commands | parallel` after scanning + - after scanning, you might remove empty back pages, the script will still select the next ID correctly (see `./maintain.py next-id`) +2. Add ring holes using a hole punch if required + - OR insert document into a plastic wrap with ring holes +3. Insert document into the latest binder at the end + - Check on the IDs assigned if you want to place the document behind the next separator or in the next binder + +### How To Sort & Combine + +Per default, the documents are called `outXXXX.jpg` or `outXXXX.png`. +If you want to add date & title to your document or sort it into a category, +you can use `./maintain.py merge --id `. +`` might be a comma separated list of IDs which can be +- a single ID, e.g. `123` +- a single ID with its counterpart ID (the other side), suffix `+`, e.g. `453+,88+` == `453,454,87,88` +- a single ID with its following page, suffix `++`, e.g. `869++` == `869+,871+` == `869,870,871,872` +- an ID range, start and end separated by `-`, e.g. `123-128` == `123,124,125,126,127,128` +- a suffix of `#` (compatible to `+`, `++` and ranges) will also select all "context pages", by default ±10 pages, e.g. `100#` == `90-110` (not useful for merge but for other commands) +The order of the IDs will determine the order of the pages later on. However for merge: +- single IDs will be completed to both sides so both sides end up in the same PDF at the end +- missing IDs will be ignored (so missing back pages might not cause any error) +You can append `--view` so you will see the resulting document to verify. +If you abort the process before answering the last question, e.g. by using `CTRL+C`, nothing will be changed. + +First, it asks for the date of the document. +By default, the current date will be proposed. +By using the arrow keys, you can select one of all dates found inside the document. + +Second, it asks for the title of the document. +To assist you, the most used words per page will be displayed above. +Because each side has its own ID and each document its own date, the titles are not required to be unique. +I even recommend using the same title for documents of the same kind. + +At last, it asks you where to put the document to. +If one document was already sorted into a category, it will proposed. +You can browse through all categories using the arrow keys and search through them using `CTRL+R`. + +Because no database is held, you can rename the files manually as well. + + +## My Use Case + +I am kind of a perfectionist and a lazy person, +which resulted in that I throwed every paper document I received in a single ring binder. +I did not came up with a "perfect" list of categories and how to distribute them accross different binders to +- minimize space (binders) required to hold all (important) documents +- allow each category to allow all documents which I might receive in future +- be able to find a required document quickly + +Also I want all my documents to be accessable on all my devices. +This is easy to accomplish with already digital documents, +however our world requires real paper documents, especially in Germany. +So I wanted to scan every document to be able to store them on my personal cloud. +However the documents there would also be required to be sorted approriatly. +At least the digital world has the advantage that resorting documents in new categories scales a lot better and might also be automatable. +But still, this would require me to keep both worlds, the analog and the digital world sorted which means more work. + +To solve both problems in an easy way, +I introduced a system to "store" all my paper documents in binders so I only need to sort them in the digital world. +If I then might need the original paper document, I can search for the desired document on my computer and look up where it is stored. + + +## Other Projects + +- see [Awesome-Selfhosted][awesome-selfhosted] + + + + +[awesome-selfhosted]: https://github.com/awesome-selfhosted/awesome-selfhosted#document-management= "Document Management on Awesome-Selfhosted" +[ocrmypdf-github]: https://github.com/jbarlow83/OCRmyPDF "OCRmyPDF on GitHub" +[parallel-web]: https://www.gnu.org/software/parallel/ "GNU's parallel" +[poppler-web]: https://poppler.freedesktop.org/ "Poppler" +[sane-supported]: http://www.sane-project.org/sane-supported-devices.html "SANE - Supported Devices" +[sane-web]: http://www.sane-project.org/ "SANE - Scanner Access Now Easy" +[self-gitea]: https://git.banananet.work/zocker/scansystem "Self-Hosted Gitea Mirror" +[self-github]: https://github.com/Zocker1999NET/scansystem "Official GitHub Repository" +[wiki-adf]: https://en.wikipedia.org/wiki/Automatic_document_feeder "Automatic document feeder on Wikipedia"