FenScribe

A Smart PDF Layout Optimizer

Automatically detect and remove blank areas from images or PDFs to reduce printing costs.

Get Started

Samples

FenScribe GUI Example

FenScribe Graphical User Interface

FenScribe Processing Example

Example of PDF Optimization

Key Features

  • Detect and remove long blank regions (by brightness threshold). New: marginal scan edge trimming (advanced option).
  • Split pages into content segments (pre-line) and save as individual images.
  • Optional Word export: insert segments into a .docx template.
  • New: Automatic image width fitting for Word documents using python-docx, reducing the need for post-processing macros.

Installation & Dependencies

1. First, download or clone the repository:

  • Option A (Git): Clone the repository using Git:
    git clone https://github.com/ordylan/FenScribe.git
    cd FenScribe
  • Option B (Download ZIP): Download the ZIP file from the GitHub page: https://github.com/ordylan/FenScribe. Then, unzip the file and navigate into the extracted directory in your terminal.

2. Install the required Python libraries using pip:

pip install PyMuPDF Pillow python-docx tkinterdnd2 pywin32

Note: pywin32 is required only if you want the program to inject/run VBA macros in Word (Windows only). If you only need image output or python-docx insertion, Word and pywin32 are optional.

Usage

  1. Navigate to the directory where you downloaded/cloned FenScribe.
  2. Run gui.pyw for the graphical interface (CLI version currently unavailable). Double-click the file or run python gui.pyw in your terminal.
  3. After processing, the output Word document (output_*.docx) will be saved in the __Output/ directory.

Typical Workflow

  1. Select a PDF file, single image, or a folder containing images.
  2. Tune parameters in Step 2 (threshold, DPI, min_height, blank_height).
  3. Choose a .docx template in Step 3 if Word output is desired.
  4. Start processing. Images/segments are saved under _temp/<input>_segments. If Word insertion is enabled, the final document is written to __Output/output_<basename>.docx.

Configuration Parameters

These parameters can be adjusted within the GUI or potentially in a future configuration file:

Parameter Description
threshold Brightness threshold for blank line detection (0-255). Converts RGB to grayscale average; rows ≥ threshold are considered blank. Higher values detect lighter grays as blank.
dpi Image resolution (Dots Per Inch) used when converting PDF pages to images for analysis. Higher DPI increases precision but also processing time and memory usage.
min_height Content validity filter (in pixels). Only preserves content blocks with height ≥ this value. Helps filter out small noise or artifacts.
blank_height Paragraph separation baseline (in pixels). Content is split into separate paragraphs when consecutive blank lines reach this height. Defines the minimum vertical gap considered a paragraph break.

Macros & Word Automation

The app can optionally inject and run a VBA macro from the _Macros folder after producing the .docx file. That requires Microsoft Word and pywin32 (Windows only).

Enable VBA access in Word: File → Options → Trust Center → Trust Center Settings → Macro Settings → Trust access to the VBA project object model.

Note: the program now attempts to auto-fit images to the document/column width using python-docx. The macro remains available as an optional post-processing step.

Troubleshooting

License

This project is licensed under the MIT License.

Development Notes

This third-generation version of FenScribe features:

  • A Graphical User Interface (GUI) implementation for more user-friendly operation.
  • Partial utilization of AI-assisted development tools during its creation.
  • Continuous optimization and refinement through multiple development iterations.
  • Focus on converting PDF content to Word format (.docx) for easier editing and manipulation post-processing.