FenScribe

✨ Smart PDF Layout Optimizer ✂️

Reduce printing costs by automatically detecting and removing blank spaces in PDF documents, intelligently rearranging content.

FenScribe Graphical User Interface

Example of PDF Optimization

1. First, download or clone the repository:

Option A (Git): Clone the repository using Git:

git clone https://github.com/ordylan/FenScribe.git
cd FenScribe

Option B (Download ZIP): Download the ZIP file from the GitHub page: https://github.com/ordylan/FenScribe. Then, unzip the file and navigate into the extracted directory in your terminal.

2. Install the required Python libraries using pip:

pip install PyMuPDF Pillow python-docx tkinterdnd2

Navigate to the directory where you downloaded/cloned FenScribe.
Run gui.pyw for the graphical interface (CLI version currently unavailable). Double-click the file or run python gui.pyw in your terminal.
After processing, the output Word document (output_*.docx) will be saved in the __Output/ directory.
Use the provided Office macro (图片溢出缩小.bas) to resize overflowing images in the generated Word document.
- Open the Word document (output_*.docx).
- Open the VBA editor (Alt + F11).
- Import the 图片溢出缩小.bas file (File > Import File...).
- Run the macro双栏图片溢出自动调() (Run > Run Sub/UserForm or press F5).
- The macro automatically detects the first column width in Word and scales oversized images to fit.

These parameters can be adjusted within the GUI or potentially in a future configuration file:

Parameter	Description
threshold	Brightness threshold for blank line detection (0-255). Converts RGB to grayscale average; rows ≥ threshold are considered blank. Higher values detect lighter grays as blank.
dpi	Image resolution (Dots Per Inch) used when converting PDF pages to images for analysis. Higher DPI increases precision but also processing time and memory usage.
min_height	Content validity filter (in pixels). Only preserves content blocks with height ≥ this value. Helps filter out small noise or artifacts.
blank_height	Paragraph separation baseline (in pixels). Content is split into separate paragraphs when consecutive blank lines reach this height. Defines the minimum vertical gap considered a paragraph break.

Potential Issues & Considerations:

Small images or geometric shapes might be accidentally removed or cropped during blank space removal if they fall below the min_height threshold.
Narrow vertical images or color blocks could be misidentified as separators. Adjusting configuration parameters like min_height or threshold might help.
Scanned documents must have reasonably horizontal text alignment. Pre-process significantly tilted pages (deskew) before using FenScribe for best results.
Manual margin trimming might still be required in the source PDF or output Word document to remove headers/footers if they interfere with content detection or are unwanted.
Performance depends on PDF complexity, DPI setting, and system resources. Large or complex PDFs may take time to process.

This project is licensed under the MIT License.

This third-generation version of FenScribe features:

A Graphical User Interface (GUI) implementation for more user-friendly operation.
Partial utilization of AI-assisted development tools during its creation.
Continuous optimization and refinement through multiple development iterations.
Focus on converting PDF content to Word format (.docx) for easier editing and manipulation post-processing.