Skip to content

Conversation

@blackghost1987
Copy link

This is mostly copied from the "text" example of the "pdf" crate, based on work by @s3bk
https://github.com/pdf-rs/pdf/blob/master/examples/text/src/main.rs

The FontInfo struct is created there as a way to cache the character maps retrieved from the Font resources. Unfortunately it's not part of the main codebase of the lib, it's just in the examples, so I can't import it and had to copy it as well. It looks like the handling of character maps is not mature enough yet (based on the comment in line 24), but for my use-case it was working fine, so I wanted to integrate it with the nicely typed Operations from this crate.

I think the parsing of the Fonts is not really in the scope of this crate, but it would be really useful to use them if present.

@s3bk
Copy link

s3bk commented May 25, 2021

nitpick: i'd call it decode_string instead.

@s3bk
Copy link

s3bk commented May 25, 2021

Regarding fonts:
While fonts play an important role in PDFs, there are plenty of cases that do not need them.
They are also a huge pain to get working reliable and I don't want to put too much burden on the core PDF crate.
And finally, pdf-rs/font is just one of many font parsers and one may wish to use a different one (for example to get hinting).

I think the text extraction is far from solved and during this experimental stage, copying code around is quite normal.

@s3bk
Copy link

s3bk commented May 27, 2021

@blackghost1987 I am going to merge pdf-rs/pdf#89 soon, which may make this crate not really necessary.

My suggestion would be to create a pdf-tools or pdf-toolbox with the text extraction code and everything else that pops up but does not fit into the main crate.

@blackghost1987
Copy link
Author

Great! It would be nice to have the typed Operations in the main PDF crate, seems more logical.

My main use-case is text extraction, so it would be awesome to have that as a feature in one of the crates, not just an example (even if it's not a full fledged working solution at first, I know it's not that straightforward). It doesn't really matter which crate has the text extraction feature though, separate them as you like.

Feel free to close this one when it won't make sense anymore.

@s3bk
Copy link

s3bk commented May 29, 2021

https://github.com/pdf-rs/pdf_tools/ is online. for now using the dev branch of the pdf repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants