HOW TO EXTRACT INFORMATION FROM PDF

Question

I am a new FME user and am trying to extract a text information from a PDF.Reading this fórum I came up with this:

Using attribute filter I choose the page number I need to extract that information and the inspector showed me this:

The yellow text is the information I need to extract.How do I do extract that to a excel file or csv?

Answer 1 · 2019-05-24T20:53:06Z

The PDF reader has a parameter (under Non-Spatial > Read Tagged Tables) which controls readingtagged tablesas a feature type.If a tagged table is present in your PDF,features will be output from thepdf_tablefeature type.

You may want to confirm whether or not your input dataset contains tagged tables as using this parameter would be the easiest way of extracting the information.You may also want to try decompressing your PDF file assuggested herebefore reading as this allows the PDF reader to read tagged tables from certain datasets.

As an aside,if you want to see the PDF reader support reading non-tagged tables,please feel free to vote onthis Idea

If none of the above suggestions work for you,one workaround would be relating the insertion points of text features to a table cell and extracting the text strings which fall within the cell areas.I have attached an example workspace demonstrating this 亚搏在线workflow here:getTextFromPDFTable.fmwt

Answer 2 · 2019-05-21T23:20:17Z

One option that you could try would be to use the `Non-Spatial > Read Non-Spatial Text` mode.

This produces all of the text that can be found for each page,and it may be easier to extract the information you're looking for from that output.

In your case I would expect the feature text to contain lines like:

"X error (cm) Y error (cm) Z error (cm) XY error (cm) Total error (cm)"

"0.275764 0.699132 4.04833 0.751553 4.11799"

You could use anAttributeSplitterto split the lines,and another one to split each line by whitespace.

Answer 3 · 2019-05-21T19:36:39Z

Answerby grazielatm ·May 21 at 07:36 PM

@danilo_fme,thank you very much.I am gonna check on that.Once I am a new user,I´m still strugling with FME.:)

Add comment · Share

10 |4000 characters needed characters left characters exceeded

Attachments:Up to 10 attachments (including images) can be used with a maximum of 4.0 MB each and 4.0 MB total.

Answer 4 · 2019-05-21T17:23:12Z

Answerby danilo_fme ·May 21 at 05:23 PM

Hi@grazielatm

Could you share us the Workspace template ( .fmwt ) or your PDF?

Thanks,

Danilo

Add comment · Share

10 |4000 characters needed characters left characters exceeded

Attachments:Up to 10 attachments (including images) can be used with a maximum of 4.0 MB each and 4.0 MB total.

HOW TO EXTRACT INFORMATION FROM PDF

4Replies

Follow this Question

Related Questions