View Full Version : Using PDF Metadata in HTML
Cuc Tu
06-19-2009, 07:51 PM
I want to code a simple HTML page that has hyperlinks to PDF documents.
I would like the displayed hyperlink text to pull the data from the "Title" field of the PDF metadata.
I have no idea how to do this, but I think it can be done as our company Google search appliance does this in its search results.
Paul Komski
06-20-2009, 07:35 AM
When Search Engines spider web-sites they extract a lot of metadata (including any Title tags) along with the word stats and weightings of a page that then gets stored in its database. The results are therefore not dynamic but reflect the page when it was last cached or analysed for changes. You could change the Title of any "spiderable" page and wait and see how long before the search engines updated their results.
What you want is dynamic on the client side (or if you are controlling the web host dynamically on the server side) and I daresay either could be used (some javascript for the former or PHP/MySQL for the latter) but is it really worth the effort?
Do also bear in mind that different users and different browsers will want to either display the pdf pages or else to download them as saved files. I personally think you would be better served by just encoding the Title directly in the hyperlink's URL. You could also enterthe name as the Title property into the URL (http://support.microsoft.com/kb/264916) to provide a baloon tip when hovered over in most browsers.
Cuc Tu
06-22-2009, 05:10 PM
Would it be worth it?
Our documents are all serialized with a 5x5 number for file names. Very abstract, so we needed a usable interface for the users. I created a simple table with some javascript filtering so users could select a particular product (works just like auto filters in Excel). For example, the number series 00001-001xx are all for a particular product. This made it easy for me to manually add this attribute to the table and drop down list.
But, the user is then presented a list of 10 to 20 documents that they don't know excactly what they are unless they open the file. So I envisioned something like:
<a href="Documents\10570\10570-00002B.pdf" get.pdftitle>document.write(pdftitle)</a>
It would be worth it because I have several thousand links.
Paul Komski
06-23-2009, 05:04 AM
If you can get it to work for you then fine and you have several thousand links to edit but since I don't know whether pdf documents have an intrinsic <title> tag of their own I can't really comment. The sort of tag just as this page of the forums has
<title>Using PDF Metadata in HTML - The PC Guide Discussion Forums</title> ... then that title appears in the Title bar of the browser displaying the page.
I suspect that you may well be able to dig out the Title element (or attribute) of any specific tag on the page that contains the link using javascript - but that is not I think what you want. The GetAttribute Method (http://www.java2s.com/Code/JavaScriptReference/Javascript-Methods/getAttribute.htm) could be used to that effect for such Title attributes if you see what I mean. The Title Tag and the Title attribute being two very different things.
It looks as if the pages may be being served on an intranet and if so presumably there is some database of Title Names for each 5x5 ID maintained somewhere. If so setting up PHP/MySQL (or asp which I havent studied on NT servers) could be a useful way to collate the data, edit it and then use it in HTML documents as the front end of the database.
Cuc Tu
06-24-2009, 01:40 AM
It seems so hard. Forget about Intranet.
Lets say I have a 2nd hard drive that I dump 15,000 JPG images on, and they are all named DSC00001, DSC00002, DSC00003, etc...
Now I make a simple HTML page with a list of hrefs to the files that include a thumbnail of the image. It's easy as pie to make the links and thumbs since I already know all the file names, but the link text is hard as nails because I don't know what DSC00001 is all about.
BUT
If I right click on the image file and view properties, the "Summary" tab has a "Title" field that explains what the image is all about (mine are all filled in).
Is it not possible to script some simple "get" function to put that "Title" text onto my simple HTML page for each link to each image?
If not, what in the world do we even have metadata for?
This would be the same question for a collection of MP3 files.
Of course, I know that certain applications are able to exploit this information, but is there no mechanism that is standard in web-based presentation? This would seem completely fundamental to the whole metadata concept.
My document collection is actually on an Intranet. Of course that is transparent to me as I merely have file access to the single drive/directory on the server, so it just looks like another network location to me. I don't have full access to the server or to run applications there. I suppose I could buy an application to index this information, even build the HTML as ASP, install it on the server, and let it run. But not in my company...I even wondered if I set this up on my local system to let it index/create teh HTML, if I could then copy the content over to the server?
So I guess there is nothing I can do on the client side?
Paul Komski
06-24-2009, 04:02 AM
There are lots of utilities that can make "html galleries" of jpg or other image files including the use of much of the exif data that is stored within them; IrfanView is a good example.
I haven't tried to do this with MP3 files but their metadata is, I believe, pretty standard and well known (with everything from author and song name etc stored there) so specific software that knows where this is kept in the files could no doubt extract it in similar fashion.
Perhaps there are specific pdf "metadata extractors" but if one can be found it would be necessary for the pdf files to have the desired title within the metadata for the extractor to be able to utilise it.
Possibly a Google Like This (http://www.google.ie/search?hl=en&q=pdf+metadata+extraction&meta=&aq=3&oq=pdf+metadata) might help find a way. It is a much more logical way to create the links rather than try to do it from within the html itself.
Cuc Tu
06-24-2009, 06:47 PM
Is this something that I could make work?
Also, what about using MS Access to get the info?
http://classicasp.aspfaq.com/files/directories-fso/how-do-i-find-the-owner-author-and-other-properties-of-a-file.html
Paul Komski
06-25-2009, 03:50 AM
I don't think that script will do it for you since the only information it can pull in that way are the various pdf files' properties and attributes just as can be displayed in Windows Explorer. It doesn't dig into the specific data (the pdf files' own headers and metadata) and extract it and since a nice friendly name for the files is not an attribute I don't see how it can be accessed that way.
Just about any database "backend" can be used to store the relevant Tables and Queries and MSAccess is no different in that regard. There are however certain "marriages of convenience" so that, for example MS Asp as the front end and MS Access as the back end work well together just as PHP as the front end for MySQL work particularly well together.
Which front end to use for a web based application depends on what the web host or server has running on it. Using ODBC is the most common way to provide cross compatibility when using different database access than what would be standard for ones front end.
Cuc Tu
06-25-2009, 11:30 AM
Some of our Windows Explorers have another "PDF" tab on the file properties dialog, which must have been a plug-in that came with the later Acrobat.
The PDF tab shows the XMP data, which is different than the same "title, subject, author, keywords, etc." fields under the summary tab.
Anyway, I give up on automating this.
vBulletin v3.6.1, Copyright ©2000-2012, Jelsoft Enterprises Ltd.