Tool for extracting (repetitive) terminology from PDF for glossary creation
Thread poster: Verena Schmidt (X)
Verena Schmidt (X)
Verena Schmidt (X)  Identity Verified
Germany
Local time: 07:18
English to German
+ ...
Feb 8, 2011

Dear colleagues,

For a localization project I have to create a glossary for a tourism website which contains all the relevant and repetitive terms and slogans (menu items etc.). I'm just downloading the whole web site into PDF and was wondering if there is any tool, which automatically extracts all the repetitions, tabs and menus from a website/PDF/Word document.

Any ideas?

Regards,

Verena


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 07:18
English to Hungarian
+ ...
Wrong format Feb 8, 2011

Verena Schmidt wrote:

I'm just downloading the whole web site into PDF

I'd start over and save as HTML files. I'm not sure how you're downloading and saving as PDF, but HTML is the native format, at least the native format your browser or downloader can access. It's also a hell of a lot better for any subsequent processing you'll do. I would use wget, but httrack is probably easier for you to use.

Of course it would be even better to just get the original data from the wesite's owner instead of downloading the site yourself.

Once you have your HTML files, you can use tools such as LF aligner to align them all in one fell swoop.
Extracting terminology automatically won't be easy. I wouldn't bother trying.


 
Adam Łobatiuk
Adam Łobatiuk  Identity Verified
Poland
Local time: 07:18
Member (2009)
English to Polish
+ ...
Why PDF? Feb 8, 2011

Hi Verena,

I'm not sure why you want to use the most troublesome format for localization work. If the client hasn't provided you with source files for the website, you could use a tool like HTTrack Website Copier to download the site.

Still, if you have good reasons to use PDF and can transfer the content to Word or text files, free term extractors were the topic of a rec
... See more
Hi Verena,

I'm not sure why you want to use the most troublesome format for localization work. If the client hasn't provided you with source files for the website, you could use a tool like HTTrack Website Copier to download the site.

Still, if you have good reasons to use PDF and can transfer the content to Word or text files, free term extractors were the topic of a recent news story on Proz: http://www.proz.com/translation-news/?p=19987#1677922

Also, if you use Trados, you can analyse the file, and then use the "Export frequent segments" feature, which does what it says. It won't be exactly terminology, but slogans and menu items could be included. Other CATs may have a similar feature.

Good luck
Collapse


 
Verena Schmidt (X)
Verena Schmidt (X)  Identity Verified
Germany
Local time: 07:18
English to German
+ ...
TOPIC STARTER
Thanks, Adam Feb 8, 2011

The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising.

Do you mean analyse with Workbench?


 
Verena Schmidt (X)
Verena Schmidt (X)  Identity Verified
Germany
Local time: 07:18
English to German
+ ...
TOPIC STARTER
Hi Farkas, Feb 8, 2011

I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work).

My workar
... See more
I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work).

My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology

For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 07:18
English to Hungarian
+ ...
Fine Feb 8, 2011

Verena Schmidt wrote:

I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work).

My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology

For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me



I see. This solution might be fine if you want to read the site yourself, but I definitely wouldn't use it for any automated processing. It introduces two unnecessary lossy file conversions (HTML->PDF->Word), which is just asking for trouble.
However, if the site is monolingual and you only need to write up a list of relevant terminology, you might as well just do it by hand from the pdf. Automated solutions are pretty much useless anyway, except if you want to compile, say, a list of all words that occur at least 5 times or something crude like that.
If the site is available in two languages, the Httrack->aligner route is clearly the best solution.


 
Adam Łobatiuk
Adam Łobatiuk  Identity Verified
Poland
Local time: 07:18
Member (2009)
English to Polish
+ ...
Yes, Workbench Feb 8, 2011

Verena Schmidt wrote:

The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising.

Do you mean analyse with Workbench?


That's correct. I don't have Studio installed right now, but it may have a similar feature.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 06:18
Member (2009)
Dutch to English
+ ...
"Extracting terminology from Translation Memory with Similis, step by step.pdf" Feb 14, 2011

You might want to take a look at:

"Extracting terminology from Translation Memory with Similis, step by step"

by

Jean-Marc Tapernoux


-> http://www.techni-tra.com/Extracting_terminology_with_Similis.pdf


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Tool for extracting (repetitive) terminology from PDF for glossary creation






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »