Tool for extracting (repetitive) terminology from PDF for glossary creation Thread poster: Verena Schmidt (X)
| Verena Schmidt (X) Germany Local time: 07:18 English to German + ...
Dear colleagues, For a localization project I have to create a glossary for a tourism website which contains all the relevant and repetitive terms and slogans (menu items etc.). I'm just downloading the whole web site into PDF and was wondering if there is any tool, which automatically extracts all the repetitions, tabs and menus from a website/PDF/Word document. Any ideas? Regards, Verena | | |
Verena Schmidt wrote: I'm just downloading the whole web site into PDF I'd start over and save as HTML files. I'm not sure how you're downloading and saving as PDF, but HTML is the native format, at least the native format your browser or downloader can access. It's also a hell of a lot better for any subsequent processing you'll do. I would use wget, but httrack is probably easier for you to use. Of course it would be even better to just get the original data from the wesite's owner instead of downloading the site yourself. Once you have your HTML files, you can use tools such as LF aligner to align them all in one fell swoop. Extracting terminology automatically won't be easy. I wouldn't bother trying. | | | Adam Łobatiuk Poland Local time: 07:18 Member (2009) English to Polish + ...
Hi Verena, I'm not sure why you want to use the most troublesome format for localization work. If the client hasn't provided you with source files for the website, you could use a tool like HTTrack Website Copier to download the site. Still, if you have good reasons to use PDF and can transfer the content to Word or text files, free term extractors were the topic of a rec... See more Hi Verena, I'm not sure why you want to use the most troublesome format for localization work. If the client hasn't provided you with source files for the website, you could use a tool like HTTrack Website Copier to download the site. Still, if you have good reasons to use PDF and can transfer the content to Word or text files, free term extractors were the topic of a recent news story on Proz: http://www.proz.com/translation-news/?p=19987#1677922 Also, if you use Trados, you can analyse the file, and then use the "Export frequent segments" feature, which does what it says. It won't be exactly terminology, but slogans and menu items could be included. Other CATs may have a similar feature. Good luck ▲ Collapse | | | Verena Schmidt (X) Germany Local time: 07:18 English to German + ... TOPIC STARTER
The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising. Do you mean analyse with Workbench? | |
|
|
Verena Schmidt (X) Germany Local time: 07:18 English to German + ... TOPIC STARTER
I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work). My workar... See more I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work). My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me ▲ Collapse | | |
Verena Schmidt wrote: I'm downloading the site with Adobe Acrobat Pro. You simply type in the website and Adobe creates one document with all the website content. This is fine to get an overview of the terminology and images used throughout the site. Right now this is NOT a translation project. The site has to be analysed regarding its cultural appropriateness and I have to create a glossary with the most relevant and repetitive terms (the client is paying an hourly rate for this type of work). My workaround is this: Download to PDF -> Convert to Word (with PDFZilla) -> Use the Word file to extract the terminology For me this is fine, the only thing missing is a good tool extracting all the repetitive terminology for me I see. This solution might be fine if you want to read the site yourself, but I definitely wouldn't use it for any automated processing. It introduces two unnecessary lossy file conversions (HTML->PDF->Word), which is just asking for trouble. However, if the site is monolingual and you only need to write up a list of relevant terminology, you might as well just do it by hand from the pdf. Automated solutions are pretty much useless anyway, except if you want to compile, say, a list of all words that occur at least 5 times or something crude like that. If the site is available in two languages, the Httrack->aligner route is clearly the best solution. | | | Adam Łobatiuk Poland Local time: 07:18 Member (2009) English to Polish + ... Yes, Workbench | Feb 8, 2011 |
Verena Schmidt wrote: The PDF is just to get a first overview of the whole website, as grafics and images have to be analysed as well. I will convert the PDF with PDFzilla into Word. Right now it's just about extracting the relevant terminology. Thanks a lot for the link, the first tool sounds promising. Do you mean analyse with Workbench? That's correct. I don't have Studio installed right now, but it may have a similar feature. | | | Michael Beijer United Kingdom Local time: 06:18 Member (2009) Dutch to English + ... | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Tool for extracting (repetitive) terminology from PDF for glossary creation TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
| Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |