How to remove all special html characters from a file using Regex Tagger?
Автор темы: Michael Beijer
Michael Beijer
Michael Beijer  Identity Verified
Великобритания
Local time: 05:24
Член ProZ.com c 2009
голландский => английский
+ ...
Jul 14, 2011

Can someone tell me how to Remove all special html characters from an Excel file or text file imported into memoQ using the new Regex Tagger?

so that I can get rid of stuff like this before trying to extract terms from the text to create a glossary:



Thanks!


 
Michael Grant
Michael Grant
Япония
Local time: 13:24
японский => английский
Remove? Or replace? Jul 15, 2011

I have more of a question than an answer for you but I am wondering whether it would be better to replace the HTML special characters with their equivalents, rather han simple remove them...?

For example, if you simply remove ’ from the text you quoted, and go from this:

Apple’s Friend-Aggregator is a full-feature ...
<
... See more
I have more of a question than an answer for you but I am wondering whether it would be better to replace the HTML special characters with their equivalents, rather han simple remove them...?

For example, if you simply remove &rsquo; from the text you quoted, and go from this:

Apple&rsquo;s Friend-Aggregator is a full-feature ...


to this:

Apples Friend-Aggregator is a full-feature ...


it will change the meaning of the term _Apple_...correct?

In any case, take a look at the memoQ help article here:

http://kilgray.com/memoq/50/help-en/index.html?import_documents_with_embedded.html

It should get you started..

One possible regex to match specialchars might be: &#?[^;]+;
which would include specialchars that may(or may not) have a # sign, and include numbers and/or letters...However, I do not have memoQ so I cannot test this...(sorry!)

MGrant

[Edited at 2011-07-15 01:59 GMT]
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
Великобритания
Local time: 05:24
Член ProZ.com c 2009
голландский => английский
+ ...
Автор темы
Thanks Michael! Jul 15, 2011

Replace, not remove! Of course

That memoQ help file was exactly what I was looking for.

In the meantime I actually managed it with this (very useful) free online "HTML Encoder / Decoder":

http://www.web2generators.com/html/entities

Michael


 
Gergely Vandor
Gergely Vandor
Венгрия
Local time: 06:24
английский => венгерский
memoQ has built-in support for this Aug 25, 2011

Hello All,

For the regex tagger, memoQ contains a "bundled" configuration called "Tags and entities". You can import the Excel file, and then run the regex tagger from the Format menu with this configuration.

Or you can even create a cascading filter (filter chain), where the regex tagger is chained after the Excel filter with this configuration.

best regards,
Gergely


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to remove all special html characters from a file using Regex Tagger?






Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »