HTML/PHP segmentation rules


I am translating html and php files. Which segmentation options should I pick in the "preferences" dialog. If I pick "Sentence", the segmentation rules for these types leave some texts out or do not segment like I would like them to do.

For example, a table of contents or a list separated by the li /li html tags is just bunched up into one big segment with list items separated by CT's red number tags. Splitting a long table of contents is not productive and leaving it like this prevents me from accessing the section titles individually when I get to them in the rest of the file.

Also, some segments are left out and I have to open the file in BBEdit to get to them after import.

For example "Cyclone model" is not segmented and stays hidden in : 

<div class="center"><img src="media/graphics/model.jpg" alt="Cyclone Model" /> </div>

So if anyone can point me into the right direction regarding segmentation rules for html/php files and how to apply them to CT, it would be much appreciated.

Thanks for your insight.

Hi Hans,

Thanks a lot for the links. Okapi may be a solution. But even with the lastest Java installed on my Mac (El Cap), Rainbow complains it need 1.7 or higher! Weird thing though... System prefs (and say 8 is installed, but typing "java -version" in terminal says it's version 6??? 

Anyway, if someone has a solution that involves only CafeTran, I'm all for it. Or if any Mac users know about the version discrepancy... Googling java discrepancy returned useless answers.


Duh... a restart took care of the Java discrepancy! And the apps are launching fine. Now on to trying Okapi.

But if Okapi can do html/php files properly, surely there is a way of telling CT to segment those files as well.

Hello Julie,

Did you manage to start working with the Okapi framework?


I successfully went through the example that proves the installation works and I looked into Ratel and segmentation rules basics. But then I had to work!

However I saw that there is an option in Ratel called "sub-flow" which seems to pick up the text in the  "alt" tag. But I haven't been able to tell Ratel to split segments at the li and /li tags...

I'll go back to Rainbow later and try to figure it out. I'll come back here when I have gotten further.

