Special segmentation rule for extracted text

Of course you can use transtools or pdftidy to get rid of superfluous line feeds at the end of every line. They often occur in files extracted from PDF. Before the line feed a hard hyphen and even a superfluous space can occur. Did anyone write an srx rule to accommodate for these files? That would be very handy for the alignment workflow.
