@daniel: You cannot delete the dialog file without deleting the visemes, but you can do sort of a work around, which I have done in the post (there's also a tutorial by use 3DTest about that on YouTube, I'll have to find it).
As you observed, viseme allocation is much better when using TTS voices. The reason seems to be that iClone (if that's what you are using) uses the text for the TTS is a guideline to determine the visemes. We have asked a million time to do the same for regular lip-syncing but it has fallen on deaf ears thus far.
After you have done your TTS lip-syncing, you can then mute the out put for the voice (there is sliders on the time line for that). Than you can import a sound file with your voice actor. Unfortunately, you won't see the waveform, but you can scrub the audio.
I think the way I have done it is to split things up by sentences or phrases to make the lining up easier.
Here is the tutorial. The only difference is that you don't need to use CT8 to do the TTS lip-syncing; you can now do that in iClone directly.
I'm in a bit of a hurry right now, but recently Mike Kelley introduced a way to use Papagayo to generate the visemes and then use a Python script to "inject" those in iClone. I have worked on some improvements of that script.
Of course, it would be much easier if this was actually integrated in iClone. The technology is there and has been for a long time, but there seems to be some reluctance on RL's part to make use of that. Even though we have facial mocap, that is not always a solution.