I'm not familiar with UE4, but reading the description the idea seems to be to use text in assisting with creating the lip-syncing for a speech file? Otherwise you would just have a moving mouth with no sound.
The phoneme morph targets in iClone would be composites. I believe there are about 60 or more morph components for the face, which are used to create the mouth shapes (visemes) for the speech.
In part because there is blending between visemes to get a fluid animation, it would not be easy to separate them out into different morph targets. So what happens in iClone is dynamic, whereas you are looking for static morph targets that represent specific phonemes, if I understand correctly.
Also, is the aim of what you are doing to do the lip-syncing in UE4 using the method you describe?