I don't see either point. You don't want your benchmarks to interact, and normally have to make sure they don't. Forking saves you that trouble. That also means yellow flags should be raised only if you wanted to use the non-forking benchmark as the baseline — but why? Sure, if you find differences and didn't expect any, it's worth investigating the source of the interaction — if it's not in your own benchmarked code, modules you pull in might have an issue you weren't aware of. But beyond that, provided with a means to entirely isolate benchmarks, I just don't see any reason to go to the trouble to make them "clean".
Makeshifts last the longest.