The solution is as follows:
- Construct N random datasets that have the same statistical characteristics as B. You can use bootstrapping with replacement, which preserves most statistical properties of the data except relationships between bars (the new random data contains no inefficiencies).
- Perform the exact same mining process A on all random datasets (which means perform the same optimization process to attempt to find a system as good as X).
- Create an average distribution of systems using the results from mining A across all N datasets
- The distribution must be stable. You need to use a large enough N. I generally do about 500+ mining exercises on random data to build the distribution. This means a lot of computer power. The F4 framework allows for OpenMP/MPI so I have been able to do validations using clusters without much need for additional coding.
- Using this distribution determine the probability that you found X in B due to random chance
- You can then accept X as not derived from spurious relationships if the probability falls below your chosen confidence interval. I generally only accept strategies at a 4 Sigma level (99.7% confidence).
The above type of test is what babelproofre has described, it is derived from the notions put forward by the paper on White's reality check but is computationally more expensive. I have performed the above test to evaluate for data mining bias during the past several years, this is also the standard practice at the firm where I am working. The above process corresponds to the classic methodology for data-mining bias evaluation (also advice to read the articles posted by babelproofre, classics).
Your are, in simple terms, comparing what your mining process (the strategy search process) can yield on real data with what it yields on average on random data. The mining process needs to generate many more systems of the intended quality on the real data compared to the random data, otherwise it means that the systems can simply be found by your process due to spurious relationships.
When doing machine learning you will often find that you cannot say that your system did not simply find spurious relationships. You will often find that repeating the mining process on random data leads to a very significant number of strategies with the same statistical properties. The more elaborate your mining method is, the more prone you are to this, the larger your genetic optimization, the worse as well.
Being able to perform the above evaluation method is what IMHO separates 99% of the people who attempt machine learning and fail from those of us who have been able to do it successfully. It takes time, it is computationally expensive but you end up with things that work. That of course, all in my personal experience on the matter.