P-Hacking in Data Science and Machine Learning: How to Avoid it
P-hacking is a way in which data analytics is exploited to find trends that seem statistically important but are not really important. P-hacking is also called Data Dredging. Data dredging is extremely tough to spot. But why is this concerning? Because it affects the study of data in negative ways.
P-hacking can also be described as an unintentional cherry-picking method of culling important and reliable data that leads to a surplus of noteworthy and required results. However, it can severely impact the data by increasing the many false positives, which affects the study’s quality. It can also mislead other processes of data recording and computing an inference and result in an increased bias.
It is difficult to avoid P-hacking but there are certain safeguards that may b help in reducing the chances of p-hacking and avoid the data dredging trap.
A Detailed Explanation of P-hacking
P-value or probability value is the chance that depends on the null hypothesis model. The test statistics are equal to the observed data value or It could also be more biased towards the p-value.
If the p-value computation is a small value, then we can infer the tail that extends beyond the observed stats is small. Which is why the observed indicator is far from the null value. This means that the data supports the observed value more than they support the null value.
By standard rules, when the p-value is lower than 0.05, the result can be statistically important. So the null hypothesis is no longer valid.
There are risks that come along the way when the p-value is misused. P-hacking is misusing data analysis which shows that the data patterns are important statistically. In reality, they have zero importance. P-Hacking could have been done by mistake or knowingly for any reason.
But how is this done? This is achieved by the performance of multiple data tests and then finding a pattern followed by modifying some of the values for having a biased result.
What is Hypothesis Testing?
Hypothesis testing is a method of statistical assumption that takes data from a sample to conclude a population consideration or a distribution of population probability. A cautious hypothesis of the distribution or parameter is made. This supposition is known as the null hypothesis, and H0 denotes it.
What is a P-value?
In null-hypothesis importance testing, the p-value is the probability of getting test results. This can be as extreme as the actual results, supposing that the null hypothesis is the correct observation.
How Does P-hacking Work?
To analyze a large volume of data, it is imperative to study them carefully. It is done to discover any likely relationship within the dataset. When a pattern is detected, data scientists form hypotheses based on these trends and know the reason why these relationships exist in the first place.
On the other hand, conventional scientific data analysis procedures begin with an assumption, followed by an examination of data to test whether the hypothesis holds true or not.
However, in most cases, the usage of p value hacking is incorrect. Commonly, this is because of an absence of understanding on how data mining methods need to be applied to find previously unidentified relationships between all the variables. This leads to an inability to accept that an association that was identified was just a coincidence.
Some unscrupulous researchers indulge in p-hacking to investigate data and report results that give the lowest p-value that is possible to arrive at. After that, they report statistically important results even though the outcome is nothing but a false positive and, therefore, it is unreliable.
Issues Caused by P-hacking
There are many problems that arise due to p value hacking, including:
- Increase in false positives results
Due to p-hacking, there are a lot of false-positive outcomes.. This makes the data extremely unreliable making all derivation from this biased and transgressed
When the results of the data are pointing in one single direction, data dredging could cause a deviation in the outcome of the data. This may lead to data misrepresentation across multiple instances. . So, the decisions you make based on this data could be erroneous.
- Increased bias
Due to the misinterpreted data you may make wrong decisions. If this data represents something in chronology, it could also upset the whole trend. For example, inflation data is calculated yearly, and the budget is based on it. If there is any case of p-hacking in the inflation data, then it could upset and cause a financial emergency in the nation.
- Misleading other statistical inference processes
As mentioned above, p-hacking could spoil the recording and could alter the variance if data is recorded for the prediction of a chronological result. It may also affect the future recording process where the statistician would focus on unnecessary variables.
How to Avoid P-hacking?
Here are a few ways to avoid P-hacking basis your understanding of data dredging.
- Pre Registration for the study
It is one of the best methods to avoid p-hacking. Using pre-registration can help you avoid tweaking data after recording it.his requires a comprehensive test plan, statistical tools, and examination methods to be applied to the data. This specific test plan can be registered online. Without tweaking any of the data reports, the outcome is as they are in the online registry. People shall come to know about the real data outcomes with this method.
- Avoid peeking at data and continuous observation
A data scientist’s curiosity about how the test will perform may lead to them examining the difference between the tests. This will increase the number of false positives and could affect the value of p in a big way. Allow a test to run its course even if the suitable p-value is attained.
- Bonferroni correction to address the problem
An increase in hypothesis tests leads to an increase in the number of false positives. Bonferroni correction is able to balance the increase in observation of an occasional event. This is done by testing every assumption which is considered to be vital. The larger the number of hypothetical tests, the stricter the balancing. There has to be a balance between true and false positives.
Bonferroni correction can protect data from type I Errors, but it is vulnerable to Type II errors.
- Power analysis
Power analysis is a statistical method that decides the rough sample size, which will confirm a high probability which would correctly reject the null hypothesis.
Steps were taken during the test for controlling false positives:
Step 1: Determining the statistical variances before beginning the test. If there is a variance that could amend the parameter, then it should be noted down along with the logic.
Step 2: Determining before the tests are performed how many replications will be there and when the sample can be excluded. This will avoid closing the test before getting the actual result.
Step 3: While examining multiple results, one should ensure that statistics reflect it. If something is uncommon, then testing the assumption again could be used to get an actual p-value.
Impact of p-hacking in data science and machine learning projects
P-hacking affects research studies negatively, without the knowledge of the examiner. Some well-known effects of data dredging in data science and machine learning models include:
- The creation of false positives affects the dependability of the results
- Misleading other examiners and upsetting the study results
- An increase in biases
- Loss of important resources, especially manpower
- The models are trained wrongly
- Researchers have to retract their published outcomes
- Loss of funding
P-hacking is an essential technique to deal with the segregation of data and finding patterns that come handy for a better study of any process. The aim here is simple : to find any changes, even the smallest ones, in advance and take measures.
Additionally, data science, machine learning, and artificial intelligence are some of the professions that are in high demand now. If you are working in IT or associated fields or are just an enthusiast, you can enroll in a course that will help you comprehend the depth of this stream. The Accelerator program in Business Analytics and Data Science is a great step forwardto do that.
There are other courses that promise to help you increase your professional qualifications and skills. These are Certificate program in Data Engineering, Integrated program in Data Science, Machine Learning and Artificial Intelligence, PG certificate program in Business Analytics and Data Science.