Conceptual Physics Reading and Study Workbook Summary
A paper'south "Methods" (or "Materials and Methods") department provides data on the report's pattern and participants. Ideally, information technology should be so articulate and detailed that other researchers can repeat the report without needing to contact the authors. You will need to examine this department to determine the study'due south strengths and limitations, which both touch on how the report'due south results should be interpreted.
Demographics
The "Methods" section usually starts by providing information on the participants, such as historic period, sex, lifestyle, wellness status, and method of recruitment. This information will help you decide how relevant the study is to y'all, your loved ones, or your clients.
Figure 3: Case study protocol to compare 2 diets
The demographic information can be lengthy, you might be tempted to skip information technology, yet it affects both the reliability of the study and its applicability.
Reliability. The larger the sample size of a study (i.e., the more participants it has), the more than reliable its results. Note that a study ofttimes starts with more participants than information technology ends with; nutrition studies, notably, commonly run into a fair number of dropouts.
Applicability. In health and fettle, applicability ways that a chemical compound or intervention (i.e., exercise, diet, supplement) that is useful for one person may exist a waste of money — or worse, a danger — for another. For case, while creatine is widely recognized as safe and effective, there are "nonresponders" for whom this supplement fails to improve do performance.
Your mileage may vary, as the creatine example shows, however a written report's demographic information can help you assess this written report's applicability. If a trial only recruited men, for instance, women reading the study should keep in mind that its results may exist less applicative to them. Likewise, an intervention tested in higher students may yield dissimilar results when performed on people from a retirement facility.
Figure 4: Some trials are sex activity-specific
Furthermore, unlike recruiting methods will attract different demographics, and then can influence the applicability of a trial. In most scenarios, trialists will use some course of "convenience sampling". For example, studies run by universities will often recruit among their students. However, some trialists volition use "random sampling" to make their trial'due south results more applicative to the general population. Such trials are generally chosen "augmented randomized controlled trials".
Confounders
Finally, the demographic information will ordinarily mention if people were excluded from the study, and if then, for what reason. Most often, the reason is the existence of a confounder — a variable that would confound (i.e., influence) the results.
For example, if you lot study the effect of a resistance preparation programme on muscle mass, you don't desire some of the participants to take muscle-building supplements while others don't. Either yous'll desire all of them to take the same supplements or, more likely, you'll desire none of them to take whatsoever.
Likewise, if you study the effect of a muscle-building supplement on muscle mass, y'all don't desire some of the participants to exercise while others do not. You'll either want all of them to follow the same workout program or, less likely, y'all'll want none of them to do.
It is of form possible for studies to have more two groups. Y'all could accept, for instance, a study on the upshot of a resistance training program with the post-obit iv groups:
-
Resistance grooming program + no supplement
-
Resistance preparation program + creatine
-
No resistance preparation + no supplement
-
No resistance training + creatine
But if your study has four groups instead of two, for each grouping to continue the same sample size y'all need twice as many participants — which makes your study more hard and expensive to run.
When you lot come correct down to it, any differences between the participants are variable and thus potential confounders. That'due south why trials in mice utilize specimens that are genetically very shut to i another. That's also why trials in humans seldom effort to test an intervention on a diverse sample of people. A trial restricted to older women, for instance, has in effect eliminated age and sex every bit confounders.
As nosotros saw above, with a bully enough sample size, nosotros can accept more groups. We tin can even create more than groups after the study has run its form, by performing a subgroup assay. For instance, if yous run an observational study on the effect of red meat on thousands of people, you can later separate the information for "male" from the information for "female" and run a split analysis on each subset of data. However, subgroup analyses of these sorts are considered exploratory rather than confirmatory and could potentially lead to faux positives. (When, for instance, a blood test erroneously detects a disease, it is chosen a false positive.)
Blueprint and endpoints
The "Methods" section will also describe how the study was run. Design variants include single-blind trials, in which only the participants don't know if they're receiving a placebo; observational studies, in which researchers simply observe a demographic and take measurements; and many more. (See figure 2 above for more examples.)
More specifically, this is where you will learn about the length of the study, the dosages used, the workout regimen, the testing methods, and and so on. Ideally, equally we said, this information should be then articulate and detailed that other researchers tin repeat the study without needing to contact the authors.
Finally, the "Methods" department can also make clear the endpoints the researchers volition exist looking at. For case, a study on the effects of a resistance grooming plan could utilise muscle mass as its main endpoint (its main criterion to judge the outcome of the study) and fat mass, strength performance, and testosterone levels as secondary endpoints.
One pull a fast one on of studies that want to observe an effect (sometimes so that they tin serve every bit marketing material for a production, but often simply because studies that testify an effect are more probable to go published) is to collect many endpoints, then to make the paper almost the endpoints that showed an effect, either by downplaying the other endpoints or by not mentioning them at all. To forbid such "data dredging/line-fishing" (a method whose devious efficacy was demonstrated through the hilarious chocolate hoax), many scientists button for the preregistration of studies.
Sniffing out the tricks used by the less scrupulous authors is, alas, part of the skills you'll need to develop to assess published studies.
Interpreting the statistics
The "Methods" section unremarkably concludes with a hearty statistics give-and-take. Determining whether an appropriate statistical analysis was used for a given trial is an entire field of report, and then we propose you don't sweat the details; effort to focus on the big motion-picture show.
Beginning, let'southward articulate upwardly ii common misunderstandings. You lot may take read that an effect was significant, but to later on discover that it was very small. Similarly, you lot may take read that no effect was institute, yet when you read the paper you found that the intervention grouping had lost more weight than the placebo grouping. What gives?
The trouble is simple: those quirky scientists don't speak similar normal people do.
For scientists, significant doesn't mean important — information technology means statistically significant. An effect is significant if the data collected over the course of the trial would be unlikely if in that location really was no event.
Therefore, an effect tin be meaning withal very small — 0.two kg (0.5 lb) of weight loss over a yr, for instance. More to the point, an effect tin can be significant yet not clinically relevant (significant that it has no discernible consequence on your wellness).
Relatedly, for scientists, no effect ordinarily ways no statistically significant issue. That's why yous may review the measurements nerveless over the grade of a trial and notice an increment or a decrease notwithstanding read in the conclusion that no changes (or no effects) were found. At that place were changes, only they weren't significant. In other words, there were changes, simply so small that they may be due to random fluctuations (they may too be due to an bodily effect; nosotros can't know for certain).
We saw earlier, in the "Demographics" section, that the larger the sample size of a study, the more reliable its results. Relatedly, the larger the sample size of a study, the greater its ability to observe if small effects are significant. A small change is less likely to exist due to random fluctuations when plant in a study with a thousand people, allow'southward say, than in a written report with 10 people.
This explains why a meta-analysis may find significant changes past pooling the data of several studies which, independently, found no significant changes.
P-values 101
Near oft, an effect is said to exist significant if the statistical analysis (run by the researchers post-study) delivers a p-value that isn't higher than a certain threshold (ready by the researchers pre-study). We'll phone call this threshold the threshold of significance.
Understanding how to translate p-values correctly can be tricky, even for specialists, merely here's an intuitive way to think nearly them:
Think about a money toss. Flip a coin 100 times and y'all will get roughly a fifty/fifty split of heads and tails. Not terribly surprising. But what if you flip this coin 100 times and become heads every time? Now that's surprising! For the record, the probability of it really happening is 0.00000000000000000000000000008%.
You can think of p-values in terms of getting all heads when flipping a coin.
-
A p-value of 5% (p = 0.05) is no more surprising than getting all heads on 4 coin tosses.
-
A p-value of 0.5% (p = 0.005) is no more surprising than getting all heads on 8 money tosses.
-
A p-value of 0.05% (p = 0.0005) is no more than surprising than getting all heads on 11 money tosses.
Contrary to pop belief, the "p" in "p-value" does not represent "probability". The probability of getting 4 heads in a row is 6.25%, not five%. If y'all want to convert a p-value into money tosses (technically called S-values) and a probability percentage, check out the converter here.
As we saw, an effect is significant if the data collected over the form of the trial would be unlikely if there really was no effect. Now we tin can add that, the lower the p-value (under the threshold of significance), the more confident we can be that an effect is significant.
P-values 201
All right. Fair warning: we're going to become nerdy. Well, nerdier. Feel free to skip this department and resume reading here.
Still with us? All right, then — let's become at it. As we've seen, researchers run statistical analyses on the results of their study (usually one analysis per endpoint) in order to decide whether or not the intervention had an effect. They commonly make this decision based on the p-value of the results, which tells you lot how likely a result at least as extreme every bit the one observed would be if the naught hypothesis, among other assumptions, were true.
Ah, jargon! Don't panic, we'll explain and illustrate those concepts.
In every experiment there are generally two opposing statements: the zip hypothesis and the alternative hypothesis. Let'due south imagine a fictional study testing the weight-loss supplement "Ameliorate Weight" against a placebo. The 2 opposing statements would look like this:
-
Null hypothesis: compared to placebo, Improve Weight does not increase or decrease weight. (The hypothesis is that the supplement's effect on weight is zilch.)
-
Alternative hypothesis: compared to placebo, Improve Weight does decrease or increase weight. (The hypothesis is that the supplement has an consequence, positive or negative, on weight.)
The purpose is to see whether the upshot (here, on weight) of the intervention (here, a supplement chosen "Better Weight") is amend, worse, or the aforementioned as the consequence of the control (here, a placebo, only sometimes the command is some other, well-studied intervention; for instance, a new drug can be studied against a reference drug).
For that purpose, the researchers usually fix a threshold of significance (α) before the trial. If, at the end of the trial, the p-value (p) from the results is less than or equal to this threshold (p ≤ α), there is a significant departure between the effects of the 2 treatments studied. (Remember that, in this context, significant means statistically significant.)
Figure 5: Threshold for statistical significance
The nearly commonly used threshold of significance is 5% (α = 0.05). It means that if the null hypothesis (i.due east., the thought that there was no departure between treatments) is truthful, then, after repeating the experiment an space number of times, the researchers would get a fake positive (i.e., would discover a pregnant effect where there is none) at most five% of the time (p ≤ 0.05).
By and large, the p-value is a measure of consistency between the results of the report and the thought that the ii treatments take the same issue. Allow'southward see how this would play out in our Improve Weight weight-loss trial, where one of the treatments is a supplement and the other a placebo:
-
Scenario 1: The p-value is 0.fourscore (p = 0.80). The results are more consistent with the null hypothesis (i.e., the idea that there is no difference betwixt the two treatments). We conclude that Improve Weight had no significant effect on weight loss compared to placebo.
-
Scenario 2: The p-value is 0.01 (p = 0.01). The results are more than consequent with the alternative hypothesis (i.east., the idea that there is a difference between the two treatments). We conclude that Better Weight had a significant effect on weight loss compared to placebo.
While p = 0.01 is a significant result, so is p = 0.000001. So what information do smaller p-values offer us? All other things being equal, they give us greater confidence in the findings. In our example, a p-value of 0.000001 would give us greater conviction that Better Weight had a significant effect on weight change. But sometimes things aren't equal between the experiments, making direct comparison betwixt two experiment's p-values catchy and sometimes downright invalid.
Even if a p-value is significant, remember that a significant effect may not exist clinically relevant. Let's say that we found a pregnant result of p = 0.01 showing that Ameliorate Weight improves weight loss. The grab: Meliorate Weight produced only 0.ii kg (0.5 lb) more weight loss compared to placebo after one year — a departure too pocket-sized to have any meaningful effect on health. In this case, though the effect is significant, statistically, the real-world effect is too small to justify taking this supplement. (This type of scenario is more than likely to accept place when the report is large since, every bit nosotros saw, the larger the sample size of a study, the greater its ability to find if pocket-sized furnishings are significant.)
Finally, nosotros should mention that, though the most commonly used threshold of significance is 5% (p ≤ 0.05), some studies require greater certainty. For instance, for genetic epidemiologists to declare that a genetic association is statistically significant (say, to declare that a factor is associated with weight gain), the threshold of significance is usually fix at 0.0000005% (p ≤ 0.000000005), which corresponds to getting all heads on 28 coin tosses. The probability of this happening is 0.00000003%.
P-values: Don't worship them!
Finally, continue in listen that, while of import, p-values aren't the final say on whether a report's conclusions are accurate.
We saw that researchers too eager to find an event in their study may resort to "data line-fishing". They may also endeavor to lower p-values in diverse ways: for instance, they may run different analyses on the aforementioned data and only report the significant p-values, or they may recruit more and more participants until they become a statistically significant result. These bad scientific practices are known as "p-hacking" or "selective reporting". (Y'all can read about a real-life example of this here.)
While a study'southward statistical assay usually accounts for the variables the researchers were trying to control for, p-values can also be influenced (on purpose or not) past study design, hidden confounders, the types of statistical tests used, and much, much more than. When evaluating the strength of a study'due south design, imagine yourself in the researcher's shoes and consider how you could torture a study to go far say what you want and advance your career in the process.
Source: https://examine.com/guides/how-to-read-a-study/