- Chuda Prasad Dhakal, PhD
Institute of Agriculture and Animal Sciences, Tribhuvan University
1. Background
Statistics can be defined as the science of learning from data. According to Sermento and Costa (2019), one of its primary goals is to extract information from data to gain a better understanding of the situations it represents. Statistical analysis, on the other hand, involves investigating trends, patterns, and relationships using quantitative data (A step-by-step guide to statistical analysis, n.d.). DePoy et al. (2016) also note that statistical analysis involves the organization and interpretation of data according to well-defined, systematic, and mathematical procedures and rules.
In summary, statistical analysis is a method for interpreting data that has been collected in a systematic and organized manner. It involves investigating trends, patterns, and relationships using quantitative data obtained through sampling techniques. Sampling is a crucial aspect of statistical analysis, as it allows researchers to draw conclusions about the larger population based on a subset of data. Through statistical analysis, researchers can extract meaningful information from data sets and identify patterns and trends that might otherwise be difficult to discern, enabling them to make informed decisions and draw accurate conclusions.
2. Nature of Statistical Analysis
Statistical analysis can vary depending on the nature of the data being analyzed, the type of analysis required, and the purpose of the study. One common distinction is between descriptive and inferential statistical analysis.
Descriptive statistical analysis involves organizing and summarizing data using numbers and graphs. This type of analysis is useful for gaining a basic understanding of the data and identifying patterns and trends. Descriptive statistics are often used to calculate measures of central tendency (such as mean, median, and mode) and measures of variability (such as standard deviation and range). Descriptive statistics also include measures of skewness and kurtosis, which provide information about the shape of the distribution of data.
In contrast, inferential statistical analysis is used when it is not feasible to inspect every unit in the population. Inferential statistics involves generalizing the information obtained from a sample to the entire population. This type of analysis is used to make predictions about future events based on current and past data.
Other types of statistical analysis include predictive analysis, prescriptive analysis, exploratory data analysis, causal analysis, and mechanistic analysis (Sermento & Costa, 2019). According to Jain and Sharma (2016), predictive analysis is used to make predictions about future events, while prescriptive analysis is used to determine the optimal course of action in a particular situation. Exploratory data analysis involves looking through the data to identify patterns and trends, while causal analysis is used to understand the causes of events. Mechanistic analysis, on the other hand, is used to explain how things happen (Pritchard & Hemberg, 2014).
Regardless of the type of statistical analysis being used, it is essential to carefully plan the procedure to ensure that valid conclusions can be drawn. A common example of the steps involved in the statistical analysis includes defining the research question, collecting data, cleaning and organizing the data, selecting appropriate statistical methods, analyzing the data, and interpreting the results.
3. Hypotheses and research design
Statistical analysis is often used to test the effect of a treatment on a sample or investigate the relationship between variables within a population. To start this process, we need to set up null and alternative hypotheses. The null hypothesis is always set in opposition to the statistical proposition being considered, for example, “There is no relationship between the variables within the population.” The alternative hypothesis, sometimes referred to as the researcher’s hypothesis is against the null hypothesis, such as “the variables in the population are related to each other.”
Research design involves the collection, organization, and analysis of data. The choice of research type, whether descriptive, correlational, or experimental, informs the selection of appropriate statistical tests to test the hypotheses.
The level of measurement of variables, such as nominal, ordinal, interval, and ratio scales, is critical in selecting appropriate statistical tests. Nominal variables are categorical and cannot be ordered, while ordinal variables have a natural order but no consistent interval between them. Interval variables have consistent intervals between them but no true zero point, while ratio variables have both a true zero point and consistent intervals between them. Categorical data groups variables and can be nominal or ordinal, while quantitative data represents amounts on an interval scale. Understanding the level of measurement is essential in selecting the appropriate statistical tests for analyzing the data.
Overall, careful planning and consideration of hypotheses and research design are necessary to draw valid conclusions from statistical analysis.
4. Sampling design and collection of data
To make statistical inferences about a population, we collect data from a sample that is representative of the population. Selecting an appropriate sample is crucial for the generalizability of our findings. There are two main approaches to sampling: probability and non-probability sampling. Probability sampling is used for generalizable findings that use parametric tests, while non-probability sampling is more appropriate for non-parametric tests that result in weaker inferences about the population.
Before collecting data, we must decide on an appropriate sample size. Several approaches and formulas can be used to determine sample size, including looking at other similar studies or using appropriate formulas for our situation. A sample that is too small may be unrepresentative of the population, while a sample that is too large will be more costly than necessary.
Sample size determination depends on several factors. One of these is the significance level (alpha), which is the risk of rejecting a true null hypothesis. Typically, alpha is set to 5%. For example, suppose we want to test whether a new drug reduces blood pressure. If we set alpha at 5%, we are saying that we are willing to risk falsely rejecting the null hypothesis (i.e., concluding that the drug is effective when it is not) in 5% of similar studies.
Another factor is effect size, which is a standardized measure of the magnitude of the difference or relationship between two variables. Effect size is important because larger effects require smaller sample sizes to be detected than smaller effects. For example, suppose we want to compare the mean heights of two groups of people. If the difference between the means is large, we can detect it with a smaller sample size than if the difference is small.
The third factor is the power of the test, which is the probability of detecting an effect of a certain size if it exists. Power is typically set to 80% or higher, which means that we want to have at least an 80% chance of detecting an effect if it is there. For example, suppose we want to test whether a new teaching method improves students’ test scores. If we set the power to 80%, we are saying that we want to have at least an 80% chance of detecting a difference in test scores if the teaching method does in fact improve them.
5. Summarizing data
Once we have collected the data, we need to summarize it using descriptive statistics. This involves creating frequency distribution tables, bar charts, and scatter plots for key variables. Through these, we can check for any skewed data, missing values, or outliers.
In statistical analysis, the three main measures of central tendency that are often reported are the mean, median, and mode. However, the appropriate measure/s to report depends on the shape of the distribution and level of measurement of the variables considered in the analysis. Variability is another essential measure to show how spread out the values in a dataset is from the measure of central tendency. The four main measures of variability often reported are range, interquartile range, standard deviation, and variance.
The shape of the distribution and level of measurement guide us in choosing the appropriate variability statistics. For instance, the interquartile range is the best measure for skewed distributions, while the standard deviation and variance provide the best information for normal distributions. After summarizing the data, appropriate statistical tests are performed, taking into consideration the units of the descriptive statistics, variance, and outliers, to find out if the test scores are statistically significant in the population.
Descriptive statistics are used to gain a basic understanding of the data and identify patterns and trends. Measures of central tendency (such as mean, median, and mode) and measures of variability (such as standard deviation and range) are often used to summarize the data. Additionally, descriptive statistics can be used to examine the shape of the distribution of the data, including skewness and kurtosis.
Inferential statistical analysis, on the other hand, is used to make predictions about future events based on current and past data. This type of analysis involves generalizing the information obtained from a sample to the entire population. Inferential statistics is used when it is not feasible to inspect every unit in the population. The purpose of inferential statistics is to determine whether the findings observed in a sample can be generalized to the population from which the sample was drawn. Inferential statistics is used to test hypotheses about the relationship between variables or the effect of a treatment on a sample.
In inferential statistics, a number that represents a sample is called a statistic, while a number that describes a population is called a parameter. We can use inferential statistics to make conclusions about population parameters based on sample statistics. The two common methods used in estimation are point estimate and interval estimate. A point estimate represents the best guess of the exact parameter, while an interval estimate represents a range of values that represent the best guess of where the parameter lies. It is good practice to report both point and interval estimates to comprehend the information carried by the sample dataset. A confidence interval is provided as an interval estimate to show the variability around a point estimate. It conveys where we would generally expect to find the population parameter most frequently.
Regarding hypothesis testing, there are two broader categories of statistical tests: parametric and non-parametric tests. Parametric tests make powerful inferences about the population based on sample data, but some assumptions must be met, and only some types of variables can be used. If the data violate these assumptions, we can perform appropriate data transformations or use non-parametric tests instead.
We can test hypotheses about relationships between variables in the population using sample data. We first assume that the null hypothesis is true in the population and then carry out statistical tests to check whether the null hypothesis can be rejected or not. These tests come in three main forms: comparison tests, regression tests, and correlation tests. The choice of statistical test depends on the research questions, research design, sampling method, and data characteristics.
Comparison tests usually compare the means of groups, such as the means of different groups within a sample, the means of one sample group taken at different times, or a sample mean and a population mean. Pearson’s r is a parametric correlation test that tells us the strength of a linear relationship between two quantitative variables. However, to test whether the correlation in the sample is strong enough to be important in the population, we need to perform a significance test of the correlation coefficient, usually, a t-test, to obtain a p-value. This test uses the sample size to calculate how much the correlation coefficient differs from zero in the population.
7. Interpreting the results and generalization
The final step in statistical analysis is interpreting the results, and in hypothesis testing, statistical significance is the primary criterion for forming conclusions. We compare the p-value with a pre-set significance level, typically 0.05, to determine whether the results are statistically significant or not. Statistically significant results are considered unlikely to have occurred by chance alone. However, this does not necessarily imply that the results have important real-life implications. The effect size is used to indicate the practical significance of the findings, and it is important to report both effect sizes and inferential statistics for a comprehensive understanding of the results. If the null hypothesis is valid in the population, the likelihood of obtaining a statistically significant outcome is very low.
When testing a hypothesis, we make two types of decision errors: type one error and type two error. These are the errors made when concluding research. A type one error occurs when the null hypothesis is rejected despite being true, whereas a type two error occurs when the null hypothesis is not rejected despite being false. To minimize the risk of these errors, we can choose an optimal significance level and ensure high power. However, there is a trade-off between the two errors, and thus a fine balance is necessary.
Finally, when generalizing the results of a study, we must consider the representativeness of the sample and the characteristics of the population. The study’s external validity is determined by how well the results can be applied to other populations or contexts. Researchers must ensure that the sample is representative of the population and that the study’s methods and measures are appropriate for the research question. A large sample size, random sampling, and rigorous study design can enhance the study’s external validity.
8. Limitations of statistical analysis
When using statistical analysis, it is important to recognize its limitations to avoid drawing unwarranted conclusions. One major limitation is that statistical models often rely on simplifying assumptions that may not be true in real-world situations, potentially leading to biased or inaccurate results if not properly accounted for. In addition, statistical analysis can only reveal associations between variables, not causality, so it is essential to be cautious when interpreting statistical results.
Furthermore, the quality and quantity of available data can greatly affect the reliability of the statistical analysis, as poor-quality data can lead to unreliable results or make it impossible to draw meaningful conclusions. Additionally, statistical analysis cannot account for unmeasured variables that may impact the outcome of interest.
Despite these limitations, statistical analysis remains a valuable tool for gaining insights into complex phenomena. By carefully considering its limitations and using complementary methods when appropriate, statistical analysis can contribute to advancements in a wide range of fields.
However, it is worth noting that statistical analysis may not always be applicable or appropriate. For example, it may not be a primary tool in certain fields, such as philosophy, literature, the fine arts, theology, and the study of personal belief systems. While statistics can be used in the analysis of certain philosophical problems, such as logic and decision theory, it may not be as relevant to the subject matter of theology and individual belief systems. Furthermore, statistical analysis is generally not used to analyze creative works, such as literature, art, music, or other forms of expression.
9. Conclusion
To sum up, statistical analysis is a powerful tool that has revolutionized many fields and significantly contributed to scientific advancements. It provides a way to make sense of complex data, test hypotheses, and identify patterns and relationships that may not be immediately apparent. However, statistical analysis is not without limitations. Simplifying assumptions, limited data quality, and the inability to account for all variables can all affect the accuracy of statistical models. Additionally, statistical analysis can only identify associations, not causality, so it’s important to interpret statistical results with caution and avoid drawing unwarranted conclusions.
Moreover, statistical analysis may not be applicable or appropriate in certain fields, such as philosophy, literature, the fine arts, theology, and the study of personal belief systems. While it can be used in certain philosophical problems and decision theory, it may not be as relevant to theology or the study of individual belief systems. Creative expressions such as literature, art, and music may not be amenable to statistical analysis.
Despite these limitations, statistical analysis remains an important tool for gaining insights into complex phenomena. By recognizing its limitations and complementing it with other methods where appropriate, statistical analysis can continue to contribute to advancements in many fields. As we continue to explore the capabilities and limitations of statistical analysis, we can use it to further our understanding and knowledge in various fields.
Bibliography
A step-by-step guide to statistical analysis. (n.d.). Scribbr. Retrieved from https://www.scribbr.com/category/statistics/ Accessed on 11 July 201
DePoy, E., MSW., & Gitlin, O.N. (2016). Chapter 20-Statistical analysis for experimental-type designs. Introduction to Research (5th Ed.). Retrieved from https://www.sciencedirect.com/science/article/pii/B9780323261715000203?via%3Dihub Accessed on 15 July 2021
Pritchard, D., & Hemberg, E. (2014). Prescriptive analytics. In Encyclopedia of business analytics and optimization (pp. 2269-2280). IGI Global.
Seven types of statistical analysis: Definition and explanation. (2021). AnalyticSteps. Retrieved from https://www.analyticssteps.com/blogs/7-types-statistical-analysis-definition-explanation Accessed on 15 July 2021
Sermento, R.P., Costa, V. (2019). An overview of statistical data analysis. Research Gate. Retrieved from file:///C:/Users/DELL/Downloads/1908.07390.pdf Accessed on 15 July 2021.
Sharma, N. (2016). Predictive analytics: The power to predict who will click, buy, lie, or die. John Wiley & Sons.