3 Research Methods in the Psychological Sciences

Tony Machin and Erich Fein


The field of psychology is characterised by a diversity of research questions related to human thought and behaviour. As such, psychology is organised into several distinct sub-disciplines such as clinical and organisational psychology. Although psychological research spans a wide range of different content areas, there is quite a bit of similarity underlying how psychologists go about answering research questions in these different areas. This is not to say that differences do not exist in the research approaches used within different areas of inquiry. However, these differences are in large part variations in emphasis and in the specific tactics used to accomplish research objectives. The broader principles and fundamental empirical or data-driven strategies guiding psychologists in different sub-disciplines are for the most part the same. Therefore, when using approaches to the discovery and construction of knowledge that differ from these broad principles and strategies, the researchers must endeavour to explain why they have adopted that approach.

If you find you struggle to understand some concepts in this chapter, don’t worry – these are topics experts throughout psychology continue to study. Indeed, understanding these concepts takes practice. Recognising that readers have a varied background in this area, there is a keyword index at the end of this chapter. Further, there are many additional resources to help you learn more about these topics. Open access (free) supports for statistics basics include Andy Field’s (2019) Discovering Statistics, the Noba Project (2021), and Daniel Lakens (2019) has a low-cost course titled Improving Your Statistical Inferences. These resources are not a substitute for a university course in research methods or statistics, but they can provide supportive background information if you want to build a stronger foundation in these key areas.

The principles and procedures that guide psychologists’ exploration of research questions are what we typically refer to as ‘psychological research methods’. The goal of this chapter is to introduce readers to the key principles that nearly all psychological scientists rely on when conducting psychological research. Understanding research methods is obviously essential for any student whose ultimate goal is to embark on a career as a research psychologist in either academia or an applied setting. However, it’s also important for many non-research careers – for example, many professions require employees to be ‘consumers’ of psychological research. These individuals might not conduct research, but will often draw upon prior research to develop plans of action to help accomplish their objectives (e.g., advertising firms developing product campaigns, managers attempting to resolve conflicts between employees, etc.). Indeed, even people making decisions in their personal lives might find themselves needing to be consumers of psychological research (e.g., a parent of a child with behavioural problems considering various intervention plans). Regardless of the setting, being an informed consumer of psychological research and developing your psychological literacy requires an understanding of the key principles that guide how research is conducted.

In discussing psychological research methods, this chapter is based on a series of key steps that a researcher must undertake in conducting any program of research. For ease of presentation, these steps follow a straightforward sequence. This sequence is a logical progression and, as will be seen, some steps cannot really be undertaken without first completing earlier steps. The order of some steps can be reversed or even addressed at the same time. To illustrate this design process, a recurring hypothetical example of a research program will be used: how fear and anger might influence aggression.

We (Tony and Erich) have adapted the original chapter created by Vaughan-Johnston, Fabrigar and Lawrence (2019) to reflect the Australian context and fully accept responsibility for the revised chapter and any errors or omissions.

Key steps in the research process

Formulating Research Questions

The first step to any program of research is formulating the research question. Ultimately, any study is only as useful as the research question it’s designed to address. Additionally, as will be seen, many of the decisions made in later stages of the research process are informed by the nature of the question a study intends to answer.

Descriptive Versus Inferential Research Questions

When formulating a research question, the first issue to address is whether the goal of the research will be primarily descriptive versus inferential in nature. Descriptive research questions largely focus on describing one or more psychological or behavioural constructs in a given domain of interest. For example, a researcher studying aggression might be interested in the prevalence of verbal aggression in the workplace. This researcher might wish to determine the proportion of employees in Australian or New Zealand workplaces who have been verbally demeaned or insulted by their co-workers. In addition, there are several large-scale studies of work and life balance in Australia, such as the Australian Work-Life Index (Fein, Skinner, & Machin, 2017), which include variables related to aggressive supervision.

Although psychological research is sometimes primarily descriptive in nature, most psychological research is predominantly inferential in its goals. Inferential research involves the exploration of relations among psychological and behavioural constructs. For example, in the context of aggression, a researcher might want to know what characteristics of workplace employees are associated with them being perpetrators of verbal aggression. In this case, a well-developed study may suggest that certain characteristics of individuals (e.g., personality) and situations (e.g., abusive supervision) may contribute to causing verbal aggression. Clearly, both types of research question (descriptive and inferential) are useful and interesting. However, if we ultimately want to understand why something occurs and/or how we can influence it, research must move beyond the purely descriptive level and begin to address inferential questions.

Exploratory Versus Confirmatory Research Questions

Assuming an inferential research question, the next consideration is whether this question will be approached in an exploratory or confirmatory manner. In exploratory research, researchers do not have specific expectations, but rather more general notions regarding the answer to the question. For example, a researcher interested in what characteristics are associated with the likelihood of being a perpetrator of verbal aggression in the workplace might measure a wide range of different characteristics of employees (e.g., their proclivity to experience different emotions, their level of seniority in the organisation, various personality traits) and then conduct analyses to see which characteristics are associated with aggression. In contrast, for confirmatory research, the researcher specifies what factors are likely to cause aggression and perhaps even when and why such factors have their effects. These hypotheses are generally derived from past research and/or some theory regarding the phenomenon of interest. The researcher then focuses attention primarily on those factors that have been hypothesised to produce the outcome of interest.

Both approaches have their advantages and limitations. The strength of exploratory research is that it encourages researchers to think broadly about the phenomenon of interest and maximises the opportunity of stumbling on unexpected discoveries. However, although exploratory studies often consider a wide range of possibilities, they are rarely optimal tests of any single explanation. In contrast, confirmatory studies tend to have a narrow focus, but usually provide more systematic and complete tests of the factors they are designed to explore. For instance, if a study must cover a wide range of different characteristics of employees that could predict their proclivity to engage in verbal aggression, it might not be feasible to extensively measure each factor (e.g., the researcher might only be able to include a few questions measuring each factor). In contrast, if a researcher has explicitly postulated that tendency to experience the emotions of fear and anger are major determinants of aggression, the researcher might be able to include very extensive measures of each emotion, and perhaps even multiple different types of measures of each emotion. The two approaches, however, are not mutually exclusive. Indeed, often a program of research will adopt an exploratory approach in its early phases and then gradually transition to a more confirmatory approach.

Basic Versus Applied Research Questions

A final consideration during the research question formulation stage is whether the study will be designed to primarily address a basic (i.e., theoretical) research question versus an applied research question. Basic research is aimed at formulating and testing fundamental psychological principles governing a domain of interest. For instance, a researcher might be interested in developing a theory of the role of emotions in aggression. The goal of this researcher is to develop principles that explain which specific emotions either increase or decrease aggression and why these emotions have the effects they do on aggression. Thus, the goal is to arrive at a fundamental understanding of the relations among the constructs of emotions and the construct of aggression.

In contrast, applied research questions tend to focus on a specific problem. They typically emphasise predicting or influencing an outcome rather than focusing on understanding why that outcome is predicted or influenced by a given factor. Indeed, applied research questions often focus on the effects of a specific measure or intervention with less concern as to why that measure or manipulation accomplishes its goal and/or the effects of the broader construct of interest that measure or intervention is presumed to represent.  However, some applied research will also include a consideration of potential theoretical or conceptual causes because such frameworks provide a foundation for future research. For example, an applied researcher might be interested in testing if a specific measure of anger predicts employee aggression or if a specific anger management program lowers employee aggression. In this case, the research could test the impact of the anger management program on employee aggression and might explain his or her results with a conceptual model.

As with other distinctions, the basic versus applied research question distinctions are not mutually exclusive. Often basic research might have the ultimate goal of developing principles that can be used to solve applied problems. Likewise, the exploration of applied questions can often contribute to the understanding of basic questions. Thus, this distinction is more a matter of emphasis than a fundamental difference in the nature of the research question being addressed. However, this difference in emphasis does have implications for the methodological decisions that a researcher might make at subsequent stages of the research process.

Selecting Dependent Variables

Once a researcher has formulated a research question – and presuming that question is inferential in nature – the researcher’s next step is to determine the specific constructs of interest. More precisely, constructs are those psychological elements within people and groups thought to vary across people and/or situations. Although the goal of all inferential research is to determine the relationship between constructs, some of this research involves merely finding associations between constructs, whereas other studies test hypothesised causal relationships among the constructs(s) of interest. A researcher cannot assess ‘fear’, ‘aggression’, or other constructs directly, but instead selects specific measures that represent constructs in an observable way. Measures representing the outcome constructs in hypothesised relationships are called ‘dependent variables’ because they are conceptualised to be dependent on the levels of one or more independent variables – a topic that will be addressed later in the chapter.

After having determined the constructs that one intends to study, one must more precisely define them. Some constructs are more easily defined than others. For example, when measuring psychological constructs such as personality, there are numerous conceptualisations of personality, including the Big Five and HEXACO frameworks. In contrast, physical traits such as height and weight often have widely-accepted definitions that are consistently applied across domains of research. Keep in mind that how a researcher chooses to define the study variables will affect the results of the study, the comparability of outcomes to other studies that have researched the same constructs, and one’s ability to operationalise the constructs in a way that will allow for feasible, sensible, and meaningful measurement.

For example, there are a broad range of ways to characterise aggression (e.g., Archer & Coyne, 2005). For some research questions, a broad conceptualisation that includes indirect, relational, and social aggression may be very useful. In other cases, a very specific definition of aggression as ‘causing physical harm to others’ may be preferable. Even within this seemingly narrowed conceptualisation of physical harm, important conceptual questions require answering: for example, should the mere desire or wish to cause physical harm count, or only aggressive actions that are actually expressed by a participant?

Operationalisation is the formal term for the specific definition of constructs with linkage to specific measures. For example, if one wishes to measure an individual’s aggression, the experimenter must decide how – that is what method of instrumentation – should be utilised to obtain an accurate measurement (e.g., using a self-report scale, observational techniques). Thus, one possible operationalisation of individual aggression could be self-report using the Aggression Scale (e.g., Orpinas & Frankowski, 2001). Researchers usually hope they can make inferences from the measure back to the construct the measure is trying to capture. When operationalising dependent variables, one must aim to select measures that are sensitive enough that the influence of the independent variable on the dependent variable can be detected. Measures should strive to accurately capture a construct of interest – a topic that will be discussed in detail later as construct validity.

Level of measurement. There are four major categories of measurement level. Nominal scales involve any measure for which scores are given as categorical labels. For example, in our fear/anger and aggression study, we might assess participants’ cultural background (e.g., German, Chinese) as a nominal variable. Notice that nominal scales like this do not imply any rank ordering of the categories. That is, cultures like Germany or China are not options that vary along a single continuum of provided options, but are categories that are selected.

Conversely, ordinal scales provide a rank ordering of the categories. For example, a measure might ask people to rank-order several aggressive thoughts they are experiencing from most to least aggressive. Here the response options are ordered from most aggressive to least aggressive: a single continuum. However, also recognise there is no standard distance between the rankings: that is, the psychological distance implied by the gap between the first and second most aggressive thoughts would not be expected to match the distance between the fourth and fifth most aggressive thoughts.

Interval data provides response options that are equally spaced. In psychology it is often difficult to create truly interval scaling. Imagine a self-reported anger scale ranging from   1 (slight anger) to 2 (moderate anger) to 3 (strong anger). The psychological distance between response options such as slight to moderate, versus moderate to strong, although intended to be equal, might not necessarily be equivalent to one another, making it difficult to form truly interval measurements. However, when multiple items are aggregated together, pseudo-interval scaling often functions quite similarly to true interval scaling, and such aggregated ordinal data can often be treated statistically as though it were interval (Harpe, 2015).

Ratio data additionally adds a true zero point. For example, if participants’ punching a doll is used as a behavioural measurement of aggression, zero punches indicate a complete absence of this behaviour. This matters, for example, when multiplying using the scale, comparing between levels on the scale. A 2 on a self-report scale of anger does not indicate ‘twice’ as much anger as a 1, but a person who punches a doll twice has engaged in twice as much of this type of aggression compared to someone who punches once.

Methods of measurement

There are methods of measurement routinely used in psychology. The most common method of measurement used in psychology is self-report measurement. These measures ask participants to verbally report their standing on the psychological or behavioural construct of interest, typically using some form of structured rating scale. Self-report tools are usually considered to be direct measures because participants are directly asked to assess their own psychological attributes. Examples include the Beck Depression Inventory (Beck et al., 1996) or the NEO Five-Factor Inventory (Costa & McCrae, 1991). One issue that commonly arises when using self-report measures is that they are susceptible to socially desirable responding (Paulhus, 1991), meaning that respondents may distort their responses in order to present themselves favourably. For example, people may understate how much anger or fear they’re feeling if feeling these emotions strongly is considered inappropriate. Another issue is that people may not always be able to provide accurate self-report responses. For example, self-report responses are influenced by the cognitive accessibility of relevant information (e.g., Strack et al., 1988), making these responses susceptible to influence based on how questions are framed. Additionally, people may simply not have perfect introspective self-awareness (Nisbett & Wilson, 1977), and therefore not be capable of accurately describing why they think or feel certain ways.

Another common method of data collection is the use of indirect measures, which refer to tools that assess participants without directly asking them to provide self-assessment of their psychological attributes (De Houwer, 2006; Gawronski & De Houwer, 2014). A quite common form of indirect measure is implicit measurement, referring to measures that assess relatively uncontrolled and automatic types of participants’ responses. Examples of implicit measures include the Name-Letter Task (NLT; LeBel & Gawronski, 2009; Nuttin, 1985), the Implicit Association Test (IAT; Greenwald et al., 1998), and the Affect Misattribution Procedure (AMP; Payne et al., 2005). Although these implicit measures are quite diverse in form, they work by assessing reaction time, or subtle response patterns that would be difficult to deliberately control. For example, implicit measures often assess how quickly people pair objects together, following the logic that similar objects or ideas are ‘congruent’ for respondents, and are easily categorised together. For example, people who pair ‘good’ with ‘white’ quickly, but ‘good’ with ‘black’ slowly may be viewed as preferring white over black people. Other implicit measures suggest that underlying feelings about an object can be assessed by how respondents’ feelings spill over onto stimuli presented shortly after. The AMP, for example, exposes participants very briefly to an image of an attitude object (a prime), and then asks them to rate their opinion towards a neutral stimulus (e.g., rating how much they like a meaningless shape). Individuals who rate the neutral stimulus as ‘bad’ after viewing a particular prime are viewed as having a negative opinion of the prime object (Payne et al., 2005).

One reason that indirect measures are often championed is that they are thought to be highly resistant to social desirability concerns (Petty et al., 2012). For example, when measuring racial attitudes with a self-report scale, psychologists may be concerned that respondents would have a powerful motivation not to admit racist attitudes. An indirect measure can subvert these social desirability concerns by measuring extremely subtle reaction time differences that would be difficult to control. It may be noted that some research has identified specific conditions whereby respondents can occasionally control ‘implicit’ responses (Klauer & Teige-Mocigemba, 2007), but generally respondents will find it much more difficult to deliberately control their responses on these tasks. Thus, implicit measures may not be completely immune to social desirability or other motivated control attempts, but they are highly resistant to such response biases.

One common observation about implicit measures is that they do not always show high levels of convergence with their explicit counterparts. Although critics have sometimes framed this low convergence as a problem, low correlations may simply suggest that implicit measures capture unique variance in constructs that traditional self-report measures fail to capture. Importantly, this implies that direct and indirect measures may have incremental validity in predicting behaviours – meaning using both types of measure to predict behaviour is more powerful than using only using one type of measure. Reviews have shown that incremental validity of implicit and explicit attitudes can indeed be observed (Friese et al., 2008). Furthermore, each type of measure may be uniquely helpful in specific contexts. In conditions where people are deliberate and thoughtful, explicit measures have better predictive power, whereas implicit measures are better used to predict spontaneous behaviour (Asendorpfet al., 2002).

Oftentimes in psychology, psychological processes are inferred based on physical changes that occur in participants’ brains or other bodily regions. Physiological measures record processes such as voltage fluctuations in brain neurons (i.e., brain activity) captured using electroencephalography (EEG), metabolic processes using positron emission topography (PET), and blood flow in the brain using functional magnetic resonance imaging (fMRI). For example, some researchers have assessed people’s fear responses by assessing activation of their amygdala region through techniques including magnetoencephalography (Moses et al., 2007). Cacioppo and Tassinary (1990) have chronicled some of the impressive advances in neuropsychology’s ability to non-invasively examine brain activity. Like implicit measures, physiological measures are often seen as preferable to self-report measurement because they can obviate participants’ attempts to control their responses. Although these measures therefore have great value in addressing certain concerns, one general limitation of these methods is that because of the complicated technology required, their administration requires highly specialised technicians, and they are therefore costly and time-consuming to use. More substantively, numerous neuropsychologists have warned readers about the dangers of over-assuming causal relationships between brain ‘signals’ and participants’ emotions, thoughts, or actions (Cacioppo et al., 2003).

Just as implicit and physiological measures operate by capturing respondents’ uncontrollable reactions, observational measures allow social scientists to obtain information from their subjects through evaluating participants’ overt behaviours. Observations can be made with or without participants’ awareness that such observations are occurring. For example, aggression has been measured by measuring how much hot sauce participants put into a glass of water supposedly intended for the next participant to enter the laboratory, with large amounts of hot sauce indicating an aggressive behaviour (Lieberman et al., 1999).

Reliability and validity

A comprehensive explanation of the development of new measures goes beyond the scope of this chapter, but guidelines are available for interested readers (John & Benet-Martinez, 2014; Simms, 2008). The following section instead focuses primarily on issues of measurement reliability and validity – two fundamental psychometric properties.

Although both reliability and validity in measurement are crucial, reliability is required for a measure to be valid, but validity is not required for a measure to be reliable. In principle, reliability simply refers to the consistency with which a measure provides the same information, although it comes in many forms. For example, psychologists may measure the same construct in the same people across a span of time, using the same measure. If a measure provides consistent measurements across time, and the construct it assesses remains stable, people who score low or high at one time point should continue to do so later – this is called ‘test-retest reliability.’ Of course, constructs that are expected to change across time (e.g., acute experiences of fear) don’t typically get measured with high test-retest reliability, because participants’ responses change due to the fleeting nature of emotion. However, many traits are thought to be relatively stable across the lifespan, such as personality (Costa & McCrae, 1993), and high test-retest reliabilities serve to indicate that these constructs’ measures are providing consistent information.

Another tool for assessing reliability is the extent to which independent evaluators judge something in an equivalent manner: ‘inter-rater reliability.’’ For example, if observers were asked to evaluate aggressive behaviour displayed by participants, inter-rater reliability would be high if all the judges observed and recorded a similar number of aggressive behaviours.  If judges’ evaluations completely differed from one to the next, this would be evidence that their observations lack reliability – that is, lack consistency. Similarly, when evaluating various items that are thought to assess the same underlying construct, ‘internal consistency’ refers to when items correlate highly with one another due to respondents answering in a consistent way across items (Henson, 2001). For example, a highly fearful individual should express that they are ‘terrified’, ‘frightened’, as well as ‘scared’. The core principle is consistency: consistent responses to these items by the same respondents would indicate that the items are reflecting the same construct, meaning that they have reliability.

After operationalising your measures, it’s also important you ensure that your measure displays validity. A measure is valid insofar as it quantifies accurately what it purports to measure. Construct validity refers to the degree to which a measure specifically and sensitively captures its intended construct (Cook & Campbell, 1979; Shadish et al., 2002). Although methodology texts often introduce dozens of unique types of validity as though each were completely separate, many of these are best viewed as similar types of evidence that allow researchers to determine if a measure has construct validity. After collecting these multiple types of evidence, researchers would unify them into a coherent argument for construct validity. For example, ‘criterion validity’ is the extent to which a measure is associated with other measures that should logically be related to its construct. This is really evidence of a measure’s construct validity – if a measure effectively captures its construct, it should be related to things that its construct relates to. For example, when developing a self-reported fear measure, this fear measure should be related to avoidance behaviours, because people are motivated to avoid things that frighten them. If they do correlate, this is consistent with the notion that the fear measure is accurately or validly measuring fear. Similarly, methodologists refer to ‘discriminant validity’ when a measure shows minimal associations with irrelevant variables. For example, a fear measure should not be closely associated with social desirability measures. Indeed, if a fear measure was negatively related to a social desirability measure, it might indicate that people are denying any fear that they feel due to social desirability concerns such as not wanting to sound afraid. This would threaten a fear measure’s construct validity, because the fear measure would no longer only be measuring fear.

If a measure appears to reflect its construct according to either experts or laypeople, then it is said to possess ‘face validity’. Once again, this is evidence of construct validity. If emotion experts think the items on a fear measure are not reflective of fear, this could raise concerns about the measure’s construct validity. Interestingly, sometimes it’s disadvantageous for a measure to possess face validity. For example, if participants are aware that a scale seeks to measure aggression, then it’s likely that participants may disagree with items to appear non-aggressive to the extent that aggression is socially inappropriate or anti-normative. To obtain accurate results it’s therefore occasionally advantageous to reduce face validity depending on the construct of interest – in other words, increasing a scale’s subtlety (Holden & Jackson, 1979).

Selecting independent variables

Once the dependent variable has been determined, a researcher selects one or more independent variables (IVs), which represent variables conceptualised as predicting or influencing DVs. Many of the same criteria used to evaluate DVs are also relevant when considering IVs. For example, the reliability and validity of IVs are as important as they are for DVs, and are often assessed in the same ways. Continuing with the example of fear or anger inducing aggression, fear/anger would be IVs – the variables understood to be increasing or decreasing aggression. However, IVs are not precisely like DVs. For one thing, DVs are always measured, whereas IVs may be measured or manipulated. Both measurement and manipulation have some advantages and disadvantages, and each opens up several specific questions for the researcher.


Manipulations are changes in constructs induced by deliberately stimulating or inhibiting those constructs through some process of the study. One common type of applied psychological interventions are training interventions to affect knowledge and skill. Another example would be clinical and counselling interventions that affect emotional states. In our recurring example involving anger, a manipulation would be any action designed to actively change participants’ current levels of anger or fear. As with DVs, researchers must consider the many ways that fear/anger could be operationalised. One could remind participants of a time when they felt fear/anger in their own lives (recalled emotion – e.g., Baker & Guttfreund, 1993) or read fictitious narratives which are intended to make participants experience fear/anger (emotion stimulated by narrative engagement). One could employ deception to generate anger – for example, Nisbett and Cohen (1996) had a confederate ‘accidentally’ bump into participants as they walked in a corridor, which elicited anger in participants. Despite being quite different, these are all manipulations designed to stimulate an IV.

One reason to incorporate a manipulation rather than a measure of one’s IV is that manipulations have advantages with respect to internal validity, which reflects researchers’ ability to make causal claims about the relationship between study variables. Imagine measuring fear (our ‘IV’) and then measuring aggression (our ‘DV’) just a few moments afterwards. Assuming an association existed between these measures, what could a researcher conclude? It’s not clear that fear caused aggression. One other possibility would be that participants were already feeling aggressive before fear was measured. If this were the case, those aggressive intentions caused the participants to feel fear and were still present when the aggression measure was collected. Thus, in this case, fear might just as easily have caused aggression (this risk is sometimes called reverse causation). Perhaps more likely, a third construct could be responsible for causing the other two constructs to appear associated. For example, participants may have been experiencing physiological arousal at an earlier point in the procedure. This arousal caused them to endorse the fear items because their heart was racing and their palms were sweating, so they inferred that they were feeling fear. Furthermore, their arousal led them to behave more aggressively. Note that in this case, arousal was actually responsible for both variables seeming to ‘increase together’ (covary), and no real causal relationship existed between fear and aggression. This threat to internal validity is sometimes called the third variable problem.

These types of associations that interfere with the direct relationship between IVs and DVs can pose serious threats to internal validity. Now imagine randomly assigning half of a group of participants to watch a frightening movie scene that results in increased fear, and the other half to watch a non-frightening scene that doesn’t increase fear (thus, fear is manipulated). That is, every participant has an equal likelihood of being in any of the experimental conditions. Because people are randomly sorted into these groups, it’s unlikely that a third variable caused differences in fear between the two groups. This is because any idiosyncratic individual differences between participants would be distributed randomly across conditions. Instead, differences between the groups are most likely attributable to the manipulation’s effects, helping to establish a causal relationship wherein the IV causes the DV. Researchers’ ability to make such causal claims are referred to as internal validity.

Thus, a common choice when using manipulations is to incorporate a control group, representing the condition in which participants would be if they were not subjected to the part of a manipulation that is of interest to you. For example, consider all the elements of watching a five-minute frightening film clip – five minutes of audio and visual stimuli, the feeling of wearing headphones, sitting in a chair, and (hopefully) feeling fear. A control group controls for as many of these irrelevant aspects as possible, leaving only the fear variable to differ across groups. Thus, a control group might watch a five-minute film clip (sitting down, wearing headphones) of an emotionally ‘neutral’ scene such as a mechanic fixing a dishwasher. Differences in group behaviours are now hopefully attributable only to fear, rather than sitting, wearing headphones, or film-watching in general, since even a boring dishwasher scene contains all of those elements.

This clustering of participants such that some experience one condition, others experience a different condition, and others experience a control condition is characteristic of a between-participant design, which helps to examine causal relationships by randomly assigning people to one of two conditions and examining differences emerging between the groups. Alternatively, in a within-participant design, participants would each undergo each condition. Re-using the video-watching example, a within-participant design might have all participants watch both clips, measuring aggression after each clip. In this case, no random assignment is required because the same individuals participate in both conditions. However, a researcher will often rotate the order of presentation – half of participants watch the control film before the frightening film, and half watch in the reverse order (this process is sometimes called counterbalancing the order of conditions). Otherwise, the order of film presentation might explain any differences between conditions.

Issues of manipulations and measurements

It’s often advisable to consider a similar checklist of priorities when using measures or manipulations. Consider issues of confounding variables. One common objection to measuring IVs is that measures are almost always influenced by constructs other than the one intended. For example, it may be difficult to measure fear without a measurement being impacted by participants’ neuroticism (a personality trait in which people experience chronic, negative emotionality). Therefore, manipulations may seem superior because they do not introduce such confounds. However, manipulations may also introduce irrelevant confounds if the manipulation influences constructs other than the one(s) intended (see Fiedler et al., 2012). For example, a manipulation designed to increase fear might also make some participants sad, angry, or surprised, making it harder to deduce what was ultimately responsible for any aggression effects. Thus, whether a researcher measures or manipulates an IV, they should still consider how irrelevant variables may interfere with their study’s validity.

Second, issues of transparency – or the degree to which participants can understand the true purpose of a study – are relevant to both measured and manipulated IVs. For example, it’s usually important that participants don’t know the precise hypothesis of a study, lest they simply act as they believe they are supposed to (i.e., demand characteristics – Orne, 1962). Suppose a study consists only of measuring fear and anger, before measuring aggression. Participants may deduce that the researcher wants to know whether fear and/or anger predict aggression, and act accordingly (acting either to confirm or disconfirm that hypothesis). One way to avoid this problem is to use one of many measures that are designed to measure a construct subtly, to avoid being obvious about what the experimenter is interested in, as discussed above. Another easy solution is to include filler measures – scales that researchers don’t want to evaluate, that are included to confuse participants’ understanding of the study’s purpose. Participants will typically assume that all study measures are relevant to the experimenter’s research questions, and therefore these bogus measures will throw off their guessing the true hypothesis.

In some contexts, manipulations may also make the study’s purposes transparent. If participants understand what a manipulation is meant to do to them, they may act differently due to their awareness of the experimenter’s research goals. Transparency is a particular issue for within-participant designs, because these often imply to participants that the experimenter wants to know how something varies across conditions, each of which each participant has experienced. In between-participant designs, in contrast, the design is often well-hidden simply because participants are not aware of what other participants are experiencing and thus do not know what their responses/actions are being compared against. One precaution that is often sensible is to include a funnel interview (Page & Scheidt, 1971). In a funnel interview, participants are asked increasingly probing questions about their experiences in the study and what they thought the study’s purpose was. Participants who truly understood the study’s purpose will presumably state this when they are asked, and researchers can consider whether to refine the manipulation, cut the data of the suspicious individuals, or else simply run statistical tests with and without suspicious participants included to assess the impact of suspicion.

The concept of construct validity was previously introduced with reference to measurements, but it has applicability to manipulations as well. Consider the previous example of bumping into participants to produce anger. In reality, it was primarily participants who were raised in the Southern, not Northern US states who felt anger at the staged hallway collision (Nisbett & Cohen, 1996) – Northerners quite often felt amused by the experience. This raises a critical question: for whom is a manipulation likely to activate its intended construct? The same stimulus that would frighten a child may not produce fear in adults. The easiest way to determine if a manipulation has construct validity is a manipulation check performed either during the study, or on a separate pilot sample. A manipulation check usually asks a participant a question directly related to the construct: for example, after watching a (hopefully) scary film clip, participants may be asked ‘How scary was that film?’ or ‘How scared are you?’. If the fear clip is felt to be scarier than the control clip, elevated fear ratings should be produced.


We next consider elements of research context that a researcher must consider when planning a study. In social science, context generally describes the population of interest (people) and the location and time (setting) in which research takes place. Context is of great importance to psychologists for at least two reasons. First, context helps to define how measures and manipulations should be designed to optimally capture a construct (i.e., construct validity). Just as some measures are only effective for children (e.g., ‘I want my mummy’ as an item measuring fear), some stimuli have different psychological meanings in certain eras. For example, consider how the meaning of the name ‘John F. Kennedy’ changed from 1962 to 1964 (with his assassination occurring in 1963), or how the words ‘John F. Kennedy’ might have radically different meanings to a respondent who was alive in the 1960s compared to a respondent who was born in the twenty-first century. This is very important in psychology, because it means that measures and manipulations that were developed originally for one context may or may not work effectively in other contexts. Ultimately, psychological scientists are interested in the relationships between constructs, not measures. Therefore, materials must be found to possess construct validity within a given context and within a given population before they can reasonably test how constructs interrelate. There is often a trade-off to consider. Materials that are very customised for a specific population may be extremely powerful tools for studying that population, but may require a serious re-evaluation and development process when alternative groups are studied, making generalisation attempts more laborious.

A second reason why context and population matters is because sometimes psychologists want to test the external validity or generalisability of findings. Suppose psychologists discover that fear does causally produce aggressive responses among children. Of course, it doesn’t automatically follow that the same relationship would occur among adults, whose emotional self-regulation abilities may be considerably different. Assuming a construct-valid fear manipulation was employed among adults, and assuming a construct-valid aggression measure was also used, the fear/aggression association could be examined among adults as well. Whether the association emerges or not would then test the external validity of the fear/aggression link – that is, how generalisable the link between variables is.


In psychology, the population of interest is typically a very large group of people about whom the researcher wishes to draw conclusions. Researchers create inclusion criteria and exclusion criteria to aid in the process of defining the population of interest. The former refers to characteristics that would render a participant eligible to participate, and the latter would disqualify a subject from partaking in the planned data collection. For example, if a social scientist was interested in the aggression levels of criminally-convicted juvenile offenders in Australia, then the inclusion criteria might include age (<18 years). Having no criminal record would be an exclusion criterion.

Measuring every individual in the population of interest is virtually never feasible (Banerjee & Chaudhury, 2010), requiring psychology researchers to test their hypotheses using a subset of the population of interest known as a sample. In some cases, researchers aim to obtain a truly random sample, which ensures that every member of the population under investigation has an equal probability of being included in the sample. One situation in which random sampling is important is when descriptive analyses are important to a researcher. For example, if researchers want to know accurately what the average aggression level is among Australian juvenile offenders, non-random sampling will likely undermine the accuracy of their descriptive estimates.

Truly random samples are often impossible to obtain (Sweetland, 1972), resulting in the collection of data by means of a convenience sample, meaning that a sample is obtained from a more readily available subgroup of the population. University students are a classic example of a convenience sample when the population of interest is ‘all people’, because students are often easily accessible to researchers – for example, participating in research in exchange for bonus marks in their courses or small cash payments. However, university students may differ from members of the general public in some important respects: they’re likely to have greater levels of education, may have a more critical approach to evaluating claims about the efficacy of a product or intervention, and so on. Therefore, a worthwhile consideration is whether a convenience sample differs from the population on specific constructs of interest to a researcher. For example, a perceptual psychologist studying visual perception may consider university students to be quite representative of people with respect to rods and cones in their retinas. To this researcher, the attributes for which university students might be expected to differ from the general population probably would not interfere with testing their key hypotheses.

Other cases may be more ambiguous, and the utility of convenience samples may also depend on the type of research question being pursued. For example, if university students have unusually developed cognitive skills (e.g., memorisation and critical thinking skills), this is likely to bias descriptive research questions about cognitive skills or abilities. Inferential research, however, necessitates closer scrutiny regarding the use of convenience samples when such samples are a tenuous match to the populations of interest. For example, it’s unclear whether a convenience sample of university students may effectively map onto a broader population or a population composed of individuals at different ages. For example, a sample of Australian university students may have a different relationship between fear/anger and aggression, compared to children or older adults. That is, the relationship between emotion and aggression (an inferential question) may itself differ across a span of age levels. One possibility, if a researcher is concerned about such age effects, would be to collect a representative sample. However, this solution is not without issues. For example, suppose fear relates to increased aggression in young adults, but that children instead become less aggressive when they’re afraid. If a researcher were to engage in equal sampling of children and young adults, the study might show no effect of fear when in fact there are two quite different effects that are masked because the two patterns run in opposing directions. Indeed, if researchers have reasonable grounds to suspect that such differences occur across sample types, they may want to conduct multiple studies, each collecting a sample from a different population. In this hypothetical case, Study 1 would identify the positive fear/aggression association among young adults, and Study 2 would identify the negative association in children. An alternative approach would involve deliberately collecting both groups within a single large study (e.g., half young adults, half children), and then statistically analysing any differences across the groups.

Another consideration regarding population is sample size – that is, the number of cases or observations produced by participants in a study. There exist numerous techniques to determine an appropriate sample size, usually termed power analyses, but the mathematical basis for these calculations is too complex to be fully advanced here. In general, larger samples decrease the chance that a finding will represent a statistical ‘fluke’ or false positive result. This is because as our sample becomes bigger, it better approximates the population that we want to make conclusions about. For example, if 10,000 Australian women were surveyed about workplace aggression, the conclusions that could be drawn about experiences of Australian women related to workplace aggression are more likely to reflect the population of all Australian women than a sample size of 10 Australian women.

Although some psychologists advocate for always maximising sample size, there are a few issues to consider when deciding on an appropriate sample size. Certainly, it is true that a larger sample size increases statistical power, or the ability to detect inferential patterns between variables where they truly exist. Similarly, descriptive statistics become more precise with larger samples. However, there are other considerations to take into account when planning research. For example, researchers may become constrained in terms of the methodologies that can facilitate such enormous samples. For example, researchers can collect thousands or even millions of participants through crowdsourcing techniques or mass online testing (e.g., YourMorals.org – Iyer, 2019), but as we detail in a later section, online research has both advantages and disadvantages associated with it.

A final issue prompting close attention to population is how stimuli and measures will be developed for various populations. As previously discussed, scientific research proceeds by using measures and manipulations to operationalise abstract constructs. Thus, it’s imperative that measures/manipulations have their intended meanings within each specific population. Consider, for example, if researchers used the same religious questionnaire for a study in both the Gold Coast and in rural Queensland. In this case, religious items may not have the same meaning for both populations because the questions could be perceived differently for a variety of factors – words within the questions could be unknown or have completely different meanings in different populations. Accordingly, some methodologists advocate for measurement invariance analysis (Millsap & Meredith, 2007; Widaman & Grimm, 2014), which uses a mathematical procedure to establish whether items of a measure perform similarly across groups at a psychometric level. Without establishing, at minimum, the basic levels of measurement invariance, comparisons across groups become suspect. Using the above example again, it becomes problematic to replicate a study on residents on inner-city Melbourne compared to rural and regional Victorians if a psychological measure has a completely different psychometric structure for these two groups.


A major factor in setting is whether a study takes place in a laboratory, in an online survey, or in a field context. The advantages and disadvantages of these contexts have stimulated productive research and debate. For example, laboratory research has sometimes been criticised as lacking mundane realism or being artificial and lacking applicability to ‘real-world’ situations (Ilgen & Favaro, 1985). However, psychologists rarely attempt to produce contexts that resemble ‘the real world’ literally, instead focusing on participants’ experiences of a study as psychologically meaningful (Berkowitz & Donnerstein, 1982). Recall that construct validity, for example, depends upon measures and/or manipulations being able to capture or produce psychological constructs within participants, such as fear, anger, or aggression. For example, a social rejection experience may be quite fabricated and artificial, but if it feels real to participants then causal hypotheses about the effects of feeling rejected can still be evaluated. Similarly, one might be concerned that participants will know they are being studied in a laboratory and therefore act unusually due to being observed. However, this risk can often be managed. Many experiments use deceptive procedures – or between-participant designs that hide the other conditions from participants – to disguise the true purpose of the research. For example, studies of bystander apathy examine how participants respond to emergencies (Latané & Darley, 1970). Although psychologists can’t ethically place people in real emergencies, they can lead participants to believe they’re attending a lab for one purpose, and have a simulated emergency occur, such as a person crying out in pain from an adjacent room. When participants intercede, they believe they’re responding to a real emergency disconnected from the experiment, and so concerns about participants ‘feeling studied’ can sometimes be controlled.

Practically, the laboratory offers many important advantages to researchers, such as the ability to control extraneous variables like time of day, temperature, noise and distractions, and so on. Although a variable like ‘temperature’ may not immediately seem important to a psychologist, note that room heat has been associated with aggression (Baron & Bell, 1975). Seemingly irrelevant environmental variables can directly influence psychological processes. Furthermore, lab equipment such as physiological measurement equipment, or computers that can assess reaction time, can be made available in a laboratory with relative ease. However, a disadvantage is that some kinds of experiences are not easily cultivated in a laboratory. For example, although psychologists may study group formation in a lab, it’s more difficult to study long-term group identity processes within a single-hour lab study, and impractical to have participants attend a laboratory for the years or decades required for some processes to unfold. Similarly, topics such as serious romantic relationships, bereavement, and so on, may be difficult to emulate in a laboratory and may be better studied in their natural contexts.

Although not overcoming all challenges associated with laboratory studies, one alternative context to the traditional laboratory is to conduct research in an online setting. There are several advantages to this setting. It’s relatively easy to solicit large samples of participants, particularly when using crowdsourcing technologies such as Amazon Mechanical Turk or Crowdflower. Furthermore, exceedingly rare (e.g., individuals with low-prevalence conditions) or distal groups (e.g., when an American researcher wishes to study Japanese populations) are much easier to obtain using online research. However, critics have suggested that attention levels may waver online, especially among university participants completing research online (Hauser & Schwarz, 2016). Others have argued this ‘online inattention’ problem may be obviated with attention checks (Goodman et al., 2013; but see Hauser & Schwarz, 2015). Certainly, online studies tend to involve participants who know they’re being studied, and so the above-noted concerns about presentation biases may be a concern here once again. With respect to the control psychological scientists have over respondents’ environments, the answer here is mixed. For example, an online study can request that participants work in a private, uninterrupted work environment, but can rarely enforce this behaviour within participants. Similarly, numerous random variables will fluctuate across participants in online samples. Variables such as room temperature, density of people within the room, and background noise, cannot be directly controlled. Additionally, online research may constrain researchers in their choice of measures and manipulations. For example, researchers can have online participants interact socially in web forums or chat rooms, but many aspects of social interaction (e.g., physical presence, non-verbal communication) are hard to capture in online studies. Similarly, some measures (e.g., physiological) may be impossible to obtain in online contexts, again restricting the sort of research that psychologists can pursue in this format.

Finally, some psychologists have argued for the benefits of field research, often protesting the apparent decrease in field studies in recent psychological science (Cialdini, 2009). Field studies do offer some advantages, such as making it typically quite easy to disguise a study’s purpose. For example, field studies in which subtle aspects of an environment are altered – such as changing the signs present in a neighborhood and observing the results – will prevent participants from becoming aware they’re being studied, and therefore permit an authentic assessment of their reactions. However, a drawback to field research is that, although external behaviours can be easily detected and studied, internal processes such as participants’ private attitudes and emotions to stimuli can be difficult to assess in this setting. Another potential drawback of field research is that many environmental factors that are easy to control in laboratories (e.g., temperature, wind, the presence of passers-by) may be much more difficult to standardise and regulate in field settings. Planning and careful attention to such factors can partially mitigate these risks, but the likely increased instability of noise variables in field research can interfere with inference testing.

Different contexts of data collection (in-lab, online, field, etc.) all carry certain advantages and disadvantages. One alternative to selecting one method and accepting all of the relevant drawbacks, is to conduct multiple studies using multiple methods. For example, a researcher might begin by testing anger’s relation to aggression using a laboratory experiment, using university students; then perform a similar test using a large sample of online participants who vary more widely across demographic variables; and then conduct a field study in which anger’s relation to aggression is monitored covertly (e.g., in a workplace setting).


Once a study is completed, the final steps in the research process are the analysis of the data, the interpretation of the results, and the report of the findings. In psychological research, most studies involve data that is quantitative in nature. Quantitative data refer to information that is expressed in some numerical form. For example, people’s responses to a 7-point rating scale indicating the level of anger they’re currently feeling might be represented by whole numbers ranging from 1 to 7. Once the data is collected, the researcher must formulate a statistical analysis of the data that corresponds to the question of interest.

If the goals of the study are purely descriptive in nature, analysis typically involves the computation of descriptive statistics for the measures of interest. Descriptive statistics summarise the overall pattern of responses for a given measure within a sample. The two most common types of descriptive statistics are indices of central tendency (i.e., indices of the single response that best characterises the sample as a whole; e.g., the average of anger ratings in a sample) and indices of variability (i.e., indices of the extent to which responses are very similar to versus different from one another in the sample – e.g., the range of ratings of anger in a sample).

However, as noted earlier, most psychological research involves inferential research questions (i.e., questions regarding the relationship between two or more psychological or behavioural constructs). In these cases, a variety of inferential statistics are available to researchers. The specific type of inferential statistic that will be most appropriate for addressing a given research question depends on a number of factors. A detailed discussion of these different types of statistical tests obviously goes well beyond the scope of this chapter. However, in a broad sense, there are several factors that guide a researcher’s choice of statistical tests. First, the nature of the relationship being explored is an important consideration. For example, is the researcher only interested in a relationship between two variables? Alternatively, is the researcher interested in the relationships of several independent variables to a single dependent variable, or perhaps the relationships of multiple independent variables to multiple dependent variables? Second, what is the scale of measurement for the variables to be analysed? Are they purely nominal level variables (e.g., Queensland versus New South Wales), purely interval level (Strongly Agree versus Strongly Disagree), or a mixture? Finally, what are the distributional properties of the variables? Do scores on the variables reflect a normal distribution? Depending on the answers to these sorts of questions, some types of analyses will be more appropriate than others because they make more or less assumptions about these properties of the data.

Although researchers have a vast array of different types of statistical tests from which they can choose, by far the most used statistical tests are based on the concept of Null Hypothesis Significance Testing (NHST). Simply stated, these tests assess the hypothesis that the relationship of interest (the alternative hypothesis) does not exist in the population. Tests are considered to be statistically significant when they produce a probability value (a p-value) equal to or less than .05. Statistical significance at the .05 level indicates that the data obtained is statistically different from those expected if the null hypothesis were true, and this difference is less than 5 per cent likely to be due to chance alone. In these cases, the researcher is said to have rejected the null hypothesis (i.e., rejected the hypothesis that the relationship doesn’t exist in the population).

Tests are considered ‘nonsignificant’ when they produce a probability value (p) greater than .05. That is, a test is considered to have provided insufficient evidence for the existence of a relationship if there is a greater than 5 per cent probability that the observed relationship could have emerged simply due to chance. In such cases, the researcher is said to have ‘failed to reject the null hypothesis’.

When an analysis of a study has produced an accurate conclusion regarding the existence of relationship between variables, the study is said to be high in statistical conclusion validity (see Cook & Campbell, 1979; Shadish et al., 2002). Conceptually, there are two forms of errors that a researcher can make with a statistical test, thereby leading to low statistical conclusion validity. A Type I error is when a researcher falsely concludes that a relationship exists (i.e., incorrectly rejects the null hypothesis). Traditionally, researchers have considered this form of error to be very serious and set their level of risk for making such an error in their statistical tests (referred to as the alpha level) at .05. Recently, some researchers have called for even stricter alpha levels as a means of enhancing the statistical conclusion validity of psychological research (e.g., Benjamin et al., 2017). A Type II error is when a researcher falsely concludes there is no evidence for the existence of a relationship (i.e., incorrectly accepts the null hypothesis). Although traditionally researchers have placed less emphasis on this form of error, researchers have considered this form of error to be problematic and have traditionally set their level of risk for making such an error in their statistical tests (referred to as beta) at .20. This means that researchers try to collect enough data that the risk of mistakenly concluding that no relationship exists (when a relationship actually does exist) is no greater than 20 per cent.

Methodologists have identified a number of potential threats to the statistical conclusion validity of research (e.g., see Cook & Campbell, 1979; Shadish et al., 2002). For example, the validity of a statistical test can be undermined if the underlying assumptions of the test are violated. For example, many tests assume that interval or ratio level measures follow a normal distribution. Other tests assume each set of observations comprising the sample are independent of one another (e.g., that the responses provided by one person in the sample are not in any way related to the responses provided by another person in the sample). Researchers may sometimes remedy such problems by selecting a statistical test with less stringent assumptions such as a nonparametric test.

Other threats to statistical conclusion validity reflect more fundamental and sometimes perhaps even more intentional errors on the part of researchers. Concerns regarding these sorts of errors have received a great deal of attention in recent years and have led some researchers to call for major changes in the way psychological research is conducted (Lilienfeld, 2017; Lilienfeld & Waldman, 2017). One issue of concern has been the fact that many studies conducted in psychology have insufficient statistical power. Statistical power refers to the probability that a study will correctly reject the null hypothesis. Traditionally, statistical power has primarily been a concern with respect to Type II errors (e.g., Cohen, 1988). However, recently methodologists have noted that in the context of a single study – because studies with low power tend to be more likely to produce anomalous results – low power can sometimes also lead to Type I errors (e.g., Button & Munafò, 2017).

Another issue that has generated a great deal of interest is a set of practices known as QRPs (Questionable Research Practices – see John et al., 2012; Simmons et al., 2011). QRPs cover a wide range of data collection, analysis, and reporting practices, most of which are considered problematic because they can undermine the statistical conclusion validity of a study. Some of these practices involve incomplete reporting of results. For example, a researcher might conduct analyses on multiple dependent variables, but then only report the results for dependent variables that produce significant effects or conduct multiple different types of analyses on a single dependent variable and only report the analysis that produces a significant effect. Similarly, a researcher might conduct a study involving multiple experimental conditions, but then only report the results for those conditions that produce significant differences. Alternatively, a researcher might conduct multiple studies and then only report those studies that produce a significant effect.

Other practices involve changes to the dataset itself or the manner in which it’s analysed. For example, a researcher might decide to drop participants from a dataset based on whether their deletion strengthens the key effects of interest in a study. Alternatively, a researcher might gradually add data participants to an existing dataset and base their decision to stop adding participants solely on when the addition of participants produces a significant effect.

As these examples illustrate, many QRPs are practices that are intended to produce a significant effect, without any clear justification beyond the fact that they produce a desired outcome for the researchers, who may be motivated to identify a significant effect. As such, these practices can inflate Type I error rates. Indeed, although each practice can potentially undermine statistical conclusion validity on its own, the risk is even greater when several of these practices are performed in conjunction with one another (Simmons et al., 2011).

In short, numerous issues have been raised about how psychological scientists conduct aspects of research, and they’re often accompanied with guidelines for improving the statistical validity of research. However, other commentators have suggested there are problems inherent in NHST as a scientific tool and that no set of reforms to current practices will ultimately be successful in addressing these limitations. These commentators have argued alternative statistical approaches are required. For instance, some have proposed traditional statistical tests be abandoned or at least supplemented by the reporting of effect sizes and their corresponding confidence intervals (e.g., Cumming, 2014; Schmidt, 1996). Others have advocated use of Bayesian statistics (e.g., Wagenmakers et al., 2017). Space restrictions preclude a discussion of these alternatives and to date neither has gained widespread acceptance in psychology. However, psychologists continue to debate their potential advantages and disadvantages. This remains an important area of research for experimental and quantitative psychologists, as well as individuals who interpret and make policy recommendations based on research that is based fundamentally on statistics-based inferences. Practitioners in the field should remain familiar with developments in this evolving area as these decisions have important implications for the application of existing research.

Shrout and Rodgers (2018, p. 504) conclude that psychological scientists must ‘engage in methodologically sound, ethically driven research in which probabilistic decisions are made explicitly and respected by an open research process’ which is supported by a very useful table (see their Table 1) that outlines a series of recommendations for research practices designed to speed knowledge construction in psychology and to reduce concerns about replication success in both exploratory (E) and confirmatory (C) studies. Their prognosis is that ‘The future of psychological science is bright’ (p. 506), which also allows for the psychology profession to prosper.


The previous sections have primarily explained social science methodology with the goal of maximising the reliability and validity of research findings. However, psychological scientists must balance their interests in obtaining reliable, valid results with several important ethical guidelines that establish how research should be conducted. Indeed, one can imagine scientific studies that could be highly reliable and valid, yet ethically egregious. For example, if a researcher was interested in the effects of socioeconomic status on aggressive behaviour, it would be methodologically sound to randomly assign children at birth to adopting parents who are poor or wealthy. However, such a study would obviously be considered ethically problematic. In the research context, the term ethics describes ‘good’ or ‘just’ treatment of various groups and individuals involved in the research process.  Ways of determining what is ‘good’ or ‘just’ is beyond the scope of this chapter. However, there are broad professional guidelines in Australia, such as the National Statement on Ethical Conduct in Human Research  (2007), which serve as guidelines that must be considered as part of research planning and conduct.

Although researchers have debated and discussed many aspects of research ethics for decades, and specific guidelines and procedures vary as a function of locales and disciplines, three fundamental principles of research ethics tend to be emphasised in nearly all systems. These three core principles are a mandate for good practice: 1) to give participants information sufficient to allow informed choices about participating or not, 2) minimising harm to participants, and 3) maintaining the privacy of participants’ responses.

Informed consent is the principle that participants should have a reasonable understanding of what they will be expected to do in a study, and the likely benefits/harms that may affect them. For example, participants should know if research may cause them harm (including physical, emotional, financial/professional, interpersonal, or other kinds of harm), and how much of their time is being requested as participants. Additionally, participants should be informed in advance about issues including whether their data will be confidential and/or anonymised (see below), or whether information about them will be obtained from sources other than themselves (e.g., from their academic transcripts). The point is that participants’ consent to participate in research is only meaningful if they know what they’re being asked to do.

One potential challenge to informed consent is the fact that some psychological questions are best pursued by partially or fully misleading participants about aspects of the research. For example, when researchers wish to covertly monitor participants’ aggressive behaviours, it may undermine the unobtrusive nature of this measurement if participants know they’re being watched. Similarly, some indirect measurements rely on participants not being aware of what is being measured, and in some cases the measure may be undermined if participants realise what is being measured. In other cases, participants are given false information about society, the actions of other participants in the experiment, the purpose of a study (often provided as a cover story in which researchers create a fictitious purpose of the research), or about the participant themselves (e.g., falsely informing participants that they have poor intelligence).

Deception is sometimes considered acceptable – provided it’s necessary to effectively study the question of interest – when a debriefing document or other method is used to inform participants at the end of a study. A debrief will often contain several elements, such as an explanation of what the truth is (e.g., what the real purpose of a study was), and why deception was considered necessary. Because this new information may alter participants’ willingness to have participated, in some contexts it may be appropriate to give participants a second opportunity to consent to the research study. For example, returning to the example of covert monitoring of aggressive behaviour, a researcher might reveal this covert monitoring at the study’s end, and offer to delete the recording if the participant does not consent to the researcher keeping this data. After all, they originally consented without knowing that such data was to be collected. The lack of initial disclosure may be necessary because the monitoring would not be covert if participants were warned about it when they first consented.

A second principle is the minimisation of harm. That is, participants’ exposure to loss, pain, and/or damage should be reduced as much as possible. Some studies may necessitate some use of harm, such as when participants are given painful shocks to elicit anger (e.g., Berkowitz & LePage, 1967). Minimisation of harm would here involve careful scaling of the shock: it must be painful (enough to elicit anger), but no more painful than that (to minimise participants’ suffering). When possible, researchers should highlight ways in which participation can serve as a growth opportunity – such as a chance to better understand themselves – rather than as harmful. In addition to the ethical, this also has a practical benefit: participants who see research and researchers in more positive terms are presumably more likely to understand the importance and value of research in psychological science.

Turning back to the recurring example, a researcher who wishes to induce fear in a participant should aim to have participants experience fear only for as long as is necessary to test a research question. Fear is usually considered a negative, uncomfortable emotion, so while researchers can ethically study fear they should also try to respect participants’ needs. For example, researchers may end the study with a positive emotion induction (Westermann et al., 1996) to reverse the harm. Also, consider deception in the context of minimising harm. We previously highlighted the potential issue of deception with informed consent, but there is also a risk of deception causing harm: participants may feel foolish for ‘falling for’ a deceptive manipulation. Thus, it may be advisable to remind participants that most experiments find only tiny suspicion rates: almost everybody ‘falls for it’, so participants should not feel embarrassed. It is possible that some deception could introduce other harms, such as leaving participants with inaccurate information about their having health problems. Sometimes, researchers may provide true information in the debriefing form, such as providing real statistics about social facts when false facts were provided in the experiment, or reminding undergraduate participants that the average undergraduate student has high intelligence when they were falsely told they lacked intelligence. The goal is to offset the harm incurred by the false information.

A third important principle is the privacy of participant data. Two aspects of participant data privacy are anonymity (i.e., the degree to which participants’ identifying information is disassociated from their study data), and confidentiality (i.e., whether researchers keep participants’ identifying information to themselves). Where possible, it’s usually advisable to maintain the anonymity of participants’ data by disassociating participants’ identifying information (e.g., name, email address) from their response data. This may have several advantages, such as protecting participants’ privacy rights. It also permits researchers to share data with others without having to compromise participants’ privacy. In some cases, it’s necessary for data to be non-anonymous at least temporarily, such as when a researcher tracks a sample of participants across multiple time points and wishes to correlate participants’ responses across time. In longitudinal research, this could mean that data is identifiable for decades! However, once data collection has been completed, it’s normally possible to anonymise data afterwards, stripping data of this identifying information.

Typically, even non-anonymous data should be confidential, meaning that a researcher would not share any identifier-data associations with others, even if the researcher can personally associate identifiers with data. In summary, the general principle of participant privacy is that privacy should be maintained as far as logistically possible. Tying this back to consent, in cases where confidentiality would not be possible to extend to participants, those participants should at least know what their expectations of privacy should be, preferably when they initially provide consent.

There is an open access research ethics e-learning course developed by staff at the University of Southern Queensland that is available online and provides a detailed explanation of the ethics principles unpinning research and advice on completing a human research ethics application.

The objectives of the course are:

  • an introduction to research ethics
  • research methodology risks and benefits
  • recruitment and data collection
  • collection, use, and management of data and information
  • research merit, integrity, and monitoring requirements
  • communication of research findings.


Psychological research spans many diverse topics, specialty areas, and interests, such as organisational, clinical, and cognitive psychology.  However, the fundamental, conceptual steps required to create high-quality research are in many ways similar. This chapter has focused on delineating research questions, selecting dependent and independent variables, issues involving the setting and population, data analysis, and ethics. It is important to remember that entire chapters and articles have been devoted to in-depth explorations of each of these individual topics (and others) – we have provided many references to example articles and chapters throughout. Importantly, we hope this chapter highlights often under-recognised skills that are developed through training in psychological science. Undergraduate programs in psychological science should prepare students to effectively evaluate research methodological issues including sample size, risks associated with third variables, whether questionable research practices were likely to have been present, whether rigorous ethical safeguards were in place, whether appropriate statistical tests were used, and whether researcher conclusions are consistent with the results from statistical tests based on the methodology employed, These are all skills that are valued beyond academia. In Australia, for example, psychological research skills can lead to evaluating research for policy development, interpreting survey data gathered in an applied setting, and providing evidence supporting one’s own professional practice.  In general, professionals who display thoughtful and critical consideration of the quality of evidence are highly sought after across a variety of fields and will often exhibit the hallmarks of psychological research training

Click the drop down below to review the key words and concepts learned from this chapter.

This chapter has been adapted by Tony Machin, School of Psychology and Counselling, University of Southern Queensland and Erich Fein, School of Psychology and Counselling, University of Southern Queensland. It has been adapted from Vaughan-Johnston, T. I., Fabrigar, L. R., & Lawrence, K. (2019). Research methods in the psychological sciences. In M. E. Norris (Ed.), The Canadian Handbook for Careers in Psychological Science. Kingston, ON: eCampus Ontario. Licensed under CC BY NC 4.0. Retrieved from https://ecampusontario.pressbooks.pub/psychologycareers/chapter/researchmethods/

Send us your feedback: We would love to hear from you! Please send us your feedback.


Archer, J., & Coyne, S. M. (2005). An integrated review of indirect, relational, and social aggression. Personality and Social Psychology Review, 9(3), 212–230. https://doi.org/10.1207%2Fs15327957pspr0903_2

Asendorpf, J. B., Banse, R., & Mucke, D. (2002). Double dissociation between explicit and implicit personality self-concept: The case of shy behavior. Journal of Personality and Social Psychology, 83(2), 380–393. https://psycnet.apa.org/doi/10.1037/0022-3514.83.2.380

Baker, R. C., & Guttfreund, D. O. (1993). The effects of written autobiographical recollection induction procedures on mood. Journal of Clinical Psychology, 49(4), 563–568. https://doi.org/10.1002/1097-4679(199307)49:4<563::AID-JCLP2270490414>3.0.CO;2-W

Banerjee, A. & Chaudhury, S. (2010). Statistics without tears: Populations and samples. Industrial Psychiatry Journal, 19(1), 60–65. https://doi.org/10.4103/0972-6748.77642

Baron, R. A., & Bell, P. A. (1975). Aggression and heat: Mediating effects of prior provocation and exposure to an aggressive model. Journal of Personality and Social Psychology, 31(5), 825–832. https://doi.org/10.1037/h0076647

Beck, A. T., Steer, R. A., Brown, G. K. (1996). Beck Depression Inventory-II. https://www.apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/beck-depression

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z

Berkowitz, L., & Donnerstein, E. (1982). External validity is more than skin deep: Some answers to criticisms of laboratory experiments. American Psychologist, 37(3), 245–257. https://doi.org/10.1037/0003-066X.37.3.245

Berkowitz, L., & LePage, A. (1967). Weapons as aggression-eliciting stimuli. Journal of Personality and Social Psychology, 7(2), 202–207. https://doi.org/10.1037/h0025008

Button, K. S., & Munafò, M. R. (2017). Powering reproducible research. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 22–33). John Wiley & Sons. https://doi.org/10.1002/9781119095910

Cacioppo, J. T., Berntson, G. G., Lorig, T. S., Norris, C. J., Rickett, E., & Nusbaum, H. (2003). Just because you’re imaging the brain doesn’t mean you can stop using your head: A primer and set of first principles. Journal of Personality and Social Psychology, 85(4), 650–661. https://doi.org/10.1037/0022-3514.85.4.650

Cacioppo, J. T., & Tassinary, L. G. (1990). Inferring psychological significance from physiological signals. American Psychologist, 45(1), 16–28. https://doi.org/10.1037//0003-066x.45.1.16

Cialdini, R. B. (2009). We have to break up. Perspectives on Psychological Science, 4(1), 5–6.  https://doi.org/10.1111%2Fj.1745-6924.2009.01091.x

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Laurence Erlbaum Associates.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Rand McNally College Publishing Company.

Costa, P. T., Jr., & McCrae, R. R. (1991). NEO five-factor inventory (NEO-FFI), Form S (Adult). https://www.parinc.com/Products/Pkey/274

Costa, P. T., & McCrae, R. R. (1993). Psychological research in the Baltimore longitudinal study of aging. Zeitschrift fur Gerontologie, 26(3), 138–141. https://europepmc.org/article/med/8337906

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177%2F0956797613504966

Diener Education Fund. (2021). Noba Project [Homepage]. https://nobaproject.com

De Houwer, J. (2006). What are implicit measures and why are we using them? In R. W. Wiers & A. W. Stacy (Eds.), The handbook of implicit cognition and addiction (pp. 11–28). Sage. https://us.sagepub.com/en-us/nam/handbook-of-implicit-cognition-and-addiction/book227075

Fein, E. C., Skinner, N., & Machin, M. A. (2017). Work intensification, work–life interference, stress, and well-being in Australian workers. International Studies of Management & Organization, 47(4), 360–371. https://doi.org/10.1080/00208825.2017.1382271

Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The long way from α-error control to validity proper: Problems with a short-sighted false-positive debate. Perspectives on Psychological Science, 7(6), 661–669. https://www.jstor.org/stable/44282621

Field, A. (2019) Discovering Statistics [Homepage]. Retrieved from https://www.discoveringstatistics.com

Friese, M., Hofmann, W., & Schmitt, M. (2008). When and why do implicit measures predict behaviour? Empirical evidence for the moderating role of opportunity, motivation, and process reliance. European Review of Social Psychology, 19(1), 285–338. https://doi.org/10.1080/10463280802556958

Gawronski, B., & De Houwer, J. (2014). Implicit measures in social and personality psychology. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (2nd ed.). Cambridge University Press.

Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224. https://doi.org/10.1002/bdm.1753

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6), 1464–1480. https://doi.org/10.1037/0022-3514.74.6.1464

Harpe, S. E. (2015). How to analyze Likert and other rating scale data. Currents in Pharmacy Teaching and Learning, 7(6), 836–850. https://doi.org/10.1016/j.cptl.2015.08.001

Hauser, D. J., & Schwarz, N. (2015). It’s a trap! Instructional manipulation checks prompt systematic thinking on “tricky” tasks. Sage Open, 5(2). https://doi.org/10.1177%2F2158244015584617

Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behavior Research Methods, 48(1), 400–407. https://doi.org/10.3758/s13428-015-0578-z

Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development, 34(3), 177–189. https://doi.org/10.1080/07481756.2002.12069034

Holden, R. R., & Jackson, D. N. (1979). Item subtlety and face validity in personality assessment. Journal of Consulting and Clinical Psychology, 47(3), 459–468. https://doi.org/10.1037/0022-006X.47.3.459

Ilgen, D. R., & Favero, J. L. (1985). Limits in generalization from psychological research to performance appraisal processes. Academy of Management Review, 10(2), 311–321. https://doi.org/10.2307/257972

Iyer, R. (2019) YourMorals.org [Homepage]. Retrieved from https://www.yourmorals.org/index.php

John, O. P., & Benet-Martinez, V. (2014). Measurement: Reliability, construct validation, and scale construction. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (2nd ed., pp 339–369) Cambridge University Press.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177%2F0956797611430953

Klauer, K. C., & Teige-Mocigemba, S. (2007). Controllability and resource dependence in automatic evaluation. Journal of Experimental Social Psychology, 43(4), 648–655. https://doi.org/10.1016/j.jesp.2006.06.003

Lakens, D. (2019) Improving your statistical inferences [Online Course]. https://www.coursera.org/learn/statistical-inferences

Latané, B., & Darley, J. M. (1970). The unresponsive bystander: Why doesn’t he help? Century Psychology Series. Appleton-Century Crofts.

LeBel, E. P., & Gawronski, B. (2009). How to find what’s in a name: Scrutinizing the optimality of five scoring algorithms for the name‐letter task. European Journal of Personality, 23(2), 85–106. https://doi.org/10.1002/per.705

Lieberman, J. D., Solomon, S., Greenberg, J., & McGregor, H. A. (1999). A hot new way to measure aggression: Hot sauce allocation. Aggressive Behavior, 25(5), 331–348. https://doi.org/10.1002/(SICI)1098-2337(1999)25:5<331::AID-AB2>3.0.CO;2

Lilienfeld, S. O. (2017). Psychology’s replication crisis and the grant culture: Righting the ship. Perspectives on Psychological Science, 12(4), 660–664. https://doi.org/10.1177%2F1745691616687745

Lilienfeld, S. O., & Waldman, I. D. (Eds.). (2017). Psychological science under scrutiny: Recent challenges and proposed solutions. John Wiley & Sons.

Millsap, R. E., & Meredith, W. (2007). Factorial invariance: Historical perspectives and new problems. In R. Cudeck and R. C. MacCallum (Eds.), Factor analysis at 100: historical developments and future directions (pp. 131–152). Lawrence Erlbaum Associates. https://doi.org/10.1111/j.1751-5823.2008.00054_24.x

Moses, S. N., Houck, J. M., Martin, T., Hanlon, F. M., Ryan, J. D., Thoma, R. J., … & Tesche,D. (2007). Dynamic neural activity recorded from human amygdala during fear conditioning using magnetoencephalography. Brain Research Bulletin, 71(5), 452–460. https://doi.org/10.1016/j.brainresbull.2006.08.016

Nisbett, R. E. & Cohen, D. (1996). Culture of honor: The psychology of violence in the South. Westview Press.

Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231–259. https://doi.org/10.1037/0033-295X.84.3.231

Nuttin, M. J., Jr. (1985). Narcissism beyond Gestalt and awareness: The name letter effect. European Journal of Social Psychology, 15(3), 353–361. https://doi.org/10.1002/ejsp.2420150309

Orne, M. T. (1962). On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. American Psychologist, 17(11), 776–783. https://doi.org/10.1037/h0043424

Orpinas, P., & Frankowski, R. (2001). The Aggression Scale: A self-report measure of aggressive behavior for young adolescents. The Journal of Early Adolescence, 21(1), 50–67. https://doi.org/10.1177/0272431601021001003

Page, M. M., & Scheidt, R. J. (1971). The elusive weapons effect: Demand awareness, evaluation apprehension, and slightly sophisticated subjects. Journal of Personality and Social Psychology, 20(3), 304–318. https://doi.org/10.1037/h0031806

Paulhus, D. (1991). Measurement and control of response bias. In J. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes, Vol. 1. (pp. 17–59). Academic Press.

Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An inkblot for attitudes: affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89(3), 277–293. https://doi.org/10.1037/0022-3514.89.3.277

Petty, R. E., Fazio, R. H., & Briñol, P. (2012). Attitudes: Insights from the new implicit measures. Psychology Press.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115–129.  https://doi.org/10.1037/1082-989X.1.2.115

Shadish, W.R., Cook, T.D., and Campbell, D.T. (2002) Experimental and quasi-experimental designs for generalized causal inference. Wadsworth Cengage Learning.

Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487–510, https://doi.org/10.1146/annurev-psych-122216-011845

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows  presenting  anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177%2F0956797611417632

Simms, L. J. (2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass, 2(1), 414–433. https://doi.org/10.1111/j.1751-9004.2007.00044.x

Strack, F., Martin, L. L., & Schwarz, N. (1988). Priming and communication: Social determinants of information use in judgments of life satisfaction. European Journal of Social Psychology, 18(5), 429–442. https://doi.org/10.1002/ejsp.2420180505

Sweetland, A. (1972). Comparing random with non-random sampling methods. Rand Corp.

Vaughan-Johnston, T. I., Fabrigar, L. R., & Lawrence, K. (2019). Research methods in the psychological sciences. In M. E. Norris (Ed.), The Canadian handbook for careers in psychological science. eCampusOntario. https://ecampusontario.pressbooks.pub/psychologycareers/chapter/researchmethods/

Wagenmakers, E. J., Verhagen, J., Ly, A., Matzke, D., Steingroever, H., Rouder, J. N., & Morey, R. D. (2017). The need for Bayesian hypothesis testing in psychological science. In S. O. Lilienfeld and I. D. Waldman (Eds.), Psychological science under scrutiny: Recent challenges and proposed solutions (pp. 123–138). John Wiley & Sons.

Westermann, R., Spies, K., Stahl, G., & Hesse, F. W. (1996). Relative effectiveness and validity of mood induction procedures: A meta‐analysis. European Journal of Social Psychology, 26(4), 557–580. https://doi.org/10.1002/(SICI)1099-0992(199607)26:4%3C557::AID-EJSP769%3E3.0.CO;2-4

Widaman, K. F., & Grimm, K. J. (2014). Advanced psychometrics: Confirmatory factor analysis, item response theory, and the study of measurement invariance. In H. T. Reis, & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology (2nd ed.). Cambridge University Press.

Please reference this chapter as:

Machin, T., & Fein, E. (2022). Research methods in the psychological sciences. In T. Machin, T. Machin, C. Jeffries & N. Hoare (Eds.), The Australian handbook for careers in psychological science. University of Southern Queensland. https://usq.pressbooks.pub/psychologycareers/chapter/research-methods-in-the-psychological-sciences/.


Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Australian Handbook for Careers in Psychological Science Copyright © 2022 by Tony Machin and Erich Fein is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book