One of the more challenging aspects of data analysis is answering questions when the available data doesn't perfectly align with what you'd ideally want. I chose to focus on Airbnb bookings because the data presented an interesting analytical challenge: distinguishing between listings that are actively booked and those that are simply unavailable due to owner preferences or part-time rental status. This required thoughtful assumptions, proxy measures, and a careful methodological approach.
Many Airbnb listings rely on recent reviews to attract bookings, but it’s unclear whether newer reviews actually drive more reservations. This project estimates the effect of recent reviews on booking outcomes using observational data from Airbnb.
I used a public Airbnb dataset that includes listing-level variables such as:
I also used calendar-level variables such as:
Overall, it is important to know that the data is a snapshot in early March.
Data decisions:
6 variables were calculated:
To control for confounding, we used propensity score matching with the following covariates:
We estimated the average treatment effect on bookings using matched pairs and checked balance using standardized mean differences (SMDs).
Before matching, there were substantial differences between listings that recently received a review and those that had not. For example, listings with recent reviews had significantly more prior bookings and reviews overall, and tended to be priced lower:
| Variable | No Recent Review | Recent Review |
|---|---|---|
| Prior Booked Days | 4.66 | 8.73 |
| Total Availability (April 2025) | 27.10 | 27.99 |
| Price | $475.71 | $266.97 |
| Review Score (Rating) | 4.81 | 4.84 |
| Number of Reviews | 28.79 | 85.51 |
After propensity score matching, these covariates were better balanced across groups. Standardized mean differences (SMD) for all covariates fell below the common threshold of 0.1, suggesting a good match.
| Variable | No Recent Review | Recent Review |
|---|---|---|
| Prior Booked Days | 9.02 | 8.73 |
| Total Availability (April 2025) | 27.65 | 27.99 |
| Price | $267.90 | $266.97 |
| Review Score (Rating) | 4.85 | 4.84 |
| Number of Reviews | 78.59 | 85.51 |
A logistic regression was used to estimate the effect of recent reviews on the probability of being booked in April 2025 (booked_yes).
Listings with a recent review had a 15.8 percentage point higher probability of being booked in April 2025 compared to similar listings without a recent review.
| Covariate | SMD | Treated Mean | Control Mean | Pooled Std Dev |
|---|---|---|---|---|
| Prior Booked Days | -0.050 | 8.73 | 9.02 | 5.78 |
| Total Availability | 0.055 | 27.99 | 27.65 | 6.28 |
| Price | -0.003 | 266.97 | 267.90 | 292.92 |
| Review Score (Rating) | -0.030 | 4.84 | 4.85 | 0.27 |
| Number of Reviews | 0.060 | 85.51 | 78.59 | 115.18 |