When you use someone else’s data with little to no documentation and no one around to answer questions, you quickly learn to rely on exploration, experience, and clear logic. This project was a perfect example: the raw transaction logs contained non-transaction rows, ambiguous return and cancellation flags, and inconsistent formatting. Tackling these challenges sharpened my ability to make evidence-based decisions, apply systematic cleaning steps, and document every assumption for future users.
This project analyzes the UCI Online Retail II dataset, which contains over a million transactions from a UK‐based e-commerce store spanning 2009–2011. Each record includes an invoice number, stock code, product description, quantity, unit price, customer ID, country, and transaction date. Beyond straightforward sales, the data contains canceled and returned orders, sample fees, and non‐product line items.
Data decisions:
2 variables were calculated: