Beware this scary thing Excel can do to your data!

[1] Heh, heh. I’ve always wanted to write a clickbait title. That was fun. Excel. Love it. Hate it. Most ecologists I know use it at least a little, including me. Now I know there are plenty of people who abhor the idea of using Excel for science. But Excel is a tool, just like any other. If you were to use a screwdriver to try to hammer a nail, then my dad – who taught me to respect tools and use them for their proper purposes – would be horrified. And if you were to use Excel to manage and analyze your dataset of tens of thousands of data points, I would be horrified. You could perhaps eventually manage to struggle through. But there are better tools.

Excel is probably best for things like manual data entry, especially if you constrain the format of your cells to only allow valid entries and save in a non-proprietary format. Excel can be fine for quick exploratory analysis of small datasets: means, standard deviations, sums… that sort of thing. It’s fast at making interpretable graphs. It’s a good exploratory viewer for datasets that you get from others – most of the time.

I say most of the time because Excel tries to think for you a little too much. When you open a file that is in CSV format (the best format to store your data in most of the time), Excel converts all the values to its best guess of the format you want to view them in. This is normally fine. But when it comes to dates and times, things can go wrong. Terribly wrong. For one thing, people store their dates and times in different formats according to culture. In the US, we (weirdly) prefer the month-day-year format. In Europe (and other places), the preferred format is the more logical day-month-year. So the value 03-04-2016 is ambiguous; it could be March 4, 2016 or April 3, 2016, depending on your cultural bias. (Times can be 12-hour or 24-hour, causing other problems.)

Each version of Excel has a setting as to which way to read dates, which it originally gets from your operating system, though you can switch it manually if you want to. So if you’ve got your computer set up with US defaults, Excel will display (and save) dates in month-day-year format. If you’ve got it set up with European defaults, Excel will use day-month-year format. This can be a problem if you’re collaborating with someone who uses a different format system than you and you’re using Excel spreadsheets to share data. (Been there. Done that.)

But it’s worse than that. Excel can actually change the format of the dates in a non-Excel file (e.g. CSV file) without your permission. Don’t believe me? Try it yourself:

  1. Start with a CSV file that contains a date in a year-month-day format – the type of format we scientists prefer because it’s unambiguous across cultures. (And, if you’re a data wrangler, because any type of sorting – numeric, date-based, alphabetical – puts the dates in proper order.) You can use one of my files if you like. Make sure it is saved to your computer. excel-before
  2. Open your file in Excel. Excel will automatically reformat your dates into something it prefers.
  3. Click “Save”.
  4. Excel will prompt you with an “Are you sure?” dialog. After all, you may lose your amazing formatting, graphs, and the like if you try to save as a CSV file instead of an Excel file.
  5. Click “No” because you don’t want to save (and overwrite) your CSV file.
  6. Excel will redirect you to the “Save As” dialog. But by now, the damage has been done! Click cancel or back (depending on your version of Excel) so that you don’t save anything.
  7. Close your file. When Excel asks if you want to save it, say no.
  8. Open your CSV file with a text editor and cringe as you see that Excel has changed the format of all your dates to month-day-year (or day-month-year) without your permission. excel-after

How do you avoid this? Some suggestions:

  1. Don’t use Excel’s “Save”. Only ever use “Save As”. Of course, this only works if you’re not an obsessive saver like me who clicks Ctrl-S every few minutes without even thinking about it. [2] Once upon a time, before a developer had ever dreamed up auto-save, it was quite easy to lose huge amounts of work because you forgot to save early and often. One day an eight-year-old girl was traumatized when her home lost power, deleting a story she had spent all afternoon writing. Never again, she vowed.
  2. Store dates in a format Excel doesn’t recognize. You can use an eight-digit string, such as YYYYMMDD. So that March 4, 2016 becomes 20160304. That’s not super easy to read as a human and it can get confused with integers, so I prefer the underscored version: YYYY_MM_DD. A date might look like 2016_03_04, but Excel has no idea it’s a date and so won’t try to auto-format it. The only downside is if you need interoperability along a data management pipeline. Packages for various programming languages will almost invariably recognize YYYY-MM-DD as a date, but you’ll have to write conversion routines if you want to use underscores in your dates instead of hyphens.
  3. Never open an original copy of a file in Excel. I do this frequently, too. Simply make a copy of a CSV file and open the copy in Excel. Store your originals somewhere where you’re not tempted to open them in Excel by mistake.
  4. Explicitly tell Excel not to convert the formats on your dates and times. This is especially useful for large files. You can import your CSV file rather than opening it. If you do this, you can manually tell Excel how to read each column. So for your dates and times, tell Excel that it’s a text field instead of a date or time one. The problem with this method is that it’s tedious and there’s no way to tell Excel to remember what you did for next time or set defaults. So you’ll have to go through the tedium every time you open a file.
  5. [UPDATED] Use separate columns for month, day, and year. [Thanks to Emily McKinnon and Kara Woo (in the comments), and Kristina Riemer (on Twitter) for pointing out this simple and useful workaround!]

Or you could just not use Excel. But since you probably will, just be aware that Excel can change the format of the data in your files (without you knowing) and take precautions as needed.

Permanent link to this article: http://ecologybits.com/index.php/2016/07/06/beware-this-scary-thing-excel-can-do-to-your-data/


1 ping

Skip to comment form

  1. Steve Heard

    Yes, I was able to reproduce this (in case anyone else was skeptical, as I was).

    At the bottom of this, I think, is an apparently minor but actually major semantic distinction. You say in your post to “open your file in Excel”. Excel does not an cannot “open” a CSV file. What it CAN do is import a CSV file, and save a file as CSV. So this explains 1/2 of the behaviour – of course Excel converts things into various cell formats; that is what it’s FOR! If you try to “open” a CSV file instead of importing it, what you’re really doing is asking Excel to import the CSV file making its own default guesses about your data. If you don’t want to use the defaults, then of course don’t use the (poorly named) command that uses those defaults. I had never really thought this out all the way before, but it makes complete sense.

    But the other 1/2 of the behaviour – where manipulating the file but not saving it changes the saved copy – that should never happen! That seems like a significant bug in Excel, and one not easily excused by semantic quibbles like mine above. Once one knows, of course, there are many workarounds (including using Excel rather than CSV as the data file of record, which is what I do; I only export CSVs for analysis as I need them). But it should never be allowed to happen!

    Thanks for posting this – the Twitter conversation had left me so very curious!

    1. Margaret Kosmala

      Okay about “opening” a CSV file. But, like many people, I double click CSV files to get them into Excel, which is typically considered an “opening” function. I can also use the “open” choice from the Excel menu to get the CSV file into Excel, in which I click the “open” button once I choose a file. So Excel itself supports the named concept of “opening” a CSV file, even if it’s technically importing it.

  2. Pavel Dodonov

    Thanks for this post! Even though I gave a coupel of Excel course I was never aware that this really, really scary thing can happen.
    Then again, I don’t often work with dates. And then again, I don’t use Excel anymore, but LibreOffice Calc – which, for example, usually asks what are the column and decimal number delimiters. I just performed the test on the example, file, and LibreOffice didn’t do anything to this data (i.e. no date format changes) even when saving.
    So a possible suggestion is to switch to Calc, which has most of the facilities Excel has but also doesn’t pretend to know more than the user. 🙂

    1. Margaret Kosmala

      Thanks for checking LibreOffice Calc. I use Calc sometimes, but find some its functions annoyingly difficult to use — buried in sub-menus instead of in a top menu. I also use Google Sheets sometimes when I want to do something quick and I know I won’t need the results offline.

  3. Emily McKinnon

    Another Excel quirk to report – apparently the default base date or something weird is set DIFFERENTLY depending on whether you have a Mac or Windows version of Excel. This means your dates can be subtly shifted (like 1 day off) if you create a file on a Windows version of Excel and open it on a Mac.
    Thankfully I realized this at some point during my PhD and now always save dates as separate columns for Month Day Year and if I need something in one column I calculate an Ordinal Date (i.e. Jan 1 = 1). It’s unfortunate that Excel manages at once to be so very useful and so maddening at the same time!

    1. Margaret Kosmala

      Oh golly. I didn’t know about that! Great suggestion for saving date parts in separate columns. I’ve added it to the main text.

    2. Jeremy Fox

      “apparently the default base date or something weird is set DIFFERENTLY depending on whether you have a Mac or Windows version of Excel.”

      Yes, this is true. It played hob with a working group I was once involved in.

  4. Kara Woo

    Another workaround that I like is saving year, month, and day as separate columns. Excel will assume the columns are numbers, not dates, and it’s easy to paste the columns back together later if you need the whole dates in your analysis.

    1. Margaret Kosmala

      Great suggestion! I’ve added it to the main text.

  5. Juan

    Thank you very much for the tip! I tried to open the CSV file in Excel 2010 and it didn’t change the format of the dates. I have the Spanish version of the software, with the format of dates day-month-year.

  1. Friday links: Most. Awkward. Department. Meeting. Ever. And more. | Dynamic Ecology

    […] Margaret Kosmala had a post warning about ways Excel can mess up data (and how to avoid the problem). (Jeremy adds: Hey, that was gonna be my link! Stop, link thief! Or I’ll say “stop” again!) […]

Leave a Reply

Your email address will not be published. Required fields are marked *