How to parse a tricky date string

by on Jul.28, 2010 , under JMP & JSL

Recently on LinkedIn, Asit Rairkar asked about how to parse a tricky date string into a numeric date column in JMP. He has date/time data as character data, but unfortunately it’s not formatted in one of the many date formats that JMP supports directly. He wrote:

I have found that informat does not work with all sorts of texted dates. e.g. I have had some tables with date codes as YYMMDDHHMMSS and I have had to manually seperate the 2 digits each and combine them again into YY/mm/dd hh:mm:ss format to feed in. Not sure if there is a better way to deal with this type of input.

Writing JSL to handle this situation is conceptually simple, but getting the JSL just right is tricky. The format JMP supports that comes closest to this format is “yyyy-mm-ddThh:mm:ss,” so that’s what I decided to work with, but overcoming the little differences between JMP’s format and Asit’s string takes some work. We have to doctor up the input string so that it matches the format JMP can handle. We’ll need to:

  1. change the two-digit years to four-digit
  2. add the “-” delimiter between year, month, and day
  3. add the “T” delimiter between date and time
  4. add the “:” delimiters between hours, minutes, and seconds

Once we’ve gotten the input string to match a JMP-supported format, we put that inside Num() as the Formula() for a new numeric column. We also have to specify one of the date/time formats for the column, and to keep things simple, I’ll just use the “yyyy-mm-ddThh:mm:ss” format again, but you could choose any format.

Here’s how it works

Suppose your data table looks like this:

You can generate my example by running this script:

dt = New Table( "tricky times",
  Add Rows( 4 ),
  New Column( "event",
    Character,
    Nominal,
    Set Values(
      {"Erin starts researching Asit's question", "Erin starts this blog post",
      "Erin wants a beer", "This blog post better be done"}
    )
  ),
  New Column( "when",
    Character,
    Nominal,
    Set Values( {"100726174324", "100728154315", "1007281730", "110728154315"} )
  )
)

So our first programming task is to convert a string like "100726174324" to a string like "2010-07-26T17:43:24", by performing the steps listed above. We could probably do this most efficiently with a clever pattern-match or regular expression statement, but both of those strategies present some disadvantages for a casual scripter:

  • they’re hard to learn, and it’s probably too much trouble given how seldom a casual scripter needs to do this sort of thing
  • they’re hard to read and decipher later—somebody who’s got a lot of practice at this kind of thing can write the patterns or regular expressions fairly easily and quickly, but figuring out a pattern or regular expression that someone else wrote (or that you yourself wrote a week ago) is much harder
  • they’re hard to debug
  • they’re hard to “borrow from” later on when you encounter a problem that’s slightly different

For these reasons, I’d rather solve this problem with a formula that is perhaps inefficient but is easy for me to read and easy for me to adapt later on. It won’t run as fast, but the time I really want to save is my own—I don’t care so much, in this case, if JMP has to work a little harder. So, let’s take it one step at a time.

1. change the two-digit years to four-digit

This is potentially tricky. Are all of the years in this century, e.g. 20-something? Or do we have a mix of centuries? If they’re all the same, it’s easy. We just Concat()enate a “20” in front of the two year digits we do have. I like the shorthand || operator better than the function Concat(). We get those two year digits by taking a Substr()ing of :when starting on the first character and continuing for two characters:

"20" || Substr( :when, 1, 2 )

If they’re a mixture of centuries, we’ll need to add some logic for guessing the correct century. Using a transition point is one way to do it. For example, if none of the dates are expected to be in the future or more than a hundred years in the past, we can safely assume that anytime the yy portion of the string is greater than the current year, we should prepend “19” rather than “20.” To do that, we need to compare the string’s yy, Substr( mydate, 1, 2 ), against today’s yy. But what’s today’s yy? Start by getting today, as a year: Year( Today() ) is the numeric value 10. Now convert that to a string with Char() so that we can use Substr() to grab just the last two characters:

Substr( Char( Year( Today() ) ), 3, 2 )

Now do the comparison inside an If(), and make the result "19" when yy is greater than the current year and "20" otherwise. Concat() that result onto the yy:

If( Substr( mydate, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ),
  "19",
  "20"
) || Substr( mydate, 1, 2 )

Whew! Fortunately the next few steps are easy.

2. add the “-” delimiter between year, month, and day

We’ve already got the yyyy piece. Now we use Substr(:when, x, 2) a few more times with x being the first character of the mm and dd pieces. We insert a string "-" between each piece, sticking it all together with ||s:

 || "-" || Substr( :when, 3, 2 ) || "-" || Substr( :when, 5, 2 )

3. add the “T” delimiter between date and time

That’s easy:

|| "T" ||

4. add the “:” delimiters between hours, minutes, and seconds

Same old same old! Just more Substr() and || action.

Substr( :when, 7, 2 ) || ":"
|| Substr( :when, 9, 2 ) || ":"
|| Substr( :when, 11, 2 )

Put it all together!

Those four steps have produced a string formatted like "2010-07-26T17:43:24", and if we put all that inside Formula( Num( ) ) for a New Column command, we’ll be creating numeric date values.

New Column( "time",
  Format( "yyyy-mm-ddThh:mm:ss" ),
  Formula(
    Num(
      // convert what you have to something like "2010-07-26T17:43:24"
      // assume dates only in past, within last 100 years
      If( Substr( :when, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ),
        "19",
        "20"
      ) || Substr( :when, 1, 2 ) || "-" || Substr( :when, 3, 2 )
        || "-" || Substr( :when, 5, 2 ) || "T" ||
      Substr( :when, 7, 2 ) || ":" || Substr( :when, 9, 2 )
        || ":" || Substr( :when, 11, 2 )
    )
  )
);

Now we have:

Notice that this technique is even forgiving of strings that are too short. I intentionally left off the ss part of a few of those strings, but since Substr(:when, 11, 2) just produced an empty string "", the formula still worked. That’s an advantage we’ll lose in the next strategy.

Uh-oh! There’s a problem in that last row! When I wrote my example time strings, I was thinking of 2011, but our algorithm above assumes that all data are from the current year or earlier, so that date got converted to 1911 instead. My event description needs improvement.

What if the dates are not in a data column?

Suppose the date value strings aren’t in a character column of a data table—suppose they’re just string values. In that case, we can use the In Format() function to convert them. The guts of it are still the same, but this time we’re parsing a string stored in a global called mydate instead of rows in a column.

mydate = "100726174324";
mydateNum=Informat(
  // convert the string to "2010-07-26T17:43:24",
  // assume dates only in past, within last 100 years
  If( Substr( mydate, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ),
    "19",
    "20"
  ) || Substr( mydate, 1, 2 ) || "-" || Substr( mydate, 3, 2 ) || "-"
    || Substr( mydate, 5, 2 ) || "T" || Substr( mydate, 7, 2 ) || ":"
    || Substr( mydate, 9, 2 ) || ":" || Substr( mydate, 11, 2 ),
  "yyyy-mm-ddThh:mm:ss" // or another format
);

Running this script returns the numeric value 26Jul2010:17:43:24 and shows that in the log. Notice, however, that when mydate = "1007281730";, missing the ss part of the string, it doesn’t work, and we get a missing value. Our ugly hand-made parser is actually more forgiving than JMP’s In Format() command. But it’s not much better—if we were missing any other part of the string than the part at the end, we’d get mixed-up results.

What pesky date, time, or duration strings have vexed you?

The good folks at JMP keep adding support for more and more date/time/duration formats, but it seems like there’s a never-ending supply of crazy formats that analysts encounter. Typically the strings are coming into JMP from other programs and from data-streaming devices, but they might also arrive when peeling data off web pages or dealing with data entry performed by other people.

What are the crazy formats that you’ve had to crunch into numeric values in JMP?

What are the tricks you’ve developed for crunching them?

Please share your experiences in the Comments! I’ll try to answer any questions you leave there, too.

:, ,