Is South Carolina lazy? Um… no, not so much!
by Erin Vang on Aug.30, 2010, under JMP & JSL
We recently had the privilege of helping out the good folks at Moore Data LLC with a SAS data import project. Scott Moore had noticed an article in the Post and Courier claiming that South Carolina was the eighth laziest state in the US. He smelled a rat, though—as he put it, “Typically, subjective words used to describe data pop a red flag that warns me of impending data misuse doom.”
With a little SAS importing help from Global Pragmatica LLC, Scott was able to peruse the data himself in JMP, and he reached a rather different conclusion!
Read Scott’s rebuttal here.
How to parse a tricky date string
by Erin Vang on Jul.28, 2010, under JMP & JSL
Recently on LinkedIn, Asit Rairkar asked about how to parse a tricky date string into a numeric date column in JMP. He has date/time data as character data, but unfortunately it’s not formatted in one of the many date formats that JMP supports directly. He wrote:
I have found that informat does not work with all sorts of texted dates. e.g. I have had some tables with date codes as YYMMDDHHMMSS and I have had to manually seperate the 2 digits each and combine them again into YY/mm/dd hh:mm:ss format to feed in. Not sure if there is a better way to deal with this type of input.
Writing JSL to handle this situation is conceptually simple, but getting the JSL just right is tricky. The format JMP supports that comes closest to this format is “yyyy-mm-ddThh:mm:ss,” so that’s what I decided to work with, but overcoming the little differences between JMP’s format and Asit’s string takes some work. We have to doctor up the input string so that it matches the format JMP can handle. We’ll need to:
- change the two-digit years to four-digit
- add the “-” delimiter between year, month, and day
- add the “T” delimiter between date and time
- add the “:” delimiters between hours, minutes, and seconds
Once we’ve gotten the input string to match a JMP-supported format, we put that inside Num()
as the Formula()
for a new numeric column. We also have to specify one of the date/time formats for the column, and to keep things simple, I’ll just use the “yyyy-mm-ddThh:mm:ss” format again, but you could choose any format.
Here’s how it works
Suppose your data table looks like this:
You can generate my example by running this script:
dt = New Table( "tricky times", Add Rows( 4 ), New Column( "event", Character, Nominal, Set Values( {"Erin starts researching Asit's question", "Erin starts this blog post", "Erin wants a beer", "This blog post better be done"} ) ), New Column( "when", Character, Nominal, Set Values( {"100726174324", "100728154315", "1007281730", "110728154315"} ) ) )
So our first programming task is to convert a string like "100726174324"
to a string like "2010-07-26T17:43:24"
, by performing the steps listed above. We could probably do this most efficiently with a clever pattern-match or regular expression statement, but both of those strategies present some disadvantages for a casual scripter:
- they’re hard to learn, and it’s probably too much trouble given how seldom a casual scripter needs to do this sort of thing
- they’re hard to read and decipher later—somebody who’s got a lot of practice at this kind of thing can write the patterns or regular expressions fairly easily and quickly, but figuring out a pattern or regular expression that someone else wrote (or that you yourself wrote a week ago) is much harder
- they’re hard to debug
- they’re hard to “borrow from” later on when you encounter a problem that’s slightly different
For these reasons, I’d rather solve this problem with a formula that is perhaps inefficient but is easy for me to read and easy for me to adapt later on. It won’t run as fast, but the time I really want to save is my own—I don’t care so much, in this case, if JMP has to work a little harder. So, let’s take it one step at a time.
1. change the two-digit years to four-digit
This is potentially tricky. Are all of the years in this century, e.g. 20-something? Or do we have a mix of centuries? If they’re all the same, it’s easy. We just Concat()
enate a “20” in front of the two year digits we do have. I like the shorthand ||
operator better than the function Concat()
. We get those two year digits by taking a Substr()
ing of :when
starting on the first character and continuing for two characters:
"20" || Substr( :when, 1, 2 )
If they’re a mixture of centuries, we’ll need to add some logic for guessing the correct century. Using a transition point is one way to do it. For example, if none of the dates are expected to be in the future or more than a hundred years in the past, we can safely assume that anytime the yy portion of the string is greater than the current year, we should prepend “19” rather than “20.” To do that, we need to compare the string’s yy
, Substr( mydate, 1, 2 )
, against today’s yy
. But what’s today’s yy
? Start by getting today, as a year: Year( Today() )
is the numeric value 10
. Now convert that to a string with Char()
so that we can use Substr()
to grab just the last two characters:
Substr( Char( Year( Today() ) ), 3, 2 )
Now do the comparison inside an If()
, and make the result "19"
when yy
is greater than the current year and "20"
otherwise. Concat()
that result onto the yy
:
If( Substr( mydate, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ), "19", "20" ) || Substr( mydate, 1, 2 )
Whew! Fortunately the next few steps are easy.
2. add the “-” delimiter between year, month, and day
We’ve already got the yyyy
piece. Now we use Substr(:when, x, 2)
a few more times with x
being the first character of the mm
and dd
pieces. We insert a string "-"
between each piece, sticking it all together with ||
s:
|| "-" || Substr( :when, 3, 2 ) || "-" || Substr( :when, 5, 2 )
3. add the “T” delimiter between date and time
That’s easy:
|| "T" ||
4. add the “:” delimiters between hours, minutes, and seconds
Same old same old! Just more Substr()
and ||
action.
Substr( :when, 7, 2 ) || ":" || Substr( :when, 9, 2 ) || ":" || Substr( :when, 11, 2 )
Put it all together!
Those four steps have produced a string formatted like "2010-07-26T17:43:24"
, and if we put all that inside Formula( Num( ) )
for a New Column
command, we’ll be creating numeric date values.
New Column( "time", Format( "yyyy-mm-ddThh:mm:ss" ), Formula( Num( // convert what you have to something like "2010-07-26T17:43:24" // assume dates only in past, within last 100 years If( Substr( :when, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ), "19", "20" ) || Substr( :when, 1, 2 ) || "-" || Substr( :when, 3, 2 ) || "-" || Substr( :when, 5, 2 ) || "T" || Substr( :when, 7, 2 ) || ":" || Substr( :when, 9, 2 ) || ":" || Substr( :when, 11, 2 ) ) ) );
Now we have:
Notice that this technique is even forgiving of strings that are too short. I intentionally left off the ss
part of a few of those strings, but since Substr(:when, 11, 2)
just produced an empty string ""
, the formula still worked. That’s an advantage we’ll lose in the next strategy.
Uh-oh! There’s a problem in that last row! When I wrote my example time strings, I was thinking of 2011, but our algorithm above assumes that all data are from the current year or earlier, so that date got converted to 1911 instead. My event description needs improvement.
What if the dates are not in a data column?
Suppose the date value strings aren’t in a character column of a data table—suppose they’re just string values. In that case, we can use the In Format()
function to convert them. The guts of it are still the same, but this time we’re parsing a string stored in a global called mydate
instead of rows in a column.
mydate = "100726174324"; mydateNum=Informat( // convert the string to "2010-07-26T17:43:24", // assume dates only in past, within last 100 years If( Substr( mydate, 1, 2 ) > Substr( Char( Year( Today() ) ), 3, 2 ), "19", "20" ) || Substr( mydate, 1, 2 ) || "-" || Substr( mydate, 3, 2 ) || "-" || Substr( mydate, 5, 2 ) || "T" || Substr( mydate, 7, 2 ) || ":" || Substr( mydate, 9, 2 ) || ":" || Substr( mydate, 11, 2 ), "yyyy-mm-ddThh:mm:ss" // or another format );
Running this script returns the numeric value 26Jul2010:17:43:24
and shows that in the log. Notice, however, that when mydate = "1007281730";
, missing the ss
part of the string, it doesn’t work, and we get a missing value. Our ugly hand-made parser is actually more forgiving than JMP’s In Format()
command. But it’s not much better—if we were missing any other part of the string than the part at the end, we’d get mixed-up results.
What pesky date, time, or duration strings have vexed you?
The good folks at JMP keep adding support for more and more date/time/duration formats, but it seems like there’s a never-ending supply of crazy formats that analysts encounter. Typically the strings are coming into JMP from other programs and from data-streaming devices, but they might also arrive when peeling data off web pages or dealing with data entry performed by other people.
What are the crazy formats that you’ve had to crunch into numeric values in JMP?
What are the tricks you’ve developed for crunching them?
Please share your experiences in the Comments! I’ll try to answer any questions you leave there, too.
How to change string columns to numeric date columns
by Erin Vang on Jul.22, 2010, under JMP & JSL
Sudarshan Krishnan asked a question on the LinkedIn JMP Professional Network about how to change a string column into a numeric date column:
What is the JSL command to convert a character data type (“hr:m:s” –eg:”23:59:59″) to a Numeric, Continuous, duration (“hr:m:s”) data type?
Dates and times in a character column
Let’s assume you have a data table includes date, time, or duration values as strings, in character columns. You might run into this situation frequently when importing data from text files or from other programs such as Microsoft Excel. The columns might look like this: Here’s a simple JSL script you can run to create a data table like this one:
New Table( "sample", Add Rows( 4 ), New Column( "timeChar A", Character, Nominal, Set Values( {"23:59:59", "10:14:01", "22:01:03", "08:01:05"} ) ), New Column( "timeChar B", Character, Nominal, Set Values( {"23:59:59", "10:14:01", "22:01:03", "08:01:05"} ) ) );
Convert date/time strings to numeric
You have two options for how to handle this, depending on what your goals are. You can either make a new, numeric column that converts the strings to date/time values using a formula, or you can change the character column directly into a numeric column. Let’s try it each way.
Option A: make a new numeric column
To make a new numeric copy of the original column, use the New Column()
operator with the arguments specifying the new column’s name (“timeNum A”), data type (Numeric), the new format desired (in this case “hr:m:s”), and a formula. The formula is simply the original column, :timeChar A, inside the Num()
operator.
New Column( "timeNum A", Numeric, Format( "hr:m:s" ), Formula( Num( :timeChar A ) ) );
Now you have the original string column, unchanged, plus a new numeric copy of that column. Note that the new column is linked to the original with a formula, so if you want to delete the original column, you’ll get an error:
Cannot delete column “timeChar A” while it is referenced by the formula of column “timeNum A”. Removing a formula leaves the data unchanged. Removing references replaces each reference with an empty value. These effects are permanent and cannot be undone.
Just click “Remove Formula” to proceed. Of course, if you no longer want the original character column, you might as well go with Option B instead.
Option B: change a character column to numeric directly
To change an existing character column into numeric, just send a series of messages to the column, first changing the type to numeric, then specifying a format. You might also want to change the modeling type to continuous, and you might want to change the column’s name.
Column( "timeChar B" ) << data type( numeric ) << Format( "hr:m:s" ) << modeling type( continuous ) << set name( "timeNum B" );
After
Now the data table with two character columns should have one character column, a second column that was character and has been changed to numeric, and a third column that is a numeric copy of the first column:
One Proportion Test
by Erin Vang on Jul.21, 2010, under JMP & JSL
One Proportion Test is a JMP script for performing a One Proportion Test quickly and easily.
Background
This script is inspired by a recent question on the LinkedIn Group “JMP Professional Network” from Jennifer Atlas, Senior Business Development Coordinator at Minitab, who asked:
I know I can calculate a sample size for a one proportion test in JMP, but how do I test for 1 proportion?
Karen Copeland, Ph.D., from Boulder Statistics promptly gave a helpful explanation of how to do it in JMP. JMP handles the problem quite capably, but you have to know where to find it and remember a bunch of details about how to use it:
To test one proportion in JMP I use the distribution platform. When you run a distribution for a nominal variable then in the red drop down menu there is an option for testing probabilities.
Details:
If you have a data table with a column for your proportion then you can proceed straight to the distribution platform. If not then first create a summary table with two columns. The first column would be your outcome such as Y/N or 0/1 and the he second column would be your frequency column of how many of each outcome you observed.
Testing of a proportion:
1. With a column of outcomes: First make sure that your column is of the nominal type. Second, use the distribution platform to create a distribution of the outcome. Under the red triangle drop down menu select “Test Probabilities” and a dialog box will appear with various options for testing your probabilities. Note, you need only fill in the one probability that you are interested in testing.
2. With a summary table for your outcomes: Again, make sure that your outcome column is of the nominal type. Second, use the distribution platform with your outcome column as the Y and the frequency column as the “Freq”. Then continue as above by selecting the test probabilities from the red triangle menu.
Note that you will also find confidence intervals for your proportions in the red drop down menu.
Boulder Statistics LLC and Global Pragmatica LLC are strategic allies, pairing Boulder Statistics’ analytical expertise with Global Pragmatica’s JMP scripting expertise to deliver outstanding solutions to our clients. When I saw Karen’s explanation, I immediately thought this would be a great opportunity to collaborate with her and build a JSL widget to make this easier.
This script is the result, and Global Pragmatica and Boulder Statistics are proud to make it available free, in an encrypted (run-only) script you can download today.
How to use One Proportion Test
You must license or download a demo copy of JMP software from SAS to use this JMP add-on script.
Launch the script. You are asked to choose a data table, which can be set up either of the two ways Dr. Copeland describes above. A dialog box requests the necessary column assignments. Click OK. At the bottom, fill in the details of the One Proportion Test. Use the Start Over button to restart the analysis with a different data table. For further help, see text at the top of the window and tooltips when hovering over buttons.
How encrypted scripts work (the free widget)
Encrypted scripts are run-only scripts. They are not human-readable, so you can’t modify them or adapt them for other purposes. If you would like to study the JSL, modify, or adapt the script, you should license the unencrypted version of this script instead.
Why pay for the unencrypted script?
Don’t! At least not right away… Start by “buying” the free widget. Try it out and see if you like it!
If you do, you might want to buy the full-price unencrypted script, so that you can modify, customize, or adapt the script for your own specialized needs, or so you can study a professional-quality script! This script demonstrates quite a few advanced scripting techniques, including:
- building an elegant, all-in-one-window user interface using display objects
- attaching scripts to buttons, radio buttons, column lists, etc.
- including a custom logo graphic
- implementing a different dialog box according to the user’s radio button choice
- opening, closing, and deleting display tree elements dynamically
- offering tips and help right in the window
- including email and web links for more information, sending feedback, etc.
- hiding globals—whether to protect intellectual property or to avoid cluttering up
Show Globals()
output - optimizing memory management and simplifying between-use value-clearing by storing problem-specific “globals” as entries in an
Associative Array()
instead of in globals
You can read more about encrypted vs. unencrypted scripts offered by Global Pragmatica here.
Compatibility
One Proportion Test has been tested on JMP 8 for Mac and Windows as well as current beta versions of JMP 9 for Mac and Windows.
Known issues
- In the second (no longer current) early adopter release of JMP 9 for Windows, you have to start the script manually. (This is not a problem in JMP 8 or later beta releases of JMP 9.) There are several ways to do this:
- press Ctrl-R, or
- click the Run Script button in the toolbar, or
- from the Edit menu, select Run Script
Buy now!
Contact me if you’re interested in purchasing this script or commissioning one like it.
Should JMP scripts exploit data objects? Yes and no.
by Erin Vang on Jul.21, 2010, under JMP & JSL
Answering a LinkedIn group question about object oriented programming in JSL from Philip Brown, the always helpful Mark Bailey wrote, “The best scripts exploit JMP objects such as data columns and platforms.”
I don’t quite agree.
In my opinion, Mark’s both right and wrong in saying that the best scripts exploit JMP objects like data columns and platforms. Yes, if a JMP data object or analysis platform already knows how to do something you need, you shouldn’t be reinventing that wheel. BUT–and this is a big but–JMP’s data tables and their sub-objects (columns, rows, etc.) come with a heavy overhead cost, and this is where I think it’s sometimes better to avoid using JMP’s object.
It all depends, of course. If you just need a quick result, by all means, let JMP do the work for you.
But if you need to do something computationally intensive, you’re far better off grabbing just what you need in a JSL data structure–a vector, a matrix, a list, an associative array, whatever is appropriate–and doing that computation outside the data table. You’ll speed up your code, reduce the memory consumption, and even avoid difficult-to-chase-down crashes.
Don’t forget that most of JMP’s formula operators can work on things other than data columns, too. I recently worked on a client’s script that was doing its own calculations of means on lists of numbers by adding up the values, counting the number of values, and dividing the sum by that. It was correct, of course, but it was SLOW. I got the script to run about twice as fast by using JMP’s column-wise mean operator, Col Mean(), on data columns instead, and then I got it to run five times faster still by using JMP’s regular Mean() operator on vectors instead of data columns.
Why did this change give me a tenfold speed improvement? Two reasons.
- JMP’s internal calculation of means is more efficient than anything we can build in JSL—JMP’s developers have always optimized their code so that JMP’s numerical computations are done lickedy-split.
- Those calculations happen much faster when you don’t run them through the data table, which is a complex object with a lot of internal and external dependencies—and a big, fancy window that needs to be repainted whenever the data change. Window-painting alone can cause significant script slowdowns.
Another more important reason to use JMP’s operators is that JMP’s developers have already thought through pesky details like the proper handling of missing values, underflow and overflow errors, and other arcana of numerical computation that can cause reasonable-looking calculations to get wrong results.
I’m not bashing data tables, by the way. It’s amazing what JMP’s data tables can do for us, with all their table and column properties, row states, formulas, dynamic links to graphs, and so on. All that power comes at a cost, though, and basic numerical computations will go faster when they’re not manipulating complex data objects.
Bottom line: if you’re taking advantage of the rich features in JMP’s data tables, use data tables, but if you’re just doing some calculations, and speed is an issue, then do the calculations outside the data table. In either case, take advantage of JMP’s built-operators as much as possible.