PDA

View Full Version : New GRE cancelled - the cost of attempted gap-reduction?


Hartmann von Aue
10-13-2007, 09:09 AM
http://www.gnxp.com/blog/2007/04/new-gre-cancelled-cost-of-attempted-gap.php

The NYT reports that a completely revised GRE has been deep sixed, not merely delayed (read the ETS press release here). The official story is that there is some insurmountable problem with providing access to all test-takers, an issue apparently too complicated for ETS to bother trying to explain it to us. You figure, since this was such a huge project that was suddenly halted, they'd want to clearly spell out why they dumped it -- unless that's the point. Although I'm no mind-reader, the true reason is pretty obvious: the made-over test was designed to narrow the male-female gap at the elite score level, but this diluted its g-loadedness such that it couldn't reliably distinguish between someone with, say, a 125 IQ and a 145+ IQ, which is what graduate departments who rely on super-smart students worry about. Rather than admit that this psychometric magic trick went awry and lopped off a few limbs of g-loadedness, they spun a yarn about access to the test. [1]

To put this in perspective, for those who took the SAT before spring 2005 -- which is everyone here, I assume -- the New SAT now includes a Writing test with both multiple choice grammar questions and a 25-minute persuasive essay. No admissions committee is paying serious attention to this silly addition, although high schoolers obsess over it. The real changes are that the Math test no longer includes the "quantitative comparison" questions (column A, B, equal, can't tell?), and the flavor of the questions is a bit more "book smarts"-based than before. Also, the Verbal test (now called Critical Reading) has zero analogies, fewer sentence completions, and much more passage-based reading. The gutted portions are those that are more highly g-loaded, for sure in the Verbal test, and most likely in the Math test as well. [2]

We now ask why ETS intentionally stripped the SAT of some of its g-loadedness? Certainly not because they discovered IQ had little value in predicting academic performance, or that some items tap g more directly than others -- so why re-invent the wheel? Since scores on various verbal tasks highly correlate, this change cannot have affected much the mean of any group of test-takers. But if getting a perfect score required scoring correctly on, say, 10 easy questions, 5 medium, and 5 difficult (across 3 sections), a greater number of above-average students can come within striking distance of a perfect score if the new requirement were 10 easy, 9 medium, and 1 hard. I don't know exactly how they screwed around with the numbers, but that's what they pay their psychometricians big bucks to do. Now, reducing the difficulty of attaining elite scores, without also raising mean scores (as with the 1994 recentering), can only have had the goal of reducing a gap that exists at the level of variance, not a gap between means. This, then, cannot be a racial gap but the male-female gap, since here the difference in means is probably 0-2 IQ points, although male variance is consistently greater.

Certainly this reduces the power of the SAT to detect very brainy people -- those with an IQ of 145 or 160 or whatever big number you want -- but I can easily imagine that both ETS and elite universities such as Harvard were willing to trade off a bit of g-loadedness in order to close the male-female gap at the elite level. Harvard students wouldn't look stupider, of course: their prestige is based on their mean SAT score compared to those of others. And they probably have other ways of figuring out who is very brainy vs. fairly smart. (As an aside, this also explains why lots more high-scoring applicants will be rejected by top schools, another paradox that is easily, even if only partially, resolved by clear thinking.) Moreover, attending Harvard isn't all about having a 145 IQ -- a non-trivial number of their graduates will join professions that don't require eigth-grade algebra or sophisticated analysis (say, political office). So that, too, may lessen their concern over the SAT becoming somewhat less g-loaded.

Not so with the GRE -- those who score at the elite level here are hardcore nerds who are planning to do serious intellectual work, and elite graduate departments pay attention mostly to the applicant's intellectual promise. MIT's math department probably doesn't care that an applicant scored 650 on the Math portion but showed singular potential for leadership roles. So, I imagine something similar to the SAT make-over happened, only this time the professors and/or ETS' psychometricians discovered that it would make a joke of a test used to detect the very brainy in search of elite graduate work.

To make this concrete, let's assume that, among applicants to graduate school in the arts and sciences (i.e., future scholars, not professionals), males enjoy only a 0.1 SD advantage in mean IQ (or 1.5 IQ points), as well as a 0.05 SD advantage in their standard deviation. Then a test that is reliable up to 3 SD above the female mean will have 30% of those above this threshold being female. (For comparison with the real world, grad students at CalTech are 30% female.) Almost 10 percentage points can be gained by dumbing the test down so that it's only reliable up to about 2 SD, in which case 39% at the top will be female. Dumbing it down further so that it can only detect those 1 SD above the female mean just adds about 5 further percentage points; females will make up 45%. My guess is that they weren't foolish enough to toy around with a GRE that only tested up to an IQ of 115, but that they took a risk on some version that tested up to about an IQ of 130. Though that's just about enough to get you into MENSA, the real hullabaloo over sex disparities has raged within the halls of the uber-elite: Harvard (Larry Summers), MIT (Nancy Hopkins), Stanford (Ben Barres), and so on. At such an elite level, an applicant with an IQ of 130 would be like a 6'3 guy trying out for the NBA (whose mean height is 6'7). Although the NBA doesn't automatically weed out those 6'3 and under, surely the recruiters would protest to the manufacturers if their new-fangled measuring sticks only measured up to 6'3!

Pursuing this hunch, I picked up my Kaplan GRE self-study book and found out that they knew at least roughly what the new GRE was going to look like. Here were the proposed new question types for Verbal and Math:

Verbal
Reading Comprehension (4 types)
Sentence Completion (2 types)

Math
Word Problems (4 types)
Data Interpretation (2 types)
Quantitative Comparison (1 type, as before)

Notice the huge change in the Verbal test, which parallels the change in the SAT Verbal test: analogies are gone, and most of the test is reading comprehension. As for Math, they did keep the Quant Comps, but most of the new question types thereof sound too touchy-feely to be of good use: Word Problems include old-fashioned ones, plus "Free Response," "All That Apply," and "Conditional Table" (Kaplan admits they didn't know the exact names -- maybe the last was a contingency table type?). "Free Response" sounds like it would be more g-loaded since you can't rely on answer choices, but it definitely isn't, at least not if this type was to resemble its counterpart on the SAT. Here, you grid in your own answer, but only non-negative rational numbers can be gridded, precluding the use of any questions whose answer had a root or exponent or absolute value, whose trick hinged on the properties of positives vs negatives vs 0, whose answer was an equation or inequality, and most importantly whose point was abstract symbol manipulation (such as "solve for V in terms of p, q, and r"). Since females are better than males at calculation, and worse than males on more abstract math problems, "Free Response" is an easy way to obscure the male advantage at "thinking" math.

Not knowing much about what the other two new types of Word Problems are, I think it's still safe to say they were just as vacuous. In fact, the Data Interpretation problems were to come in 2 types: the old-fashioned one, and a new one called -- don't laugh -- "Sentence Completion"! For christ's sake, why not just turn some of the harder ones into Writing problems in disguise, where the test-taker corrects the grammar of a word problem rather than actually solve it! This psychometric flimflam is ultimately what all would-be gap-reducers must reduce themselves to, at least when the concern is the sex gap at uber-elite levels where those who matter will brook no nonsense over the basic tests being dumbed down.

[1] Since I'm pretty tired by now of writing about the "women in science" topic, for background info I'll just link to a very lengthy post of mine on point, plus Steven Pinker's debate with Elizabeth Spelke.

[2] See p.2 of the full PDF linked to in this post from the GNXP archives. It contains a graphic showing the g-loadedness of various cognitive tasks. Analogies are the most highly g-loaded verbal tasks, reading comprehension one of the least so (though still enough to validate its use on tests).

Macrobius
10-13-2007, 03:08 PM
Nice analysis. Is it possible that they are tinkering with the correlation between math and verbal scores? If a 'verbal' person can score higher on the 'math' portion than formerly, then total scores will go up for that person. The problem may not be (directly) male/female differences, but falling math ability in the general population. I don't think that possibility was considered in the analysis.

If the population were to split into two subgroups (math-capable, and math-ignorant), then the math-capable would have an even greater advantage than they do now -- higher total score at lower verbal ability, say, relative to the mean. Since we know males are 'over-represented' at high levels of math ability as tested on this sort of test, then the 'gender gap problem' might be a real one. Women would be falling out of the applicant pool because of their math scores, period. Naturally, this *would* terrify the uber-elites, as the article calls them, at least those who train the graduate students who design these tests.

Means and variances are composed of many complex components, and actual shifts in the means of society's sub-components, such as the math ability of White middle-class children dropping to working-class and lower levels through interbreeding and poor education, can have large effects at the margins.

Hartmann von Aue
10-14-2007, 02:02 AM
Nice analysis. Is it possible that they are tinkering with the correlation between math and verbal scores? If a 'verbal' person can score higher on the 'math' portion than formerly, then total scores will go up for that person. The problem may not be (directly) male/female differences, but falling math ability in the general population. I don't think that possibility was considered in the analysis.

If the population were to split into two subgroups (math-capable, and math-ignorant), then the math-capable would have an even greater advantage than they do now -- higher total score at lower verbal ability, say, relative to the mean. Since we know males are 'over-represented' at high levels of math ability as tested on this sort of test, then the 'gender gap problem' might be a real one.

You do not need a terribly high IQ, nor do you need know much beyond basic algebra and geometry to do well on the GRE quantitative.

Women would be falling out of the applicant pool because of their math scores, period. Naturally, this *would* terrify the uber-elites, as the article calls them, at least those who train the graduate students who design these tests.

Actually, I think you'd be seeing the reverse. Unless a woman were going into hard science, her mathematics score would probably be taken less seriously than her verbal score. In fact this article is suggesting that the changes were made to the verbal section to reduce male advantages on that section.

Angler
10-14-2007, 04:01 AM
I'm somewhat skeptical of the idea that ETS is deliberately sabotaging their own tests for the sake of political correctness. The sole value of these products they sell, the SAT and GRE, lies in their predictive ability. If they reduce that predictive ability, then they risk losing their substantial market share in the scholastic testing business.

I'm also not sure the SAT and GRE were ever explicitly intended to measure g -- though of course they are correlated with g, as are nearly all broad-based tests covering verbal and mathematical skills. (Simple vocabulary knowledge has an even higher correlation with g than the SAT or GRE, even though all vocabulary is learned knowledge. The reason is that smarter people usually read more -- though there are certainly exceptions -- and when they do, they tend to absorb the meanings of words more readily than less intelligent people.) To the fairly large extent that they do, much of it is "crystallized g" anyway.

Not that crystallized intelligence (http://en.wikipedia.org/wiki/Fluid_and_crystallized_intelligence) isn't extremely important. For example, you could have a very high level of innate potential to be good in math, but if you goofed off in math class and never mastered basic algebra and geometry, then your math SAT scores will reflect that -- and you'll sure as hell never make it through calculus.

On a more general note, I lean away from the position, apparently held by many at GNXP, that a person has a single, well-defined IQ score. The idea is ridiculous. Your IQ depends, to at least some extent, on the test used to estimate it. Well-written IQ tests do positively correlate with each other fairly well, but the correlations are far from perfect.

You know what's funny about that? Many of the people who get into high-IQ groups on the basis of one test wouldn't qualify on the basis of another test! Here are two short pieces on that from a (deceased) member of an ultra-high-IQ group, Grady Towers:

http://www.eskimo.com/~miyaguch/grady/societies.html
http://www.eskimo.com/~miyaguch/grady/followup.html (correction to the former)

Here's another great article by the same author:

http://www.prometheussociety.org/articles/multiple.html

The currently accepted relationship between these two kinds of ability is called the investment theory of intelligence. It says, in effect, that we are all born with a certain raw ability, or the eduction of relations and correlates, which can be measured with culture fair tests. As we get older, we "invest" this fluid g in certain kinds of judgment skills, such as those involved in doing a mathematical word problem, or parsing a sentence. When we are young, the theory goes, our formal educations are so much alike that we all invest our fluid g in much the same kinds of judgment skills. That means that our fluid intelligence and our crystallized intelligence are so similar at an early age that it's almost impossible to tell them apart. After we leave school, however, we all begin to invest our fluid g abilities in different things. Measures of fluid g and crystallized g begin to draw apart. Those that invest their fluid g in school-like activities, such as accounting or law, continue to show intellectual growth on conventional (crystallized) IQ tests. Those that put their intelligence to work in other ways, such as becoming ranchers or artists, will not show the same intellectual growth, and may even show a decline in IQ on conventional measures of intelligence.

Many years ago, Mensa was faced by a serious policy decision about the kind of intelligence that it wanted to select for. It turned out that three out of four prospective members who were selected using a culture fair test could not pass a culture loaded test. At the same time, it also turned out that three out of four prospective members who could pass a culture loaded test could not pass a culture fair test. In the end, Mensa chose to use culture loaded tests exclusively in selecting its members. Almost all other high IQ societies, with the exception of Four Sigma and Triple Nine, have followed suit. As a consequence, there are now three qualitatively different kinds of high IQ societies extant. One kind, represented by Four-Sigma and most of the membership of the Triple Nine Society, was recruited with the LAIT--a culture fair test--and is made up mostly of people gifted with fluid intelligence. A second kind, represented by Mensa, Intertel and ISPE, was recruited by more conventional tests, and is made up of those gifted primarily with crystallized intelligence. Some individuals, however, have joined or qualified for membership in both kinds of societies, and are about equally gifted with both kinds of ability. A large minority of Triple Nine members, as well as a majority of those in Prometheus, appear to belong in this category.

Hartmann von Aue
10-14-2007, 05:33 AM
I'm somewhat skeptical of the idea that ETS is deliberately sabotaging their own tests for the sake of political correctness

They do change their tests in response to the demands of the university system.

Consider the change they made to the SAT in the mid-90s:

New->Old
Verbal Math
730-800 -70 780-800 -20
700-760 -60 700-690 +10
600-670 -70 600-600 - 0
500-580 -80 500-520 -20

http://www.arthurhu.com/index/sat.htm

What does this do? It makes the test less discriminating at the top of the curve - especially on the Verbal. Now if you truncate the distribution to ignore top of the curve - say you only consider scores below 700 - there is still a huge sex difference in around that region in favor of males on the math - so what did they do? They subtracted ten points - the only part of the test where they didn't boost scores.

http://photos1.blogger.com/blogger/5379/496/1600/0.jpg

http://dienekes.blogspot.com/2006/04/male-female-differences-in-general.html

Angler
10-14-2007, 05:49 AM
Interesting points, Leon.

But as to that bottom link, one of the people who left comments made an excellent point that seems to undermine the author's conclusions:

Since SATs are taken primarily by students intending to go to college, they overrepresent the right half of the curve and underrepresent the left half. I would predict that there are higher proportions of males in the lower intelligence ranges as well (i.e., the male IQ distribution has a higher standard deviation), but since low-IQ students are unlikely to take the SAT, the figures inaccurately reflect a higher male g.
C Phillips
I can't think of a way to argue with this.

Also, if the authors want to prove that males have a higher average g-level than females, then they should really base their study on a "purer" test of g, such as the Raven's Progressive Matrices. The SAT is too contaminated by education, although the use of large number of subjects may cause that contamination to "average out."

In any case, I don't dispute that males and females differ in their patterns of abilities. The evidence for that is quite strong. For example, I know that one area in which males have an advantage is spatial tasks.

Angler
10-14-2007, 06:24 AM
I was just thinking back to a question I saw on the GRE when I took it. I knew the answer, but I remember thinking at the time that questions like this really do contaminate the test with educational factors.

The following isn't the exact question, but it uses exactly the same principle. No calculators were allowed on the test:

Which of the following is NOT evenly divisible by 3?

(a) 4575
(b) 5484
(c) 6912
(d) 7124
(e) 2463

The answer is (d). This can be found in two ways. One is to actually divide each number by 3, which is time-consuming. But if you know that a number is evenly divisible by 3 if and only if the sum of its digits is divisible by 3, then you can do the problem more quickly, which leaves more time to do other, more difficult problems on the test.

Of course, we can still come back to the observation that academic knowledge is itself highly correlated with g, so perhaps such "contamination" doesn't detract too much from the tests' predictive value. In fact, it may even enhance it.

Hartmann von Aue
10-14-2007, 06:37 AM
But if you know that a number is evenly divisible by 3 if and only if the sum of its digits is divisible by 3, then you can do the problem more quickly, which leaves more time to do other, more difficult problems on the test.

Right - if you have good intuition about the way division works you don't need to know the rule about the sum.

EDIT - no I misread what you said I meant to say is if you know that 45,75,54,84,69,12,24,63 are each divisible by three by inspection then you are already done.

Angler
10-14-2007, 07:01 AM
Right - if you have good intuition about the way division works you don't need to know the rule about the sum.
I don't see how intuition can be used to solve a divisibility problem.

The proof of the divisibility-by-3 rule is simple, but if you don't already know of the rule, then to apply it you have to (1) correctly guess that it exists, (2) confirm it with a proof, and (3) apply it. To do all that during a test would still take a lot longer than simply knowing the rule to begin with and can jump to step (3). So again, the person who knows the rule ahead of time has the advantage.

EDIT: I just saw your edit. Right, breaking the numbers up into two-digit segments helps in this example, though the divisibility rule is still being applied. Anyway, you get my point: knowing these mathematical concepts ahead of time greatly reduces the amount of thinking needed during the test. Perhaps this is why the verbal is more g-loaded than the math.

Hartmann von Aue
10-14-2007, 07:09 AM
I don't see how intuition can be used to solve a divisibility problem.

The proof of the divisibility-by-3 rule is simple, but if you don't already know of the rule, then to apply it you have to (1) correctly guess that it exists, (2) confirm it with a proof, and (3) apply it.

To do all that during a test would still take a lot longer than simply knowing the rule to begin with and can jump to step (3). So again, the person who knows the rule ahead of time has the advantage.

No - if you understand that the sum of two numbers divisible by three is divisible by three - then you don't need to know the rule.

(a) 4575 45=3*15 75=3*25
(b) 5484 54=9*6 84=12*7
(c) 6912 69=23*3 12=3*4
(d) 7124
(e) 2463 24=8*3 63=7*9

Direct division is as least as fast as the sum rule.

1525
1828
2304
821

It's still mainly a test of basic arithmetic skill - which can be taught.

Angler
10-14-2007, 08:03 AM
Oh, so you weren't using the divisibility rule on each of the two-digit segments. Sorry, I didn't read your post carefully enough. Okay, that's a little different from how I would do it -- i.e., I would see that "84" and immediately think "8+4=12, which is divisible by 3" -- but I can see how your way would be faster for some people.

Keep in mind, however, that it's really just a coincidence that each of those segments can be broken up like that -- I just pulled this question out of my hat in order to demonstrate the divisibility rule, and I wasn't thinking of other shortcuts. The actual test questions might not have been so flexible.

Anyway, this is only one example. There are other such rules that are good to know on these tests, such as the sum of the interior angles in a polygon equals 180(n-2), where n is the number of sides. Other things that help include having memorized the squares of as many integers as possible, up through at least 20. Sure, these things are easily derived/calculated, but again, knowing them ahead of time provides a major speed advantage. I was on a math team in high school, and I think that a lot what I learned from that activity gave me a big advantage on the SAT.

So again, the point is that many of these skills are education-dependent. They're still connected to intelligence, of course, because a person's ability to profit from education is related to his intelligence. Nevertheless, I think research into male-female difference in intelligence would be better based on tests that focus more on "pure" reasoning skills.