Defense managers and system engineers require estimates of project cost/effort, duration, and quality in order to secure funding and set xpectations with customers, end users, and management teams. Researchers and practitioners of software metrics have developed models to help project anagers and system engineers produce estimates of project effort, duration, and quality. These models generally quantify the project scope using estimated source lines of code or function points, and then require the application of generalized rules-of-thumb to arrive at the needed project estimates of staffing, duration, and quality. Experts agree that for these models to perform at their best, the parameters should be calibrated based on project data from the using organization. Our question was, "How do parametric models perform out-of-the-box (that is, without calibration)?" This is analogous to a project team without access to historical data using the models as published. What level of accuracy can they expect? We examined several published models by comparing the predicted values against the actual results from 54 software projects completed by a SEI CMMI Level 3 organization with a mature (and commended) measurement program. This paper evaluated a subset of these approaches - nine simple models (four effort estimation models, three duration estimation models, and two software quality (i.e., defect) models)-using 54 non-trivial commercial projects completed recently by a CMMI Level 3 organization. This certification means that the data was collected in a standard manner and makes sense to use in this study. It does not imply that a defined process level is needed to use the results. For the effort estimation models, we found that the upper bound of the best case model contained 81% of our projects, that is, four out of five of our projects would use less effort than predicted by the best case model, whereas the average effort estimate across all models contained only 54% of our projects, or a little better than a coin flip if we estimate using the average. Duration estimates performed significantly better. In the best case model, the upper bound estimate contained 93% of our projects with the overall model average at 91% and the lower bound estimate exceeded the actual duration more than 70% of the time. This means we can out-perform the project duration seven out of 10 times using the shortest duration estimated using the models out-of-the box. For quality modeling, one of the defect prediction approaches worked quite well, with the upper bound containing 94% of the projects (or 9.4 times out of 10 we will deliver fewer defects than forecast by the model). This information is useful to executives and managers performing early project estimates without detailed analysis of the requirements or architecture as the bounds allow them to quickly respond to customer requests with some level of confidence. So, if you are asked for a project estimate and do not have access to historical data or well-calibrated local estimation models, there is hope. Based on your available sizing information, you can use these models out-of-the-box with some success as long as you keep these things in mind: • Caper's Jones approach was the only one that (relatively) accurately addressed all three project management estimation needs for effort, duration, and quality. • None of the four effort estimation models were particularly effective with our project data, but using the upper bound of the Rone model gives the project team an 80% chance of meeting the effort estimate. • A project should never commit to the lower bound effort estimates from any of the models we evaluated. • The duration estimation models are particularly effective with our project data. Using the upper bound of the Boehm model gives a project team a better than 90% chance of completing the project within the estimated calendar time. • Capers Jones' quality model was the most accurate predictor of quantity of defects in our software development projects. From our analysis, it appears as though duration and quality models are quite useful, but effort estimation is still problematic. We suggest researchers investigate other approaches to effort estimation that are not based on SLOC or Function Points. For example, models that rely on use cases or story points and can estimate all three key parameters (i.e., effort, duration, and quality) may prove valuable in the future. The translation from mission or business need to requirements and architecture is a huge challenge that impacts estimates on each iteration, by developing models to address these early solution descriptions, managers and system engineers can benefit with earlier estimates.