we often hear statements like I havedata on hundred thousandleprosy patients I have data on climatic change in the Chilean city I have data on road traffic accidents and so on in all these statements there is a word data in this session we are going to either see what this data means what are the different types of data and how toconvert this data into pieces of information data can broadly beclassified into qualitative and quantitative data the qualitative data as the name suggests we can't quantify them it is on some sort of quality again this qualitative data could be a nominal data or an ordinal data the nominal data the examples of the color of eyes the different regions of a city and so on and the ordinal data are the data which can be arranged in a sort of an order like examples in all stages of disease condition quantitative data again are of two categories one is a discrete data which essentially is a full number a number of siblings family size etc and the other one is a continuous data where is a continuous measurement like height and weight sothese are all different types of data which requires different type ofanalytical skill now our aim is to get some information out of data a large set of data it's very essential but still looking at just the data you can't rather get any information so we need to summarize them one of the ways of summarizing the data is to get a value of a sort of an average no average you mean the first average that comes to our mind is the mean a mean which is also called an arithmetic mean this is a most commonly used and simply it's called mean it's you add all the observed values we call that as sum which is Sigma X I in amathematical notation and mean is nothing but divide this sum by the number of observations you have used in your calculations which is n the sample mean is denoted by an X a bar line on top of X and the population mean is denoted by mu let's say see an example suppose there is there are 10 pregnant patients who had visited an ANC clinic and that ages are 26 31 25 and so on and what is a being age of these pregnant women the mean is got by summing up all the ages which comes to 216 there are 10 observations so divided it by 10 it is 260 by 10 which is equal to 26 we say the mean age of pregnant women who visited the NZ clinic is 26 years now one of the problem with this average mean is some extreme values either big or small even one or two if they are present in your data said that couldinfluence on the average because you are adding all and one big value if you had the whole mean becomes an overestimation so in order to control this or in order to avoid this we have an another may a measure which is called median the median is literally the middle value of the distribution it divides the distribution exactly into two halves that is 50% of the data will fall on either side this is a very use we'll measure especially when you have extreme values let's let's see this example suppose you have a data on the duration of stay in hospitals of 11 patients the duration is one day two days three days and nine days for ten patients and then the eleventh patient it is 77 days of course I have arranged this data in an ascending order themedian is the middle value which is a sixth value the values if you get n plus 1/2 11 plus 112 divided by 2 is 6 so the sixth value is the value of 6 which means the mean is the median is 6 here whereas when we count when you really compute the mean for this it comes out to be eleven point eight as you could rather see 6 is more appropriate measure of average in this case rather than the mean eleven point eight if n is even then you take the average of middle two values now there is an another measure which is called more mode is the value that occurs most frequently in fact mode is the only location statistics which we can use for nominal data which are not measurable in a premie ology we do use more quite often in an epidemic curve with respect to time we look for the model class and then that gives an idea of the incubation period of the pathogenthe example for a mode is the color preference and the number of persons were 354 people they prefer green 852 prefer ello 310 prefer white and 474 purple right so the maximum number of people they prefer ello and so the model class is ello says you could rather see as in the respect of rather mode there can be multiple modes that cannot be a mode at all in a signal suppose if all the values are 354 here then there is no mode so more can exists can there canthere can be multiple in a data set now we have seen mean median mode are three good measures of summarizing your data to get an average value so it's not enough you just rather know the average value say for example you go to a swimming pool and you don'tknow swimming and you are five feet seven inches and then if the pool manager says the average depth of the swimming pool is four and a half feet you feel very comfortable and you jump and suppose the place where you jump is nine feet then you know the thing thatyou miss to ask is is yes the average is four and a half feet but what is the variability that maybe you know place where it is a shallow as three feet and as depth as nine or ten feet so you need rather asks what is the variability one of the measures that comes to our mindis the range the range is the difference between the minimum and then the maximum value of the observations an advantage of this measure is it's very quick and easy indicator of dispersion but as Ihad said about the mean the range also is influenced by extreme values and also we consider only two values the first and then the last and in between we are not using the data at all and that is agreat disadvantage of range there is another value which is calledinterquartile range this to a large extent take care of this extreme values in the sense we divide the datasets intofour quarters and we try to remove the first quarter and then the last quarter and consider only the middle 50% of the values and this interquartile range is the q3 minus q1 and then a great advantage of this is this value doesn't rather get affected by extreme values but again the disadvantages is it covers only the middle 50% of the values and then the same said one days that we had for range thatuses will need to values and in between values are not made use of and that's a great disadvantage of this values and the another measure of variability is mean deviation from me what do you meanby that say for example from your data set every data point we try to subtract the mean and then we try to take a average of this mean deviation which is called an mean deviation from mean one of the problem with this is if you rather do with that what happens is isthere are some values which are less than the mean some values which are more than the mean and if you do the summation of all these you get a value 0 so mean deviation from mean is always zero in order to get over that what we do it is we know the same and then we just take the difference and then we take the average this called absolute mean deviation and advantages it's based on all observations in the group it's easy to grasp the meaning of the whole procedure but the descent one day decision you know the science of the difference of the value and the it's mathematically it's not very rigorous to use this value so in order to get over that we have an another measure what we do is is we do take the difference of each observations from mean and instead of ignoring the same we square them the square takes care of even the minus and then the plus everything becomes plus and then we take an average of that that value is called variance and since this variance is we are squaring and then the measurement also squares we take a square root at the end and that's called standard deviation the standarddeviation just denoted as SD is the square root of the average of thesquared deviations of the observations from the arithmetic mean the square of the standard deviation is the variance so advantage your standard deviation is most important measure of distribution while the variance is in unit square the standard deviation is expressed in the same units of the measurement and it is suitable for further analysis so standard deviation together with arithmetic mean is useful for describing the data and these two measures are extensively used for further treatment of your data set I'm going to introduce to you one more measure which is called coefficient of variation the purpose of this measure suppose if you have adifferent groups different data sets to compare and then you want rather compare their relative variability in different groups so the coefficient of variation is the standard deviation expressed as apercentage of arithmetic mean because the standard deviation by arithmetic mean what happens this is they both of the same units of measurements so the unit of measurement get cancelled so what you get is a pure number and that number expressed in terms of percentage that is multiplied by 100 you get coefficient of variation so in summary we have to choose an appropriate central or dispersion values the mean and standard deviation are the most appropriate central and dispersion values especially if there are noextreme values if there are extreme values there are methods of still using mean and standard deviation using some transformations of your data that requires a little expert handling of your data otherwise you go in for median and interquartile range and these two measures median interquartile range do take care of extreme values the more and range is normally used for quantitative variables time distributions in epidemic curve the mean and standard deviation asI said are the most used measures of variability and the somebody statistics. Thank you.
Log in to save your progress and obtain a certificate in Alison’s free Introduction to Biomedical Research online course
Sign up to save your progress and obtain a certificate in Alison’s free Introduction to Biomedical Research online course
Please enter you email address and we will mail you a link to reset your password.