Detecting Power Laws
Binning SamplesMarcusFrean and NickChapman talked about how to detect power laws by binning samples. See bins.gif This is handy for discrete distributions, and essential for real-valued ones. If you use exponentially spaced bin boundaries, divide the bin's occupancy by its width, and then plot the log of that against the bin number, you should get a straight line whose negative slope is the exponent a. Nick's going to check it out in practice, by generating artificial "powerlaw" distributions... RoBt askz: So again I want to be clear about the rationale for exponentially sized bins. Is it because this will support the "scale-free" aspect of the distribution? Just to check my understanding of what you are then saying: does this mean you then do not take the log of the independent variable when plotting to look for a straight line? MarcusFrean sayz: the (current!) idea is to # use exponentially sized bins, # divide the counts by the bin-width to get "adjusted counts" # ''plot the log of the adjusted counts versus the bin index''. Or, equivalently, the abscissa can be the log of the bin boundary. This will be a straight line if the distribution is a powerlaw. and in that case the slope is the exponent. RoBt: Ok, this seems reasonable. I still wonder about effects at the low end though. When we are dealing with things are that come in discrete sizes, rather than continuous, it seems to me that the tiny bucket sizes make result in lots of noise data (exactly the kind of thing that bucket size selection normally handles). I am also unsure, especially at these tiny sizes, whether we are really sure what we are measuring: e.g. classes vs methods vs fields vs lines vs tokens vs lexemes vs characters vs bits. Surely other people have faced similar issues? It doesn't seem right to change our idea of the distribution because of edge-effects relating to the tiniest quanta in the data. Comments? MarcusFrean: Yah, I guess you can start with your first bucket boundary at the smallest value in the dataset, make the first bin width =1, and go from there upwards. I've got a different worry - it seems like this logarithmic binning runs the risk of losing information by sticking all the data from (say) 10 to 15 or whatever into a single bin, when in reality we may have loads of excellent info on 10, 11, 12, 13, 14 and 15. Seems like we're throwing hard-earned data away, in effect. Provided your data is positive integers though you don't need to do this binning anyway (do you?), and the standard method of ranking and then plotting on log scales (as in the scrips below, and the CACM paper) seems preferable... Agree / disagree ??
Matlab diagnosticsIt looks like some of the distributions that ''aren't'' PowerLaw might be LogNormal. I've written some quick and dirty diagnostics for the two cases. Put your raw data (list of samples) into a file "datafile". Grab this matlab script: diagnoseDatafile.m Then it's
- matlab -nodesktop
- generatePowerlawData.m - Call this with one argument, the exponent. E.g: ''generatePowerlawData(1.8) ''
- generateLognormalData.m - Call this with two arguments, the mode and its "spread". E.g: ''generateLognormalData(10, 0.5) ''
I've added a bit to Marcus' diagnose script which tries to fit a straight line to both the powerlaw and log-normal plots. It'll draw the best-fit line onto them and print out the corresponding equation. diagnoseDatafile.m JeromeDolman
Also see LogNormal distributions.