I have a simple echo of a website’s source:
echo file_get_contents('http://www.threadless.com/blogs/blogs');
A small portion of the output is a little fishy:
<a class="pagea selected" href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,1">1</a> <a class="pagea " href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,2">2</a> <a class="pagea " href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,3">3</a> <a class="pagea " href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,4">4</a> <a class="pageelipsa" href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,2826">...</a> <a class="pagea " href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,2827">2,827</a> <a class="pagea " href="/blogs/blogs?token=ccaea4f99cbadd8262c148c86e1d8b06&uuid=5abf8a35510975e77f4618b544f7fe65/page,2828">2,828</a> <div class="pagecontext grey">(84,833 results!)</div>
When viewing the page source of the actual page in a browser, that same area is:
<a class="pagea selected" href="/blogs/blogs/page,1">1</a> <a class="pagea " href="/blogs/blogs/page,2">2</a> <a class="pagea " href="/blogs/blogs/page,3">3</a> <a class="pagea " href="/blogs/blogs/page,4">4</a> <a class="pageelipsa" href="/blogs/blogs/page,2826">...</a> <a class="pagea " href="/blogs/blogs/page,2827">2,827</a> <a class="pagea " href="/blogs/blogs/page,2828">2,828</a> <div class="pagecontext grey">(84,833 results!)</div>
The file_get_contents() actually returns a slightly different contents than the actual page! The function seems to have added “?token=randomString” to all of the page traversal URL’s. Since I’m working on a web crawler, this is a big no no. What’s going on?