Kentucky Web Scraping: September 2014

Monday, 15 September 2014

Has It Been Done Before? Optimize Your Patent Search Using Patent Scraping Technology

Has it been done before? Optimize your Patent Search using Patent Scraping Technology.

Since the US patent office opened in 1790, inventors across the United States have been submitting all sorts of great products and half-baked ideas to their database. Nowadays, many individuals get ideas for great products only to have the patent office do a patent search and tell them that their ideas have already been patented by someone else! Herin lies a question: How do I perform a patent search to find out if my invention has already been patented before I invest time and money into developing it?

The US patent office patent search database is available to anyone with internet access.

US Patent Search Homepage

Performing a patent search with the patent searching tools on the US Patent office webpage can prove to be a very time consuming process. For example, patent searching the database for "dog" and "food" yields 5745 patent search results. The straight-forward approach to investigating the patent search results for your particular idea is to go through all 5745 results one at a time looking for yours. Get some munchies and settle in, this could take a while! The patent search database sorts results by patent number instead of relevancy. This means that if your idea was recently patented, you will find it near the top but if it wasn't, you could be searching for quite a while. Also, most patent search results have images associated with them. Downloading and displaying these images over the internet can be very time consuming depending on you internet connection and the availability of the patent search database servers.

Because patent searches take such a long time, many companies and organizations are looking ways to improve the process. Some organizations and companies will hire employees for the sole purpose of performing patent searches for them. Others contract out the job to small business that specialize in patent searches. The latest technology for performing patent searches is called patent scraping.

Patent scraping is the process of writing computer automated scripts that analyze a website and copy only the content you are interested in into easily accessible databases or spreadsheets on your computer. Because it is a computerized script performing the patent search, you don't need a separate employee to get the data, you can let it run the patent scraping while you perform other important tasks! Patent scraping technology can also extract text content from images. By saving the images and textual content to your computer, you can then very efficiently search them for content and relevancy; thus saving you lots of time that could be better spent actually inventing something!

To put a real-world face on this, let us consider the pharmaceutical industry. Many different companies are competing for the patent on the next big drug. It has become an indispensible tactic of the industry for one company to perform patent searches for what patents the other companies are applying for, thus learning in which direction the research and development team of the other company is taking them. Using this information, the company can then choose to either pursue that direction heavily, or spin off in a different direction. It would quickly become very costly to maintain a team of researchers dedicated to only performing patent searches all day. Patent scraping technology is the means for figuring out what ideas and technologies are coming about before they make headline news. It is by utilizing patent scraping technology that the large companies stay up to date on the latest trends in technology.

While some companies choose to hire their own programming team to do their patent scraping scripts for them, it is much more cost effective to contract out the job to a qualified team of programmers dedicated to performing such services.

Source:http://ezinearticles.com/?Has-It-Been-Done-Before?-Optimize-Your-Patent-Search-Using-Patent-Scraping-Technology&id=171000

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a (very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our case requires a bit more work. I can tell two things right away:

They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object (looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So now we understand how to use offset
The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that from the original page somehow. Try digging around a little, you'll find it.
Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article itself)
Using the devtools you have access to all client-side scripts. So by digging you can find that that link to /get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the request, and try to emulate that (though you can probably figure it out yourself)
You might need to overcome some security measures. For example, you might need a session-key from the original article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in case it shows up.
Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt). Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help you.

Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

How do you scrape AJAX pages?

9 Answers

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
{
// Firefox, Opera 8.0+, Safari
xmlHttp=new XMLHttpRequest();
}
catch (e)
{
// Internet Explorer
try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
}
xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
xmlHttp.open("GET","time.asp",true);
xmlHttp.send(null);
}
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.

Advanced scraping with C++:

For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.

Advanced scraping with Java:

For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino

Advanced scraping with .NET:

For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

If you can get at it, try examining the DOM tree. Selenium does this as a part of testing a page. It also has functions to click buttons and follow links, which may be useful.

In my opinion the simpliest solution is to use Casperjs, a framework based on the WebKit headless browser phantomjs.

The whole page is loaded, and it's very easy to scrape any ajax-related data. You can check this basic tutorial to learn Automating & Scraping with PhantomJS and CasperJS

You can also give a look at this example code, on how to scrape google suggests keywords :

/*global casper:true*/
var casper = require('casper').create();
var suggestions = [];
var word = casper.cli.get(0);

if (!word) {
    casper.echo('please provide a word').exit(1);
}

casper.start('http://www.google.com/', function() {
    this.sendKeys('input[name=q]', word);
});

casper.waitFor(function() {
return this.fetchText('.gsq_a table span').indexOf(word) === 0
}, function() {
suggestions = this.evaluate(function() {
      var nodes = document.querySelectorAll('.gsq_a table span');
      return [].map.call(nodes, function(node){
          return node.textContent;
      });
});
});

casper.run(function() {
this.echo(suggestions.join('\n')).exit();
});

Depends on the ajax page. The first part of screen scraping is determining how the page works. Is there some sort of variable you can iterate through to request all the data from the page? Personally I've used Web Scraper Plus for a lot of screen scraping related tasks because it is cheap, not difficult to get started, non-programmers can get it working relatively quickly.

Side Note: Terms of Use is probably somewhere you might want to check before doing this. Depending on the site iterating through everything may raise some flags.

As a low cost solution you can also try SWExplorerAutomation (SWEA). The program creates an automation API for any Web application developed with HTML, DHTML or AJAX.

The best way to scrape web pages using Ajax or in general pages using Javascript is with a browser itself or a headless browser (a browser without GUI). Currently phantomjs is a well promoted headless browser using WebKit. An alternative that I used with success is HtmlUnit (in Java or .NET via IKVM, which is a simulated browser. Another known alternative is using a web automation tool like Selenium.

I wrote many articles about this subject like web scraping Ajax and Javascript sites and automated browserless OAuth authentication for Twitter. At the end of the first article there are a lot of extra resources that I have been compiling since 2011.

I am currently trying to retrieve only visible text using BeautifulSoup... but I keep getting javascript and html cruft that I didn't really want...

As alternative, I am playing with the idea of spawning a browser programmatically, sending cntr-c key presses to the process, and retrieving the text from the clipboard.

I really like the idea working with rendered data as opposed to the source data. However, this might not help you if you need interact with the dom tree. Personally, I am not happy with the clipboard killing all of my concurrency.. My next step is looking at the Chrome source code...

Another best tool that lets us write scrappers on live DOM is solvent from MIT . http://simile.mit.edu/wiki/Solvent .

Also promising is , Envjs at http://www.envjs.com/doc/guides .

If your intention is to scrape to a high scale , you might need to introduce delays and mimic human behavior to avoid being banned by the service being scraped.

I think Brian R. Bondy's answer is useful when the source code is easy to read. I prefer an easy way using tools like Wireshark or HttpAnalyzer to capture the packet and get the url from the "Host" field and the "GET" field.

For example,I capture a packet like the following:

    GET /hqzx/quote.aspx?type=3&market=1&sorttype=3&updown=up&page=1&count=8&time=164330 HTTP/1.1

    Accept: /

    Referer: http://quote.hexun.com/stock/default.aspx

    Accept-Language: zh-cn

    Accept-Encoding: gzip, deflate

    User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

    Host: quote.tool.hexun.com

    Connection: Keep-Alive

Source:http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages/6484022#6484022

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are
2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source:http://promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/