I was reading about Impala, a fast big data store from Cloudera, and I noticed only Ruby and a Java had clients. Why no PHP client for Impala. So, I thought I would give it a go at creating one.
After a lot of reading and studying both the Java and the Ruby clients I finally was successful at getting a PHP client to connect to the Impala service.
This is meant to be the starting blocks (library) for a more user friendly client. I am working on that currently and hope to have something to share soon. I plan on learning more about Impala and the API methods that are available via the thrift service.
Here is the GitHub for the code.
I bundled the PHP library code into a phar archive. There are a lot of files and it is easier to deal with. I also did not use the supplied Thrift classloader, I used the Zend Framework classmap_genarator script to build a classmap, and I wrote a very simple class loader to consume it.
Check it out and let me know what you think.
Tried out your PHP client for impala. Everything works ok for simple select queries but times out for complex queries.
Hi Gideon,
I did not use very complex queries when building the library, if you could post you error message I'd love to help trouble shoot the problem.
Here is the error message after doing a SELECT COUNT(id)
Fatal error: Uncaught exception 'ThriftExceptionTTransportException' with message 'TSocket: timed out reading 4 bytes from 10.3.4.78:21000' in phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Transport/TSocket.php:274 Stack trace: #0 phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Transport/TTransport.php(74): ThriftTransportTSocket->read(4) #1 phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Transport/TBufferedTransport.php(113): ThriftTransportTTransport->readAll(4) #2 phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Protocol/TBinaryProtocol.php(305): ThriftTransportTBufferedTransport->readAll(4) #3 phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Protocol/TBinaryProtocol.php(197): ThriftProtocolTBinaryProtocol->readI32(NULL) #4 phar://C:/xampp/htdocs/impala/build/php-impala.phar/packages/BeeswaxService/BeeswaxService.php(243): ThriftProtocolTBinaryProtocol->readMessageBegin(NULL, 0, 0) #5 phar://C:/xampp/htdocs/impala/build/php-impala.phar/packages/BeeswaxS in phar://C:/xampp/htdocs/impala/build/php-impala.phar/Thrift/Transport/TSocket.php on line 274
I am seeing the same thing, I did notice that there is a bug about that filed against Impala.
https://issues.cloudera.org/browse/IMPALA-168
We might be running into this bug, although I have not tried it against the newest impala release (Cloudera Impala 1.0), just the beta version.
I just installed the latest version of Impala. The queries work ok on the command like. I am temporarily switch to java but will watch any updates.
Thanks
Hi Robert McFrazier,
I got the same error as Gideon Mazambani's. I fixed it by increasing the timeout of Thrift:
// modify the file test.php
$socket = new TSocket('172.16.0.26', 21000); // make sure to enter your impala host ip address
// Add following lines
const DEFAULT_THRIFT_TIMEOUT = 10; // in seconds.
$socket->setRecvTimeout(DEFAULT_THRIFT_TIMEOUT*1000);
$socket->setSendTimeout(DEFAULT_THRIFT_TIMEOUT*1000);
It works well now. This solution is based on the post here: https://groups.google.com/forum/#!topic/phpcassa/…
Thanks for the fix, I'm looking at the latest release of Impala now to see if I can update the PHP client, maybe the new version will work with out extending the call timeout.
Great client Robert.
Had to modify the timeouts too per the thread above.
QQ-Do you know how to run a query which returns more than 1024 results at a time? I tried increasing the batch size without luck
Hi John,
I have not ran a query that returned more than 1024 yet, I'll check to see if it is an option that I need to expose to control result set size.
Thank you for this client, very easy to use and works perfectly!
Your welcome glad you found it useful.
It doesnt return more than 1024 records..how to get more than 1024?
Not for sure, I believe that is an internal setting
$result = $client->fetch($queryHandle,false,100);
the 100 is what limits the return, I am currently working on an updated version of that lib, to use a more recent version of the Impala thrift.
I have tried it like,
$result = $client->fetch($queryHandle,false,4000);
still it returns 1024.
No..I have checked …I have set the count as 4000 $client->fetch($queryHandle,false,4000); still it returns 1024.
When you select more then one column in your query, how do you get them from $result object (without directly parsing return string)? As I see it, you only get one string field for every returned row, and if you have more then one column you need to parse string manually?
I believe you are correct, I currently working on an update and hope to address this in the update.
you get it a tab separated string, and then you need to parse the string.
Would it be in any way possible to fetch the results as an associative array? It's not a problem to parse the rows into individual cells, but then I lose information about the column names, and I'm having a lot of trouble thinking of a work-around for that…
No, its not possible to get result set as an associative array. thats really a problem.
Hi Robert,
Thanks for creating impala client.
It seems to be working perfect, on basic usage if it.
I appreciate your efforts.
Thanks and Looking forward for assistance which may required on it.
Hi Poonam,
Glad you are finding it useful. I'm currently working on an updated version of the impala client, one that uses the HiveServer2 API.
Hi Poonam,
did u check whether it returns more than 1024 records at a time.
Hi, when do you expect to have this new version of Impala client?
Hi Kreso,
I don't have a specific date but, I am working on it now, it should be a few weeks out. maybe by the first half on Jan 2014.
Hello to get more than 1024 rows set the configuration on your BeeswayQuery like so
$_oQuery = new Query();
$_oQuery->query = $sQuery;
$_oQuery->configuration = array("BATCH_SIZE=500");
the number will not be used, but the presence of the flag gets you everything
Thanks this works……now i have nearly 3M data to display. now fetching all data at a time and viewing is time taking….can we have kind of pagination on top of that.
If you have to display 3M rows of data you are doing it wrong! What device do you want to render that? high end gamig station on cinema screen?
Still can just get the Quer_id out of you QueryHandle and restore it in an ajaxcall.
How can i get the column names ? i run the test with a "select * from table_name" but i only get this king of data :
object(Results)#10 (5) { ["ready"]=> bool(true) ["columns"]=> array(4) { [0]=> string(6) "string" [1]=> string(6) "string" [2]=> string(6) "string" [3]=> string(6) "string" } ["data"]=> array(35) {…} (0) ["has_more"]=> bool(true) }
this way instead of having the name of the columns i have only their type.
How can i get the column names ? i run the test with a "select * from table_name" but i only get this king of data :
object(Results)#10 (5) { ["ready"]=> bool(true) ["columns"]=> array(17) { [0]=> string(6) "string" [1]=> string(6) "string" [3]=> string(6) "double" … } ["data"]=> array(35) { … } ["start_row"]=> float(0) ["has_more"]=> bool(true) }